Spider-Python3Ⅱ

Posted on 2019-01-18 Edited on 2019-08-20 In Spider Views:

Symbols count in article: 4.4k Reading time ≈ 4 mins.

Python3实现的Spider，采用urllib模块，未完成。

Spider

[^]: Python3 实现

urllib.request模块

用于HTTP、HTTPS、FTP协议的URL，主要用于HTTP

1	urllib.request.urlopen(url, data = None, [ timeout，], cafile = None, capath = None, cadefault = False, context = None )

data是以post方式提交URL的参数，timeout是超时时间设置参数（仅适用于HTTP，HTTPS和FTP连接），ca-XXX是有关身份验证的参数（cafile和capath参数为HTTPS请求指定一组可信CA证书。）

P.S. python3.6后不推荐使用cafile，capath和cadefault来支持context。请使用ssl.SSLContext.load_cert_chain()或者 ssl.create_default_context()选择系统的可信CA证书。

该函数返回的对象有下列方法：

geturl() 返回检索到资源的URL，通常用于确定是否遵循重定向。
getcode 返回响应的HTTP状态代码（200：成功、404：不存在、503：暂时不可用）。
info 以email.message_from_string()实例的形式返回页面的元信息。

返回百度首页：

#!/usr/bin/env python3
#-*- coding: utf-8 -*-
__author__ = 'QCF'

import urllib.request

def clear():
    '''清屏'''
    print('内容较多，3s后翻页')
    time.sleep(3)
    OS = platform.system() # 获取系统信息
    if (OS == 'Windows'):
        os.system('cls') # windows清屏命令
    else:
        os.system('clear') # linux清屏命令

def linkBaidu():
    url = 'http://www.baidu.com'
    try:
        response = urllib.request.urlopen(url,timeout=3)
        result = response.read().decode('utf-8')
    except Exception as e:
        print('网络地址错误')
        exit()
    with open('baidu.txt', 'w', encoding = 'utf-8') as fp:
        fp.write(result)
    print("获取url信息：response.geturl()：%s" %response.geturl())
    print("获取返回代码：response.getcode()：%s" %response.getcode())
    print("获取返回信息：response.info()：%s" %response.info())
    print("网页内容已存入当前目录的baidu,txt中")


if __name__ == '__main__':
    linkBaidu()

================= RESTART: F:/Python/Spyder/ConnectBaidu.py =================
获取url信息：response.geturl()：http://www.baidu.com
获取返回代码：response.getcode()：200
获取返回信息：response.info()：Bdpagetype: 1
Bdqid: 0xef5df86600013d61
Cache-Control: private
Content-Type: text/html
Cxy_all: baidu+1c8349b37b441e6932e8b8b6e4747690
Date: Fri, 25 Jan 2019 15:03:18 GMT
Expires: Fri, 25 Jan 2019 15:03:00 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=28A5143FAE268F8DB5005D86DECF2D35:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=28A5143FAE268F8DB5005D86DECF2D35; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1548428598; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: delPer=0; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=26524_1439_21110_28329_28414_20718; path=/; domain=.baidu.com
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked


网页内容已存入当前目录的baidu,txt中

1 2	<!DOCTYPE html> <!--STATUS OK-->

若是将url = 'http://www.baidu.com'改为url = 'https://www.baidu.com'返回内容如下：

================= RESTART: F:/Python/Spyder/ConnectBaidu.py =================
获取url信息：response.geturl()：https://www.baidu.com
获取返回代码：response.getcode()：200
获取返回信息：response.info()：Accept-Ranges: bytes
Cache-Control: no-cache
Content-Length: 227
Content-Type: text/html
Date: Fri, 25 Jan 2019 15:03:32 GMT
Etag: "5c36c624-e3"
Last-Modified: Thu, 10 Jan 2019 04:12:20 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=8CE73187BDBE7A99BC73BBDDA28A698C; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1548428612; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Strict-Transport-Security: max-age=0
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close

网页内容已存入当前目录的baidu,txt中

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

特别说明：with open('baidu.txt', 'w', encoding = 'utf-8') as fp:为Windows下写法，Linux下为with open('baidu.txt', 'w') as fp:

为什么呢？

Windows新建文件默认使用gbk编码，如果用gbk编码解析程序中 result = response.read().decode('utf-8')的utf-8网络数据流，就会导致UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 29527: illegal multibyte sequence错误。

但是如果url是HTTPS，则无此报错。（？）

urllib package 包含四个模块：

urllib.request 用于打开和获取URL
urllib.error 包含urllib.request产生的异常
urllib.parse 用于解析URL
urllib.robotparser 用于解析robots.txt文件

（参照：https://docs.python.org/3/library/urllib.html）