Spider-Python3Ⅱ

Spider

[^]: Python3 实现


urllib.request模块

  • 用于HTTP、HTTPS、FTP协议的URL,主要用于HTTP
1
urllib.request.urlopen(url, data = None, [ timeout,], cafile = None, capath = None, cadefault = False, context = None )

data是以post方式提交URL的参数,timeout是超时时间设置参数(仅适用于HTTP,HTTPS和FTP连接),ca-XXX是有关身份验证的参数(cafilecapath参数为HTTPS请求指定一组可信CA证书。)

P.S. python3.6后不推荐使用cafile,capath和cadefault来支持context。请使用ssl.SSLContext.load_cert_chain()或者 ssl.create_default_context()选择系统的可信CA证书。

该函数返回的对象有下列方法:

  • geturl() 返回检索到资源的URL,通常用于确定是否遵循重定向
  • getcode 返回响应的HTTP状态代码(200:成功、404:不存在、503:暂时不可用)。
  • infoemail.message_from_string()实例的形式返回页面的元信息

返回百度首页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/bin/env python3
#-*- coding: utf-8 -*-
__author__ = 'QCF'

import urllib.request

def clear():
'''清屏'''
print('内容较多,3s后翻页')
time.sleep(3)
OS = platform.system() # 获取系统信息
if (OS == 'Windows'):
os.system('cls') # windows清屏命令
else:
os.system('clear') # linux清屏命令

def linkBaidu():
url = 'http://www.baidu.com'
try:
response = urllib.request.urlopen(url,timeout=3)
result = response.read().decode('utf-8')
except Exception as e:
print('网络地址错误')
exit()
with open('baidu.txt', 'w', encoding = 'utf-8') as fp:
fp.write(result)
print("获取url信息:response.geturl():%s" %response.geturl())
print("获取返回代码:response.getcode():%s" %response.getcode())
print("获取返回信息:response.info():%s" %response.info())
print("网页内容已存入当前目录的baidu,txt中")


if __name__ == '__main__':
linkBaidu()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
================= RESTART: F:/Python/Spyder/ConnectBaidu.py =================
获取url信息:response.geturl():http://www.baidu.com
获取返回代码:response.getcode():200
获取返回信息:response.info():Bdpagetype: 1
Bdqid: 0xef5df86600013d61
Cache-Control: private
Content-Type: text/html
Cxy_all: baidu+1c8349b37b441e6932e8b8b6e4747690
Date: Fri, 25 Jan 2019 15:03:18 GMT
Expires: Fri, 25 Jan 2019 15:03:00 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=28A5143FAE268F8DB5005D86DECF2D35:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=28A5143FAE268F8DB5005D86DECF2D35; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1548428598; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: delPer=0; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=26524_1439_21110_28329_28414_20718; path=/; domain=.baidu.com
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked


网页内容已存入当前目录的baidu,txt中
1
2
<!DOCTYPE html>
<!--STATUS OK-->

若是将url = 'http://www.baidu.com'改为url = 'https://www.baidu.com'返回内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
================= RESTART: F:/Python/Spyder/ConnectBaidu.py =================
获取url信息:response.geturl():https://www.baidu.com
获取返回代码:response.getcode():200
获取返回信息:response.info():Accept-Ranges: bytes
Cache-Control: no-cache
Content-Length: 227
Content-Type: text/html
Date: Fri, 25 Jan 2019 15:03:32 GMT
Etag: "5c36c624-e3"
Last-Modified: Thu, 10 Jan 2019 04:12:20 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Pragma: no-cache
Server: BWS/1.1
Set-Cookie: BD_NOT_HTTPS=1; path=/; Max-Age=300
Set-Cookie: BIDUPSID=8CE73187BDBE7A99BC73BBDDA28A698C; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1548428612; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Strict-Transport-Security: max-age=0
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close

网页内容已存入当前目录的baidu,txt中
1
2
3
4
5
6
7
8
9
10
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

特别说明:with open('baidu.txt', 'w', encoding = 'utf-8') as fp:为Windows下写法,Linux下为with open('baidu.txt', 'w') as fp:

为什么呢?

​ Windows新建文件默认使用gbk编码,如果用gbk编码解析程序中 result = response.read().decode('utf-8')utf-8网络数据流,就会导致UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 29527: illegal multibyte sequence错误。

但是如果url是HTTPS,则无此报错。(?)


urllib package 包含四个模块:

  1. urllib.request 用于打开和获取URL
  2. urllib.error 包含urllib.request产生的异常
  3. urllib.parse 用于解析URL
  4. urllib.robotparser 用于解析robots.txt文件

(参照:https://docs.python.org/3/library/urllib.html)