用python写网络爬虫：3.urllib库进一步的使用方法

文章目录

异常处理
- URLError
- HTTPError
- 设置超时时间
链接的解析、构造、合并
- urlparse方法
- urlsplit方法
- urljoin方法
- urlencode方法
- parse_qs方法
- quote方法
Robots 协议
- Robots 协议的结构
- 解析协议
参考书籍

在上一篇文章：用python写网络爬虫：2.urllib库的基本用法已经介绍了如何使用urllib库的requests模块发送简单的请求，如果想方便地实现更多东西，还得学习新的东西

异常处理

使用urllib库中的error模块可以帮我们处理异常情况

URLError

如果我们打开一个不存在的网页，程序应该会报错，这时可以使用URLError命令返回错误原因，避免程序异常终止

from urllib import request, error 

try: 
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.URLError as e: 
    print(e.reason)

HTTPError

是URLError错误的子类，专门处理HTTP请求错误，它有三个属性

code：返回HTTP状态码
reason：返回错误原因
headers：返回请求头
代码示例：

from urllib import request,error 

try: 
    response = request.urlopen('https://cuiqingcai.com/index.htm') 
except error.HTTPError as e: 
    print(e.reason,e.code,e.headers)

输出结果如下
在这里插入图片描述
在一些情况下，e.reason，e.code 和 e.headers 也可能为 None。此时可以代替使用 e 本身来输出完整的异常信息。

print(e)

为了使代码更有效率，一般我们先使用HTTPError检查是否存在HTTPError，再使用父类的URLError检查是否存在URLError，即

from urllib import request, error

try: 
    response = request.urlopen('https://cuiqingcai.com/index.htm') 
except error.HTTPError as e: 
    print(e.reason, e.code, e.headers) 
except error.URLError as e: 
    print(e .reason) 
else: 
    print('Request Successfully')

设置超时时间

通过设置一个超时时间，防止程序因异常而长时间循环

import socket 
import urllib.request 
import urllib.error 

try: 
    response = urllib.request.urlopen('https://www.baidu.com',timeout = 0.01)
except urllib.error.URLError as e: 
    print(type(e.reason)) 
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')

链接的解析、构造、合并

urlparse方法

使用urlparse方法可以进行链接的解析

from urllib.parse import urlparse 

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

返回结果是一个 ParseResult 类型的对象，它包含6个部分，分别是协议scheme、域名netloc、访问路径path、参数params、查询条件query、锚点fragment

<class ‘urllib.parse.ParseResult’> ParseResult(scheme=‘http’, netloc=‘www.baidu.com’, path=‘/index.html’, params=‘user’, query=‘id=5’, fragment=‘comment’)

这也暗示着，一个标准的URL链接格式应该是

scheme://netloc/path;params?query#fragment

urlparse方法的参数
urlparse(urlstring,scheme=‘’,allow_fragments=True)

urlstring：待解析的URL，必填参数
scheme：若原链接不含有协议信息，则为其指定协议（例如HTTP）。若原链接含有协议scheme，则失效
allow_fragments：是否带有fragments。是：正常解析fragments；否：忽略fragments，将其作为query、params或path的一部分

使用urlunparse方法按照URL的标准结构可以构造一个链接

from urllib.parse import urlunparse 

data =['http','www.baidu.com','index.html','user','a=6','comment'] 
print(urlunparse(data))

输出结果即为 http://www.baidu.com/index.html;user?a=6#comment

urlsplit方法

urlsplit方法大体与urlparse类似，区别在于urlsplit不解析params的部分，而将其与path合并

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment') 
print(result)

结果即

SplitResult(scheme=‘http’, netloc=‘www.baidu.com’, path=‘/index.html;user’, query=‘id=5’, fragment=‘comment’)

类似地，可以使用urlunsplit方法构造链接，只需注意元素为5个

from urllib.parse import urlunsplit 

data =['http','www.baidu.com','index.html','a=6','comment']
print(urlunsplit(data))

urljoin方法

以上介绍了链接的解析和构造，下面介绍用urljoin方法进行链接的合并，一般输入两个参数，第一个参数为基础链接，第二个参数为新链接；urljoin会将基础链接拆分，将新链接没有的部分给补全，如下例

from urllib.parse import urljoin 

print(urljoin('http://www.baidu.com','FAQ.html')) 
print(urljoin('http://www.baidu.com','https://cuiqingcai.com/FAQ.html')) 
print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html')) 
print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2')) 
print(urljoin('http://www.baidu.com?wd=abc','https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com','?category=2#comment')) 
print(urljoin('www.baidu.com','?category=2#comment')) 
print(urljoin('www.baidu.com#conent','?category=2'))

运行结果如下
在这里插入图片描述

urlencode方法

构造GET请求时，常常需要先声明一个字典，再将其化为请求的参数，这时需要urlencode方法

from urllib.parse import urlencode

params = {
    'name' : 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

运行结果为 http://www.baidu.com?name=germey&age=22

parse_qs方法

若想将URL中的参数转回字典，可以使用parse_qs方法

from urllib.parse import parse_qs 

query= 'name=germey&age=22' 
print(parse_qs(query))

运行结果：{‘name’: [‘germey’], ‘age’: [‘22’]}

若想将参数转化为列表，则使用parse_qsl方法

from urllib.parse import parse_qsl 

query= 'name=germey&age=22' 
print(parse_qsl(query))

运行结果：[(‘name’, ‘germey’), (‘age’, ‘22’)]

quote方法

URL中带有中文参数时，有时可能会导致乱码的问题，此时用quote方法可以将中文字符转化为 URL 编码

from urllib.parse import quote 

keyword = '壁纸'
url = 'https://www.baidu.com/s?wd='+ quote(keyword) 
print(url)

输出结果：https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

若想将编码转回文字，则使用unquote方法

from urllib .parse import unquote 
url = 'http://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8' 
print(unquote(url))

运行结果：http://www.baidu.com/s?wd=壁纸

Robots 协议

Robots 协议（ Robots Exclusion Protocol ），也称作爬虫协议、机器人协议,用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作 robots.txt的文本文件，一般放在网站的根目录下。

当爬虫访问一个站点时，它首先会检查这个站点根目录下是否存在 robots.txt 文件，如果存在，搜索爬虫会根据其中的规则来爬取，否则搜索爬虫便会访问所有可直接访问的页面。

Robots 协议的结构

robots.txt一般由三部分组成，User-agent、Disallow、Allow，分别代表对哪些爬虫生效，禁止爬取的范围，允许爬取的范围。

常见的爬虫名称

名称	含义
*	所有爬虫
BaiduSpider	百度
Googlebot	谷歌
360Spider	360
YodaoBot	有道

下面举几个 robots.txt 的例子

禁止所有爬虫访问任何目录：

User-agent: *
Disallow: /

允许所有爬虫访问所有目录

User-agent: *
Allow:

禁止所有爬虫访问网站某些目录：

User-agent: *
Disallow: /private/
Disallow: /tmp/

只允许某一爬虫访问：

User-agent: BaiduSpider
Disallow:
User-agent: *
Disallow: /

解析协议

使用robotparser模块的RobotFileParser类进行解析robots.txt文件，只需要输入URL即可

urllib.robotparser.RobotFileParser(url='')

或者在声明时输入为空，并选择以下设置

set_url：传入URL链接
read：读取robots.txt文件，必须设置
parse：解析 robots. txt 文件，传入的参数是 robots.txt 部分行的内容，按照robots.txt的语法规则来分析这些内容
can_fetch：判断爬虫是否可以抓取这个 URL，返回结果是 True 或 False
mtime：返回的是上次抓取和分析 robots.txt 的时间，利于定期检查来抓取最新的 robots.txt
modified：将当前时间设置为上次抓取和分析 robots.txt 的时间，避免频繁地获取 robots.txt 文件，节省网络资源并提高爬虫效率

示例：

from urllib.robotparser import RobotFileParser 
rp = RobotFileParser() 
rp.set_url('http://www.jianshu.com/robots.txt') 
rp. read() 

print(rp.can_fetch ('*','http://www.jianshu.com/p/b67554025d7d')) 
print(rp.can_fetch ('*','http://www.jianshu.com/search?q=python&page=l&type=collections'))