说明:仅供学习使用,请勿用于非法用途,若有侵权,请联系博主删除
作者:zhu6201976
一、捕获Request所有网络相关异常
在spider类中,我们构造Request对象或FormRequest对象时,可传递参数errback回调函数。该回调函数可捕获Request所有网络相关异常,如图所示:
文档地址:Requests and Responses — Scrapy 2.11.1 documentation
-
errback (collections.abc.Callable) –
a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Failure as first parameter. For more information, see Using errbacks to catch exceptions in request processing below.
Changed in version 2.0: The callback parameter is no longer required when the errback parameter is specified.
解释:errback参数接收一个function,以处理该request时任意网络异常,如404 Not Found等。errback回调函数第一个参数为Failure对象,包含失败的Request对象和失败原因。
源码示例:Requests and Responses — Scrapy 2.11.1 documentation
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"https://example.invalid/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(
u,
callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True,
)
def parse_httpbin(self, response):
self.logger.info("Got successful response from {}".format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error("HttpError on %s", response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error("DNSLookupError on %s", request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error("TimeoutError on %s", request.url)
从示例可知,errback可捕获所有网络相关异常,如:HttpError、DNSLookupError、TCPTimedOutError等。因此,我们可以在构造Request对象时,传递该参数,捕获spider中所有与网络相关的异常。
二、捕获所有解析异常
构造Request对象时,仅传递errback参数只能捕获该请求相关网络异常,并不能捕获到解析相关异常。因此,需要再次try catch捕获,以下案例捕获程序中的ValueError 和 Name Not DefinedError。
def parse(self, response, **kwargs):
method_name = sys._getframe().f_code.co_name
# 2.解析异常需单独捕获
try:
1 / 0
aaa
except Exception as e:
self.logger.error(f'{method_name} Exception {e}')
三、完整代码及运行效果
import sys
import scrapy
class BaiduSpider(scrapy.Spider):
name = "baidu"
# allowed_domains = ["baidu.com"]
# start_urls = []
def start_requests(self):
# 1.构造请求errback回调 可捕获请求异常
yield scrapy.Request('https://www.baidu7489764654.com', callback=self.parse, errback=self.parse_errback)
yield scrapy.Request('https://www.baidu.com', callback=self.parse, errback=self.parse_errback)
def parse(self, response, **kwargs):
method_name = sys._getframe().f_code.co_name
# 2.解析异常需单独捕获
try:
1 / 0
aaa
except Exception as e:
self.logger.error(f'{method_name} Exception {e}')
def parse_errback(self, failure):
# 处理请求异常
self.logger.error(f'{failure.request.url} {failure.value}')
四、总结
经过上述2个步骤,理论上来说,spider类中所有异常已被成功捕获。不管是请求相关的网络异常,还是解析过程中的代码异常,都能成功捕获并处理,代码健壮性较佳。