Scrapy框架spider类异常处理

说明：仅供学习使用，请勿用于非法用途，若有侵权，请联系博主删除

作者：zhu6201976

一、捕获Request所有网络相关异常

在spider类中，我们构造Request对象或FormRequest对象时，可传递参数errback回调函数。该回调函数可捕获Request所有网络相关异常，如图所示：

文档地址：Requests and Responses — Scrapy 2.11.1 documentation

errback (collections.abc.Callable) –

a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives a Failure as first parameter. For more information, see Using errbacks to catch exceptions in request processing below.

Changed in version 2.0: The callback parameter is no longer required when the errback parameter is specified.

解释：errback参数接收一个function，以处理该request时任意网络异常，如404 Not Found等。errback回调函数第一个参数为Failure对象，包含失败的Request对象和失败原因。

源码示例：Requests and Responses — Scrapy 2.11.1 documentation

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError


class ErrbackSpider(scrapy.Spider):
    name = "errback_example"
    start_urls = [
        "http://www.httpbin.org/",  # HTTP 200 expected
        "http://www.httpbin.org/status/404",  # Not found error
        "http://www.httpbin.org/status/500",  # server issue
        "http://www.httpbin.org:12345/",  # non-responding host, timeout expected
        "https://example.invalid/",  # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(
                u,
                callback=self.parse_httpbin,
                errback=self.errback_httpbin,
                dont_filter=True,
            )

    def parse_httpbin(self, response):
        self.logger.info("Got successful response from {}".format(response.url))
        # do something useful here...

    def errback_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error("HttpError on %s", response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error("DNSLookupError on %s", request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error("TimeoutError on %s", request.url)

从示例可知，errback可捕获所有网络相关异常，如：HttpError、DNSLookupError、TCPTimedOutError等。因此，我们可以在构造Request对象时，传递该参数，捕获spider中所有与网络相关的异常。

二、捕获所有解析异常

构造Request对象时，仅传递errback参数只能捕获该请求相关网络异常，并不能捕获到解析相关异常。因此，需要再次try catch捕获，以下案例捕获程序中的ValueError 和 Name Not DefinedError。

  def parse(self, response, **kwargs):
      method_name = sys._getframe().f_code.co_name
      # 2.解析异常需单独捕获
      try:
          1 / 0
          aaa
      except Exception as e:
          self.logger.error(f'{method_name} Exception {e}')

三、完整代码及运行效果

import sys

import scrapy


class BaiduSpider(scrapy.Spider):
    name = "baidu"
    # allowed_domains = ["baidu.com"]
    # start_urls = []

    def start_requests(self):
        # 1.构造请求errback回调 可捕获请求异常
        yield scrapy.Request('https://www.baidu7489764654.com', callback=self.parse, errback=self.parse_errback)
        yield scrapy.Request('https://www.baidu.com', callback=self.parse, errback=self.parse_errback)

    def parse(self, response, **kwargs):
        method_name = sys._getframe().f_code.co_name
        # 2.解析异常需单独捕获
        try:
            1 / 0
            aaa
        except Exception as e:
            self.logger.error(f'{method_name} Exception {e}')

    def parse_errback(self, failure):
        # 处理请求异常
        self.logger.error(f'{failure.request.url} {failure.value}')