Python爬虫——scrapy-4

免责声明

本文章仅用于学习交流，无任何商业用途

部分图片来自尚硅谷

meta简介

在Scrapy框架中，可以使用meta属性来传递额外的信息。meta属性可以在不同的组件之间传递数据，包括爬虫、中间件和管道等。

在爬虫中，可以使用meta属性在请求之间传递数据。例如：

yield scrapy.Request(url, callback=self.parse_details, meta={'item': item})

在上面的例子中，通过设置meta属性，将item对象传递给了下一个请求的回调函数parse_details。

在中间件中，可以使用meta属性来获取和修改请求的元数据。例如：

def process_request(self, request, spider):
    item = request.meta['item']
    item['timestamp'] = datetime.now()
    request.meta['item'] = item

在上面的例子中，process_request方法获取了请求的item对象，并添加了一个timestamp字段，然后将修改后的item对象保存回meta属性中。

在管道中，可以使用meta属性来获取和传递数据。例如：

def process_item(self, item, spider):
    timestamp = item['timestamp']
    # 使用timestamp做一些处理

在上面的例子中，可以从item对象的meta属性中取出之前设置的timestamp值，并进行相应的处理。

总之，Scrapy的meta属性可以在不同的组件之间传递数据，非常方便灵活。

爬取电影天堂的国内电影的全部名字和图片链接

import scrapy
from scrapy_movie_070.items import ScrapyMovie070Item

class MvSpider(scrapy.Spider):
    name = "mv"
    allowed_domains = ["www.dygod.net"]
    start_urls = ["https://www.dygod.net/html/gndy/china/index.html"]

    def parse(self, response):
        print("==============成功啦===============")
        # 我们要第一页的名字和第二页的图片
        a_list = response.xpath('//div[@class="co_content8"]//td[2]//a[2]')

        for a in a_list:
            # 获取第一页的name和链接
            name = a.xpath('./text()').extract_first()
            src = a.xpath('./@href').extract_first()

            url = 'https://www.dygod.net' + src
            print(name, url)
            yield scrapy.Request(url=url, callback=self.parse_second, meta={'name':name})

    def parse_second(self, response):
        print("==============呀啦嗦===============")
        # 如果拿不到数据，记得检查xpath语法是否正确
        img_src = response.xpath('//div[@id="Zoom"]//img[1]/@src').extract_first()
        img_url = 'https://www.dygod.net' + img_src
        # 接收到请求的那个Meta参数的值
        name = response.meta['name']

        movie = ScrapyMovie070Item(src=img_url, name=name)

        yield movie

CrawlSpider是Scrapy框架中的一个特殊爬虫类，它提供了一种基于规则的快速爬取方式。 CrawlSpider使用了一组规则来定义爬取的行为，并自动根据这些规则对页面上的链接进行跟踪和爬取。

使用CrawlSpider，可以更轻松地从一个网站中提取数据，而无需编写太多的代码。以下是使用CrawlSpider的基本步骤：

创建一个CrawlSpider的子类，并设置name属性（爬虫的唯一标识符）和allowed_domains属性（限制爬取的域名）。
定义一个rules属性，其中包含多个Rule对象，每个Rule对象定义了一个规则。
- Rule对象的link_extractor属性定义了链接提取器，用于从页面中提取链接。
- Rule对象的callback属性定义了回调函数，用于处理提取到的链接对应的页面。
编写回调函数，用于处理提取到的链接对应的页面。
在回调函数中使用XPath或CSS选择器等方法提取数据，并使用yield语句返回Item对象或新的Request对象，进行进一步的爬取或处理。

以下是一个简单的CrawlSpider示例：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        Rule(LinkExtractor(allow=r'/page/\d+'), callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        # 提取数据并返回Item对象
        yield {
            'title': response.css('h1::text').get(),
            'content': response.css('.content::text').getall(),
        }

在上面的示例中，allowed_domains属性限制了只爬取example.com域名下的页面。start_urls属性定义了初始爬取的URL。

rules属性定义了一个规则，其中使用了LinkExtractor来提取符合allow条件的链接，并将提取到的链接交给parse_page方法进行处理。follow=True表示继续跟踪该链接上的页面。

parse_page方法是回调函数，用于处理提取到的链接对应的页面。在这个方法中，可以使用XPath或CSS选择器等方法提取页面中的数据，并使用yield语句返回Item对象。

通过以上步骤，就可以创建一个基于规则的爬虫，并使用CrawlSpider类来自动进行页面跟踪和爬取。

下图来自尚硅谷

C:\Users\14059>scrapy shell https://www.dushu.com/book/1188.html
2024-03-08 17:00:29 [scrapy.utils.log] INFO: Scrapy 2.9.0 started (bot: scrapybot)
2024-03-08 17:00:29 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.7.5 (tags/v3.7.5:5c02a39a0b, Oct 15 2019, 00:11:34) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.0.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform Windows-10-10.0.22621-SP0
2024-03-08 17:00:29 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0}
2024-03-08 17:00:29 [py.warnings] WARNING: d:\python\python375\lib\site-packages\scrapy\utils\request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)

2024-03-08 17:00:29 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-03-08 17:00:29 [scrapy.extensions.telnet] INFO: Telnet Password: 13c50912dfa84ac1
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-03-08 17:00:29 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-03-08 17:00:29 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-03-08 17:00:29 [scrapy.core.engine] INFO: Spider opened
2024-03-08 17:00:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.dushu.com/book/1188.html> (referer: None)
2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x00000254496A38C8>
[s] item {}
[s] request <GET https://www.dushu.com/book/1188.html>
[s] response <200 https://www.dushu.com/book/1188.html>
[s] settings <scrapy.settings.Settings object at 0x00000254496A3748>
[s] spider <DefaultSpider 'default' at 0x25449bbdf88>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector
2024-03-08 17:00:30 [asyncio] DEBUG: Using selector: SelectSelector
In [1]: from scrapy.linkextractors import LinkExtractor

2024-03-08 17:01:58 [asyncio] DEBUG: Using selector: SelectSelector
In [2]: link = LinkExtractor

2024-03-08 17:02:49 [asyncio] DEBUG: Using selector: SelectSelector
In [3]: from scrapy.linkextractors import LinkExtractor

2024-03-08 17:03:24 [asyncio] DEBUG: Using selector: SelectSelector
In [4]: link = LinkExtractor(allow=r'/book/1188_\d+\.html')

2024-03-08 17:04:45 [asyncio] DEBUG: Using selector: SelectSelector
In [5]: link
Out[6]: <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor at 0x2544d2ae508>

2024-03-08 17:05:01 [asyncio] DEBUG: Using selector: SelectSelector
In [7]: link.extract_links(response)
Out[7]:
[Link(url='https://www.dushu.com/book/1188_2.html', text='2', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_3.html', text='3', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_4.html', text='4', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_5.html', text='5', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_6.html', text='6', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_7.html', text='7', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_8.html', text='8', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_9.html', text='9', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_10.html', text='10', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_11.html', text='11', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_12.html', text='12', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_13.html', text='13', fragment='', nofollow=False)]

2024-03-08 17:05:20 [asyncio] DEBUG: Using selector: SelectSelector
In [8]: link1 = LinkExtractor

2024-03-08 17:17:12 [asyncio] DEBUG: Using selector: SelectSelector
In [9]: link1 = LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a/@href')

2024-03-08 17:18:03 [asyncio] DEBUG: Using selector: SelectSelector
In [10]: link.extract_links(response)
Out[10]:
[Link(url='https://www.dushu.com/book/1188_2.html', text='2', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_3.html', text='3', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_4.html', text='4', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_5.html', text='5', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_6.html', text='6', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_7.html', text='7', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_8.html', text='8', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_9.html', text='9', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_10.html', text='10', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_11.html', text='11', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_12.html', text='12', fragment='', nofollow=False),
Link(url='https://www.dushu.com/book/1188_13.html', text='13', fragment='', nofollow=False)]

整个和命令行斗智斗勇的过程如上了，[○･｀Д´･ ○]

CrawlSpider案例

目标：读书网数据入库

（1）创建一个项目

scrapy startproject 项目名

（2）跳转到spdiers 的文件目录下

cd 到spiders为止

cd 项目名\项目名\spiders

（3）创建爬虫文件

scrapy genspider -t crawl 爬虫文件的名字爬取的域名

注意：一定要注意第一页的URL结构是否和其他页码的结构一样

如果不需要存储到数据库中，代码如下

read.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_readbook_090.items import ScrapyReadbook090Item

class ReadSpider(CrawlSpider):
    name = "read"
    allowed_domains = ["www.dushu.com"]
    start_urls = ["https://www.dushu.com/book/1188_1.html"]

    rules = (Rule(LinkExtractor(allow=r"/book/1188_\d+\.html"),
                  callback="parse_item",
                  follow=True),)

    def parse_item(self, response):
        img_list = response.xpath('//div[@class="bookslist"]//img')
        for img in img_list:
            name = img.xpath('./@alt').extract_first()
            img_src = img.xpath('./@data-original').extract_first()

            book = ScrapyReadbook090Item(name=name, src=img_src)
            yield book

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class ScrapyReadbook090Pipeline:

    def open_spider(self, spider):
        self.fp = open('book.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self, spider):
        self.fp.close()

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyReadbook090Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    src = scrapy.Field()

settings.py

# Scrapy settings for scrapy_readbook_090 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "scrapy_readbook_090"

SPIDER_MODULES = ["scrapy_readbook_090.spiders"]
NEWSPIDER_MODULE = "scrapy_readbook_090.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "scrapy_readbook_090 (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "scrapy_readbook_090.middlewares.ScrapyReadbook090SpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "scrapy_readbook_090.middlewares.ScrapyReadbook090DownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "scrapy_readbook_090.pipelines.ScrapyReadbook090Pipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"