一、crawlSpider
1. 安装scrapy
终端中:pip install scrapy
2. 创建项目
1)创建项目
scrapy startproject 项目名
2)切换到spiders目录下
cd 项目名\项目名\spiders
3)创建文件
scrapy genspider -t crawl 文件名 网址
4)运行
scrapy crawl 文件名
3. 对文件中的rule进行修改
鼠标悬浮在地址页上——右键检查:
对rule中的正则表达式allow进行修改: 修改完成后会提取当前起始url(start_urls )的链接
allow=r'/book/1104_\d+\.html'
其中:
\d:表示数字
+:表示1~多
\.:转义,使符号“.”生效
对rule中的follow进行修改:
follow = False
1440(行)/40(行/页)/3(本/行) = 12页(共13页),少了一页
原因:规则里不包含首页
只提取到2-13页
首页:
# 首页:1104_1 start_urls = ["https://www.dushu.com/book/1104_1.html"]
1560行:
2. 读书网
1)dsw.PY
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
#导入
from demo_dsw.items import DemoDswItem
class DswSpider(CrawlSpider):
name = "dsw"
#修改 allowed_domains范围:只保留域名
allowed_domains = ["www.dushu.com"]
# 首页:1104_1
start_urls = ["https://www.dushu.com/book/1104_1.html"]
rules = (Rule(LinkExtractor(allow=r'/book/1104_\d+\.html'),
callback="parse_item",
follow=False),)
def parse_item(self, response):
img_list = response.xpath('//div[@class = "bookslist"]//img')
for img in img_list:
name = img.xpath('./@data-original').extract_first()
src = img.xpath('./@alt').extract_first()
#创建一本书
book = DemoDswItem(name = name, src = src)
#返回
yield book
#CTRL + alt + L:查看book.json文件
2)items.PY
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class DemoDswItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
#书名
name = scrapy.Field()
#图片
src = scrapy.Field()
3)pipelines.PY
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class DemoDswPipeline:
# 开启
def open_spider(self,spider):
self.fp = open('book.json','w',encoding='utf-8')
def process_item(self, item, spider):
# 中间:只能写字符串
self.fp.write(str(item))
return item
# 关闭
def close_spider(self,spider):
self.fp.close()
4)settings.PY
解除注释:
ITEM_PIPELINES = {
"demo_dsw.pipelines.DemoDswPipeline": 300,
}
3. scrapy的post请求
scrapy_post.PY
import json
from typing import Iterable
import scrapy
from scrapy import Request
class ScrapypostSpider(scrapy.Spider):
name = "scrapyPost"
allowed_domains = ["fanyi.baidu.com"]
# post请求:
# post请求必须依赖参数才能执行
# 如果没有参数,post请求没有任何意义
# status_urls也没用了
# post方法也没用了
# start_urls = ["https://fanyi.baidu.com/sug"]
# def parse(self, response):
'
# pass
def start_requests(self):
url = 'https://fanyi.baidu.com/sug'
data = {
'kw': 'cat'
}
yield scrapy.FormRequest(url = url,formdata = data,callback = self.parse_second)
def parse_second(self,response):
content = response.text
obj = json.loads(content,encodings = 'utf-8')
print(obj)