本文章是在Scrapy入门-CSDN博客的基础上改写的代码。
1.声明采集目标
打开mySpider/mySpider1/items.py文件,修改MyspiderItem类为AIspiderItem:
class AIspiderItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
date = scrapy.Field()
2.修改dgcuAI.py 文件
引入AIspiderItem:
from mySpider.items import AIspiderItem
在parse函数结尾加上创建AIspiderItem类的代码:
from scrapy.selector import Selector
from mySpider1.items import AIspiderItem
class DgcuaiSpider(scrapy.Spider):
name = 'dgcuAI'
allowed_domains = ['ai.dgcu.edu.cn']
start_urls = ['http://ai.dgcu.edu.cn/front/category/2.html']
def parse(self, response):
print(response.url)
selector = Selector(response)
node_list = selector.xpath("//div[@class='pageList']/ul/li")
for node in node_list:
# 文章标题
title = node.xpath('./a[1]/div[@class="major-content1"]/text()').extract_first()
# 文章链接
url = node.xpath('./a[1]/@href').extract_first()
# 日期
date = node.xpath('./a[1]/div[@class="major-content2"]/text()').extract_first()
# 创建AIspiderItem类
item = AIspiderItem()
item['title'] = title
item['url'] = url
item['date'] = date
yield item
3.修改pipelines.py文件:
from itemadapter import ItemAdapter
from pymongo import MongoClient
from bson import ObjectId, json_util
class AIPipeline:
def open_spider(self, spider):
# MongoDB 连接设置
self.MONGO_URI = 'mongodb://localhost:27017/'
self.DB_NAME = 'news' # 数据库名称
self.COLLECTION_NAME = 'DGCU_AI' # 集合名称
self.client = MongoClient(self.MONGO_URI)
self.db = self.client[self.DB_NAME]
self.collection = self.db[self.COLLECTION_NAME]
# 如果集合中已有数据,清空集合
self.collection.delete_many({})
print('爬取开始')
def process_item(self, item, spider):
title = item['title']
url = item['url']
date = item['date']
# 将item转换为字典
item_dict = {
'title': title,
'url': url,
'date': date
}
# 插入数据
self.collection.insert_one(item_dict)
return item
def close_spider(self, spider):
print('爬取结束,显示数据库中所有元素')
cursor = self.collection.find()
for document in cursor:
print(document)
self.client.close()
4.修改settings.py 文件:
ITEM_PIPELINES = {
'mySpider.pipelines.AIPipeline': 300,
}
5.运行run.py文件:
from scrapy import cmdline
cmdline.execute("scrapy crawl dgcuAI -s LOG_ENABLED=False".split())
若遇到pymongo.errors.ServerSelectionTimeoutError,则参考该解决方法mongoDB 报错 MongoNetworkError: connect ECONNREFUSED 127.0.0.1:27017 : 一个可行的解决方案 - pwindy - 博客园 (cnblogs.com) 运行结果:
6.在MongoDB数据库中验证
# 在cmd中输入以下命令,查看数据库中的数据:
> mongosh # 启动mongoDB
> show dbs # 查看所有数据库
> use news # 使用news数据库
> show collections # 查看当前数据库的所有集合
> db.DGCU_AI.find() # 查看DGCU_AI集合中的所有文档