scrapy 爬虫:多线程爬取去微博热搜排行榜数据信息,进入详情页面拿取第一条微博信息,保存到本地text文件、保存到excel

 如果想要保存到excel中可以看我的这个爬虫

使用Scrapy 框架开启多进程爬取贝壳网数据保存到excel文件中,包括分页数据、详情页数据,新手保护期快来看!!仅供学习参考,别乱搞_爬取贝壳成交数据c端用户登录-CSDN博客

 

最终数据展示 

QuotesSpider 爬虫程序

import scrapy
import re

from weibo_top.items import WeiboTopItem

class QuotesSpider(scrapy.Spider):
    name = "weibo_top"
    allowed_domains = ['s.weibo.com']
    def start_requests(self):
        yield scrapy.Request(url="https://s.weibo.com/top/summary?cate=realtimehot")

    def parse(self, response, **kwargs):
        trs = response.css('#pl_top_realtimehot > table > tbody > tr')
        count = 0
        for tr in trs:
            if count >= 30:  # 获取前3条数据
                break  # 停止处理后续数据
            item = WeiboTopItem()
            title = tr.css('.td-02 a::text').get()
            link = 'https://s.weibo.com/' + tr.css('.td-02 a::attr(href)').get()
            item['title'] = title
            item['link'] = link
            if link:
                count += 1  # 增加计数器
                yield scrapy.Request(url=link, callback=self.parse_detail, meta={'item': item})
            else:
                yield item

    def parse_detail(self, response, **kwargs):

        item = response.meta['item']
        list_items = response.css('div.card-wrap[action-type="feed_list_item"]')
        limit = 0
        for li in list_items:
            if limit >= 1:
                break  # 停止处理后续数据
            else:
                content = li.xpath('.//p[@class="txt"]/text()').getall()
                processed_content = [re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9【】,]', '', text) for text in content]
                processed_content = [text.strip() for text in processed_content if text.strip()]
                processed_content = ','.join(processed_content).replace('【,','【')
                item['desc'] = processed_content
                print(processed_content)
                yield item
                limit += 1  # 增加计数器

item 定义数据结构

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WeiboTopItem(scrapy.Item):
    title = scrapy.Field()  # '名称'
    link = scrapy.Field()  # '详情地址'
    desc = scrapy.Field()  # 'desc'
    pass

中间件 设置cookie\User-Agent\Host

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
from fake_useragent import UserAgent
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class WeiboTopSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)


class WeiboTopDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    def __init__(self):
        self.cookie_string = "SUB=_2AkMS10-nf8NxqwFRmfoXyG3jaoxxygHEieKki758JRMxHRl-yT9vqhIrtRB6OVdhSYUGwRsrtuQyFPy_aLfaay7wguyu; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WhBJpfihr9Mo_TDhk.fIHFo; _s_tentry=www.baidu.com; UOR=www.baidu.com,s.weibo.com,www.baidu.com; Apache=5259811159487.941.1709629772294; SINAGLOBAL=5259811159487.941.1709629772294; ULV=1709629772313:1:1:1:5259811159487.941.1709629772294:"
        # self.referer = "https://sh.ke.com/chengjiao/"

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        cookie_dict = self.get_cookie()
        request.cookies = cookie_dict
        request.headers['User-Agent'] = UserAgent().random
        request.headers['Host'] = 's.weibo.com'
        # request.headers["referer"] = self.referer
        return None

    def get_cookie(self):
        cookie_dict = {}
        for kv in self.cookie_string.split(";"):
            k = kv.split('=')[0]
            v = kv.split('=')[1]
            cookie_dict[k] = v
        return cookie_dict

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info("Spider opened: %s" % spider.name)

管道 数据保存到记事本

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class WeiboTopPipeline:
    def __init__(self):
        self.items = []

    def process_item(self, item, spider):
        # 将item添加到列表中
        self.items.append(item)
        print('\n\nitem',item)
        return item

    def close_spider(self, spider):
        # 打开文件,将所有items写入文件
        with open('weibo_top_data.txt', 'w', encoding='utf-8') as file:
            for item in self.items:
                title = item.get('title', '')
                desc = item.get('desc', '')
                output_string = f'{title}\n{desc}\n\n'
                file.write(output_string)

settings  配置多线程、延迟

# Scrapy settings for weibo_top project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "weibo_top"

SPIDER_MODULES = ["weibo_top.spiders"]
NEWSPIDER_MODULE = "weibo_top.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "weibo_top (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "weibo_top.middlewares.WeiboTopSpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   "weibo_top.middlewares.WeiboTopDownloaderMiddleware": 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "weibo_top.pipelines.WeiboTopPipeline": 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 80
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 160
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/435577.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

AI智商排名:Claude-3首次突破100

用挪威门萨(智商测试题)中 35 个问题对chatGPT等人工智能进行了测试: ChatGPT 对ChatGPT进行了两次挪威门萨测试,在 35 个问题中,它平均答对了 13 个,智商估计为 85。 测试方法 每个人工智能都接受了两次…

【b站咸虾米】1 Vue介绍 2021最新Vue从基础到实例高级_vue2_vuecli脚手架博客案例

课程地址:【2021最新Vue从基础到实例高级_vue2_vuecli脚手架博客案例】 https://www.bilibili.com/video/BV1pz4y1S7bC/?share_sourcecopy_web&vd_sourceb1cb921b73fe3808550eaf2224d1c155 感觉尚硅谷的Vue看完忘得差不多了,且之前学过咸虾米的unia…

Java agent技术的注入利用与避坑点

什么是Java agent技术? Java代理(Java agent)是一种Java技术,它允许开发人员在运行时以某种方式修改或增强Java应用程序的行为。Java代理通过在Java虚拟机(JVM)启动时以"代理"(agent…

Redis(十七)分布式锁

文章目录 面试题分布式锁锁的种类分布式锁需要具备的条件和刚需分布式锁 案例nginx分布式微服务部署,单机锁问题分布式锁注意事项lock/unlocklua脚本自研版的redis分布式锁搞定lua脚本 可重入锁可重入锁种类可重入锁hset实现,对比setnx(重要&…

react高阶组件:如何同时兼容class类组件和函数式组件。

场景: 每个页面都要实现分享功能,但是页面有些是用class类,有些又直接是函数式。 方案1: 写2套方法。各自引用。(维护不太好,改要改2遍) 方案2: 可以封一个 jsx的组件&#xff0c…

NLP:spacy库安装与zh_core_web_sm配置

到公司来第一个项目竟然是偏文本信息抽取与结构化的,(也太高看我了┭┮﹏┭┮) 反正给机会了就上吧,我就一臭实习的,怕个啥。配置了两天的环境,也踩了不少坑,我把我的经历给大家分享一下&#…

HarmonyOS NEXT应用开发案例集

概述 随着应用代码的复杂度提升,为了使应用有更好的可维护性和可扩展性,良好的应用架构设计变得尤为重要。本篇文章将介绍一个应用通用架构的设计思路,以减少模块间的耦合、提升团队开发效率,为开发者呈现一个清晰且结构化的开发…

【Tauri】(4):整合Tauri和actix-web做本地大模型应用开发,可以实现session 登陆接口,完成页面展示,进入聊天界面

1,视频地址 https://www.bilibili.com/video/BV1GJ4m1Y7Aj/ 【Tauri】(4):整合Tauri和actix-web做本地大模型应用开发,可以实现session 登陆接口,完成页面展示,进入聊天界面 使用国内代理进行加…

【HTML】HTML基础7.3(自定义列表)

目录 标签 效果 代码 注意 标签 <dl> <dt>自定义标题</dt><dd>内容1</dd><dd>内容2</dd><dd>内容3</dd> 。。。。。。 </dl> 效果 代码 <dl><dt>蜘蛛侠系列</dt><dd>蜘蛛侠1</dd…

PyCharm Community Edition 2023.3.3,UI界面设置成旧版

File->Settings->Appearance & Behavior->New UI->Enable new UI(取消勾选)->重启PyCharm 旧版UI: 新版UI&#xff1a;

基于决策树实现葡萄酒分类

基于决策树实现葡萄酒分类 将葡萄酒数据集拆分成训练集和测试集&#xff0c;搭建tree_1和tree_2两个决策树模型&#xff0c;tree_1使用信息增益作为特征选择指标&#xff0c;B树使用基尼指数作为特征选择指标&#xff0c;各自对训练集进行训练&#xff0c;然后分别对训练集和测…

【ETCD】简介安装常用操作---图文并茂详细讲解

目录 一 简介 1.1 etcd是什么 1.2. 特点 1.3. 使用场景 1.4 关键字 1.5 工作原理 二 安装 2.1 etcd安装前介绍 2.2 安装 2.3 启动 2.4 创建一个etcd服务 三 常用操作 一 简介 1.1 etcd是什么 etcd是CoreOS团队于2013年6月发起的开源项目&#xff0c;它的目标是构建…

git 如何将多个提交点合并为一个提交点 commit

文章目录 核心命令详细使用模式总结示例 核心命令 git merge branch2 是将分支branch2的提交点合并到本地当前分支。 而在执行这条命令的时候&#xff0c;加一个选项--squash就表示在合并的时候将多个提交点合并为一个提交点。 git merge --squash branch2 先看squash单词的意…

探索c++——了解c++的魅力

前言&#xff1a;c是一门既面向对象又面向过程的语言。 不同于java纯粹的面向对象和c纯粹的面向过程。 造成c该特性的原因是c是由本贾尼大佬在c的基础上增添语法创建出来的一门新的语言。 它既兼容了c&#xff0c; 身具面向过程的特性。 又有本身的面向对象的特性。 面向对象和…

SpringBoot集成ElasticSearch(ES)

ElasticSearch环境搭建 采用docker-compose搭建&#xff0c;具体配置如下&#xff1a; version: 3# 网桥es -> 方便相互通讯 networks:es:services:elasticsearch:image: registry.cn-hangzhou.aliyuncs.com/zhengqing/elasticsearch:7.14.1 # 原镜像elasticsearch:7.…

Unity 整体界面淡入淡出效果

在Unity中&#xff0c;如果我们要实现控制多个组件同时淡出&#xff0c;同时淡入的效果&#xff0c;可以使用DOTween插件实现。 如图&#xff0c;一个页面中带有背景&#xff0c;一张图片&#xff0c;一个文本&#xff0c;一个滑动条。 要实现以上界面的整体淡入淡出&#xff…

机器学习-启航

文章目录 原理分析机器学习的两种典型任务机器学习分类总结数据机器学习分类解读简单复杂 原理分析 马克思主义哲学-规律篇 规律客观存在&#xff0c;万事万物皆有规律。 机器学习则是多维角度拆解分析复杂事实数据&#xff0c;发现复杂事实背后的规律&#xff0c;然后将规律用…

C#,回文分割问题(Palindrome Partitioning Problem)算法与源代码

1 回文串 “回文串”是一个正读和反读都一样的字符串&#xff0c;初始化标志flagtrue&#xff0c;比如“level”或者“noon”等等就是回文串。 2 回文分割问题 给定一个字符串&#xff0c;如果该字符串的每个子字符串都是回文的&#xff0c;那么该字符串的分区就是回文分区。…

VS code下载与使用方法(包含远程调试)

Visual Studio Code(简称 VSCode)是由微软开发的一款免费、开源、跨平台的现代化轻量级代码编辑器。它具有丰富的功能和强大的扩展性,适用于多种编程语言和开发环境。以下是 VSCode 的一些主要特点和功能: 跨平台支持: 可在 Windows、macOS 和 Linux 等多种操作系…

基于ACM32 MCU的两轮车充电桩方案,打造高效安全的电池管理

随着城市化进程的加快、人们生活水平的提高和节能环保理念的普及&#xff0c;越来越多的人选择了电动车作为代步工具&#xff0c;而两轮电动车的出行半径较短&#xff0c;需要频繁充电&#xff0c;因此在城市中设置两轮车充电桩就非常有必要了。城市中的充电桩不仅能解决两轮车…