scrapy 教程

Scrapy Tutorial¶

In this tutorial, we’ll assume that Scrapy is already installed on your system. If that’s not the case, see Installation guide.

We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors.

This tutorial will walk you through these tasks:

  1. Creating a new Scrapy project

  2. Writing a spider to crawl a site and extract data

  3. Exporting the scraped data using the command line

  4. Changing spider to recursively follow links

  5. Using spider arguments

Scrapy is written in Python. The more you learn about Python, the more you can get out of Scrapy.

If you’re already familiar with other languages and want to learn Python quickly, the Python Tutorial is a good resource.

If you’re new to programming and want to start with Python, the following books may be useful to you:

  • Automate the Boring Stuff With Python

  • How To Think Like a Computer Scientist

  • Learn Python 3 The Hard Way

You can also take a look at this list of Python resources for non-programmers, as well as the suggested resources in the learnpython-subreddit.

Creating a project¶

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject tutorial

This will create a tutorial directory with the following contents:

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Our first Spider¶

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to be made, and optionally, how to follow links in pages and parse the downloaded page content to extract data.

This is the code for our first Spider. Save it in a file named quotes_spider.py under the tutorial/spiders directory in your project:

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:

  • name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

  • start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

  • parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

    The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

How to run our spider¶

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl quotes

This command runs the spider named quotes that we’ve just added, that will send some requests for the quotes.toscrape.com domain. You will get an output similar to this:

... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...

Now, check the files in the current directory. You should notice that two new files have been created: quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

Note

If you are wondering why we haven’t parsed the HTML yet, hold on, we will cover that soon.

What just happened under the hood?¶

Scrapy schedules the scrapy.Request objects returned by the start_requests method of the Spider. Upon receiving a response for each one, it instantiates Response objects and calls the callback method associated with the request (in this case, the parse method) passing the response as an argument.

A shortcut to the start_requests method¶

Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs, you can just define a start_urls class attribute with a list of URLs. This list will then be used by the default implementation of start_requests() to create the initial requests for your spider.

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

Extracting data¶

The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:

scrapy shell 'https://quotes.toscrape.com/page/1/'

Note

Remember to always enclose URLs in quotes when running Scrapy shell from the command line, otherwise URLs containing arguments (i.e. & character) will not work.

On Windows, use double quotes instead:

scrapy shell "https://quotes.toscrape.com/page/1/"

You will see something like:

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET https://quotes.toscrape.com/page/1/>
[s]   response   <200 https://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

Using the shell, you can try selecting elements using CSS with the response object:

>>> response.css("title")
[<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running response.css('title') is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to refine the selection or extract the data.

To extract the text from the title above, you can do:

>>> response.css("title::text").getall()
['Quotes to Scrape']

There are two things to note here: one is that we’ve added ::text to the CSS query, to mean we want to select only the text elements directly inside <title> element. If we don’t specify ::text, we’d get the full title element, including its tags:

>>> response.css("title").getall()
['<title>Quotes to Scrape</title>']

The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one result, so we extract them all. When you know you just want the first result, as in this case, you can do:

>>> response.css("title::text").get()
'Quotes to Scrape'

As an alternative, you could’ve written:

>>> response.css("title::text")[0].get()
'Quotes to Scrape'

Accessing an index on a SelectorList instance will raise an IndexError exception if there are no results:

>>> response.css("noelement")[0].get()
Traceback (most recent call last):
...
IndexError: list index out of range

You might want to use .get() directly on the SelectorList instance instead, which returns None if there are no results:

>>> response.css("noelement").get()

There’s a lesson here: for most scraping code, you want it to be resilient to errors due to things not being found on a page, so that even if some parts fail to be scraped, you can at least get some data.

Besides the getall() and get() methods, you can also use the re() method to extract using regular expressions:

>>> response.css("title::text").re(r"Quotes.*")
['Quotes to Scrape']
>>> response.css("title::text").re(r"Q\w+")
['Quotes']
>>> response.css("title::text").re(r"(\w+) to (\w+)")
['Quotes', 'Scrape']

In order to find the proper CSS selectors to use, you might find it useful to open the response page from the shell in your web browser using view(response). You can use your browser’s developer tools to inspect the HTML and come up with a selector (see Using your browser’s Developer Tools for scraping).

Selector Gadget is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.

XPath: a brief intro¶

Besides CSS, Scrapy selectors also support using XPath expressions:

>>> response.xpath("//title")
[<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath("//title/text()").get()
'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You can see that if you read the text representation of the selector objects in the shell closely.

While perhaps not as popular as CSS selectors, XPath expressions offer more power because besides navigating the structure, it can also look at the content. Using XPath, you’re able to select things like: the link that contains the text “Next Page”. This makes XPath very fitting to the task of scraping, and we encourage you to learn XPath even if you already know how to construct CSS selectors, it will make scraping much easier.

We won’t cover much of XPath here, but you can read more about using XPath with Scrapy Selectors here. To learn more about XPath, we recommend this tutorial to learn XPath through examples, and this tutorial to learn “how to think in XPath”.

Extracting quotes and authors¶

Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.

Each quote in https://quotes.toscrape.com is represented by HTML elements that look like this:

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

Let’s open up scrapy shell and play a bit to find out how to extract the data we want:

scrapy shell 'https://quotes.toscrape.com'

We get a list of selectors for the quote HTML elements with:

>>> response.css("div.quote")
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]

Each of the selectors returned by the query above allows us to run further queries over their sub-elements. Let’s assign the first selector to a variable, so that we can run our CSS selectors directly on a particular quote:

>>> quote = response.css("div.quote")[0]

Now, let’s extract the textauthor and tags from that quote using the quote object we just created:

>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'

Given that the tags are a list of strings, we can use the .getall() method to get all of them:

>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Having figured out how to extract each bit, we can now iterate over all the quote elements and put them together into a Python dictionary:

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...

Extracting data in our spider¶

Let’s get back to our spider. Until now, it hasn’t extracted any data in particular, just saving the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield Python keyword in the callback, as you can see below:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

To run this spider, exit the scrapy shell by entering:

quit()

Then, run:

scrapy crawl quotes

Now, it should output the extracted data with the log:

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

Storing the scraped data¶

The simplest way to store the scraped data is by using Feed exports, with the following command:

scrapy crawl quotes -O quotes.json

That will generate a quotes.json file containing all scraped items, serialized in JSON.

The -O command-line switch overwrites any existing file; use -o instead to append new content to any existing file. However, appending to a JSON file makes the file contents invalid JSON. When appending to a file, consider using a different serialization format, such as JSON Lines:

scrapy crawl quotes -o quotes.jsonl

The JSON Lines format is useful because it’s stream-like, so you can easily append new records to it. It doesn’t have the same problem as JSON when you run twice. Also, as each record is a separate line, you can process big files without having to fit everything in memory, there are tools like JQ to help do that at the command-line.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. A placeholder file for Item Pipelines has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to implement any item pipelines if you just want to store the scraped items.

Following links¶

Let’s say, instead of just scraping the stuff from the first two pages from https://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let’s see how to follow links from them.

The first thing to do is extract the link to the page we want to follow. Examining our page, we can see there is a link to the next page with the following markup:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

We can try extracting it in the shell:

>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

This gets the anchor element, but we want the attribute href. For that, Scrapy supports a CSS extension that lets you select the attribute contents, like this:

>>> response.css("li.next a::attr(href)").get()
'/page/2/'

There is also an attrib property available (see Selecting element attributes for more):

>>> response.css("li.next a").attrib["href"]
'/page/2/'

Now let’s see our spider, modified to recursively follow the link to the next page, extracting data from it:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

What you see here is Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules you define, and extract different kinds of data depending on the page it’s visiting.

In our example, it creates a sort of loop, following all the links to the next page until it doesn’t find one – handy for crawling blogs, forums and other sites with pagination.

A shortcut for creating Requests¶

As a shortcut for creating Request objects you can use response.follow:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("span small::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request.

You can also pass a selector to response.follow instead of a string; this selector should extract necessary attributes:

for href in response.css("ul.pager a::attr(href)"):
    yield response.follow(href, callback=self.parse)

For <a> elements there is a shortcut: response.follow uses their href attribute automatically. So the code can be shortened further:

for a in response.css("ul.pager a"):
    yield response.follow(a, callback=self.parse)

To create multiple requests from an iterable, you can use response.follow_all instead:

anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)

or, shortening it further:

yield from response.follow_all(css="ul.pager a", callback=self.parse)

More examples and patterns¶

Here is another spider that illustrates callbacks and following links, this time for scraping author information:

import scrapy


class AuthorSpider(scrapy.Spider):
    name = "author"

    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        author_page_links = response.css(".author + a")
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css("li.next a")
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default="").strip()

        yield {
            "name": extract_with_css("h3.author-title::text"),
            "birthdate": extract_with_css(".author-born-date::text"),
            "bio": extract_with_css(".author-description::text"),
        }

This spider will start from the main page, it will follow all the links to the authors pages calling the parse_author callback for each of them, and also the pagination links with the parse callback as we saw before.

Here we’re passing callbacks to response.follow_all as positional arguments to make the code shorter; it also works for Request.

The parse_author callback defines a helper function to extract and cleanup the data from a CSS query and yields the Python dict with the author data.

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured in the DUPEFILTER_CLASS setting.

Hopefully by now you have a good understanding of how to use the mechanism of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.

Using spider arguments¶

You can provide command line arguments to your spiders by using the -a option when running them:

scrapy crawl quotes -O quotes-humor.json -a tag=humor

These arguments are passed to the Spider’s __init__ method and become spider attributes by default.

In this example, the value provided for the tag argument will be available via self.tag. You can use this to make your spider fetch only quotes with a specific tag, building the URL based on the argument:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = "https://quotes.toscrape.com/"
        tag = getattr(self, "tag", None)
        if tag is not None:
            url = url + "tag/" + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

If you pass the tag=humor argument to this spider, you’ll notice that it will only visit URLs from the humor tag, such as https://quotes.toscrape.com/tag/humor.

You can learn more about handling spider arguments here.

Next steps¶

This tutorial covered only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What else? section in the Scrapy at a glance chapter for a quick overview of the most important ones.

You can continue from the section Basic concepts to know more about the command-line tool, spiders, selectors and other things the tutorial hasn’t covered like modeling the scraped data. If you’d prefer to play with an example project, check the Examples section.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/947838.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

《向量数据库指南》——应对ElasticSearch挑战,拥抱Mlivus Cloud的新时代

在当今数据驱动的商业环境中,向量数据库的应用正变得愈加重要。随着人工智能和机器学习的快速发展,尤其是在自然语言处理、图像识别及推荐系统等领域,向量数据库以其强大的存储和检索能力,迎来了广泛的应用机会。然而,在实际应用中,企业在选择和实施向量数据库方案时,常…

基于SpringBoot的网上订餐系统(源码+数据库+文档)

亲测完美运行带论文&#xff1a;文末获取源码 文章目录 项目简介&#xff08;论文摘要&#xff09;运行视频包含的文件列表&#xff08;含论文&#xff09;前台运行截图后台运行截图 项目简介&#xff08;论文摘要&#xff09; 随着我国经济的飞速发展&#xff0c;人们的生活速…

【保姆级】sql注入之堆叠注入

一、堆叠注入的原理 mysql数据库sql语句的默认结束符是以";"号结尾&#xff0c;在执行多条sql语句时就要使用结束符隔 开,而堆叠注入其实就是通过结束符来执行多条sql语句 比如我们在mysql的命令行界面执行一条查询语句,这时语句的结尾必须加上分号结束 select * fr…

Linux(centos)安装 MySQL 8 数据库(图文详细教程)

前言 前几天写了个window系统下安装Mysql的博客&#xff0c;收到很多小伙伴私信需要Linux下安装Mysql的教程&#xff0c;今天这边和大家分享一下&#xff0c;话不多说&#xff0c;看教程。 一、删除以前安装的MySQL服务 一般安装程序第一步都需要清除之前的安装痕迹&#xff…

Segment Anything论文详细翻译【Part2:引言Introduction】

目录 写在前面 Introduction 第1段 第2段 第3段 第4段 第5段 第6段 第7段 第8段 第9段 第10段 第11段 第12段 Figure2 关键特点 图中具体内容 图例说明 写在前面 为啥要写这篇文章&#xff1f;因为找不到一篇写的特别好的【翻译并仔细解释】文章。网上大多千…

整数拼接(哈希表 枚举)

2068. 整数拼接 - AcWing题库 #include <bits/stdc.h> using namespace std;const int N 1e5 10;int n,k; int a[N]; int s[11][N]; //因为Ai < 10^9 10^9 是一个10位数&#xff0c;所以要*10^10 才能拼接int main() {cin >> n >> k;for (int i 1;i &…

使用爬虫技术获取网页中的半结构化数据

目录 前言1. 半结构化数据与爬虫技术简介1.1 半结构化数据的定义与特性1.2 爬虫技术的基本原理 2. 爬取半结构化数据的实现过程2.1 明确目标与准备2.2 发送HTTP请求2.3 解析网页内容2.4 动态内容的处理2.5 数据存储与清洗 3. 技术挑战与应对策略3.1 处理反爬机制3.2 提高爬取效…

Linux(Centos 7.6)命令详解:ls

1.命令作用 列出目录内容(list directory contents) 2.命令语法 Usage: ls [OPTION]... [FILE]... 3.参数详解 OPTION: -l&#xff0c;long list 使用长列表格式-a&#xff0c;all 不忽略.开头的条目&#xff08;打印所有条目&#xff0c;包括.开头的隐藏条目&#xff09…

一文读懂主成分分析法(PCA)

主成分分析法&#xff08;PCA&#xff09; 主成分分析法&#xff08;PCA&#xff09;主成分分析的基本思想主成分的计算主成分分析的原理主成分分析的特点主成分分析的应用 主成分分析法&#xff08;PCA&#xff09; 主成分分析的基本思想 PCA是1901 年Pearson在研究回归分析…

LLVM防忘录

目录 Windows中源码编译LLVMWindows下编译LLVM Pass DLL Windows中源码编译LLVM 直接从llvm-project下载源码, 然后解压后用VS2022打开该目录, 然后利用VS的开发终端执行: cmake -S llvm -B build -G "Visual Studio 17 2022" -DLLVM_ENABLE_PROJECTSclang -DLLVM_…

adb 不是内部或外部命令,也不是可运行的程序或批处理文件。

1、问题概述&#xff1f; 本文讲述的是在window系统中安装了Android SDK之后&#xff0c;adb无法使用的情况。 在cmd中执行adb devices提示如下问题&#xff1a; adb 不是内部或外部命令&#xff0c;也不是可运行的程序或批处理文件。 问题&#xff1a;没有配置android sdk环…

Leetcode 第426场周赛分析总结

3370. 仅含置位位的最小整数 AC代码 class Solution { public:int smallestNumber(int n) {int x 1;while (x - 1 < n) {x << 1;}return x - 1;} };分析总结 也可以先直接获取n的长度&#xff0c;然后计算得到&#xff0c;这样时间复杂度由O(logn)优化为O(1) 在C…

【从零开始入门unity游戏开发之——unity篇05】unity6基础入门——运行游戏按钮、Game游戏窗口和Project项目窗口介绍

文章目录 运行游戏按钮、Game游戏窗口和Project项目窗口一、运行游戏按钮二、Game游戏窗口1、右上角设置1.1 如果没有相机渲染则发出警告1.2 在”编程模式”下清除每一帧1.3 窗口最大化 2、上方工具&#xff08;1&#xff09;切换手机模拟器&#xff08;2&#xff09;切换不同显…

九、Vue 事件处理器

文章目录 前言一、基础事件绑定:v-on 指令二、方法调用:组织有序的交互逻辑三、事件修饰符阻止冒泡与默认事件捕获与自身触发单次触发与鼠标按键区分四、按键修饰符前言 在 Vue.js 的交互世界里,事件处理器起着举足轻重的作用,它让页面从静态展示迈向动态交互,精准捕捉用户…

【项目】基于趋动云平台的Stable Diffusion开发

【项目】基于趋动云平台的Stable Diffusion开发 &#xff08;一&#xff09;登录趋动云&#xff08;二&#xff09;创建项目&#xff1a;&#xff08;三&#xff09;初始化开发环境&#xff1a;&#xff08;四&#xff09;运行代码&#xff08;五&#xff09;运行模型 &#xf…

VSCode下配置Blazor环境 断点调试Blazor项目

VSCode下使用Blazor的环境配置和插件推荐 Blazor是一种用于构建交互式Web UI的.NET框架&#xff0c;它可以让你使用C#、Razor和HTML进行Web开发&#xff0c;而不需要JavaScript。在这篇文章中&#xff0c;我们将介绍如何在VSCode中配置Blazor环境&#xff0c;并推荐一些有用的…

word文档中的文档网格——解决相同行间距当显示出不同行间距的情况

1 问题 被一个行间距调疯了&#xff0c;就是样式改了没用&#xff0c;格式刷刷了没用。就是肉眼可以看出行间距完全不一样。 2 解决方法 1&#xff09;修改论文正文(即出现问题文本的样式)样式&#xff1a;样式>修改>格式>段落>缩进和间距>取消"如果定义了…

ubuntu如何禁用 Snap 更新

.禁用 Snap 更新&#xff08;通过修改 snapd 配置&#xff09; 打开并编辑 /etc/apt/apt.conf.d/50unattended-upgrades文件。 这个文件控制自动更新的行为。 sudo vim /etc/apt/apt.conf.d/50unattended-upgrades 里面有一行将里面的auto改为false即可禁用更新&#xff1a;…

UniApp 原生插件开发指南

一、UniApp 原生插件开发引言 在当今的移动应用开发领域&#xff0c;跨平台开发已成为主流趋势&#xff0c;而 UniApp 作为一款强大的跨平台开发框架&#xff0c;备受开发者青睐。它凭借 “一套代码&#xff0c;多端运行” 的特性&#xff0c;极大地提高了开发效率&#xff0c…

JVM实战—9.线上FGC的几种案例

大纲 1.如何优化每秒十万QPS的社交APP的JVM性能(增加S区大小 优化内存碎片) 2.如何对垂直电商APP后台系统的FGC进行深度优化(定制JVM参数模版) 3.不合理设置JVM参数可能导致频繁FGC(优化反射的软引用被每次YGC回收) 4.线上系统每天数十次FGC导致频繁卡顿的优化(大对象问题…