如何抓取和处理天气网站数据

目的

在进行气象研究时，获取准确的历史天气数据是至关重要的。本文将分享如何从天气网站收集数据并将其转化为表格形式，以便于后续分析。然而，在直接抓取数据时，可能会遇到API接口保护的问题。本文将详细解释解决这些问题的步骤，并展示如何将数据转化为可用的表格形式。

在这里插入图片描述

实现逻辑

总体逻辑：

先从指定网站获取指定年份和月份的天气数据，并保存为文本文件。
然后从这些文本文件中解析出日期和温度数据，并将其保存到CSV文件中。

这段代码的总体逻辑可以分为两个主要部分：从网站获取历史天气数据并保存为文本文件，以及从这些文本文件中提取数据并保存到CSV文件中。下面是详细解释：

第一部分：获取历史天气数据并保存为文本文件

1. 导入必要的库

import requests
from bs4 import BeautifulSoup
from glob import glob
import csv

requests: 用于发送HTTP请求。
BeautifulSoup: 用于解析HTML内容。
glob: 用于查找符合特定模式的文件路径名。
csv: 用于读取和写入CSV文件。

2. 设置请求头和Cookies

headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
    'Sec-Ch-Ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Host': 'tianqi.2345.com',
    'Referer': 'https://tianqi.2345.com/wea_history/51133.htm',
}
cookies = {
    'lastCityId': '51133',
    'lastCountyId': '51133',
    'lastCountyPinyin': 'tacheng',
    'lastCountyTime': '1718931941',
    'lastProvinceId': '40',
    'Hm_lpvt_a3f2879f6b3620a363bec646b7a8bcdd': '1718931942',
    'Hm_lvt_a3f2879f6b3620a363bec646b7a8bcdd': '1718929158',
}

headers: 模拟浏览器发送请求所需的头信息。
cookies: 用于维持会话信息。

3. 定义年份和月份列表

years = [2024, 2023, 2022]
months = [4, 5, 6]

years和months：指定要获取数据的年份和月份。

4. 循环遍历年份和月份，发送请求并保存响应数据

for year in years:
    for month in months:
        url = f'https://tianqi.2345.com/Pc/GetHistory?areaInfo%5BareaId%5D=51133&areaInfo%5BareaType%5D=2&date%5Byear%5D={year}&date%5Bmonth%5D={month}'

        response = requests.get(url, headers=headers, cookies=cookies)

        with open(f'{year}-{month}.txt', 'w', encoding='utf-8') as file:
            file.write(response.json()['data'])

构造URL：基于指定的年份和月份，构造获取历史天气数据的URL。
发送请求：使用requests.get方法发送请求，附带请求头和cookies。
保存数据：将返回的JSON格式数据写入以年份和月份命名的文本文件中。

第二部分：从文本文件中提取数据并保存到CSV文件

1. 遍历所有文本文件

for file_path in glob('D:\lab\paper\date\*.txt'):

使用glob模块查找匹配指定路径模式的所有文本文件。

2. 读取文件内容并解析HTML

with open(file_path) as fs:
    soup = BeautifulSoup(fs.read(), 'html.parser')

读取文件内容：打开并读取每个文本文件的内容。
解析HTML：使用BeautifulSoup解析文件内容。

3. 提取文件名并准备CSV文件

filename = file_path.split('\\')[-1].split('.')[0]
with open(f'{filename}.csv', 'w', newline='', encoding='utf-8') as file:
    csv_writer = csv.writer(file)
    csv_writer.writerow(['日期', '最高温', '最低温'])

提取文件名：从文件路径中提取文件名，去除文件扩展名。
准备CSV文件：创建新的CSV文件，并写入表头行。

4. 提取并写入数据到CSV文件

# 找到所有的日期和温度数据
for row in soup.select('.history-table tr')[1:]:  # 跳过表头
    columns = row.find_all('td')
    if len(columns) == 6:
        date = columns[0].text.strip()
        high_temp = columns[1].text.strip()
        low_temp = columns[2].text.strip()
    # 写入 CSV 文件
    csv_writer.writerow([date, high_temp, low_temp])

选择数据行：使用CSS选择器.history-table tr找到所有表格行，跳过表头行。
提取数据：从每行中提取日期、最高温和最低温数据。
写入CSV文件：将提取的数据写入CSV文件中。

全部代码


import requests
from bs4 import BeautifulSoup
from glob import glob 
import csv

headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,ja;q=0.7',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
    
    'Sec-Ch-Ua': '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Host': 'tianqi.2345.com',
    'Referer': 'https://tianqi.2345.com/wea_history/51133.htm',
}
cookies = {
    'lastCityId': '51133',
    'lastCountyId': '51133',
    'lastCountyPinyin': 'tacheng',
    'lastCountyTime': '1718931941',
    'lastProvinceId': '40',
    'Hm_lpvt_a3f2879f6b3620a363bec646b7a8bcdd': '1718931942',
    'Hm_lvt_a3f2879f6b3620a363bec646b7a8bcdd': '1718929158',

}
years = [2024,2023,2022]
months = [4,5,6]


for year in years:
    for month in months:
        url = f'https://tianqi.2345.com/Pc/GetHistory?areaInfo%5BareaId%5D=51133&areaInfo%5BareaType%5D=2&date%5Byear%5D={year}&date%5Bmonth%5D={month}'

        response = requests.get(url,headers=headers,cookies=cookies)

        with open(f'{year}-{month}.txt', 'w', encoding='utf-8') as file:
            file.write(response.json()['data'])

for file_path in glob('D:\lab\paper\date\*.txt'):
    
    with open(file_path) as fs:
        soup = BeautifulSoup(fs.read(), 'html.parser')
        filename = file_path.split('\\')[-1].split('.')[0]
        with open(f'{filename}.csv', 'w', newline='', encoding='utf-8') as file:
            csv_writer = csv.writer(file)
            csv_writer.writerow(['日期', '最高温', '最低温'])

            # 找到所有的日期和温度数据
            for row in soup.select('.history-table tr')[1:]:  # 跳过表头
                columns = row.find_all('td')
                if len(columns) == 6:
                    date = columns[0].text.strip()
                    high_temp = columns[1].text.strip()
                    low_temp = columns[2].text.strip()
                # 写入 CSV 文件
                csv_writer.writerow([date, high_temp, low_temp])