spider小案例~https://industry.cfi.cn/BCA0A4127A4128A4141.html

一、获取列表页信息

通过抓包发现列表页信息非正常返回，列表信息如下图：

通过观察发现列表页信息是通过unes函数进行处理的，我们接下来去看下该函数

该函数是对列表页的信息先全局替换"~"为"%u"，然后再通过unescape函数对替换后的字符串进行解码，到此我们就可以获取到列表页的信息了，我们用Python来还原一下

import re
from urllib.parse import unquote

import requests


def get_list_page():
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    }
    url = 'https://industry.cfi.cn/BCA0A4127A4128A4141.html'
    response = requests.get(url, headers=headers)
    re_data = re.findall('var n.*?="(.*?)";', response.text)
    for data in re_data:
        result = data.replace("~", "\\u")
        list_info = unquote(result).encode('utf8').decode('unicode_escape')
        # 详情页url
        detail_url = "https://industry.cfi.cn/"+''.join(re.findall(r'onclick=\"window.open\(\'(.*?)\'\);\"',list_info,re.S))
        print(detail_url)
        # 标题
        title_info = re.sub(r'[<font color=FireBrick><b></b>/</font></u><br>]','',list_info.split(');"')[-1]).strip()
        print(title_info)

二、获取详情页信息

有了详情页的URL，我们接下来再来看详情页的获取

抓包可见详情信息如上图，处理详情内容的函数应为 -->ifrnews，接下来我们去找该函数的位置，卡看该函数做了什么处理，如下图

箭头所指为我们想要的结果，与列表页类似，我们用Python还原下详情页的获取

def get_detail_page():
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    }
    url = 'https://industry.cfi.cn/p20231209000312.html'
    response = requests.get(url, headers=headers)
    # 从响应中取出详情内容
    content = ''.join(re.findall(r"var nr\d+=\"(.*?)\";", response.text, re.S))
    # 对详情内容进行解码
    detail_page_html = unquote(content).replace('~', "\\u").encode('utf8').decode('unicode_escape')
    print(detail_page_html)

总结：

在 JavaScript 中，使用 “%u” 进行 Unicode 编码。而在 Python 中，可以使用 “\u” 进行 Unicode 编码。

以下是示例：

在 JavaScript 中，使用 “%u” 进行 Unicode 编码：

var str = "%u4F60%u597D";
var decodedStr = unescape(str);
console.log(decodedStr); // 输出：你好

在 Python 中，使用 “\u” 进行 Unicode 编码：

请注意，在 Python 中使用 Unicode 编码时需要对反斜杠进行转义，因此在字符串中需要使用双反斜杠 “\” 表示单个反斜杠。

str = "\\u4F60\\u597D"
decoded_str = bytes(str, "utf-8").decode("unicode_escape")
print(decoded_str) # 输出：你好

以上内容仅供学习使用~

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/236455.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！