Python：由b站临时短链接获取到永久链接（去除分享中的杂项）

📚博客主页：knighthood2001
✨公众号：认知up吧（目前正在带领大家一起提升认知，感兴趣可以来围观一下）
🎃知识星球：【认知up吧|成长|副业】介绍
❤️如遇文章付费，可先看看我公众号中是否发布免费文章❤️
🙏笔者水平有限，欢迎各位大佬指点，相互学习进步！

用过b站的小伙伴，应该都知道，b站视频分享后，会生成一段包含标题和链接的内容，比如：【xxxxx】 https://b23.tv/Wr3moAZ。

今天，我们的重点就是去找到这个链接背后的永久链接（因为不排除这个链接过个十天半月后不能使用），而且在爬虫中，这个链接是不太规范的，如果使用这个链接当作Referer，肯定是爬不出东西的。

本文内容分三步

根据短链接网址获取其背后的永久链接


import requests
def get_permanent_link(url):
    # 使用requests库发送head请求（也可以选择GET，但HEAD更快且消耗资源更少）
    response = requests.head(url, allow_redirects=True)
    # print(response.headers)
    # 检查是否有重定向发生
    if 'location' in response.headers:
        # 如果响应头中包含'location'，则表示发生了重定向
        # 递归调用自身以处理连续重定向的情况
        return get_permanent_link(response.headers['location'])
    else:
        # 如果没有重定向，则返回当前请求的URL作为永久链接
        return response.url
# 测试函数
temporary_link = "https://b23.tv/Oz899Lb"
permanent_link = get_permanent_link(temporary_link)
print("永久链接是:", permanent_link)
# https://www.bilibili.com/video/BV1LT421S7sh?buvid=XXBF6E0BE943914C6FCC270C1BC645FEABF88&from_spmid=tm.recommend.0.0&is_story_h5=false&mid=rD2uXjILlSxisOyhyvffSg%3D%3D&p=1&plat_id=116&share_from=ugc&share_medium=android&share_plat=android&share_session_id=a834af0a-dc13-4fbe-bf8f-dbc285b6c4db&share_source=COPY&share_tag=s_i&spmid=united.player-video-detail.0.0&timestamp=1716192491&unique_k=Oz899Lb&up_id=259649365

首先，最重要的就是那个短链接网址了，我们需要通过它，获取到背后的永久的url，这时候requests就派上用场了，这里涉及到判断重定向。location是HTTP协议中用来表示重定向位置的头部字段。如果响应头中包含location，那么通常意味着服务器希望客户端去访问另一个URL，即发生了重定向。

因此这里需要判断一下，但是，经过几次测试，b站貌似没有发生重定向，因此直接像代码中的else情况处理即可。

此时，你就获取到了长链接。

去除短链接外的内容

争对以下这样的分享内容，我们需要提取出url，因此需要去除其他部分。
在这里插入图片描述

这时候，正则表达式就派上用场了。

import re
# 提取text中的url
def get_url(text):
    # 匹配url的正则表达式
    pattern = r'https?://[^\s]+'
    match = re.search(pattern, text)
    if match:
        return match.group()

通过这个代码，能够只留下这个短链接。

去除获得的长链接中的参数

通过上面代码，获得到的url，已经是长链接了。

但是，获取到的url中，带有很多参数，这些参数目前来看并没有什么用处，因此可以把他去除。

这里同样采用的正则表达式。

经过观察，可以发现，b站视频链接中?后面的就是参数了，去掉不影响视频的加载。

代码如下：

# 获得纯净url
def get_pure_url(text):
    # 匹配url的正则表达式
    pattern = r'^(https?://[^\?]+)'
    match = re.search(pattern, text)
    if match:
        return match.group()
    else:
        return None

通过在❓处断开，取前半部分，就得到了最关键的url了。

整合一下

整合一下，使得更加美观，且后续只需要更改一下main函数的名字，就可以在别处调用了。

import requests
import re
"""
# 电脑分享链接：【夏天晚上吹着小风吃龙虾烧烤太爽了！天津凌奥夜市88元小龙虾自助】 https://www.bilibili.com/video/BV1Bw4m1i7ko/?share_source=copy_web&vd_source=80a8f348074649de1e18f1345dee7db3
# 返回数据：https://www.bilibili.com/video/BV1Bw4m1i7ko/?share_source=copy_web&vd_source=80a8f348074649de1e18f1345dee7db3
# 手机分享链接：【杭州首家100元一位313羊庄自助，震惊！-哔哩哔哩】 https://b23.tv/Wr3moAZ
# 返回数据：https://www.bilibili.com/video/BV11U411d7fR?buvid=XXBF6E0BE943914C6FCC270C1BC645FEABF88&from_spmid=tm.recommend.0.0&is_story_h5=false&mid=rD2uXjILlSxisOyhyvffSg%3D%3D&p=1&plat_id=116&share_from=ugc&share_medium=android&share_plat=android&share_session_id=4931472e-3d3c-4a66-aa7f-ee419ad505fc&share_source=COPY&share_tag=s_i&spmid=united.player-video-detail.0.0&timestamp=1717260126&unique_k=Wr3moAZ&up_id=257385649
"""
# 提取text中的url
def get_url(text):
    # 匹配url的正则表达式
    pattern = r'https?://[^\s]+'
    match = re.search(pattern, text)
    if match:
        return match.group()
def get_permanent_link(url):
    # 使用requests库发送head请求（也可以选择GET，但HEAD更快且消耗资源更少）
    response = requests.head(url, allow_redirects=True)
    # print(response.headers)
    # 检查是否有重定向发生
    if 'location' in response.headers:
        # 如果响应头中包含'location'，则表示发生了重定向
        # 递归调用自身以处理连续重定向的情况
        return get_permanent_link(response.headers['location'])
    else:
        # 如果没有重定向，则返回当前请求的URL作为永久链接
        return response.url

# 获得纯净url
def get_pure_url(text):
    # 匹配url的正则表达式
    pattern = r'^(https?://[^\?]+)'
    match = re.search(pattern, text)
    if match:
        return match.group()
    else:
        return None
def main(text):
    print(get_pure_url(get_permanent_link(get_url(text))))
    return get_pure_url(get_permanent_link(get_url(text)))
if __name__ == '__main__':
    aaa = "【夏天晚上吹着小风吃龙虾烧烤太爽了！天津凌奥夜市88元小龙虾自助】 https://www.bilibili.com/video/BV1Bw4m1i7ko/?share_source=copy_web&vd_source=80a8f348074649de1e18f1345dee7db3"
    bbb = "【杭州首家100元一位313羊庄自助，震惊！-哔哩哔哩】 https://b23.tv/Wr3moAZ"
    main(aaa)
    main(bbb)