由于今日头条网页是动态渲染,再加上各种token再验证,因此直接通过API接口获取数据难度很大,本文使用selenium来实现新闻内容爬取。
selenium核心代码
知识点:
- 代码中加了很多的异常处理,保证错误后重试,提高稳定性
EdgeChromiumDriverManager().install()
自动下载浏览器驱动,避免浏览器更新后驱动版本不对的问题- 使用
driver.refresh()
、driver.close()
、driver.quit()
防止占用内存过多 - 使用
--disable-extensions
禁用插件,避免插件可能带来的影响 - 使用
--inprivate
打开无痕模式,这里遇到一个很烦的问题,用户登录同步问题,无痕模式可以避免
from webdriver_manager.microsoft import EdgeChromiumDriverManager
def get_html_by_selenium(url):
print("开始:", url)
options = webdriver.EdgeOptions()
# 启用'禁用浏览器正在被自动化程序控制的提示'启动参数
options.add_experimental_option("excludeSwitches", ["enable-automation"])
# 禁用插件
options.add_argument("--disable-extensions")
# 无痕模式
options.add_argument('--inprivate')
count = 0
driver = None
while count < 10:
try:
driver = webdriver.Edge(service=Service(executable_path=EdgeChromiumDriverManager().install()),
options=options)
# 最小化
driver.minimize_window()
time.sleep(1)
driver.get(url)
break
except WebDriverException as e:
print(e)
count += 1
time.sleep(3)
continue
except ConnectionError as e:
print(e)
count += 1
time.sleep(3)
continue
if driver is None:
return
time.sleep(10)
try:
html = driver.page_source
# 防止内存泄露
driver.refresh()
try:
driver.close()
except WebDriverException:
pass
driver.quit()
return html
except NoSuchWindowException:
return
新闻列表解析代码
URL示例:
https://www.toutiao.com/c/user/token/MS4wLjABAAAA6Ftyf-tftfbjp1u_TEz6kpY77ZlPaYRV0UsfXkF2UsM/?tab=article
这里比较简单,拿到了新闻标题和url,HTML解析过程中可能遇到浏览器中渲染的html结构和真实请求到的html结构不一样,要以真实拿到的html内容为准
url = f"https://www.toutiao.com/c/user/token/{USER_TOKEN}/?tab=article"
html = get_html_by_selenium(url)
soup = BeautifulSoup(html, "html.parser")
for article in soup.find_all("div", attrs={"class": "profile-article-card-wrapper"}):
a = article.find("a")
news_title = a["title"]
url = a["href"]
content, news_time = parse_and_save_news(url)
新闻内容解析代码
相对比较简单,忽略了图片的解析,最终获得新闻的内容和新闻时间
def parse_and_save_news(url):
html = get_html_by_selenium(url)
if not html:
return
soup = BeautifulSoup(html, "html.parser")
article_content = soup.find("div", attrs={"class": "article-content"})
if article_content is None:
return
article_meta = soup.find("div", attrs={"class": "article-meta"})
time_string = article_meta.find("span", attrs=None).text
news_time = datetime.strptime(time_string, "%Y-%m-%d %H:%M")
article = article_content.article
new_soup = BeautifulSoup("<html><body></body></html>", "html.parser")
body = new_soup.body
for p in article.find_all("p"):
body.append(BeautifulSoup(f"<p>{p.text}</p>", "html.parser"))
content = new_soup.prettify()
return content, news_time