写在前面
- 工作需要遇到,简单整理
- 理解不足小伙伴帮忙指正
对每个人而言,真正的职责只有一个:找到自我。然后在心中坚守其一生,全心全意,永不停息。所有其它的路都是不完整的,是人的逃避方式,是对大众理想的懦弱回归,是随波逐流,是对内心的恐惧 ——赫尔曼·黑塞《德米安》
逻辑相对简单,主要通过 站长之家 https://cdn.chinaz.com/
,获取全国省市的 CDN节点 IP 信息
采集流程:
- 获取CDN 厂家信息
- 跳转页面到指定的厂家,择需要获取的省份
- 获取当前页IP,循环处理分页数据
- 处理完当前省份,循环跳转其他省份处理
- 处理完当前厂家,循环处理其他厂家
代码:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
@File : cdn_data_dns.py
@Time : 2023/08/21 21:46:47
@Author : Li Ruilong
@Version : 1.0
@Contact : liruilonger@gmail.com
@Desc : 省市CDN 节点IP数据获取
"""
# here put the import lib
from seleniumwire import webdriver
import json
import time
from selenium.webdriver.common.by import By
import pandas as pd
import re
ip_pattern = r"\b(?:\d{1,3}\.){3}\d{1,3}\b"
# 自动登陆
driver = webdriver.Chrome()
with open('C:\\Users\山河已无恙\\Documents\GitHub\\reptile_demo\\demo\\cookie.txt', 'r', encoding='u8') as f:
cookies = json.load(f)
driver.get('https://cdn.chinaz.com/')
for cookie in cookies:
driver.add_cookie(cookie)
driver.get('https://cdn.chinaz.com/')
time.sleep(6)
#CND 商家排行获取 https://cdn.chinaz.com/
CDN_Manufacturer = []
new_div_element = driver.find_element(By.CSS_SELECTOR, ".toplist-main")
div_elements = new_div_element.find_element(By.CSS_SELECTOR, ".ullist")
div_cdn = div_elements.find_elements(By.XPATH,"//a[contains(@href,'server')]")
#CDN_Manufacturer.extend(div_elements)
current_window_1 = driver.current_window_handle
for i,mdn_ms in enumerate(div_cdn):
try:
#driver.execute_script("arguments[0].click();", mdn_ms)
ip_addresse = []
print(mdn_ms.text)
cloud_cdn_name = mdn_ms.text
mdn_ms.click()
time.sleep(2)
driver.switch_to.window(driver.window_handles[-1])
# 滚动到页面底部
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/2)")
time.sleep(5)
areas_list = ["安徽", "河北", "河南", "湖北", "湖南", "江西", "陕西", "山西", "四川", "重庆"]
for a in areas_list:
areas = driver.find_element(By.CSS_SELECTOR,"#areas")
nmg = areas.find_element(By.XPATH,"//a/font[contains(text(),'"+a+"')]")
nmg.click()
time.sleep(2)
new_div_element = driver.find_element(By.CSS_SELECTOR, ".box")
new_table_element = str(new_div_element.text).split("\n")
ip_addresses = re.findall(ip_pattern, str(new_table_element))
ip_addresse.extend(ip_addresses)
if len(driver.find_elements(By.XPATH,"//a[contains(@title, '尾页')]")) < 2:
#driver.close()
#driver.switch_to.window(current_window_1)
ips = {}
ips[cloud_cdn_name] = ip_addresse
df = pd.DataFrame(ips)
df.to_csv('CDN_M_省份_'+a +'_'+cloud_cdn_name+'.csv', index=False)
print("单页数据,数据已保存为CSV文件",'CDN_M_'+a +'_'+cloud_cdn_name+'.csv')
continue
sum_page = driver.find_element(By.XPATH,"//a[contains(@title, '尾页')]")
attribute_value = sum_page.get_attribute('val')
print(attribute_value)
current_window_2 = driver.current_window_handle
for page in range(1,int(attribute_value)):
try:
next_page = driver.find_element(By.XPATH,"//a[contains(@title, '下一页')]")
next_page.click()
time.sleep(5)
new_div_element = driver.find_element(By.CSS_SELECTOR, ".box")
new_table_element = str(new_div_element.text).split("\n")
ip_addresses = re.findall(ip_pattern, str(new_table_element))
ip_addresse.extend(ip_addresses)
except:
print(a,cloud_cdn_name,"没有IP")
time.sleep(5)
pass
continue
ips = {}
ips[cloud_cdn_name] = ip_addresse
df = pd.DataFrame(ips)
df.to_csv('CDN_M_省份_'+a+'_'+cloud_cdn_name+'.csv', index=False)
print("数据已保存为CSV文件",' CDN_M_省份_'+a+'_'+cloud_cdn_name+'.csv')
except:
print(cloud_cdn_name,"没有IP")
pass
continue
finally:
pass
driver.close()
driver.switch_to.window(current_window_1)
continue
博文部分内容参考
© 文中涉及参考链接内容版权归原作者所有,如有侵权请告知
© 2018-2023 liruilonger@gmail.com, All rights reserved. 保持署名-非商用-相同方式共享(CC BY-NC-SA 4.0)