在获取到运动汇的网站链接后,界面如图所示:
右键检查,我们会发现没有任何数据,只有当我们点开这些"第一单元"、"第二单元"等,数据才会加载出来;
由于我们只需要分析这一个网页并获取其中的数据,所以我们可以简单的手动点开所有这些"单元"以及里面的各个项目的比赛数据,让网站将数据全部加载出来,如下所示:
紧接着,直接利用浏览器! 右键检查:
点击最上面的<html>, 右键移动到复制, 点击复制元素:
自主将这里面的html内容复制到某个文本文件中,随后使用Xpath表达式即可轻松解析出数据,解析代码如下:
# 项目, 姓名, 学院, 组/道, 名词, 成绩, 得分, 备注
from lxml import etree
EVENT_YEAR = "2024.11.15-16"
class ReadHtmlTable(object):
def __init__(self, html_content: str):
self.html_content = html_content
self.table_col = ("比赛时间", "项目", "姓名", "学院", "组/道", "名次", "成绩", "得分", "备注")
def get_different_units(self) -> list[etree.Element]:
tree = etree.HTML(self.html_content)
units = tree.xpath(r'//*[@id="div_Result"]/ul/li')
return units
def get_all_tables(self) -> list[etree.Element]:
units = self.get_different_units()
tables = []
for unit in units:
table = unit.xpath(r"./div[2]/div/div/table/tbody")
tables.extend(table)
return tables
def get_table_record(self, table_element: etree.Element) -> tuple[str, list[tuple]]:
# 一条记录be like: [None, '2/4', 'XX名字', '日语', '16.97', '27', None, None]
rows = table_element.xpath("./tr[2]/td/table/tbody/tr")[1:]
records = []
try:
# 如果是空表就跳出
event_name = table_element.xpath("./tr[1]/td[1]/text()")[0].strip().split('\xa0')[1]
except IndexError:
return records
for rank, row in enumerate(rows, 1):
# ("项目", "姓名", "学院", "组/道", "名次", "成绩", "得分", "备注")
items = [i.text.strip() if i.text is not None else "" for i in row.xpath(r"./td")]
records.append((EVENT_YEAR, event_name, items[2], items[3], items[1], rank, items[4], items[5], items[6]))
return records
def write_in_csv(self):
with open("./sports_meeting_res.csv", "w", encoding="utf-8") as f:
f.write(",".join(self.table_col) + "\n")
for table in self.get_all_tables():
for record in self.get_table_record(table):
# 将记录转换为字符串并写入文件
csv_line = ",".join(map(str, record)) + "\n"
f.write(csv_line)
if __name__ == "__main__":
with open("./sports_meeting_res2024.txt", "r", encoding="utf-8") as f:
html_content = f.read()
read_html_table = ReadHtmlTable(html_content)
read_html_table.write_in_csv()
需要注意上述代码中, 全局变量EVENT_YEAR是指代比赛年份,需要你自己修改成比赛的真实年份;此外,你需要修改读取的文件路径,在代码中是"./sports_meeting_res2024.txt", 里面是我复制下来的html元素内容;
最后,所有内容就会按照指定格式保存成csv文件!非常Easy啊