目录
一、股票信息提取(http://quote.stockstar.com/)
1、首先打开网页
2、我们选取信息技术行业的股票,点进去。然后先复制网页地址http://quote.stockstar.com/stock/industry_I.shtml
3、然后点点击键盘上的F12打开开发工具分析网页结构,开始定位要爬取的数据对应的网页结构
4、提取定位的网页结构元素进行分析
5、分析完了,开写
(1)使用Beautiful Soup解析HTML代码:
(2)找到包含股票信息的表格:
(3)提取表格中的行数据:
(4)遍历每一行,提取股票信息:
(6)完整代码
6、结果演示
二、提取新浪新闻热榜新闻
三、结语
一、股票信息提取(http://quote.stockstar.com/)
1、首先打开网页
2、我们选取信息技术行业的股票,点进去。然后先复制网页地址http://quote.stockstar.com/stock/industry_I.shtml
3、然后点点击键盘上的F12打开开发工具分析网页结构,开始定位要爬取的数据对应的网页结构
上图可以看出爬取的数据都在box box_02这个盒子中
4、提取定位的网页结构元素进行分析
<div class="box box02">
<div class="bg_box" id="dataTable">
<div class="con">
**//这里是股票所对应的表格 需要提取**
<table width="100%" border="0" cellpadding="0" cellspacing="0" class="trHover" id="table1">
<thead class="tbody_right">
<tr>
<td width="6%" class="align_center">
<a href="javascript:void(0)" sort="0" target="_self" class="newup">代码</a>
</td>
<td width="24%" class="align_center">简称
</td>
<td width="17.5%" class="align_right">
<a href="javascript:void(0)" sort="1" target="_self">流通市值(万元)</a>
</td>
<td width="17.5%" class="align_right">
<a href="javascript:void(0)" sort="2" target="_self">总市值(万元)</a>
</td>
<td width="17.5%" class="align_right">
<a href="javascript:void(0)" sort="3" target="_self">流通股本(万元)</a>
</td>
<td width="17.5%" class="align_right">
<a href="javascript:void(0)" sort="4" target="_self">总股本(万元)</a>
</td>
</tr>
</thead>
<tbody class="tbody_right" id="datalist">
//start(从start到end是要每一行对应的股票信息,我们进行遍历,最后打印出来就好了)
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000004.shtml">000004</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000004.shtml">国华网安</a></td>
<td class="align_right ">190063.58</td>
<td class="align_right ">199232.32</td>
<td class="align_right ">12628.81</td>
<td class="align_right ">13238.03</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000032.shtml">000032</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000032.shtml">深桑达A</a></td>
<td class="align_right ">1166377.73</td>
<td class="align_right ">2058568.25</td>
<td class="align_right ">64476.38</td>
<td class="align_right ">113795.92</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000158.shtml">000158</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000158.shtml">常山北明</a></td>
<td class="align_right ">1224653.26</td>
<td class="align_right ">1235730.73</td>
<td class="align_right ">158428.62</td>
<td class="align_right ">159861.67</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000409.shtml">000409</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000409.shtml">云鼎科技</a></td>
<td class="align_right ">364104.22</td>
<td class="align_right ">581661.43</td>
<td class="align_right ">42337.70</td>
<td class="align_right ">67635.05</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000503.shtml">000503</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000503.shtml">国新健康</a></td>
<td class="align_right ">905071.39</td>
<td class="align_right ">991065.37</td>
<td class="align_right ">89877.99</td>
<td class="align_right ">98417.61</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000555.shtml">000555</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000555.shtml">神州信息</a></td>
<td class="align_right ">1166774.74</td>
<td class="align_right ">1170929.32</td>
<td class="align_right ">97231.23</td>
<td class="align_right ">97577.44</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000676.shtml">000676</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000676.shtml">智度股份</a></td>
<td class="align_right ">914176.97</td>
<td class="align_right ">915255.50</td>
<td class="align_right ">127500.28</td>
<td class="align_right ">127650.70</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000682.shtml">000682</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000682.shtml">东方电子</a></td>
<td class="align_right ">1222620.92</td>
<td class="align_right ">1222743.03</td>
<td class="align_right ">134059.31</td>
<td class="align_right ">134072.70</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000839.shtml">000839</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000839.shtml">ST国安</a></td>
<td class="align_right ">764366.14</td>
<td class="align_right ">764366.14</td>
<td class="align_right ">391982.64</td>
<td class="align_right ">391982.64</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000889.shtml">000889</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000889.shtml">ST中嘉</a></td>
<td class="align_right ">148744.29</td>
<td class="align_right ">160105.78</td>
<td class="align_right ">86984.97</td>
<td class="align_right ">93629.11</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000948.shtml">000948</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000948.shtml">南天信息</a></td>
<td class="align_right ">528632.72</td>
<td class="align_right ">539879.79</td>
<td class="align_right ">38614.52</td>
<td class="align_right ">39436.07</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000971.shtml">000971</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000971.shtml">ST高升</a></td>
<td class="align_right ">134618.70</td>
<td class="align_right ">166725.83</td>
<td class="align_right ">84665.85</td>
<td class="align_right ">104859.01</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/000997.shtml">000997</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/000997.shtml">新 大 陆</a></td>
<td class="align_right ">1788947.49</td>
<td class="align_right ">1798885.70</td>
<td class="align_right ">102636.12</td>
<td class="align_right ">103206.29</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002063.shtml">002063</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002063.shtml">远光软件</a></td>
<td class="align_right ">943962.50</td>
<td class="align_right ">1024941.65</td>
<td class="align_right ">175457.71</td>
<td class="align_right ">190509.60</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002065.shtml">002065</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002065.shtml">东华软件</a></td>
<td class="align_right ">1625588.07</td>
<td class="align_right ">1795070.13</td>
<td class="align_right ">290283.58</td>
<td class="align_right ">320548.24</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002093.shtml">002093</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002093.shtml">国脉科技</a></td>
<td class="align_right ">713868.17</td>
<td class="align_right ">714317.50</td>
<td class="align_right ">100686.62</td>
<td class="align_right ">100750.00</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002095.shtml">002095</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002095.shtml">生 意 宝</a></td>
<td class="align_right ">392665.13</td>
<td class="align_right ">394243.20</td>
<td class="align_right ">25170.84</td>
<td class="align_right ">25272.00</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002123.shtml">002123</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002123.shtml">梦网科技</a></td>
<td class="align_right ">650474.06</td>
<td class="align_right ">757978.52</td>
<td class="align_right ">68687.86</td>
<td class="align_right ">80039.97</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002131.shtml">002131</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002131.shtml">利欧股份</a></td>
<td class="align_right ">1309578.84</td>
<td class="align_right ">1515714.58</td>
<td class="align_right ">584633.41</td>
<td class="align_right ">676658.29</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002148.shtml">002148</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002148.shtml">北纬科技</a></td>
<td class="align_right ">249222.00</td>
<td class="align_right ">308537.10</td>
<td class="align_right ">45148.91</td>
<td class="align_right ">55894.40</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002153.shtml">002153</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002153.shtml">石基信息</a></td>
<td class="align_right ">1121434.43</td>
<td class="align_right ">1913164.88</td>
<td class="align_right ">159976.38</td>
<td class="align_right ">272919.38</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002174.shtml">002174</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002174.shtml">游族网络</a></td>
<td class="align_right ">916019.16</td>
<td class="align_right ">917717.77</td>
<td class="align_right ">91419.08</td>
<td class="align_right ">91588.60</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002195.shtml">002195</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002195.shtml">岩山科技</a></td>
<td class="align_right ">1674280.22</td>
<td class="align_right ">1694554.91</td>
<td class="align_right ">565635.21</td>
<td class="align_right ">572484.77</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002197.shtml">002197</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002197.shtml">证通电子</a></td>
<td class="align_right ">491566.64</td>
<td class="align_right ">565213.89</td>
<td class="align_right ">53431.16</td>
<td class="align_right ">61436.29</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002212.shtml">002212</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002212.shtml">天融信</a></td>
<td class="align_right ">811854.37</td>
<td class="align_right ">823376.63</td>
<td class="align_right ">116813.58</td>
<td class="align_right ">118471.46</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002230.shtml">002230</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002230.shtml">科大讯飞</a></td>
<td class="align_right ">10642815.24</td>
<td class="align_right ">11280510.86</td>
<td class="align_right ">218448.59</td>
<td class="align_right ">231537.58</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002232.shtml">002232</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002232.shtml">启明信息</a></td>
<td class="align_right ">637335.59</td>
<td class="align_right ">637335.59</td>
<td class="align_right ">40854.85</td>
<td class="align_right ">40854.85</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002235.shtml">002235</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002235.shtml">安妮股份</a></td>
<td class="align_right ">319414.68</td>
<td class="align_right ">334413.21</td>
<td class="align_right ">55357.83</td>
<td class="align_right ">57957.23</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002238.shtml">002238</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002238.shtml">天威视讯</a></td>
<td class="align_right ">898866.26</td>
<td class="align_right ">898866.26</td>
<td class="align_right ">80255.92</td>
<td class="align_right ">80255.92</td>
</tr>
<tr>
<td class="align_center select"><a href="//stock.quote.stockstar.com/002247.shtml">002247</a></td>
<td class="align_center"><a href="//stock.quote.stockstar.com/002247.shtml">聚力文化</a></td>
<td class="align_right ">108114.83</td>
<td class="align_right ">142946.17</td>
<td class="align_right ">64354.07</td>
<td class="align_right ">85087.00</td>
</tr>
//end
</tbody>
<tbody>
<tr id="has_fyStock_data" class="noSelect no_trHover">
<td colspan="12" class="time notSelect">
<span class="fl" id="latesttime_span">数据时间:2024-03-29</span>
<div class="fenye fr" id="divPageControl1">共<strong>422</strong>条记录<span><em>1</em></span><a
href="/stock/industry_I_0_0_2.html" target="_self"><em>2</em></a><a
href="/stock/industry_I_0_0_3.html" target="_self"><em>3</em></a><a
href="/stock/industry_I_0_0_4.html" target="_self"><em>4</em></a><a
href="/stock/industry_I_0_0_5.html" target="_self"><em>5</em></a><em>...</em><a
href="/stock/industry_I_0_0_15.html" target="_self"><em>15</em></a><a
href="/stock/industry_I_0_0_2.html" target="_self"
class="n"><em>下一页</em></a>到第<input type="text" class="page_input"
id="txtPageNumber"
onkeydown="if (event.keyCode == 13){PagedControl.GoToThePage('/stock/industry_I_0_0_{0}.html');return false;}">页<a
href="javascript:void(0);"
onclick="PagedControl.GoToThePage('/stock/industry_I_0_0_{0}.html');return false;"><em>确定</em></a>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
5、分析完了,开写
(1)使用Beautiful Soup解析HTML代码:
import requests
from bs4 import BeautifulSoup
url = "<http://quote.stockstar.com/stock/industry_I.shtml>"
response = requests.get(url)
response.encoding = 'gbk' # 设置编码为 gbk
soup = BeautifulSoup(response.text, 'html.parser')
(2)找到包含股票信息的表格:
table = soup.find('table', class_='trHover')
(3)提取表格中的行数据:
rows = table.find_all('tr')
(4)遍历每一行,提取股票信息:
pythonCopy code
for row in rows[1:]: # Skip the header row
cells = row.find_all('td')
if len(cells) >= 6: # Ensure there are enough cells
stock_code = cells[0].text.strip()
stock_name = cells[1].text.strip()
circulation_market_value = cells[2].text.strip()
total_market_value = cells[3].text.strip()
circulation_stock = cells[4].text.strip()
total_stock = cells[5].text.strip()
print(f"股票代码: {stock_code}, 股票名称: {stock_name}, 流通市值: {circulation_market_value}, 总市值: {total_market_value}, 流通股本: {circulation_stock}, 总股本: {total_stock}")
(6)完整代码
import requests
from bs4 import BeautifulSoup
url = "<http://quote.stockstar.com/stock/industry_I.shtml>"
response = requests.get(url)
response.encoding = 'gbk' # 设置编码为 gbk 不设置这个编码会乱码
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', class_='trHover')
rows = table.find_all('tr')
for row in rows[1:]: # Skip the header row
cells = row.find_all('td')
if len(cells) >= 6: # Ensure there are enough cells
stock_code = cells[0].text.strip()
stock_name = cells[1].text.strip()
circulation_market_value = cells[2].text.strip()
total_market_value = cells[3].text.strip()
circulation_stock = cells[4].text.strip()
total_stock = cells[5].text.strip()
print(f"股票代码: {stock_code}, 股票名称: {stock_name}, "
f"流通市值: {circulation_market_value}, 总市值: {total_market_value}, "
f"流通股本: {circulation_stock}, 总股本: {total_stock}")
这样就可以提取出表格中的股票信息了。如果你有其他需求或者需要进一步解析页面,请提供更多详细信息。
6、结果演示
二、提取新浪新闻热榜新闻
还是给以上的步骤一样
打开网页点F12提取要爬取数据的页面结构代码,分析,写出代码。
就是提取蓝色部分的网页结构代码
<div class="blk_main_card">
<!-- 热榜 -->
//blk_main_li为父元素
<div class="blk_main_li" tab-type="tab-cont">
<ul class="uni-blk-list02 list-a list-0427" style="padding-top: 7px;">
<li><a href="<https://sinanews.sina.cn/native_zt/yingyanlandingpage1711786917>" data="0" target="_blank">小米汽车遭遇上百余名消费者投诉</a></li>
<li><a href="<https://sinanews.sina.cn/native_page/quanzi_914931027323416577.html>" data="1" target="_blank">偷点外卖就不要写真实姓名了</a></li>
<li id="hot_list_ad">
<a id="hotlist_index_3" href="<https://s.weibo.com/weibo?q=%E5%93%AA%E4%BA%9B%E4%BA%BA%E5%AE%B9%E6%98%93%E5%BE%97%E7%99%BE%E6%97%A5%E5%92%B3>" data="2" target="_blank">哪些人容易得百日咳</a>
<ins class="sinaads sinaads-fail" id="sinaads-right-hotlist" data-ad-pdps="PDPS000000067800" data-ad-width="360" data-ad-height="26" data-ad-type="embed" style="display:none" data-ad-status="done"></ins>
<script>(sinaads = window.sinaads || []).push({
params: {
element: document.getElementById("PDPS000000067800"),
sinaads_success_handler:function () {
var ads = document.getElementById("sinaads-right-hotlist");
var _news= document.getElementById("hotlist_index_3");
var hot_list_ad= document.getElementById("hot_list_ad")
_news.style.display="none";
ads.style.display= "block";
hot_list_ad.classList.add("hotlist_have_ad")
},
sinaads_fail_handler: function () {
console.log('sinaads_fail_handler')
}
}
})</script>
</li>
//热榜新闻都被包含在li标签中
<li><a href="<https://sinanews.sina.cn/native_zt/yingyanlandingpage1711790585>" data="3" target="_blank">杭州东站</a></li>
<li><a href="<https://sinanews.sina.cn/native_page/quanzi_914336910352965633.html>" data="4" target="_blank">2024中国网络媒体论坛</a></li>
<li><a href="<https://sinanews.sina.cn/native_page/quanzi_914966334487650305.html>" data="5" target="_blank">雷军能不能生产一下相机</a></li>
<li><a href="<https://sinanews.sina.cn/native_zt/yingyanlandingpage1711790450>" data="6" target="_blank">医院取精室里都有些什么</a></li>
<li><a href="<https://k.sina.com.cn/article_5756451891_m1571c7c3303301b0u4.html?from=news&subch=onews>" data="7" target="_blank">警方辟谣面具男用病毒针扎人</a></li>
<li><a href="<https://finance.sina.cn/2024-03-30/detail-inaqawts0171984.d.html>" data="8" target="_blank">殡葬用品店否认南通烧纸普遍2层楼高</a></li>
<li><a href="<https://sinanews.sina.cn/native_zt/yingyanlandingpage1711790306>" data="9" target="_blank">花间令女性群像没有郑合惠子</a></li>
<li><a class="fe661" href="<https://sinanews.sina.cn/h5/top_news_list.d.html>" data="10" target="_blank">点击查看更多实时热点</a></li>
</ul>
</div>
</div>
分析以后写出代码
import requests
from bs4 import BeautifulSoup
# 网页 URL
url = '<https://news.sina.com.cn/>'
# 发送 GET 请求并获取响应
response = requests.get(url)
# 使用 BeautifulSoup 解析 HTML 内容
soup = BeautifulSoup(response.content, 'html.parser')
# 找到热榜新闻所在的父元素
hot_news_parent = soup.find('div', class_='blk_main_li')
# 找到所有热榜新闻条目
hot_news_list = hot_news_parent.find_all('li')
# 遍历热榜新闻列表并提取信息
for news_item in hot_news_list:
# 提取新闻标题和链接
news_title = news_item.a.text.strip() # 获取新闻标题文本并去除首尾空格
news_link = news_item.a['href'] # 获取新闻链接
# 打印新闻标题和链接
print(f"标题: {news_title}\\n链接: {news_link}\\n")
结果:
三、结语
通过今天的案例练习和实践,我们可以进一步加深对Beautiful Soup的理解和运用。在进行网页爬取时,记得遵守网站的爬虫规则,不要频繁请求或者过度抓取,以免对网站造成影响。同时,保持学习的态度,不断探索和尝试新的技术和方法,提高自己的爬虫能力和效率。不管做什么都一样,祝兄弟姐妹们在自己的道路上取得更多的成就!