这次用了BeautifulSoup库来爬取Steam的热销商品,BeautifulSoup更侧重的是从页面的结构解析,
根据标签元素等来爬取数据,这次遇到两个问题:
1.Steam热销商品列表经常有重复的,所以我建了一个列表,把爬到的数据存进去,每次爬的时候都校验跟列表里有没有重复,有的话就跳过,防止重复爬取。
2.我需要同时遍历两个表,找到了zip()函数解决方案,下面简单介绍一下。
zip()
大家看下面的实例应该就能明白。
xs = ['我是','你是','他是']
ys = ['第一','第二','第三']
for x, y in zip(xs,ys):
print(x+y)
输出结果如下:
我是第一
你是第二
他是第三
下面是完整爬虫代码,使用的库请自行安装不另做教学:
from bs4 import BeautifulSoup
import xlwt,os,time,requests
page = 1
total_pages = 3
count = 1
pool=[]
document = 'Steam_GameTopSellers'
wb = xlwt.Workbook()
ws = wb.add_sheet("TopSellers")
ws.write(0,0,'#')
ws.write(0,1,'Game Title')
ws.write(0,2,'Released Date')
root = os.getcwd()
date = time.strftime('%Y%m%d',time.localtime(time.time()))
while page<total_pages:
url = 'https://store.steampowered.com/search/?tags=597&filter=topsellers&page=%s' % str(page)
r = requests.session()
res = r.get(url).text
soup = BeautifulSoup(res,"html.parser")
game_names = soup.find_all('span',attrs={'class':'title'})
released_dates = soup.find_all('div',attrs={'class':'col search_released responsive_secondrow'})
for game_name, released_date in zip(game_names,released_dates):
if game_name.string in pool:
continue
else:
print('%s .GameName:%s Released on:%s' % (count,game_name.string,released_date.string))
pool.append(game_name.string)
ws.write(count,0,count)
ws.write(count,1,game_name.string)
ws.write(count,2,released_date.string)
count += 1
rate = page / (total_pages - 1)
print('--------------------------第%s页爬取完成--------------------已完成: %.2f%%' % (str(page),(rate * 100)))
page += 1
wb.save('%s%s.xls' % (document,date))
print('--------------------------爬取完成--------------------------')
print('所有数据已存至:%s\%s%s.xls' % (root,document,date))
爬取结果(一共爬取1242条数据,51页~60页都是重复的):
![在这里插入图片描述](https://img-blog.csdnimg.cn/20190320000827891.?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mjg1MjIxMA==,size_16,color_FFFFFF,t_70)
![在这里插入图片描述](https://img-blog.csdnimg.cn/20190320001107370.?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80Mjg1MjIxMA==,size_16,color_FFFFFF,t_70)
结语:
这steam真是。。。后十页就是重复的,我一开始以为是有什么防爬机制,
后来在实际页面检查确实有重复的情况。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)