第一步、打开唯品会网站 https://www.vip.com。然后随意搜索一种商品,比如"键盘",搜索之后下拉发现页面URL没有发生改变,但是商品信息在不断加载,那么这就是动态Ajax技术,遇到这种情况,第一反应就是找接口。
第二步、打开开发者工具,鼠标右键,点击检查,切换到Network选项卡,然后刷新唯品会页面,进行抓包,然后查看每个包的pirview,发现商品信息在‘ v2?callback=getMerchandise’中,我们来看一下URL,不看不要紧,一看吓一跳-_-,这URL也太长了,研究一下参数,发现主要是每件商品都有自己的pid,那么接下来,只要我们找到商品的pid就可以抓取数据了。
继续在Network抓到的包中查看每个包的priview,最终在‘rank?callback=getMerchandis’中找到了商品的pid。接下来就好办了,先切换到headers,查看url参数,在唯品会页面翻页,发现改变的只有pageOffset,每次翻页pageOffset增加120,那么每页的商品有120件,而且如果换一件商品进行搜索,只有keyword改变,了解了这一点,我们就可以实现搜索商品关键词然后得到对应的商品信息,并且可以进行翻页。
第三步、获取商品的pid。 访问‘rank?callback=getMerchandis’中的URL,参数keyword,和pageOffset可以进行修改,以达到自己想要的信息,然后请求HTML页面,记得加上请求头。在‘rank?callback=getMerchandis’包中的 priview中可以得知,该页面返回的是json数据,而且是不合法的json那么就要将不合法的json,那么就要将不合法的json转换成字典,方便取出pid,直接上代码。
keyword = input('请输入想要查询的商品关键词>>>')
pagenum = int(input('请输入页数,每页120个商品>>>'))
for i in range(0,pagenum):
url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101108&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1628070214309_e7fbca2c43dda020cc7734c00466d49c&wap_consumer=a&standby_id=nature&keyword{}&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset{}&channelId=1&gPlatform=PC&batchSize=120&_=1628070503449'.format(quote(keyword),120*i)
headers = {
'referer': 'https://category.vip.com/',
'user-agent': 'Mozilla/5.0'
}
html = requests.get(url,headers = headers)
# print(html.text)
start = html.text.find('{"code"')
json_data = json.loads(html.text[start:-1])['data']['products']
# print(json_data)
for data in json_data:
pid = data['pid']
# print(pid)
第四步、有了商品pid再回到‘ v2?callback=getMerchandise’中,将商品pid放到URL里面然后再求情就可以得到商品的json数据,再次转换成字典格式,然后想要什么信息,直接从字典里取出来就行。
for data in json_data:
pid = data['pid']
# print(pid)
stuff_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets1&app_name=shop_pc&app_version=4.0&warcallback=getMerchandiseDroplets1&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101108&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1628070214309_e7fbca2c43dda020cc7734c00466d49c&wap_consumer=a&productIds={}&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%2C%22ic2label%22%3A1%7D&context=&_=1628071156110'.format(pid)
stuff_html = requests.get(stuff_url,headers = headers)
# print(stuff_html.text)
start = stuff_html.text.find('{"code"')
end = stuff_html.text.find('"}}')+len('"}}')
stuff_json_data = json.loads(stuff_html.text[start:end])['data']['products']
# print(stuff_json_data)
for stuff_data in stuff_json_data:
title = stuff_data['title']
price = stuff_data['price']['salePrice']
imgurl = stuff_data['squareImage']
print('名称:{},价格:{}元'.format(title,price))
print(imgurl)
第五步、将数据保存到本地txt文本中
with open('{}商品信息.txt'.format(keyword),'a',encoding='utf8')as f:
f.write('商品名称:{},价格:{}元'.format(title,price)+'\n')
f.write('商品图片链接:{}'.format(imgurl)+'\n')
===最后把源代码奉上,原创作品,记得点赞哦,点赞的人会变帅,会变得更有钱!!!
import requests
import json
from urllib.parse import quote
def get_weipin_info():
keyword = input('请输入想要查询的商品关键词>>>')
pagenum = int(input('请输入页数,每页120个商品>>>'))
for i in range(0,pagenum):
url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101108&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1628070214309_e7fbca2c43dda020cc7734c00466d49c&wap_consumer=a&standby_id=nature&keyword={}&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset={}&channelId=1&gPlatform=PC&batchSize=120&_=1628070503449'.format(quote(keyword),120*i)
headers = {
'referer': 'https://category.vip.com/',
'user-agent': 'Mozilla/5.0'
}
html = requests.get(url,headers = headers)
# print(html.text)
start = html.text.find('{"code"')
json_data = json.loads(html.text[start:-1])['data']['products']
# print(json_data)
for data in json_data:
pid = data['pid']
# print(pid)
stuff_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets1&app_name=shop_pc&app_version=4.0&warcallback=getMerchandiseDroplets1&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101108&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1628070214309_e7fbca2c43dda020cc7734c00466d49c&wap_consumer=a&productIds={}&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%2C%22ic2label%22%3A1%7D&context=&_=1628071156110'.format(pid)
stuff_html = requests.get(stuff_url,headers = headers)
# print(stuff_html.text)
start = stuff_html.text.find('{"code"')
end = stuff_html.text.find('"}}')+len('"}}')
stuff_json_data = json.loads(stuff_html.text[start:end])['data']['products']
# print(stuff_json_data)
for stuff_data in stuff_json_data:
title = stuff_data['title']
price = stuff_data['price']['salePrice']
imgurl = stuff_data['squareImage']
print('名称:{},价格:{}元'.format(title,price))
print(imgurl)
with open('{}商品信息.txt'.format(keyword),'a',encoding='utf8')as f:
f.write('商品名称:{},价格:{}元'.format(title,price)+'\n')
f.write('商品图片链接:{}'.format(imgurl)+'\n')
print('{}商品信息爬取完成'.format(keyword))
if __name__ == '__main__':
get_weipin_info()
结果: