Python爬虫之Js逆向案例(16)- xx商品评论&店铺详情案例

2023-11-11

一次运行程序，同时获取一下内容：

1. 获取商店详情；

2. 获取当前商品评论；

3. 获取商品的问题、答案；

效果如下图：在这里插入图片描述

下面会进行以下几步进行分析（下方演示过程全部使用chrome浏览器）；

1.抓包找到对应接口

商店详情https://item-soa.jd.com/getWareBusiness?skuId=
商品评价https://club.jd.com/comment/skuProductPageComments.action
商品相关问题https://question.jd.com/question/getQuestionAnswerList.action
问题的回答https://question.jd.com/question/getAnswerListById.action

2.编写全局控制参数到配置文件

在这里插入图片描述

3.爬虫编写

3.1 店铺详情

def get_shop_info(p_id, max_page):

    q_url = shop_info_base_url + p_id
    res = requests.get(q_url, headers=headers, cookies=cookies).json()
    if res.get("shopInfo") == None:
        return

    shop = res.get("shopInfo").get("shop")
    if shop is None:
        # 没有店铺评分的店，测试下来发现是封掉的店
        return

    shop['wname'] = res.get('wareInfo').get("wname")
    saveDataTool.save_shop_info(p_id, shop)
    get_comments(p_id, max_page)
    get_questions(p_id, max_page)

3.2 商品评论

def get_comments(p_id, max_page):

    print('准备--获取店铺ID为<%s>的评价数据...' % (p_id))
    index = 0
    while True:
        try:
            print('准备--开始获取第%d页...' % (index + 1))
            params.update({'productId': p_id})
            res = requests.get(comment_base_url,
                               headers=headers,
                               cookies=cookies,
                               params=params).json()

            maxPage = res.get('maxPage')

            if len(res.get("comments", [])) > 0:
                saveDataTool.save_comment(p_id, res)
            else:
                print("结束--没有评论内容了，猜测已经是最后一页了...")
                break

            # 只要达到了最大页码条数，无论后面还有没有下一页，当前店铺的评论到此为止
            if max_page > 0 and max_page <= index + 1:
                print("正常终止--<%s>的第%d页触发了限定最大页码数..." % (p_id, index + 1))
                break
            if index >= maxPage - 1:
                print("结束--<%s>的第%d页已是最后一页..." % (p_id, index + 1))
                break
            else:
                index += 1
                params.update({
                    'page': index,
                })
                sleep = random.randint(config['random_start'],
                                       config['random_end'])
                print('开始休眠：%d秒...' % (sleep))
                time.sleep(sleep)  # 下次请求之前随机暂停几秒，防止被封号

        except Exception as err:
            logsTool.log_to_save(id, index, err)
            break

3.3 商品问题

def get_questions(p_id, max_page):
    print('准备--获取店铺ID为<%s>的问题列表数据...' % (p_id))
    index = 1
    while True:
        try:
            print('准备--开始获取问题列表第%d页...' % (index))
            quesiton_params.update({'productId': p_id})
            res = requests.get(question_base_url,
                               headers=headers,
                               cookies=cookies,
                               params=quesiton_params).json()

            totalPage = math.ceil(res.get("totalItem", 0) / 10)
            questionList = res.get("questionList", [])

            if len(questionList) > 0:
                saveDataTool.save_question(p_id, res)
                for question in questionList:
                    answerCount = question.get("answerCount", 0)
                    tempId = question.get('id')
                    # 直接保存
                    if answerCount > 0 and answerCount <= 2:
                        saveDataTool.save_answer(
                            p_id, tempId, question.get('answerList', []))
                    elif answerCount > 2:
                        get_answer(p_id, tempId, max_page)
                    else:
                        continue

            if index >= totalPage:
                print("结束--问题列表<%s>的第%d页已是最后一页..." % (p_id, index))
                break

            # 只要达到了最大页码条数，无论后面还有没有下一页，当前店铺的评论到此为止
            if max_page > 0 and max_page <= index:
                print("正常终止--问题列表<%s>的第%d页触发了限定最大页码数..." % (p_id, index))
                break

            index += 1
            quesiton_params.update({
                'page': index,
            })
            sleep = random.randint(config['random_start'],
                                   config['random_end'])
            print('开始休眠：%d秒...' % (sleep))
            time.sleep(sleep)  # 下次请求之前随机暂停几秒，防止被封号

        except Exception as err:
            logsTool.log_to_save(p_id, index, err)
            break

3.4 问题答案

# 问题;p_id:product_id、 q_id:问题id
def get_answer(p_id, q_id, max_page):

    print('准备--获取问题ID为<%s>的答案列表数据...' % (q_id))
    index = 1
    while True:
        try:
            print('准备--开始获取第%d页answer...' % (index))
            answer_params.update({'questionId': q_id})
            res = requests.get(answer_base_url,
                               headers=headers,
                               cookies=cookies,
                               params=answer_params).json()

            answers = res.get("answers", [])
            if len(answers) > 0:
                saveDataTool.save_answer(p_id, q_id, answers)

            if res.get("moreCount", 0) < 0:
                print("结束--<%s>的answer的第%d页已是最后一页..." % (q_id, index))
                break

            # 只要达到了最大页码条数，无论后面还有没有下一页，当前店铺的评论到此为止
            if max_page > 0 and max_page <= index:
                print("正常终止--<%s>的第%d页触发了限定最大页码数..." % (q_id, index))
                break

            index += 1
            answer_params.update({
                'page': index,
            })
            sleep = random.randint(config['random_start'],
                                   config['random_end'])
            print('开始休眠：%d秒...' % (sleep))
            time.sleep(sleep)  # 下次请求之前随机暂停几秒，防止被封号

        except Exception as err:
            logsTool.log_to_save(q_id, index, err)
            break

4.CSV

在这里插入图片描述

总结：程序一键运行，过程中错误中断自动保存日志到log文件，方便后续分析！但是现在没有添加多线程，大数据量采集数据的话，单线程运行可能需要很久，本案例只是自己纯学习练习使用！源码已同步到知识星球！

下期预告：使用scrapy爬虫框架完成上述功能！

后期会持续分享爬虫案例-100例，不想自己造轮子的同学可加入我的知识星球，有更多技巧、案例注意事项、案例坑点终结、答疑提问特权等你哦！！！

欢迎加入「python、爬虫、逆向Club」知识星球

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python爬虫js逆向

python

爬虫

开发语言