无法加载资源：服务器通过 Selenium 使用 ChromeDriver Chrome 响应状态为 429（请求过多）和 404（未找到）

2024-02-16

我正在尝试在 python 中使用 selenium 构建一个刮刀。 Selenium Webdriver 打开窗口并尝试加载页面但突然停止加载。我可以在本地 Chrome 浏览器中访问相同的链接。

以下是我从网络驱动程序获得的错误日志：

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1 - Failed to load resource: the server responded with a status of 429 (Too Many Requests)', 'source': 'network', 'timestamp': 1556997743637}

{'level': 'SEVERE', 'message': 'about:blank - Failed to load resource: net::ERR_UNKNOWN_URL_SCHEME', 'source': 'network', 'timestamp': 1556997745338}

{'level': 'SEVERE', 'message': 'https://shop.coles.com.au/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1556997748339}

我的脚本：

from selenium import webdriver
import os

path = os.path.join(os.getcwd(), 'chromedriver')
driver = webdriver.Chrome(executable_path=path)

links = [
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/baby-accessories?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/food?pageNumber=1",
    "https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/formula?pageNumber=1",
]


for link in links:
    driver.get(link)

429 请求过多

超文本传输协议429 请求过多 https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429响应状态代码表明用户在给定时间内发送了太多请求（“速率限制”）。响应表示应该包括解释条件的详细信息，并且可以包括Retry-After标头指示在发出新请求之前要等待多长时间。

当服务器受到攻击或刚刚收到来自一方的大量请求时，用429状态码会消耗资源。因此，服务器不需要使用429状态码；当限制资源使用时，仅删除连接或采取其他步骤可能更合适。

404 未找到

超文本传输协议404 未找到 https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404客户端错误响应代码表明服务器找不到所请求的资源。在浏览器中，这意味着该 URL 无法识别。在 API 中，这也可能意味着端点有效但资源本身不存在。服务器还可以发送此响应而不是 403，以向未经授权的客户端隐藏资源的存在。由于该响应代码在网络上频繁出现，因此可能是最著名的响应代码。

A 404状态代码并不指示资源是暂时丢失还是永久丢失。但如果资源被永久删除，410 (Gone)应该使用而不是404地位。此外，404当未找到请求的资源时使用状态代码，无论该资源不存在还是存在401 or 403出于安全原因，该服务想要屏蔽。

Analysis

当我尝试你的代码块时，我遇到了类似的后果。如果您检查DOM Tree https://javascript.info/dom-nodes of the webpage https://shop.coles.com.au/a/a-nsw-metro-rouse-hill/everything/browse/baby/nappies-changing?pageNumber=1你会发现不少标签都有这个关键词dist。举个例子：

<link rel="shortcut icon" type="image/x-icon" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/img/favicon.ico">
<link rel="stylesheet" href="/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/css/google/fonts-Source-Sans-Pro.css" type="text/css" media="screen">
'appDir': '/wcsstore/ColesResponsiveStorefrontAssetStore/dist/30e70cfc76bf73d384beffa80ba6cbee/app'

该术语的存在dist明确表明该网站受保护机器人管理服务提供者蒸馏网络 https://www.distilnetworks.com/和导航Chrome驱动程序被检测到并随后blocked.

Distil

根据文章Distill.it 确实有一些东西...... https://www.forbes.com/sites/timconneally/2013/01/28/theres-something-about-distil-it/#6e1881e438b9:

Distil 通过观察网站行为并识别抓取工具特有的模式来保护网站免受自动内容抓取机器人的侵害。当 Distil 在一个站点上识别出恶意机器人时，它会创建一份部署到所有客户的黑名单行为配置文件。 Distil 类似于机器人防火墙，可以检测模式并做出反应。

Further,

"One pattern with **Selenium** was automating the theft of Web content"Distil 首席执行官 Rami Essaid 上周在接受采访时表示。"Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".

参考

您可以在以下位置找到一些详细的讨论：

无法使用 Selenium 自动登录 Chase 网站 https://stackoverflow.com/questions/53605757/unable-to-use-selenium-to-automate-chase-site-login/54284776#54284776
检测到通过 ChromeDriver 启动的 Chrome 浏览器 https://stackoverflow.com/questions/52832413/chrome-browser-initiated-through-chromedriver-gets-detected/52833487#52833487

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)