我用 python 编写了一个脚本,在向某些链接发送请求时使用代理,以便从那里解析产品名称。我目前的尝试完美地完成了这项工作。这个功能parse_product()
完全依赖于返回的结果(代理),以便以正确的方式重用相同的代理。我正在尝试修改parse_product()
以这样的方式运行函数,以便该函数不依赖于先前对同一函数的调用,以便重用工作代理直到无效。更清楚地说 - 我希望主要功能更像下面这样。然而,当它完成解决后,我将使用多重处理来使脚本运行得更快:
if __name__ == '__main__':
for url in linklist:
parse_product(url)
尽管如此,希望脚本能够像现在一样工作。
我尝试过(工作之一):
import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup
linklist = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE'
]
proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
def process_proxy(proxy):
global proxyVault
if not proxy:
proxy_url = choice(proxyVault)
proxy = {'https': f'http://{proxy_url}'}
else:
proxy_pattern = proxy.get("https").split("//")[-1]
if proxy_pattern in proxyVault:
proxyVault.remove(proxy_pattern)
random.shuffle(proxyVault)
proxy_url = choice(proxyVault)
proxy = {'https': f'http://{proxy_url}'}
return proxy
def parse_product(link,proxy):
try:
if not proxy:raise
print("checking the proxy:",proxy)
res = requests.get(link,proxies=proxy,timeout=5)
soup = BeautifulSoup(res.text,"html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception: product_name = ""
return proxy, product_name
except Exception:
"""the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
proxy_link = process_proxy(proxy)
return parse_product(link,proxy_link)
if __name__ == '__main__':
proxy = None
for url in linklist:
result = parse_product(url,proxy)
proxy = result[0]
print(result)
Note: parse_product()
函数返回代理和产品名称。但是,函数返回的代理会在同一函数中重用parse_product()
直至无效。
顺便说一下,proxyVault 中使用的代理只是占位符。