我找到了一个非常相似的问题并使用接受的答案中提供的第二个选项来开发此问题的解决方法,因为它在 scrapy 中不支持开箱即用。
我创建了一个函数,它获取 url 作为输入并为其创建规则:
def rules_for_url(self, url):
domain = Tools.get_domain(url)
rules = (
Rule(LinkExtractor(allow_domains=(domain), deny_domains=()), callback='parse_internal', follow=True),
Rule(LinkExtractor(allow_domains=(), deny_domains=(domain)), callback='parse_external', follow=False),
)
return rules
然后我重写了 CrawlSpider 的一些函数。
我将 _rules 更改为字典,其中键是不同的网站域,值是该域的规则(使用rules_for_url
)。 _rules 的填充完成于_compile_rules
然后我做出适当的改变_requests_to_follow
and _response_downloaded
支持新的使用方式_rules
.
_rules = {}
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
domain = Tools.get_domain(response.url)
for n, rule in enumerate(self._rules[domain]):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(domain + ';' + str(n), link)
yield rule.process_request(r)
def _response_downloaded(self, response):
meta_rule = response.meta['rule'].split(';')
domain = meta_rule[0]
rule_n = int(meta_rule[1])
rule = self._rules[domain][rule_n]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _compile_rules(self):
def get_method(method):
if callable(method):
return method
elif isinstance(method, six.string_types):
return getattr(self, method, None)
for url in self.start_urls:
url_rules = self.rules_for_url(url)
domain = Tools.get_domain(url)
self._rules[domain] = [copy.copy(r) for r in url_rules]
for rule in self._rules[domain]:
rule.callback = get_method(rule.callback)
rule.process_links = get_method(rule.process_links)
rule.process_request = get_method(rule.process_request)
查看原来的功能here.
现在,蜘蛛将简单地遍历 start_urls 中的每个 url,并创建一组特定于该 url 的规则。然后对每个正在抓取的网站使用适当的规则。
希望这对将来遇到这个问题的人有所帮助。
Simon.