嗨,我正在打电话controller
inside for-loop
,因为我有超过 100 个 url,所以我将所有内容都放在列表中,我将迭代并crawl
在页面上,我也为 setCustomData 设置了该 url,因为它不应该离开域。
for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) {
String str = iterator.next();
System.out.println("cheking"+str);
CrawlController controller = new CrawlController(config, pageFetcher,
robotstxtServer);
controller.setCustomData(str);
controller.addSeed(str);
controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers);
controller.waitUntilFinish();
}
但如果我运行上面的代码,第一个网址在第二个网址开始后完美爬行并打印错误,如下所示。
50982 [main] INFO edu.uci.ics.crawler4j.crawler.CrawlController - Crawler 1 started.
51982 [Crawler 1] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager - Connection request: [route: {}->http://www.connectzone.in][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 100]
60985 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController - It looks like no thread is working, waiting for 10 seconds to make sure...
70986 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController - No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
80986 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController - All of the crawlers are stopped. Finishing the process...
80987 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController - Waiting for 10 seconds before final clean up...
91050 [Thread-2] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager - Connection manager is shutting down
91051 [Thread-2] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager - Connection manager shut down
请帮助我解决上述解决方案,我在循环内启动和运行控制器,因为我在列表中有很多网址。
注意:**我正在使用**crawler4j-3.5.jar以及他们的依赖关系。
Try:
for(String url : urls) {
controller.addSeed(url);
}
并覆盖shouldVisit(WebUrl)
这样它就不能离开域。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)