试图使用硒抓取这个网站。
我的代码可以工作,但目前它只抓取第一页。该页面使用输入按钮作为浏览页面的一种方式,因此我想逐个单击每个按钮,但它不起作用,有没有人有任何其他方法来处理此类分页的导航?
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
options = Options()
# options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options,
executable_path=r'/Users/liban/Downloads/chromedriver')
url = 'http://www.boston.gov.uk/index.aspx?articleid=6207&ShowAdvancedSearch=true'
driver.get(url)
def get_Data():
data = []
divs = driver.find_element_by_xpath('//*[@id="content"]/form').find_elements_by_tag_name('div')
for div in divs:
app_number = driver.find_element_by_xpath('//div[ contains( concat( " ", normalize-space( @class ), " "), " grid_13 ") ]/form/div[1]/h4/a').text
address = driver.find_element_by_xpath('//div[ contains( concat( " ", normalize-space( @class ), " "), " grid_13 ") ]/form/div[1]/p[5]').text
status = driver.find_element_by_xpath('//div[ contains( concat( " ", normalize-space( @class ), " "), " grid_13 ") ]/form/div[1]/p[1]/strong').text
link = driver.find_element_by_xpath('//div[ contains( concat( " ", normalize-space( @class ), " "), " grid_13 ") ]/form/div[1]/h4/a').get_attribute("href")
proposals = driver.find_element_by_xpath('//div[ contains( concat( " ", normalize-space( @class ), " "), " grid_13 ") ]/form/div[1]/p[3]').text
data.append({"caseRef": app_number, "propDesc": proposals, "address": address, "caseUrl": link, "status": status})
print(data)
return data
def navigation():
data = []
total_inputs = driver.find_element_by_xpath('//div[ contains( concat( " ", normalize-space( @class ), " "), " grid_13 ") ]/form/table/tbody/tr/td/input')
for input in total_inputs:
input.click()
data.extend(get_Data())
def main():
all_data = []
select = Select(driver.find_element_by_xpath('//*[@id="DatePresets"]'))
select.select_by_index(7)
search_by = driver.find_element_by_xpath('//*[@id="radio-ReceivedDate"]')
search_by.click()
show = Select(driver.find_element_by_xpath('//*[@id="ResultSize"]'))
show.select_by_index(4)
search_button = driver.find_element_by_xpath('//*[@id="content"]/form/input[3]')
search_button.click()
all_data.extend(navigation())
if __name__ == "__main__":
main()
网站如何处理分页:
<td align="center">
<input type="submit" class="pageNumberButton selected" name="searchResults_Page" value="1" disabled="disabled"/>
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="2" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="3" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="4" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="5" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="6" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="7" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="8" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="9" />
<input type="submit" class="pageNumberButton " name="searchResults_Page" value="10" />
</td>
手动步骤:
- 选择预设日期=“上个月”
- 搜索依据=“两个日期”
- 点击搜索
- 抓取每个页面后,转到下一页,依此类推,直到没有更多页面,然后返回原始 URL。