我有一个page https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx有一个表(表 id=“ctl00_ContentPlaceHolder_ctl00_ctl00_GV”class=“GridListings”)我需要抓取。
我通常使用 BeautifulSoup 和 urllib,但在这种情况下,问题是该表需要一些时间来加载,所以当我尝试使用 BS 获取它时,它不会被捕获。
由于一些安装问题,我无法使用 PyQt4、drysracpe 或 Windmill,所以唯一可能的方法是使用 Selenium/PhantomJS
我尝试了以下操作,仍然没有成功:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located(By.CSS_SELECTOR, 'table#ctl00_ContentPlaceHolder_ctl00_ctl00_GV'))
上面的代码没有给我所需的表内容。
我该如何实现这一目标???
您可以使用获取数据requests and bs4,,几乎(如果不是所有)ASP 网站都需要提供一些后置参数,例如__活动目标, __事件验证 etc.. :
from bs4 import BeautifulSoup
import requests
data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
"__EVENTARGUMENT": "LISTINGS;0",
"ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
"__ASYNCPOST": "true"}
对于实际的帖子,我们需要添加更多值来发布数据:
post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
soup = BeautifulSoup(s.get(post).content)
data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
r = s.post(post, data=data)
soup2 = BeautifulSoup(r.content)
table = soup2.select_one("div.GridListings")
print(table)
运行代码时您将看到打印的表格。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)