Ich bin derzeit Scraping Daten, die von Javascript in einer Website generiert werden. Daher verwende ich Scrapy und Selen, um solche Daten zu scrapen. Der Spider kann jedoch nur Daten von der ersten Site crawlen und scrappen. Kann mir jemand dabei helfen? Unten ist der Code, den ich geschrieben habe. Danke im Voraus.Scrapy mit Selen nur Crawling auf die erste Seite anstelle von mehreren Websites
import scrapy
from scrapy.http import Request
import time
from selenium import webdriver
class w01item(scrapy.Item):
date = scrapy.Field()
title = scrapy.Field()
underlying_bid = scrapy.Field()
bid = scrapy.Field()
class mqSpider(scrapy.Spider):
name = "w11"
allowed_domains = ["kswarrants.kasikornsecurities.com"]
start_urls = ["http://kswarrants.kasikornsecurities.com/www/Tool/calculator"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
self.driver.add_cookie({'name':'Disc', 'value':'YES','path':'/'})
self.driver.get("http://kswarrants.kasikornsecurities.com/www/Tool/calculator")
options=self.driver.find_elements_by_xpath('//select[@id="underling0"]/option')
for option in options[1:4]:
a = option.text
textbox=self.driver.find_element_by_id("calid")
textbox.send_keys(option.text)
time.sleep(1)
self.driver.find_element_by_id("btn_sub").click()
time.sleep(2)
for x in xrange(1,3):
item = w01item()
item['title']= a
item['date'] = self.driver.find_element_by_id('d_1').text
item['underlying_bid']= self.driver.find_element_by_id('d_'+ str(x)+'_1').text
item['bid'] = self.driver.find_element_by_id('d_'+ str(x)+'_2').text
yield item
self.driver.find_element_by_id("calid").clear()
Das Protokoll aus dem das Skript ausgeführt ist unten.
2016-03-21 23:14:56 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-21 23:14:56 [scrapy] INFO: Optional features available: ssl, http11
2016-03-21 23:14:56 [scrapy] INFO: Overridden settings: {}
2016-03-21 23:14:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-21 23:15:02 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session {"desiredCapabilities": {"platform": "ANY", "br
: "firefox", "version": "", "marionette": false, "javascriptEnabled": true}}
2016-03-21 23:15:02 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
leware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-21 23:15:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-21 23:15:02 [scrapy] INFO: Enabled item pipelines:
2016-03-21 23:15:02 [scrapy] INFO: Spider opened
2016-03-21 23:15:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-21 23:15:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-21 23:15:02 [scrapy] DEBUG: Redirecting (302) to <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> from <GET http://kswarrant
securities.com/www/Tool/calculator>
2016-03-21 23:15:02 [scrapy] DEBUG: Crawled (200) <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> (referer: None)
2016-03-21 23:15:03 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/url {"url"
kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a"}
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/cookie {"s
"415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "cookie": {"path": "/", "name": "Disc", "value": "YES"}}
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/url {"url"
kswarrants.kasikornsecurities.com/www/Tool/calculator", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/elements {
xpath", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "//select[@id=\"underling0\"]/option"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{06
2-46ee-9e2c-720482b405d8}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{06d323c4-f252-46ee-9e2c-720482b405d8}"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "calid"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{06
2-46ee-9e2c-720482b405d8}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{06d323c4-f252-46ee-9e2c-720482b405d8}"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{a
62-4578-896c-9de40ce48162}/value {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{a06dc4f5-0462-4578-896c-9de40ce48162}", "value": ["A", "A", "V",
"C", "1", "6", "0", "4", "A"]}
2016-03-21 23:15:06 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "btn_sub"}
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{e
ae-4848-9bac-450b5567842b}/click {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{ee24c112-f7ae-4848-9bac-450b5567842b}"}
2016-03-21 23:15:08 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{74
0-45be-80bb-152e3cc78c6a}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{741d6ad8-2640-45be-80bb-152e3cc78c6a}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{d2
1-48bc-a76e-fccc5ad9e646}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{d25a795f-4721-48bc-a76e-fccc5ad9e646}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1_2"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{3d
b-40a0-8880-9f5675bed655}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{3d380c09-248b-40a0-8880-9f5675bed655}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [scrapy] DEBUG: Scraped from <200 http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator>
{'bid': u'0.77',
'date': u'21/03/2016',
'title': u'AAV11C1604A',
'underlying_bid': u'5.10'}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{74
0-45be-80bb-152e3cc78c6a}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{741d6ad8-2640-45be-80bb-152e3cc78c6a}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_2_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{60
8-4302-93dc-079c6e686055}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{606f3c95-5e98-4302-93dc-079c6e686055}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_2_2"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{8b
0-461f-aafe-0bdafa2c6d6f}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{8be98315-46d0-461f-aafe-0bdafa2c6d6f}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [scrapy] DEBUG: Scraped from <200 http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator>
{'bid': u'0.80',
'date': u'21/03/2016',
'title': u'AAV11C1604A',
'underlying_bid': u'5.15'}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "calid"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{9
70-408c-b683-5c363412cf0f}/clear {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{9c1a9c4b-aa70-408c-b683-5c363412cf0f}"}
2016-03-21 23:15:11 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:11 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{00
7-4a5b-86c9-96bed980ebef}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{001d727c-3f87-4a5b-86c9-96bed980ebef}"}
2016-03-21 23:15:12 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:12 [scrapy] ERROR: Spider error processing <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> (referer: None)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or() if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 54, in <genexpr>
return (r for r in result or() if _filter(r))
File "D:\testing\w11s.py", line 25, in parse
a = option.text
File "c:\python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 70, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "c:\python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 457, in _execute
return self._parent.execute(command, params)
File "c:\python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 233, in execute
self.error_handler.check_response(response)
File "c:\python27\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
StaleElementReferenceException: Message: Element not found in the cache - perhaps the page has changed since it was looked up
Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:9454)
at Utils.getElementAt (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:9039)
at WebElement.getElementText (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12092)
at DelayedCommand.prototype.executeInternal_/h (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12661)
at DelayedCommand.prototype.executeInternal_ (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12666)
at DelayedCommand.prototype.execute/< (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12608)
2016-03-21 23:15:12 [scrapy] INFO: Closing spider (finished)
Sie den Konstruktor überschreiben, ohne den übergeordneten Konstruktor zu – eLRuLL
Entschuldigung zu fordern. Ich verstehe nicht, was Sie sagen, da ich noch neu zu scrapy und selen bin. Ich habe es vor 2 - 3 Wochen abgeholt. Kannst du ein Beispiel geben? Danke – scraper
können Sie Protokolle teilen? – eLRuLL