2016-03-19 2 views
0

Ich bin derzeit Scraping Daten, die von Javascript in einer Website generiert werden. Daher verwende ich Scrapy und Selen, um solche Daten zu scrapen. Der Spider kann jedoch nur Daten von der ersten Site crawlen und scrappen. Kann mir jemand dabei helfen? Unten ist der Code, den ich geschrieben habe. Danke im Voraus.Scrapy mit Selen nur Crawling auf die erste Seite anstelle von mehreren Websites

import scrapy 
from scrapy.http import Request 
import time 
from selenium import webdriver 

class w01item(scrapy.Item): 
    date = scrapy.Field() 
    title = scrapy.Field() 
    underlying_bid = scrapy.Field() 
    bid = scrapy.Field() 

class mqSpider(scrapy.Spider): 
    name = "w11" 
    allowed_domains = ["kswarrants.kasikornsecurities.com"] 
    start_urls = ["http://kswarrants.kasikornsecurities.com/www/Tool/calculator"] 
    def __init__(self): 
     self.driver = webdriver.Firefox() 

    def parse(self, response): 
     self.driver.get(response.url) 
     self.driver.add_cookie({'name':'Disc', 'value':'YES','path':'/'}) 
     self.driver.get("http://kswarrants.kasikornsecurities.com/www/Tool/calculator") 
     options=self.driver.find_elements_by_xpath('//select[@id="underling0"]/option') 
     for option in options[1:4]: 
      a = option.text 
      textbox=self.driver.find_element_by_id("calid") 
      textbox.send_keys(option.text) 
      time.sleep(1) 
      self.driver.find_element_by_id("btn_sub").click() 
      time.sleep(2) 
      for x in xrange(1,3): 
       item = w01item() 
       item['title']= a 
       item['date'] = self.driver.find_element_by_id('d_1').text 
       item['underlying_bid']= self.driver.find_element_by_id('d_'+ str(x)+'_1').text 
       item['bid'] = self.driver.find_element_by_id('d_'+ str(x)+'_2').text 
       yield item 
      self.driver.find_element_by_id("calid").clear() 

Das Protokoll aus dem das Skript ausgeführt ist unten.

2016-03-21 23:14:56 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot) 
2016-03-21 23:14:56 [scrapy] INFO: Optional features available: ssl, http11 
2016-03-21 23:14:56 [scrapy] INFO: Overridden settings: {} 
2016-03-21 23:14:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-03-21 23:15:02 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session {"desiredCapabilities": {"platform": "ANY", "br 
: "firefox", "version": "", "marionette": false, "javascriptEnabled": true}} 
2016-03-21 23:15:02 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH 
leware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-03-21 23:15:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-03-21 23:15:02 [scrapy] INFO: Enabled item pipelines: 
2016-03-21 23:15:02 [scrapy] INFO: Spider opened 
2016-03-21 23:15:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-03-21 23:15:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-03-21 23:15:02 [scrapy] DEBUG: Redirecting (302) to <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> from <GET http://kswarrant 
securities.com/www/Tool/calculator> 
2016-03-21 23:15:02 [scrapy] DEBUG: Crawled (200) <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> (referer: None) 
2016-03-21 23:15:03 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/url {"url" 
kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a"} 
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/cookie {"s 
"415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "cookie": {"path": "/", "name": "Disc", "value": "YES"}} 
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/url {"url" 
kswarrants.kasikornsecurities.com/www/Tool/calculator", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a"} 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/elements { 
xpath", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "//select[@id=\"underling0\"]/option"} 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{06 
2-46ee-9e2c-720482b405d8}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{06d323c4-f252-46ee-9e2c-720482b405d8}"} 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "calid"} 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{06 
2-46ee-9e2c-720482b405d8}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{06d323c4-f252-46ee-9e2c-720482b405d8}"} 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{a 
62-4578-896c-9de40ce48162}/value {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{a06dc4f5-0462-4578-896c-9de40ce48162}", "value": ["A", "A", "V", 
"C", "1", "6", "0", "4", "A"]} 
2016-03-21 23:15:06 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "btn_sub"} 
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{e 
ae-4848-9bac-450b5567842b}/click {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{ee24c112-f7ae-4848-9bac-450b5567842b}"} 
2016-03-21 23:15:08 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{74 
0-45be-80bb-152e3cc78c6a}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{741d6ad8-2640-45be-80bb-152e3cc78c6a}"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1_1"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{d2 
1-48bc-a76e-fccc5ad9e646}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{d25a795f-4721-48bc-a76e-fccc5ad9e646}"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1_2"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{3d 
b-40a0-8880-9f5675bed655}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{3d380c09-248b-40a0-8880-9f5675bed655}"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [scrapy] DEBUG: Scraped from <200 http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> 
{'bid': u'0.77', 
'date': u'21/03/2016', 
'title': u'AAV11C1604A', 
'underlying_bid': u'5.10'} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{74 
0-45be-80bb-152e3cc78c6a}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{741d6ad8-2640-45be-80bb-152e3cc78c6a}"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_2_1"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{60 
8-4302-93dc-079c6e686055}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{606f3c95-5e98-4302-93dc-079c6e686055}"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_2_2"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{8b 
0-461f-aafe-0bdafa2c6d6f}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{8be98315-46d0-461f-aafe-0bdafa2c6d6f}"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [scrapy] DEBUG: Scraped from <200 http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> 
{'bid': u'0.80', 
'date': u'21/03/2016', 
'title': u'AAV11C1604A', 
'underlying_bid': u'5.15'} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {" 
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "calid"} 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{9 
70-408c-b683-5c363412cf0f}/clear {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{9c1a9c4b-aa70-408c-b683-5c363412cf0f}"} 
2016-03-21 23:15:11 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:11 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{00 
7-4a5b-86c9-96bed980ebef}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{001d727c-3f87-4a5b-86c9-96bed980ebef}"} 
2016-03-21 23:15:12 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
2016-03-21 23:15:12 [scrapy] ERROR: Spider error processing <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> (referer: None) 
Traceback (most recent call last): 
    File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback 
    yield next(it) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output 
    for x in result: 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 54, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "D:\testing\w11s.py", line 25, in parse 
    a = option.text 
    File "c:\python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 70, in text 
    return self._execute(Command.GET_ELEMENT_TEXT)['value'] 
    File "c:\python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 457, in _execute 
    return self._parent.execute(command, params) 
    File "c:\python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 233, in execute 
    self.error_handler.check_response(response) 
    File "c:\python27\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response 
    raise exception_class(message, screen, stacktrace) 
StaleElementReferenceException: Message: Element not found in the cache - perhaps the page has changed since it was looked up 
Stacktrace: 
    at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:9454) 
    at Utils.getElementAt (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:9039) 
    at WebElement.getElementText (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12092) 
    at DelayedCommand.prototype.executeInternal_/h (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12661) 
    at DelayedCommand.prototype.executeInternal_ (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12666) 
    at DelayedCommand.prototype.execute/< (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/[email protected]/components/command-processor.js:12608) 
2016-03-21 23:15:12 [scrapy] INFO: Closing spider (finished) 
+0

Sie den Konstruktor überschreiben, ohne den übergeordneten Konstruktor zu – eLRuLL

+0

Entschuldigung zu fordern. Ich verstehe nicht, was Sie sagen, da ich noch neu zu scrapy und selen bin. Ich habe es vor 2 - 3 Wochen abgeholt. Kannst du ein Beispiel geben? Danke – scraper

+0

können Sie Protokolle teilen? – eLRuLL

Antwort

Verwandte Themen