2016-10-05 4 views
-1

Ich brauche Hilfe. Ich wollte einen Crawler für eine bestimmte Website (UndermineJournal) machen. Ich möchte diese Daten von der Seite bekommen, um eine Konsolenausgabe für mich zu erstellen, weil ich meistens auf Konsolen arbeite und nicht so oft wechseln möchte. Der andere Punkt ist, dass ich die Daten in eine Datenbank pushen möchte (sql etc ist kein Problem). Aber irgendwie bekomme ich nur diese angezeigt, wenn ich versuche, den Crawler auszuführen, ist das Tutorial nicht wirklich hilfreich Ich denke:Scrapy Anfänger bekommt Ausnahme

2016-10-05 10:55:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine) 
2016-10-05 10:55:23 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-10-05 10:55:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'} 
2016-10-05 10:55:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-10-05 10:55:23 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-10-05 10:55:24 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 429, in open 
    response = self._open(req, data) 
    File "/usr/lib/python2.7/urllib2.py", line 447, in _open 
    '_open', req) 
    File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-10-05 10:55:24 [boto] ERROR: Unable to read instance data, giving up 
2016-10-05 10:55:24 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-10-05 10:55:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-10-05 10:55:24 [scrapy] INFO: Enabled item pipelines: 
2016-10-05 10:55:24 [scrapy] INFO: Spider opened 
2016-10-05 10:55:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-10-05 10:55:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-10-05 10:55:24 [scrapy] ERROR: Error while obtaining start requests 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request 
    request = next(slot.start_requests) 
    File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests 
    yield self.make_requests_from_url(url) 
    File "/usr/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url 
    return Request(url, dont_filter=True) 
    File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__ 
    self._set_url(url) 
    File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url 
    raise ValueError('Missing scheme in request url: %s' % self._url) 
ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442 
2016-10-05 10:55:24 [scrapy] INFO: Closing spider (finished) 
2016-10-05 10:55:24 [scrapy] INFO: Dumping Scrapy stats: 
{'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 710944), 
'log_count/DEBUG': 2, 
'log_count/ERROR': 3, 
'log_count/INFO': 7, 
'start_time': datetime.datetime(2016, 10, 5, 8, 55, 24, 704378)} 
2016-10-05 10:55:24 [scrapy] INFO: Spider closed (finished) 

Meine Spinne ist dies:

# -*- coding: utf-8 -*- 
import scrapy 


class JournalSpider(scrapy.Spider): 
    name = "journal" 
    allowed_domains = ["theunderminejournal.com"] 
    start_urls = (
     'theunderminejournal.com/#eu/eredar/item/124442', 
    ) 

    def parse(self, response): 
     page = respinse.url.split("/")[-2] 
     filename = 'journal-%s.html' % page 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
      self.log('Saved file %s' % filename) 
     pass 

weiß jemand einen Tipp?

EDIT ERGEBNISSE

2016-10-05 11:21:35 [scrapy] INFO: Scrapy 1.0.3 started (bot: undermine) 
2016-10-05 11:21:35 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-10-05 11:21:35 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'undermine.spiders', 'SPIDER_MODULES': ['undermine.spiders'], 'BOT_NAME': 'undermine'} 
2016-10-05 11:21:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-10-05 11:21:35 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-10-05 11:21:36 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 429, in open 
    response = self._open(req, data) 
    File "/usr/lib/python2.7/urllib2.py", line 447, in _open 
    '_open', req) 
    File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-10-05 11:21:36 [boto] ERROR: Unable to read instance data, giving up 

Antwort

0

ValueError: Missing scheme in request url: theunderminejournal.com/#eu/eredar/item/124442

Ihre Urls immer mit entweder http:// oder https:// beginnen soll.

start_urls = (
    'theunderminejournal.com/#eu/eredar/item/124442', 
    #^should be: 
    'http://theunderminejournal.com/#eu/eredar/item/124442', 
) 
+0

Der Fehler in Ihrer Bearbeitung ist völlig unabhängig und verursacht durch 'Boto'-Paket, das nicht mit irgendwo verbinden kann. Sie können es wahrscheinlich ignorieren. Funktioniert die Spinne selbst? – Granitosaurus