2017-12-27 25 views
0

Ich versuche, Zeitpläne von Basketballteams in eine CSV-Datei mit Scrapy zu speichern. Ich habe den folgenden Code in diesen Dateien geschrieben:Scrapy Spider speichert keine Daten

settings.py

BOT_NAME = 'test_project' 

SPIDER_MODULES = ['test_project.spiders'] 
NEWSPIDER_MODULE = 'test_project.spiders' 

FEED_FORMAT = "csv" 
FEED_URI = "cportboys.csv" 

# Crawl responsibly by identifying yourself (and your website) on the user-agent 
#USER_AGENT = 'test_project (+http://www.yourdomain.com)' 

# Obey robots.txt rules 
ROBOTSTXT_OBEY = True 

khsaabot.py

import scrapy 


class KhsaabotSpider(scrapy.Spider): 
    name = 'khsaabot' 
    allowed_domains = ['https://scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978'] 
    start_urls = ['http://https://scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978/'] 

def parse(self, response): 
    date = response.css('.mdate::text').extract() 
    opponent = response.css('.opponent::text').extract() 
    place = response.css('.schedule-loc::text').extract() 


    for item in zip(date,opponent,place): 
     scraped_info = { 
      'date' : item[0], 
      'opponent' : item[1], 
      'place' : item[2], 
     } 

     yield scraped_info 

Nun, ich bin nicht sicher, was hier schief geht, Wenn ich es mit "scrapy crawl khsaabot" im Terminal austrage, gibt es keine Fehler und scheint gut zu funktionieren. Allerdings nur für den Fall gibt es ein Problem mit dem, was im Terminal geschieht, inklusive ich die Ausgabe, die ich dort zu bekommen:

2017-12-27 17:21:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: test_project) 
2017-12-27 17:21:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'test_project', 'FEED_FORMAT': 'csv', 'FEED_URI': 'cportboys.csv', 'NEWSPIDER_MODULE': 'test_project.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['test_project.spiders']} 
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.memusage.MemoryUsage', 
'scrapy.extensions.feedexport.FeedExporter', 
'scrapy.extensions.logstats.LogStats'] 
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-12-27 17:21:49 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-12-27 17:21:49 [scrapy.core.engine] INFO: Spider opened 
2017-12-27 17:21:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-12-27 17:21:49 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://https/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://https/robots.txt>: DNS lookup failed: no results for hostname lookup: https. 
Traceback (most recent call last): 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks 
    result = result.throwExceptionIntoGenerator(g) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator 
    return g.throw(self.type, self.value, self.tb) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request 
    defer.returnValue((yield download_func(request=request,spider=spider))) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts 
    "no results for hostname lookup: {}".format(self._hostStr) 
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978/> (failed 1 times): DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978/> (failed 2 times): DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://https//scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978/> (failed 3 times): DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.core.scraper] ERROR: Error downloading <GET http://https//scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978/> 
Traceback (most recent call last): 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks 
    result = result.throwExceptionIntoGenerator(g) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator 
    return g.throw(self.type, self.value, self.tb) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request 
    defer.returnValue((yield download_func(request=request,spider=spider))) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/Users/devsandbox/anaconda3/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts 
    "no results for hostname lookup: {}".format(self._hostStr) 
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https. 
2017-12-27 17:21:49 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-12-27 17:21:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 6, 
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 6, 
'downloader/request_bytes': 1416, 
'downloader/request_count': 6, 
'downloader/request_method_count/GET': 6, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 12, 27, 23, 21, 49, 579649), 
'log_count/DEBUG': 7, 
'log_count/ERROR': 2, 
'log_count/INFO': 7, 
'memusage/max': 50790400, 
'memusage/startup': 50790400, 
'retry/count': 4, 
'retry/max_reached': 2, 
'retry/reason_count/twisted.internet.error.DNSLookupError': 4, 
'scheduler/dequeued': 3, 
'scheduler/dequeued/memory': 3, 
'scheduler/enqueued': 3, 
'scheduler/enqueued/memory': 3, 
'start_time': datetime.datetime(2017, 12, 27, 23, 21, 49, 323652)} 
2017-12-27 17:21:49 [scrapy.core.engine] INFO: Spider closed (finished) 

Der Ausgang rechts sieht für mich, aber ich bin noch neu in Scrapy so ich könnte etwas fehlen.

Dank y'all

Antwort

3

Sie twisted.internet.error.DNSLookupError Meldungen im Protokoll bekommen. Blick auf Ihre start_urls Liste, beginnt der Artikel mit "http://https://". ändern:

start_urls = ['http://https://scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978/'] 

zu:

start_urls = ['https://scoreboard.12dt.com/scoreboard/khsaa/kybbk17/?id=51978/'] 
+0

Vielen Dank! Ich habe getan, was Sie gesagt haben, und es wurde diesen Fehler los, aber jetzt bekomme ich einen an der gleichen Stelle mit dem Namen "NotImplementedError". Was kann ich dagegen tun? – Hunter

+0

Es macht nichts, nachdem ich ein wenig gesucht habe, habe ich herausgefunden, wie ich diesen Fehler beheben kann und einige andere, die ich gefunden habe. Danke, dass du mit diesem hier geholfen hast! – Hunter