Ich lerne, wie man mit scrapy + splash arbeitet. Ich habe ein Projekt mit einer virtuellen Umgebung erstellt und mache jetzt dieses Tutorial: https://github.com/scrapy-plugins/scrapy-splash.Scrapy + Splash: Verbindung verweigert
Ich habe lief Splash mit:
$ docker run -p 8050:8050 scrapinghub/splash
, die in Folge:
2017-01-12 09:18:50+0000 [-] Log opened.
2017-01-12 09:18:50.225754 [-] Splash version: 2.3
2017-01-12 09:18:50.227033 [-] Qt 5.5.1, PyQt 5.5.1, WebKit 538.1, sip 4.17, Twisted 16.1.1, Lua 5.2
2017-01-12 09:18:50.227201 [-] Python 3.4.3 (default, Nov 17 2016, 01:08:31) [GCC 4.8.4]
2017-01-12 09:18:50.227645 [-] Open files limit: 1048576
2017-01-12 09:18:50.227882 [-] Can't bump open files limit
2017-01-12 09:18:50.333978 [-] Xvfb is started: ['Xvfb', ':1', '-screen', '0', '1024x768x24']
2017-01-12 09:18:50.438528 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2017-01-12 09:18:50.597573 [-] verbosity=1
2017-01-12 09:18:50.597747 [-] slots=50
2017-01-12 09:18:50.597820 [-] argument_cache_max_entries=500
2017-01-12 09:18:50.598696 [-] Web UI: enabled, Lua: enabled (sandbox: enabled)
2017-01-12 09:18:50.601924 [-] Site starting on 8050
2017-01-12 09:18:50.602119 [-] Starting factory <twisted.web.server.Site object at 0x7ff528490be0>
Wenn ich die folgende Spinne laufen:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = 'spiderman'
domain = ['web']
start_urls = ['http://www.example.com']
def parse(self, response):
print(response.body)
Alles funktioniert gut; scrapy gibt den Körper html zurück. Allerdings, wenn ich versuche SplashRequest aus dem Tutorial wie folgt aus:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = 'spiderman'
domain = ['web']
start_urls = ['http://www.example.com']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
args = {'wait':0.5},)
def parse(self, response):
response.body
Ich erhalte die folgenden Meldungen in meinem Terminal:
File "/Users/username/myVirtualEnvironment/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 61: Connection refused.
2017-01-12 11:02:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-12 11:03:06 [scrapy.downloadermiddlewares.retry] DEBUG:
Retrying <GET http://192.168.59.103:8050/robots.txt> (failed 1 times): TCP connection timed out: 60: Operation timed out
Meine Vermutung ist, dass Spritzen einige Verbindungsprobleme verursacht, aber ich weiß nicht wissen, wie man sie repariert. Ich habe hinzugefügt:
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'
DOWNLOAD_DELAY = 0.25
Aber es hilft nicht!
F: Kann jemand dieses Problem lösen?
BEARBEITEN: Ändern ROBOTSTXT_OBEY
zu False
funktioniert nicht. Gesamtes Konsolenprotokoll:
$ scrapy crawl spiderman
2017-01-12 11:25:18 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: myScrapingProject)
2017-01-12 11:25:18 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myScrapingProject', 'DOWNLOAD_DELAY': 0.25, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'myScrapingProject.spiders', 'SPIDER_MODULES': ['myScrapingProject.spiders'], 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'}
2017-01-12 11:25:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-01-12 11:25:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-01-12 11:25:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-01-12 11:25:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-01-12 11:25:18 [scrapy.core.engine] INFO: Spider opened
2017-01-12 11:25:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-12 11:25:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-12 11:26:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-12 11:26:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.example.com via http://192.168.59.103:8050/render.html> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2017-01-12 11:27:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-12 11:27:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.example.com via http://192.168.59.103:8050/render.html> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2017-01-12 11:28:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-12 11:29:03 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.example.com via http://192.168.59.103:8050/render.html> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2017-01-12 11:29:03 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.example.com via http://192.168.59.103:8050/render.html>
Traceback (most recent call last):
File "/Users/username/myVirtualEnvironment/lib/python3.6/site-packages/twisted/internet/defer.py", line 1297, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/Users/username/myVirtualEnvironment/lib/python3.6/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/Users/username/myVirtualEnvironment/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2017-01-12 11:29:03 [scrapy.core.engine] INFO: Closing spider (finished)
2017-01-12 11:29:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
'downloader/request_bytes': 1746,
'downloader/request_count': 3,
'downloader/request_method_count/POST': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 1, 12, 10, 29, 3, 935527),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'splash/render.html/request_count': 1,
'start_time': datetime.datetime(2017, 1, 12, 10, 25, 18, 451764)}
2017-01-12 11:29:03 [scrapy.core.engine] INFO: Spider closed (finished)
EDIT2: Wenn ich curl http://localhost:8050/render.html?url=http%3A%2F%2Fwww.example.com%2F
in einem neuen Terminal-Fenstern laufen lasse, erhalte ich die folgende ouput im Terminalfenster, die ich verwenden laufen Splash mit:
process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: Failed to open "/etc/machine-id": No such file or directory
See the manual page for dbus-uuidgen to correct this issue.
2017-01-12 10:48:03.341100 [events] {"path": "/render.html", "load": [0.07, 0.02, 0.0], "fds": 19, "client_ip": "172.17.0.1", "_id": 140690919672912, "method": "GET", "rendertime": 6.497595548629761, "active": 0, "qsize": 0, "maxrss": 83860, "args": {"uid": 140690919672912, "url": "http://www.examp\u200c\u200ble.com/"},
"timestamp": 1484218083, "status_code": 200, "user-agent": "curl/7.51.0"}
2017-01-12 10:48:03.343167 [-] "172.17.0.1" - - [12/Jan/2017:10:48:02 +0000] "GET /render.html?url=http%3A%2F%2Fwww.examp\xe2\x80\x8c\xe2\x80\x8ble.com%2F HTTP/1.1" 200 1262 "-" "curl/7.51.0"
Haben Sie 'scrapy_splash.SplashMiddleware' in Ihren Scrapy Projekteinstellungen in' DOWNLOADER_MIDDLEWARES' wie installiert [ beschrieben in der README] (https://github.com/scrapy-plugins/scrapyplashsplash#configuration)? Sie können auch die Behandlung von robots.txt mit 'ROBOTSTXT_OBEY = False' in Ihrem scrapy' settings.py' deaktivieren. Sie können auch überprüfen, ob Splash läuft, indem Sie die Weboberfläche unter http: // localhost: 8050/ –
öffnen. Ehm, wie installiere ich scrapy_splash.SplashMiddleware, kann ich es in der Readme-Datei nicht finden. Ich denke, es ist installiert, da ich auch die folgende Nachricht bekomme: '2017-01-12 11:25:18 [scrapy.middleware] INFO: Downloader-Middleware aktiviert:' scrapy_splash.SplashCookiesMiddleware ', ' scrapy_splash.SplashMiddleware ', 'Ja splash läuft und läuft! – titusAdam
Und haben Sie versucht, die Behandlung von robots.txt mit ['ROBOTSTXT_OBEY = False'] zu deaktivieren (https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey)? Wenn es immer noch nicht funktioniert, fügen Sie Ihre Konsolenprotokolle dort ein, wo Sie scrapy crawl ausführen (alles, nicht nur das Ende mit 'Erneut versuchen'). Und wenn Sie etwas in der Splash-Konsole dieser anderen Konsole sehen, können Sie das auch einfügen. –