2017-08-01 6 views
0

Also ich spiele mit Scrapy & Splash und ich laufe auf einige Probleme. Ich habe versucht, meine Spinnen laufen zu lassen, und erhielt HTTP 502 & 504 Fehler. Okay, ich habe versucht, Splash in meinem Browser auszuprobieren. Zuerst habe ich "sudo docker run -p 8050: 8050 -p 5023: 5023 scrapinghub/splash --max-timeout 3600 -v3" um Splash zu starten, dann ging ich zu localhost: 8050. Die Web-Benutzeroberfläche wird ordnungsgemäß geöffnet, und ich kann Code eingeben. Hier ist die grundlegende Funktion, die ich zu laufen bin versucht:Ausprobieren von Scrapy + Splash

function main(splash, args) 
    assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js")) 
    splash.resource_timeout = 30.0 
    splash.images_enabled = false 
    assert(splash:go(args.url)) 
    assert(splash:wait(0.5)) 
    return { 
    html = splash:html(), 
    --png = splash:png(), 
    --har = splash:har(), 
    } 
end 

Ich versuche http://boingboing.net/blog zu machen, um diese Funktion verwenden und erhalten einen ‚ungültigen Hostnamen‘ LUA-Fehler; hier sind die Protokolle:

2017-08-01 18:26:28+0000 [-] Log opened. 
2017-08-01 18:26:28.077457 [-] Splash version: 3.0 
2017-08-01 18:26:28.077838 [-] Qt 5.9.1, PyQt 5.9, WebKit 602.1, sip 4.19.3, Twisted 16.1.1, Lua 5.2 
2017-08-01 18:26:28.077900 [-] Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] 
2017-08-01 18:26:28.077984 [-] Open files limit: 65536 
2017-08-01 18:26:28.078046 [-] Can't bump open files limit 
2017-08-01 18:26:28.180376 [-] Xvfb is started: ['Xvfb', ':1937726875', '-screen', '0', '1024x768x24', '-nolisten', 'tcp'] 
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root' 
2017-08-01 18:26:28.226937 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles 
2017-08-01 18:26:28.301002 [-] verbosity=3 
2017-08-01 18:26:28.301116 [-] slots=50 
2017-08-01 18:26:28.301202 [-] argument_cache_max_entries=500 
2017-08-01 18:26:28.301530 [-] Web UI: enabled, Lua: enabled (sandbox: enabled) 
2017-08-01 18:26:28.302122 [-] Site starting on 8050 
2017-08-01 18:26:28.302219 [-] Starting factory <twisted.web.server.Site object at 0x7ffa08390dd8> 
2017-08-01 18:26:32.660457 [-] "172.17.0.1" - - [01/Aug/2017:18:26:32 +0000] "GET/HTTP/1.1" 200 7677 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:27:18.860020 [-] "172.17.0.1" - - [01/Aug/2017:18:27:18 +0000] "GET /info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 5656 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:27:19.038565 [pool] initializing SLOT 0 
libpng warning: iCCP: known incorrect sRGB profile 
libpng warning: iCCP: known incorrect sRGB profile 
process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text 
See the manual page for dbus-uuidgen to correct this issue. 
2017-08-01 18:27:19.066765 [render] [140711856519656] viewport size is set to 1024x768 
2017-08-01 18:27:19.066964 [pool] [140711856519656] SLOT 0 is starting 
2017-08-01 18:27:19.067071 [render] [140711856519656] function main(splash, args)\r\n assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend 
2017-08-01 18:27:19.070107 [render] [140711856519656] [lua_runner] dispatch cmd_id=__START__ 
2017-08-01 18:27:19.070270 [render] [140711856519656] [lua_runner] arguments are for command __START__, waiting for result of __START__ 
2017-08-01 18:27:19.070352 [render] [140711856519656] [lua_runner] entering dispatch/loop body, args=() 
2017-08-01 18:27:19.070424 [render] [140711856519656] [lua_runner] send None 
2017-08-01 18:27:19.070496 [render] [140711856519656] [lua_runner] send (lua) None 
2017-08-01 18:27:19.070657 [render] [140711856519656] [lua_runner] got AsyncBrowserCommand(id=None, name='http_get', kwargs={'url': 'https://code.jquery.com/jquery-3.1.1.min.js', 'callback': '<a callback>'}) 
2017-08-01 18:27:19.070755 [render] [140711856519656] [lua_runner] instructions used: 70 
2017-08-01 18:27:19.070834 [render] [140711856519656] [lua_runner] executing AsyncBrowserCommand(id=0, name='http_get', kwargs={'url': 'https://code.jquery.com/jquery-3.1.1.min.js', 'callback': '<a callback>'}) 
2017-08-01 18:27:19.071141 [network] [140711856519656] GET https://code.jquery.com/jquery-3.1.1.min.js 
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method 
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method 
2017-08-01 18:27:19.082150 [pool] [140711856519656] SLOT 0 is working 
2017-08-01 18:27:19.082298 [pool] [140711856519656] queued 
2017-08-01 18:28:39.151814 [network-manager] Download error 3: the remote host name was not found (invalid hostname) (https://code.jquery.com/jquery-3.1.1.min.js) 
2017-08-01 18:28:39.152087 [network-manager] Finished downloading https://code.jquery.com/jquery-3.1.1.min.js 
2017-08-01 18:28:39.152202 [render] [140711856519656] [lua_runner] dispatch cmd_id=0 
2017-08-01 18:28:39.152268 [render] [140711856519656] [lua_runner] arguments are for command 0, waiting for result of 0 
2017-08-01 18:28:39.152339 [render] [140711856519656] [lua_runner] entering dispatch/loop body, args=(PyResult('return', None, 'invalid_hostname'),) 
2017-08-01 18:28:39.152400 [render] [140711856519656] [lua_runner] send PyResult('return', None, 'invalid_hostname') 
2017-08-01 18:28:39.152468 [render] [140711856519656] [lua_runner] send (lua) (b'return', None, b'invalid_hostname') 
2017-08-01 18:28:39.152582 [render] [140711856519656] [lua_runner] instructions used: 79 
2017-08-01 18:28:39.152642 [render] [140711856519656] [lua_runner] caught LuaError LuaError('[string "function main(splash, args)\\r..."]:2: invalid_hostname',) 
2017-08-01 18:28:39.152816 [pool] [140711856519656] SLOT 0 finished with an error <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48>: [Failure instance: Traceback: <class 'splash.exceptions.ScriptError'>: {'error': 'invalid_hostname', 'type': 'LUA_ERROR', 'source': '[string "function main(splash, args)\r..."]', 'message': 'Lua error: [string "function main(splash, args)\r..."]:2: invalid_hostname', 'line_number': 2} 
    /app/splash/browser_tab.py:1180:_return_reply 
    /app/splash/qtrender_lua.py:901:callback 
    /app/splash/lua_runner.py:27:return_result 
    /app/splash/qtrender.py:17:stop_on_error_wrapper 
    --- <exception caught here> --- 
    /app/splash/qtrender.py:15:stop_on_error_wrapper 
    /app/splash/qtrender_lua.py:2257:dispatch 
    /app/splash/lua_runner.py:195:dispatch 
    ] 
2017-08-01 18:28:39.152883 [pool] [140711856519656] SLOT 0 is closing <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48> 
2017-08-01 18:28:39.152944 [render] [140711856519656] [splash] clearing 0 objects 
2017-08-01 18:28:39.153026 [render] [140711856519656] close is requested by a script 
2017-08-01 18:28:39.153304 [render] [140711856519656] cancelling 0 remaining timers 
2017-08-01 18:28:39.153374 [pool] [140711856519656] SLOT 0 done with <splash.qtrender_lua.LuaRender object at 0x7ffa08477e48> 
2017-08-01 18:28:39.153997 [events] {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0", "error": {"error": 400, "info": {"error": "invalid_hostname", "type": "LUA_ERROR", "source": "[string \"function main(splash, args)\r...\"]", "message": "Lua error: [string \"function main(splash, args)\r...\"]:2: invalid_hostname", "line_number": 2}, "type": "ScriptError", "description": "Error happened while executing Lua script"}, "active": 0, "status_code": 400, "maxrss": 107916, "qsize": 0, "path": "/execute", "timestamp": 1501612119, "fds": 18, "args": {"render_all": false, "http_method": "GET", "png": 1, "url": "http://boingboing.net/blog", "wait": 0.5, "html": 1, "response_body": false, "har": 1, "load_args": {}, "lua_source": "function main(splash, args)\r\n assert(splash:autoload(\"https://code.jquery.com/jquery-3.1.1.min.js\"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend", "resource_timeout": 0, "uid": 140711856519656, "save_args": [], "viewport": "1024x768", "timeout": 3600, "images": 1}, "client_ip": "172.17.0.1", "rendertime": 80.11527562141418, "method": "POST", "_id": 140711856519656, "load": [0.46, 0.51, 0.54]} 
2017-08-01 18:28:39.154127 [-] "172.17.0.1" - - [01/Aug/2017:18:28:38 +0000] "POST /execute HTTP/1.1" 400 325 "http://localhost:8050/info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:28:39.154237 [pool] SLOT 0 is available 

Wenn ich es versuchen ohne Laden bis JQuery zuerst, ich einen ‚network5‘ LUA-Fehler erhalten (die einige Arten von Timeout). für die Protokolle sind wie folgt:

2017-08-01 18:31:07.110255 [-] "172.17.0.1" - - [01/Aug/2017:18:31:06 +0000] "GET /info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++--assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend HTTP/1.1" 200 5658 "http://localhost:8050/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:31:07.489653 [pool] initializing SLOT 1 
2017-08-01 18:31:07.490576 [render] [140711856961016] viewport size is set to 1024x768 
2017-08-01 18:31:07.490692 [pool] [140711856961016] SLOT 1 is starting 
2017-08-01 18:31:07.490829 [render] [140711856961016] function main(splash, args)\r\n --assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend 
2017-08-01 18:31:07.493641 [render] [140711856961016] [lua_runner] dispatch cmd_id=__START__ 
2017-08-01 18:31:07.493782 [render] [140711856961016] [lua_runner] arguments are for command __START__, waiting for result of __START__ 
2017-08-01 18:31:07.493865 [render] [140711856961016] [lua_runner] entering dispatch/loop body, args=() 
2017-08-01 18:31:07.493937 [render] [140711856961016] [lua_runner] send None 
2017-08-01 18:31:07.494010 [render] [140711856961016] [lua_runner] send (lua) None 
2017-08-01 18:31:07.494270 [render] [140711856961016] [lua_runner] got AsyncBrowserCommand(id=None, name='go', kwargs={'baseurl': None, 'http_method': 'GET', 'headers': None, 'body': None, 'url': 'http://boingboing.net/blog', 'errback': '<an errback>', 'callback': '<a callback>'}) 
2017-08-01 18:31:07.494416 [render] [140711856961016] [lua_runner] instructions used: 166 
2017-08-01 18:31:07.494502 [render] [140711856961016] [lua_runner] executing AsyncBrowserCommand(id=0, name='go', kwargs={'baseurl': None, 'http_method': 'GET', 'headers': None, 'body': None, 'url': 'http://boingboing.net/blog', 'errback': '<an errback>', 'callback': '<a callback>'}) 
2017-08-01 18:31:07.494576 [render] [140711856961016] HAR event: _onStarted 
2017-08-01 18:31:07.494697 [render] [140711856961016] callback 0 is connected to loadFinished 
2017-08-01 18:31:07.495031 [network] [140711856961016] GET http://boingboing.net/blog 
2017-08-01 18:31:07.495617 [pool] [140711856961016] SLOT 1 is working 
2017-08-01 18:31:07.495741 [pool] [140711856961016] queued 
2017-08-01 18:31:37.789845 [network-manager] timed out, aborting: http://boingboing.net/blog 
2017-08-01 18:31:37.790154 [network-manager] Finished downloading http://boingboing.net/blog 
2017-08-01 18:31:37.791064 [render] [140711856961016] mainFrame().urlChanged http://boingboing.net/blog 
2017-08-01 18:31:37.796078 [render] [140711856961016] mainFrame().initialLayoutCompleted 
2017-08-01 18:31:37.796343 [render] [140711856961016] loadFinished: RenderErrorInfo(type='Network', code=5, text='Operation canceled', url='http://boingboing.net/blog') 
2017-08-01 18:31:37.796420 [render] [140711856961016] loadFinished: disconnecting callback 0 
2017-08-01 18:31:37.796518 [render] [140711856961016] [lua_runner] dispatch cmd_id=0 
2017-08-01 18:31:37.796576 [render] [140711856961016] [lua_runner] arguments are for command 0, waiting for result of 0 
2017-08-01 18:31:37.796640 [render] [140711856961016] [lua_runner] entering dispatch/loop body, args=(PyResult('return', None, 'network5'),) 
2017-08-01 18:31:37.796699 [render] [140711856961016] [lua_runner] send PyResult('return', None, 'network5') 
2017-08-01 18:31:37.796765 [render] [140711856961016] [lua_runner] send (lua) (b'return', None, b'network5') 
2017-08-01 18:31:37.796883 [render] [140711856961016] [lua_runner] instructions used: 175 
2017-08-01 18:31:37.796943 [render] [140711856961016] [lua_runner] caught LuaError LuaError('[string "function main(splash, args)\\r..."]:5: network5',) 
2017-08-01 18:31:37.797093 [pool] [140711856961016] SLOT 1 finished with an error <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828>: [Failure instance: Traceback: <class 'splash.exceptions.ScriptError'>: {'error': 'network5', 'type': 'LUA_ERROR', 'source': '[string "function main(splash, args)\r..."]', 'message': 'Lua error: [string "function main(splash, args)\r..."]:5: network5', 'line_number': 5} 
    /app/splash/browser_tab.py:533:_on_content_ready 
    /app/splash/qtrender_lua.py:702:error 
    /app/splash/lua_runner.py:27:return_result 
    /app/splash/qtrender.py:17:stop_on_error_wrapper 
    --- <exception caught here> --- 
    /app/splash/qtrender.py:15:stop_on_error_wrapper 
    /app/splash/qtrender_lua.py:2257:dispatch 
    /app/splash/lua_runner.py:195:dispatch 
    ] 
2017-08-01 18:31:37.797158 [pool] [140711856961016] SLOT 1 is closing <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828> 
2017-08-01 18:31:37.797217 [render] [140711856961016] [splash] clearing 0 objects 
2017-08-01 18:31:37.797310 [render] [140711856961016] close is requested by a script 
2017-08-01 18:31:37.797430 [render] [140711856961016] cancelling 0 remaining timers 
2017-08-01 18:31:37.797491 [pool] [140711856961016] SLOT 1 done with <splash.qtrender_lua.LuaRender object at 0x7ffa083ff828> 
2017-08-01 18:31:37.798067 [events] {"user-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0", "error": {"error": 400, "info": {"error": "network5", "type": "LUA_ERROR", "source": "[string \"function main(splash, args)\r...\"]", "message": "Lua error: [string \"function main(splash, args)\r...\"]:5: network5", "line_number": 5}, "type": "ScriptError", "description": "Error happened while executing Lua script"}, "active": 0, "status_code": 400, "maxrss": 113372, "qsize": 0, "path": "/execute", "timestamp": 1501612297, "fds": 21, "args": {"render_all": false, "http_method": "GET", "png": 1, "url": "http://boingboing.net/blog", "wait": 0.5, "html": 1, "response_body": false, "har": 1, "load_args": {}, "lua_source": "function main(splash, args)\r\n --assert(splash:autoload(\"https://code.jquery.com/jquery-3.1.1.min.js\"))\r\n splash.resource_timeout = 30.0\r\n splash.images_enabled = false\r\n assert(splash:go(args.url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n --png = splash:png(),\r\n --har = splash:har(),\r\n }\r\nend", "resource_timeout": 0, "uid": 140711856961016, "save_args": [], "viewport": "1024x768", "timeout": 3600, "images": 1}, "client_ip": "172.17.0.1", "rendertime": 30.308406591415405, "method": "POST", "_id": 140711856961016, "load": [0.39, 0.42, 0.49]} 
2017-08-01 18:31:37.798190 [-] "172.17.0.1" - - [01/Aug/2017:18:31:37 +0000] "POST /execute HTTP/1.1" 400 309 "http://localhost:8050/info?wait=0.5&images=1&expand=1&timeout=3600.0&url=http%3A%2F%2Fboingboing.net%2Fblog&lua_source=function+main%28splash%2C+args%29%0D%0A++--assert%28splash%3Aautoload%28%22https%3A%2F%2Fcode.jquery.com%2Fjquery-3.1.1.min.js%22%29%29%0D%0A++splash.resource_timeout+%3D+30.0%0D%0A++splash.images_enabled+%3D+false%0D%0A++assert%28splash%3Ago%28args.url%29%29%0D%0A++assert%28splash%3Await%280.5%29%29%0D%0A++return+%7B%0D%0A++++html+%3D+splash%3Ahtml%28%29%2C%0D%0A++++--png+%3D+splash%3Apng%28%29%2C%0D%0A++++--har+%3D+splash%3Ahar%28%29%2C%0D%0A++%7D%0D%0Aend" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:54.0) Gecko/20100101 Firefox/54.0" 
2017-08-01 18:31:37.798294 [pool] SLOT 1 is available 

Wenn ich kommentieren Sie zusätzlich die resource_timeout Linie, erhalte ich eine network3 LUA-Fehler (wieder, ungültige Hostnamen, aber dieses Mal präsentiert anders).

Irgendeine Idee, was ich falsch mache?

Antwort

0

Es stellte sich heraus, dass es kein Scrapy/Splash-Problem war - es war ein Docker/IP-Route/Netzwerk-Admin-Problem. Die Netzwerkadministratoren haben es so eingerichtet, dass ich HTTP-Anfragen nur über ein bestimmtes Ziel stellen kann; das Hinzufügen von "--net = host" zu meinem Docker Start-up scheint dies behoben zu haben. This webpage war sehr hilfreich.

0

Versuchen

function main(splash, args) 
    ... 
    assert(splash:go(args.url)) 
    ... 

zu

function main(splash) 
    ... 
    assert(splash:go(splash.args.url)) 
    ... 

Mindestens das ist, zu verändern, wie es liest, wenn ich Splash öffnen auf Port 8050 in der Standard-Skript. Mit dieser Änderung funktioniert dein Skript für mich.