2017-06-02 1 views
0

Ich renne einen Kratzer mit einem FilesPipeline, der bis jetzt 14,550 Einzelteile gedownloadet hat. Irgendwann scheint es jedoch "hängen geblieben" zu sein; In den Downloads wurde von "Verlust" gesprochen. Da der Schaber eine WORKDIR in den Einstellungen angegeben hat, habe ich versucht, es zu stoppen und neu zu starten.Scrapy SitemapSpider nur dupefiltering ein Einzelteil und Vollenden

Seltsamerweise jedoch beim Wiederanfahren ist es ein Gegenstand im Dupefilter und Finishing (siehe Protokolle unten). Ich habe keine Ahnung, warum sich die Spinne so verhält; kann mir jemand in die richtige richtung zeigen, es zu debuggen?

scraper_1 | Tor appears to be working. Proceeding with command... 
scraper_1 | 2017-06-02 11:38:20 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: apkmirror_scraper) 
scraper_1 | 2017-06-02 11:38:20 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'apkmirror_scraper', 'NEWSPIDER_MODULE': 'apkmirror_scraper.spiders', 'SPIDER_MODULES': ['apkmirror_scraper.spiders']} 
scraper_1 | 2017-06-02 11:38:20 [apkmirror_scraper.extensions] INFO: The crawler will scrape the following (randomized) number of items before changing identity: 32 
scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled extensions: 
scraper_1 | ['scrapy.extensions.corestats.CoreStats', 
scraper_1 | 'scrapy.extensions.telnet.TelnetConsole', 
scraper_1 | 'scrapy.extensions.memusage.MemoryUsage', 
scraper_1 | 'scrapy.extensions.closespider.CloseSpider', 
scraper_1 | 'scrapy.extensions.feedexport.FeedExporter', 
scraper_1 | 'scrapy.extensions.logstats.LogStats', 
scraper_1 | 'scrapy.extensions.spiderstate.SpiderState', 
scraper_1 | 'apkmirror_scraper.extensions.TorRenewIdentity'] 
scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled downloader middlewares: 
scraper_1 | ['scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
scraper_1 | 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
scraper_1 | 'apkmirror_scraper.downloadermiddlewares.TorRetryMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
scraper_1 | 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled spider middlewares: 
scraper_1 | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
scraper_1 | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
scraper_1 | 'scrapy.spidermiddlewares.referer.RefererMiddleware', 
scraper_1 | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
scraper_1 | 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: env 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: assume-role 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: shared-credentials-file 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials 
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/endpoints.json 
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/s3/2006-03-01/service-2.json 
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/_retry.json 
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Registering retry handlers for service: s3 
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7f9739657a60> 
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f9739657840> 
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Switching signature version for service s3 to version s3v4 based on config file override. 
scraper_1 | 2017-06-02 11:38:21 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60) 
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback. 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: env 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: assume-role 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: shared-credentials-file 
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials 
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/endpoints.json 
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/s3/2006-03-01/service-2.json 
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/_retry.json 
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Registering retry handlers for service: s3 
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7f9739657a60> 
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f9739657840> 
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Switching signature version for service s3 to version s3v4 based on config file override. 
scraper_1 | 2017-06-02 11:38:21 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60) 
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback. 
scraper_1 | 2017-06-02 11:38:21 [scrapy.middleware] INFO: Enabled item pipelines: 
scraper_1 | ['scrapy.pipelines.images.ImagesPipeline', 
scraper_1 | 'scrapy.pipelines.files.FilesPipeline'] 
scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Spider opened 
scraper_1 | 2017-06-02 11:38:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
scraper_1 | 2017-06-02 11:38:21 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
scraper_1 | 2017-06-02 11:38:21 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.apkmirror.com/sitemap_index.xml> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 
scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Closing spider (finished) 
scraper_1 | 2017-06-02 11:38:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
scraper_1 | {'dupefilter/filtered': 1, 
scraper_1 | 'finish_reason': 'finished', 
scraper_1 | 'finish_time': datetime.datetime(2017, 6, 2, 11, 38, 21, 946421), 
scraper_1 | 'log_count/DEBUG': 26, 
scraper_1 | 'log_count/INFO': 10, 
scraper_1 | 'memusage/max': 73805824, 
scraper_1 | 'memusage/startup': 73805824, 
scraper_1 | 'start_time': datetime.datetime(2017, 6, 2, 11, 38, 21, 890151)} 
scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Spider closed (finished) 
apkmirrorscrapercompose_scraper_1 exited with code 0 

Hier sind einige Details über die Spinne. Es ist Schaber apkmirror.com mit einem SitemapSpider:

from scrapy.spiders import SitemapSpider 
from apkmirror_scraper.spiders.base_spider import BaseSpider 


class ApkmirrorSitemapSpider(SitemapSpider, BaseSpider): 
    name = 'apkmirror' 
    sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml'] 
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')] 

    custom_settings = { 
     'CLOSESPIDER_PAGECOUNT': 0, 
     'CLOSESPIDER_ERRORCOUNT': 1, 
     'CONCURRENT_REQUESTS': 32, 
     'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 
     'TOR_RENEW_IDENTITY_ENABLED': True, 
     'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 50, 
     'FEED_URI': '/scraper/apkmirror_scraper/data/apkmirror.json', 
     'FEED_FORMAT': 'json', 
     'DUPEFILTER_CLASS': 'apkmirror_scraper.dupefilters.URLDupefilter', 
    } 

    download_timeout = 60 * 15.0  # Allow 15 minutes for downloading APKs 

, wo ich die dupefilter Klasse überschrieben haben sich wie folgt:

from scrapy.dupefilters import RFPDupeFilter 

class URLDupefilter(RFPDupeFilter): 

    def request_fingerprint(self, request): 
     '''Simply use the URL as fingerprint. (Scrapy's default is a hash containing the request's canonicalized URL, method, body, and (optionally) headers).''' 
     return request.url 

Antwort

2

Es sieht aus wie SitemapSpider ‚s start_requests()does NOT set dont_filter=True entgegen der default Spider class.

Wenn Sie Ihren Crawl neu starten, wird http://www.apkmirror.com/sitemap_index.xml vermutlich in Ihrem Arbeitsverzeichnis als "bereits besucht" daher gefiltert.

Sie können Ihre ApkmirrorSitemapSpiderstart_requests() überschreiben, um dont_filter=True zu setzen. Sie können den Fehler auch in scrapy öffnen.