Ich renne einen Kratzer mit einem FilesPipeline
, der bis jetzt 14,550 Einzelteile gedownloadet hat. Irgendwann scheint es jedoch "hängen geblieben" zu sein; In den Downloads wurde von "Verlust" gesprochen. Da der Schaber eine WORKDIR
in den Einstellungen angegeben hat, habe ich versucht, es zu stoppen und neu zu starten.Scrapy SitemapSpider nur dupefiltering ein Einzelteil und Vollenden
Seltsamerweise jedoch beim Wiederanfahren ist es ein Gegenstand im Dupefilter und Finishing (siehe Protokolle unten). Ich habe keine Ahnung, warum sich die Spinne so verhält; kann mir jemand in die richtige richtung zeigen, es zu debuggen?
scraper_1 | Tor appears to be working. Proceeding with command...
scraper_1 | 2017-06-02 11:38:20 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: apkmirror_scraper)
scraper_1 | 2017-06-02 11:38:20 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'apkmirror_scraper', 'NEWSPIDER_MODULE': 'apkmirror_scraper.spiders', 'SPIDER_MODULES': ['apkmirror_scraper.spiders']}
scraper_1 | 2017-06-02 11:38:20 [apkmirror_scraper.extensions] INFO: The crawler will scrape the following (randomized) number of items before changing identity: 32
scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled extensions:
scraper_1 | ['scrapy.extensions.corestats.CoreStats',
scraper_1 | 'scrapy.extensions.telnet.TelnetConsole',
scraper_1 | 'scrapy.extensions.memusage.MemoryUsage',
scraper_1 | 'scrapy.extensions.closespider.CloseSpider',
scraper_1 | 'scrapy.extensions.feedexport.FeedExporter',
scraper_1 | 'scrapy.extensions.logstats.LogStats',
scraper_1 | 'scrapy.extensions.spiderstate.SpiderState',
scraper_1 | 'apkmirror_scraper.extensions.TorRenewIdentity']
scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled downloader middlewares:
scraper_1 | ['scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
scraper_1 | 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
scraper_1 | 'apkmirror_scraper.downloadermiddlewares.TorRetryMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
scraper_1 | 'scrapy.downloadermiddlewares.stats.DownloaderStats']
scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled spider middlewares:
scraper_1 | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.referer.RefererMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
scraper_1 | 'scrapy.spidermiddlewares.depth.DepthMiddleware']
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: env
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: assume-role
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: shared-credentials-file
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/endpoints.json
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/s3/2006-03-01/service-2.json
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/_retry.json
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Registering retry handlers for service: s3
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7f9739657a60>
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f9739657840>
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Switching signature version for service s3 to version s3v4 based on config file override.
scraper_1 | 2017-06-02 11:38:21 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60)
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback.
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: env
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: assume-role
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: shared-credentials-file
scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/endpoints.json
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/s3/2006-03-01/service-2.json
scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/_retry.json
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Registering retry handlers for service: s3
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7f9739657a60>
scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f9739657840>
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Switching signature version for service s3 to version s3v4 based on config file override.
scraper_1 | 2017-06-02 11:38:21 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60)
scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback.
scraper_1 | 2017-06-02 11:38:21 [scrapy.middleware] INFO: Enabled item pipelines:
scraper_1 | ['scrapy.pipelines.images.ImagesPipeline',
scraper_1 | 'scrapy.pipelines.files.FilesPipeline']
scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Spider opened
scraper_1 | 2017-06-02 11:38:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scraper_1 | 2017-06-02 11:38:21 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
scraper_1 | 2017-06-02 11:38:21 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.apkmirror.com/sitemap_index.xml> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-06-02 11:38:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {'dupefilter/filtered': 1,
scraper_1 | 'finish_reason': 'finished',
scraper_1 | 'finish_time': datetime.datetime(2017, 6, 2, 11, 38, 21, 946421),
scraper_1 | 'log_count/DEBUG': 26,
scraper_1 | 'log_count/INFO': 10,
scraper_1 | 'memusage/max': 73805824,
scraper_1 | 'memusage/startup': 73805824,
scraper_1 | 'start_time': datetime.datetime(2017, 6, 2, 11, 38, 21, 890151)}
scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Spider closed (finished)
apkmirrorscrapercompose_scraper_1 exited with code 0
Hier sind einige Details über die Spinne. Es ist Schaber apkmirror.com
mit einem SitemapSpider
:
from scrapy.spiders import SitemapSpider
from apkmirror_scraper.spiders.base_spider import BaseSpider
class ApkmirrorSitemapSpider(SitemapSpider, BaseSpider):
name = 'apkmirror'
sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]
custom_settings = {
'CLOSESPIDER_PAGECOUNT': 0,
'CLOSESPIDER_ERRORCOUNT': 1,
'CONCURRENT_REQUESTS': 32,
'CONCURRENT_REQUESTS_PER_DOMAIN': 16,
'TOR_RENEW_IDENTITY_ENABLED': True,
'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 50,
'FEED_URI': '/scraper/apkmirror_scraper/data/apkmirror.json',
'FEED_FORMAT': 'json',
'DUPEFILTER_CLASS': 'apkmirror_scraper.dupefilters.URLDupefilter',
}
download_timeout = 60 * 15.0 # Allow 15 minutes for downloading APKs
, wo ich die dupefilter Klasse überschrieben haben sich wie folgt:
from scrapy.dupefilters import RFPDupeFilter
class URLDupefilter(RFPDupeFilter):
def request_fingerprint(self, request):
'''Simply use the URL as fingerprint. (Scrapy's default is a hash containing the request's canonicalized URL, method, body, and (optionally) headers).'''
return request.url