2016-10-26 2 views
0

Ich möchte Gedicht Analyse von scrapy bekommen.scrapy request callback count weniger als anfrage count

ich durch folgende Schritte erhalten analysieren:

  1. erhalten die poem list url
  2. erhalten die poem detail url
  3. die poem analyze url

bekommen, aber ich fand Zahl genannt Anfrage Rückruf weniger als Anforderung zählt . In meiner Zählung Demo Anfrage ist 10, aber Rückruf 8.

Der folgende Inhalt log ist:

2016-10-26 16:15:54 [scrapy] INFO: Scrapy 1.2.0 started (bot: poem) 
2016-10-26 16:15:54 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'poem.spiders', 'SPIDER_MODULES': ['poem.spiders'], 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'BOT_NAME': 'poem'} 
2016-10-26 16:15:54 [scrapy] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2016-10-26 16:15:54 [scrapy] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2016-10-26 16:15:54 [scrapy] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2016-10-26 16:15:54 [scrapy] INFO: Enabled item pipelines: 
['poem.pipelines.PoemPipeline'] 
2016-10-26 16:15:54 [scrapy] INFO: Spider opened 
2016-10-26 16:15:54 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
poem list count : 10 
callback count : 1 
item count : 1 
callback count : 2 
item count : 2 
callback count : 3 
item count : 3 
callback count : 4 
item count : 4 
callback count : 5 
item count : 5 
callback count : 6 
item count : 6 
callback count : 7 
item count : 7 
callback count : 8 
item count : 8 
2016-10-26 16:15:55 [scrapy] INFO: Closing spider (finished) 
2016-10-26 16:15:55 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 5385, 
'downloader/request_count': 20, // (10 * 2) 
'downloader/request_method_count/GET': 20, 
'downloader/response_bytes': 139702, 
'downloader/response_count': 20, 
'downloader/response_status_count/200': 20, 
'dupefilter/filtered': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 26, 8, 15, 55, 416028), 
'item_scraped_count': 8, 
'log_count/INFO': 7, 
'request_depth_max': 2, 
'response_received_count': 20, 
'scheduler/dequeued': 19, 
'scheduler/dequeued/memory': 19, 
'scheduler/enqueued': 19, 
'scheduler/enqueued/memory': 19, 
'start_time': datetime.datetime(2016, 10, 26, 8, 15, 54, 887101)} 
2016-10-26 16:15:55 [scrapy] INFO: Spider closed (finished) 

IDE: PyCharm
DEBUG: Terminal in PyCharm
Code:

############## spider.py ############## 
import scrapy 
from poem.items import PoemItem 


class PoemSpider(scrapy.Spider): 

    name = 'poem' 

    analyze_count = 0 

    start_urls = ['http://so.gushiwen.org/type.aspx'] 

    def parse(self, response): 
     # 1. get poem list url 
     poems = response.xpath("//div[@class='typeleft']/div[@class='sons']") 
     for poem in poems: 
      # 2. get poem detail url 
      poem_href = poem.xpath("./p[1]/a/@href").extract_first() 
      poem_url = response.urljoin(poem_href) 
      yield scrapy.Request(poem_url, callback=self.parse_poem) 

    def parse_poem(self, response): 
     ## 3. get analyze url 
     analyze_href = response.xpath("//u[text()='%s']/parent::*/@href"%(u'赏析')).extract_first() 
     analyze_url = response.urljoin(analyze_href) 
     yield scrapy.Request(analyze_url, callback=self.parse_poem_analyze) 

    def parse_poem_analyze(self, response): 
     # print analyze callback called count 
     print "#####################################" 
     PoemSpider.analyze_count = PoemSpider.analyze_count + 1 
     print PoemSpider.analyze_count 
     poem = PoemItem() 
     yield poem 

############## pipelines.py ############## 
class PoemPipeline(object): 

    processcount = 0 

    def process_item(self, item, spider): 
     # print item count 
     print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>" 
     PoemPipeline.processcount = PoemPipeline.processcount + 1 
     print PoemPipeline.processcount 
     return item 

Antwort

2

Ihr Protokoll fehlt stderr, aber Sie können immer noch sehen, was passiert, indem Sie auf den Stats-Ausgang

schauen
{'downloader/request_bytes': 5385, 
'downloader/request_count': 20, // (10 * 2) 
'downloader/request_method_count/GET': 20, 
'downloader/response_bytes': 139702, 
'downloader/response_count': 20, 
'downloader/response_status_count/200': 20, 
'dupefilter/filtered': 2, <--------------------------------------------- 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 10, 26, 8, 15, 55, 416028), 
'item_scraped_count': 8, <--------------------------------------------- 
'log_count/INFO': 7, 
'request_depth_max': 2, 
'response_received_count': 20, 
'scheduler/dequeued': 19, 
'scheduler/dequeued/memory': 19, 
'scheduler/enqueued': 19, 
'scheduler/enqueued/memory': 19, 
'start_time': datetime.datetime(2016, 10, 26, 8, 15, 54, 887101)} 

Also zwei Anfragen, die Sie machen, werden von dupefilter Middleware ausgefiltert, die sicherstellt, dass alle Anfragen, die Sie machen, einzigartig sind (normalerweise nur eine eindeutige URL).

Ihr Problem scheint hier unsichere URL Schöpfung zu sein:

analyze_href = response.xpath("//u[text()='%s']/parent::*/@href"%(u'赏析')).extract_first() 
analyze_url = response.urljoin(analyze_href) 

seit analyze_href leer sein können Sie mit analyze_url am Ende dann == response.url und herausgefiltert wird, da Sie es gerade gekrochen.
Um diese Überprüfung zu vermeiden, wenn analyze_href nicht leer ist und nur dann in eine url machen:

analyze_href = response.xpath("//u[text()='%s']/parent::*/@href"%(u'赏析')).extract_first() 
if not analyze_href: 
    logging.error("failed to find analyze_href for {}".format(response.url)) 
    return 
analyze_url = response.urljoin(analyze_href) 
+0

Sie haben Recht! Vielen Dank! – TomatoPeter