2016-04-28 8 views
1

ich Probleme habe in Scrapy v1.0.5 meine Spinne Arbeit machen:Mein CrawlSpider in Scrapy nicht an die Regeln

class MaddynessSpider(CrawlSpider): 
name = "maddyness" 
allowed_domains = ["www.maddyness.com"] 

start_urls = [ 
    'http://www.maddyness.com/finance/levee-de-fonds/' 
] 

_extract_article_links = Rule(
    LinkExtractor(
     allow=(
      r'http://www\.maddyness\.com/finance/levee-de-fonds/' 
     ), 
     restrict_xpaths=('//article[starts-with(@class,"post")]'), 
    ), 
    callback='parse_article', 
) 

_extract_pagination_links = Rule(
    LinkExtractor(
     allow=(
      r'http://www\.maddyness\.com/finance/levee-de-fonds/', 
      r'http://www\.maddyness\.com/page/' 
     ), 
     restrict_xpaths=('//div[@class="pagination-wrapper"]'), 
    ) 
) 

rules = (
    _extract_article_links, 
    _extract_pagination_links, 
) 

def _extract_date(self, url): 
    match = re.match(r'\S+/\S+/\S+/(\S+/\S+/\S+)/\S+/', url) 
    return match.group(1) if match else None 

def _extract_slug(self, url): 
    match = re.match(r'\S+/\S+/\S+/\S+/\S+/\S+/(\S+)/', url) 
    return match.group(1) if match else None 

""" 
Parsing function after each page is scraped 
""" 
def parse_article(self, response): 
    print("la") 
    article = NewsItem() 
    loader = BeautifulSoupItemLoader(item=article, response=response, from_encoding='cp1252') 

    #loader.add_xpath('company_name', u'//meta[@property="article:tag"]/@content') 

    return loader.load_item() 

ich meine nie Callback-Funktion parse_article erreichen und die Ausgabe zeigt mir dies:

[Anaconda2] C:\dev\hubble\workspaces\python\batch\scripts\Crawler>scrapy crawl maddyness 

2016-04-28 17:00:03 [scrapy] INFO: Scrapy 1.0.5 started (bot: Crawler) 
2016-04-28 17:00:03 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-04-28 17:00:03 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'Crawler.spiders', 'SPIDER_MODULES': ['Crawler.spiders'], 'BOT_NAME': 'Crawler'} 
2016-04-28 17:00:04 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-04-28 17:00:04 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-04-28 17:00:04 [scrapy] INFO: Enabled spider middlewares:HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-04-28 17:00:04 [scrapy] INFO: Enabled item pipelines: ElasticsearchPipeline 

2016-04-28 17:00:04 [scrapy] INFO: Spider opened 
2016-04-28 17:00:04 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-04-28 17:00:04 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-04-28 17:00:04 [scrapy] DEBUG: Redirecting (301) to <GET https://www.maddyness.com/finance/levee-de-fonds/> from <GET http://www.maddyness.com/finance/levee-de-fonds/> 
2016-04-28 17:00:04 [scrapy] DEBUG: Redirecting (301) to <GET https://www.maddyness.com/index.php?s=%23MaddyPitch> from <GET http://www.maddyness.com/index.php?s=%23MaddyPitch> 
2016-04-28 17:00:04 [scrapy] DEBUG: Crawled (200) <GET https://www.maddyness.com/index.php?s=%23MaddyPitch> (referer: None) 
2016-04-28 17:00:04 [scrapy] DEBUG: Crawled (200) <GET https://www.maddyness.com/finance/levee-de-fonds/> (referer: None) 
2016-04-28 17:00:05 [scrapy] INFO: Closing spider (finished)Spider closed 
2016-04-28 17:00:05 [scrapy] INFO: Dumping Scrapy stats: {  
'downloader/request_bytes': 1080, 
'downloader/request_count': 4, 
'downloader/request_method_count/GET': 4, 
'downloader/response_bytes': 48223, 
'downloader/response_count': 4, 
'downloader/response_status_count/200': 2, 
'downloader/response_status_count/301': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 4, 28, 15, 0, 5, 123000), 
'log_count/DEBUG': 5, 
'log_count/INFO': 7, 
'response_received_count': 2, 
'scheduler/dequeued': 4, 
'scheduler/dequeued/memory': 4, 
'scheduler/enqueued': 4, 
'scheduler/enqueued/memory': 4, 
'start_time': datetime.datetime(2016, 4, 28, 15, 0, 4, 590000)} 
2016-04-28 17:00:05 [scrapy] INFO: Spider closed (finished) 

Vielen Dank im Voraus für Ihre Hilfe, ich bin total festgefahren.

Antwort

0

Es ist nur so, dass Sie zum „https“ von „http“ umgeleitet sind und alle nachfolgenden Artikel Links beginnen jetzt mit https, während die Regeln nur http Links zu extrahieren konfiguriert sind. Fix, dass:

_extract_article_links = Rule(
    LinkExtractor(
     allow=(
      r'https?://www\.maddyness\.com/finance/levee-de-fonds/' 
     ), 
     restrict_xpaths=('//article[starts-with(@class,"post")]'), 
    ), 
    callback='parse_article', 
) 

_extract_pagination_links = Rule(
    LinkExtractor(
     allow=(
      r'https?://www\.maddyness\.com/finance/levee-de-fonds/', 
      r'https?://www\.maddyness\.com/page/' 
     ), 
     restrict_xpaths=('//div[@class="pagination-wrapper"]'), 
    ) 
) 

s? hier würde s 0 oder 1 Mal passen so dass es beide arbeiten für http und https.

+0

Tatsächlich !! Vielen Dank ! –

Verwandte Themen