2017-10-26 1 views
0

Dies ist mein erster Schaber und ich habe einige Probleme. Um zu beginnen, habe ich meine CSS-Selektoren erstellt und sie arbeiten bei der Verwendung von Scrapy Shell. Wenn ich myspider laufen gibt es einfach dieseScrapy extrahiert keine Daten, CSS-Selektoren sind korrekt

2017-10-26 14:48:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: digikey) 
2017-10-26 14:48:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'digikey', 'CONCURRENT_REQUESTS': 1, 'NEW 
SPIDER_MODULE': 'digikey.spiders', 'SPIDER_MODULES': ['digikey.spiders'], 'USER_AGENT': 'digikey ("Mozilla/5.0 (Windows 
NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")'} 
2017-10-26 14:48:49 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.corestats.CoreStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.logstats.LogStats'] 
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-10-26 14:48:50 [scrapy.core.engine] INFO: Spider opened 
2017-10-26 14:48:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min 
) 
2017-10-26 14:48:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-10-26 14:48:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikey.com/products/en/capacitors/alumin 
um-electrolytic-capacitors/58/page/3?stock=1> (referer: None) 
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-10-26 14:48:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 329, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 104631, 
'downloader/response_count': 1, 
'downloader/response_status_count/200': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 10, 26, 21, 48, 52, 235020), 
'log_count/DEBUG': 2, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2017, 10, 26, 21, 48, 50, 249076)} 
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Spider closed (finished) 
PS C:\Users\dalla_000\digikey> 

Meine Spinne wie dieses

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from digikey.items import DigikeyItem 
from scrapy.selector import Selector 

class DigikeySpider(CrawlSpider): 
    name = 'digikey' 
    allowed_domains = ['digikey.com'] 
    start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1'] 

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php') 
    # and follow links from them (since no callback means follow=True by default). 
    Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1',), deny=('subsection\.php',))), 
) 
def parse_item(self, response): 
    for row in response.css('table#productTable tbody tr'): 
     item = DigikeyItem() 
     item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first() 
     item['manufacturer'] = row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first() 
     item['description'] = row.css('.tr-description::text').extract_first() 
     item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first() 
     item['price'] = row.css('.tr-unitPrice::text').extract_first() 
     item['minimumquanity'] = row.css('.tr-minQty::text').extract_first() 
     yield item 

     parse_start_url = parse_item 

Die items.py sieht wie folgt aussieht:

BOT_NAME = 'digikey' 

SPIDER_MODULES = ['digikey.spiders'] 
NEWSPIDER_MODULE = 'digikey.spiders' 


# Crawl responsibly by identifying yourself (and your website) on the user-agent 
USER_AGENT = 'digikey ("Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")' 

# Obey robots.txt rules 
ROBOTSTXT_OBEY = False 
:

import scrapy 


class DigikeyItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    partnumber = scrapy.Field() 
    manufacturer = scrapy.Field() 
    description = scrapy.Field() 
    quanity= scrapy.Field() 
    minimumquanity = scrapy.Field() 
    price = scrapy.Field() 
    pass' 

Rahmen Ich habe Mühe zu verstehen, warum keine Daten b sind Diese Arbeit wird mit arbeitenden css-Selektoren extrahiert. Außerdem schließt die Spinne den Job ab und schließt. Ich beschränke die Spinne, um nur eine Seite zu kriechen, wenn sie richtig funktioniert, öffne ich sie für die gesamte Web site.

Antwort

0

Dies funktioniert für mich (nicht ändern, um den Einzug):

import scrapy 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 

from scrapy.selector import Selector 

class DigikeyItem(scrapy.Item): 
    # define the fields for your item here like: 
    # name = scrapy.Field() 
    partnumber = scrapy.Field() 
    manufacturer = scrapy.Field() 
    description = scrapy.Field() 
    quanity= scrapy.Field() 
    minimumquanity = scrapy.Field() 
    price = scrapy.Field() 

class DigikeySpider(CrawlSpider): 
    name = 'digikey' 
    allowed_domains = ['digikey.com'] 
    start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1'] 

    rules = (
     # Extract links matching 'category.php' (but not matching 'subsection.php') 
     # and follow links from them (since no callback means follow=True by default). 
     Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3',)),callback='parse_item'), 
    ) 

    def parse_item(self, response): 
     for row in response.css('table#productTable tbody tr'): 
      item = DigikeyItem() 
      item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first() 
      item['manufacturer'] = row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first() 
      item['description'] = row.css('.tr-description::text').extract_first() 
      item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first() 
      item['price'] = row.css('.tr-unitPrice::text').extract_first() 
      item['minimumquanity'] = row.css('.tr-minQty::text').extract_first() 
      yield item 

    parse_start_url = parse_item 

Aber wenn Sie die Regel testen möchten Ihre start_urls zu

start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58'] 

und entfernen parse_start_url = parse_item

ändern
0

Ich denke, Sie müssen nicht die verwenden, weil diese Spinne mit der Absicht erstellt wurde, eine Website wie ein Forum oder Blog für verschiedene Kategorien zu navigieren Post oder tatsächliche Element Links (das ist, was rules sind für).

Ihre rules versuchen verschiedene Urls follow zu bekommen und dann von denen diejenigen spezifischen URLs besuchen, die Ihre rule passen und dann die Methode dort an dieser Stelle URL mit der Antwort angegeben nennen.

Aber in Ihrem Fall möchten Sie die spezifische URL innerhalb start_urls besuchen, so CrawlSpider funktioniert nicht so. Sie sollten nur die Spider Implementierung verwenden, um die Antwort von den start_urls URLs zu erhalten.

class DigikeySpider(scrapy.Spider): 
    name = 'digikey' 
    allowed_domains = ['digikey.com'] 
    start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1'] 

    def parse(self, response): 
     ... 
+0

Hallo, Wenn ich die Spinne richtig funktioniert, wird Regeln zu folgen alle/Produkte/de Ich änderte die Rul Es ist zu vermeiden, das ganze Ding jedes Mal laufen zu lassen, wenn ich versuche, es zu testen. – Dallas