Dies ist mein erster Schaber und ich habe einige Probleme. Um zu beginnen, habe ich meine CSS-Selektoren erstellt und sie arbeiten bei der Verwendung von Scrapy Shell. Wenn ich myspider laufen gibt es einfach dieseScrapy extrahiert keine Daten, CSS-Selektoren sind korrekt
2017-10-26 14:48:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: digikey)
2017-10-26 14:48:49 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'digikey', 'CONCURRENT_REQUESTS': 1, 'NEW
SPIDER_MODULE': 'digikey.spiders', 'SPIDER_MODULES': ['digikey.spiders'], 'USER_AGENT': 'digikey ("Mozilla/5.0 (Windows
NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")'}
2017-10-26 14:48:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-26 14:48:50 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-26 14:48:50 [scrapy.core.engine] INFO: Spider opened
2017-10-26 14:48:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-10-26 14:48:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-26 14:48:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikey.com/products/en/capacitors/alumin
um-electrolytic-capacitors/58/page/3?stock=1> (referer: None)
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-26 14:48:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 329,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 104631,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 10, 26, 21, 48, 52, 235020),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 10, 26, 21, 48, 50, 249076)}
2017-10-26 14:48:52 [scrapy.core.engine] INFO: Spider closed (finished)
PS C:\Users\dalla_000\digikey>
Meine Spinne wie dieses
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from digikey.items import DigikeyItem
from scrapy.selector import Selector
class DigikeySpider(CrawlSpider):
name = 'digikey'
allowed_domains = ['digikey.com']
start_urls = ['https://www.digikey.com/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1']
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('/products/en/capacitors/aluminum-electrolytic-capacitors/58/page/3?stock=1',), deny=('subsection\.php',))),
)
def parse_item(self, response):
for row in response.css('table#productTable tbody tr'):
item = DigikeyItem()
item['partnumber'] = row.css('.tr-mfgPartNumber [itemprop="name"]::text').extract_first()
item['manufacturer'] = row.css('[itemprop="manufacture"] [itemprop="name"]::text').extract_first()
item['description'] = row.css('.tr-description::text').extract_first()
item['quanity'] = row.css('.tr-qtyAvailable::text').extract_first()
item['price'] = row.css('.tr-unitPrice::text').extract_first()
item['minimumquanity'] = row.css('.tr-minQty::text').extract_first()
yield item
parse_start_url = parse_item
Die items.py sieht wie folgt aussieht:
BOT_NAME = 'digikey'
SPIDER_MODULES = ['digikey.spiders']
NEWSPIDER_MODULE = 'digikey.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'digikey ("Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02")'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
:
import scrapy
class DigikeyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
partnumber = scrapy.Field()
manufacturer = scrapy.Field()
description = scrapy.Field()
quanity= scrapy.Field()
minimumquanity = scrapy.Field()
price = scrapy.Field()
pass'
Rahmen Ich habe Mühe zu verstehen, warum keine Daten b sind Diese Arbeit wird mit arbeitenden css-Selektoren extrahiert. Außerdem schließt die Spinne den Job ab und schließt. Ich beschränke die Spinne, um nur eine Seite zu kriechen, wenn sie richtig funktioniert, öffne ich sie für die gesamte Web site.
Hallo, Wenn ich die Spinne richtig funktioniert, wird Regeln zu folgen alle/Produkte/de Ich änderte die Rul Es ist zu vermeiden, das ganze Ding jedes Mal laufen zu lassen, wenn ich versuche, es zu testen. – Dallas