2017-12-10 4 views
0

Ich habe versucht, dem Scrapy Tutorial zu folgen, aber ich blieb stecken und habe keine Ahnung wo der Fehler ist.
Es funktioniert, aber keine Elemente werden gecrawlt.Scrapy funktioniert nicht (noob level) - 0 Seiten gecrawlt 0 Elemente gecrawlt

bekomme ich folgende Ausgabe:

C:\Users\xxx\allegro>scrapy crawl AllegroPrices 
2017-12-10 22:25:14 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: AllegroPrices) 
2017-12-10 22:25:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'allegro.spiders', 'SPIDER_MODULES': ['allegro.spiders'], 'ROBOTSTXT_OBEY': True, 'LOG_LEVEL': 'INFO', 'BOT_NAME': 'AllegroPrices'} 
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'allegro.middlewares.AllegroSpiderMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-12-10 22:25:15 [scrapy.middleware] INFO: Enabled item pipelines: 
['allegro.pipelines.AllegroPipeline'] 
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider opened 
2017-12-10 22:25:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-12-10 22:25:15 [AllegroPrices] INFO: Spider opened: AllegroPrices 
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-12-10 22:25:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 527000), 
'log_count/INFO': 8, 
'start_time': datetime.datetime(2017, 12, 10, 21, 25, 15, 517000)} 
2017-12-10 22:25:15 [scrapy.core.engine] INFO: Spider closed (finished) 

Meine Spinne Datei:

# -*- coding: utf-8 -*- 
import scrapy 
from allegro.items import AllegroItem 

class AllegroPrices(scrapy.Spider): 
    name = "AllegroPrices" 
    allowed_domains = ["allegro.pl"] 

#Use working product URL below 
start_urls = [ 
"http://allegro.pl/diablo-ii-lord-of-destruction-2-pc-big-box-eng-i6896736152.html", "http://allegro.pl/diablo-ii-2-pc-dvd-box-eng-i6961686788.html", 
"http://allegro.pl/star-wars-empire-at-war-2006-dvd-box-i6995651106.html", "http://allegro.pl/heavy-gear-ii-2-pc-eng-cdkingpl-i7059163114.html" 
] 

def parse(self, response): 
    items = AllegroItem() 
    title = response.xpath('//h1[@class="title"]//text()').extract() 
    sale_price = response.xpath('//div[@class="price"]//text()').extract() 
    seller = response.xpath('//div[@class="btn btn-default btn-user"]/span/text()').extract() 
    items['product_name'] = ''.join(title).strip() 
    items['product_sale_price'] = ''.join(sale_price).strip() 
    items['product_seller'] = ''.join(seller).strip() 
    yield items 

Einstellungen:

# -*- coding: utf-8 -*- 

# Scrapy settings for allegro project 
# 
# For simplicity, this file contains only settings considered important or 
# commonly used. You can find more settings consulting the documentation: 
# 
#  http://doc.scrapy.org/en/latest/topics/settings.html 
#  http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 
#  http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 

BOT_NAME = 'AllegroPrices' 

SPIDER_MODULES = ['allegro.spiders'] 
NEWSPIDER_MODULE = 'allegro.spiders' 


# Crawl responsibly by identifying yourself (and your website) on the user-agent 
#USER_AGENT = 'allegro (+http://www.yourdomain.com)' 

# Obey robots.txt rules 
ROBOTSTXT_OBEY = True 

# Configure maximum concurrent requests performed by Scrapy (default: 16) 
#CONCURRENT_REQUESTS = 32 

# Configure a delay for requests for the same website (default: 0) 
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay 
# See also autothrottle settings and docs 
#DOWNLOAD_DELAY = 3 
# The download delay setting will honor only one of: 
#CONCURRENT_REQUESTS_PER_DOMAIN = 16 
#CONCURRENT_REQUESTS_PER_IP = 16 

# Disable cookies (enabled by default) 
#COOKIES_ENABLED = False 

# Disable Telnet Console (enabled by default) 
#TELNETCONSOLE_ENABLED = False 

# Override the default request headers: 
#DEFAULT_REQUEST_HEADERS = { 
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
# 'Accept-Language': 'en', 
#} 

# Enable or disable spider middlewares 
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html 
SPIDER_MIDDLEWARES = { 
    'allegro.middlewares.AllegroSpiderMiddleware': 543, 
} 

LOG_LEVEL = 'INFO' 

# Enable or disable downloader middlewares 
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html 
#DOWNLOADER_MIDDLEWARES = { 
# 'allegro.middlewares.MyCustomDownloaderMiddleware': 543, 
#} 

# Enable or disable extensions 
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html 
#EXTENSIONS = { 
# 'scrapy.extensions.telnet.TelnetConsole': None, 
#} 

# Configure item pipelines 
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html 
ITEM_PIPELINES = { 
    'allegro.pipelines.AllegroPipeline': 300, 
} 

# Enable and configure the AutoThrottle extension (disabled by default) 
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html 
#AUTOTHROTTLE_ENABLED = True 
# The initial download delay 
#AUTOTHROTTLE_START_DELAY = 5 
# The maximum download delay to be set in case of high latencies 
#AUTOTHROTTLE_MAX_DELAY = 60 
# The average number of requests Scrapy should be sending in parallel to 
# each remote server 
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 
# Enable showing throttling stats for every response received: 
#AUTOTHROTTLE_DEBUG = False 

# Enable and configure HTTP caching (disabled by default) 
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 
#HTTPCACHE_ENABLED = True 
#HTTPCACHE_EXPIRATION_SECS = 0 
#HTTPCACHE_DIR = 'httpcache' 
#HTTPCACHE_IGNORE_HTTP_CODES = [] 
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' 

Pipeline:

# -*- coding: utf-8 -*- 

# Define your item pipelines here 
# 
# Don't forget to add your pipeline to the ITEM_PIPELINES setting 
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 


class AllegroPipeline(object): 
    def process_item(self, item, spider): 
     return item 

Artikel:

# -*- coding: utf-8 -*- 

# Define here the models for your scraped items 
# 
# See documentation in: 
# http://doc.scrapy.org/en/latest/topics/items.html 

import scrapy 

class AllegroItem(scrapy.Item): 
    # define the fields for your item here like: 
    product_name = scrapy.Field() 
    product_sale_price = scrapy.Field() 
    product_seller = scrapy.Field() 
+0

Ich habe kein Problem als Standalone-Skript auszuführen, ohne Projekterstellung (PL: nie mam problemu z uruchomieniem Tego jako Samodzielny Skrypt bez tworzenia projektu) – furas

+0

Ich glaube, Sie falsche Einrückungen haben - 'start_urls' und' parse() 'müssen innerhalb der Klasse' AllegroPrices' liegen. Jetzt sind sie nicht drinnen. Hinweise sind in Python sehr wichtig. – furas

Antwort

0

Ich habe kein Problem es als Standalone-Skript auszuführen, ohne Projekt zu erstellen und in eine CSV-Datei zu speichern.

Und ich muss USER-AGENT nicht ändern.

Vielleicht gibt es ein Problem mit einigen Einstellungen. Sie haben die URL nicht in das Tutorial geschrieben, um es zu überprüfen.

Oder einfach haben Sie falsche Einrückungen und start_urls und parse() in nicht innerhalb der Klasse. Hinweise sind in Python sehr wichtig.

BTW: Sie haben /a/ in xpath für Verkäufer vergessen.

import scrapy 

#class AllegroItem(scrapy.Item): 
# product_name = scrapy.Field() 
# product_sale_price = scrapy.Field() 
# product_seller = scrapy.Field() 

class AllegroPrices(scrapy.Spider): 

    name = "AllegroPrices" 
    allowed_domains = ["allegro.pl"] 

    start_urls = [ 
     "http://allegro.pl/diablo-ii-lord-of-destruction-2-pc-big-box-eng-i6896736152.html", 
     "http://allegro.pl/diablo-ii-2-pc-dvd-box-eng-i6961686788.html", 
     "http://allegro.pl/star-wars-empire-at-war-2006-dvd-box-i6995651106.html", 
     "http://allegro.pl/heavy-gear-ii-2-pc-eng-cdkingpl-i7059163114.html" 
    ] 

    def parse(self, response): 
     title = response.xpath('//h1[@class="title"]//text()').extract() 
     sale_price = response.xpath('//div[@class="price"]//text()').extract() 
     seller = response.xpath('//div[@class="btn btn-default btn-user"]/a/span/text()').extract() 

     title = title[0].strip() 

     print(title, sale_price, seller) 

     yield {'title': title, 'price': sale_price, 'seller': seller} 

     #items = AllegroItem() 
     #items['product_name'] = ''.join(title).strip() 
     #items['product_sale_price'] = ''.join(sale_price).strip() 
     #items['product_seller'] = ''.join(seller).strip() 
     #yield items 

# --- run it as standalone script without project and save in CSV --- 

from scrapy.crawler import CrawlerProcess 

#c = CrawlerProcess() 

c = CrawlerProcess({ 
# 'USER_AGENT': 'Mozilla/5.0', 
    'FEED_FORMAT': 'csv', 
    'FEED_URI': 'output.csv' 
}) 

c.crawl(AllegroPrices) 
c.start() 

Ergebnis in CSV-Datei:

title,price,seller 
STAR WARS: EMPIRE AT WAR [2006] DVD BOX,"24,90 zł",CDkingpl 
DIABLO II: LORD OF DESTRUCTION 2 PC BIG BOX ENG,"149,00 zł",CDkingpl 
HEAVY GEAR II 2 | PC ENG CDkingpl,"19,90 zł",CDkingpl 
DIABLO II 2 | PC DVD BOX | ENG,"24,90 zł",CDkingpl 
+0

Dies ist URL zu Tutorial: http://blog.datahut.co/tutorial-how-to-scrape-amazon-using-python-scrapy/ – Bodhistawa

Verwandte Themen