Die folgende Spinne Code in Scrapy entwickelt wurde, verwendet werden Seiten kriechen aus americanas Webseite:Keine Seiten gekrochen zu werden - scrapy
# -*- coding: utf-8 -*-
import scrapy
import urllib
import re
import webscrap.items
import time
from urlparse import urljoin
from HTMLParser import HTMLParser
class AmericanasSpider(scrapy.Spider):
name = "americanas"
start_urls = ('http://www.americanas.com.br/loja/226795/alimentos-e-bebidas?WT.mc_id=home-menuLista-alimentos/',)
source = webscrap.items.ImportSource ("Americanas")
def parse (self, response):
ind = 0
self.source.submit()
b = []
for c in response.xpath ('//div[@class="item-menu"]/ul'):
c1 = re.sub('[\t\n]','', c.xpath('//span [@class="menu-heading"]/text()').extract()[ind])
if (c1):
x = webscrap.items.Category(c1)
x.submit()
for b in c.xpath ('li'):
b1 = webscrap.items.Category(b.xpath('a/text()').extract()[0])
if (b1):
b1.setParent(x.getID())
b1.submit()
link = b.xpath ('@href').extract()
urla = urljoin (response.url, link)
request = scrapy.Request (urla, callback = self.parse_category)
request.meta['idCategory'] = b1.getID()
yield request
for a in b.xpath ('ul/li/a/text()'):
a1 = webscrap.items.Category(a.extract())
a1.setParent(b1.getID())
a1.submit()
link = a.xpath ('@href').extract()
urla = urljoin (response.url, link)
request = scrapy.Request (urla, callback = self.parse_category)
request.meta['idCategory'] = a1.getID()
yield request
ind = ind + 1
def parse_category(self, response):
# produtos na pagina
items = response.xpath('//div[@class="paginado"]//article[@class="single-product vitrine230 "]')
for item in items:
url = item.xpath('.//div[@itemprop="item"]/form/div[@class="productInfo"]/div]/a[@class="prodTitle"]/@href').extract()
urla = urljoin(response.url, link)
request = scrapy.Request (urla, callback = self.parse_product)
request.meta['idCategory'] = response.meta['idCategory']
yield request
# proxima pagina (caso exista)
nextpage = response.xpath('//div[@class="pagination"]/ul/li/a[@class="pure-button next"]/@href').extract()
if (nextpage):
link = nextpage[0]
urlb = urljoin(response.url, link)
self.log('Next Page: {0}'.format(nextpage))
request = scrapy.Request (urlb, callback = self.parse_category)
request.meta['idCategory'] = response.meta['idCategory']
yield request
def parse_product (self, response):
print response.url
title = response.xpath('//title/text()').extract()
self.log(u'Título: {0}'.format(title))
aber ich bekomme die folgende Ausgabe:
PS C:\Users\Natalia Oliveira\Desktop\Be Happy\behappy\import\webscrap> scrapy crawl americanas
2016-10-06 17:28:04 [scrapy] INFO: Scrapy 1.1.2 started (bot: webscrap)
2016-10-06 17:28:04 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'webscrap.spiders', 'REDIRECT_ENABLED': Fal
se, 'SPIDER_MODULES': ['webscrap.spiders'], 'BOT_NAME': 'webscrap'}
2016-10-06 17:28:04 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-10-06 17:28:05 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-10-06 17:28:05 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-10-06 17:28:05 [scrapy] INFO: Enabled item pipelines:
[]
2016-10-06 17:28:05 [scrapy] INFO: Spider opened
2016-10-06 17:28:05 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-10-06 17:28:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-10-06 17:28:05 [scrapy] DEBUG: Crawled (200) <GET http://www.americanas.com.br/loja/226795/alimentos-e-bebidas?WT.m
c_id=home-menuLista-alimentos/> (referer: None)
2016-10-06 17:28:07 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.americanas.com.br/loja/226795/alimentos-
e-bebidas?WT.mc_id=home-menuLista-alimentos/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all dupli
cates)
2016-10-06 17:28:07 [scrapy] DEBUG: Crawled (200) <GET http://www.americanas.com.br/loja/226795/alimentos-e-bebidas?WT.m
c_id=home-menuLista-alimentos/> (referer: http://www.americanas.com.br/loja/226795/alimentos-e-bebidas?WT.mc_id=home-men
uLista-alimentos/)
2016-10-06 17:28:22 [scrapy] INFO: Closing spider (finished)
2016-10-06 17:28:22 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 931,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 80585,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'dupefilter/filtered': 60,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 6, 20, 28, 22, 257000),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 10, 6, 20, 28, 5, 346000)}
2016-10-06 17:28:22 [scrapy] INFO: Spider closed (finished)
ich wirklich Ich weiß nicht, was hier falsch ist, denn ich bin ein Anfänger in Scrapy. Hier ist der falsche Punkt? Die def-Parse läuft wie erwartet, also denke ich, der Fehler sollte in def parse_category oder parse_product Methoden sein.
Sie haben definitiv einen Fehler mit 'urla = urljoin (response.url, link)', Link ist nicht definiert. Ich nehme an, dass es "urla = urljoin (response.url, url)" sein sollte. Auch was ist 'x = webscrap.items.Category (c1)' etc .. tun? –
@PadraicCunningham vielen Dank. Du hattest Recht, die 'URL' war ein Fehler, aber immer noch die gleiche Rückkehr. Und über Ihre Frage, die, war es nur ein Versuch, einen Kategorie-Index zu speichern –
Sie können 'print()' verwenden, um Werte in Variablen zu sehen und welche Variable hat falschen Wert - auf diese Weise können Sie Codezeile finden, die das Problem macht . – furas