2016-06-03 8 views
1

Ich habe versucht, meinen Kopf rund um Scrapy zu bekommen, aber ich habe nicht viel Glück über die Grundlagen zu bekommen. Wenn ich meine Spinne laufen lasse, bekomme ich eine Spider-Fehlerverarbeitung die Seite und eine Spider-Ausnahme, die noch nicht implementiert ist, aber wenn ich scrapy fetch verwende, wird die HTML-Antwort ausgegeben, so dass die Site nicht verfügbar ist. Der Ausgang ist unten enthalten zusammen mit meinem Artikel, Spinne und Einstellungen WerteCrawler Spider: Spider Fehler bei der Verarbeitung erhöht NotImplenedError

Items.py

class MycrawlerItem(scrapy.Item): 
# define the fields for your item here like: 
# name = scrapy.Field() 
title = scrapy.Field() 
files = scrapy.Field() 
file_urls = scrapy.Field() 

mycrawler.py

import scrapy 
from scrapy.spiders import Rule 
from bs4 import BeautifulSoup 
from scrapy.linkextractors import LinkExtractor 
from librarycrawler.items import LibrarycrawlerItem 
class CrawlSpider(scrapy.Spider): 
    name = "mycrawler" 
    allowed_domains = ["example.com"] 
    start_urls = [ 
     "http://www.example.com" 
    ] 
    #LinkExtractor(), 
    rules = (
     Rule(LinkExtractor(),callback='scrape_page', follow=True) 
    ) 

    def scrape_page(self,response): 
     page_soup = BeautifulSoup(response.body,"html.parser") 
     ScrapedPageTitle = page_soup.title.get_text() 
     item = LibrarycrawlerItem() 
     item['title'] =ScrapedPageTitle 
     item['file_urls'] = response.url 

     yield item 

Settings.py

ITEM_PIPELINES = { 
'scrapy.pipelines.files.FilesPipeline':300, 
} 

FILES_STORE = 'C:\MySpider\mycrawler\ExtractedText' 

Terminal-Ausgang

[scrapy] C:\MySpider\mycrawler>scrapy crawl mycrawler -o mycrawler.csv 
2016-06-03 16:11:47 [scrapy] INFO: Scrapy 1.0.3 started (bot: mycrawler) 
2016-06-03 16:11:47 [scrapy] INFO: Optional features available: ssl, http11, boto 
2016-06-03 16:11:47 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mycrawler.spiders', 'FEED_URI': 'mycrawler.csv', 'DEPTH_LIMIT': 3, 'SPIDER_MODULES': ['mycrawler.spiders'], 'BOT_NAME': 'mycrawler', 'USER_AGENT': 'mycrawler(+http://www.example.com)', 'FEED_FORMAT': 'csv'} 
2016-06-03 16:11:48 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-06-03 16:11:48 [boto] DEBUG: Retrieving credentials from metadata server. 
2016-06-03 16:11:49 [boto] ERROR: Caught exception reading instance dataTraceback (most recent call last): 
    File "C:\Anaconda3\envs\scrapy\lib\site-packages\boto\utils.py", line 210, inretry_url 
    r = opener.open(req, timeout=timeout) 
    File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 431, in open 
    response = self._open(req, data) 
    File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 449, in _open 
    '_open', req) 
    File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 409, in _call_chain 
    result = func(*args) 
    File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1227, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1197, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2016-06-03 16:11:49 [boto] ERROR: Unable to read instance data, giving up 
2016-06-03 16:11:49 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl 
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH 
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd 
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-06-03 16:11:49 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa 
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-06-03 16:11:49 [scrapy] INFO: Enabled item pipelines: FilesPipeline 
2016-06-03 16:11:49 [scrapy] INFO: Spider opened 
2016-06-03 16:11:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i 
tems (at 0 items/min) 
2016-06-03 16:11:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-06-03 16:11:49 [scrapy] DEBUG: Redirecting (meta refresh) to <GET http://myexample.com> from <GEThttp://myexample.com> 
2016-06-03 16:11:50 [scrapy] DEBUG: Crawled (200) <GET http://myexample.com> (referer: None) 
2016-06-03 16:11:50 [scrapy] ERROR: Spider error processing <GET http://www.example.com> (referer: None) 
Traceback (most recent call last): File "C:\Anaconda3\envs\scrapy\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks current.result = callback(current.result, *args, **kw) 
    File "C:\Anaconda3\envs\scrapy\lib\site-packages\scrapy\spiders\__init__.py",line 76, in parse raise NotImplementedErrorNotImplementedError 
2016-06-03 16:11:50 [scrapy] INFO: Closing spider (finished) 
2016-06-03 16:11:50 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 449, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 23526, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 6, 3, 15, 11, 50, 227000), 
'log_count/DEBUG': 4, 
'log_count/ERROR': 3, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'spider_exceptions/NotImplementedError': 1, 
'start_time': datetime.datetime(2016, 6, 3, 15, 11, 49, 722000)} 
2016-06-03 16:11:50 [scrapy] INFO: Spider closed (finished) 

Antwort

1

Sie von scrapy der CrawlSpider zu Unterklasse benötigen, wenn Sie diese Funktionalität wollen, zum Beispiel so etwas wie dieses:

from scrapy.item import Field, Item 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import Rule 
from scrapy.spiders.crawl import CrawlSpider 


class LibrarycrawlerItem(Item): 
    title = Field() 
    file_urls = Field() 


class MyCrawlSpider(CrawlSpider): 
    name = 'sample' 
    allowed_domains = ['example.com', 'iana.org'] 
    start_urls = ['http://www.example.com'] 
    rules = (
     Rule(LinkExtractor(), callback='scrape_page'), 
    ) 


    def scrape_page(self,response): 
     item = LibrarycrawlerItem() 
     item['title'] = response.xpath('//title/text()').extract_first() 
     item['file_urls'] = response.url 

     yield item 

Zum besseren Verständnis der, wie die Regeln funktionieren bitte beziehen Sie sich auf die documenation, BTW können Sie auch die LinkExtractor innerhalb Ihrer parse Methode ohne Sub-Classing CrawlSpider verwenden.

Verwandte Themen