Ich habe versucht, meinen Kopf rund um Scrapy zu bekommen, aber ich habe nicht viel Glück über die Grundlagen zu bekommen. Wenn ich meine Spinne laufen lasse, bekomme ich eine Spider-Fehlerverarbeitung die Seite und eine Spider-Ausnahme, die noch nicht implementiert ist, aber wenn ich scrapy fetch
verwende, wird die HTML-Antwort ausgegeben, so dass die Site nicht verfügbar ist. Der Ausgang ist unten enthalten zusammen mit meinem Artikel, Spinne und Einstellungen WerteCrawler Spider: Spider Fehler bei der Verarbeitung erhöht NotImplenedError
Items.py
class MycrawlerItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
files = scrapy.Field()
file_urls = scrapy.Field()
mycrawler.py
import scrapy
from scrapy.spiders import Rule
from bs4 import BeautifulSoup
from scrapy.linkextractors import LinkExtractor
from librarycrawler.items import LibrarycrawlerItem
class CrawlSpider(scrapy.Spider):
name = "mycrawler"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com"
]
#LinkExtractor(),
rules = (
Rule(LinkExtractor(),callback='scrape_page', follow=True)
)
def scrape_page(self,response):
page_soup = BeautifulSoup(response.body,"html.parser")
ScrapedPageTitle = page_soup.title.get_text()
item = LibrarycrawlerItem()
item['title'] =ScrapedPageTitle
item['file_urls'] = response.url
yield item
Settings.py
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline':300,
}
FILES_STORE = 'C:\MySpider\mycrawler\ExtractedText'
Terminal-Ausgang
[scrapy] C:\MySpider\mycrawler>scrapy crawl mycrawler -o mycrawler.csv
2016-06-03 16:11:47 [scrapy] INFO: Scrapy 1.0.3 started (bot: mycrawler)
2016-06-03 16:11:47 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-06-03 16:11:47 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mycrawler.spiders', 'FEED_URI': 'mycrawler.csv', 'DEPTH_LIMIT': 3, 'SPIDER_MODULES': ['mycrawler.spiders'], 'BOT_NAME': 'mycrawler', 'USER_AGENT': 'mycrawler(+http://www.example.com)', 'FEED_FORMAT': 'csv'}
2016-06-03 16:11:48 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-06-03 16:11:48 [boto] DEBUG: Retrieving credentials from metadata server.
2016-06-03 16:11:49 [boto] ERROR: Caught exception reading instance dataTraceback (most recent call last):
File "C:\Anaconda3\envs\scrapy\lib\site-packages\boto\utils.py", line 210, inretry_url
r = opener.open(req, timeout=timeout)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Anaconda3\envs\scrapy\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-06-03 16:11:49 [boto] ERROR: Unable to read instance data, giving up
2016-06-03 16:11:49 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-06-03 16:11:49 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-06-03 16:11:49 [scrapy] INFO: Enabled item pipelines: FilesPipeline
2016-06-03 16:11:49 [scrapy] INFO: Spider opened
2016-06-03 16:11:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-06-03 16:11:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-03 16:11:49 [scrapy] DEBUG: Redirecting (meta refresh) to <GET http://myexample.com> from <GEThttp://myexample.com>
2016-06-03 16:11:50 [scrapy] DEBUG: Crawled (200) <GET http://myexample.com> (referer: None)
2016-06-03 16:11:50 [scrapy] ERROR: Spider error processing <GET http://www.example.com> (referer: None)
Traceback (most recent call last): File "C:\Anaconda3\envs\scrapy\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks current.result = callback(current.result, *args, **kw)
File "C:\Anaconda3\envs\scrapy\lib\site-packages\scrapy\spiders\__init__.py",line 76, in parse raise NotImplementedErrorNotImplementedError
2016-06-03 16:11:50 [scrapy] INFO: Closing spider (finished)
2016-06-03 16:11:50 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 449,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 23526,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 3, 15, 11, 50, 227000),
'log_count/DEBUG': 4,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2016, 6, 3, 15, 11, 49, 722000)}
2016-06-03 16:11:50 [scrapy] INFO: Spider closed (finished)