2017-03-09 3 views
2

ich Tage verschwendet habe meine Meinung um Scrapy zu bekommen, die Dokumentation zu lesen und andere Scrapy Blogs und Q & A ... und jetzt bin ich zu tun, was die Menschen am meisten hasse: Stellen Sie für Richtungen ;-) Das Problem ist: Meine Spinne öffnet sich, holt die start_urls, tut aber anscheinend nichts mit ihnen. Stattdessen schließt es sofort und das war es. Anscheinend komme ich nicht einmal zur ersten self.log() -Anweisung.Python/Scrapy: CrawlSpider nach Abholen start_urls stoppt

Was ich habe, so weit ist dies:

# -*- coding: utf-8 -*- 
import scrapy 
# from scrapy.shell import inspect_response 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from scrapy.selector import Selector 
from scrapy.http import HtmlResponse, FormRequest, Request 
from KiPieSpider.items import * 
from KiPieSpider.settings import * 

class KiSpider(CrawlSpider): 
    name = "KiSpider" 
    allowed_domains = ['www.kiweb.de', 'kiweb.de'] 
    start_urls = (
     # ST Regra start page: 
     'https://www.kiweb.de/default.aspx?pageid=206', 
      # follow ST Regra links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=206&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
     # ST Thermo start page: 
     'https://www.kiweb.de/default.aspx?pageid=202&page=1', 
      # follow ST Thermo links in the form of: 
      # https://www.kiweb.de/default.aspx?pageid=202&page=\d+ 
      # https://www.kiweb.de/default.aspx?pageid=299&docid=\d{6} 
    ) 
    rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 
       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    ) 

    # Once an article page is reached, check whether a login is necessary: 
    def parse_init(self, response): 
     self.log('Parsing article: %s' % response.url) 
     if not response.xpath('input[@value="Logout"]'): 
      # Note: response.xpath() is a shortcut of response.selector.xpath() 
      self.log('Not logged in. Logging in...\n') 
      return self.login(response) 
     else: 
      self.log('Already logged in. Continue crawling...\n') 
      return self.parse_item(response) 


    def login(self, response): 
     self.log("Trying to log in...\n") 
     self.username = self.settings['KI_USERNAME'] 
     self.password = self.settings['KI_PASSWORD'] 
     return FormRequest.from_response(
      response, 
      formname='Form1', 
      formdata={ 
       # needs name, not id attributes! 
       'ctl04$Header$ctl01$textbox_username': self.username, 
       'ctl04$Header$ctl01$textbox_password': self.password, 
       'ctl04$Header$ctl01$textbox_logindaten_typ': 'Username_Passwort', 
       'ctl04$Header$ctl01$checkbox_permanent': 'True', 
      }, 
      callback = self.parse_item, 
     ) 

    def parse_item(self, response): 
     articles = response.xpath('//div[@id="artikel"]') 
     items = [] 
     for article in articles: 
      item = KiSpiderItem() 
      item['link'] = response.url 
      item['title'] = articles.xpath("div[@class='ct1']/text()").extract() 
      item['subtitle'] = articles.xpath("div[@class='ct2']/text()").extract() 
      item['article'] = articles.extract() 
      item['published'] = articles.xpath("div[@class='biblio']/text()").re(r"(\d{2}.\d{2}.\d{4}) PIE") 
      item['artid'] = articles.xpath("div[@class='biblio']/text()").re(r"PIE \[(d+)-\d+\]") 
      item['lang'] = 'de-DE' 
      items.append(item) 
#  return(items) 
     yield items 
#  what is the difference between return and yield?? found both on web. 

Wenn scrapy crawl KiSpider tun, ergibt sich:

2017-03-09 18:03:33 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: KiPieSpider) 
2017-03-09 18:03:33 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'KiPieSpider.spiders', 'DEPTH_LIMIT': 3, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['KiPieSpider.spiders'], 'BOT_NAME': 'KiPieSpider', 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'KiPieSpider ([email protected])', 'DOWNLOAD_DELAY': 0.25} 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled extensions: 
['scrapy.extensions.logstats.LogStats', 
'scrapy.extensions.telnet.TelnetConsole', 
'scrapy.extensions.corestats.CoreStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled downloader middlewares: 
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled spider middlewares: 
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
'scrapy.spidermiddlewares.referer.RefererMiddleware', 
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 
2017-03-09 18:03:33 [scrapy.middleware] INFO: Enabled item pipelines: 
[] 
2017-03-09 18:03:33 [scrapy.core.engine] INFO: Spider opened 
2017-03-09 18:03:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-03-09 18:03:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 
2017-03-09 18:03:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=206> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.kiweb.de/default.aspx?pageid=202&page=1> (referer: None) 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-03-09 18:03:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 465, 
'downloader/request_count': 2, 
'downloader/request_method_count/GET': 2, 
'downloader/response_bytes': 48998, 
'downloader/response_count': 2, 
'downloader/response_status_count/200': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2017, 3, 9, 17, 3, 34, 235000), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 2, 
'scheduler/dequeued': 2, 
'scheduler/dequeued/memory': 2, 
'scheduler/enqueued': 2, 
'scheduler/enqueued/memory': 2, 
'start_time': datetime.datetime(2017, 3, 9, 17, 3, 33, 295000)} 
2017-03-09 18:03:34 [scrapy.core.engine] INFO: Spider closed (finished) 

Ist es, dass die Login-Routine nicht mit einem Rückruf enden sollte, aber eine Art Rendite/Rendite-Erklärung? Oder was mache ich falsch? Unglücklicherweise geben mir die Dokumente und Tutorials, die ich bisher gesehen habe, nur eine vage Vorstellung davon, wie jedes Bit mit den anderen verbunden ist, besonders Scrapys Dokumente scheinen als Referenz für Leute geschrieben zu sein, die bereits viel über Scrapy wissen.

Etwas frustriert Grüße Christopher

Antwort

0
rules = (
     # First rule that matches a given link is followed/parsed. 
     # Follow category pagination without further parsing: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=(202|206])&page=\d+', 

       # but only within the pagination table cell: 
       restrict_xpaths=('//td[@id="ctl04_teaser_next"]'), 
      ), 
      follow=True, 
     ), 
     # Follow links to category (202|206) articles and parse them: 
     Rule(
      LinkExtractor(
       # Extract links in the form: 
       # allow=r'Default\.aspx?pageid=299&docid=\d+', 
       # but only within article preview cells: 
       restrict_xpaths=("//td[@class='TOC-zelle TOC-text']"), 
      ), 
      # and parse the resulting pages for article content: 
      callback='parse_init', 
      follow=False, 
     ), 
    ) 

Sie nicht allow Parameter benötigen, weil es nur einen Link in dem Tag von XPath ausgewählt ist.

Ich verstehe nicht die Regex in erlauben Parameter, aber zumindest sollten Sie die ? zu entkommen. enter image description here

+1

Vielen Dank, es war die Unescaped? innerhalb des erlaubten Parameters! –

Verwandte Themen