2017-02-18 9 views
0

Wenn ich Daten von Detailseite dieser page kriechen, ich habe Fehler scrapy.exceptions.NotSupported: Ich kann immer noch Daten mit einer geringen Anzahl von Seiten bekommen, aber wenn ich erhöhe die Menge der Seiten, scrapy Lauf aber ohne mehr auszugeben, läuft es und kann nicht aufhören. Danke im Voraus!Scrapy Fehler NotSupported

Die Seiten haben Bilder, aber ich möchte nicht Bilder crawlen, vielleicht Antwort Inhalt ist kein Text.

Dies ist Fehler

2017-02-18 15:35:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en> from <GET http://maps.google.com.my/maps?f=q&source=s_q&hl=en&q=bs+bio+science+sdn+bhd&vps=1&jsv=171b&sll=4.109495,109.101269&sspn=25.686885,46.318359&ie=UTF8&ei=jPeISu6RGI7kugOboeXiDg&cd=1&usq=bs+bio+science+sdn+bhd&geocode=FQdNLwAdEm4QBg&cid=12762834734582014964&li=lmd> 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://com> (failed 3 times): DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.byunature> (failed 3 times): DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.borneococonutoil.com> (failed 3 times): DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://com>: DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.byunature>: DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.borneococonutoil.com>: DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed. 
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> from <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en> 
2017-02-18 15:35:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> (referer: http://www.bsbioscience.com/contactus.html) 
2017-02-18 15:35:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html) 
2017-02-18 15:35:41 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html) 
Traceback (most recent call last): 
    File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback 
    yield next(it) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output 
    for x in result: 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "D:\Scrapy\tutorial\tutorial\spiders\tu2.py", line 17, in parse 
    company = response.css('font:nth-child(3)::text').extract_first() 
    File "c:\python27\lib\site-packages\scrapy\http\response\__init__.py", line 97, in css 
    raise NotSupported("Response content isn't text") 
NotSupported: Response content isn't text 
2017-02-18 15:35:41 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-02-18 15:35:41 [scrapy.extensions.feedexport] INFO: Stored json feed (30 items) in: tu2.json 
2017-02-18 15:35:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 55, 
'downloader/exception_type_count/scrapy.exceptions.NotSupported': 31, 
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 24, 

Meine Codes:

import scrapy 
import json 
from scrapy.linkextractors import LinkExtractor 
# import LxmlLinkExtractor as LinkExtractor 

class QuotesSpider(scrapy.Spider): 
    name = "tu2" 

    def start_requests(self): 
     baseurl = 'http://edirectory.matrade.gov.my/application/edirectory.nsf/category?OpenForm&query=product&code=PT&sid=BED1E22D5BE3F9B5394D6AF0E742828F' 
     urls = [] 
     for i in range(1, 3): 
      urls.append(baseurl + "&page=" + str(i)); 

     for url in urls: 
      yield scrapy.Request(url=url, callback=self.parse) 


    def parse(self, response): 

     company = response.css('font:nth-child(3)::text').extract_first() 

     key3 = "Business Address"; 
     key4 = response.css('tr:nth-child(4) td:nth-child(1) b::text').extract_first(); 
     key5 = response.css('tr:nth-child(5) td:nth-child(1) b::text').extract_first(); 

     value3 = response.css('tr:nth-child(3) .table-middle:nth-child(3)::text').extract_first(); 
     value4 = response.css('tr:nth-child(4) td:nth-child(3)::text').extract_first(); 
     value5 = response.css('tr:nth-child(5) td:nth-child(3)::text').extract_first(); 


     # bla = {} 
     # if key3 is not None: 
     #  bla[key3] = value3; 

     if value3 is not None: 
      json_data = { 
       'company' : company, 
       key3: value3, 
       key4: value4, 
       key5: value5, 



      }; 
      yield json_data 
      # yield json.dumps(bla) 

     # detail page 
     count = 0; 
     for button in response.css('td td a'): 
      detail_page_url = button.css('::attr(href)').extract_first(); 
      if detail_page_url is not None: 
       page_urls = response.urljoin(detail_page_url); 
       yield scrapy.Request(page_urls, callback=self.parse) 

Antwort

1
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html) 

Die Spinne hier eine PDF-Datei kriecht. Sie müssen diese manuell filtern oder verwenden Sie LinkExtractor, die das bereits tut.

def parse(self, response): 
    url = 'someurl' 
    if '.pdf' not in url: 
     yield Request(url, self.parse2) 
    # or 
    le = LinkExtractor() 
    urls = le.extract_links(response) 
    for url in urls: 
     yield Request(url, self.parse2) 

standardmäßig LinkExtractor viele Nicht-HTML-Dateien ignoriert, pdf einschließlich - source here for full list

Für Ihre Codebeispiel sehen, versuchen Sie dies:

# detail page 
count = 0; 
link_extractor = LinkExtractor(restrict_css='td td a::attr(href)') 
urls = link_extractor.extract_links(response) 
for detail_page_url in urls: 
    url = response.urljoin(detail_page_url); 
    yield scrapy.Request(url, callback=self.parse) 
+0

@Granitosarus Danke, aber ich, wie zu erstellen ein Filter dafür, erstelle Filter in __init__.py gefolgt von deinem Link? Und meinst du, wir können es entfernen, kein Prozess, der PDF-Link? –

+0

@RoShanShan Ja, einfach nicht PDF-Links verarbeiten. Das zweite Beispiel nach '# oder' ist alles was du brauchst, wirklich. Siehe https://doc.scrapy.org/en/latest/topics/link-extractors.html#link-extractors – Granitosaurus

+0

Ich weiß wirklich nicht, wo ich den Code nach '#or setzen soll. Ich möchte Daten aus dem Detail dieses Links extrahieren: [link] (http://edirectory.matrade.gov.my/application/edirectory.nsf/category?OpenForm&query=product&code=PT&sid=BED1E22D5BE3F9B5394D6AF0E742828F). Sie können meine Codes oben sehen. –