Wenn ich Daten von Detailseite dieser page kriechen, ich habe Fehler scrapy.exceptions.NotSupported: Ich kann immer noch Daten mit einer geringen Anzahl von Seiten bekommen, aber wenn ich erhöhe die Menge der Seiten, scrapy Lauf aber ohne mehr auszugeben, läuft es und kann nicht aufhören. Danke im Voraus!Scrapy Fehler NotSupported
Die Seiten haben Bilder, aber ich möchte nicht Bilder crawlen, vielleicht Antwort Inhalt ist kein Text.
Dies ist Fehler
2017-02-18 15:35:35 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en> from <GET http://maps.google.com.my/maps?f=q&source=s_q&hl=en&q=bs+bio+science+sdn+bhd&vps=1&jsv=171b&sll=4.109495,109.101269&sspn=25.686885,46.318359&ie=UTF8&ei=jPeISu6RGI7kugOboeXiDg&cd=1&usq=bs+bio+science+sdn+bhd&geocode=FQdNLwAdEm4QBg&cid=12762834734582014964&li=lmd>
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://com> (failed 3 times): DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.byunature> (failed 3 times): DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.borneococonutoil.com> (failed 3 times): DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://com>: DNS lookup failed: address 'com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.byunature>: DNS lookup failed: address 'www.byunature' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.borneococonutoil.com>: DNS lookup failed: address 'www.borneococonutoil.com' not found: [Errno 11001] getaddrinfo failed.
2017-02-18 15:35:37 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> from <GET https://www.google.com.my:443/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en>
2017-02-18 15:35:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.google.com.my/maps/place/bs+bio+science+sdn+bhd/@4.109495,109.101269,2856256m/data=!3m1!4b1!4m2!3m1!1s0x0:0xb11eb29219c723f4?source=s_q&hl=en&dg=dbrw&newdg=1> (referer: http://www.bsbioscience.com/contactus.html)
2017-02-18 15:35:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html)
2017-02-18 15:35:41 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.canaanalpha.com/extras/Anistrike_Poster.pdf> (referer: http://www.canaanalpha.com/anistrike.html)
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or())
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or() if _filter(r))
File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or() if _filter(r))
File "D:\Scrapy\tutorial\tutorial\spiders\tu2.py", line 17, in parse
company = response.css('font:nth-child(3)::text').extract_first()
File "c:\python27\lib\site-packages\scrapy\http\response\__init__.py", line 97, in css
raise NotSupported("Response content isn't text")
NotSupported: Response content isn't text
2017-02-18 15:35:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-18 15:35:41 [scrapy.extensions.feedexport] INFO: Stored json feed (30 items) in: tu2.json
2017-02-18 15:35:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 55,
'downloader/exception_type_count/scrapy.exceptions.NotSupported': 31,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 24,
Meine Codes:
import scrapy
import json
from scrapy.linkextractors import LinkExtractor
# import LxmlLinkExtractor as LinkExtractor
class QuotesSpider(scrapy.Spider):
name = "tu2"
def start_requests(self):
baseurl = 'http://edirectory.matrade.gov.my/application/edirectory.nsf/category?OpenForm&query=product&code=PT&sid=BED1E22D5BE3F9B5394D6AF0E742828F'
urls = []
for i in range(1, 3):
urls.append(baseurl + "&page=" + str(i));
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
company = response.css('font:nth-child(3)::text').extract_first()
key3 = "Business Address";
key4 = response.css('tr:nth-child(4) td:nth-child(1) b::text').extract_first();
key5 = response.css('tr:nth-child(5) td:nth-child(1) b::text').extract_first();
value3 = response.css('tr:nth-child(3) .table-middle:nth-child(3)::text').extract_first();
value4 = response.css('tr:nth-child(4) td:nth-child(3)::text').extract_first();
value5 = response.css('tr:nth-child(5) td:nth-child(3)::text').extract_first();
# bla = {}
# if key3 is not None:
# bla[key3] = value3;
if value3 is not None:
json_data = {
'company' : company,
key3: value3,
key4: value4,
key5: value5,
};
yield json_data
# yield json.dumps(bla)
# detail page
count = 0;
for button in response.css('td td a'):
detail_page_url = button.css('::attr(href)').extract_first();
if detail_page_url is not None:
page_urls = response.urljoin(detail_page_url);
yield scrapy.Request(page_urls, callback=self.parse)
@Granitosarus Danke, aber ich, wie zu erstellen ein Filter dafür, erstelle Filter in __init__.py gefolgt von deinem Link? Und meinst du, wir können es entfernen, kein Prozess, der PDF-Link? –
@RoShanShan Ja, einfach nicht PDF-Links verarbeiten. Das zweite Beispiel nach '# oder' ist alles was du brauchst, wirklich. Siehe https://doc.scrapy.org/en/latest/topics/link-extractors.html#link-extractors – Granitosaurus
Ich weiß wirklich nicht, wo ich den Code nach '#or setzen soll. Ich möchte Daten aus dem Detail dieses Links extrahieren: [link] (http://edirectory.matrade.gov.my/application/edirectory.nsf/category?OpenForm&query=product&code=PT&sid=BED1E22D5BE3F9B5394D6AF0E742828F). Sie können meine Codes oben sehen. –