Ich möchte zugreifen und dann den Inhalt aus einer Liste von URLs extrahieren. Betrachten Sie zum Beispiel diese website, ich möchte den Inhalt jedes Beitrags extrahieren. So, basierend auf den gebuchten Antworten habe ich versucht, die folgenden:Wie man den ganzen Inhalt von einer langen Liste von URLs mit scrapy zieht/extrahiert?
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
import urllib
class Test(scrapy.Spider):
name = "test"
allowed_domains = ["https://sfbay.craigslist.org/search/jjj?employment_type=2"]
start_urls = (
'https://sfbay.craigslist.org/search/jjj?employment_type=2',
)
def parse(self, response):
driver = webdriver.Firefox()
driver.get(response)
links = driver.find_elements_by_xpath('''.//a[@class='hdrlnk']''')
links = [x.get_attribute('href') for x in links]
for x in links:
print(x)
Aber ich verstehe nicht, wie in einer einzigen Bewegung der gesamten Inhalt aus einer langen Liste von Links verschrotten, ohne das Ziel Urls Angabe ... Irgendeine Idee, wie man es macht? Ich versuche auch etwas ähnlich wie diese video, und ich bin immer noch stecken ....
UPDATE Basierend in @quasarseeker Antwort, die ich versucht:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from test.items import TestItems
class TestSpider(CrawlSpider):
name = "test"
allowed_domains = ["https://sfbay.craigslist.org/search/jjj?employment_type=2"]
start_urls = (
'https://sfbay.craigslist.org/search/jjj?employment_type=2',
)
rules = ( # Rule to parse through all pages
Rule(LinkExtractor(allow=(), restrict_xpaths=("//a[@class='button next']",)),
follow=True),
# Rule to parse through all listings on a page
Rule(LinkExtractor(allow=(), restrict_xpaths=("/p[@class='row']/a",)),
callback="parse_obj", follow=True),)
def parse_obj(self, response):
item = TestItem()
item['url'] = []
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item['url'].append(link.url)
print('\n\n\n\n**********************\n\n\n\n',item)
return item
Allerdings bin ich nicht immer alles:
2016-11-03 08:46:24 [scrapy] INFO: Scrapy 1.2.0 started (bot: test)
2016-11-03 08:46:24 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test.spiders', 'BOT_NAME': 'test', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['test.spiders']}
2016-11-03 08:46:24 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.corestats.CoreStats']
2016-11-03 08:46:24 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-11-03 08:46:24 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-11-03 08:46:24 [scrapy] INFO: Enabled item pipelines:
[]
2016-11-03 08:46:24 [scrapy] INFO: Spider opened
2016-11-03 08:46:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-03 08:46:24 [scrapy] DEBUG: Crawled (200) <GET https://sfbay.craigslist.org/robots.txt> (referer: None)
2016-11-03 08:46:25 [scrapy] DEBUG: Crawled (200) <GET https://sfbay.craigslist.org/search/jjj?employment_type=2> (referer: None)
2016-11-03 08:46:25 [scrapy] DEBUG: Filtered offsite request to 'sfbay.craigslist.org': <GET https://sfbay.craigslist.org/search/jjj?employment_type=2&s=100>
2016-11-03 08:46:25 [scrapy] INFO: Closing spider (finished)
2016-11-03 08:46:25 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 516,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 18481,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 11, 3, 14, 46, 25, 230629),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 11, 3, 14, 46, 24, 258110)}
2016-11-03 08:46:25 [scrapy] INFO: Spider closed (finished)
bekommen Sie alle Tags 'a' und für jede' a' erhalten 'href' attrib. – furas