0

Ich muss die Links in den Suchergebnissen der Bing-Suche extrahieren (url).Wie extrahiert man alle Links von Bing? (Wie deklariert man Variablen innerhalb einer Scrapy-Klasse?)

page_links soll die URLs der anderen Seiten in der Bing-Suche am unteren Rand der Seite verfügbar halten.

news_link_list soll die URLs aller Nachrichten-Website Geschichten halten, die ich (entschieden durch legal_domains)

Die yield Request(url, callback) sollte Schleife durch alle page_links und erhalten die response dann tun die Manipulationen verfolgen möchten aktualisieren news_linked_list und verified_links.

import scrapy 
from scrapy.linkextractors import LinkExtractor 
from scrapy.http import Request 
import re 


class LinksSpider(scrapy.Spider): 
    name = 'links' 
    verified_links = [] 
    news_link_list = [] 
    legal_domains = [ 
       'www.bloomberg.com', 
       'www.bbc.com', 
       'www.theguardian.com', 
       'www.cnn.com', 
       'www.foxnews.com', 
       'www.breitbart.com' 
    ] 
    legal_domains.sort() 
    start_urls = ['https://www.bing.com/search?q=Brexit&filters=ex1%3a%22ez5_15706_16976%22&qpvt=Brexit'] 

    def parse(self, response): 
     links = response.css("a::attr(href)").extract() 
     last_index = len(links) - 1 
     for i in range(last_index, -1, -1): 
      if links[i] == '#': 
       last_index = i 
     page_links = links[last_index:] 
     filtered_links = [] 
     for each_link in links: 
      filtered_links = filtered_links + re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', each_link) 
     filtered_links.sort() 

     for each_news_url in filtered_links: 
      for each_domain in legal_domains: 
       if each_domain in each_news_url and each_news_url not in news_link_list: 
        with open('news_link_list', 'a') as f: 
         f.write(each_news_url) 
        news_link_list.append(each_news_url) 
        break 
     verified_links = verified_links + page_links 
     for each_page_url in page_links: 
      yield Request(url=each_page_url, callback="parse") 

Aber ich bekam die folgende Fehlermeldung (Variable nicht definiert ist), aber es war so wollte ich wissen, ob dies wegen der Art und Weise ist scrapy funktioniert und wenn es dann ist, wie kann ich es beheben?.

Traceback (most recent call last): 
    File "/home/dennis/.local/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback 
    yield next(it) 
    File "/home/dennis/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output 
    for x in result: 
    File "/home/dennis/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/home/dennis/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/home/dennis/.local/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/home/dennis/tutorial/tutorial/spiders/links.py", line 35, in parse 
    for each_domain in legal_domains: 
NameError: name 'legal_domains' is not defined 

Ich bin neu in Scrapy also bitte verzeiht mir, wenn dies einfach. Ich bin sicher, dass dies andere Early Adopters Scrapy helfen würde

Antwort

1

ändern for each_domain in legal_domains: zu for each_domain in self.legal_domains:

0

hinzufügen selbst an alle Variablen, die Sie in der Klasse deklariert (LinksSpider) wie folgt:

def parse(self, response): 
    links = response.css("a::attr(href)").extract() 
    last_index = len(links) - 1 
    for i in range(last_index, -1, -1): 
     if links[i] == '#': 
      last_index = i 
    page_links = links[last_index:] 
    filtered_links = [] 
    for each_link in links: 
     filtered_links = filtered_links + re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', each_link) 
    filtered_links.sort() 

    for each_news_url in filtered_links: 
     for each_domain in self.legal_domains: 
      if each_domain in each_news_url and each_news_url not in self.news_link_lists: 
       with open('news_link_list', 'a') as f: 
        f.write(each_news_url) 
       self.news_link_list.append(each_news_url) 
       break 
    verified_links = self.verified_links + page_links 
    for each_page_url in page_links: 
     yield Request(url=each_page_url, callback="parse") 
Verwandte Themen