2016-05-15 11 views
0

Ich benutze Scrapy, um Inhalt zu kratzen. Ich habe viel versucht, diese Website zu kratzen, die 2 Spalten hat. der Website-Code:Wie schreibe ich xPath mit 2 Spalten

<div> 
    <div class="something"> 
     <article> 
      <h2> 
       <a href="somelinks"> 
     <article> 
      <h2> 
       <a href="somelinks"> 
     <article> 
      <h2> 
       <a href="somelinks"> 
    <div class="something"> 
     <article> 
      <h2> 
       <a href="somelinks"> 
     <article> 
      <h2> 
       <a href="somelinks"> 
     <article> 
      <h2> 
       <a href="somelinks"> 
</div> 

mein Code:

for href in response.xpath("//div[@class='something']/article/h2/a/@href"): 
    url = response.urljoin(href.extract()) 
    yield scrapy.Request(url, callback=self.parse_dir_contents) 

ist mein Code falsch? Ich kann es nicht laufen lassen. der Geist schließt sich einfach.

+0

Lassen Sie mich Netz sehen page URL –

+0

Können Sie mehr von Ihrer Spider-Klasse und nicht nur die Schleife für href-Attribute teilen? –

+0

Das ist - oder sollte - HTML ungültig sein. Sind Sie sicher, dass das erste verschachtelte 'div' nicht geschlossen ist? – usr2564301

Antwort

0

Sie können Spinne Für alle Blogeinträge kratzen von http://www.bebizzy.com/the-bebizzy-blog/

import scrapy 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 

from check_site.items import YourItem 


class StackSpider(CrawlSpider): 
    name = 'stack' 
    allowed_domains = ['bebizzy.com'] 
    start_urls = ['http://www.bebizzy.com/the-bebizzy-blog/'] 

    rules = (
     Rule(LinkExtractor(restrict_css='a.more-link'), callback='parse_item', follow=True), 
     Rule(LinkExtractor(restrict_css='div.pagination>div>a'), callback='parse', follow=True), 
    ) 

    def parse_item(self, response): 
     self.logger.info(response.url) 
     i = YourItem() 
     #TODO: fill your item 
     #i['title'] = ... 
     return i 

Log von der Spinne erhalten:

2016-05-15 21:45:18 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-05-15 21:45:18 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-05-15 21:45:18 [scrapy] INFO: Enabled item pipelines: 
2016-05-15 21:45:18 [scrapy] INFO: Spider opened 
2016-05-15 21:45:18 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/04/12/learn-smartphone-features-spring/ 
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/04/why-you-need-a-responsive-website/ 
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/14/samsung-galaxy-s7-s7-edgereview/ 
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/10/marketing-your-business-online/ 
2016-05-15 21:45:26 [stack] INFO: http://www.bebizzy.com/2016/03/16/demographics-of-social-media-users/ 
2016-05-15 21:45:27 [stack] INFO: http://www.bebizzy.com/2016/03/02/websites-launched-creekside-farmstands-and-mandan-farmers-market/ 
2016-05-15 21:45:27 [stack] INFO: http://www.bebizzy.com/2016/03/01/what-is-wordpress/ 
2016-05-15 21:45:32 [stack] INFO: http://www.bebizzy.com/2016/03/18/mobile-friendly-sites-increase-seo-rank-google/ 
2016-05-15 21:45:33 [stack] INFO: http://www.bebizzy.com/2016/02/21/manage-multiple-wordpress-installations-with-managewp/ 
2016-05-15 21:45:33 [stack] INFO: http://www.bebizzy.com/2016/03/24/buy-laptop-tablet-2/ 
2016-05-15 21:45:33 [stack] INFO: http://www.bebizzy.com/2016/03/30/customizing-android-smartphone-screens/ 
2016-05-15 21:45:34 [stack] INFO: http://www.bebizzy.com/2015/09/18/vzwbuzz-recap-show-mobile-music/ 
2016-05-15 21:45:34 [stack] INFO: http://www.bebizzy.com/2015/09/03/choosing-a-new-logo/ 
2016-05-15 21:45:37 [stack] INFO: http://www.bebizzy.com/2015/10/16/best-android-apps-for-your-ghost-hunting-adventure/ 
2016-05-15 21:45:38 [stack] INFO: http://www.bebizzy.com/2015/10/21/samsung-note-5/ 
2016-05-15 21:45:39 [stack] INFO: http://www.bebizzy.com/2015/10/22/ue-roll-bluetooth-speaker/ 
2016-05-15 21:45:39 [stack] INFO: http://www.bebizzy.com/2015/11/17/best-apps-for-the-upcoming-election/ 
2016-05-15 21:45:39 [stack] INFO: http://www.bebizzy.com/2015/12/07/best-star-wars-android-apps/ 
2016-05-15 21:45:40 [stack] INFO: http://www.bebizzy.com/2016/02/19/using-microsoft-office-on-your-mobile-device/ 
2016-05-15 21:45:41 [stack] INFO: http://www.bebizzy.com/2016/01/08/best-android-business-apps-for-2016/ 
2016-05-15 21:45:41 [stack] INFO: http://www.bebizzy.com/2015/09/01/best-games-for-your-android-phone-essentialapps/ 
2016-05-15 21:45:44 [stack] INFO: http://www.bebizzy.com/2015/03/12/android-apps-for-your-spring-to-do-list/ 
2016-05-15 21:45:44 [stack] INFO: http://www.bebizzy.com/2015/02/02/mobile-technology-for-a-better-valentines-day/ 
2016-05-15 21:45:45 [stack] INFO: http://www.bebizzy.com/2015/03/18/logitech-k480-bluetooth-keyboard/ 
2016-05-15 21:45:45 [stack] INFO: http://www.bebizzy.com/2015/03/01/the-samsung-s6-and-the-htc-one-m9/ 
2016-05-15 21:45:47 [stack] INFO: http://www.bebizzy.com/2015/07/07/i-had-switchersremorse-once-once/ 
2016-05-15 21:45:47 [stack] INFO: http://www.bebizzy.com/2015/04/10/best-android-fishing-apps/ 
2016-05-15 21:45:48 [stack] INFO: http://www.bebizzy.com/2015/05/17/htcs-new-flagship-the-htc-one-m9/ 
2016-05-15 21:45:48 [stack] INFO: http://www.bebizzy.com/2015/07/28/windows10-twitter-stream/ 
2016-05-15 21:45:49 [stack] INFO: http://www.bebizzy.com/2015/01/06/my-3-words/ 

Fügen Sie einfach Artikel Logik Füllung nach #TODO: Kommentar

+0

wow. Kannst du etwas über die Regeln erklären? weil ich nicht weiß, wie man das für zukünftige blogs/articles verwendet/schreibt D: – Michimcchicken

+0

Die erste Regel erhält alle Post-Links von css selector 'a.more-link', während die zweite Links zur nächsten Seite gibt. Bitte beachten Sie, dass diese zweite Regel 'callback = 'parse'' vordefiniert in 'CrawlSpider' verwendet, um den Prozess zu wiederholen. –

+0

ohh danke ich dachte, ich muss def parse: O danke – Michimcchicken