2016-10-10 7 views
0

Ich habe HTML-Elemente, die wie folgt aussehen:Wie gruppiere ich XPath?

enter image description here

Ich mag würde h1, gruppieren div.article-meta und div.article-content, so kann ich seine Daten Zeile für Zeile auf meinem Scrapy Projekt Schleife schreiben.

Ich denke darüber nach, jede von ihnen in eine Var zu gruppieren, dann loop diese Var, ich bin mir nicht sicher, wie es geht.

Bitte vorschlagen. Danke,

Bisher habe ich versucht, dies:

def parse(self, response): 
    now = time.strftime('%Y-%m-%d %H:%M:%S') 
    hxs = scrapy.Selector(response) 

    titles = hxs.xpath('//div[@class="list-article"]/h1') 
    images = hxs.xpath('//div[@class="list-article"]/feature-image') 
    contents = hxs.xpath('//div[@class="list-article"]/article-content') 

    for i, title in titles: 
     item = DapnewsItem() 
     item['categoryId'] = '1' 

     name = titles[i].xpath('a/text()') 
     if not name: 
      print('DAP => [' + now + '] No title') 
     else: 
      item['name'] = name.extract()[0] 

     description = contents[i].xpath('p/text()') 
     if not description: 
      print('DAP => [' + now + '] No description') 
     else: 
      item['description'] = description[1].extract() 

     url = titles[i].xpath("a/@href") 
     if not url: 
      print('DAP => [' + now + '] No url') 
     else: 
      item['url'] = url.extract()[0] 

     imageUrl = images[i].xpath('img/@src') 
     if not imageUrl: 
      print('DAP => [' + now + '] No imageUrl') 
     else: 
      item['imageUrl'] = imageUrl.extract()[0] 

     yield item 

Dies ist die Fehler, die ich bekomme.

enter image description here

+0

dort Hallo, ich habe meine Antwort für Sofar – Vicheanak

Antwort

1

Lassen Sie uns diesen HTML-Snippet verwenden, um darzustellen:

<div class="list-article"> 

    <h1><a href="http//www.example.com/article1.html">Title 1</h1> 
    <div class="article-meta">Something for 1</div> 
    <div class="feature-image"><img src="http://www.example.com/image1.jpg"></div> 
    <div class="article-content"><p>Content 1</p></div> 

    <h1><a href="http//www.example.com/article2.html">Title 2</h1> 
    <div class="article-meta">Something for 2</div> 
    <div class="feature-image"><img src="http://www.example.com/image2.jpg"></div> 
    <div class="article-content"><p>Content 2</p></div> 

    <h1><a href="http//www.example.com/article3.html">Title 3</h1> 
    <div class="article-meta">Something for 3</div> 
    <div class="feature-image"><img src="http://www.example.com/image3.jpg"></div> 
    <div class="article-content"><p>Content 3</p></div> 

</div> 

Sie können Schleife auf jeder <h1> und mit XPath's following-sibling axis zu prüfen, welche Elemente kommen, nachdem auf der gleichen Ebene in dem Baum, und dann Filterung auf den ersten: z following-sibling::div[@class="feature-image"][1] zum ersten <div class="feature-image">

>>> selector = scrapy.Selector(text='''<div class="list-article"> 
... 
...  <h1><a href="http//www.example.com/article1.html">Title 1</h1> 
...  <div class="article-meta">Something for 1</div> 
...  <div class="feature-image"><img src="http://www.example.com/image1.jpg"></div> 
...  <div class="article-content"><p>Content 1</p></div> 
... 
...  <h1><a href="http//www.example.com/article2.html">Title 2</h1> 
...  <div class="article-meta">Something for 2</div> 
...  <div class="feature-image"><img src="http://www.example.com/image2.jpg"></div> 
...  <div class="article-content"><p>Content 2</p></div> 
... 
...  <h1><a href="http//www.example.com/article3.html">Title 3</h1> 
...  <div class="article-meta">Something for 3</div> 
...  <div class="feature-image"><img src="http://www.example.com/image3.jpg"></div> 
...  <div class="article-content"><p>Content 3</p></div> 
...  
... </div>''') 

>>> for h in selector.css('div.list-article > h1'): 
...  item = { 
...   'title': h.xpath('a/text()').extract_first(), 
...   'image': h.xpath(''' 
...    following-sibling::div[@class="feature-image"][1] 
...     /img/@src''').extract_first(), 
...   'content': h.xpath(''' 
...    following-sibling::div[@class="article-content"][1] 
...     /p/text()''').extract_first(), 
...  } 
...  print(item) 
... 
{'content': u'Content 1', 'image': u'http://www.example.com/image1.jpg', 'title': u'Title 1'} 
{'content': u'Content 2', 'image': u'http://www.example.com/image2.jpg', 'title': u'Title 2'} 
{'content': u'Content 3', 'image': u'http://www.example.com/image3.jpg', 'title': u'Title 3'} 
>>> 
+0

Arbeit groß aktualisiert! Vielen Dank. – Vicheanak