Wie kann ich die ganzen Klartext von einer Website mit Scrapy?

Ich möchte all den sichtbaren Text haben, von einer Website, nachdem die HTML gerendert wird. Ich arbeite in Python mit Scrapy Framework. Mit xpath('//body//text()') Ich bin in der Lage, es zu bekommen, aber mit den HTML-Tags, und ich möchte nur den Text. Irgendeine Lösung dafür? Vielen Dank !Wie kann ich die ganzen Klartext von einer Website mit Scrapy?

Quelle

2014-04-18 tomasyany

Die einfachste Option extract//body//text() und join alles gefunden wäre.

Eine weitere Option ist nltk zu verwenden 's clean_html():

>>> import nltk 
>>> html = """ 
... <div class="post-text" itemprop="description"> 
... 
...   <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p> 
... 
...  </div>""" 
>>> nltk.clean_html(html) 
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

Eine weitere Option ist das BeautifulSoup' s get_text():

get_text()

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(html) 
>>> print soup.get_text().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

Eine weitere Option ist lxml.html ‚s text_content() zu verwenden :

.text_content()

Returns the text content of the element, including the text content of its children, with no markup.

>>> import lxml.html 
>>> tree = lxml.html.fromstring(html) 
>>> print tree.text_content().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

Quelle

2014-04-18 15:18:56 alecxe

ich meine Frage gelöscht haben .. Ich habe den Code unten html = sel.select ("// Körper // text()") Baum = lxml.html.fromstring (html) item [ 'description'] = verwendet tree.text_content(). strip() Aber ich bekomme die \t is_full_html = _looks_like_full_html_unicode (html) \t exceptions.TypeError: erwartete Zeichenfolge oder Puffer ..erro. Was schief gelaufen ist – Backtrack

'nltk' hat am besten für mich funktioniert – user4421975

Genau wie ein Update verwarf' nltk' ihre 'clean_html' Methode und stattdessen: ' NotImplementedError: Um HTML Markup zu entfernen, benutze BeautifulSoup's get_text() Funktion ' – TheNastyOne

Haben Sie versucht?

''.join(sel.select("//body//text()").extract()).strip()

wo sel ist ein Selector Beispiel:

xpath('//body//text()').re('(\w+)')

ODER

xpath('//body//text()').extract()

Quelle

2014-04-18 15:08:41

Das funktioniert eigentlich ganz gut, aber immer noch einige HTML-Tags und andere zurück. – tomasyany

Wie kann ich die ganzen Klartext von einer Website mit Scrapy?

Antwort

Verwandte Themen