Scrapy: Script-Tags im HTML body

ich derzeit extrahieren den gesamten Text im Inneren des Körpers Tag (ohne Abstand wie \ r \ n) mit dem folgenden Code ausschließen Inhalt innen:Scrapy: Script-Tags im HTML body

full_text = response.xpath('normalize-space(/html/body)').extract()

Das Problem ist dies nimmt JavaScript innerhalb von Script-Tags innerhalb von body auf.

Wissen Sie, wie ich den Inhalt innerhalb von Skript-Tags ausschließen kann?

Ich habe dies zu tun versucht, aber es funktioniert nicht:

full_text = response.xpath('normalize-space(/html/body/*[not(self::script)])').extract()

Jede geschätzt Hilfe.

Quelle

2016-09-13 Tom Brock

können Sie die Antwort auf diese Frage Scraping text without javascript code using scrapy

from w3lib.html import remove_tags, remove_tags_with_content 

input = hxs.select('//div[@id="content"]').extract() 
output = remove_tags(remove_tags_with_content(input, ('script',)))

Quelle

2016-09-13 18:44:22 MrPandav

, dass der Trick funktioniert folgen. Prost –

Scrapy: Script-Tags im HTML body

Antwort

Verwandte Themen