2016-06-10 2 views
-1

Mit dem folgenden HTML möchte ich 2 Bits Daten herausziehen und sie in eine Liste in Python hinzufügen. Jeder fettgedruckte Text ist sein Name eines Pferdes und das folgende ist der Kommentar.Python: fetten Text und den folgenden Text ziehen

<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open. 
 
    <br> 
 
    <br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. 
 
    She saw it out well and it´ll be interesting to see how she copes with a rise. 
 
    <br> 
 
    <br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW. 
 
    <br> 
 
    <br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form. 
 
    <br> 
 
    <br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton] 
 
    <br> 
 
    <br> 
 
    <div id="resultRaceReport" class="hide"></div> 
 
</div>

aus der obigen Ausgabe ich es wie folgt aus

[LADY Makfi zeigten deutlich verbesserte Form zu vergießen ihr Mädchen Tag auf diese saisonale Debüt sehen möchten für einen neuen Hof. Das Stutfohlen bot wenig Tony Martin letztes Jahr, aber zeigte etwas Fähigkeit auf ihrem Debüt und ist offensichtlich fähig, wenn frisch. Sie sah es gut aus und it'll interessant sein zu sehen, wie sie mit einem Anstieg meistert.]

[Weardiditallgorong, ging dieser längeren Reise kämpfen nach unten und wahrscheinlich verbessert wieder auf ihrem letzten Time-out Sekunde bei Bad. Das war ihr Best-Effort noch auf dem AW.]

[Chauvelin, in der zweiten Zeit Scheuklappen, seit einiger Zeit in seinem ermutigendsten Aufwand gedreht und sicherlich auch auf seiner besten Form behandelt.]

[Happy Jack, nicht zum ersten Mal reiste leicht, bis schweren Wetter es, wenn für seine Mühe gebeten. [David Orton]]

aber im nur nicht sicher, wie die gewünschte Ausgabe zu erhalten (mehr die Logik dahinter)

ich derzeit lxml verwenden, um Inhalte zu kratzen und müssten die fett (Pferde Namen übereinstimmen so) gegen meinen Tisch kann ich die Kommentare (Text nach der fett) zu meiner Datenbank

+3

Mögliche Duplikat [Parsing HTML mit Python] (http://stackoverflow.com/questions/11709079/parsing-html-using-python) –

+0

@emma perkins, Ich nehme an, dass Sie lxml wie bei Ihrer vorherigen Frage verwenden? –

+0

Entschuldigung ja ich bin (ich werde das in die Frage hinzufügen) - das ist mehr die Logik, es zu tun, anstatt wie zu –

Antwort

2

mit lxml hinzufügen:

h = """<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.<br><br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.<br><br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.<br><br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.<br><br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]<br><br> <div id="resultRaceReport" class="hide"></div></div>""" 

from lxml import html 

x = html.fromstring(h) 

div = x.xpath("//*[@id='ANALYSIS']")[0] 

# find bold tags by class name 
for b in div.xpath(".//b[@class='black']"): 
    # get bold text 
    print(b.text) 
    # get text between current bold up to next br tag. 
    print(b.xpath("./following::text()[1]")) 

würden Sie:

LADY MAKFI 
[u' showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.'] 
Weardiditallgorong 
[' went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.'] 
Chauvelin 
[', in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.'] 
Happy Jack 
[' not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]'] 

Wenn Sie das alles in einer einzigen Liste genau wollen wie geschrieben:

from lxml import html 

x = html.fromstring(h) 
div = x.xpath("//*[@id='ANALYSIS']")[0] 
out = [b.text + "," + b.xpath("./following::text()[1]")[0].lstrip(",") for b in div.xpath(".//b[@class='black']")] 

Welche gibt Ihnen:

[u'LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.', 
'Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.', 
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.', 
'Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]'] 
+0

perfect noch einmal danke –

+0

Kein Problem, wir können eigentlich den xpath vereinfachen, um nur den ersten folgenden Text nach jedem Fett-Tag zu bekommen. Machst du eine Analyse der Daten? –

+0

ja - also ich sammle Ergebnisse von vergangenen Pferderennen ... dann Analyse für sie für Wetten zu tun :) so muss jeder Kommentar auf dem Pferd inot meine Datenbank und passen mit diesem Pferd –

1

Ich ziehe Beautiful Soup ‚s api über lxml direkt verwenden. Ich kann Xpath vollständig vermeiden und einfach Python schreiben.

import bs4 
soup = bs4.BeautifulSoup(document, 'lxml') 
[b.text + b.next_sibling.rstrip() for b in soup.find_all('b')] 

Ausgang:

['LADY MAKFI showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.\n She saw it out well and it´ll be interesting to see how she copes with a rise.', 
'Weardiditallgorong went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.', 
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.', 
'Happy Jack not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]'] 
Verwandte Themen