Parsing spezifische Daten mit Beautiful Soup

So habe ich eine Webseite, die tabellarische Daten enthält. Im Folgenden ist der HTML-Code für die Tabelle:Parsing spezifische Daten mit Beautiful Soup

<table class="confluenceTable"> 
    <tbody> 
     <tr> 
      <th class="confluenceTh"> 
      <p>Prefix</p> 
      </th> 
      <th class="confluenceTh"> 
      <p>Group</p> 
      </th> 
      <th class="confluenceTh"> 
      <p>Contact</p> 
      </th> 
      <th class="confluenceTh"> 
      <p>Dev/Test Lab</p> 
      </th> 
      <th class="confluenceTh"> 
      <p>Performance</p> 
      </th> 
     </tr> 
     <tr> 
      <td class="confluenceTd"> 
      <p> </p> 
      </td> 
      <td class="confluenceTd"> 
      <p> </p> 
      </td> 
      <td class="confluenceTd"> 
      <p> </p> 
      </td> 
     </tr> 
     <tr> 
      <th class="confluenceTh"> 
      <p> </p> 
      </th> 
      <th class="confluenceTh"> 
      <p> </p> 
      </th> 
      <th class="confluenceTh"> 
      <p> </p> 
      </th> 
     </tr> 
     <tr> 
      <td class="confluenceTd"> 
      <p>SEF00</p> 
      </td> 
      <td class="confluenceTd"> 
      <p>APTRA Vision</p> 
      </td> 
      <td class="confluenceTd"> 
      <p> </p> 
      </td> 
      <td class="confluenceTd"> 
      <p><a href="/somepage">VCD Lab</a> , <a href="/somepage">Test Lab</a></p> 
      </td> 
      <td class="confluenceTd"> 
      <p><a href="/display">Perf Lab</a></p> 
      </td> 
     </tr> 
     <tr> 
      <td class="confluenceTd"> 
      <p>SEF01</p> 
      </td> 
      <td class="confluenceTd"> 
      <p>In-Person Bill Payment</p> 
      </td> 
      <td class="confluenceTd"> 
      <p>Swamy PKV</p> 
      </td>

Wie kann ich meine Python-Code formatieren, so dass ich alle Daten erhalten nur unter Präfix und Gruppen Spalten. Bisher habe ich das versucht:

ii=1 
data=requests.get(url,auth=(username,password)) 
sample=data.content 
soup=BeautifulSoup(sample,'html.parser') 
for row in soup.find_all('tr')[1:154]: 
    datatocheck.append(row.get_text(separator='\t')) 
while(ii<=152): 
     print datatocheck[ii][0:30] 
     ii+=1

Das gibt mir die folgende Ausgabe:

SEF00 APTRA Vision   VCD Lab 
SEF01 In-Person Bill Payment S

Aber ich will nur SEF00 (Präfix) und APTRA Vision (Gruppe), SEF01 und In-Person Bill Payment. Nicht die anderen Spalten.

Auch kann ich meinen HTML-Code nicht ändern.

Quelle

2016-11-18 Anurag Joshi

Wie wäre es, wenn u Wenn SEF00 tun in ii:

Es drucken kann nur die SEF00

Quelle

2016-11-18 13:53:22 Daniel

Ich habe das nicht ganz verstanden. Können Sie bitte einen Beispielcodeblock hinzufügen? –

Ich bin nicht nach Hause Ich poste durch mein iPhone so, ich werde sehen, wenn es funktioniert, wenn es tut, werde ich hier posten, was ich dachte, du solltest Python fragen, um die Zeichenfolge zu drucken, wenn es SEF00 auf sie hat – Daniel

soup = BeautifulSoup(html, 'lxml') 

for row in soup.find_all('tr')[3:]: # remove empty row 
    tds = [i.get_text(strip=True) for i in row.find_all('td')] 
    print(tds[0],tds[1])

aus:

SEF00 APTRA Vision 
SEF01 In-Person Bill Payment

gerade in der Reihe alle td erhalten, setze sie in eine Liste, dann schneide sie

Quelle

2016-11-18 15:22:04

OK, aber ist es ein Problem, wenn ich HTML.parser anstelle von lxml verwende? Ich versuchte pip, lxml zu installieren, aber es scheitert immer .. –

at print (tds [0], tds [1]), bekomme ich immer IndexError: Liste Index außerhalb des Bereichs FEHLER Nachricht. Irgendwelche Vorschläge? –

Wie auch immer, die andere Lösung, die Sie gestern gegeben haben, hat funktioniert. Noch einmal vielen Dank! –

Parsing spezifische Daten mit Beautiful Soup

Antwort

Verwandte Themen