2016-10-15 14 views
1

Unten ist das div-Tag direkt von espncricinfo.com genommen.webscraping mit Beautifulsoup 4

<div id="rectPlyr_Playerlistt20" style="display: none; visibility: hidden; 
    background:url(http://i.imgci.com/espncricinfo/ciPlayerTablebottom-bg.gif) bottom left no-repeat;"> 
    <table class="playersTable" cellpadding="0" cellspacing="0" style="margin-top:15px; margin-bottom:14px;"> 
     <td class="divider"><a href="/ci/content/player/26421.html">R Ashwin</a></td> 
     <td class="divider"><a href="/ci/content/player/27223.html">STR Binny</a></td> 
     <td class=""><a href="/ci/content/player/625383.html">JJ Bumrah</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/430246.html">YS Chahal</a></td> 
     <td class="divider"><a href="/ci/content/player/290727.html">R Dhawan</a></td> 
     <td class=""><a href="/ci/content/player/28235.html">S Dhawan</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/28081.html">MS Dhoni</a></td> 
     <td class="divider"><a href="/ci/content/player/28671.html">FY Fazal</a></td> 
     <td class=""><a href="/ci/content/player/28763.html">G Gambhir</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/234675.html">RA Jadeja</a></td> 
     <td class="divider"><a href="/ci/content/player/290716.html">KM Jadhav</a></td> 
     <td class=""><a href="/ci/content/player/253802.html">V Kohli</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/277955.html">DS Kulkarni</a></td> 
     <td class="divider"><a href="/ci/content/player/326016.html">B Kumar</a></td> 
     <td class=""><a href="/ci/content/player/398506.html">Mandeep Singh</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/31107.html">A Mishra</a></td> 
     <td class="divider"><a href="/ci/content/player/481896.html">Mohammed Shami</a></td> 
     <td class=""><a href="/ci/content/player/290630.html">MK Pandey</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/554691.html">AR Patel</a></td> 
     <td class="divider"><a href="/ci/content/player/32540.html">CA Pujara</a></td> 
     <td class=""><a href="/ci/content/player/277916.html">AM Rahane</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/422108.html">KL Rahul</a></td> 
     <td class="divider"><a href="/ci/content/player/33141.html">AT Rayudu</a></td> 
     <td class=""><a href="/ci/content/player/279810.html">WP Saha</a></td> 
    </tr> 
    <tr class=""> 
     <td class="divider"><a href="/ci/content/player/236779.html">I Sharma</a></td> 
     <td class="divider"><a href="/ci/content/player/34102.html">RG Sharma</a></td> 
     <td class=""><a href="/ci/content/player/537126.html">BB Sran</a></td> 
    </tr> 
    <tr class="odd"> 
     <td class="divider"><a href="/ci/content/player/390484.html">JD Unadkat</a></td> 
     <td class="divider"><a href="/ci/content/player/237095.html">M Vijay</a></td> 
     <td class=""><a href="/ci/content/player/376116.html">UT Yadav</a></td> 
    </tr> 
    <tr class=""> 
    </tr> 
    </table> 
</div> 

Ich möchte über HTML-Datei kratzen:

from bs4 import BeautifulSoup 
import os 
import urllib2 
BASE_URL = "http://www.espncricinfo.com" 
espn_ = urllib2.urlopen("http://www.espncricinfo.com/ci/content/player/index.html?country=6") 

soup = BeautifulSoup(espn_ , 'html.parser') 

#print soup.prettify().encode('utf-8') 
t20 = soup.find_all('div' , {"id" : "rectPlyr_Playerlistt20"}) 
for row in t20: 
print(row.find('tr' , {"class":"odd"})) 

Nehmen wir an, ich den Code genommen haben von oben angegebenen URL. Wenn ich scrape bekomme ich die Ausgabe als KEINE

Auch wenn ich t20 drucke ich nicht volle Ausgabe, es zeigt nur bis JJ Bumrah, d. H. Nur die erste <tr> Tag. Wenn Sie mit den obigen Daten nicht klar sind, gehen Sie zu der in espn_ bereitgestellten URL. wähle das Team Indien und gehe zum Tab "t20". Ich möchte die href-Links aller Spieler, die wir unter t20 sehen, verwerfen.

Antwort

1

Die HTML ist schlecht kaputt, Sie müssen nur auf die ersten Zeilen der Tabelle schauen, um das zu sehen. Ihre beste Möglichkeit ist entweder lxml oder html5lib als Parser zu verwenden, suchen nur für die Anker direkt und in Scheiben schneiden mit einem Schritt:

soup = BeautifulSoup(espn_.content , 'html5lib') 

t20 = soup.select("#rectPlyr_Playerlistt20 .playersTable td.divider a") 
for a in t20[1::2]: 
    print(a) 

Welche gibt Ihnen:

<a href="/ci/content/player/27223.html">STR Binny</a> 
<a href="/ci/content/player/290727.html">R Dhawan</a> 
<a href="/ci/content/player/28671.html">FY Fazal</a> 
<a href="/ci/content/player/290716.html">KM Jadhav</a> 
<a href="/ci/content/player/326016.html">B Kumar</a> 
<a href="/ci/content/player/481896.html">Mohammed Shami</a> 
<a href="/ci/content/player/32540.html">CA Pujara</a> 
<a href="/ci/content/player/33141.html">AT Rayudu</a> 
<a href="/ci/content/player/34102.html">RG Sharma</a> 
<a href="/ci/content/player/237095.html">M Vijay</a>