2017-01-10 3 views
1

Ich versuche Tabellen zu scrape und konvertieren sie in data.tables in Python, aber ich habe wenig Glück von Wahldaten in den USA. Dies ist HTML der Daten, die ich kratzen möchte.Tabellen mit Python kratzen

<tr class="type-republican"> 
<th class="results-name" scope="row"><a href="xxxxx"><span class="name-combo"><span class="token token-party"><abbr title="Republican">R</abbr></span> <span class="token token-winner"><b aria-hidden="true" class="icon icon-check"></b> <span class="icon-text">Winner</span></span> D. Trump</span></a></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">62.9%</span><span class="graph"><span class="bar"><span class="index" style="width:62.9%;"></span></span></span></span></td> 
<td class="results-popular">1,306,925</td> 
<td class="delegates-cell">9</td> 
</tr> 
<tr class="type-democrat"> 
<th class="results-name" scope="row"><a href="xxxxxx"><span class="name-combo"><span class="token token-party"><abbr title="Democratic">D</abbr></span> H. Clinton</span></a></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">34.6%</span><span class="graph"><span class="bar"><span class="index" style="width:34.6%;"></span></span></span></span></td> 
<td class="results-popular">718,084</td> 
<td class="delegates-cell"></td> 
</tr> 
<tr class="type-independent"> 
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> G. Johnson</span></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">2.1%</span><span class="graph"><span class="bar"><span class="index" style="width:2.1%;"></span></span></span></span></td> 
<td class="results-popular">43,869</td> 
<td class="delegates-cell"></td> 
</tr> 
<tr class="type-independent"> 
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> J. Stein</span></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">0.4%</span><span class="graph"><span class="bar"><span class="index" style="width:0.4%;"></span></span></span></span></td> 
<td class="results-popular">9,287</td> 
<td class="delegates-cell"></td> 
</tr> 
</tbody> 
</table>, <table class="results-table"> 
<tbody> 
<tr class="type-republican"> 
<th class="results-name" scope="row"><a href="xxxxx"><span class="name-combo"><span class="token token-party"><abbr title="Republican">R</abbr></span> D. Trump</span></a></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">73.4%</span><span class="graph"><span class="bar"><span class="index" style="width:73.4%;"></span></span></span></span></td> 
<td class="results-popular">18,110</td> 
</tr> 
<tr class="type-democrat"> 
<th class="results-name" scope="row"><a href="xxxxxx"><span class="name-combo"><span class="token token-party"><abbr title="Democratic">D</abbr></span> H. Clinton</span></a></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">24.0%</span><span class="graph"><span class="bar"><span class="index" style="width:24.0%;"></span></span></span></span></td> 
<td class="results-popular">5,908</td> 
</tr> 
<tr class="type-independent"> 
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> G. Johnson</span></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">2.2%</span><span class="graph"><span class="bar"><span class="index" style="width:2.2%;"></span></span></span></span></td> 
<td class="results-popular">538</td> 
</tr> 
<tr class="type-independent"> 
<th class="results-name" scope="row"><span class="name-combo"><span class="token token-party"><abbr title="Independent">I</abbr></span> J. Stein</span></th> 
<td class="results-percentage"><span class="percentage-combo"><span class="number">0.4%</span><span class="graph"><span class="bar"><span class="index" style="width:0.4%;"></span></span></span></span></td> 
<td class="results-popular">105</td> 
</tr> 
</tbody> 

Und so weiter ... So sieht mein Code wie folgt.

Percentage = [] 
Count = [] 
page = requests.get('xxxx') 
soup = BeautifulSoup(page.text, "lxml") 
table = soup.find('div', class_='content-alpha') 
for row in table.find_all('tr'): 
    col = row.find_all('td') 
    Percentage = col[0].find(text=True) 
    Count = col[1].find(text=True 
    print (Count) 

Aber was ich hier bekomme, ist eine Information von nur ein paar Tabellen, aber nicht alle von ihnen. Wie kann ich Informationen von allen Tabellen abrufen? Und warum bekomme ich Informationen nur von wenigen Tischen?

Ich hoffe, Sie verstehen die Frage.

HTML ist wirklich groß, also füge ich Link zur Website hinzu http://www.politico.com/2016-election/results/map/president/alabama/. Ich möchte in Alabama 2016 US-Wahl-Daten von jedem Landkreis kratzen

+0

Die Klasse 'Content-Alpha' ist in Ihren Daten hier nicht enthalten. Können Sie die Daten, die Sie abkratzen möchten, und die erwarteten Ergebnisse aktualisieren? – Stergios

+0

Es ist viel einfacher für uns, Ihnen zu helfen, wenn Sie die URL angeben, die Sie versuchen zu kratzen – wpercy

+0

Ich habe den Link zur Website hinzugefügt. – Extria

Antwort

1

So gelang es mir nach einiger Zeit alle Daten von dieser Website zu scrappen. Das Hauptproblem bestand also darin, dass diese Website in JavaScript eingebettet war, sodass ich mit Beautifulsoup nicht kratzen konnte. Also habe ich selen + beautifulsoup4 benutzt, um die Seite in HTML umzuwandeln und sie zu scrappen.

from selenium import webdriver 
import time 
import os 
from bs4 import BeautifulSoup 
chrome_path = r"C:\Users\Desktop\chromedriver_win32\chromedriver.exe" 
driver = webdriver.Chrome(chrome_path) 
driver.get('http://www.politico.com/2016-election/primary/results/map/president/arizona/') 
time.sleep(80) 
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
time.sleep(5) 
html = driver.page_source 
soup = BeautifulSoup(html,'html.parser') 
for posts in soup.findAll('table',{'class':'results-table'}): 
for tr in posts.findAll('tr'): 
    popular = [td for td in tr.stripped_strings] 
    print(popular) 

Da es eine dynamische Webseite ist, musste ich einige Dinge mit Selen simulieren. Wie Scrollen nach unten. Ich habe time.sleep (60) verwendet, damit die Seite geladen werden konnte. Es lädt wirklich langsam, also setze ich Zeit auf 60s. Hoffe es hilft jemandem.

0
import requests, bs4 

r = requests.get('http://www.politico.com/2016-election/results/map/president/alabama/') 
soup = bs4.BeautifulSoup(r.text, 'lxml') 
contents = soup.find(class_='contrast-white') 
for table in contents.find_all(class_='results-group'): 
    title = table.find(class_='title').text 
    for tr in table.find_all('tr'): 
     _, name, percentage, popular = [td for td in tr.stripped_strings] 
     print(title, name, percentage, popular) 

aus:

Autauga County D. Trump 73.4% 18,110 
Autauga County H. Clinton 24.0% 5,908 
Autauga County G. Johnson 2.2% 538 
Autauga County J. Stein 0.4% 105 
Baldwin County D. Trump 77.4% 72,780 
Baldwin County H. Clinton 19.6% 18,409 
Baldwin County G. Johnson 2.6% 2,448 
Baldwin County J. Stein 0.5% 453 
Barbour County D. Trump 52.3% 5,431 
Barbour County H. Clinton 46.7% 4,848 
Barbour County G. Johnson 0.9% 93 
Barbour County J. Stein 0.2% 18 
Bibb County D. Trump 77.0% 6,733 
Bibb County H. Clinton 21.4% 1,874 
Bibb County G. Johnson 1.4% 124 
Bibb County J. Stein 0.2% 17 
Blount County D. Trump 89.9% 22,808 
Blount County H. Clinton 8.5% 2,150 
Blount County G. Johnson 1.3% 337 
Blount County J. Stein 0.4% 89 
Bullock County H. Clinton 75.1% 3,530 
Bullock County D. Trump 24.2% 1,139 
Bullock County G. Johnson 0.5% 22 
Bullock County J. Stein 0.2% 10 
Butler County D. Trump 56.3% 4,891 
Butler County H. Clinton 42.8% 3,716 
Butler County G. Johnson 0.7% 65 
Butler County J. Stein 0.1% 13 
Calhoun County D. Trump 69.2% 32,803 
Calhoun County H. Clinton 27.9% 13,197 
Calhoun County G. Johnson 2.4% 1,114 
Calhoun County J. Stein 0.6% 262 
Chambers County D. Trump 56.6% 7,803 
Chambers County H. Clinton 41.8% 5,763 
Chambers County G. Johnson 1.2% 168 
Chambers County J. Stein 0.3% 44 
Cherokee County D. Trump 83.9% 8,809 
Cherokee County H. Clinton 14.5% 1,524 
Cherokee County G. Johnson 1.4% 145 
Cherokee County J. Stein 0.2% 25 

enter image description here Der Rest ist leer, nichts drin.

+0

Vielen Dank für eine Antwort. Ich bin neu in Python, also habe die gleiche Frage, warum es nur einen Teil der Seite bis Cherokee County kratzt? – Extria

+0

@Extria Ich aktualisiere meine Antwort. –

+0

Also gibt es keine Möglichkeit, den Rest der Grafschaften zu kratzen? – Extria