2017-06-28 2 views
0

Ich bin ziemlich neu in Python Beautiful Soup und ich habe nicht viel Wissen über HTML oder JS. Ich versuchte, bs4 zu verwenden, um alle xls-Dateien in diesem page herunterzuladen, aber es scheint, dass bs4 die Verbindungen unter "Anlage" Abschnitt nicht finden kann. Könnte mir jemand helfen?Kann den gewünschten Link zum Download nicht finden (Python BeautifulSoup)

Mein aktueller Code ist:

""" 
Scrapping of all county-level raw data from 
http://www.countyhealthrankings.org for all years. Data stored in RawData 
folder. 
Code modified from https://null-byte.wonderhowto.com/how-to/download-all- 
pdfs-webpage-with-python-script-0163031/ 
""" 

from bs4 import BeautifulSoup 
import urlparse 
import urllib2 
import os 
import sys 

""" 
Get all links 
""" 
def getAllLinks(url): 
    page=urllib2.urlopen(url) 
    soup = BeautifulSoup(page.read(),"html.parser") 
    links = soup.find_all('a', href=True) 
    return links 

def download(links): 
    for link in links: 
     #raw_input("Press Enter to continue...") 
     #print link 
     #print "------------------------------------" 
     #print os.path.splitext(os.path.basename(link['href'])) 
     #print "------------------------------------" 
     #print os.path.splitext(os.path.basename(link['href']))[1] 
     suffix = os.path.splitext(os.path.basename(link['href']))[1] 
     if os.path.splitext(os.path.basename(link['href']))[1] == '.xls': 
      print link #cannot find anything 
      currentLink = urllib2.urlopen(link) 

links = 
getAllLinks("http://www.countyhealthrankings.org/app/iowa/2017/downloads") 
download(links) 

(By the way, meine gewünschte Verbindung wie this aussieht.)

Dank!

Antwort

0

Dies scheint eine der Aufgaben zu sein, für die BeautifulSoup (an sich zumindest) unzureichend ist. Sie können es jedoch mit Selen machen.

>>> from selenium import webdriver 
>>> driver = webdriver.Chrome() 
>>> driver.get('http://www.countyhealthrankings.org/app/iowa/2017/downloads') 
>>> links = driver.find_elements_by_xpath('.//span[@class="file"]/a') 
>>> len(links) 
30 
>>> for link in links: 
...  link.get_attribute('href') 
...  
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2017_IA.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2017%20County%20Health%20Rankings%20Iowa%20Data%20-%20v1.xls' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2017%20Health%20Outcomes%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2017%20Health%20Factors%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2016_IA.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2016%20County%20Health%20Rankings%20Iowa%20Data%20-%20v3.xls' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2016%20Health%20Outcomes%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2016%20Health%20Factors%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2015_IA.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2015%20County%20Health%20Rankings%20Iowa%20Data%20-%20v3.xls' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2015%20Health%20Outcomes%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2015%20Health%20Factors%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2014_IA_v2.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2014%20County%20Health%20Rankings%20Iowa%20Data%20-%20v6.xls' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2014%20Health%20Outcomes%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2014%20Health%20Factors%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/states/CHR2013_IA.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2013%20County%20Health%20Ranking%20Iowa%20Data%20-%20v1_0.xls' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2013%20Health%20Outcomes%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2013%20Health%20Factors%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/states/CHR2012_IA.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2012%20County%20Health%20Ranking%20Iowa%20Data%20-%20v2.xls' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2012%20Health%20Outcomes%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2012%20Health%20Factors%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/states/CHR2011_IA.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2011%20County%20Health%20Ranking%20Iowa%20Data%20-%20v2.xls' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2011%20Health%20Outcomes%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2011%20Health%20Factors%20-%20Iowa.png' 
'http://www.countyhealthrankings.org/sites/default/files/states/CHR2010_IA_0.pdf' 
'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2010%20County%20Health%20Ranking%20Iowa%20Data%20-%20v2.xls' 
+0

Dank Bill. Das scheint zu funktionieren! Nur neugierig, wissen Sie, warum BeautifulSoup in diesem Fall nicht gut funktioniert? – jliu

+0

Das sollte in meiner Antwort gewesen sein. Ich war misstrauisch, weil dein Code in Ordnung war. Ich habe versucht, BeautifulSoup zu verwenden, um alle Links auf der Seite zu finden und ihre hrefs auszudrucken. Nichts davon war das, was wir wollten, was mir nahelegte, dass die Seite wahrscheinlich Ajax benutzt, um ihren eigenen Inhalt zu laden. Das ist heutzutage praktisch die Norm. Sie können weiterhin BeautifulSoup verwenden, aber oft müssen Sie das DOM einer Seite mit den Funktionen eines Produkts wie Selen laden. BeautifulSoup kann nicht verarbeiten, was nicht in HTML geladen ist. –

+0

Uh OK. Vielen Dank:)) – jliu