2013-01-23 13 views
5

Ich habe eine HTML-Seite mit mehreren divs wiePython: Wie URL aus HTML-Seite mit BeautifulSoup extrahieren?

<div class="article-additional-info"> 
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t... 
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"> 
<span class="arrows">»</span> 
</a> 
</div> 

<div class="article-additional-info"> 
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe... 
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"> 
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments"> 
</div> 

und ich brauche den <a href=> Wert für alle divs mit Klasse article-additional-info ich neue

so brauche ich die Urls

BeautifulSoup bin zu erhalten
"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece" 
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece" 

Was ist der beste Weg, dies zu erreichen?

Antwort

8

Nach Ihren Kriterien gibt es drei URLs (nicht zwei) - wollten Sie die dritte herausfiltern?

Grundidee ist, durch die HTML iterieren, nur die Elemente in der Klasse herausziehen und dann in dieser Klasse durch alle Links laufen, die tatsächlichen Verbindungen herausziehen:

In [1]: from bs4 import BeautifulSoup 

In [2]: html = # your HTML 

In [3]: soup = BeautifulSoup(html) 

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
    ...:  for link in item.find_all('a'): 
    ...:   print link.get('href') 
    ...:   
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 

Dies schränkt Suche nur nach diesen Elementen mit dem Klassen-Tag article-additional-info, und im Inneren von dort sucht nach allen Anker (a) Tags und greift ihre entsprechenden href Link.

2
from bs4 import BeautifulSoup as BS 
html = # Your HTML 
soup = BS(html) 
for text in soup.find_all('div', class_='article-additional-info'): 
    for links in text.find_all('a'): 
     print links.get('href') 

Welche druckt:

http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
2

Nach der Dokumentation arbeiten, ich habe es die folgende Art und Weise tat, ich danke Ihnen allen für Ihre Antworten, ich schätze sie

>>> import urllib2 
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews') 
>>> soup = BeautifulSoup(f.fp) 
>>> for link in soup.select('.article-additional-info'): 
... print link.find('a').attrs['href'] 
... 
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece 
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece 
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece 
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece 
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece 
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece 
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece 
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article.ece 
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece 
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece 
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece 
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece 
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece 
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece 
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece 
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece 
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece 
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece 
>>> 
0
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}): 
...:  for link in item.find_all('a'): 
...:   print link.get('href') 
...: 
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece  
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments 
+1

Bitte verlinke nicht mehr auf deine eigene Seite, es ist [** spam **] (http://stackoverflow.com/help/promotion) für [so]. –

Verwandte Themen