2016-12-10 4 views
0

Ich weiß, dass dies sehr grün ist, aber ich versuche, Links in einer Website absteigend und wollen die Links Links von Links mit der Voraussetzung, die Links in jeder Stufe folgen folgen etwas einfaches Musterabgleich. Ich habe einige Tutorials zum Anzeigen von Links gesehen, aber nicht zu Pattern-Matching- oder absteigenden Links von Links. Etwas Hilfe wäre willkommen.Schöne Suppe: Absteigende Links von Links, die Muster passen

Zum Beispiel in diesem Fall:

from bs4 import BeautifulSoup 
import urllib2 

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks") 
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset')) 

for link in soup.find_all('a', href=True): 
    print link['href'] 

Out:

/contact-gpsbasecamp.php 
/privacy-policy.php 
/terms-of-service.php 
/
       National-Parks/map 
/National-Historic-Parks 
/National-Historic-Sites 
/National-Monuments 
/Other-NPS-Facilities 
national-parks/Acadia_National_Park 
national-parks/Arches_National_Park 
national-parks/Badlands_National_Park 
national-parks/Big_Bend_National_Park 
national-parks/Biscayne_National_Park 
national-parks/Black_Canyon_Of_The_Gunnison_National_Park 
national-parks/Bryce_Canyon_National_Park 
national-parks/Canyonlands_National_Park 
national-parks/Capitol_Reef_National_Park 
national-parks/Carlsbad_Caverns_National_Park 
national-parks/Channel_Islands_National_Park 
national-parks/Congaree_National_Park 
national-parks/Crater_Lake_National_Park 
national-parks/Cuyahoga_Valley_National_Park 
national-parks/Death_Valley_National_Park 
national-parks/Denali_National_Park_and_Preserve 
national-parks/Dry_Tortugas_National_Park 
national-parks/Everglades_National_Park 
national-parks/Gates_Of_The_Arctic_National_Park_and_Preserve 
national-parks/Glacier_Bay_National_Park_and_Preserve 
national-parks/Glacier_National_Park 
national-parks/Grand_Canyon_National_Park 
national-parks/Grand_Teton_National_Park 
national-parks/Great_Basin_National_Park 
national-parks/Great_Smoky_Mountains_National_Park 
national-parks/Guadalupe_Mountains_National_Park 
national-parks/Haleakala_National_Park 
national-parks/Hawaii_Volcanoes_National_Park 
national-parks/Hot_Springs_National_Park 
national-parks/Isle_Royale_National_Park 
national-parks/Joshua_Tree_National_Park 
national-parks/Katmai_National_Park_and_Preserve 
national-parks/Kenai_Fjords_National_Park 
national-parks/Kings_Mountain_National_Military_Park 
national-parks/Kobuk_Valley_National_Park 
national-parks/Lake_Clark_National_Park_and_Preserve 
national-parks/Lassen_Volcanic_National_Park 
national-parks/Mammoth_Cave_National_Park 
national-parks/Mesa_Verde_National_Park 
national-parks/Mount_Rainier_National_Park 
national-parks/National_Park_of_American_Samoa 
national-parks/National_Parks_of_New_York_Harbor 
national-parks/North_Cascades_National_Park 
national-parks/Olympic_National_Park 
national-parks/Petrified_Forest_National_Park 
national-parks/Redwood_National_and_State_Parks 
national-parks/Rocky_Mountain_National_Park 
national-parks/Saguaro_National_Park 
national-parks/Sequoia_and_Kings_Canyon_National_Parks 
national-parks/Shenandoah_National_Park 
national-parks/Theodore_Roosevelt_National_Park 
national-parks/Virgin_Islands_National_Park 
national-parks/Voyageurs_National_Park 
national-parks/Wind_Cave_National_Park 
national-parks/Wolf_Trap_National_Park_for_the_Performing_Arts 
national-parks/Wrangell_-_St_Elias_National_Park_and_Preserve 
national-parks/Yellowstone_National_Park 
national-parks/Yosemite_National_Park 
national-parks/Zion_National_Park 
http://www.gpsbasecamp.com 
http://www.gpsbasecamp.com 
/upload-gps-file.php 
/download-gps-file.php 
/national-parks 
/state-parks 


/mp3/index.php 

Wie kann ich dann hinunter die alle Links "Nationalparks", die Informationen zu erhalten, aus Links auf der nächsten Ebene?

Danke für Ihre Hilfe!

+0

Mit dem Abstieg meinst du zu? Ich weiß nicht, wonach du fragst. – Stats4224

+0

Mit dem Abstieg meine ich: folgen Sie dem Link zu einem anderen Link und dann Informationen von der Zielwebseite wiederherstellen. Vielen Dank! – user3654387

Antwort

0

Ich denke, das die Funktionalität, die Sie suchen: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

from bs4 import BeautifulSoup 
import urllib2 
import re 

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks") 
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset')) 

nat_parks_linkns = [link['href'] for link in soup.find_all((href=re.compile("national-parks"))] 

Dann können Sie zu jedem Link gehen, wie Sie wieder gefallen. (Ich habe nicht den obigen Code, um tatsächlich zu testen bekommen)

+0

Schauen Sie sich dieses hervorragende Video an, das etwas von dem zeigt, was Sie tun! https://youtu.be/N0ph2a6Vd7M Sie können den Multiprocessing-Teil ignorieren, aber das Web-Crawl ist im Grunde das, was Sie tun, es klingt wie. – Stats4224

1

Methode 1:

for link in soup.select('a[href^="national-parks"]'): 
     print(link['href']) 

Methode 2:

import re 
for link in soup.find_all('a', href=re.compile(r"^national-parks")): 
    print(link['href']) 

die beiden Verfahren werden href entsprechen, die mit dem nationalen Parks beginnen '

aus:

national-parks/Acadia_National_Park 
national-parks/Arches_National_Park 
national-parks/Badlands_National_Park 
national-parks/Big_Bend_National_Park 
national-parks/Biscayne_National_Park 
national-parks/Black_Canyon_Of_The_Gunnison_National_Park 
national-parks/Bryce_Canyon_National_Park 
national-parks/Canyonlands_National_Park 
national-parks/Capitol_Reef_National_Park 
national-parks/Carlsbad_Caverns_National_Park 
national-parks/Channel_Islands_National_Park 
national-parks/Congaree_National_Park 
national-parks/Crater_Lake_National_Park 
national-parks/Cuyahoga_Valley_National_Park 
national-parks/Death_Valley_National_Park 
national-parks/Denali_National_Park_and_Preserve 
national-parks/Dry_Tortugas_National_Park 
national-parks/Everglades_National_Park 
national-parks/Gates_Of_The_Arctic_National_Park_and_Preserve 
national-parks/Glacier_Bay_National_Park_and_Preserve 
national-parks/Glacier_National_Park 
national-parks/Grand_Canyon_National_Park 
national-parks/Grand_Teton_National_Park 
national-parks/Great_Basin_National_Park 
national-parks/Great_Smoky_Mountains_National_Park 
national-parks/Guadalupe_Mountains_National_Park 
national-parks/Haleakala_National_Park 
national-parks/Hawaii_Volcanoes_National_Park 
national-parks/Hot_Springs_National_Park 
national-parks/Isle_Royale_National_Park 
national-parks/Joshua_Tree_National_Park 
national-parks/Katmai_National_Park_and_Preserve 
national-parks/Kenai_Fjords_National_Park 
national-parks/Kings_Mountain_National_Military_Park 
national-parks/Kobuk_Valley_National_Park 
national-parks/Lake_Clark_National_Park_and_Preserve 
national-parks/Lassen_Volcanic_National_Park 
national-parks/Mammoth_Cave_National_Park 
national-parks/Mesa_Verde_National_Park 
national-parks/Mount_Rainier_National_Park 
national-parks/National_Park_of_American_Samoa 
national-parks/National_Parks_of_New_York_Harbor 
national-parks/North_Cascades_National_Park 
national-parks/Olympic_National_Park 
national-parks/Petrified_Forest_National_Park 
national-parks/Redwood_National_and_State_Parks 
national-parks/Rocky_Mountain_National_Park 
national-parks/Saguaro_National_Park 
national-parks/Sequoia_and_Kings_Canyon_National_Parks 
national-parks/Shenandoah_National_Park 
national-parks/Theodore_Roosevelt_National_Park 
national-parks/Virgin_Islands_National_Park 
national-parks/Voyageurs_National_Park 
national-parks/Wind_Cave_National_Park 
national-parks/Wolf_Trap_National_Park_for_the_Performing_Arts 
national-parks/Wrangell_-_St_Elias_National_Park_and_Preserve 
national-parks/Yellowstone_National_Park 
national-parks/Yosemite_National_Park 
national-parks/Zion_National_Park