2017-12-31 34 views
-1

brauchen einige hilfe bei der beautifulsoup library für webscrapping.komplette webscraping mit beautifulsoup

Ich brauche http://thehill.com/ den Text von der Webseite zu extrahieren .../365407-sean-diddy-Waben-will-to-buy-c ...

mein Ziel ist es, den Text genau so wie in der Web-Seite zu extrahieren, für die Ich extrahiere alle "p" -Tags und seinen Text, aber innerhalb von "p" -Tags gibt es "a" -Tags, die auch etwas Text enthalten.

so meine Fragen: 1. Wie konvertiert man die Unicode ("") in normale Strings als Text auf der Webseite? denn wenn ich nur "p" -Tags entpacke, konvertiert die beautifulSoup-Bibliothek den Text in Unicode, und sogar die Sonderzeichen sind Unicode, also möchte ich den extrahierten Unicode-Text in normalen Text konvertieren. Wie kann ich das machen?

  1. Wie extrahiert man den Text in "p" Tags, die "a" -Tags haben. Ich meine, ich möchte den vollständigen Text innerhalb der "p" -Tags einschließlich des Textes innerhalb der verschachtelten Tags extrahieren.

Ich habe mit dem folgenden Code versucht:

html = requests.get("http://thehill.com/…/365407-sean-diddy-combs-wants-to-buy-c…").content 
news_soup = BeautifulSoup(html, "html.parser") 
a_text = news_soup.find_all('p') 

y = a_text[1].find_all('a').string 

Antwort

0

Sie können eine verschachtelte Liste Verständnis verwenden, um alle Verbindungen zu den Absatz-Tags zu finden und verwenden encode("ascii", 'ignore') die Unicode zu entschlüsseln:

import urllib 
from bs4 import BeautifulSoup as soup 
s = soup(str(urllib.urlopen('http://thehill.com/blogs/blog-briefing-room/365407-sean-diddy-combs-wants-to-buy-carolina-panthers-and-sign-kaepernick').read()), 'lxml') 
all_text = [i.text.encode("ascii", 'ignore') for i in s.find_all('p')] 
all_paragraphs = filter(None, [[b.text.encode("ascii", 'ignore') for b in i.find_all('a')] for i in s.find_all('p')]) 
print(all_text) 
print(all_paragraphs) 

Ausgabe:

['Hip hop mogul Sean Diddy Combs said Sunday night hes interested in buying the Carolina Panthers and signing quarterback Colin Kaepernick, who has been unemployed this season after kneeling during the national anthem in 2016.', 'Panthers owner Jerry Richardson announced Sunday he would be selling the team after the 2017 season, just hours after Sports Illustrated published accusations of sexual misconduct from former employees. Richardson also allegedly used a racial slur about a team scout.', 'Diddy took to Twitter soon after the Panthers announced the upcoming sale, declaring his desire to own a team and increase diversity among NFL ownership.', 'I would like to buy the @Panthers. Spread the word. Retweet!', 'There are no majority African American NFL owners. Lets make history.', '', 'Kaepernick respondedSundaymorning, saying I want in on the ownership group!', 'I want in on the ownership group! Lets make it happen!, 'Other athletes, including NBA starStephen Curryandformer NFL playerGreg Jennings,responded to Combs saying they were interested in part-owning the team.', "Former league MVP Cam Newton is the team's current quarterback.", 'Kaepernick has been a free agent since the end of the 2016 season, when he made headlinesfor kneeling during the national anthem before games to protest issues of racial inequality.', 'President TrumpDonald John TrumpHouse Democrat slams Donald Trump Jr. for serious case of amnesia after testimony Skier Lindsey Vonn: I dont want to represent Trump at Olympics Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia MORE hascriticized Kaepernick directly, saying the NFL should have suspended him for the demonstration. He has since taken aim at other players who have knelt or sat during the anthem during the 2017 season.', '- This story was updated at 11:03 A.M. EST.', 'View the discussion thread.', 'The Hill 1625 K Street, NW Suite 900 Washington DC 20006 | 202-628-8500 tel | 202-628-8503 fax', 'The contents of this site are 2017 Capitol Hill Publishing Corp., a subsidiary of News Communications, Inc.'] 
[['Sports Illustrated'], ['@Panthers'], ['Stephen Curry', 'former NFL player'], ['President Trump', 'Donald John Trump', 'House Democrat slams Donald Trump Jr. for serious case of amnesia after testimony', 'Skier Lindsey Vonn: I dont want to represent Trump at Olympics', 'Poll: 4 in 10 Republicans think senior Trump advisers had improper dealings with Russia', 'MORE', 'criticized Kaepernick directly', 'knelt or sat'], ['View the discussion thread.']] 
+0

Schön danke sehr sehr –