2016-05-19 6 views
0

Ich versuche, einen Textblock aus einem Artikel (http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK) zu bekommen, unten ist der spezifische Abschnitt des Codes, die ichDer Versuch, den gesamten Text zwischen mehreren Span-Tags zu finden mit BeautifulSoup

<span id="midArticle_start"></span> 

<span id="midArticle_0"></span> 
<span class="focusParagraph"><p><span class="articleLocation">YANGON</span> 
    Standing among the party seeing off Myanmar's new president as he left for Russia on Wednesday was leading businessman Htun Myint Naing, better known as Steven Law.</p></span> 
<span id="midArticle_1"></span><p>Only the day before, the United States had added six of his companies to the Treasury's blacklist, a move that is unlikely to hamper the tycoon's business empire significantly.</p> 
<span id="midArticle_2"></span><p>President Barack Obama's sanctions policy on Myanmar, updated on Tuesday, aims to strike a balance between targeting individuals without undermining development or deterring U.S. businesses eying the country as it opens up to global trade.</p> 
<span id="midArticle_3"></span><p>Underlining how tricky that balance is, Law may actually gain commercially from the latest changes, even if they do make it harder for him to portray himself as an internationally accepted businessman close to the new democratic government.</p> 
<span id="midArticle_4"></span><p>"Though (sanctions) are not meant to have a blanket effect on the country, their intended targets often play outsize roles ... controlling critical infrastructure impacting trade and business for ordinary citizens," said Nyantha Maw Lin, managing director at consultancy Vriens & Partners in Yangon.</p> 
<span id="midArticle_5"></span><p>On Tuesday, Washington eased some restrictions on Myanmar but also strengthened measures against Law by adding six firms connected to him and his conglomerate, Asia World, to the Treasury blacklist.</p> 
<span id="midArticle_6"></span><p>Yet the blacklisting, which attracted considerable attention in Myanmar, looks like a formality given that the companies were already covered by sanctions, because they were owned 50 percent or more by Law or Asia World. Law was sanctioned in 2008 for alleged ties to Myanmar's military.</p> 
<span id="midArticle_7"></span><p>More important for Law was the U.S. decision to further ease restrictions on trading through his shipping port and airports, extending a temporary six month allowance set in December to an indefinite one.</p> 
<span id="midArticle_8"></span><p></p> 
<span id="midArticle_9"></span><p>PORTS BACK IN FAVOR</p> 
<span id="midArticle_10"></span><p>Law is one of the most powerful and well-connected businessmen in Myanmar with close ties to China.</p> 
<span id="midArticle_11"></span><p>He is not, however, universally popular at home or abroad because of alleged ties to the military, which ruled Myanmar with an iron fist until 2011.</p> 
<span id="midArticle_12"></span> 
erhalten möchten

Das Endziel ist es, jeden Satz als separate Objekte zu haben, die ich später, wie

print(sentence1) 

~ Stehen unter der Partei neuen Präsidenten zu sehen, aus Myanmar verwenden kann, wie er für Russland am Mittwoch verlassen Htun Myint wurde führenden Geschäftsmann Naing, besser bekannt als Steven Law.

print(sentence2) 

~ tags zuvor allein, hatten die Vereinigten Staaten sechs seiner Unternehmen auf die schwarze Liste, eine Bewegung des Finanzministeriums fügte hinzu, die Tycoon Business-Imperium behindern erheblich unwahrscheinlich ist.

Mein Code ruft nur den ersten Satz aber nichts vorbei, dass wie unten dargestellt:

import requests 
from bs4 import BeautifulSoup 
z = requests.get("http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK/") 
url2 = 'http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK' 
response2 = requests.get(url2) 

soup2 = BeautifulSoup(response2.content, "html.parser") 
first_sentence = soup2.p.get_text() 
print(first_sentence) 
second_sentence = soup2.p.find_all_next() 
print(second_sentence) 

Wenn jemand könnte mir helfen Figur, wie individuell alle die Sätze zu bekommen, wäre es sehr geschätzt werden. Ich habe bereits Verfahren in anderen Fragen diskutiert Stackoverflow versucht Finding next occuring tag and its enclosed text with Beautiful Soup und Using beautifulsoup to extract text between line breaks (e.g. <br /> tags)

Antwort

0

Ihr Problem könnte sein, dass die find_all_next() Methode alle Spiele zurückgibt, die nach dem Startelement erscheinen (die zuvor <p> angepasst), und wie Sie haben nicht angegeben, Welches Tag passt dazu, es passt zu allem.

Wenn Sie, dass Sie erhalten alle verbleibenden <p> Tags auf der Seite zu soup2.p.find_all_next("p") ändern, können Sie dann durch sie durchlaufen (oder sie explizit zuweisen, wenn Sie mögen) durch so etwas wie

soup2 = BeautifulSoup(response2.content, "html.parser") 
first_sentence = soup2.p.get_text() 
print(first_sentence) 
for sentence in soup2.p.find_all_next("p") 
    print(sentence.get_text()) 

verwendet, die ist noch einfacher, wenn Sie nur die zusätzlichen Variablen entfernen und findAll() statt:

soup2 = BeautifulSoup(response2.content, "html.parser") 
for sentence in soup2.find_all("p") 
    print(sentence.get_text()) 
+0

Wie würde ich mich über Iterieren durch den Textblock, der print (sentence.get_text()) zurückgibt. Ist es möglich, jedem einzelnen Satz einen Wert zuzuordnen? – Shehzad

+0

Es ist wahrscheinlich am einfachsten, sie zur späteren Bearbeitung in ein Wörterbuch oder eine Liste aufzunehmen. Wenn die einzelnen Sätze keine Namen benötigen, können Sie sie einfach mit 'sets = list (map ((lambda x: x.get_text()), supple2.find_all ("p"))) ' –

0

Sie können nur alle <p> Elemente zurückgeben innerhalb <span> wo id equals 'articleText' mit CSS-Selektor #articleText p:

>>> import requests 
>>> from bs4 import BeautifulSoup 
>>> url2 = 'http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK' 
>>> response2 = requests.get(url2) 
>>> soup2 = BeautifulSoup(response2.content, "html.parser") 
>>> for sentence in soup2.select("#articleText p"): 
...  print(sentence.get_text()) 
...  print() 
... 
YANGON Standing among the party seeing off Myanmar's new president as he left for Russia on Wednesday was leading businessman Htun Myint Naing, better known as Steven Law. 

Only the day before, the United States had added six of his companies to the Treasury's blacklist, a move that is unlikely to hamper the tycoon's business empire significantly. 

President Barack Obama's sanctions policy on Myanmar, updated on Tuesday, aims to strike a balance between targeting individuals without undermining development or deterring U.S. businesses eying the country as it opens up to global trade. 

Underlining how tricky that balance is, Law may actually gain commercially from the latest changes, even if they do make it harder for him to portray himself as an internationally accepted businessman close to the new democratic government. 

...... 
...... 
0

Sie ausprobieren können: soup2.p.find_all_next (text = True)

wie folgt aus:

second_sentence = soup2.p.find_all_next(text=True) 

for item in second_sentence: 

     print(item.split('\n')) 
Verwandte Themen