Die allgemeine Frage wurde an einigen Stellen gestellt und beantwortet: http://www.resolvinghere.com/sof/18408799.shtmlPull Text zwischen zwei BeautifulSoup Elemente
How to get all text between just two specified tags using BeautifulSoup?
Aber bei dem Versuch, zu implementieren, erhalte ich wirklich umständlich Saiten.
Mein Setup: Ich versuche Transkript Text aus den Presidential Debatten zu ziehen, und ich dachte, ich würde hier beginnen: http://www.presidency.ucsb.edu/ws/index.php?pid=111500
I
Das gerade das Transkript mit
transcript = soup.find_all("span", class_="displaytext")[0]
isolieren Formatierung des Transkripts ist nicht ideal. Alle paar Zeilen Text hat eine
<p>
und sie bezeichnen eine Änderung in Lautsprechern mit einem verschachtelten
<b>
. zB:
<p><b>TRUMP:</b> First of all, I have to say, as a businessman, I get along with everybody. I have business all over the world. [<i>booing</i>]</p>,
<p>I know so many of the people in the audience. And by the way, I'm a self-funder. I don't have — I have my wife and I have my son. That's all I have. I don't have this. [<i>applause</i>]</p>,
<p>So let me just tell you, I get along with everybody, which is my obligation to my company, to myself, et cetera.</p>,
<p>Obviously, the war in Iraq was a big, fat mistake. All right? Now, you can take it any way you want, and it took — it took Jeb Bush, if you remember at the beginning of his announcement, when he announced for president, it took him five days.</p>,
<p>He went back, it was a mistake, it wasn't a mistake. It took him five days before his people told him what to say, and he ultimately said, "It was a mistake." The war in Iraq, we spent $2 trillion, thousands of lives, we don't even have it. Iran has taken over Iraq, with the second-largest oil reserves in the world.</p>,
<p>Obviously, it was a mistake.</p>,
<p><b>DICKERSON:</b> So...</p>
Aber wie ich schon sagte, ist kein neues Problem. Definieren Sie ein Start- und End-Tag, durchlaufen Sie die Elemente, so lange wie aktuell! = Next, fügen Sie den Text hinzu.
Also teste ich auf ein einzelnes Element, um die Details richtig zu machen.
startTag = transcript.find_all('b')[165]
endTag = transcript.find_all('b')[166]
content = []
content += startTag.string
content
Und die Ergebnisse, die wir bekommen sind [u'R', u'U', u'B', u'I', u'O', u':']
statt [u'RUBIO:']
.
Was fehlt mir?