2016-04-22 7 views
0

Wie Absätzen von schlecht strukturierten HTML abrufen?Abrufen von Absätzen aus HTML mit Python

Ich habe diesen ursprünglichen HTML-Text:

This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
<br> 
<ul> 
    <li>AA Early Childhood Education, or related field. </li> 
    <li>2+ years experience in a licensed childcare facility </li> 
    <li>Ability to meet state requirements, including finger print clearance. </li> 
    <li>Excellent oral and written communication skills </li> 
    <li>Strong organization and time management skills. </li> 
    <li>Creativity in expanding children's learning through play.<br> </li> 
    <li>Strong classroom management skills.<br> </li> 
</ul> 
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p> 

ich Python verwenden und versuchen, so etwas zu tun:

soup = BeautifulSoup(html) 

Es gibt einen neuen HTML-Text mit 2 kurzen Absätzen:

<html> 

<body> 
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
     <br/> 
    </p> 
    <ul> 
     <li>AA Early Childhood Education, or related field. </li> 
     <li>2+ years experience in a licensed childcare facility </li> 
     <li>Ability to meet state requirements, including finger print clearance. </li> 
     <li>Excellent oral and written communication skills </li> 
     <li>Strong organization and time management skills. </li> 
     <li>Creativity in expanding children's learning through play. 
      <br/> </li> 
     <li>Strong classroom management skills. 
      <br/> </li> 
    </ul> 
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
     <br/> </p> 
</body> 

</html> 

Aber es ist nicht das, was ich erwartet habe. Im Ergebnis würde Ich mag diesen HTML Text erhalten:

<html> 

<body> 
    <p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
     AA Early Childhood Education, or related field. 
     2+ years experience in a licensed childcare facility 
     Ability to meet state requirements, including finger print clearance. 
     Excellent oral and written communication skills 
     Strong organization and time management skills. 
     Creativity in expanding children's learning through play. 
     Strong classroom management skills. 
    </p> 
    <p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p> 
</body> 

</html> 

Für oben html bekommen, denke ich, dass der beste Ansatz, der alle HTML-Tags außer <p> und </p> von den ursprünglichen HTML zu entfernen ist.

Zu diesem Zweck habe ich versucht, den folgenden regulären Ausdruck:

new_html = re.sub('<[^<]+?>', '', html) 

Offensichtlich ist die regelmäßige expession entfernt alle HTML-Tags. Also, wie alle HTML-Tags außer <p> und </p> entfernen?

Wenn jemand mir helfen, die r.e. dann füttere ich new_html zu BeautifulSoup() und html, die ich erwarte.

+2

Haben Sie den Text abrufen möchten? Wenn ja, dann sollte 'sup.get_text()' in Ordnung sein. – styvane

+0

Nein, ich möchte eine Liste von Absätzen abrufen. – user3601768

+1

Und was ist mit all diesen li-Tags? Möchten Sie sie nur durch den Text ersetzen? – styvane

Antwort

1

Kurze Antwort

new_html = re.sub('<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', '', html)

Lange Antwort

Ihre ursprüngliche Regex scheint seltsam. Ich hätte statt [^<] setzen. Sie möchten "alles, was kein schließendes Tag ist".

Auch ist es seltsam zu setzen + gefolgt von ?.

+ bedeutet: "Wiederholung 1 oder mehr Zeit"

? bedeutet: "Wiederholung 0 oder einmal".

Die beiden Zeichen zu haben ist ziemlich seltsam.

Wie auch immer, können wir Ihre Regex wie folgt ausdrücken:

"open-Tag", dann "alles, was nicht 'p' und nicht/p", dann "schließen tag"

das entspricht an:

"open tag", dann entweder "ein eindeutiges Zeichen, das nicht 'p' ist" oder "alles was kein Schrägstrich ist dann ein oder mehrere Zeichen" oder "ein Schrägstrich dann ein eindeutiges Zeichen, das nicht ist" p '"oder" ein Schrägstrich dann zwei oder mehr Zeichen ", dann" Tag schließen ".

Welche äquivalent zu:

< dann ([^p] oder [^>/][^>]+ oder /[^p] oder /[^>][^>]+) dann >

Dies ist, was durch die Regex oben ausgedrückt wird.

Hier ist ein kurzer Test in ein Python-Konsole eingeben:

re.sub(
    '<([^p]|[^>/][^>]+|/[^p]|/[^>][^>]+)>', 
    '', 
    'aa <p> bb <a> cc <li> dd <pp> ee <pa> ff </p> gg </a> hh </li> ii </pp> jj </pa> ff') 
+1

'+?' Bedeutet eins oder mehr und nicht gierig. Ohne dieses '?' Würden auch End-Tags erfasst. Du hast Recht, dass '[^>]' jedoch besser ist. –

+0

Warum sollte Regex anstelle von HTML Parser verwendet werden? – styvane

1

Dies ist eine Art manuelle Dokumentmanipulation, aber Sie können die li Elemente und remove sie nach appending zum ersten Absatz Schleife. Dann entfernen Sie das ul Element auch:

from bs4 import BeautifulSoup 


data = """ 
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
<br> 
<ul> 
    <li>AA Early Childhood Education, or related field. </li> 
    <li>2+ years experience in a licensed childcare facility </li> 
    <li>Ability to meet state requirements, including finger print clearance. </li> 
    <li>Excellent oral and written communication skills </li> 
    <li>Strong organization and time management skills. </li> 
    <li>Creativity in expanding children's learning through play.<br> </li> 
    <li>Strong classroom management skills.<br> </li> 
</ul> 
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>""" 

soup = BeautifulSoup(data, "lxml") 

p = soup.p 
for li in soup.find_all("li"): 
    p.append(li.get_text()) 
    li.extract() 

soup.find("ul").extract() 
print(soup.prettify()) 

Druckt die Absätze 2, wie Sie geplant haben zu haben:

<html> 
<body> 
    <p> 
    This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
    <br/> 
    AA Early Childhood Education, or related field. 
    2+ years experience in a licensed childcare facility 
    Ability to meet state requirements, including finger print clearance. 
    Excellent oral and written communication skills 
    Strong organization and time management skills. 
    Creativity in expanding children's learning through play. 
    Strong classroom management skills. 
    </p> 
    <p> 
    The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br/> 
    </p> 
</body> 
</html> 

Hinweis, dass es einen wichtigen Unterschied in der Art und Weise lxml, html.parser und html5lib parsen Sie den eingegebenen HTML-Code. html5lib und html.parser nicht automatisch den ersten Absatz erstellen den Code oben wirklich lxml spezifisch.


Ein besserer Ansatz wäre wahrscheinlich ein separates "Suppe" -Objekt. Beispiel:

from bs4 import BeautifulSoup 


data = """ 
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
<br> 
<ul> 
    <li>AA Early Childhood Education, or related field. </li> 
    <li>2+ years experience in a licensed childcare facility </li> 
    <li>Ability to meet state requirements, including finger print clearance. </li> 
    <li>Excellent oral and written communication skills </li> 
    <li>Strong organization and time management skills. </li> 
    <li>Creativity in expanding children's learning through play.<br> </li> 
    <li>Strong classroom management skills.<br> </li> 
</ul> 
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br> 
</p>""" 

soup = BeautifulSoup(data, "lxml") 

# create new soup 
new_soup = BeautifulSoup("<body></body>", "lxml") 
new_body = new_soup.body 

# create first paragraph 
first_p = new_soup.new_tag("p") 
first_p.append(soup.p.get_text()) 

for li in soup.find_all("li"): 
    first_p.append(li.get_text()) 

new_body.append(first_p) 

# create second paragraph 
second_p = soup.find_all("p")[-1] 
new_body.append(second_p) 

print(new_soup.prettify()) 

Drucke:

<html> 
<body> 
    <p> 
    This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have: 
    AA Early Childhood Education, or related field. 
    2+ years experience in a licensed childcare facility 
    Ability to meet state requirements, including finger print clearance. 
    Excellent oral and written communication skills 
    Strong organization and time management skills. 
    Creativity in expanding children's learning through play. 
    Strong classroom management skills. 
    </p> 
    <p> 
    The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children. 
    <br/> 
    </p> 
</body> 
</html> 
+1

@ user3601768 ja, wahrscheinlich eine separate Suppe Objekt wäre eine bessere Idee, aktualisiert die Antwort mit einer Probe. Ich hoffe, das hilft. – alecxe

Verwandte Themen