Wie Absätzen von schlecht strukturierten HTML abrufen?Abrufen von Absätzen aus HTML mit Python
Ich habe diesen ursprünglichen HTML-Text:
This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.<br> </li>
<li>Strong classroom management skills.<br> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br>
</p>
ich Python verwenden und versuchen, so etwas zu tun:
soup = BeautifulSoup(html)
Es gibt einen neuen HTML-Text mit 2 kurzen Absätzen:
<html>
<body>
<p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
<br/>
</p>
<ul>
<li>AA Early Childhood Education, or related field. </li>
<li>2+ years experience in a licensed childcare facility </li>
<li>Ability to meet state requirements, including finger print clearance. </li>
<li>Excellent oral and written communication skills </li>
<li>Strong organization and time management skills. </li>
<li>Creativity in expanding children's learning through play.
<br/> </li>
<li>Strong classroom management skills.
<br/> </li>
</ul>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.
<br/> </p>
</body>
</html>
Aber es ist nicht das, was ich erwartet habe. Im Ergebnis würde Ich mag diesen HTML Text erhalten:
<html>
<body>
<p>This position is responsible for developing and implementing age appropriate lesson and activity plans for preschool children, ages 4-5 years-old. Maintain a fun and interactive classroom that is clean and well organized, provide a safe, healthy and welcoming learning environment. The ideal candidate will have:
AA Early Childhood Education, or related field.
2+ years experience in a licensed childcare facility
Ability to meet state requirements, including finger print clearance.
Excellent oral and written communication skills
Strong organization and time management skills.
Creativity in expanding children's learning through play.
Strong classroom management skills.
</p>
<p>The ideal candidate must be a reliable, self-starting professional who is passionate about teaching young children.</p>
</body>
</html>
Für oben html bekommen, denke ich, dass der beste Ansatz, der alle HTML-Tags außer <p>
und </p>
von den ursprünglichen HTML zu entfernen ist.
Zu diesem Zweck habe ich versucht, den folgenden regulären Ausdruck:
new_html = re.sub('<[^<]+?>', '', html)
Offensichtlich ist die regelmäßige expession entfernt alle HTML-Tags. Also, wie alle HTML-Tags außer <p>
und </p>
entfernen?
Wenn jemand mir helfen, die r.e. dann füttere ich new_html
zu BeautifulSoup()
und html, die ich erwarte.
Haben Sie den Text abrufen möchten? Wenn ja, dann sollte 'sup.get_text()' in Ordnung sein. – styvane
Nein, ich möchte eine Liste von Absätzen abrufen. – user3601768
Und was ist mit all diesen li-Tags? Möchten Sie sie nur durch den Text ersetzen? – styvane