Wie finden Sie Spannen mit einer bestimmten Klasse mit bestimmten Text mit schönen Suppe und re?

, wie ich alle Span mit einer Klasse von 'blue', den Text im Format enthalten finden:Wie finden Sie Spannen mit einer bestimmten Klasse mit bestimmten Text mit schönen Suppe und re?

04/18/13 7:29pm

, die sein könnte daher:

04/18/13 7:29pm

oder:

Posted on 04/18/13 7:29pm

in Bezug auf Dafür habe ich bisher die Logik konstruiert:

new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all 
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re 
for _ in new_content: 
    result = re.findall(pattern, _) 
    print result

Ich habe mich auf https://stackoverflow.com/a/7732827 und https://stackoverflow.com/a/12229134 bezogen, um zu versuchen, einen Weg zu finden, dies zu tun, aber das obige ist alles, was ich bis jetzt habe.

edit:

das Szenario zu klären, gibt es Span mit:

<span class="blue">here is a lot of text that i don't need</span>

und

<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>

und beachten Sie, ich brauche nur 04/18/13 7:29pm nicht den Rest des Inhalts.

bearbeiten 2:

ich auch versucht:

pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>') 
for _ in new_content: 
    result = re.findall(pattern, _) 
    print result

und bekam Fehler:

'TypeError: expected string or buffer'

Quelle

2013-04-27 user1063287

import re 
from bs4 import BeautifulSoup 

html_doc = """ 
<html> 
<body> 
<span class="blue">here is a lot of text that i don't need</span> 
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span> 
<span class="blue">04/19/13 7:30pm</span> 
<span class="blue">Posted on 04/20/13 10:31pm</span> 
</body> 
</html> 
""" 

# parse the html 
soup = BeautifulSoup(html_doc) 

# find a list of all span elements 
spans = soup.find_all('span', {'class' : 'blue'}) 

# create a list of lines corresponding to element texts 
lines = [span.get_text() for span in spans] 

# collect the dates from the list of lines using regex matching groups 
found_dates = [] 
for line in lines: 
    m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line) 
    if m: 
     found_dates.append(m.group(1)) 

# print the dates we collected 
for date in found_dates: 
    print(date)

Ausgang:

04/18/13 7:29pm 
04/19/13 7:30pm 
04/20/13 10:31pm

Quelle

2013-04-27 06:04:19

Ich konnte den obigen Code erfolgreich ausführen, aber es funktionierte nicht in meiner Implementierung. Ich dachte, es könnte sein, weil im Originalquellcode zwischen Datum und Uhrzeit ein ' ' steht, zB '04/18/13 19:29 Uhr'. Als Referenz fügte ich '.replace (" "," ")' zu dem ursprünglichen ''urlopen read object'' hinzu und es funktionierte. Vielen Dank (an alle Antwortenden!). – user1063287

Dieses Muster zu erfüllen scheint, was Sie suchen:

>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>') 
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>') 
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups() 
('04/18/13 7:29pm',)

Quelle

2013-04-27 05:46:12

ich weiß nicht, wie zu implementieren Dies, ich habe den Code, den ich versucht, basierend auf Ihrem Vorschlag in origina l posten (siehe Edit 2). – user1063287

@ user1063287 versuchen Sie, Ihre dritte Zeile in 'result = pattern.match (_). Groups()' zu ändern. 're.findall' erwartet eine Zeichenkette (wie die Zeichenkette, die Sie früher verwenden, wenn Sie' re.compile' aufrufen und stattdessen geben Sie eine bereits kompilierte Regex. Im Wesentlichen versuchen Sie, Ihr Muster zweimal zu kompilieren. –

Ich bekomme ''TypeError: expected string or buffer'' – user1063287

Dies ist eine flexible regex, die Sie verwenden können:

"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"

Beispiel:

>>> import re 
>>> from bs4 import BeautifulSoup 
>>> html = """ 
<html> 
<body> 
<span class="blue">here is a lot of text that i don't need</span> 
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span> 
<span class="blue">04/19/13 7:30pm</span> 
<span class="blue">04/18/13 7:29pm</span> 
<span class="blue">Posted on 15/18/2013 10:00AM</span> 
<span class="blue">Posted on 04/20/13 10:31pm</span> 
<span class="blue">Posted on 4/1/2013 17:09aM</span> 
</body> 
</html> 
""" 
>>> soup = BeautifulSoup(html) 
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})] 
>>> ok = [m.group(1) 
     for line in lines 
     for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),) 
      if m] 
>>> ok 
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM'] 
>>> for i in ok: 
    print i 

04/18/13 7:29pm 
04/19/13 7:30pm 
04/18/13 7:29pm 
15/18/2013 10:00AM 
04/20/13 10:31pm 
4/1/2013 17:09aM

Quelle

2013-04-27 05:57:26 pradyunsg

Wie finden Sie Spannen mit einer bestimmten Klasse mit bestimmten Text mit schönen Suppe und re?

Antwort

Verwandte Themen