2017-03-05 8 views
0

Ich arbeite an einem binären Klassifizierungsproblem mit Textdaten. Ich möchte die Wörter des Textes basierend auf ihren Auftritten in einigen gut definierten Word-Klassenfunktionen klassifizieren, die ich gewählt habe. Für jetzt habe ich das Auftreten des gesamten Wortes in jeder Wortklasse gesucht und die Zählung dieser Wortklasse bei Übereinstimmung inkrementiert. Diese Zählung wird weiterhin verwendet, um die Häufigkeit jeder Wortklasse zu berechnen. Hier ist mein Code:Wie re.search() in meinem Code zu implementieren?

import nltk 
import re 

def wordClassFeatures(text): 
    home = """woke home sleep today eat tired wake watch 
     watched dinner ate bed day house tv early boring 
     yesterday watching sit""" 

    conversation = """know people think person tell feel friends 
talk new talking mean ask understand feelings care thinking 
friend relationship realize question answer saying""" 


    countHome = countConversation =0 

    totalWords = len(text.split()) 

    text = text.lower() 
    text = nltk.word_tokenize(text) 
    conversation = nltk.word_tokenize(conversation) 
    home = nltk.word_tokenize(home) 
''' 
    for word in text: 
     if word in conversation: #this is my current approach 
      countConversation += 1 
     if word in home: 
      countHome += 1 
''' 

    for word in text: 
     if re.search(word, conversation): #this is what I want to implement 
      countConversation += 1 
     if re.search(word, home): 
      countHome += 1 

    countConversation /= 1.0*totalWords 
    countHome /= 1.0*totalWords 

    return(countHome,countConversation) 

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't 
see the benefits (please correct me if I'm wrong), thus I abandoned that.""" 

print(wordClassFeatures(text)) 

Der Nachteil dabei ist, dass ich jetzt einen zusätzlichen Aufwand für jedes Wort aller Wortklassen ergeben, da die Wörter im Text explizit in eine Wortklasse fallen lassen. Daher versuche ich nun, jedes Wort des Textes als regulären Ausdruck einzugeben und in jeder Wortklasse danach zu suchen. Dies führt den Fehler:

line 362, in wordClassFeatures 
if re.search(conversation, word): 
    File "/root/anaconda3/lib/python3.6/re.py", line 182, in search 
    return _compile(pattern, flags).search(string) 
    File "/root/anaconda3/lib/python3.6/re.py", line 289, in _compile 
    p, loc = _cache[type(pattern), pattern, flags] 
TypeError: unhashable type: 'list' 

Ich weiß, dass es in der Syntax ein großer Fehler, aber ich konnte es nicht im Netz finden, da die meisten der Syntax für re.search im Format sind:

re.search("thank|appreciate|advance", x)

Gibt es eine Möglichkeit, dies richtig zu implementieren?

+1

Es sollte 're.search (Wort, Gespräch)' sein. –

+0

@Rawing Versuchte es. Dieser Fehler wird ausgelöst: Zeile 362 in wordClassFeatures if re.search (Wort, Konversation): Datei "/root/anaconda3/lib/python3.6/re.py", Zeile 182, in der Suche return _compile (Muster, flags) .search (string) TypeError: erwarteter String oder bytesähnliches Objekt –

+0

Diese Frage benötigt ein [Minimal, Complete und Verifable] (http://stackoverflow.com/help/mcve) Beispiel. Das erleichtert es uns, Ihnen zu helfen. –

Antwort

0

Ich glaube re.search für eine string oder buffer und nicht list suchen, der Code für Gespräch Variablen und Haus einzieht.

Auch, wenn Sie tokenizing sind Sie tun dies mit allen Sonderzeichen für Text, die Suche trow aus ist. So

, zuerst müssen wir Text von Sonderzeichen

text = re.sub('\W+',' ', text) #strip text of all special characters 

Weiter abzustreifen, lassen wir Gespräch und Hause Variablen wie es (im String-Format) und nicht tokenize

#conversation = nltk.word_tokenize(conversation) 
#home = nltk.word_tokenize(home) 

Wir bekommen die gewünschte Antwort:

(0.21301775147928995, 0.20118343195266272) 

Voll Code unten:

import nltk 
import re 

def wordClassFeatures(text): 
    home = """woke home sleep today eat tired wake watch 
     watched dinner ate bed day house tv early boring 
     yesterday watching sit""" 

    conversation = """know people think person tell feel friends 
talk new talking mean ask understand feelings care thinking 
friend relationship realize question answer saying""" 

    text = re.sub('\W+',' ', text) #strip text of all special characters 

    countHome = countConversation =0 

    totalWords = len(text.split()) 

    text = text.lower() 
    text = nltk.word_tokenize(text) 
    #conversation = nltk.word_tokenize(conversation) 
    #home = nltk.word_tokenize(home) 
    ''' 
     for word in text: 
      if word in conversation: #this is my current approach 
       countConversation += 1 
      if word in home: 
       countHome += 1 
    ''' 

    for word in text: 
     if re.search(word, conversation): #this is what I want to implement 
      countConversation += 1 
     if re.search(word, home): 
      countHome += 1 

    countConversation /= 1.0*totalWords 
    countHome /= 1.0*totalWords 

    return(countHome,countConversation) 

text = """ Long time no see. Like always I was rewriting it from scratch a couple of times. But nevertheless 
it's still java and now it uses metropolis sampling to help that poor path tracing converge. Btw. I did MLT on 
yesterday evening after 2 beers (it had to be Ballmer peak). Altough the implementation is still very fresh it 
easily outperforms standard path tracing, what is to be seen especially when difficult caustics are involved. 
I've implemented spectral rendering too, it was very easy actually, cause all computations on wavelengths are 
linear just like rgb. But then I realised that even if it does feel more physically correct to do so, whats the 
point? 3d applications are operating in rgb color space, and because I cant represent a rgb color as spectrum 
interchangeably I have to approximate it, so as long as I'm not running a physical simulation or something I don't 
see the benefits (please correct me if I'm wrong), thus I abandoned that.""" 

print(wordClassFeatures(text))