Ascii-Codec kann nicht Byte 0xC2 Python dekodieren NLTK

-1

Ich habe einen Code, den ich für Spam Klassifizierung bin mit und es funktioniert super, aber jedes Mal wenn ich versuche/eindämmen lemmatize das Wort, das ich bekomme diese Fehlermeldung:Ascii-Codec kann nicht Byte 0xC2 Python dekodieren NLTK

File " /Users/Ramit/Desktop/Bayes1/src/filter.py "Zeile 16, in trim_word word = ps.stem (Wort)

File" /Library/Python/2.7/site-packages/nltk/stem /porter.py“, Linie 664, in der Stamm Schaft = self._step1a (STEM)

File "/Library/Python/2.7/site-packages/nltk/stem/porter.py", Linie 289, in _step1a

if word.endswith('ies') and len(word) == 4: 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

Hier ist mein Code:

from word import Word 
    from nltk.corpus import stopwords 
    from nltk.stem import PorterStemmer 
    ps = PorterStemmer() 
    class Filter(): 

def __init__(self): 
    self.words = dict() 


def trim_word(self, word): 
    # Helper method to trim away some of the non-alphabetic characters 
    # I deliberately do not remove all non-alphabetic characters. 
    word = word.strip(' .:,-!()"?+<>*') 
    word = word.lower() 
      word = ps.stem(word) 
    return word 


def train(self, train_file): 
    lineNumber = 1 
    ham_words = 0 
    spam_words = 0 
      stop = set(stopwords.words('english')) 

    # Loop through all the lines 
    for line in train_file: 
     if lineNumber % 2 != 0: 
      line = line.split('\t') 
      category = line[0] 
      input_words = line[1].strip().split(' ') 

      #Loop through all the words in the line, remove some characters 
      for input_word in input_words: 
       input_word = self.trim_word(input_word) 
       if (input_word != "") and (input_word not in stop): 

        # Check if word is in dicionary, else add 
        if input_word in self.words: 
         word = self.words[input_word] 
        else: 
         word = Word(input_word) 
         self.words[input_word] = word 

        # Check wether the word is in ham or spam sentence, increment counters 
        if category == "ham": 
         word.increment_ham() 
         ham_words += 1 
        elif category == "spam": 
         word.increment_spam() 
         spam_words += 1 

        # Probably bad training file input... 
        else: 
         print "Not valid training file format" 

     lineNumber+=1 

    # Compute the probability for each word in the training set 
    for word in self.words: 
     self.words[word].compute_probability(ham_words, spam_words) 


def get_interesting_words(self, sms): 
    interesting_words = [] 
      stop = set(stopwords.words('english')) 
    # Go through all words in the SMS and append to list. 
    # If we have not seen the word in training, assign probability of 0.4 
    for input_word in sms.split(' '): 
     input_word = self.trim_word(input_word) 
     if (input_word != "") and (input_word not in stop): 
      if input_word in self.words: 
       word = self.words[input_word] 
      else: 
       word = Word(input_word) 
       word.set_probability(0.40) 
      interesting_words.append(word) 

    # Sort the list of interesting words, return top 15 elements if list is longer than 15 
    interesting_words.sort(key=lambda word: word.interesting(), reverse=True) 
    return interesting_words[0:15] 


def filter(self, input_file, result_file): 
    # Loop through all SMSes and compute total spam probability of the sms-message 
    lineNumber = 0 
    for sms in input_file: 
     lineNumber+=1 
     spam_product = 1.0 
     ham_product = 1.0 
     if lineNumber % 2 != 0: 
      try: 
       for word in self.get_interesting_words(sms): 
        spam_product *= word.get_probability() 
        ham_product *= (1.0 - word.get_probability()) 

       sms_spam_probability = spam_product/(spam_product + ham_product) 
      except: 
       result_file.write("error") 

      if sms_spam_probability > 0.8: 
       result_file.write("SPAM: "+sms) 
      else: 
       result_file.write("HAM: "+sms) 
     result_file.write("\n")

ich für eine Lösung nur bin auf der Suche, die mir die Worte lemmatize/Stamm erlauben würde. Ich habe versucht, mich im Netz umzusehen. Ich habe ähnliche Probleme gefunden, aber sie haben nicht für mich gearbeitet.

Quelle

2017-03-18 Ramit Sawhney

Vorschläge: (1) Wandeln Sie Ihre Tabs zu Leerzeichen vor der Veröffentlichung. (2) Erstellen Sie ein [minimales Beispiel] (http://stackoverflow.com/help/mcve). –

Vielleicht würde dies helfen https://gist.github.com/alvations/07758d02412d928414bb von https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66 – alvas

Das Problem könnte sein, dass Sie Lesen Sie die Datei nicht richtig? versuchen Sie 'import io; file_in = io.open ('filename.txt', 'r', encoding = 'utf8') '. Es ist ein wenig unklar, was falsch ist, aber wenn Sie die Daten posten könnten, die Sie verarbeiten möchten, wird es viel einfacher zu verstehen, was schiefgelaufen ist. – alvas

Verwenden Sie sys.

import sys 
sys.setdefaultencoding('utf-8') 
reload(sys)

Quelle

2017-03-18 14:50:17 MFigueredo

Ascii-Codec kann nicht Byte 0xC2 Python dekodieren NLTK

Antwort

Verwandte Themen