NLTK Kollokationen für bestimmte Wörter

Ich weiß, wie Bigramm und Trigramm Kollokationen mit NLTK und ich wenden sie an meine eigenen Korpora. Der Code ist unten.NLTK Kollokationen für bestimmte Wörter

Ich bin mir aber nicht sicher über (1) wie man die Kollokationen für ein bestimmtes Wort bekommt? (2) Hat NLTK eine Kollokationsmetrik basierend auf dem Log-Likelihood-Verhältnis?

import nltk 
from nltk.collocations import * 
from nltk.tokenize import word_tokenize 

text = "this is a foo bar bar black sheep foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence" 

trigram_measures = nltk.collocations.TrigramAssocMeasures() 
finder = TrigramCollocationFinder.from_words(word_tokenize(text)) 

for i in finder.score_ngrams(trigram_measures.pmi): 
    print i

Quelle

2014-01-16 Sabba

diesen Code Versuchen:

import nltk 
from nltk.collocations import * 
bigram_measures = nltk.collocations.BigramAssocMeasures() 
trigram_measures = nltk.collocations.TrigramAssocMeasures() 

# Ngrams with 'creature' as a member 
creature_filter = lambda *w: 'creature' not in w 


## Bigrams 
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt')) 
# only bigrams that appear 3+ times 
finder.apply_freq_filter(3) 
# only bigrams that contain 'creature' 
finder.apply_ngram_filter(creature_filter) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(bigram_measures.likelihood_ratio, 10) 


## Trigrams 
finder = TrigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt')) 
# only trigrams that appear 3+ times 
finder.apply_freq_filter(3) 
# only trigrams that contain 'creature' 
finder.apply_ngram_filter(creature_filter) 
# return the 10 n-grams with the highest PMI 
print finder.nbest(trigram_measures.likelihood_ratio, 10)

Es nutzt die Maßnahme Wahrscheinlichkeit und filtert auch ngrams heraus, dass das Wort ‚Kreatur‘ nicht

enthalten

Quelle

2014-01-17 11:54:31 bogs

Wie für Frage # 2, ja! NLTK hat das Likelihood-Ratio in seinem Assoziationsmaß. Die erste Frage bleibt unbeantwortet!

http://nltk.org/api/nltk.metrics.html?highlight=likelihood_ratio#nltk.metrics.association.NgramAssocMeasures.likelihood_ratio

Quelle

2014-01-17 03:57:58 Sabba

Frage 1 - Versuchen:

target_word = "electronic" # your choice of word 
finder.apply_ngram_filter(lambda w1, w2, w3: target_word not in (w1, w2, w3)) 
for i in finder.score_ngrams(trigram_measures.likelihood_ratio): 
print i

Die Idee ist, alles, was Sie nicht wollen, auszufiltern. Diese Methode wird normalerweise verwendet, um Wörter in bestimmten Teilen des Ngrams herauszufiltern, und Sie können dies nach Herzenslust optimieren.

Quelle

2014-01-17 04:22:01 dmvianna

NLTK Kollokationen für bestimmte Wörter

Antwort

Verwandte Themen