Ich versuche, den N_gram Parameter in CountVectorizer() mit Gensim zu imitieren. Mein Ziel ist es, LDA mit Scikit oder Gensim zu benutzen und sehr ähnliche Bigramme zu finden.Versuchen, Scikit Ngram mit Gensim zu imitieren
Zum Beispiel können wir die folgenden Bigramme mit scikit finden: "abc Computer", "binär ungeordnete" und mit GENSIM "A survey", "Graph Minderjährigen" ...
Ich habe meinen Code unten angehängt einen Vergleich zwischen Gensim und Scikit in Bezug auf Bigramme/Unigramme zu machen.
Danke für Ihre Hilfe
documents = [["Human" ,"machine" ,"interface" ,"for" ,"lab", "abc" ,"computer" ,"applications"],
["A", "survey", "of", "user", "opinion", "of", "computer", "system", "response", "time"],
["The", "EPS", "user", "interface", "management", "system"],
["System", "and", "human", "system", "engineering", "testing", "of", "EPS"],
["Relation", "of", "user", "perceived", "response", "time", "to", "error", "measurement"],
["The", "generation", "of", "random", "binary", "unordered", "trees"],
["The", "intersection", "graph", "of", "paths", "in", "trees"],
["Graph", "minors", "IV", "Widths", "of", "trees", "and", "well", "quasi", "ordering"],
["Graph", "minors", "A", "survey"]]
Mit dem GENSIM Modell, das wir 48 einzigartige Token finden, können wir die Unigramm-/Bigramme mit Druck (dictionary.token2id)
# 1. Gensim
from gensim.models import Phrases
# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(documents, min_count=1)
for idx in range(len(documents)):
for token in bigram[documents[idx]]:
if '_' in token:
# Token is a bigram, add to document.
documents[idx].append(token)
documents = [[doc.replace("_", " ") for doc in docs] for docs in documents]
print(documents)
dictionary = corpora.Dictionary(documents)
print(dictionary.token2id)
Und mit dem scikit drucken 96 einzigartige Token können wir scikit Vokabular mit Druck (vocab)
# 2. Scikit
import re
token_pattern = re.compile(r"\b\w\w+\b", re.U)
def custom_tokenizer(s, min_term_length = 1):
"""
Tokenizer to split text based on any whitespace, keeping only terms of at least a certain length which start with an alphabetic character.
"""
return [x.lower() for x in token_pattern.findall(s) if (len(x) >= min_term_length and x[0].isalpha()) ]
from sklearn.feature_extraction.text import CountVectorizer
def preprocess(docs, min_df = 1, min_term_length = 1, ngram_range = (1,1), tokenizer=custom_tokenizer):
"""
Preprocess a list containing text documents stored as strings.
doc : list de string (pas tokenizé)
"""
# Build the Vector Space Model, apply TF-IDF and normalize lines to unit length all in one call
vec = CountVectorizer(lowercase=True,
strip_accents="unicode",
tokenizer=tokenizer,
min_df = min_df,
ngram_range = ngram_range,
stop_words = None
)
X = vec.fit_transform(docs)
vocab = vec.get_feature_names()
return (X,vocab)
docs_join = list()
for i in documents:
docs_join.append(' '.join(i))
(X, vocab) = preprocess(docs_join, ngram_range = (1,2))
print(vocab)