2017-08-16 4 views
1

Ich arbeite derzeit an LDA Logarithmus in Python. Ich möchte die Themen in eine Liste der besten 20 Wörter in jedem Thema umwandeln. Ich habe unter dem Code versucht, aber unterschiedliche Ausgaben bekommen. Ich möchte meine Ausgabe im folgenden Format: topic=2,words=20.Wie konvertiert man die Themen in nur eine Liste der Top 20 Wörter in jedem Thema in LDA in Python

['men', 'kill', 'soldier', 'order', 'patient', 'night', 'priest', 'becom', 'new', 'speech', 'friend', 'decid', 'young', 'ward', 'state', 'front', 'would', 'home', 'two', 'father'] 

["n't", 'go', 'fight', 'doe', 'home', 'famili', 'car', 'night', 'say', 'next', 'ask', 'day', 'want', 'show', 'goe', 'friend', 'two', 'polic', 'name', 'meet'] 

ich unten Ausgang bekam:

["(u'ngma', 0.034841332255132154)", "(u'video', 0.0073756817356584745)", "(u'youtube', 0.006524039676605746)", "(u'liked', 0.0065240394176856644)",] 
["(u'ngma', 0.024537057880333127)", "(u'photography', 0.0068263432438681482)", "(u'tvallwhite', 0.0029535361359022566)", "(u'3', 0.0029252727655122079)"] 

Mein Code:

`ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary,passes=50) 
lda=ldamodel.print_topics(num_topics=2, num_words=3) 

f=open('LDA.txt','w') 
f.write(str(lda)) 
f.close() 

topics_matrix = ldamodel.show_topics(formatted=False,num_words=10) 
topics_matrix = np.array((topics_matrix),dtype=list) 
topic_words = topics_matrix[:, 1] 
for i in topic_words: 
    print([str(word) for word in i]) 
    print()` 

edit-1:

topic_words = [] 
for i in range(3): 
    tt = ldamodel.get_topic_terms(i,10) 
    topic_words.append([pair[0] for pair in tt]) 
    print topic_words 

in nicht Folge Ausgang erwartet:

[[1897, 135, 130, 127, 70, 162, 445, 656, 608, 1019], [1897, 364, 56, 1236, 181, 172, 449, 48, 15, 18], [1897, 163, 11, 70, 166, 345, 480, 9, 60, 351]] 

Antwort

0

this-

from gensim import corpora 
import gensim 
from gensim.models.ldamodel import LdaModel 
from gensim.parsing.preprocessing import STOPWORDS 

# example docs 
doc1 = """ 
Java (Indonesian: Jawa; Javanese: ꦗꦮ; Sundanese: ᮏᮝ) is an island of Indonesia.\ 
With a population of over 141 million (the island itself) or 145 million (the \ 
administrative region), Java is home to 56.7 percent of the Indonesian population \ 
and is the most populous island on Earth.[1] The Indonesian capital city, Jakarta, \ 
is located on western Java. Much of Indonesian history took place on Java. It was \ 
the center of powerful Hindu-Buddhist empires, the Islamic sultanates, and the core \ 
of the colonial Dutch East Indies. Java was also the center of the Indonesian struggle \ 
for independence during the 1930s and 1940s. Java dominates Indonesia politically, \ 
economically and culturally. 
""" 
doc2 = """ 
Hydrogen fuel is a zero-emission fuel when burned with oxygen, if one considers water \ 
not to be an emission. It often uses electrochemical cells, or combustion in internal \ 
engines, to power vehicles and electric devices. It is also used in the propulsion of \ 
spacecraft and might potentially be mass-produced and commercialized for passenger vehicles \ 
and aircraft.Hydrogen lies in the first group and first period in the periodic table, i.e. \ 
it is the first element on the periodic table, making it the lightest element. Since \ 
hydrogen gas is so light, it rises in the atmosphere and is therefore rarely found in \ 
its pure form, H2.""" 

doc3 = """ 
The giraffe (Giraffa) is a genus of African even-toed ungulate mammals, the tallest living \ 
terrestrial animals and the largest ruminants. The genus currently consists of one species, \ 
Giraffa camelopardalis, the type species. Seven other species are extinct, prehistoric \ 
species known from fossils. Taxonomic classifications of one to eight extant giraffe species\ 
have been described, based upon research into the mitochondrial and nuclear DNA, as well \ 
as morphological measurements of Giraffa, but the IUCN currently recognizes only one \ 
species with nine subspecies. 
""" 

documents = [doc1, doc2, doc3] 
document_wrd_splt = [[word for word in document.lower().split() if word not in STOPWORDS] \ 
for document in documents] 

dictionary = corpora.Dictionary(document_wrd_splt) 
print(dictionary.token2id) 

corpus = [dictionary.doc2bow(text) for text in texts] 

lda = LdaModel(corpus, num_topics=3, id2word = dictionary, passes=50) 

num_topics = 3 
topic_words = [] 
for i in range(num_topics): 
    tt = lda.get_topic_terms(i,20) 
    topic_words.append([dictionary[pair[0]] for pair in tt]) 

# output 
>>> topic_words[0] 
['indonesian', 'java', 'species', 'island', 'population', 'million', '(the', 'java.', 'center', 'giraffe', 'currently', 'genus', 'city,', 'economically', 'administrative', 'east', 'sundanese:', 'itself)', 'took', '1940s.'] 
>>> topic_words[1] 
['vehicles', 'fuel', 'hydrogen', 'periodic', 'table,', 'i.e.', 'uses', 'form,', 'considers', 'zero-emission', 'internal', 'period', 'burned', 'cells,', 'rises', 'pure', 'atmosphere', 'aircraft.hydrogen', 'water', 'engines,'] 
>>> topic_words[2] 
['giraffa,', 'even-toed', 'living', 'described,', 'camelopardalis,', 'consists', 'extinct,', 'seven', 'fossils.', 'morphological', 'terrestrial', '(giraffa)', 'dna,', 'mitochondrial', 'nuclear', 'ruminants.', 'classifications', 'species,', 'prehistoric', 'known'] 
+0

Code Versuchen versucht Ausgang erwartet nur knapp sein Ziel zu bekommen. check edit-1 – aneeket

+0

Aktualisiert den Beitrag. –

Verwandte Themen