Kann mir jemand erklären?dtm sparsity unterschiedlich je nach tf/tfidf, gleicher corpus
Mein Verständnis:
tf >= 0 (absolute frequency value)
tfidf >= 0 (for negative idf, tf=0)
sparse entry = 0
nonsparse entry > 0
Also der genaue spärliche/nonsparse Anteil soll gleich in den beiden DTMs mit dem folgenden Code erstellt werden.
library(tm)
data(crude)
dtm <- DocumentTermMatrix(crude, control=list(weighting=weightTf))
dtm2 <- DocumentTermMatrix(crude, control=list(weighting=weightTfIdf))
dtm
dtm2
Aber:
> dtm
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2255/23065**
Sparsity : 91%
Maximal term length: 17
Weighting : term frequency (tf)
> dtm2
<<DocumentTermMatrix (documents: 20, terms: 1266)>>
**Non-/sparse entries: 2215/23105**
Sparsity : 91%
Maximal term length: 17
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)