Hier ist der vollständige Code unvollständig ich die Antwort in. Credits für viele Leute, beschrieb ich alle Quellen auf der Unterseite. Es ist sehr grob, das ist mir klar, aber es fängt mich ein, d. Ich kann txt lemmes Wörterbuch und meine Stoppwörter benutzen, um polnische Texte zu klassifizieren. Danke an Damiano Fantini, Dmitriy Selivanov und viele andere.
#----1. Set up. ----
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
library(readtext)
library(tm)
library(proxy)
library(stringi)
library(stringr)
library(hashmap)
library(data.table)
library(text2vec)
# For reading n-grams
library(RWeka) #(*)
BigramTokenizer <-
function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3)) #(*)
#----2. Read data. ----
stopwordsPL <- as.vector(str_split(readLines("polish.stopwords.text",encoding = "UTF-8"), pattern = " ",simplify = T))
docs <- VCorpus(DirSource(pattern="txt"))
titles <- rownames(summary(docs))
docs <- tm_map(docs, removeWords, words=stopwordsPL)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, function(x) stri_trans_general(x, "Latin-ASCII"))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
# for English texts it would be simpler
# docs <- tm_map(docs, removeWords, stopwords("english")) #can add other words to remove
# docs <- tm_map(docs, stemDocument, "english")
#====3. Lemmatize ====
# # Dictionary from http://www.lexiconista.com/datasets/lemmatization/
# lemmadict_file = "lemmatization-pl.text"
# dt = fread(file= lemmadict_file, header = F, col.names = c("lemma", "word"), data.table=T, encoding="UTF-8")
# # I threw away Polish letters, maybe changing locales may help.
# dt$lemma <- stri_trans_general(dt$lemma, "Latin-ASCII;lower")
# dt$word <- stri_trans_general(dt$word, "Latin-ASCII;lower")
# dt <- unique(dt)
#
# # Creating hash dictionary
# lemma_hm = hashmap(dt$word, dt$lemma)
#
# # Test if it works
# lemma_hm[["mnozyl"]]
# # [1] "mnozyc"
#
# save_hashmap(lemma_hm, file="lemma_hm", overwrite = TRUE, compress = TRUE)
lemma_hm <- load_hashmap(file="lemma_hm")
lemma_tokenizer = function(x, lemma_hashmap,
tokenizer = text2vec::word_tokenizer) {
tokens_list = tokenizer(x)
for(i in seq_along(tokens_list)) {
tokens = tokens_list[[i]]
replacements = lemma_hashmap[[tokens]]
ind = !is.na(replacements)
tokens_list[[i]][ind] = replacements[ind]
}
# paste together, return a vector
sapply(tokens_list, (function(i){paste(i, collapse = " ")}))
}
docs <- tm_map(docs, (function(x) {lemma_tokenizer(x, lemma_hashmap=lemma_hm)}))
docs <- tm_map(docs, PlainTextDocument)
#====4. Create document term matrix====
docsTDM <-
DocumentTermMatrix(docs, control = list(wordLengths = c(5, 25),tokenize = BigramTokenizer)) # tokenize=LemmaTokenizer, tokenize = BigramTokenizer (*)
docsTDM$dimnames
#====5. Remove sparse and common words====
docsTDM <- removeSparseTerms(docsTDM, .90)
# https://stackoverflow.com/questions/25905144/removing-overly-common-words-occur-in-more-than-80-of-the-documents-in-r
removeCommonTerms <- function (x, pct)
{
stopifnot(inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")),
is.numeric(pct), pct > 0, pct < 1)
m <- if (inherits(x, "DocumentTermMatrix"))
t(x)
else x
t <- table(m$i) < m$ncol * (pct)
termIndex <- as.numeric(names(t[t]))
if (inherits(x, "DocumentTermMatrix"))
x[, termIndex]
else x[termIndex, ]
}
docsTDM <-
removeCommonTerms(docsTDM, .8) #remove terms that are in >=80% of the documents
docsTDM$dimnames
#====6. Cluster data (hclust). ====
docsdissim <- dist(as.matrix(docsTDM), method = "cosine")
docsdissim2 <- as.matrix(docsdissim)
dim(docsdissim2)
rownames(docsdissim2) <- titles
colnames(docsdissim2) <- titles
h <- hclust(docsdissim, method = "ward.D2")
plot(h, labels = titles, sub = "")
# Library hclust with p-values (pvclust)
library(pvclust)
h_pv <- pvclust(docsdissim2, method.hclust = "ward.D2", method.dist ="correlation")
plot(h_pv)
data.frame(cutree(tree = h_pv$hclust, k = 4))
# pvclust provides two types of p-values: AU (Approximately Unbiased) p-value and BP (Bootstrap Probability) value.
# AU p-value, which is computed by multiscale bootstrap resampling, is a better approximation to unbiased p-value
# than BP value computed by normal bootstrap resampling.
# AU p-value > 0.95 we can assume the clusters exist and may stably be
# observed if we increase the number of observations.
# (http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/)
#==== Literature:====
# Original article:
# http://www.rexamine.com/2014/06/text-mining-in-r-automatic-categorization-of-wikipedia-articles/
# Updates to make it work after some functions became obsolete:
# https://stackoverflow.com/questions/34423823/r-automatic-categorization-of-wikipedia-articles
# https://stackoverflow.com/questions/34372166/error-using-termdocumentmatrix-and-dist-functions-in-r
#
# Based on that:
# http://brazenly.blogspot.co.uk/2015/02/r-categorization-clustering-of.html
#
# Sparse terms:
# https://stackoverflow.com/questions/28763389/how-does-the-removesparseterms-in-r-work
# Lemmatizing function:
# https://stackoverflow.com/questions/46122591/a-lemmatizing-function-using-a-hash-dictionary-does-not-work-with-tm-package-in
# https://stackoverflow.com/questions/45762559/lemmatization-using-txt-file-with-lemmes-in-r/45790325#45790325
Es funktioniert wie ein Charme. Danke Damiano. –