berechnen Wort Co-Vorkommen Matrix in r

würde Ich mag ein Wort Co-Vorkommen Matrix in R. berechnen habe ich die folgenden Datenrahmen von Sätzen -berechnen Wort Co-Vorkommen Matrix in r

dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F) 
dat[2,1] <- c("The girl is short.") 
dat[3,1] <- c("The tall boy and the short girl are friends.")

Was mich

The boy is tall. 
The girl is short. 
The tall boy and the short girl are friends.

gibt

Was ich tun möchte, ist zunächst eine Liste mit allen eindeutigen Worte über alle drei Sätze zu bilden, nämlich

The 
boy 
is 
tall 
girl 
short 
and 
are 
friends

dann möchte ich Wort Co-Vorkommen Matrix schaffen, die zählt, wie oft Worte Co-treten in einem Satz insgesamt, die so etwas wie dieses

 The boy is tall girl short and are friends 
The  0  2  2  2  2  2  1  1 1 
boy  2  0  1  2  1  1  1  1 1 
is  2  1  0  2  1  1  0  0 0 
tall 2  2  1  0  1  1  1  1 1 
etc.

für alle Wörter aussehen würde, wo ein Wort kann nicht co mit sich selbst kommen. Man beachte, dass in Satz 3, wo das Wort "the" zweimal erscheint, die Lösung nur die Koincomancen einmal für das "the" berechnen sollte.

Hat jemand eine Idee, wie ich das tun könnte. Ich arbeite mit einem Datenrahmen von etwa 3000 Sätzen.

Quelle

2016-11-07 Allan Davids

was haben Sie versucht, warum nicht er? Sie müssen sich etwas Mühe geben, hier: – agenis

Schauen Sie in [tm-Paket] (https://cran.r-project.org/web/packages/tm/index.html). – zx8754

Mit base-R versuchen Sie, jeden Satz mit 'strsplit' und Leerzeichen zu teilen und Punkte, Komma und dergleichen mit' gsub' zu entfernen. Für die eindeutige Liste von Wörtern können Sie dann den 'unique' Befehl verwenden. –

library(tm) 
library(dplyr) 
dat  <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F) 
dat[2,1] <- c("The girl is short.") 
dat[3,1] <- c("The tall boy and the short girl are friends.") 

ds <- Corpus(DataframeSource(dat)) 
dtm <- DocumentTermMatrix(ds, control=list(wordLengths=c(1,Inf))) 

X   <- inspect(dtm) 
out  <- crossprod(X) # Same as: t(X) %*% X 
diag(out) <- 0    # rm own-word occurences 
out

 Terms 
Terms boy friend girl short tall the 
    boy  0  1 1  1 2 2 
    friend 1  0 1  1 1 1 
    girl  1  1 0  2 1 2 
    short 1  1 2  0 1 2 
    tall  2  1 1  1 0 2 
    the  2  1 2  2 2 0

Sie auch Stoppwörter entfernen möchten, wie "die", das heißt

ds <- tm_map(ds, stripWhitespace) 
ds <- tm_map(ds, removePunctuation) 
ds <- tm_map(ds, stemDocument) 
ds <- tm_map(ds, removeWords, c("the", stopwords("english"))) 
ds <- tm_map(ds, removeWords, c("the", stopwords("spanish")))

Quelle

2016-11-07 12:11:04

berechnen Wort Co-Vorkommen Matrix in r

Antwort

Verwandte Themen