2016-10-09 2 views
0

Ich versuche, ein großes Korpus automatisch zu einer numerischen Liste zu machen. Eine Nummer pro Zeile. Zum Beispiel habe ich die folgenden Daten:ein Textdokument eine numerische Liste machen

Df.txt = 

In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. 
We love you Mr. Brown. 
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him. 
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home. 
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy! 
If you have an alternative argument, let's hear it! :) 

Zuerst las ich den Text mit dem Befehl readLines:

text <- readLines("Df.txt", encoding = "UTF-8") 

Zweitens bekomme ich den ganzen Text in Kleinbuchstaben und ich entfernen unnötige Abstand:

## Lower cases input: 
lower_text <- tolower(text) 
## removing leading and trailing spaces: 
Spaces_remove <- str_trim(lower_text) 

Ab hier möchte ich jeder Zeile eine Nummer zuweisen zB:

"In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." = 1 
"We love you Mr. Brown." = 2 
... 
"If you have an alternative argument, let's hear it! :)" = 6 

Irgendwelche Ideen?

+0

Was meinst du "ihm eine Nummer zuweisen". Verwenden Sie einen zweispaltigen Datenrahmen? – hrbrmstr

+1

So etwas wie 'setNames (txt, 1: length (txt))' oder 'as.list (txt)'? – Jaap

+0

Ich denke, es wird as.list (txt) sein. Wie ihr im letzten Code sehen könnt, hat jede Zeile eine Nummer zugewiesen. Prost! –

Antwort

1

Sie bereits haben irgendwie numerische Zeile # Assoziationen mit dem Vektor (es ist indiziert numerisch), aber ...

text_input <- 'In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. 
We love you Mr. Brown. 
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him. 
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home. 
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one\'s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE\'s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy! 
If you have an alternative argument, let\'s hear it! :)' 

library(dplyr) 
library(purrr) 
library(stringi) 

textConnection(text_input) %>% 
    readLines(encoding="UTF-8") %>% 
    stri_trans_tolower() %>% 
    stri_trim() -> corpus 

# data frame with explicit line # column 
df <- data_frame(line_number=1:length(corpus), text=corpus) 

# list with an explicit line number field 
lst <- map(1:length(corpus), ~list(line_number=., text=corpus[.])) 

# implicit list numeric ids 
as.list(corpus) 

# explicit list numeric id's (but they're really string keys) 
setNames(as.list(corpus), 1:length(corpus)) 

# named vector 
set_names(corpus, 1:length(corpus)) 

Es gibt eine Fülle von R-Pakete, die die Last der Textverarbeitung/NLP erheblich erleichtern ops. Diese Arbeit außerhalb von ihnen zu tun, wird wahrscheinlich das Rad neu erfinden. Die CRAN NLP Task View listet viele von ihnen auf.

+0

Vielen Dank für den Code und Tipps @hrbmstr, eine letzte Sache. Ich mag das Objekt in der Klasse numerisch und nicht eine Liste zu sein. Ich versuche, es in eine N-Gramm-Funktion für die Wortvorhersage einzufügen und arbeitet nur mit Eingabe als numerisch. Prost –

+2

Vielleicht sollten Sie zurück zu Ihrer ursprünglichen Frage gehen und _all_ die notwendigen Informationen für Leute, um Ihnen zu helfen versuchen. Was Sie gerade in Ihrem Kommentar eingegeben haben, ist unklar. Was für eine "Vorhersage"? Was "numerisch"? Sie wissen, dass Sie 'as.numeric (vector_of_characters)' tun können und Zahlen zurückbekommen, richtig? Sie haben Ihr Bedürfnis in keiner Weise deutlich artikuliert. – hrbrmstr

Verwandte Themen