Es gibt viele Dateien in einem Verzeichnis, das Json formatierte Einträge in jeder Zeile hat. Die Größe der Dateien variiert von 5k bis 200MB. Ich habe diesen Code, um jede Datei zu gehen, analysiere die Daten, nach denen ich suche, und schließlich einen Datenrahmen. Dieses Skript braucht sehr viel Zeit, um fertig zu sein, es endet nie.ist es möglich, Datei lesen und analysieren in R
Gibt es eine Möglichkeit, es zu beschleunigen, damit ich die Dateien schneller lesen kann?
Code:
library(jsonlite)
library(data.table)
setwd("C:/Files/")
#data <- lapply(readLines("test.txt"), fromJSON)
df<-data.frame(Timestamp=factor(),Source=factor(),Host=factor(),Status=factor())
filenames <- list.files("Json_files", pattern="*.txt", full.names=TRUE)
for(i in filenames){
print(i)
data <- lapply(readLines(i), fromJSON)
myDf <- do.call("rbind", lapply(data, function(d) {
data.frame(TimeStamp = d$payloadData$timestamp,
Source = d$payloadData$source,
Host = d$payloadData$host,
Status = d$payloadData$status)}))
df<-rbind(df,myDf)
}
Dies ist ein Beispieleintrag aber es gibt Tausende von Einträgen wie dies in der Datei:
{"senderDateTimeStamp":"2016/04/08 10:53:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB01","servermember":"test"},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:54:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
{"senderDateTimeStamp":"2016/04/08 10:55:18","senderHost":null,"senderAppcode":"app","senderUsecase":"appinternalstats_prod","destinationTopic":"app_appinternalstats_realtimedata_topic","correlatedRecord":false,"needCorrelationCacheCleanup":false,"needCorrelation":false,"correlationAttributes":null,"correlationRecordCount":0,"correlateTimeWindowInMills":0,"lastCorrelationRecord":false,"realtimeESStorage":true,"receiverDateTimeStamp":1460127623591,"payloadData":{"timestamp":"2016-04-08T10:53:18.169","status":"get","source":"STREAM","fund":"JVV","client":"","region":"","evetid":"","osareqid":"","basis":"","pricingdate":"","content":"","msgname":"","recipient":"","objid":"","idlreqno":"","host":"WEB02","servermember":""},"payloadDataText":"","key":"app:appinternalstats_prod","destinationTopicName":"app_appinternalstats_realtimedata_topic","hdfsPath":"app/appinternalstats_prod","esindex":"app","estype":"appinternalstats_prod","useCase":"appinternalstats_prod","appCode":"app"}
@Tensibai, Ich habe einen Beispieleintrag, aber die Dateien können 10'von Tausenden solcher Einträge haben. – user1471980
Sie können die Anzahl der Zeilen in den Dateien zählen, wodurch Sie eine gute Schätzung der Größe des endgültigen Datasets erhalten. R behandelt wachsende Objekte nicht sehr gut. Sie sollten versuchen, die 'rbind'-Aufrufe zu vermeiden. – niczky12
@Tensibai, sorry, ich habe gerade 3 Einträge platziert. – user1471980