Pyspark NaiveBayes Modell vorhergesagt Ausgabe in Csv-Datei

Ich benutze separaten Zug und Test csv Datei von 345MB und 21GB Größe mit 13 Reihen und max. 80 Millionen Zeilen.Pyspark NaiveBayes Modell vorhergesagt Ausgabe in Csv-Datei

NaiveBayes Modell codes-

# Reading files 
data="C:/csv/train2004.txt" 
test="C:/csv/ascii20041.asc" 
#Data into RDD 
train=sc.textFile(data).map(lambda x: x.split(",")) 
test=sc.textFile(test).map(lambda y: y.split(" ")) 

#extract header 
header = train.first() 
header1 = test.first() 
print(header) 
print(header1) 

#Removing Header Row 
train = train.filter(lambda Row: Row!=header) 
#test=test.filter(lambda Row: Row!=header) 
print(train.first()) 
print(test.first()) 
train = train.map(lambda x: x[4:17]) 
test = test.map(lambda x: x[3:16]) 
print(train.first()) 
print(test.first()) 

# Reading required column 
train = train.map(lambda x: LabeledPoint(x[0],x[1:13])) 
test = test.map(lambda y: LabeledPoint(y[0],y[1:13])) 
print(train.first()) 
print(test.first()) 

#Naive Bayes Model training 
model = NaiveBayes.train(train, 1.0) 

#Prediction and save as Test file 
predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label)) 
print(predictionAndLabel.first()) 
predictionAndLabel.saveAsTextFile('c:/csv/mycsv.csv') 

#Accuracy Checking 
accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count()/test.count() 
print('model accuracy {}'.format(accuracy))

ERROR:

An error occurred while calling o5072.saveAsTextFile. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 365.0 failed 1 times, most recent failure: Lost task 0.0 in stage 365.0 (TID 460, localhost)

Noch ich bin vor Problem in:.

Saving 'predictAndLabel' 'saveAsTest' vorhergesagten Ausgabe mit einem Text Datei.
Zum Verbinden des predictAndLabel-Ergebnisses mit der Testeingabe mit der Referenz der Zeilennummer.

Quelle

2016-12-10 Ram

In diesen beiden Zeilen:

train = train.map(lambda x: LabeledPoint(x[0], x[1:13])) 
test = test.map(lambda y: LabeledPoint(y[0], y[1:13]))

Sie Liste von Strings zu LabeledPoint geben, die keine gültige Eingabe erneut. Es sollte

NumPy array
Liste
pyspark.mllib.linalg.SparseVector
scipy.sparse column matrix

von numerischen Typen sein.

Quelle

2016-12-10 20:11:09

Hallo James Z, können Sie jeden Beispielcode zur Verfügung stellen, damit ich manipuliere, um Ergebnisse zu erhalten. – Ram

Pyspark NaiveBayes Modell vorhergesagt Ausgabe in Csv-Datei

Antwort

Verwandte Themen