Der StringIndexer (import org.apache.spark.ml.feature.StringIndexer) ist, was Sie suchen. Der Dokumentationslink, der es beschreibt: StringIndexer

Hier ist ein Beispiel, das den Titanic-Datensatz verwendet. Die Felder "Sex and Embark" sind kategorisch und müssen in Numerik konvertiert werden.


import org.apache.spark.sql.SparkSession 
import org.apache.spark.ml.classification.LogisticRegression 
import org.apache.spark.ml.feature.{OneHotEncoder,StringIndexer,VectorAssembler,VectorIndexer} 
import org.apache.spark.ml.linalg.Vectors 
import org.apache.spark.ml.Pipeline 

val training = spark.read.option("header","true").option("inferSchema","true").format("csv").load("train.csv") 

// Convert the categorical (string) values into numeric values 
val genderIndexer = new StringIndexer().setInputCol("Sex").setOutputCol("SexIndex") 
val embarkIndexer = new StringIndexer().setInputCol("Embarked").setOutputCol("EmbarkIndex") 

// Convert the numerical index columns into One Hot columns 
// The One Hot columns are binary {0,1} values of the categories 
val genderEncoder = new OneHotEncoder().setInputCol("SexIndex").setOutputCol("SexVec") 
val embarkEncoder = new OneHotEncoder().setInputCol("EmbarkIndex").setOutputCol("EmbarkVec") 

// Create the vector structured data (label,features(vector)) 
val assembler = new VectorAssembler().setInputCols(Array("Pclass","SexVec","Age","SibSp","Parch","Fare","EmbarkVec")).setOutputCol("features") 

// Create the Logistic Regression instance 
val lr = new LogisticRegression().setMaxIter(100).setRegParam(0.3).setElasticNetParam(0.8) 

// Create the model pipeline 
val pipeline = new Pipeline().setStages(Array(genderIndexer,embarkIndexer,genderEncoder,embarkEncoder,assembler,lr)) 

// Create the Logistic Regression model by fitting the training data 
val lrModel = pipeline.fit(training) 

// Score the data 
val results = lrModel.transform(test) 



|PassengerId|Survived|Pclass|Name            |Sex |Age |SibSp|Parch|Ticket   |Fare |Cabin|Embarked| 
|1   |0  |3  |Braund, Mr. Owen Harris       |male |22.0|1 |0 |A/5 21171  |7.25 |null |S  | 
|2   |1  |1  |Cumings, Mrs. John Bradley (Florence Briggs Thayer)|female|38.0|1 |0 |PC 17599  |71.2833|C85 |C  | 
|3   |1  |3  |Heikkinen, Miss. Laina        |female|26.0|0 |0 |STON/O2. 3101282|7.925 |null |S  | 
|4   |1  |1  |Futrelle, Mrs. Jacques Heath (Lily May Peel)  |female|35.0|1 |0 |113803   |53.1 |C123 |S  | 
|5   |0  |3  |Allen, Mr. William Henry       |male |35.0|0 |0 |373450   |8.05 |null |S  | 
only showing top 5 rows 


|label|Pclass|Name       |Sex |Age |SibSp|Parch|Fare |Embarked|SexIndex|EmbarkIndex|SexVec  |EmbarkVec |features        |rawPrediction       |probability        |prediction| 
|0 |1  |Baxter, Mr. Quigg Edmond  |male|24.0|0 |1 |247.5208|C  |0.0  |1.0  |(1,[0],[1.0])|(2,[1],[1.0])|[1.0,1.0,24.0,0.0,1.0,247.5208,0.0,1.0]|[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0  | 
|0 |1  |Blackwell, Mr. Stephen Weart |male|45.0|0 |0 |35.5 |S  |0.0  |0.0  |(1,[0],[1.0])|(2,[0],[1.0])|[1.0,1.0,45.0,0.0,0.0,35.5,1.0,0.0] |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0  | 
|0 |1  |Carlsson, Mr. Frans Olof  |male|33.0|0 |0 |5.0  |S  |0.0  |0.0  |(1,[0],[1.0])|(2,[0],[1.0])|[1.0,1.0,33.0,0.0,0.0,5.0,1.0,0.0]  |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0  | 
|0 |1  |Carrau, Mr. Francisco M  |male|28.0|0 |0 |47.1 |S  |0.0  |0.0  |(1,[0],[1.0])|(2,[0],[1.0])|[1.0,1.0,28.0,0.0,0.0,47.1,1.0,0.0] |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0  | 
|0 |1  |Foreman, Mr. Benjamin Laventall|male|30.0|0 |0 |27.75 |C  |0.0  |1.0  |(1,[0],[1.0])|(2,[1],[1.0])|[1.0,1.0,30.0,0.0,0.0,27.75,0.0,1.0] |[0.4221428237360974,-0.4221428237360974]|[0.6039958948910183,0.39600410510898165]|0.0  | 
only showing top 5 rows 

Danke, aber was ich eigentlich wollte, fragen - ist es mir möglich, meine benutzerdefinierten Transformationen zu definieren (wie StringIndexer, OneHotEncoder ...), die als Pipeline werden kann dann eingebaut in Bühne? –

