Im Wesentlichen müssen Sie einen Zug-Validierungstest-Split für Ihre Beispieldaten durchführen. Wenn der Zugsatz verwendet wird, um Ihre normalen Parameter zu optimieren, wird der Validierungssatz zum Einstellen von Hyperparametern in der Rastersuche und der Testsatz zur Leistungsbewertung verwendet.Hier ist eine Möglichkeit, dies zu tun.
from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd
# simulate some artifical data so that I can show you the result of each intermediate step
# 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])
X.shape
Out[78]: (1000, 100)
y.shape
Out[79]: (1000,)
# Nested Cross-Validation, this returns an train/test index interator
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
# to take a look at the split, you will see it has 5 tuples
list(split)
# the 1st fold
train_index = list(split)[0][0]
Out[80]: array([ 0, 1, 2, ..., 997, 998, 999])
test_index = list(split)[0][1]
Out[81]: array([ 5, 12, 17, ..., 979, 982, 984])
# let's play with just one iteration for now
# your pipe
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# set up params
params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
logisticregression__penalty=['l1', 'l2'],
logisticregression__class_weight=[None, 'auto'])
# apply your grid search only in train data but with a futher cv step
# so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
# all performance is still evaluated in a separated held-out 'test' set
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
# fit the data on train set
grid.fit(X[train_index], y[train_index])
# to get the params of your estimator, call your gscv
grid.best_estimator_
Out[82]:
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', penalty='l1', random_state=None,
solver='liblinear', tol=0.0001, verbose=0))])
# the performance in validation set
grid.grid_scores_
Out[83]:
[mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]
# or the best score among them
grid.best_score_
Out[84]: 0.94002188482393367
# now after finishing training the estimator, we now predict in test set
y_pred = grid.predict(X[test_index])
# since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
y_pred_probs = grid.predict_proba(X[test_index])
Out[87]:
array([[ 0.0632, 0.9368],
[ 0.0236, 0.9764],
[ 0.0227, 0.9773],
...,
[ 0.0108, 0.9892],
[ 0.2903, 0.7097],
[ 0.0113, 0.9887]])
# to get evaluation result,
print(classification_report(y[test_index], y_pred))
precision recall f1-score support
0 0.93 0.59 0.72 22
1 0.95 0.99 0.97 179
avg/total 0.95 0.95 0.95 201
# to put all things together with the nested cross-validation
# generate a pandas dataframe to store prediction probability
kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
report = [] # to store classificaiton report
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
for train_index, test_index in split:
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
grid.fit(X[train_index], y[train_index])
y_pred_probs = grid.predict_proba(X[test_index])
kfold_df.iloc[test_index, :] = y_pred_probs
y_pred = grid.predict(X[test_index])
report.append(classification_report(y[test_index], y_pred))
# your result
print(kfold_df)
Out[88]:
0 1
0 0.1710 0.8290
1 0.0083 0.9917
2 0.2049 0.7951
3 0.0038 0.9962
4 0.0536 0.9464
5 0.0632 0.9368
6 0.1243 0.8757
7 0.1150 0.8850
8 0.0796 0.9204
9 0.4096 0.5904
.. ... ...
990 0.0505 0.9495
991 0.2128 0.7872
992 0.0270 0.9730
993 0.0434 0.9566
994 0.8078 0.1922
995 0.1452 0.8548
996 0.1372 0.8628
997 0.0127 0.9873
998 0.0935 0.9065
999 0.0065 0.9935
[1000 rows x 2 columns]
for r in report:
print(r)
for r in report:
print(r)
precision recall f1-score support
0 0.93 0.59 0.72 22
1 0.95 0.99 0.97 179
avg/total 0.95 0.95 0.95 201
precision recall f1-score support
0 0.86 0.55 0.67 22
1 0.95 0.99 0.97 179
avg/total 0.94 0.94 0.93 201
precision recall f1-score support
0 0.89 0.38 0.53 21
1 0.93 0.99 0.96 179
avg/total 0.93 0.93 0.92 200
precision recall f1-score support
0 0.88 0.33 0.48 21
1 0.93 0.99 0.96 178
avg/total 0.92 0.92 0.91 199
precision recall f1-score support
0 0.88 0.33 0.48 21
1 0.93 0.99 0.96 178
avg/total 0.92 0.92 0.91 199
vielen Dank dafür. Sehr hilfreich. Eine Sache, die ich nicht verstehe, ist die Notwendigkeit der Transformation: Wenn es n Features auswählt, was genau wird "transformiert"? (Und ich bin mir nicht sicher, wie es das bestimmt - es muss eine Schwelle geben). Meine Heuristik, die ich verwende, ist RFECV wählt die 'n' besten Funktionen und lässt die anderen fallen. – GPB
Weiter zu meiner Frage oben, bekomme ich Fehlermeldung: 'Pipeline' Objekt hat kein Attribut 'Coef_', wenn ich versuche, zu sehen coef_ wie oben beschrieben. Auch neugierig zu wissen, warum Sie behaupten, Stratified K Fold ist für jede Klassifizierung Problem gewählt (was es ist): Ich dachte, Kfold war der Standard, mit geschichteten Kfold für unausgeglichene Klassen (die ich habe). – GPB