Drucken Feature-Namen für SelectKBest wo K-Wert ist innerhalb von Param_Grid von GridSearchCV

Ich versuchte Parameterkombinationen von k von SelectKBest und n_components von PCA innerhalb der Param_grid. Ich bin in der Lage, den k-Wert und n_components mit dem folgenden Code zu drucken. Ich bin den gesamten Code zu veröffentlichen, so verstehen Sie, aus der Liste der Funktionen von genommen werdenDrucken Feature-Namen für SelectKBest wo K-Wert ist innerhalb von Param_Grid von GridSearchCV

#THE FIRST FEATURE HAS TO BE THE LABEL 

featurelist = ['poi', 'exercised_stock_options', 'expenses', 'from_messages', 
      'from_poi_to_this_person', 'from_this_person_to_poi', 'other', 
      'restricted_stock', 'salary', 'shared_receipt_with_poi', 
      'to_messages', 'total_payments', 'total_stock_value', 
      'ratio_from_poi', 'ratio_to_poi'] 

enronml = pd.DataFrame(enron[['poi', 'exercised_stock_options', 'expenses', 'from_messages', 
      'from_poi_to_this_person', 'from_this_person_to_poi', 'other', 
      'restricted_stock', 'salary', 'shared_receipt_with_poi', 
      'to_messages', 'total_payments', 'total_stock_value', 
      'ratio_from_poi', 'ratio_to_poi']].copy()) 


enronml = enronml.to_dict(orient="index") 
dataset = enronml 

#featureFormat, takes the dictionary as the dataset, converts the first 
feature in featurelist into label 

data = featureFormat(dataset, featurelist, sort_keys = True) 
labels, features = targetFeatureSplit(data) 

from sklearn.cross_validation import train_test_split 
from sklearn.naive_bayes import GaussianNB 

from sklearn.cross_validation import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(features, labels, 
test_size=0.20, random_state=0) 


pca = PCA() 
gnba = GaussianNB() 
steps = [('scaler', MinMaxScaler()), 
    ('best', SelectKBest()), 
    ('pca', pca), 
    ('gnba', gnba)] 

pipeline = Pipeline(steps) 

parameters = [  
{ 
'best__k':[3], 
'pca__n_components': [1,2] 
}, 
{ 
'best__k':[4], 
'pca__n_components': [1,2,3] 
}, 
{ 
'best__k':[5], 
'pca__n_components': [1,2,3,4] 
}, 
] 

cv = StratifiedShuffleSplit(test_size=0.2, random_state=42) 
gnbawithpca = GridSearchCV(pipeline, param_grid = parameters, cv=cv, 
scoring="f1") 
gnbawithpca.fit(X_train,y_train) 

means = gnbawithpca.cv_results_['mean_test_score'] 
stds = gnbawithpca.cv_results_['std_test_score'] 


for mean, std, params in zip(means, stds, 
gnbawithpca.cv_results_['params']): 
    print("%0.3f (+/-%0.03f) for %r" 
      % (mean, std * 2, params))

Ich bin in der Lage ein Ergebnis wie dieses

0.480 (+/-0.510) for {'best__k': 3, 'pca__n_components': 1} 
0.534 (+/-0.409) for {'best__k': 3, 'pca__n_components': 2} 
0.480 (+/-0.510) for {'best__k': 4, 'pca__n_components': 1} 
0.534 (+/-0.409) for {'best__k': 4, 'pca__n_components': 2} 
0.565 (+/-0.342) for {'best__k': 4, 'pca__n_components': 3} 
0.480 (+/-0.510) for {'best__k': 5, 'pca__n_components': 1} 
0.513 (+/-0.404) for {'best__k': 5, 'pca__n_components': 2} 
0.473 (+/-0.382) for {'best__k': 5, 'pca__n_components': 3} 
0.448 (+/-0.353) for {'best__k': 5, 'pca__n_components': 4}

Ich möchte wissen, zu bekommen, was Features wurden ausgewählt, zum Beispiel, wenn best_k = 5, möchte ich die Namen dieser 5 Funktionen wissen.

Quelle

2017-07-09 Anonymous

Könnten Sie klarstellen, was Ihre Frage ist? Sie haben einfach gesagt, was Sie tun, nicht was das Problem ist. –

Es tut mir leid. Ich habe die letzte Zeile vergessen. –

RESOLVED

Wenn Sie die Pipeline zu definieren, in GridSearchCV verwendet werden, können Sie jeden Schritt nennen:

steps = [('scaler', MinMaxScaler()), 
    ('best', SelectKBest()), 
    ('pca', pca), 
    ('gnba', gnba)] 

pipeline = Pipeline(steps)

Sie aus zwei Gründen tun, dass:

So können Sie die definieren Parameter im Parameterraster (die Namen werden benötigt, um zu identifizieren, für welchen Schritt Sie die Parameter definieren).

Sie können also vom GridSearchCV-Objekt auf die Attribute des Schritts zugreifen (dies beantwortet Ihre Frage).

skb_step = gnbawithpca.best_estimator_.named_steps['best'] 

# Get SelectKBest scores, rounded to 2 decimal places, name them "feature_scores" 

feature_scores = ['%.2f' % elem for elem in skb_step.scores_ ] 

# Get SelectKBest pvalues, rounded to 3 decimal places, name them "feature_scores_pvalues" 

feature_scores_pvalues = ['%.3f' % elem for elem in skb_step.pvalues_ 
] 

# Get SelectKBest feature names, whose indices are stored in 'skb_step.get_support', 

# create a tuple of feature names, scores and pvalues, name it "features_selected_tuple" 

features_selected_tuple=[(featurelist[i+1], feature_scores[i], 
feature_scores_pvalues[i]) for i in skb_step.get_support(indices=True)] 

# Sort the tuple by score, in reverse order 

features_selected_tuple = sorted(features_selected_tuple, key=lambda 
feature: float(feature[1]) , reverse=True) 

# Print 

print ' ' 
print 'Selected Features, Scores, P-Values' 
print features_selected_tuple

Quelle

2017-07-09 18:45:17

Drucken Feature-Namen für SelectKBest wo K-Wert ist innerhalb von Param_Grid von GridSearchCV

Antwort

Verwandte Themen