Dies ist mein erster Beitrag. Ich habe versucht, Features mit FeatureUnion und Pipeline zu kombinieren, aber wenn ich einen tf-idf + svd piepline hinzufügen, schlägt der Test mit einem "Dimensionskonflikt" Fehler fehl. Meine einfache Aufgabe besteht darin, ein Regressionsmodell zu erstellen, um die Relevanz der Suche vorherzusagen. Code und Fehler sind unten aufgeführt. Ist in meinem Code etwas nicht in Ordnung?Dimension nicht übereinstimmenden Fehler mit Scikit Pipeline FeatureUnion
df = read_tsv_data(input_file)
df = tokenize(df)
df_train, df_test = train_test_split(df, test_size = 0.2, random_state=2016)
x_train = df_train['sq'].values
y_train = df_train['relevance'].values
x_test = df_test['sq'].values
y_test = df_test['relevance'].values
# char ngrams
char_ngrams = CountVectorizer(ngram_range=(2,5), analyzer='char_wb', encoding='utf-8')
# TFIDF word ngrams
tfidf_word_ngrams = TfidfVectorizer(ngram_range=(1, 4), analyzer='word', encoding='utf-8')
# SVD
svd = TruncatedSVD(n_components=100, random_state = 2016)
# SVR
svr_lin = SVR(kernel='linear', C=0.01)
pipeline = Pipeline([
('feature_union',
FeatureUnion(
transformer_list = [
('char_ngrams', char_ngrams),
('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
('tfidf_word_ngrams', tfidf_word_ngrams),
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
]
)
),
('svr_lin', svr_lin)
])
model = pipeline.fit(x_train, y_train)
y_pred = model.predict(x_test)
Beim Hinzufügen der Pipeline unten an der FeatureUnion Liste:
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
Die Ausnahme unten generiert wird:
2016-07-31 10:34:08,712 : Testing ... Test Shape: (400,) - Training Shape: (1600,)
Traceback (most recent call last):
File "src/model/end_to_end_pipeline.py", line 236, in <module>
main()
File "src/model/end_to_end_pipeline.py", line 233, in main
process_data(input_file, output_file)
File "src/model/end_to_end_pipeline.py", line 175, in process_data
y_pred = model.predict(x_test)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 203, in predict
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 523, in transform
for name, trans in self.transformer_list)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
self.results = batch()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 399, in _transform_one
return transformer.transform(X)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 291, in transform
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/decomposition/truncated_svd.py", line 201, in transform
return safe_sparse_dot(X, self.components_.T)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 179, in safe_sparse_dot
ret = a * b
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/sparse/base.py", line 389, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
Vielen Dank für Ihren Vorschlag. Das war genau das Problem. Ich habe gerade einen zusätzlichen SVD-Transformer erstellt, um das tf-idf-Wort n-gramm zu verarbeiten, und es hat wie erwartet funktioniert. – sylar