2016-10-29 10 views
0

ich durch die searchList suchen möchten und überprüfen. Wenn ich eine Übereinstimmung erhalte, möchte ich die Daten an masterdf anhängen, die leicht erreicht wird, wie unten gesehen. Aber ich möchte auch eine neue Spalte mit searchWord hinzufügen, damit ich weiß, welche text mit was zusammenpasst. Dieser Code füllt die Spalte searchWord mit der letzten übereinstimmenden Suche aus.2. Spalte hinzufügen, wenn Spalte in str.contains Spiele

masterdf = pd.DataFrame(columns=['doc_id','text',]) 

for searchWord in searchList: 
    search = jsons_data[jsons_data['text'].str.contains(searchWord)] 
    if len(search) > 0: 
     masterdf = masterdf.append(search) 
     masterdf['searchWord'] = searchWord 

Antwort

1

Ich denke, das ist, was Sie wollen.

Lassen Sie uns Setup bis Beispieldaten:

tt = '''I want to search through the. searchList and check if column text str.contains one or more of each searchWord. If I get a match I want to append the data to masterdf which is easily accomplished as seen below. But I also want to add a new column with searchWord so that I know which text matched with what. This code below fills the column searchWord with the. latest search that matched''' 
text_col = tt.split('.') 
id_col = range(len(text_col)) 
jsons_data = pd.DataFrame({'doc_id':id_col,'text':text_col}) 

searchList = ['code','fills', 'But','also','want'] 

Das Beispiel jsons_data ist

doc_id text 
0 0  I want to search through the 
1 1  searchList and check if column text str 
2 2  contains one or more of each searchWord 
3 3  If I get a match I want to append the data to... 
4 4  But I also want to add a new column with sear... 
5 5  This code below fills the column searchWord w... 
6 6  latest search that matched 

Code ändern mit search['searchWord'] = searchWord erhalten wir:

masterdf = pd.DataFrame(columns=['doc_id','text','searchWord']) 

for searchWord in searchList: 
    search = jsons_data[jsons_data['text'].str.contains(searchWord)] 
    if len(search) > 0: 
     search['searchWord'] = searchWord 
     masterdf = masterdf.append(search) 

Und masterdf ist

doc_id text            searchWord 
5 5.0 This code below fills the column searchWord w... code 
5 5.0 This code below fills the column searchWord w... fills 
4 4.0 But I also want to add a new column with sear... But 
4 4.0 But I also want to add a new column with sear... also 
0 0.0 I want to search through the      want 
3 3.0 If I get a match I want to append the data to... want 
4 4.0 But I also want to add a new column with sear... want 
1

Ich schlage vor, vektorisiert (kein Looping) Ansatz zu verwenden:

In [84]: df 
Out[84]: 
    doc_id                        text 
0  0                  I want to search through the 
1  1                searchList and check if column text str 
2  2                contains one or more of each searchWord 
3  3 If I get a match I want to append the data to masterdf which is easily accomplished as seen below 
4  4  But I also want to add a new column with searchWord so that I know which text matched with what 
5  5            This code below fills the column searchWord with the 
6  6                   latest search that matched 

In [85]: searchList = ['code', 'fills', 'but', 'also', 'want'] 

In [86]: words_re = '{}'.format('|'.join(searchList).lower()) 

In [87]: words_re 
Out[87]: 'code|fills|but|also|want' 

In [88]: masterdf = df[df.text.str.contains('(?:{})'.format(words_re))].copy() 

In [89]: masterdf['searchWord'] = masterdf.text.str.findall('({})'.format(words_re)).str.join('|') 

In [90]: masterdf 
Out[90]: 
    doc_id                        text searchWord 
0  0                  I want to search through the  want 
3  3 If I get a match I want to append the data to masterdf which is easily accomplished as seen below  want 
4  4  But I also want to add a new column with searchWord so that I know which text matched with what also|want 
5  5            This code below fills the column searchWord with the code|fills 
+0

Dies sieht wirklich nett. Warum ist das besser als Looping? Und gibt es eine "klarere"/korrektere Art, über dieses Problem nachzudenken, als "add column when match ..."? – user3471881

+1

@ user3471881, weil für größere (1000+) Datensätze vektorisierte Lösungen im Vergleich zu "looping" -Lösungen in der Regel um Größenordnungen schneller sind – MaxU

Verwandte Themen