2017-09-12 2 views
0

Ich frage mich, wie man mehrere Indizes für einen Datenrahmen basierend auf einer Liste, die Elemente aus einer anderen Spalte gruppiert.Multi-Indizierung Pandas Dataframe

Da es wahrscheinlich besser ist, mit gutem Beispiel zu zeigen, hier ist ein Skript, das zeigt, was ich habe und was ich möchte:

def ungroup_column(df, column, split_column = None): 
    ''' 
    # Summary 
     Takes a dataframe column that contains lists and spreads the items in the list over many rows 
     Similar to pandas.melt(), but acts on lists within the column 

    # Example 

     input datframe: 

       farm_id animals 
      0 1  [pig, sheep, dog] 
      1 2  [duck] 
      2 3  [pig, horse] 
      3 4  [sheep, horse] 


     output dataframe: 

       farm_id animals 
      0 1  pig 
      0 1  sheep 
      0 1  dog 
      1 2  duck 
      2 3  pig 
      2 3  horse 
      3 4  sheep 
      3 4  horse 

    # Arguments 

     df: (pandas.DataFrame) 
      dataframe to act upon 

     column: (String) 
      name of the column which contains lists to separate 

     split_column: (String) 
      column to be added to the dataframe containing the split items that were in the list 
      If this is not given, the values will be written over the original column 
    ''' 
    if split_column is None: 
     split_column = column 

    # split column into mulitple columns (one col for each item in list) for every row 
    # then transpose it to make the lists go down the rows 
    list_split_matrix = df[column].apply(pd.Series).T 

    # Now the columns of `list_split_matrix` (they're just integers) 
    # are the indices of the rows in `df` - i.e. `df_row_idx` 
    # so this melt concats each column on top of each other 
    melted_df = pd.melt(list_split_matrix, var_name = 'df_row_idx', value_name = split_column).dropna().set_index('df_row_idx') 

    if split_column == column: 
     df = df.drop(column, axis = 1) 
     df = df.join(melted_df) 
    else: 
     df = df.join(melted_df) 
    return df 

from IPython.display import display 
train_df.index 
from utils import * 
play_df = train_df 
sent_idx = play_df.groupby('pmid')['sentence'].apply(lambda row: range(0, len(list(row)))) #set_index(['pmid', range(0, len())]) 
play_df.set_index('pmid') 

import pandas as pd 
doc_texts = ['Here is a sentence. And Another. Yet another sentence.', 
      'Different Document here. With some other sentences.'] 
playing_df = pd.DataFrame({'doc':[nlp(doc) for doc in doc_texts], 
          'sentences':[[s for s in nlp(doc).sents] for doc in doc_texts]}) 
display(playing_df) 
display(ungroup_column(playing_df, 'sentences')) 

Der Ausgang dieses sich wie folgt:

doc sentences 
0 (Here, is, a, sentence, ., And, Another, ., Ye... [(Here, is, a, sentence, .), (And, Another, .)... 
1 (Different, Document, here, ., With, some, oth... [(Different, Document, here, .), (With, some, ... 
doc sentences 
0 (Here, is, a, sentence, ., And, Another, ., Ye... (Here, is, a, sentence, .) 
0 (Here, is, a, sentence, ., And, Another, ., Ye... (And, Another, .) 
0 (Here, is, a, sentence, ., And, Another, ., Ye... (Yet, another, sentence, .) 
1 (Different, Document, here, ., With, some, oth... (Different, Document, here, .) 
1 (Different, Document, here, ., With, some, oth... (With, some, other, sentences, .) 

Aber ich würde wirklich einen Index haben, wie für die Spalte ‚Sätze‘ wie folgt aus:

doc_idx sent_idx  document           sentence 
0   0   (Here, is, a, sentence, ., And, Another, ., Ye... (Here, is, a, sentence, .) 
      1   (Here, is, a, sentence, ., And, Another, ., Ye... (And, Another, .) 
      2   (Here, is, a, sentence, ., And, Another, ., Ye... (Yet, another, sentence, .) 
1   0   (Different, Document, here, ., With, some, oth... (Different, Document, here, .) 
      1   (Different, Document, here, ., With, some, oth... (With, some, other, sentences, .) 
+1

Ich denke, dass Sie [diese sehr schöne MaxU-Lösung] überprüfen können (https://stackoverflow.com/a/40449726/2901002) – jezrael

+0

Whats nlp (doc) .ents? nlkt Satz Tokenizer? – Dark

+0

@Bharath, Ja, es ist ein Satz Tokenizer von spacy – chase

Antwort

1

B ased auf dem zweiten Ausgang Sie den Index zurücksetzen und set_index dann basierend auf cumcount des aktuellen Index dann die Achse umbenennen dh

new_df = ungroup_column(playing_df, 'sentences').reset_index() 
new_df['sent_idx'] = new_df.groupby('index').cumcount() 
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx']) 

Ausgang:

 
                   doc  sents 
doc_idx sent_idx              
0  0   [Here, is, a, sentence, ., And, Another, ., Ye...  Here is a sentence. 
     1   [Here, is, a, sentence, ., And, Another, ., Ye...  And Another. 
     2   [Here, is, a, sentence, ., And, Another, ., Ye...  Yet another sentence. 
1  0   [Different, Document, here, ., With, some, oth...  Different Document here. 
     1   [Different, Document, here, ., With, some, oth...  With some other sentences. 

Statt Anwendung pd.Series können Sie verwenden np.concatenate die Spalte zu erweitern. ( ich nltk verwendet, um die Worte und Sätze Token)

import nltk 
import pandas as pd 
doc_texts = ['Here is a sentence. And Another. Yet another sentence.', 
     'Different Document here. With some other sentences.'] 
playing_df = pd.DataFrame({'doc':[nltk.word_tokenize(doc) for doc in doc_texts], 
         'sents':[nltk.sent_tokenize(doc) for doc in doc_texts]}) 

s = playing_df['sents'] 
i = np.arange(len(df)).repeat(s.str.len()) 

new_df = playing_df.iloc[i, :-1].assign(**{'sents': np.concatenate(s.values)}).reset_index() 

new_df['sent_idx'] = new_df.groupby('index').cumcount() 
new_df.set_index(['index','sent_idx']).rename_axis(['doc_idx','sent_idx']) 

Hoffe, es hilft.

+0

Vielen Dank! Das funktioniert gut. Ich habe mich auch gefragt, nachdem ich die [pandas multiindexing documentation] (https://pandas.pydata.org/pandas-docs/stable/advanced.html) angeschaut habe, wenn Sie denken, dass es einen angemesseneren Weg gibt, mit dem Multiindex umzugehen, seit ich bemerkt habe, dass die Ebene "document" nicht wiederholt wird, wie es nach der Funktion "ungroup_column" ist, die ich hier angewendet habe. – chase

+0

Ich bin froh, @chase zu helfen. – Dark