2016-03-21 13 views
1

Ich bin neu in Pandas, aber durch stackoverflow, haben Dinge zur Arbeit bekommen. Dies funktioniert derzeit, dauert aber ca. 30 Minuten (ziemlich großer Datensatz). Ich frage mich, ob es einen Weg gibt, das zu beschleunigen? Im Wesentlichen versuchen, die verschiedenen Kombinationen der Spalte "Status" mit der Spalte "Current_Status" abzubilden. Vielen Dank!Pandas, lange Bearbeitungszeit von groupby und Vergleichen

df_new = df.groupby('id').apply(lambda x: pd.Series(dict( 
new_col1=(x['foo'] != np.nan).sum(),  
new_col2=(x['bar'] == 'P').sum(), 
new_col3=(x['bar'] == 'C').sum(), 
new_col3=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(), 
new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(), 
new_col5=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum() 
))) 

Beispiel df Struktur:

In[15]: df.head(6) 
Out[15]: 
    id foo bar Status     Current_Status 
0 1 23 'C' 'Approved, paid'   'Approved, paid' 
1 1 63 'P' 'Approved, not yet paid' 'Approved, paid' 
2 1 84 'P' 'Approved, paid'   'Approved, paid' 
3 1 125 'P' 'Approved, not yet paid' 'Approved, not yet paid' 
4 1 216 'P' 'Approved, not yet paid' 'Approved, paid' 
5 1 12 'C' 'Approved, paid'   'Approved, paid' 
+0

Können Sie Beispieldaten hinzufügen? '5-6 Reihen' – jezrael

+0

Just tat so. Ich bin mir sicher, dass die Syntax von "Out [15]" aus ist. Bitte ignorieren Sie das! – nonegiven72

Antwort

1

können Sie versuchen, notnull und numpy.in1d:

df_new1 = df.groupby('id').apply(lambda x: pd.Series(dict(
new_col1=(x['foo'].notnull()).sum(), 
new_col2=np.in1d(x['bar'],'P').sum(), 
new_col3=np.in1d(x['bar'],'C').sum(), 
new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(), 
new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(), 
new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum() 
))) 

Eine weitere schnellere Lösung Werte auf Werte 0 und 1 von factorize konvertieren, dann invertierte Spalten erstellen von abs und zuletzt groupby mit sum:

df['new_col1'] = df['foo'].notnull().astype(int) 
df['new_col2'] = df['bar'].factorize()[0] 
df['new_col3'] = (df['new_col2'] - 1).abs() 
df['Status'] = df['Status'].factorize()[0] 
df['invertStatus'] = (df['Status'] - 1).abs() 
df['Current_Status'] = df['Current_Status'].factorize()[0] 
df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs() 

df['new_col4'] = df['Status'] & df['invertCurrent_Status'] 
df['new_col5'] = df['Status'] & df['Current_Status'] 
df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status'] 

print df.groupby('id').sum() 
         [['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']] 

Oder können Sie boolean Series erstellen - die schnellste Lösung:

df['new_col1'] = df['foo'].notnull() 
df['new_col2'] = np.in1d(df['bar'], 'P') 
df['new_col3'] = ~df['new_col2'] 
Status = np.in1d(df['Status'],'Approved, not yet paid') 
invertStatus = ~Status 
Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid') 
invertCurrent_Status = ~Current_Status 

df['new_col4'] = Status & invertCurrent_Status 
df['new_col5'] = Status & Current_Status 
df['new_col6'] = invertStatus & invertCurrent_Status 
#print df 

print df.groupby('id').sum() 
     [['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int) 

Timings:

In [25]: len(df) 
Out[25]: 110000 

In [26]: %timeit a(df) 
10 loops, best of 3: 24.7 ms per loop 

In [27]: %timeit b(df1) 
10 loops, best of 3: 39.3 ms per loop 

In [28]: %timeit c(df2) 
10 loops, best of 3: 46 ms per loop 

In [29]: %timeit d(df3) 
10 loops, best of 3: 103 ms per loop 

-Code:

df = pd.concat([df]*10000).reset_index(drop=True)  
#print df 
df1,df2,df3 = df.copy(), df.copy(), df.copy() 


def a(df): 
    df['new_col1'] = df['foo'].notnull() 
    df['new_col2'] = np.in1d(df['bar'], 'P') 
    df['new_col3'] = ~df['new_col2'] 
    Status = np.in1d(df['Status'],'Approved, not yet paid') 
    invertStatus = ~Status 
    Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid') 
    invertCurrent_Status = ~Current_Status 
    df['new_col4'] = Status & invertCurrent_Status 
    df['new_col5'] = Status & Current_Status 
    df['new_col6'] = invertStatus & invertCurrent_Status 
    #print df 
    return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int) 

def b(df): 
    df['new_col1'] = df['foo'].notnull().astype(int) 
    df['new_col2'] = df['bar'].factorize()[0] 
    df['new_col3'] = (df['new_col2'] - 1).abs() 
    df['Status'] = df['Status'].factorize()[0] 
    df['invertStatus'] = (df['Status'] - 1).abs() 
    df['Current_Status'] = df['Current_Status'].factorize()[0] 
    df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs() 

    df['new_col4'] = df['Status'] & df['invertCurrent_Status'] 
    df['new_col5'] = df['Status'] & df['Current_Status'] 
    df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status'] 

    return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]  
def c(df): 
    return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'].notnull()).sum(),new_col2=np.in1d(x['bar'],'P').sum(),new_col3=np.in1d(x['bar'],'C').sum(),new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),))) 

def d(df): 
    return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'] != np.nan).sum(),new_col2=(x['bar'] == 'P').sum(),new_col3=(x['bar'] == 'C').sum(),new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),new_col5=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),new_col6=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum()))) 

Testing Datenrahmen:

id foo bar     Status   Current_Status 
0 1 23 C   Approved, paid   Approved, paid 
1 1 63 P Approved, not yet paid   Approved, paid 
2 1 84 P   Approved, paid   Approved, paid 
3 1 125 P Approved, not yet paid Approved, not yet paid 
4 1 12 C   Approved, paid   Approved, paid 
5 2 23 C   Approved, paid   Approved, paid 
6 2 63 P Approved, not yet paid   Approved, paid 
7 2 84 P   Approved, paid   Approved, paid 
8 2 125 P Approved, not yet paid Approved, not yet paid 
9 2 216 P Approved, not yet paid   Approved, paid 
10 2 12 C   Approved, paid   Approved, paid 
+0

Danke, ich werde das versuchen und sehen, ob es beschleunigt. – nonegiven72

+0

Ich teste es und es ist 2 mal schneller. – jezrael