Wie beschriftet man doppelte Gruppen in Pandas?

Ich habe einen Datenrahmen:Wie beschriftet man doppelte Gruppen in Pandas?

>>> df 
    A 
0 foo 
1 bar 
2 foo 
3 baz 
4 foo 
5 bar

ich alle doppelten Gruppen finden müssen und sie mit sequentieller dgroup_id ‚s Label:

>>> df 
    A dgroup_id 
0 foo   1 
1 bar   2 
2 foo   1 
3 baz 
4 foo   1 
5 bar   2

(Das bedeutet, dass foo zu der ersten Gruppe von Duplikaten gehört , bar zur zweiten Gruppe von Duplikaten und baz nicht dupliziert)

ich dies tat.

import pandas as pd 

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')}) 

duplicates = df.groupby('A').size() 
duplicates = duplicates[duplicates>1] 
# Yes, this is ugly, but I didn't know how to do it otherwise: 
duplicates[duplicates.reset_index().index] = duplicates.reset_index().index 
df.insert(1, 'dgroup_id', df['A'].map(duplicates))

Dies führt zu:

>>> df 
    A dgroup_id 
0 foo  1.0 
1 bar  0.0 
2 foo  1.0 
3 baz  NaN 
4 foo  1.0 
5 bar  0.0

Gibt es einen einfacheren/kürzeren Weg, dies in Pandas zu erreichen? Ich habe gelesen, dass vielleicht pandas.factorize könnte hier helfen, aber ich weiß nicht, wie man es benutzt ... (auf diese Funktion ist keine Hilfe)

Auch: Ich habe nichts dagegen, weder die 0- basierte Gruppenzahl, noch die seltsame Sortierreihenfolge; aber ich hätte gerne die dgroup_id 's als ints, nicht schwimmt.

Quelle

2017-07-08 Amenhotep

nicht sicher, aber wie wäre es versucht, '(duplicates.reset_index() .index) .asyp (int) '? –

Verwenden Betrieb gekettet ersten value_count für jeden A zu erhalten, berechnen die Sequenznummer für jede Gruppe und dann wieder mit dem ursprünglichen DF zusammen.

(
    pd.merge(df, 
      df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan) 
       .cumsum().rename('dgroup_id').to_frame(), 
      left_on='A', right_index=True).sort_index() 
) 
Out[49]: 
    A dgroup_id 
0 foo  1.0 
1 bar  2.0 
2 foo  1.0 
3 baz  NaN 
4 foo  1.0 
5 bar  2.0

Wenn Sie Nan für einzigartige Gruppen benötigen, können Sie nicht int als Datentyp haben, die eine Pandas Einschränkung im Moment ist. Wenn Sie mit einem Satz 0 für einzigartige Gruppen ok sind, können Sie so etwas wie:

(
    pd.merge(df, 
      df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan) 
       .cumsum().rename('dgroup_id').to_frame().fillna(0).astype(int), 
      left_on='A', right_index=True).sort_index() 
) 

    A dgroup_id 
0 foo   1 
1 bar   2 
2 foo   1 
3 baz   0 
4 foo   1 
5 bar   2

Quelle

2017-07-08 11:10:16 Allen

Sie können ein list von Duplikaten von get_duplicates() dann den dgroup_id von A ‚s Index

def find_index(string): 
    if string in duplicates: 
     return duplicates.index(string)+1 
    else: 
     return 0 

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')}) 
duplicates = df.set_index('A').index.get_duplicates() 
df['dgroup_id'] = df['A'].apply(find_index) 
df

Ausgang:

 
    A dgroup_id 
0 foo   2 
1 bar   1 
2 foo   2 
3 baz   0 
4 foo   2 
5 bar   1

Quelle

2017-07-08 11:00:03 Dark

Verwenden duplicated zu identifizieren, wo dups sind. Verwenden Sie where, um Singletons durch '' zu ersetzen. Verwenden Sie kategorisch zu faktorisieren.

dups = df.A.duplicated(keep=False) 
df.assign(dgroup_id=df.A.where(dups, '').astype('category').cat.codes) 

    A dgroup_id 
0 foo   2 
1 bar   1 
2 foo   2 
3 baz   0 
4 foo   2 
5 bar   1

Wenn Sie darauf bestehen, die Nullen '' sein

dups = df.A.duplicated(keep=False) 
df.assign(
    dgroup_id=df.A.where(dups, '').astype('category').cat.codes.replace(0, '')) 

    A dgroup_id 
0 foo   2 
1 bar   1 
2 foo   2 
3 baz   
4 foo   2 
5 bar   1

Quelle

2017-07-08 13:30:36 piRSquared

könnten Sie gehen für:

import pandas as pd 
import numpy as np 
df = pd.DataFrame(['foo', 'bar', 'foo', 'baz', 'foo', 'bar',], columns=['name']) 

# Create the groups order 
ordered_names = df['name'].drop_duplicates().tolist() # ['foo', 'bar', 'baz'] 

# Find index of each element in the ordered list 
df['duplication_index'] = df['name'].apply(lambda x: ordered_names.index(x) + 1) 

# Discard non-duplicated entries 
df.loc[~df['name'].duplicated(keep=False), 'duplication_index'] = np.nan 

print(df) 
# name duplication_index 
# 0 foo    1.0 
# 1 bar    2.0 
# 2 foo    1.0 
# 3 baz    NaN 
# 4 foo    1.0 
# 5 bar    2.0

Quelle

2017-07-08 13:50:27 Deena

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')}) 
key_set = set(df['A']) 
df_a = pd.DataFrame(list(key_set)) 
df_a['dgroup_id'] = df_a.index 
result = pd.merge(df,df_a,left_on='A',right_on=0,how='left') 

In [32]: result.drop(0,axis=1) 
Out[32]: 
    A dgroup_id 
0 foo  2 
1 bar  0 
2 foo  2 
3 baz  1 
4 foo  2 
5 bar  0

Quelle

2017-07-08 14:04:30

Wie beschriftet man doppelte Gruppen in Pandas?

Antwort

Verwandte Themen