Fill Werte FEHLT nur in einem Datenrahmen (Pandas)

was ich in einem Datenrahmen haben:Fill Werte FEHLT nur in einem Datenrahmen (Pandas)

email user_name sessions ymo 
[email protected] JD 1 2015-03-01 
[email protected] JD 2 2015-05-01

Was ich brauche:

email user_name sessions ymo 
[email protected] JD 0 2015-01-01 
[email protected] JD 0 2015-02-01 
[email protected] JD 1 2015-03-01 
[email protected] JD 0 2015-04-01 
[email protected] JD 2 2015-05-01 
[email protected] JD 0 2015-06-01 
[email protected] JD 0 2015-07-01 
[email protected] JD 0 2015-08-01 
[email protected] JD 0 2015-09-01 
[email protected] JD 0 2015-10-01 
[email protected] JD 0 2015-11-01 
[email protected] JD 0 2015-12-01

ymo Spalte sind pd.Timestamp s:

all_ymo 

[Timestamp('2015-01-01 00:00:00'), 
Timestamp('2015-02-01 00:00:00'), 
Timestamp('2015-03-01 00:00:00'), 
Timestamp('2015-04-01 00:00:00'), 
Timestamp('2015-05-01 00:00:00'), 
Timestamp('2015-06-01 00:00:00'), 
Timestamp('2015-07-01 00:00:00'), 
Timestamp('2015-08-01 00:00:00'), 
Timestamp('2015-09-01 00:00:00'), 
Timestamp('2015-10-01 00:00:00'), 
Timestamp('2015-11-01 00:00:00'), 
Timestamp('2015-12-01 00:00:00')]

Leider ist diese Antwort: Adding values for missing data combinations in Pandas nicht gut, da es Duplikate für vorhandene erstellt ymo Werte.

Ich habe versucht, so etwas wie dieses, aber es ist extrem langsam:

for em in all_emails: 
    existent_ymo = fill_ymo[fill_ymo['email'] == em]['ymo'] 
    existent_ymo = set([pd.Timestamp(datetime.date(t.year, t.month, t.day)) for t in existent_ymo]) 
    missing_ymo = list(existent_ymo - all_ymo) 
    multi_ind = pd.MultiIndex.from_product([[em], missing_ymo], names=col_names) 
    fill_ymo = sessions.set_index(col_names).reindex(multi_ind, fill_value=0).reset_index()

Quelle

2016-09-13 LetMeSOThat4U

Wenn die fehlenden Einträge des gefüllten Überzahl, füllen Sie dann einen neuen Datenrahmen mit einem pd.data_range beginnen. Fügen Sie dann Sitzungswerte hinzu, bei denen die Daten übereinstimmen. Wenn E-Mail-Adresse und Benutzername 1-für-1 sind, dann sollten Sie nur einen von ihnen im Datenrahmen haben, um Speicher zu sparen (wenn Größe ein Problem ist) – dodell

Ich versuche allgemeinere Lösung mit periods erstellen:

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 


mbeg = pd.period_range('2015-01', periods=12, freq='M') 
print (mbeg) 
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06', 
      '2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'], 
      dtype='int64', freq='M') 
#convert column ymo to period 
df.ymo = df.ymo.dt.to_period('m') 
#groupby and reindex with filling 0 
df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index()

print (df) 

     email user_name  ymo sessions 
0 [email protected]  JD 2015-01   0 
1 [email protected]  JD 2015-02   0 
2 [email protected]  JD 2015-03   1 
3 [email protected]  JD 2015-04   0 
4 [email protected]  JD 2015-05   2 
5 [email protected]  JD 2015-06   0 
6 [email protected]  JD 2015-07   0 
7 [email protected]  JD 2015-08   0 
8 [email protected]  JD 2015-09   0 
9 [email protected]  JD 2015-10   0 
10 [email protected]  JD 2015-11   0 
11 [email protected]  JD 2015-12   0 
12 [email protected]  AB 2015-01   0 
13 [email protected]  AB 2015-02   0 
14 [email protected]  AB 2015-03   1 
15 [email protected]  AB 2015-04   0 
16 [email protected]  AB 2015-05   2 
17 [email protected]  AB 2015-06   0 
18 [email protected]  AB 2015-07   0 
19 [email protected]  AB 2015-08   0 
20 [email protected]  AB 2015-09   0 
21 [email protected]  AB 2015-10   0 
22 [email protected]  AB 2015-11   0 
23 [email protected]  AB 2015-12   0

Dann, wenn Bedarf datetimes Verwendung to_timestamp:

df.ymo = df.ymo.dt.to_timestamp() 
print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0

Lösung mit Datetimes:

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index()

print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0

Quelle

2016-09-13 10:25:06 jezrael

erzeugen Monat beginnen Termine und reindex
ffill und bfill Spalten ['email', 'user_name']
fillna(0) für Spalte 'sessions'

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df1 = df.set_index('ymo').reindex(mbeg) 

df1[['email', 'user_name']] = df1[['email', 'user_name']].ffill().bfill() 
df1['sessions'] = df1['sessions'].fillna(0).astype(int) 

df1

Quelle

2016-09-13 09:53:39 piRSquared

Leider funktioniert es nicht, wenn Zeilen für einen anderen Benutzer mit den gleichen Daten (wie df in jezraels Antwort), dies löst "ValueError: kann nicht von einer doppelten Achse neu indizieren". – LetMeSOThat4U

Fill Werte FEHLT nur in einem Datenrahmen (Pandas)

Antwort

Verwandte Themen