2017-05-02 3 views
0

Ich habe einen Datenrahmen in der folgenden Form:Erhalten Beobachtungen basierend auf Perzentilwert in Python Pandas

d1 = {'City_ID': ['City_1','City_1','City_1','City_1','City_2','City_3','City_3','City_3','City_3','City_3'], 
'Indiv_ID': ['Indiv_1','Indiv_2','Indiv_3','Indiv_4','Indiv_5','Indiv_6','Indiv_7','Indiv_8','Indiv_9','Indiv_10'], 
'Expenditure_by_earning': [0.11, 0.66, 0.51, 0.43, 0.46,0.8, 0.14, 0.06, 0.64, 0.95]} 

Die eigentliche Datensatz enthält über 1000 Städte mit mehreren Personen obwohl einige Städte nur 1 Beobachtung enthalten. Ich würde gern Einzelpersonen in jeder Stadt erhalten, deren Ausgaben, indem sie Wert verdienen, weniger als 25% Perzentil und mehr als 75% Perzentil für diese Stadt sind.

Die Ausgabe in diesem Fall ich erwarten würde:

City_ID  Indiv_ID Expenditure_by_earning  Percentile 
City_1  Indiv_1   0.11      25 
City_1  Indiv_2   0.66      75 
City_3  Indiv_7   0.06      25 
City_3  Indiv_8   0.14      25 
City_3  Indiv_6   0.8      75 
City_3  Indiv_10   0.95      75 

Hinweis: City 2 eliminiert wird.

Würde jemand mir helfen, wie dies mit Python zu erreichen? Danke.

Antwort

1
# Calculate quantiles by city (result is indexed by city) 
q25 = d1.groupby('City_ID')['Expenditure_by_earning'].quantile(.25) 
q75 = d1.groupby('City_ID')['Expenditure_by_earning'].quantile(.75) 

# Calculate Residuals Above Percentile Levels 
# (First set d1 Index on CityID (like q25/q75), allowing for direct subtraction) 
d1 = d1.set_index('City_ID') 
d1['Pct_75_resid'] = d1['Expenditure_by_earning'] - q75 
d1['Pct_25_resid'] = d1['Expenditure_by_earning'] - q25 

# Filter 
d1.query('Pct_75_resid >= 0 or Pct_25_resid <=0') 
+1

ehrfürchtig. Sehr geschätzt. – Prometheus

1
import pandas as pd 
df = pd.DataFrame(d1) 
df 

    City_ID Expenditure_by_earning Indiv_ID 
0 City_1 0.11 Indiv_1 
1 City_1 0.66 Indiv_2 
2 City_1 0.51 Indiv_3 
3 City_1 0.43 Indiv_4 
4 City_2 0.46 Indiv_5 
5 City_3 0.80 Indiv_6 
6 City_3 0.14 Indiv_7 
7 City_3 0.06 Indiv_8 
8 City_3 0.64 Indiv_9 
9 City_3 0.95 Indiv_10 

df = df.merge(df.groupby('City_ID')['Expenditure_by_earning'].quantile([0.25, 0.75]).unstack(), left_on='City_ID', right_index=True) 
df.loc[((df['Expenditure_by_earning']<=df[0.25]) | (df['Expenditure_by_earning']>=df[0.75])) & (df[0.25]!=df[0.75])] 


City_ID Expenditure_by_earning Indiv_ID 0.25 0.75 
0 City_1 0.11 Indiv_1 0.35 0.5475 
1 City_1 0.66 Indiv_2 0.35 0.5475 
5 City_3 0.80 Indiv_6 0.14 0.8000 
6 City_3 0.14 Indiv_7 0.14 0.8000 
7 City_3 0.06 Indiv_8 0.14 0.8000 
9 City_3 0.95 Indiv_10 0.14 0.8000 
Verwandte Themen