2017-01-17 6 views
1

Ich habe gerade angefangen durch die PyMC3 documentation Lesen (ich bin viel bequemer mit sklearn) und kam across Rugby hierarchical model example:PyMC3: Hierarchisches Rugby-Modell posterior?

# Imports and Rugby data setup -- model in next section 

import numpy as np 
import pandas as pd 
import pymc3 as pm 
import theano.tensor as tt 
import matplotlib.pyplot as plt 
import seaborn as sns 

games = [ 
    ['Wales', 'Italy', 23, 15], 
    ['France', 'England', 26, 24], 
    ['Ireland', 'Scotland', 28, 6], 
    ['Ireland', 'Wales', 26, 3], 
    ['Scotland', 'England', 0, 20], 
    ['France', 'Italy', 30, 10], 
    ['Wales', 'France', 27, 6], 
    ['Italy', 'Scotland', 20, 21], 
    ['England', 'Ireland', 13, 10], 
    ['Ireland', 'Italy', 46, 7], 
    ['Scotland', 'France', 17, 19], 
    ['England', 'Wales', 29, 18], 
    ['Italy', 'England', 11, 52], 
    ['Wales', 'Scotland', 51, 3], 
    ['France', 'Ireland', 20, 22], 
] 
columns = ['home_team', 'away_team', 'home_score', 'away_score'] 
df = pd.DataFrame(games, columns=columns) 

teams = df.home_team.unique() 
teams = pd.DataFrame(teams, columns=['team']) 
teams['i'] = teams.index 

df = pd.merge(df, teams, left_on='home_team', right_on='team', how='left') 
df = df.rename(columns = {'i': 'i_home'}).drop('team', 1) 
df = pd.merge(df, teams, left_on='away_team', right_on='team', how='left') 
df = df.rename(columns = {'i': 'i_away'}).drop('team', 1) 

observed_home_goals = df.home_score.values 
observed_away_goals = df.away_score.values 

home_team = df.i_home.values 
away_team = df.i_away.values 

num_teams = len(df.i_home.drop_duplicates()) 
num_games = len(home_team) 

g = df.groupby('i_away') 
att_starting_points = np.log(g.away_score.mean()) 
g = df.groupby('i_home') 
def_starting_points = -np.log(g.away_score.mean()) 

Hier ist das Haupt PyMC3 Modell Setup:

with pm.Model() as model: 
    # Global model parameters 
    home = pm.Normal('home', 0, tau=.0001) 
    tau_att = pm.Gamma('tau_att', .1, .1) 
    tau_def = pm.Gamma('tau_def', .1, .1) 
    intercept = pm.Normal('intercept', 0, tau=.0001) 

    # Team-specific model parameters 
    atts_star = pm.Normal('atts_star', mu=0, tau=tau_att, shape=num_teams) 
    defs_star = pm.Normal('defs_star', mu=0, tau=tau_def, shape=num_teams) 

    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star)) 
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star)) 
    home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team]) 
    away_theta = tt.exp(intercept + atts[away_team] + defs[home_team]) 

    # Likelihood of observed data 
    home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_goals) 
    away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_goals) 

    start = pm.find_MAP() 
    step = pm.NUTS(state=start) 
    trace = pm.sample(20000, step, init=start) 

Ich weiß, wie man die trace:

pm.traceplot(trace[5000:]) 

Und Gen plotten posterior predictive samples rate:

ppc = pm.sample_ppc(trace[5000:], samples=500, model=model) 

Was ich bin nicht sicher: Wie stelle ich Fragen des Modells/posterior?

Zum Beispiel gehe ich davon aus, die Verteilung der Noten für Wales vs Italy matchups wäre:

# Wales vs Italy is the first matchup in our dataset 
home_wales = ppc['home_points'][:, 0] 
away_italy = ppc['away_points'][:, 0] 

Aber was Matchups, die in den ursprünglichen Daten nicht aufgezeichnet werden?

  • Wenn Italien zu Hause gegen Frankreich spielt, wie sieht die Punkteverteilung aus?
  • Wenn Italien zu Hause gegen Frankreich spielt, wie oft erzielen beide Mannschaften unter 15?

Danke für jede Hilfe/Einblicke.

Antwort

1

Ich bin ziemlich sicher, dass ich in der Lage war, dies nach dem Lesen durch die PyMC3 Hierarchical Partial Pooling example herauszufinden. Die Beantwortung der Fragen um:

  1. Ja, das ist es, was die Verteilung für Wales vs Italy matchups wäre (da es das erste Spiel in den beobachteten Daten ist).

  2. Um vorauszusagen (da die beiden Teams in unserem ursprünglichen Datensatz nicht gegeneinander gespielt haben), würden wir Vorhersage thetas benötigen.

Hier ist der Code zu dem aktualisierten Modell:

# Setup the model similarly to the previous one... 
with pm.Model() as model: 
    # Global model parameters 
    home = pm.Normal('home', 0, tau=.0001) 
    tau_att = pm.Gamma('tau_att', .1, .1) 
    tau_def = pm.Gamma('tau_def', .1, .1) 
    intercept = pm.Normal('intercept', 0, tau=.0001) 

    # Team-specific model parameters 
    atts_star = pm.Normal('atts_star', mu=0, tau=tau_att, shape=num_teams) 
    defs_star = pm.Normal('defs_star', mu=0, tau=tau_def, shape=num_teams) 

    atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star)) 
    defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star)) 
    home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team]) 
    away_theta = tt.exp(intercept + atts[away_team] + defs[home_team]) 

    # Likelihood of observed data 
    home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_goals) 
    away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_goals) 

# Now for predictions with no games played... 
with model: 
    # IDs from `teams` DataFrame 
    italy, france = 4, 1 
    # New `thetas` for Italy vs France predictions 
    pred_home_theta = tt.exp(intercept + home + atts[italy] + defs[france]) 
    pred_away_theta = tt.exp(intercept + atts[france] + defs[italy]) 
    pred_home_points = pm.Poisson('pred_home_points', mu=pred_home_theta) 
    pred_away_points = pm.Poisson('pred_away_points', mu=pred_away_theta) 

# Sample the final model 
with model: 
    start = pm.find_MAP() 
    step = pm.NUTS(state=start) 
    trace = pm.sample(20000, step, init=start) 

Sobald die trace abgeschlossen ist, können wir die Vorhersagen zeichnen:

# Use 5,000 as MCMC burn in 
pred = pd.DataFrame({ 
    "italy": trace["pred_home_points"][5000:], 
    "france": trace["pred_away_points"][5000:], 
}) 
# Plot the distributions 
sns.kdeplot(pred.italy, shade=True, label="Italy") 
sns.kdeplot(pred.france, shade=True, label="France") 
plt.show() 

Italy vs France Rugby distributions

Wie oft wird Italien zu Hause gewinnen?

# 19% of the time 
(pred.italy > pred.france).mean() 

Wie oft beide Teams unter 15 punkten?

# 0.7% of the time 
1.0 * len(pred[(pred.italy <= 15) & (pred.france <= 15)])/len(pred) 
+1

Das sieht gut aus für mich. Warum fügst du es nicht zur Dokumentation von PyMC3 hinzu? –

Verwandte Themen