python - Running a multivariate ordered logit in PyMC3

link之家

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I'm trying to build a Bayesian multivariate ordered logit model using PyMC3. I have gotten a toy multivariate logit model working based on the examples in this book. I've also gotten an ordered logistic regression model running based on the example at the bottom of this page.

However, I cannot get an ordered, multivariate logistic regression to run. I think the issue could be the way the cutpoints are specified, specifically the shape parameter, but I'm not sure why it would be different if there are multiple independent variables than if there were just one, since the number of response categories has not changed.

Here's my code:

Data prep for MWE:

import pymc3 as pm
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris(return_X_y=False)
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                     columns=iris['feature_names'] + ['target'])
iris = iris.rename(index=str, columns={'sepal length (cm)': 'sepal_length', 'sepal width (cm)': 'sepal_width', 'target': 'species'})
Here is a working multivariate (binary) logit:
df = iris.loc[iris['species'].isin([0, 1])]
y = pd.Categorical(df['species']).codes
x = df[['sepal_length', 'sepal_width']].values
with pm.Model() as model_1:
      alpha = pm.Normal('alpha', mu=0, sd=10)
      beta = pm.Normal('beta', mu=0, sd=2, shape=x.shape[1])
      mu = alpha + pm.math.dot(x, beta)
      theta = 1 / (1 + pm.math.exp(-mu))
      y_ = pm.Bernoulli('yl', p=theta, observed=y)
      trace_1 = pm.sample(5000)
Here is a working ordered logit (with one independent variable):
x = iris['sepal_length'].values
y = pd.Categorical(iris['species']).codes
with pm.Model() as model:
    cutpoints = pm.Normal("cutpoints", mu=[-2,2], sd=10, shape=2,
                          transform=pm.distributions.transforms.ordered)
    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
    tr = pm.sample(1000)
Here is my attempt at a multivariate ordered logit, which breaks:
x = iris[['sepal_length', 'sepal_width']].values
y = pd.Categorical(iris['species']).codes
with pm.Model() as model:
    cutpoints = pm.Normal("cutpoints", mu=[-2,2], sd=10, shape=2,
                          transform=pm.distributions.transforms.ordered)
    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=x, observed=y)
    tr = pm.sample(1000)
The error I get is: "ValueError: all the input array dimensions except for the concatenation axis must match exactly."
This suggests it's a data problem (x, y), but the data looks the same as it does for the multivariate logit, which works. 
How can I fix the ordered multivariate logit so it will run? 
I've never done multivariate ordinal regression before, but it seems like one must approach the modeling problem in either two ways:
Partition in the predictor space, in which case you'd need cutlines/curves instead of points.
Partition in a transformed space where you've projected predictor space to a scalar value and can use cutpoints again.
If you want to use the pm.OrderedLogistic it seems like you have to do the latter, since it doesn't appear to support a multivariate eta case out of the box.
Here's my stab at it, but again, I'm not sure this is a standard approach.
import numpy as np
import pymc3 as pm
import pandas as pd
import theano.tensor as tt
from sklearn.datasets import load_iris
# Load data
iris = load_iris(return_X_y=False)
iris = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                     columns=iris['feature_names'] + ['target'])
iris = iris.rename(index=str, columns={
    'sepal length (cm)': 'sepal_length', 
    'sepal width (cm)': 'sepal_width', 
    'target': 'species'})
# Prep input data
Y = pd.Categorical(iris['species']).codes
X = iris[['sepal_length', 'sepal_width']].values
# augment X for simpler regression expression
X_aug = tt.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
# Model with sampling
with pm.Model() as ordered_mvlogit:
    # regression coefficients
    beta = pm.Normal('beta', mu=0, sd=2, shape=X.shape[1] + 1)
    # transformed space (univariate real)
    eta = X_aug.dot(beta)
    # points for separating categories
    cutpoints = pm.Normal("cutpoints", mu=np.array([-1,1]), sd=1, shape=2,
                          transform=pm.distributions.transforms.ordered)
    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=eta, observed=Y)
    trace_mvordlogit = pm.sample(5000)
This seems to converge fine and yields decent intervals
If you then work the beta and cutpoint mean values back to predictor space, you get the following partitioning, which appears reasonable.  However, sepal length and width aren't really the best for partitioning.
# Extract mean parameter values
b0, b1, b2 = trace_mvordlogit.get_values(varname='beta').mean(axis=0)
cut1, cut2 = trace_mvordlogit.get_values(varname='cutpoints').mean(axis=0)
# plotting parameters
x_min, y_min = X.min(axis=0)
x_max, y_max = X.max(axis=0)
buffer = 0.2
num_points = 37
# compute grid values
x = np.linspace(x_min - buffer, x_max + buffer, num_points)
y = np.linspace(y_min - buffer, y_max + buffer, num_points)
X_plt, Y_plt = np.meshgrid(x, y)
Z_plt = b0 + b1*X_plt + b2*Y_plt
# contour + scatter plots
plt.figure(figsize=(8,8))
plt.contourf(X_plt,Y_plt,Z_plt, levels=[-80, cut1, cut2, 50])
plt.scatter(iris.sepal_length, iris.sepal_width, c=iris.species)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.show()
Second Order Terms
You could easily extend eta in the model to include interactions and higher-order terms, so that the final classifier cuts can be curves instead of simple lines.  For example, here is the second order model.
from sklearn.preprocessing import scale
Y = pd.Categorical(iris['species']).codes
# scale X for better sampling
X = scale(iris[['sepal_length', 'sepal_width']].values)
# augment with intercept and second-order terms
X_aug = tt.concatenate((
    np.ones((X.shape[0], 1)), 
    (X[:,0]*X[:,0]).reshape((-1,1)),
    (X[:,1]*X[:,1]).reshape((-1,1)),
    (X[:,0]*X[:,1]).reshape((-1,1))), axis=1)
with pm.Model() as ordered_mvlogit_second:
    beta = pm.Normal('beta', mu=0, sd=2, shape=6)
    eta = X_aug.dot(beta)
    cutpoints = pm.Normal("cutpoints", mu=np.array([-1,1]), sd=1, shape=2,
                          transform=pm.distributions.transforms.ordered)
    y_ = pm.OrderedLogistic("y", cutpoints=cutpoints, eta=eta, observed=Y)
    trace_mvordlogit_second = pm.sample(tune=1000, draws=5000, chains=4, cores=4)
This samples nicely and all the coefficients have non-zero HPDs
And as above you can generate a plot of the classification regions
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.