Fitting models using R-style formulas (2024)

Since version 0.5.0, statsmodels allows users to fit statisticalmodels using R-style formulas. Internally, statsmodels uses thepatsy package to convert formulas anddata to the matrices that are used in model fitting. The formulaframework is quite powerful; this tutorial only scratches the surface. Afull description of the formula language can be found in the patsydocs:

Loading modules and functions

In [1]: import statsmodels.api as smIn [2]: import statsmodels.formula.api as smfIn [3]: import numpy as npIn [4]: import pandas

Notice that we called statsmodels.formula.api in addition to the usualstatsmodels.api. In fact, statsmodels.api is used here only to loadthe dataset. The formula.api hosts many of the samefunctions found in api (e.g. OLS, GLM), but it also holds lower casecounterparts for most of these models. In general, lower case modelsaccept formula and df arguments, whereas upper case ones takeendog and exog design matrices. formula accepts a stringwhich describes the model in terms of a patsy formula. df takesa pandas data frame.

dir(smf) will print a list of available models.

Formula-compatible models have the following generic call signature:(formula, data, subset=None, *args, **kwargs)

OLS regression using formulas

To begin, we fit the linear model described on the GettingStarted page. Download the data, subset columns,and list-wise delete to remove missing observations:

In [5]: df = sm.datasets.get_rdataset("Guerry", "HistData").dataIn [6]: df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()In [7]: df.head()Out[7]:  Lottery Literacy Wealth Region0 41 37 73 E1 38 51 22 N2 66 13 61 C3 80 46 76 E4 79 69 83 E

Fit the model:

In [8]: mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)In [9]: res = mod.fit()In [10]: print(res.summary()) OLS Regression Results ==============================================================================Dep. Variable: Lottery R-squared: 0.338Model: OLS Adj. R-squared: 0.287Method: Least Squares F-statistic: 6.636Date: Thu, 14 Dec 2023 Prob (F-statistic): 1.07e-05Time: 14:49:34 Log-Likelihood: -375.30No. Observations: 85 AIC: 764.6Df Residuals: 78 BIC: 781.7Df Model: 6 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975]-------------------------------------------------------------------------------Intercept 38.6517 9.456 4.087 0.000 19.826 57.478Region[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938Region[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419Region[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943Region[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232Wealth 0.4515 0.103 4.390 0.000 0.247 0.656==============================================================================Omnibus: 3.049 Durbin-Watson: 1.785Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694Skew: -0.340 Prob(JB): 0.260Kurtosis: 2.454 Cond. No. 371.==============================================================================Notes:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Categorical variables

Looking at the summary printed above, notice that patsy determinedthat elements of Region were text strings, so it treated Region as acategorical variable. patsy’s default is also to include anintercept, so we automatically dropped one of the Region categories.

If Region had been an integer variable that we wanted to treatexplicitly as categorical, we could have done so by using the C()operator:

In [11]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()In [12]: print(res.params)Intercept 38.651655C(Region)[T.E] -15.427785C(Region)[T.N] -10.016961C(Region)[T.S] -4.548257C(Region)[T.W] -10.091276Literacy -0.185819Wealth 0.451475dtype: float64

Examples more advanced features patsy’s categorical variablesfunction can be found here: Patsy: Contrast Coding Systems forcategorical variables

Operators

We have already seen that “~” separates the left-hand side of the modelfrom the right-hand side, and that “+” adds new columns to the designmatrix.

Removing variables

The “-” sign can be used to remove columns/variables. For instance, wecan remove the intercept from a model by:

In [13]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()In [14]: print(res.params)C(Region)[C] 38.651655C(Region)[E] 23.223870C(Region)[N] 28.634694C(Region)[S] 34.103399C(Region)[W] 28.560379Literacy -0.185819Wealth 0.451475dtype: float64

Multiplicative interactions

“:” adds a new column to the design matrix with the product of the othertwo columns. “*” will also include the individual columns that weremultiplied together:

In [15]: res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()In [16]: res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()In [17]: print(res1.params)Literacy:Wealth 0.018176dtype: float64In [18]: print(res2.params)Literacy 0.427386Wealth 1.080987Literacy:Wealth -0.013609dtype: float64

Many other things are possible with operators. Please consult the patsydocs to learnmore.

Functions

You can apply vectorized functions to the variables in your model:

In [19]: res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()In [20]: print(res.params)Intercept 115.609119np.log(Literacy) -20.393959dtype: float64

Define a custom function:

In [21]: def log_plus_1(x): ....:  return np.log(x) + 1.0 ....: In [22]: res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()In [23]: print(res.params)Intercept 136.003079log_plus_1(Literacy) -20.393959dtype: float64

Namespaces

Notice that all of the above examples use the calling namespace to look for the functions to apply. The namespace used can be controlled via the eval_env keyword. For example, you may want to give a custom namespace using the patsy:patsy.EvalEnvironment or you may want to use a “clean” namespace, which we provide by passing eval_func=-1. The default is to use the caller’s namespace. This can have (un)expected consequences, if, for example, someone has a variable names C in the user namespace or in their data structure passed to patsy, and C is used in the formula to handle a categorical variable. See the Patsy API Reference for more information.

Using formulas with models that do not (yet) support them

Even if a given statsmodels function does not support formulas, youcan still use patsy’s formula language to produce design matrices.Those matrices can then be fed to the fitting function as endog andexog arguments.

To generate numpy arrays:

In [24]: import patsyIn [25]: f = 'Lottery ~ Literacy * Wealth'In [26]: y, X = patsy.dmatrices(f, df, return_type='matrix')In [27]: print(y[:5])[[41.] [38.] [66.] [80.] [79.]]In [28]: print(X[:5])[[ 1. 37. 73. 2701.] [ 1. 51. 22. 1122.] [ 1. 13. 61. 793.] [ 1. 46. 76. 3496.] [ 1. 69. 83. 5727.]]

y and X would be instances of patsy.DesignMatrix which is a subclass of numpy.ndarray.

To generate pandas data frames:

In [29]: f = 'Lottery ~ Literacy * Wealth'In [30]: y, X = patsy.dmatrices(f, df, return_type='dataframe')In [31]: print(y[:5]) Lottery0 41.01 38.02 66.03 80.04 79.0In [32]: print(X[:5]) Intercept Literacy Wealth Literacy:Wealth0 1.0 37.0 73.0 2701.01 1.0 51.0 22.0 1122.02 1.0 13.0 61.0 793.03 1.0 46.0 76.0 3496.04 1.0 69.0 83.0 5727.0
In [33]: print(sm.OLS(y, X).fit().summary()) OLS Regression Results ==============================================================================Dep. Variable: Lottery R-squared: 0.309Model: OLS Adj. R-squared: 0.283Method: Least Squares F-statistic: 12.06Date: Thu, 14 Dec 2023 Prob (F-statistic): 1.32e-06Time: 14:49:34 Log-Likelihood: -377.13No. Observations: 85 AIC: 762.3Df Residuals: 81 BIC: 772.0Df Model: 3 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975]-----------------------------------------------------------------------------------Intercept 38.6348 15.825 2.441 0.017 7.149 70.121Literacy -0.3522 0.334 -1.056 0.294 -1.016 0.312Wealth 0.4364 0.283 1.544 0.126 -0.126 0.999Literacy:Wealth -0.0005 0.006 -0.085 0.933 -0.013 0.012==============================================================================Omnibus: 4.447 Durbin-Watson: 1.953Prob(Omnibus): 0.108 Jarque-Bera (JB): 3.228Skew: -0.332 Prob(JB): 0.199Kurtosis: 2.314 Cond. No. 1.40e+04==============================================================================Notes:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.[2] The condition number is large, 1.4e+04. This might indicate that there arestrong multicollinearity or other numerical problems.

Last update: Dec 14, 2023

Fitting models using R-style formulas (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Prof. An Powlowski

Last Updated:

Views: 5731

Rating: 4.3 / 5 (44 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Prof. An Powlowski

Birthday: 1992-09-29

Address: Apt. 994 8891 Orval Hill, Brittnyburgh, AZ 41023-0398

Phone: +26417467956738

Job: District Marketing Strategist

Hobby: Embroidery, Bodybuilding, Motor sports, Amateur radio, Wood carving, Whittling, Air sports

Introduction: My name is Prof. An Powlowski, I am a charming, helpful, attractive, good, graceful, thoughtful, vast person who loves writing and wants to share my knowledge and understanding with you.