Matching Markets
Sorting bias in endogenous groups
Ever worked with grouped data where group membership wasn’t random? Did you correct for endogeneous matching? Here’s why you should.
Eh, you already know the answer? Then check out a simple correction method in (Klein 2015a) and the documentation and vignette (Klein 2015b) to R package matchingMarkets
(Klein 2015c).
Still curious? So here’s why: The figure below plots group outcomes (\(R_{ij}\)) against characteristics (\(X_{ij}\)) for all feasible partitions of 4 agents into groups of two (\(AB\)-\(CD\), \(AC\)-\(BD\), \(AD\)-\(BD\)). Underlying this example is a simple linear model where the group characteristic has no effect on outcomes, and thus \(\beta=0\) (black line). If groups are assigned at random, then all 3 partitions are equiprobable and the average slope estimate is zero. If, however, group formation is endogenous, then regression coefficients will generally be biased (red line).
Read on for the algebra and R code behind the linear model in this example.
A simple model of group diversity
Let’s start with a simple model where agent \(i\)’s valuation over agent \(j\) is symmetric, that is \(u_{i,j}=u_{j,i}\). The additive valuation of group \(G\) is then given by the sum over all pairwise valuations
\(V_G = \sum_{i\in G} \sum_{j\in G\backslash i} u_{i,j}\).
Let group valuation \(V_G\) and group outcome \(R_G\) dependent on a group’s diversity \(X\) and some error term. In a population with two types, group diversity can be thought of as the probability of two group members being of the same type. Let’s further assume that everybody prefers to work with members of their own type (\(\alpha=+1\)) but diversified teams are more successful (\(\beta=-1\)).
\(V_G = \alpha\cdot X_G + \eta_G\)
\(R_G = \beta\cdot X_G + \delta\eta_G + \xi_G\)
Here, \(\eta_G\) is a group’s unobserved group valuation, which also affects the outcome equation for \(\delta \neq 0\). Let \(\eta\) capture group members’ individual abilities, which have a positive effect \(\delta = 0.5\) on outcomes. Finally, the error term \(\zeta_G\) contains random shocks that affect group outcomes but are unknown at group formation.
The endogeneity problem kicks in whenever (i) groups are not formed at random and thus \(cov(X_G, \eta_G)\neq 0\) and (ii) unobserved group characteristics affect group outcomes, \(\delta \neq 0\). It is resolved when we can control for the unobservable \(\eta\) in the outcome equation.
An example using simulated data
To illustrate, let us simulate the bias from endogenous group formation and consider the solution implemented in the matchingMarkets
package. I proceed in three steps: generation of individual-level data, transformation to group-level variables and outcomes and, finally, comparison of OLS and the correction method presented in (Klein 2015a).
Individual-level data
The stabsim
function simulates individual-level, independent variables. The code below generates data for m=1,000
markets with gpm=2
groups per market and group size ind=5
.
## Simulate individual-level, independent variables
library(matchingMarkets)
idata <- stabsim(m=1000, ind=5, seed=123, gpm=2)
head(idata)
## m.id g.id wst R
## 1 1 1 0 NA
## 2 1 1 1 NA
## 3 1 1 0 NA
## 4 1 1 1 NA
## 5 1 1 1 NA
## 6 1 2 0 NA
The resulting data contains market and group identifiers m.id
and g.id
and the independent variable wst
\(\sim\) B(1,0.5). The dependent variable R
depends on the error terms and is still undefined at this stage.
Group-level data
Next we apply the function stabit
that serves three purposes:
- First, it specifies the list of variables to be included in
selection
andoutcome
equations and generates group-level variables based on group members’ individual characteristics. For example, the operationieq="wst"
produces the probability that two randomly drawn group members have the same value ofwst
. - Second, if
simulation="NTU"
, it draws standard normal, group-level unobservableseta
andxi
to enter selection and outcome equation and selects equilibrium groups based on the group formation game with non-transferable utility, assuming pairwise aligned preferences as in (Klein 2015a). In the case of two groups per market, this selection rule results in one dominant group with the maximum group valuation and one group comprised of the residual agents. - Third, the argument
method="model.frame"
specifies that only the group-level model matrices be generated. Other options are estimators using"NTU"
for selection correction using non-transferable utility matching as selection rule or"outcome"
for estimation of the outcome equation only.
## Simulate group-level variables (takes a minute to complete...)
mdata <- stabit(x=idata, simulation="NTU", method="model.frame",
selection = list(ieq="wst"),
outcome = list(ieq="wst"))$model.frame
The resulting object mdata
is a list containing data for selection and outcome equations in SEL
and OUT
, respectively. SEL
contains 252,000 rows, one for each of \({5 \choose 10}\) = 252 feasible group in each of the 1,000 markets. A group’s valuation is given by V = +1*wst.ieq + eta
. The variable D
indicates which groups are observed in equilibrium D=1
and which are not D=0
.
head(mdata$SEL, 4)
## m.id g.id wst.ieq D V eta
## 1 1 1 0.4 1 2.97145815 2.5714581
## 2 1 2 0.4 1 1.11284232 0.7128423
## 3 1 3 0.4 0 0.05608277 -0.3439172
## 4 1 4 0.6 0 2.19850877 1.5985088
The outcome data in OUT
contains 2,000 rows, one for each of 2 equilibrium groups per market. The group outcome is given by R = -1*wst.ieq + epsilon
with epsilon := +0.5*eta + xi
.
head(mdata$OUT, 4)
## m.id g.id intercept wst.ieq R xi epsilon
## 1 1 1 1 0.4 -0.78221286 -1.6679419 -0.3822129
## 2 1 2 1 0.4 0.17095999 0.2145388 0.5709600
## 3 2 1 1 0.6 -0.57408090 -1.3165104 0.0259191
## 4 2 2 1 0.6 -0.08277463 1.4414618 0.5172254
Bias from sorting
The bias in the slope estimate \(\hat\beta-\beta\) = -0.44 - (-1) = 0.56 is illustrated in the left panel of the figure below.
## Naive OLS estimation
lm(R ~ wst.ieq, data=mdata$OUT)$coefficients
## (Intercept) wst.ieq
## 0.3961386 -0.4372392
The source of this bias is the positive correlation between epsilon
and the exogenous variable wst.ieq
(see the right panel below).
## epsilon is correlated with independent variables
summary(lm(epsilon ~ wst.ieq, mdata$OUT))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3961386 0.08758434 4.522939 6.456494e-06
## wst.ieq 0.5627608 0.15889272 3.541766 4.065683e-04
An analytical example and a formal treatement of this bias is available in (Klein 2015a). We know that epsilon = 0.5*eta + xi
. Thus, conditional on eta
, the unobservables in the outcome equation are independent of the exogenous variables (because xi
does not enter the selection equation).
## xi is uncorrelated with independent variables
summary(lm(xi ~ wst.ieq, mdata$OUT))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03837509 0.06827652 -0.5620540 0.5741423
## wst.ieq 0.04147217 0.12386508 0.3348173 0.7377980
Correction of sorting bias
The selection problem is resolved when the residual from the selection equation eta
is controlled for in the outcome equation.
## 1st stage: obtain fitted value for eta
lm.sel <- lm(V ~ -1 + wst.ieq, data=mdata$SEL); lm.sel$coefficients
## wst.ieq
## 1.004501
eta <- lm.sel$resid[mdata$SEL$D==1]
## 2nd stage: control for eta
lm(R ~ wst.ieq + eta, data=mdata$OUT)$coefficients
## (Intercept) wst.ieq eta
## -0.03858257 -0.96578534 0.50366230
The figure below plots the bias from sorting against the independent variable, for the naive OLS and the selection-correction from the structural model.
In most real-world applications, however, the match valuations V
are unobserved. The solution is to estimate the selection equation by imposing equilibrium bounds, as derived in (Klein 2015b).
References
Klein, T. 2015a. Does anti-diversification pay? A one-sided matching model of microcredit. Cambridge Working Papers in Economics 1521. Faculty of Economics, University of Cambridge.
———. 2015b. Analysis of stable matchings in R: Package matchingMarkets. Vignette to R package matchingMarkets. The Comprehensive R Archive Network.
———. 2015c. matchingMarkets: Analysis of stable matchings. R package version 0.1-7. The Comprehensive R Archive Network.
Comments
comments powered by Disqus