# Matching Markets

#### Sorting bias in endogenous groups

Ever worked with grouped data where group membership wasn’t random? Did you correct for endogeneous matching? Here’s why you should.

Eh, you already know the answer? Then check out a simple correction method in (Klein 2015a) and the documentation and vignette (Klein 2015b) to R package matchingMarkets (Klein 2015c).

Still curious? So here’s why: The figure below plots group outcomes ($$R_{ij}$$) against characteristics ($$X_{ij}$$) for all feasible partitions of 4 agents into groups of two ($$AB$$-$$CD$$, $$AC$$-$$BD$$, $$AD$$-$$BD$$). Underlying this example is a simple linear model where the group characteristic has no effect on outcomes, and thus $$\beta=0$$ (black line). If groups are assigned at random, then all 3 partitions are equiprobable and the average slope estimate is zero. If, however, group formation is endogenous, then regression coefficients will generally be biased (red line).

Read on for the algebra and R code behind the linear model in this example. #### A simple model of group diversity

Let’s start with a simple model where agent $$i$$’s valuation over agent $$j$$ is symmetric, that is $$u_{i,j}=u_{j,i}$$. The additive valuation of group $$G$$ is then given by the sum over all pairwise valuations

$$V_G = \sum_{i\in G} \sum_{j\in G\backslash i} u_{i,j}$$.

Let group valuation $$V_G$$ and group outcome $$R_G$$ dependent on a group’s diversity $$X$$ and some error term. In a population with two types, group diversity can be thought of as the probability of two group members being of the same type. Let’s further assume that everybody prefers to work with members of their own type ($$\alpha=+1$$) but diversified teams are more successful ($$\beta=-1$$).

$$V_G = \alpha\cdot X_G + \eta_G$$

$$R_G = \beta\cdot X_G + \delta\eta_G + \xi_G$$

Here, $$\eta_G$$ is a group’s unobserved group valuation, which also affects the outcome equation for $$\delta \neq 0$$. Let $$\eta$$ capture group members’ individual abilities, which have a positive effect $$\delta = 0.5$$ on outcomes. Finally, the error term $$\zeta_G$$ contains random shocks that affect group outcomes but are unknown at group formation.

The endogeneity problem kicks in whenever (i) groups are not formed at random and thus $$cov(X_G, \eta_G)\neq 0$$ and (ii) unobserved group characteristics affect group outcomes, $$\delta \neq 0$$. It is resolved when we can control for the unobservable $$\eta$$ in the outcome equation.

#### An example using simulated data

To illustrate, let us simulate the bias from endogenous group formation and consider the solution implemented in the matchingMarkets package. I proceed in three steps: generation of individual-level data, transformation to group-level variables and outcomes and, finally, comparison of OLS and the correction method presented in (Klein 2015a).

##### Individual-level data

The stabsim function simulates individual-level, independent variables. The code below generates data for m=1,000 markets with gpm=2 groups per market and group size ind=5.

## Simulate individual-level, independent variables
library(matchingMarkets)
idata <- stabsim(m=1000, ind=5, seed=123, gpm=2)
head(idata)
##   m.id g.id wst  R
## 1    1    1   0 NA
## 2    1    1   1 NA
## 3    1    1   0 NA
## 4    1    1   1 NA
## 5    1    1   1 NA
## 6    1    2   0 NA

The resulting data contains market and group identifiers m.id and g.id and the independent variable wst$$\sim$$ B(1,0.5). The dependent variable R depends on the error terms and is still undefined at this stage.

##### Group-level data

Next we apply the function stabit that serves three purposes:

• First, it specifies the list of variables to be included in selection and outcome equations and generates group-level variables based on group members’ individual characteristics. For example, the operation ieq="wst" produces the probability that two randomly drawn group members have the same value of wst.
• Second, if simulation="NTU", it draws standard normal, group-level unobservables eta and xi to enter selection and outcome equation and selects equilibrium groups based on the group formation game with non-transferable utility, assuming pairwise aligned preferences as in (Klein 2015a). In the case of two groups per market, this selection rule results in one dominant group with the maximum group valuation and one group comprised of the residual agents.
• Third, the argument method="model.frame" specifies that only the group-level model matrices be generated. Other options are estimators using "NTU" for selection correction using non-transferable utility matching as selection rule or "outcome" for estimation of the outcome equation only.
## Simulate group-level variables (takes a minute to complete...)
mdata <- stabit(x=idata, simulation="NTU", method="model.frame",
selection = list(ieq="wst"),
outcome   = list(ieq="wst"))$model.frame The resulting object mdata is a list containing data for selection and outcome equations in SEL and OUT, respectively. SEL contains 252,000 rows, one for each of $${5 \choose 10}$$ = 252 feasible group in each of the 1,000 markets. A group’s valuation is given by V = +1*wst.ieq + eta. The variable D indicates which groups are observed in equilibrium D=1 and which are not D=0. head(mdata$SEL, 4)
##   m.id g.id wst.ieq D          V        eta
## 1    1    1     0.4 1 2.97145815  2.5714581
## 2    1    2     0.4 1 1.11284232  0.7128423
## 3    1    3     0.4 0 0.05608277 -0.3439172
## 4    1    4     0.6 0 2.19850877  1.5985088

The outcome data in OUT contains 2,000 rows, one for each of 2 equilibrium groups per market. The group outcome is given by R = -1*wst.ieq + epsilon with epsilon := +0.5*eta + xi.

head(mdata$OUT, 4) ## m.id g.id intercept wst.ieq R xi epsilon ## 1 1 1 1 0.4 -0.78221286 -1.6679419 -0.3822129 ## 2 1 2 1 0.4 0.17095999 0.2145388 0.5709600 ## 3 2 1 1 0.6 -0.57408090 -1.3165104 0.0259191 ## 4 2 2 1 0.6 -0.08277463 1.4414618 0.5172254 ##### Bias from sorting The bias in the slope estimate $$\hat\beta-\beta$$ = -0.44 - (-1) = 0.56 is illustrated in the left panel of the figure below. ## Naive OLS estimation lm(R ~ wst.ieq, data=mdata$OUT)$coefficients ## (Intercept) wst.ieq ## 0.3961386 -0.4372392 The source of this bias is the positive correlation between epsilon and the exogenous variable wst.ieq (see the right panel below). ## epsilon is correlated with independent variables summary(lm(epsilon ~ wst.ieq, mdata$OUT))$coefficients ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3961386 0.08758434 4.522939 6.456494e-06 ## wst.ieq 0.5627608 0.15889272 3.541766 4.065683e-04 An analytical example and a formal treatement of this bias is available in (Klein 2015a). We know that epsilon = 0.5*eta + xi. Thus, conditional on eta, the unobservables in the outcome equation are independent of the exogenous variables (because xi does not enter the selection equation). ## xi is uncorrelated with independent variables summary(lm(xi ~ wst.ieq, mdata$OUT))$coefficients ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.03837509 0.06827652 -0.5620540 0.5741423 ## wst.ieq 0.04147217 0.12386508 0.3348173 0.7377980 ##### Correction of sorting bias The selection problem is resolved when the residual from the selection equation eta is controlled for in the outcome equation. ## 1st stage: obtain fitted value for eta lm.sel <- lm(V ~ -1 + wst.ieq, data=mdata$SEL); lm.sel$coefficients ## wst.ieq ## 1.004501 eta <- lm.sel$resid[mdata$SEL$D==1]
## 2nd stage: control for eta
lm(R ~ wst.ieq + eta, data=mdata$OUT)$coefficients
## (Intercept)     wst.ieq         eta
## -0.03858257 -0.96578534  0.50366230

The figure below plots the bias from sorting against the independent variable, for the naive OLS and the selection-correction from the structural model. In most real-world applications, however, the match valuations V are unobserved. The solution is to estimate the selection equation by imposing equilibrium bounds, as derived in (Klein 2015b).