Propensity Score Matching

Maestría en Economía Uninorte

Author

Carlos Andrés Yanes

Published

August 23, 2024

Introducción

Seguimos trabajando el ejemplo de Nessa, Ali, and Abdul-Hakim (2012) para el análisis del impacto Programa de Microcréditos1 realizado en Bangladesh para 1998.

1 Esto se hizo gracias al libro de Evaluación de Impacto en la Practica del Banco Mundial y la parte de R fue construida por John Woodill

Las variables son: Código hogar (nh); año (year); código de la villa (villid); Código zona (thanaid); Edad del jefe (agehead); Sexo(sexhead); Educación “en años” del jefe (educhead), tamaño de la familia (famsize); Tierra(hhland); Activos del jefe (hhasset); Gasto anual en comida (expfd); gasto anual en elementos distintos a la comida (expnfd); gasto total per capita al año (exptot); Si el hombre tiene un microcredito (dmmfd); Si la mujer tiene un microcrédito (dfmfd); Peso muestral (weight); Acceso del pueblo a una carretera (vaccess); proporcion de la tierra que ha sido regado (pcirr); Precios de alimentos como aceite, arroz, trigo, leche, papa, huevos.

Preparación del entorno de trabajo

Cerciorese primero de tener un entorno adecuado de paquetes para trabajar el archivo o base de datos. Si debe instalar alguno no olvide utilizar p.e:

install.packages(c("pacman", "survey" "haven", "tidyverse", "huxtable"))

Despues que culmine la instalación, carguelos con la opción de library()

library(pacman)
p_load(survey, foreign, haven, tidyverse, huxtable, dplyr, Matching, MatchIt, ggplot2)

Sigue el cargue de los datos, debe cerciorarse que esten en su carpeta de trabajo

hh_98 <- read_dta("hh_98.dta")
head(hh_98) # Encabezado
nhyearvillidthanaidageheadsexheadeducheadfamsizehhlandhhassetexpfdexpnfdexptotdmmfddfmfdweightvaccesspcirrricewheatmilkpotatoeggoil
1.11e+0411179102363.33e+048.26e+031.24e+039.5e+03 011.04 10.4512.68.1211.58.552.2 40.6
1.11e+04111431641161.8e+05 6.72e+031.25e+041.92e+04111.01 10.4512.68.1211.58.552.2 40.6
1.11e+0411152007918.07e+045.11e+032.39e+037.51e+03011.01 10.4512.68.1211.58.552.2 40.6
1.11e+041114810781.68e+044.08e+031.25e+035.33e+03011.01 10.4512.68.1211.58.552.2 40.6
1.2e+04 121351105101.88e+047.03e+031.76e+042.47e+04010.76310.2 10.26.0910.86.872.0343.3
1.2e+04 12133157164.49e+042.11e+031.17e+033.28e+03002.25 10.2 10.26.0910.86.872.0343.3

Mas adelante de ver nuestra base, creamos un par de variables en términos logarítmicos

hh_98.data <- mutate(hh_98, lexptot = log(1 + exptot)) %>%
  mutate(lnland = log((1 + hhland/100)))

Impacto de la participación del programa

Vamos a tomar la base de datos de forma no balanceada, para ello con la opción del paquete de survey lo tomamos

des1 <- svydesign(id = ~nh,  weights = ~weight, data = hh_98.data)

# Modelo probit dependiente la participación
prog.lm <- svyglm(dmmfd ~ sexhead + agehead + educhead + lnland + vaccess + pcirr + rice + wheat + milk + oil + egg, 
                  design=des1, family = quasibinomial(link = "probit"))   

# Lista de variables
X <- prog.lm$fitted
Tr <- hh_98.data$dmmfd
Y <- hh_98.data$lexptot

# Matching
m.out <- Match(Tr = Tr, X = X, Y = Y, caliper = 0.001)
summary(m.out)

Estimate...  -0.081316 
AI SE......  0.031365 
T-stat.....  -2.5926 
p.val......  0.0095266 

Original number of observations..............  1129 
Original number of treated obs...............  220 
Matched number of observations...............  71 
Matched number of observations  (unweighted).  88 

Caliper (SDs)........................................   0.001 
Number of obs dropped by 'exact' or 'caliper'  149 

Podemos ver que existe una diferencia significativa del efecto con el grupo. Teniendo como dependiente el gasto anual de los campesinos.

Primer Match

Ahora vamos a mirar como podemos buscar la relación de matching entre individuos a partir de las características observables dentro de la muestra

MatchBalance(dmmfd ~ sexhead + agehead + educhead + lnland + vaccess + pcirr + rice + wheat + milk + oil + egg, data = hh_98.data, nboots = 500, ks = TRUE)

***** (V1) sexhead *****
before matching:
mean treatment........ 0.97727 
mean control.......... 0.89109 
std mean diff......... 57.697 

mean raw eQQ diff..... 0.086364 
med  raw eQQ diff..... 0 
max  raw eQQ diff..... 1 

mean eCDF diff........ 0.043092 
med  eCDF diff........ 0.043092 
max  eCDF diff........ 0.086184 

var ratio (Tr/Co)..... 0.22965 
T-test p-value........ 3.6741e-09 


***** (V2) agehead *****
before matching:
mean treatment........ 44.036 
mean control.......... 46.491 
std mean diff......... -20.891 

mean raw eQQ diff..... 2.7727 
med  raw eQQ diff..... 3 
max  raw eQQ diff..... 15 

mean eCDF diff........ 0.042981 
med  eCDF diff........ 0.030478 
max  eCDF diff........ 0.12168 

var ratio (Tr/Co)..... 0.83529 
T-test p-value........ 0.0066778 
KS Bootstrap p-value.. 0.004 
KS Naive p-value...... 0.010544 
KS Statistic.......... 0.12168 


***** (V3) educhead *****
before matching:
mean treatment........ 2.7409 
mean control.......... 2.2145 
std mean diff......... 14.337 

mean raw eQQ diff..... 0.50455 
med  raw eQQ diff..... 0 
max  raw eQQ diff..... 3 

mean eCDF diff........ 0.034492 
med  eCDF diff........ 0.032868 
max  eCDF diff........ 0.079033 

var ratio (Tr/Co)..... 1.1516 
T-test p-value........ 0.054121 
KS Bootstrap p-value.. 0.062 
KS Naive p-value...... 0.21851 
KS Statistic.......... 0.079033 


***** (V4) lnland *****
before matching:
mean treatment........ 0.32431 
mean control.......... 0.39162 
std mean diff......... -16.995 

mean raw eQQ diff..... 0.080432 
med  raw eQQ diff..... 0.016878 
max  raw eQQ diff..... 1.7482 

mean eCDF diff........ 0.034645 
med  eCDF diff........ 0.034623 
max  eCDF diff........ 0.063916 

var ratio (Tr/Co)..... 0.56172 
T-test p-value........ 0.035665 
KS Bootstrap p-value.. 0.412 
KS Naive p-value...... 0.46431 
KS Statistic.......... 0.063916 


***** (V5) vaccess *****
before matching:
mean treatment........ 0.85 
mean control.......... 0.83168 
std mean diff......... 5.1181 

mean raw eQQ diff..... 0.018182 
med  raw eQQ diff..... 0 
max  raw eQQ diff..... 1 

mean eCDF diff........ 0.0091584 
med  eCDF diff........ 0.0091584 
max  eCDF diff........ 0.018317 

var ratio (Tr/Co)..... 0.91396 
T-test p-value........ 0.50013 


***** (V6) pcirr *****
before matching:
mean treatment........ 0.55325 
mean control.......... 0.56209 
std mean diff......... -2.5201 

mean raw eQQ diff..... 0.027802 
med  raw eQQ diff..... 0.0175 
max  raw eQQ diff..... 0.19881 

mean eCDF diff........ 0.026626 
med  eCDF diff........ 0.027108 
max  eCDF diff........ 0.05278 

var ratio (Tr/Co)..... 1.1482 
T-test p-value........ 0.73431 
KS Bootstrap p-value.. 0.574 
KS Naive p-value...... 0.70715 
KS Statistic.......... 0.05278 


***** (V7) rice *****
before matching:
mean treatment........ 10.532 
mean control.......... 10.223 
std mean diff......... 17.314 

mean raw eQQ diff..... 0.36551 
med  raw eQQ diff..... 0.22556 
max  raw eQQ diff..... 1.8045 

mean eCDF diff........ 0.051255 
med  eCDF diff........ 0.051545 
max  eCDF diff........ 0.12578 

var ratio (Tr/Co)..... 1.4065 
T-test p-value........ 0.01832 
KS Bootstrap p-value.. 0.002 
KS Naive p-value...... 0.00736 
KS Statistic.......... 0.12578 


***** (V8) wheat *****
before matching:
mean treatment........ 7.5296 
mean control.......... 7.4517 
std mean diff......... 10.062 

mean raw eQQ diff..... 0.1338 
med  raw eQQ diff..... 0 
max  raw eQQ diff..... 0.67668 

mean eCDF diff........ 0.034366 
med  eCDF diff........ 0.029853 
max  eCDF diff........ 0.075558 

var ratio (Tr/Co)..... 0.80554 
T-test p-value........ 0.19147 
KS Bootstrap p-value.. 0.074 
KS Naive p-value...... 0.26405 
KS Statistic.......... 0.075558 


***** (V9) milk *****
before matching:
mean treatment........ 10.609 
mean control.......... 10.965 
std mean diff......... -10.695 

mean raw eQQ diff..... 0.42139 
med  raw eQQ diff..... 0 
max  raw eQQ diff..... 3.3834 

mean eCDF diff........ 0.028392 
med  eCDF diff........ 0.012681 
max  eCDF diff........ 0.086094 

var ratio (Tr/Co)..... 0.96817 
T-test p-value........ 0.15685 
KS Bootstrap p-value.. 0.064 
KS Naive p-value...... 0.14471 
KS Statistic.......... 0.086094 


***** (V10) oil *****
before matching:
mean treatment........ 39.426 
mean control.......... 39.398 
std mean diff......... 0.64407 

mean raw eQQ diff..... 0.61209 
med  raw eQQ diff..... 0 
max  raw eQQ diff..... 5.4135 

mean eCDF diff........ 0.021813 
med  eCDF diff........ 0.015037 
max  eCDF diff........ 0.0531 

var ratio (Tr/Co)..... 1.2295 
T-test p-value........ 0.93047 
KS Bootstrap p-value.. 0.292 
KS Naive p-value...... 0.70003 
KS Statistic.......... 0.0531 


***** (V11) egg *****
before matching:
mean treatment........ 1.8509 
mean control.......... 1.9781 
std mean diff......... -37.098 

mean raw eQQ diff..... 0.12611 
med  raw eQQ diff..... 0 
max  raw eQQ diff..... 0.50751 

mean eCDF diff........ 0.083573 
med  eCDF diff........ 0.077308 
max  eCDF diff........ 0.14831 

var ratio (Tr/Co)..... 0.82698 
T-test p-value........ 1.935e-06 
KS Bootstrap p-value.. < 2.22e-16 
KS Naive p-value...... 0.00082542 
KS Statistic.......... 0.14831 


Before Matching Minimum p.value: < 2.22e-16 
Variable Name(s): egg  Number(s): 11 
fit <- prog.lm$data
fit$fvalues <- prog.lm$fitted.values 

fit.control <- filter(fit, dmmfd == 0)
fit.treated <- filter(fit, dmmfd == 1)

ggplot() + 
  geom_density(aes(x=fit.control$fvalues, linetype = '2')) +
  geom_density(aes(x=fit.treated$fvalues, linetype = '3')) +
  xlim(-.1,.6) +
  xlab("") +
  scale_linetype_discrete(name = "", labels = c("Controles", "Tratados")) +
  ggtitle("Densidades de los Controles vs Tratados")

Segunda regresión pero con balance

A continuación intentamos replicar lo anterior, pero ya mas balanceado en el grupo de controles

# Second Regression (Balanced)

des1 <- svydesign(id = ~nh,  weights = ~weight, data = hh_98.data)
prog.lm <- svyglm(dmmfd ~ sexhead + agehead + educhead + lnland + vaccess + pcirr + rice + wheat + milk + oil, 
                  design=des1, family = quasibinomial(link = "probit"))   

X <- prog.lm$fitted
Tr <- hh_98.data$dmmfd
Y <- hh_98.data$lexptot

# Acá ya existe como tal la zona de soporte común
m.out <- Match(Tr = Tr, X = X, Y = Y, caliper = 0.001, M = 1, CommonSupport = TRUE)
summary(m.out)

Estimate...  -0.025275 
AI SE......  0.030279 
T-stat.....  -0.83474 
p.val......  0.40387 

Original number of observations..............  1084 
Original number of treated obs...............  219 
Matched number of observations...............  76 
Matched number of observations  (unweighted).  106 

Caliper (SDs)........................................   0.001 
Number of obs dropped by 'exact' or 'caliper'  143 

Vamos al grafico

fit <- prog.lm$data
fit$fvalues <- prog.lm$fitted.values 

fit.control <- filter(fit, dmmfd == 0)
fit.treated <- filter(fit, dmmfd == 1)

ggplot() + 
  geom_density(aes(x=fit.control$fvalues, linetype = '2')) +
  geom_density(aes(x=fit.treated$fvalues, linetype = '3')) +
  xlim(-.1,.6) +
  xlab("") +
  scale_linetype_discrete(name = "", labels = c("Controles", "Tratados")) +
  ggtitle("Densidad de Grupos")

Vecino cercano

En esta parte tomaremos por grupos e iremos viendo otras opciones

# Nearest Neighbor
des1 <- svydesign(id = ~nh,  weights = ~weight, data = hh_98.data)
prog.lm <- svyglm(dmmfd ~ sexhead + agehead + educhead + lnland + vaccess + pcirr + rice + wheat + milk + oil, 
                  design=des1, family = quasibinomial(link = "probit"))   

X <- prog.lm$fitted.values
Tr <- hh_98.data$dmmfd
Y <- hh_98.data$lexptot
m.out <- Match(Tr = Tr, X = X, Y = Y, M = 1, caliper = 0.001, replace = TRUE)
summary(m.out)

Estimate...  -0.028309 
AI SE......  0.0301 
T-stat.....  -0.94049 
p.val......  0.34696 

Original number of observations..............  1129 
Original number of treated obs...............  220 
Matched number of observations...............  79 
Matched number of observations  (unweighted).  109 

Caliper (SDs)........................................   0.001 
Number of obs dropped by 'exact' or 'caliper'  141 

Usando MatchIt

Vamos a implementar ahora el paquete de MatchIt para mirar otras opciones de pareo.

# Este es estratificando
m.out <- matchit(dmmfd ~ sexhead + agehead + educhead + lnland + vaccess + pcirr + rice + wheat + milk + oil,
                 data = hh_98.data, method = "nearest", distance = "probit", caliper = 0.001)
summary(m.out)

Call:
matchit(formula = dmmfd ~ sexhead + agehead + educhead + lnland + 
    vaccess + pcirr + rice + wheat + milk + oil, data = hh_98.data, 
    method = "nearest", distance = "probit", caliper = 0.001)

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.2243        0.1877          0.5302     0.8132    0.1305
sexhead         0.9773        0.8911          0.5783          .    0.0862
agehead        44.0364       46.4906         -0.2089     0.8353    0.0430
educhead        2.7409        2.2145          0.1434     1.1516    0.0345
lnland          0.3243        0.3916         -0.1700     0.5617    0.0346
vaccess         0.8500        0.8317          0.0513          .    0.0183
pcirr           0.5532        0.5621         -0.0252     1.1482    0.0266
rice           10.5316       10.2228          0.1731     1.4065    0.0513
wheat           7.5296        7.4517          0.1006     0.8055    0.0344
milk           10.6085       10.9654         -0.1070     0.9682    0.0284
oil            39.4259       39.3979          0.0064     1.2295    0.0218
         eCDF Max
distance   0.1896
sexhead    0.0862
agehead    0.1217
educhead   0.0790
lnland     0.0639
vaccess    0.0183
pcirr      0.0528
rice       0.1258
wheat      0.0756
milk       0.0861
oil        0.0531

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.2060        0.2060          0.0001     1.0002    0.0009
sexhead         0.9888        0.9775          0.0754          .    0.0112
agehead        44.0787       45.2247         -0.0976     0.8521    0.0298
educhead        2.0787        2.0674          0.0031     1.1143    0.0142
lnland          0.3467        0.2833          0.1602     0.9305    0.0533
vaccess         0.9101        0.9213         -0.0315          .    0.0112
pcirr           0.5412        0.5534         -0.0349     1.2947    0.0484
rice           10.1173       10.1996         -0.0462     1.2545    0.0298
wheat           7.4853        7.3827          0.1325     0.7428    0.0474
milk           10.9942       11.1006         -0.0319     1.0562    0.0315
oil            39.5669       39.0575          0.1171     1.2771    0.0363
         eCDF Max Std. Pair Dist.
distance   0.0225          0.0005
sexhead    0.0112          0.0754
agehead    0.1124          1.0960
educhead   0.0449          0.7804
lnland     0.1798          0.9021
vaccess    0.0112          0.3461
pcirr      0.1011          1.1403
rice       0.0899          0.8277
wheat      0.1236          1.0847
milk       0.0899          1.2078
oil        0.0899          1.2075

Sample Sizes:
          Control Treated
All           909     220
Matched        89      89
Unmatched     820     131
Discarded       0       0

Solo mujeres

glm.female <- glm(dfmfd ~ sexhead + agehead + educhead + lnland + vaccess + pcirr + rice + wheat + milk + oil, family = binomial, data = hh_98.data)
X <- glm.female$fitted
Tr <- hh_98.data$dmmfd
Y <- hh_98.data$lexptot
m.out <- Match(Tr = Tr, X = X, Y = Y, caliper = 0.001, M = 1, replace = TRUE)
summary(m.out)

Estimate...  0.0059114 
AI SE......  0.027928 
T-stat.....  0.21167 
p.val......  0.83237 

Original number of observations..............  1129 
Original number of treated obs...............  220 
Matched number of observations...............  100 
Matched number of observations  (unweighted).  137 

Caliper (SDs)........................................   0.001 
Number of obs dropped by 'exact' or 'caliper'  120 

¿Qué nos dice esta parte?

Modelo adicional

# Ajuste del modelo de Propensity Score
pscore_model <- matchit(dmmfd ~ sexhead + agehead + educhead + lnland + vaccess + pcirr + rice + wheat + milk + oil + egg,
                        method = "nearest", data = hh_98.data)
summary(pscore_model)

Call:
matchit(formula = dmmfd ~ sexhead + agehead + educhead + lnland + 
    vaccess + pcirr + rice + wheat + milk + oil + egg, data = hh_98.data, 
    method = "nearest")

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.2519        0.1811          0.6523     1.2798    0.1861
sexhead         0.9773        0.8911          0.5783          .    0.0862
agehead        44.0364       46.4906         -0.2089     0.8353    0.0430
educhead        2.7409        2.2145          0.1434     1.1516    0.0345
lnland          0.3243        0.3916         -0.1700     0.5617    0.0346
vaccess         0.8500        0.8317          0.0513          .    0.0183
pcirr           0.5532        0.5621         -0.0252     1.1482    0.0266
rice           10.5316       10.2228          0.1731     1.4065    0.0513
wheat           7.5296        7.4517          0.1006     0.8055    0.0344
milk           10.6085       10.9654         -0.1070     0.9682    0.0284
oil            39.4259       39.3979          0.0064     1.2295    0.0218
egg             1.8509        1.9781         -0.3710     0.8270    0.0836
         eCDF Max
distance   0.3022
sexhead    0.0862
agehead    0.1217
educhead   0.0790
lnland     0.0639
vaccess    0.0183
pcirr      0.0528
rice       0.1258
wheat      0.0756
milk       0.0861
oil        0.0531
egg        0.1483

Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
distance        0.2519        0.2503          0.0147     1.0473    0.0016
sexhead         0.9773        0.9818         -0.0305          .    0.0045
agehead        44.0364       44.2955         -0.0221     0.9099    0.0168
educhead        2.7409        2.7455         -0.0012     0.9079    0.0155
lnland          0.3243        0.3674         -0.1088     0.6618    0.0265
vaccess         0.8500        0.8045          0.1273          .    0.0455
pcirr           0.5532        0.5145          0.1103     1.0611    0.0418
rice           10.5316       10.4865          0.0253     1.4821    0.0388
wheat           7.5296        7.6096         -0.1032     0.7729    0.0449
milk           10.6085       10.5501          0.0175     1.2407    0.0309
oil            39.4259       40.0595         -0.1456     1.2084    0.0332
egg             1.8509        1.8363          0.0426     1.0055    0.0157
         eCDF Max Std. Pair Dist.
distance   0.0273          0.0176
sexhead    0.0045          0.0915
agehead    0.0682          1.0040
educhead   0.0455          0.9496
lnland     0.0636          1.0476
vaccess    0.0455          0.8147
pcirr      0.0955          1.0781
rice       0.0909          0.9335
wheat      0.1182          1.1397
milk       0.1045          1.0058
oil        0.0955          1.0349
egg        0.0318          0.9035

Sample Sizes:
          Control Treated
All           909     220
Matched       220     220
Unmatched     689       0
Discarded       0       0

Para el objeto de extraer datos

# Extraer los datos emparejados
matched_data <- match.data(pscore_model)
head(matched_data)
nhyearvillidthanaidageheadsexheadeducheadfamsizehhlandhhassetexpfdexpnfdexptotdmmfddfmfdweightvaccesspcirrricewheatmilkpotatoeggoillexptotlnlanddistanceweightssubclass
1.11e+04111431641161.8e+05 6.72e+031.25e+041.92e+04111.01 10.4512.6 8.1211.58.552.2 40.69.860.77  0.231 121
1.11e+041114810781.68e+044.08e+031.25e+035.33e+03011.01 10.4512.6 8.1211.58.552.2 40.68.580.077 0.23  190
1.3e+04 131351041388.41e+043.67e+032.48e+036.15e+03011.49 00.759.026.0910.86.112.0340.68.720.867 0.08441119
1.3e+04 131481042376.47e+054.79e+032.2e+03 6.99e+03011.49 00.759.026.0910.86.112.0340.68.851.21  0.0626113
1.31e+0413170104193.4e+04 3.26e+036.08e+039.34e+03100.65700.759.026.0910.86.112.0340.69.140.174 0.0767122
1.31e+041314110534.24e+032.4e+03 523       2.93e+03100.65700.759.026.0910.86.112.0340.67.980.02960.113 124

Carlos Yanes Guerra | Departamento de Economía | Universidad del Norte