Use fl2003.RData
, which is a cleaned up version of the data from Fearon and Laitin (2003). Fit a model where onset is explained by all variables. Use cv.glmnet()
to fit elastic net models for a variety of α values, using a loss function that is appropriate for the binomial nature of the data. Present plots of the model’s predictive accuracy for different α values. Fit a model with glmnet()
using the α value you found that minimizes predictive error. Report coefficient estimates for all variables, and plot the changes in coefficient values vs. the L1 norm, log-lambda value, and deviance explained.
Randomly sample five cases where onset = 0
and five where onset = 1
. Fit an elastic net with the optimal α value you found for the whole dataset. Are the most important coefficients the same?
library(dplyr)
library(glmnet)
library(caret)
library(parallel)
load('fl2003.RData')
# create predictors and response
fl_x <- as.matrix(fl[, -1])
fl_y <- as.factor(fl$onset)
# sequences of alpha values to evaluate
alphas <- seq(0, 1, by = .1)
# cross validation elastic nets for different penalty parameters
fits <- mclapply(alphas, function(x) cv.glmnet(fl_x, fl_y, type.measure = 'auc',
alpha = x, family = 'binomial'))
# plot AUC for different penalty parameters
par(mfrow = c(3,1))
plot(fits[[1]])
plot(fits[[6]])
plot(fits[[11]])
# penalty parameter w/ highest AUC
alpha_best <- which.max(sapply(fits, function(x) max(x$cvm)))
# fit elastic net w/ best penalty parameters
best_fit <- glmnet(fl_x, fl_y, family = 'binomial', alpha = alphas[alpha_best])
# plot coefficients
par(mfrow = c(3,1))
plot(best_fit, xvar = 'norm', label = T)
plot(best_fit, xvar = 'lambda', label = T)
plot(best_fit, xvar = 'dev', label = T)
# coefficients from whole dataset
betas <- varImp(best_fit, lambda = best_fit$lambda[length(best_fit$lambda)])
# sample 10 observations for each outcome
fl_samp <- fl %>% group_by(onset) %>% sample_n(10)
# create predictors and response
fl_x_samp <- as.matrix(fl_samp[, -1])
fl_y_samp <- as.factor(fl_samp$onset)
# fit sample data w/ best penalty parameter from entire dataset
best_fit_samp <- glmnet(fl_x_samp, fl_y_samp, family = 'binomial', alpha = alphas[alpha_best])
# plot coefficients
plot(best_fit_samp, xvar = 'norm', label = T)
plot(best_fit_samp, xvar = 'lambda', label = T)
plot(best_fit_samp, xvar = 'dev', label = T)
# coefficients for sampled data
betas_samp <- varImp(best_fit_samp, lambda = best_fit_samp$lambda[length(best_fit_samp$lambda)])
# compare coefficients for both models
kable(data.frame(betas, betas_samp) %>% rename(All = Overall, Sample = Overall.1))
All | Sample | |
---|---|---|
warl | 1.0222 | 0.0000 |
gdpenl | 0.2353 | 2.3865 |
lpopl1 | 0.2668 | 7.3504 |
lmtnest | 0.2198 | 3.8555 |
ncontig | 0.7021 | 12.0521 |
Oil | 0.6626 | 1.0948 |
nwstate | 1.6557 | 0.0000 |
instab | 0.6097 | 8.2590 |
polity2l | 0.0247 | 0.8997 |
ethfrac | 0.0407 | 7.2016 |
relfrac | 0.0010 | 21.7585 |
western | 1.9047 | 0.0000 |
eeurop | 0.6906 | 0.0000 |
lamerica | 0.3673 | 9.8158 |
ssafrica | 0.0650 | 1.0215 |
asia | 0.2107 | 0.2941 |
Fearon, James D., and David D. Laitin. 2003. “Ethnicity, Insurgency, and Civil War.” The American Political Science Review 97 (1): 75–90.