Presenting results from an arbitrary number of models
9 minute read
The combination of tidyr::nest()
and purrr:map()
can be used to
easily fit the same model to different subsets of a single dataframe.
There are many
tutorials
available to help guide you
through this process. There are substantially fewer (none I’ve been able
to find) that show you how to use these two functions to fit the same
model to different features from your dataframe.
While the former involves splitting your data into different subsets by row, the latter involves cycling through different columns. I recently confronted a problem where I had to run many models, including just one predictor at a time from large pool of candidate predictors, while also including a standard set of control variables in each.1 Given the (apparent) absence of tutorials on fitting the same model to different features from a dataframe using these functions, I decided to write up the solution I reached in the hope it might be helpful to someone else.2 Start by loading the following packages:
library(tidyverse)
library(broom)
library(modelsummary)
library(kableExtra)
library(nationalparkcolors)
We’ll start with a recap of the subsetting approach, then build on it to
cycle through features instead of subsets of the data. This code is
similar to the official tidyverse
tutorial above, but
pipes the output directly to a ggplot()
call to visualize the results.
mtcars %>%
nest(data = -cyl) %>% # split data by cylinders
mutate(mod = map(data, ~lm(mpg ~ disp + wt + am + gear, data = .x)),
out = map(mod, ~tidy(.x, conf.int = T))) %>% # tidy model to get coefs
unnest(out) %>% # unnest to access coefs
mutate(sig = sign(conf.low) == sign(conf.high), # p <= .05
cyl = as.factor(cyl)) %>% # factor for nicer plotting
filter(term == 'disp') %>%
ggplot(aes(x = cyl, y = estimate, ymin = conf.low, ymax = conf.high,
color = sig)) +
geom_pointrange() +
geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
scale_color_manual(name = 'Statistical significance',
labels = str_to_title,
values = park_palette('Saguaro')) +
labs(x = 'Cylinders', y = "Coefficient estimate") +
theme_bw() +
theme(legend.position = 'bottom')
Multiple predictors
The first thing we have to do is create a custom fuction because we now
need to be able to specify different predictors in different runs of the
model. The code below is very similar to the code above, except that
we’re defining the formula in lm()
via the formula()
function, which
parses a character object that we’ve assembled via str_c()
. The net
effect of this is to fit a model where the pred
argmument to
func_var()
is the first predictor. This lets us use an external
function to supply different values to pred
. Then we use
broom::tidy()
to create a tidy dataframe of point estimates and
measures of uncertainty from the model and store them in a variable
called out
. Finally, mutate(pred = pred)
creates a variable named
pred
in the output dataframe that records what the predictor used to
fit the model was. We could retrieve this from the mod
list-column,
but this is approach is simpler both to extract the predictor
programtically and to visually inspect the data. We use then
purr::map_dfr()
to generate a dataframe where each row corresponds to
a model with with a different predictor.
func_var <- function(pred, dataset) {
dataset %>%
nest(data = everything()) %>%
mutate(mod = map(data, ~lm(formula(str_c('mpg ~ ' , pred, # substitute pred
' + wt + am + gear')),
data = .x)),
out = map(mod, ~tidy(.x, conf.int = T))) %>%
mutate(pred = pred) %>%
return()
}
## predictors of interest
preds <- c('disp', 'hp', 'drat')
## fit models with different predictors
mods_var <- map_dfr(preds, function(x) func_var(x, mtcars))
## inspect
mods_var
## # A tibble: 3 × 4
## data mod out pred
## <list> <list> <list> <chr>
## 1 <tibble [32 × 11]> <lm> <tibble [5 × 7]> disp
## 2 <tibble [32 × 11]> <lm> <tibble [5 × 7]> hp
## 3 <tibble [32 × 11]> <lm> <tibble [5 × 7]> drat
Plots
You can see our original dataframe that we condensed down into data
with nest()
, the model object in mod
, the tidied model output in
out
, and finally the predictor used to fit the model in pred
. Using
unnest()
, we can unnest the out
object and get a dataframe we can
use to plot the main coefficient estimate from each of our three models.
mods_var %>%
unnest(out) %>%
mutate(sig = sign(conf.low) == sign(conf.high)) %>%
filter(term %in% preds) %>%
ggplot(aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
color = sig)) +
geom_pointrange() +
geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
scale_color_manual(name = 'Statistical significance',
labels = str_to_title,
values = park_palette('Saguaro')) +
labs(x = 'Predictor', y = "Coefficient estimate") +
theme_bw() +
theme(legend.position = 'bottom')
Tables
Things get slightly more complicated when we want to represent our
results textually instead of visually. We can use the excellent
modelsummary::modelsummary()
function to create our table, but we need
to supply a list of model objects, rather than the unnested dataframe we
created above to plot the results. We can use the split()
function to
turn our dataframe into a list, and by using split(seq(nrow(.)))
,
we’ll create one list item for each row in our dataframe.
Since each list item will be a one row dataframe, we can use lapply()
to cycle through the list. The mod
object in each one row dataframe is
itself a list-column, so we need to index it with [[1]]
to properly
access the model object itself.3 The last step is a call to
unname()
, which will drop the automatically generated list item names
of 1
, 2
, and 3
, allowing modelsummary()
to use the default names
for each model column in the output.
tab_coef_map = c('disp' = 'Displacement', # format coefficient labels
'hp' = 'Horsepower',
'drat' = 'Drive ratio',
'wt' = 'Weight (1000 lbs)',
'am' = 'Manual',
'gear' = 'Gears',
'(Intercept)' = '(Intercept)')
mods_var %>%
split(seq(nrow(.))) %>% # list where each object is a one row dataframe
lapply(function(x) x$mod[[1]]) %>% # extract model from data dataframe
unname() %>% # remove names for default names in table
modelsummary(coef_map = tab_coef_map, stars = c('*' = .05))
Bonus
Now, let’s combine both approaches. We’re going to be splitting our
dataframe into three sub-datasets by number of cylinders while also
fitting the same model three times with 'disp'
, 'hp'
, and 'drat'
as predictors. The only changes to func_var()
are to omit cyl
from
the nesting, and to recode it as a factor to treat it as discrete axis
labels.
func_var_obs <- function(pred, dataset) {
dataset %>%
nest(data = -cyl) %>%
mutate(mod = map(data, ~lm(formula(str_c('mpg ~ ' , pred,
' + wt + am + gear')),
data = .x)),
out = map(mod, ~tidy(.x, conf.int = T)),
cyl = as.factor(cyl),
pred = pred) %>%
select(-data) %>%
return()
}
preds <- c('disp', 'hp', 'drat')
mods_var_obs <- map_dfr(preds, function(x) func_var_obs(x, mtcars))
Plotting involves a call to facet_wrap()
, but is otherwise similar.
mods_var_obs %>%
unnest(out) %>%
mutate(sig = sign(conf.low) == sign(conf.high)) %>%
filter(term %in% preds) %>%
ggplot(aes(x = cyl, y = estimate, ymin = conf.low, ymax = conf.high,
color = sig)) +
geom_pointrange() +
geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
facet_wrap(~pred) +
scale_color_manual(name = 'Statistical significance',
labels = str_to_title,
values = park_palette('Saguaro')) +
labs(x = 'Predictor', y = "Coefficient estimate") +
theme_bw() +
theme(legend.position = 'bottom')
Creating tables is more complex. Here we have to cycle through each
predictor with a call to map()
, filter the output to only contain
results from models using that predictor, then split the dataframe by
cylinders instead of into separate rows. Note the use of
unname(preds_name[x])
to retrieve full english predictor names to
create more useful table titles. We’ll also be using tab_coef_map
from
above to get more informative row labels in our tables. Running the code
below generates the following tables:
## named vector for full english predictor names
preds_name <- c('displacement', 'horsepower', 'drive ratio')
names(preds_name) <- preds
map(preds, function(x) mods_var_obs %>%
filter(pred == x) %>% # subset to models using predictor x
select(mod, cyl) %>% # drop tidied model
split(.$cyl) %>% # split by number of cylinders in engine
lapply(function(y) y$mod[[1]]) %>% # only one item in each list
modelsummary(title = str_c('Predictor: ',
unname(preds_name[x]), # formatted name
coef_map = tab_coef_map,
stars = c('*' = .05),
escape = F) %>%
add_header_above(c(' ' = 1, 'Cylinders' = 3))) %>%
walk(print) # invisibly return input to avoid [[1]] in output
We’ve got one table for each predictor we considered, and each one is
split into three models for cars with four, six, and eight cylinder
engines. This is a bit overkill for this example, but it’s all you have
to do to scale this framework up to hundreds of potential predictors is
put more items in preds
.
-
Yes, I know this is a perfect situation to use LASSO. Sometimes people (reviewers) want certain models run, and you just have to run them. ↩
-
There’s a very real chance that someone else is me in six months. ↩
-
Things get a lot more complicated if your
split()
call produces a list of dataframes that aren’t one row each, so make sure that’s what you’re getting before you proceed. ↩