Today we will continue our work on Fama French factor models, but more as a vehicle to explore some of the awesome stuff happening in the world of tidy models. For new readers who want get familiar with Fama French before diving into this post, see here where we covered importing and wrangling the data, here where we covered rolling models and visualization, my most recent previous post here where we covered managing many models, and if you’re into Shiny, this flexdashboard.

Our goal today is to explore k-fold cross-validation via the `rsample`

package, and a bit of model evaluation via the `yardstick`

package. We started the model evaluation theme last time when we used `tidy()`

, `glance()`

and `augment()`

from the `broom`

package. In this post, we will use the `rmse()`

function from `yardstick`

, but our main focus will be on the `vfold_cv()`

function from `rsample`

. We are going to explore these tools in the context of linear regression and Fama French, which might seem weird since these tools would typically be employed in the realms of machine learning, classification, and the like. We’ll stay in the world of explanatory models via linear regression world for a few reasons.

First, and this is a personal preference, when getting to know a new package or methodology, I prefer to do so in a context that’s already familiar. I don’t want to learn about `rsample`

whilst also getting to know a new data set and learning the complexities of some crazy machine learning model. Since Fama French is familiar from our previous work, we can focus on the new tools in `rsample`

and `yardstick`

. Second, factor models are important in finance, despite relying on good old linear regression. We won’t regret time spent on factor models, and we might even find creative new ways to deploy or visualize them.

The plan for today is take the same models that we ran in the last post, only this use k-fold cross validation and bootstrapping to try to assess the quality of those models.

For that reason, we’ll be working with the same data as we did previously. I won’t go through the logic again, but in short, we’ll import data for daily prices of five ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen:

```
library(tidyverse)
library(broom)
library(tidyquant)
library(timetk)
symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG")
# The prices object will hold our daily price data.
prices <-
getSymbols(symbols, src = 'yahoo',
from = "2012-12-31",
to = "2017-12-31",
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) %>%
reduce(merge) %>%
`colnames<-`(symbols)
asset_returns_long <-
prices %>%
tk_tbl(preserve_index = TRUE, rename_index = "date") %>%
gather(asset, returns, -date) %>%
group_by(asset) %>%
mutate(returns = (log(returns) - log(lag(returns)))) %>%
na.omit()
factors_data_address <-
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip"
factors_csv_name <- "Global_5_Factors_Daily.csv"
temp <- tempfile()
download.file(
# location of file to be downloaded
factors_data_address,
# where we want R to store that file
temp,
quiet = TRUE)
Global_5_Factors <-
read_csv(unz(temp, factors_csv_name), skip = 6 ) %>%
rename(date = X1, MKT = `Mkt-RF`) %>%
mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>%
mutate_if(is.numeric, funs(. / 100)) %>%
select(-RF)
data_joined_tidy <-
asset_returns_long %>%
left_join(Global_5_Factors, by = "date") %>%
na.omit()
```

After running that code, we have an object called `data_joined_tidy`

. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.

```
data_joined_tidy %>%
slice(1)
```

```
# A tibble: 5 x 8
# Groups: asset [5]
date asset returns MKT SMB HML RMW CMA
<date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013-01-02 AGG -0.00117 0.0199 -0.0043 0.0028 -0.0028 -0.0023
2 2013-01-02 EEM 0.0194 0.0199 -0.0043 0.0028 -0.0028 -0.0023
3 2013-01-02 EFA 0.0154 0.0199 -0.0043 0.0028 -0.0028 -0.0023
4 2013-01-02 IJS 0.0271 0.0199 -0.0043 0.0028 -0.0028 -0.0023
5 2013-01-02 SPY 0.0253 0.0199 -0.0043 0.0028 -0.0028 -0.0023
```

Let’s work with just one ETF for today and use `filter(asset == "AGG")`

to shrink our data down to just that ETF.

```
agg_ff_data <-
data_joined_tidy %>%
filter(asset == "AGG")
```

Okay, we’re going to regress the daily returns of AGG on one factor, then three factors, then five factors, and we want to evaluate how well each model explains AGG’s returns. That means we need a way to test the model. Last time, we looked at the adjusted r-squared values when the model was run on the entirety of AGG returns. Today, we’ll evaluate the model using k-fold cross validation. That’s a pretty jargon-heavy phrase that isn’t part of the typical finance lexicon. Let’s start with the second part, `cross-validation`

. Instead of running our model on the entire data set - all the daily returns of AGG - we’ll run it on just part of the data set, then test the results on the part that we did not use. Those different subsets of our original data are often called the training and the testing sets, though `rsample`

calls them the `analysis`

and `assessment`

sets. We validate the model results by applying them to the `assessment`

data and seeing how the model performed.

The `k-fold`

bit refers to the fact that we’re not just dividing our data into training and testing subsets, we’re actually going to divide it into a bunch of groups, a `k`

number of groups, or a `k`

number of `folds`

. One of those folds will be used as the validation set; the model will be fit on the other `k - 1`

sets, and then tested on the validation set. We’re doing this with a linear model to see how well it explains the data; it’s typically used in machine learning to see how well a model predicts data (we’ll get there in 2019).^{1}

If you’re like me, it will take a bit of tinkering to really grasp k-fold cross validation, but `rsample`

as a great function for dividing our data into k-folds. If we wish to use five folds (the state of the art seems to be either five or ten folds), we call the `vfold_cv()`

function, pass it our data object `agg_ff_data`

, and set `v = 5`

.

```
library(rsample)
library(yardstick)
set.seed(752)
cved_ff<-
vfold_cv(agg_ff_data, v = 5)
cved_ff
```

```
# 5-fold cross-validation
# A tibble: 5 x 2
splits id
<list> <chr>
1 <S3: rsplit> Fold1
2 <S3: rsplit> Fold2
3 <S3: rsplit> Fold3
4 <S3: rsplit> Fold4
5 <S3: rsplit> Fold5
```

We have an object called `cved_ff`

, with a column called `splits`

and a column called `id`

. Let’s peek at the first split.

`cved_ff$splits[[1]]`

`<1007/252/1259>`

Three numbers. The first, 1007, is telling us how many observations are in the `analysis`

. Since we have five folds, we should have 80% (or 4/5) of our data in the `analysis`

set. The second number, 252, is telling us how many observations are in the `assessment`

, which is 20% of our original data. The third number, 1259, is the total number of observations in our original data.

Next, we want to apply a model to the `analysis`

set of this k-folded data and test the results on the `assessment`

set. Let’s start with one factor and run a simple linear model, `lm(returns ~ MKT)`

.

We want to run it on `analysis(cved_ff$splits[[1]])`

- the analysis set of out first split.

```
ff_model_test <- lm(returns ~ MKT, data = analysis(cved_ff$splits[[1]]))
ff_model_test
```

```
Call:
lm(formula = returns ~ MKT, data = analysis(cved_ff$splits[[1]]))
Coefficients:
(Intercept) MKT
0.0001025 -0.0265517
```

Nothing too crazy so far. Now we want to test on our assessment data. The first step is to add that data to the original set. We’ll use `augment()`

for that task, and pass it `assessment(cved_ff$splits[[1]])`

```
ff_model_test %>%
augment(newdata = assessment(cved_ff$splits[[1]])) %>%
head() %>%
select(returns, .fitted)
```

```
# A tibble: 6 x 2
returns .fitted
<dbl> <dbl>
1 0.000902 0.000118
2 0.00117 0.0000494
3 0.00108 0.000116
4 -0.00244 -0.0000754
5 -0.00217 -0.0000834
6 0.00282 0.000387
```

We just added our fitted values to the `assessment`

data, the subset of the data on which the model was not fit. How well did our model do when compare the fitted values to the data in the held out set?

We can use the `rmse()`

function from `yardstick`

to measure our model. RMSE stands for root mean-squared error. It’s the sum of the squared differences between our fitted values and the actual values in the `assessment`

data. A lower RMSE is better!

```
ff_model_test %>%
augment(newdata = assessment(cved_ff$splits[[1]])) %>%
rmse(returns, .fitted)
```

```
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.00208
```

Now that we’ve done that piece by piece, let’s wrap the whole operation into one function. This function takes one argument, a `split`

, and we’re going to use `pull()`

so we can extract the raw number, instead of the entire `tibble`

result.

```
model_one <- function(split) {
split_for_model <- analysis(split)
ff_model <- lm(returns ~ MKT, data = split_for_model)
holdout <- assessment(split)
rmse <- ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
```

Now we pass it our first split.

```
model_one(cved_ff$splits[[1]]) %>%
head()
```

`[1] 0.00208033`

Now we want to apply that function to each of our five folds that are stored in `agg_cved_ff`

. We do that with a combination of `mutate()`

and `map_dbl()`

. We use `map_dbl()`

instead of `map`

because we are returning a number here and there’s not a good reason to store that number in a list column.

```
cved_ff %>%
mutate(rmse = map_dbl(cved_ff$splits, model_one))
```

```
# 5-fold cross-validation
# A tibble: 5 x 3
splits id rmse
* <list> <chr> <dbl>
1 <S3: rsplit> Fold1 0.00208
2 <S3: rsplit> Fold2 0.00189
3 <S3: rsplit> Fold3 0.00201
4 <S3: rsplit> Fold4 0.00224
5 <S3: rsplit> Fold5 0.00190
```

OK, we have five RMSE’s since we ran the model on five separate `analysis`

fold sets and tested on five separate `assessment`

fold sets. Let’s find the average RMSE by taking the `mean()`

of the `rmse`

column. That can help reduce noisiness that resulted from our random creation of those five folds.

```
cved_ff %>%
mutate(rmse = map_dbl(cved_ff$splits, model_one)) %>%
summarise(mean_rse = mean(rmse))
```

```
# A tibble: 1 x 1
mean_rse
<dbl>
1 0.00202
```

We now have the mean RMSE after running on our model, `lm(returns ~ MKT)`

, on all five of our folds.

That process for finding the mean RMSE can be applied other models, as well. Let’s suppose we wish to find the mean RMSE for two other models: `lm(returns ~ MKT + SMB + HML)`

, the Fama French three-factor model, and `lm(returns ~ MKT + SMB + HML + RMW + CMA`

, the Fama French five-factor model. By comparing the mean RMSE’s, we can evaluate which model explained the returns of AGG better. Since we’re just adding more and more factors, the models can be expected to get more and more accurate but again, we are exploring the `rsample`

machinery and creating a template where we can pop in whatever models we wish to compare.

First, let’s create two new functions, that follow the exact same code pattern as above but house the three-factor and five-factor models.

```
model_two <- function(split) {
split_for_model <- analysis(split)
ff_model <- lm(returns ~ MKT + SMB + HML, data = split_for_model)
holdout <- assessment(split)
rmse <-
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
model_three <- function(split) {
split_for_model <- analysis(split)
ff_model <- lm(returns ~ MKT + SMB + HML + RMW + CMA, data = split_for_model)
holdout <- assessment(split)
rmse <-
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
```

Now we pass those three models to the same `mutate()`

with `map_dbl()`

flow that we used with just one model. The result will be three new columns of RMSE’s, one for each of our three models applied to our five folds.

```
cved_ff %>%
mutate(
rmse_model_1 = map_dbl(
splits,
model_one),
rmse_model_2 = map_dbl(
splits,
model_two),
rmse_model_3 = map_dbl(
splits,
model_three))
```

```
# 5-fold cross-validation
# A tibble: 5 x 5
splits id rmse_model_1 rmse_model_2 rmse_model_3
* <list> <chr> <dbl> <dbl> <dbl>
1 <S3: rsplit> Fold1 0.00208 0.00211 0.00201
2 <S3: rsplit> Fold2 0.00189 0.00184 0.00178
3 <S3: rsplit> Fold3 0.00201 0.00195 0.00194
4 <S3: rsplit> Fold4 0.00224 0.00221 0.00212
5 <S3: rsplit> Fold5 0.00190 0.00183 0.00177
```

We can also find the mean RMSE for each model.

```
cved_ff %>%
mutate(
rmse_model_1 = map_dbl(
splits,
model_one),
rmse_model_2 = map_dbl(
splits,
model_two),
rmse_model_3 = map_dbl(
splits,
model_three)) %>%
summarise(mean_rmse_model_1 = mean(rmse_model_1),
mean_rmse_model_2 = mean(rmse_model_2),
mean_rmse_model_3 = mean(rmse_model_3))
```

```
# A tibble: 1 x 3
mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3
<dbl> <dbl> <dbl>
1 0.00202 0.00199 0.00192
```

That code flow worked just fine, but we had to repeat ourselves when creating the functions for each model. Let’s toggle to a flow where we define three models - the ones that we wish to test with via cross-validation and RMSE - then pass those models to one function.

First, we use `as.formula()`

to define our three models.

```
mod_form_1 <- as.formula(returns ~ MKT)
mod_form_2 <- as.formula(returns ~ MKT + SMB + HML)
mod_form_3 <- as.formula(returns ~ MKT + SMB + HML + RMW + CMA)
```

Now we write one function that takes `split`

as an argument, same as above, but also takes `formula`

as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to `map`

, so I’ll append `_flex`

to the name of this function.

```
ff_rmse_models_flex <- function(split, formula) {
split_for_data <- analysis(split)
ff_model <- lm(formula, data = split_for_data)
holdout <- assessment(split)
rmse <-
ff_model %>%
augment(newdata = holdout) %>%
rmse(returns, .fitted) %>%
pull(.estimate)
}
```

Now we use the same code flow as before, except we call `map_dbl()`

, pass it our `cved_ff$splits`

object, our new `flex`

function called `ff_rmse_models_flex()`

, and the model we wish to pass as the `formula`

argument. First we pass it `mod_form_1`

.

```
cved_ff %>%
mutate(rmse_model_1 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_1))
```

```
# 5-fold cross-validation
# A tibble: 5 x 3
splits id rmse_model_1
* <list> <chr> <dbl>
1 <S3: rsplit> Fold1 0.00208
2 <S3: rsplit> Fold2 0.00189
3 <S3: rsplit> Fold3 0.00201
4 <S3: rsplit> Fold4 0.00224
5 <S3: rsplit> Fold5 0.00190
```

Now let’s pass it all three models and find the mean RMSE.

```
cved_ff %>%
mutate(rmse_model_1 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_1),
rmse_model_2 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_2),
rmse_model_3 = map_dbl(cved_ff$splits,
ff_rmse_models_flex,
formula = mod_form_3)) %>%
summarise(mean_rmse_model_1 = mean(rmse_model_1),
mean_rmse_model_2 = mean(rmse_model_2),
mean_rmse_model_3 = mean(rmse_model_3))
```

```
# A tibble: 1 x 3
mean_rmse_model_1 mean_rmse_model_2 mean_rmse_model_3
<dbl> <dbl> <dbl>
1 0.00202 0.00199 0.00192
```

Alright, that code flow seems a bit more flexible than our original method of writing a function to assess each model. We didn’t do much hard thinking about functional form here, but hopefully this provides a template where you could assess more nuanced models. We’ll get into bootstrapping and time series work next week, then head to Shiny to ring in the New Year!

And, finally, a couple of public service announcements.

First, thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).

Second, applications are open for the Battlefin alternative data contest, and RStudio is one of the tools you can use to analyze the data. Check it out here. In January, they’ll announce 25 finalists who will get to compete for a cash prize and connect with some quant hedge funds. Go get ‘em!

Thanks for reading and see you next time.

For more on cross-validation, see “An Introduction to Statistical Learning”, chapter 5. Available online here: http://www-bcf.usc.edu/~gareth/ISL/.↩