battlefin presentation

2019-01-29

by Jonathan Regenstein

Introducing R and RStudio

+ Statistical programming language -> by data scientists, for data scientists
+ Base R + 17,000 packages
+ RStudio
+ Shiny
+ sparklyr -> big data 
+ tensorflow -> AI
+ Rmarkdown -> reproducible reports
+ database connectors
+ htmlwidgets

Packages for today

library(tidyverse)
library(tidyquant)
library(timetk)
library(tibbletime)
library(highcharter)
library(PerformanceAnalytics)

More packages for finance here: https://cran.r-project.org/web/views/Finance.html

Today’s project

+ Import and wrangle data on the financial services SPDR ETF

+ Import and wrangle data from Freddie Mac on housing prices

+ Try to find a signal and code up a toy strategy

+ Visualize its results and descriptive statistics

+ Compare it to buy-and-hold

+ Conclude by building a Shiny dashboard for further exploration

+ Data science work flow

Import data

We will use the tidyquant package and it’s tq_get() function to grab the data from public sources.

In real life, this would be a pointer to your proprietary data source for market data. Probably a data base or a data lake somewhere, possibly an excel spreadsheet or csv.

symbols <- "XLF"


prices <- 
  tq_get(symbols, 
         get = "stock.prices",
         from = "1998-01-01")


prices %>% 
  slice(1:5)

# A tibble: 5 x 7
  date        open  high   low close volume adjusted
  <date>     <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
1 1998-12-22  19.1  19.1  18.8  18.9  55800     9.71
2 1998-12-23  18.9  19.2  18.9  19.2  78700     9.85
3 1998-12-24  19.2  19.3  19.2  19.3  43800     9.92
4 1998-12-28  19.3  19.3  19.0  19.1  51900     9.79
5 1998-12-29  19.1  19.3  18.9  19.3 100800     9.89

Start with a line chart.

We will use highcharter to create a quick interactive chart.

hc_prices_daily <- 
prices %>% 
  hchart(., 
         hcaes(x = date, y = adjusted),
         type = "line") %>% 
  hc_title(text = "Explore prices") %>% 
  hc_tooltip(pointFormat = "XLF: ${point.y: .2f}")

hc_prices_daily

Why start with a simple line chart? Always good to make sure our data isn’t corrupted or missing values.

returns <- 
prices %>% 
  select(date, adjusted) %>% 
  mutate(returns = log(adjusted) - log(lag(adjusted))) %>% 
  na.omit()

returns %>% 
  slice(1:5)

# A tibble: 5 x 3
  date       adjusted  returns
  <date>        <dbl>    <dbl>
1 1998-12-23     9.85  0.0146 
2 1998-12-24     9.92  0.00658
3 1998-12-28     9.79 -0.0132 
4 1998-12-29     9.89  0.0106 
5 1998-12-30     9.85 -0.00396

Get some quick summary stats on the history of daily returns using the table.Stats function.

table.Stats(returns$returns)

                         
Observations    5066.0000
NAs                0.0000
Minimum           -0.1823
Quartile 1        -0.0071
Median             0.0004
Arithmetic Mean    0.0002
Geometric Mean     0.0000
Quartile 3         0.0078
Maximum            0.2698
SE Mean            0.0003
LCL Mean (0.95)   -0.0003
UCL Mean (0.95)    0.0007
Variance           0.0004
Stdev              0.0191
Skewness           0.5033
Kurtosis          20.6198

hc_returns_daily <- 
returns %>%
  hchart(., hcaes(x = date, y = returns),
               type = "scatter") %>% 
  hc_tooltip(pointFormat = '{point.x: %Y-%m-%d} <br>
                            {point.y:.4f}%')

hc_returns_daily

Now let’s grab some data from a different source, that we think might be related to this market price and returns data.

In this case, it is housing data that is published by Freddie Mac. Not exactly an exotic source of data, but Freddie does produce some great data, it’s publicly available so we can hack around with it, and this is an illustrative example of what we might confront when working with alternative data. It’s coming from a new source, not a provider of our market data. That means it will be in a different format. In this case, it’s a csv file. That also means it almost certainly won’t be in a structure that fits perfectly with our other data. We’ll need to import it, wrangle it, mash it together (or join it) it with our market data, then start investigating.

Here the source: http://www.freddiemac.com/research/indices/house-price-index.html

From the internet:

# not run
hpi_raw <- read_csv("http://www.freddiemac.com/research/docs/fmhpi_master_file.csv")

hpi_raw %>%
  glimpse()

Observations: 229,680
Variables: 7
$ Year      <int> 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, 1975, …
$ Month     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, …
$ GEO_Type  <chr> "State", "State", "State", "State", "State", "State", …
$ GEO_Name  <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", …
$ GEO_Code  <chr> ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".",…
$ Index_NSA <dbl> 34.45636, 34.95436, 35.46517, 35.99843, 36.58180, 37.2…
$ Index_SA  <dbl> 34.65549, 35.12381, 35.51362, 35.89006, 36.29857, 36.7…

From a local CSV:

hpi_raw <- 
  read_csv("path to your scv")

Create a date column
filter just the USA
filter to recent data, since 1990
give the index column a better name
delete the unneeded columns

hpi_wrangled <- 
  hpi_raw %>% 
  select(Year, Month, GEO_Type, Index_SA) %>% 
  rename(housing_index = Index_SA, geog = GEO_Type) %>%
  unite("date", c("Year", "Month"), sep = "-") %>% 
  mutate(date = ymd(parse_date_time(date, "%Y%m"))) %>% 
  filter(geog == "US" & date >= "1998-01-01")

hpi_wrangled %>% 
  head()

# A tibble: 6 x 3
  date       geog  housing_index
  <date>     <chr>         <dbl>
1 1998-01-01 US             80.8
2 1998-02-01 US             81.2
3 1998-03-01 US             81.6
4 1998-04-01 US             82.1
5 1998-05-01 US             82.5
6 1998-06-01 US             83.0

How about monthly changes?

hpi_wrangled %>% 
  mutate(hpi_change = log(housing_index) - log(lag(housing_index))) %>% 
  head()

# A tibble: 6 x 4
  date       geog  housing_index hpi_change
  <date>     <chr>         <dbl>      <dbl>
1 1998-01-01 US             80.8   NA      
2 1998-02-01 US             81.2    0.00512
3 1998-03-01 US             81.6    0.00546
4 1998-04-01 US             82.1    0.00556
5 1998-05-01 US             82.5    0.00537
6 1998-06-01 US             83.0    0.00519

Some quick vis’s on monthly changes

hpi_wrangled %>% 
  mutate(hpi_change = log(housing_index) - log(lag(housing_index)),
         color_of_bars = if_else(hpi_change < lag(hpi_change) |
                                  hpi_change < 0,
                                 "#ff3333", "#6495ed")) %>%
  na.omit() %>% 
  hchart(., hcaes(x = date, y = hpi_change, color = color_of_bars), type = "scatter") %>% 
  hc_tooltip(pointFormat = '{point.x: %Y-%m-%d} <br>
                            {point.y:.4f}%')

hpi_wrangled %>% 
  mutate(hpi_change = log(housing_index) - log(lag(housing_index)),
         pos_neg = case_when(
                                  hpi_change < 0 ~ "negative",# or 0 or mean(na.omit(hpi_change))
                                  hpi_change > 0 ~ "positive")) %>%
  count(pos_neg) %>% 
  na.omit() %>% 
  hchart(., hcaes(x = pos_neg, y = n, color = pos_neg), type = "column", pointWidth = 30) %>% 
  hc_title(text = "Num Pos v. Neg Months", align = "center")

Looks good, except this is monthly data, and our price data is daily. Now we have a decision: how to make this data align so we can start exploring potential relationships.

Let’s coerce prices to monthly.

prices %>% 
  select(date, adjusted) %>%
  as_tbl_time(index = date) %>% 
  as_period("monthly", side = "start") %>% 
  head()

# A time tibble: 6 x 2
# Index: date
  date       adjusted
  <date>        <dbl>
1 1998-12-22     9.71
2 1999-01-04     9.76
3 1999-02-01     9.76
4 1999-03-01    10.1 
5 1999-04-01    10.4 
6 1999-05-03    11.3

Wait, look real close - the first trading day of each month is not the 1st of each month, but Freddie always reports data as of the 1st.

Let’s coerce it to the first of the month using floor_date().

prices_monthly <- 
  prices %>% 
  select(date, adjusted) %>%
  as_tbl_time(index = date) %>% 
  as_period("monthly", side = "start") %>%
  mutate(date = floor_date(date, "month"))

prices_monthly %>% 
  slice(1:7)

# A time tibble: 7 x 2
# Index: date
  date       adjusted
  <date>        <dbl>
1 1998-12-01     9.71
2 1999-01-01     9.76
3 1999-02-01     9.76
4 1999-03-01    10.1 
5 1999-04-01    10.4 
6 1999-05-01    11.3 
7 1999-06-01    10.3

Okay, that worked, we’re ready to join this data, but note we just did something important. We changed the date. Maybe that’s no big deal. But maybe it is. Maybe you or your colleagues or your team think this is the worst idea and we should, instead, coerce the Freddie Data to match our market data. If you turn all your data wrangling and prep over to someone else, make sure they notate it. If you do it yourself, make sure you notate! Future you will be happy.

xlf_hpi_joined <-
  prices %>% 
  select(date, adjusted) %>%
  as_tbl_time(index = date) %>% 
  as_period("monthly", side = "start") %>%
  mutate(date = floor_date(date, "month")) %>% 
  full_join(hpi_wrangled)

xlf_hpi_joined %>% 
  slice(1:7)

# A time tibble: 7 x 4
# Index: date
  date       adjusted geog  housing_index
  <date>        <dbl> <chr>         <dbl>
1 1998-12-01     9.71 US             85.6
2 1999-01-01     9.76 US             86.1
3 1999-02-01     9.76 US             86.5
4 1999-03-01    10.1  US             87.0
5 1999-04-01    10.4  US             87.5
6 1999-05-01    11.3  US             88.0
7 1999-06-01    10.3  US             88.6

xlf_hpi_joined_returns <-
  xlf_hpi_joined %>%  
  mutate(hpi_change = log(housing_index) - log(lag(housing_index)),
         asset_returns = log(adjusted) - log(lag(adjusted))) %>% 
  na.omit()

xlf_hpi_joined_returns %>% 
  slice(1:7)

# A time tibble: 7 x 6
# Index: date
  date       adjusted geog  housing_index hpi_change asset_returns
  <date>        <dbl> <chr>         <dbl>      <dbl>         <dbl>
1 1999-01-01     9.76 US             86.1    0.00600       0.00535
2 1999-02-01     9.76 US             86.5    0.00552       0      
3 1999-03-01    10.1  US             87.0    0.00508       0.0367 
4 1999-04-01    10.4  US             87.5    0.00545       0.0305 
5 1999-05-01    11.3  US             88.0    0.00600       0.0770 
6 1999-06-01    10.3  US             88.6    0.00672      -0.0883 
7 1999-07-01    11.0  US             89.2    0.00670       0.0676

xlf_hpi_hc <- 
xlf_hpi_joined_returns %>% 
  hchart(., hcaes(x = date, y = adjusted),
         type = "line") %>%
  hc_add_series(hpi_wrangled, hcaes(x = date, y = housing_index), 
                type = "line", name = "housing_index")

xlf_hpi_hc

Take a look. Nothing too egregious here but that plateau around February 2006 looks like something.

Add some trading logic

Let’s see if that Freddie data might give us a useful signal. We could head to a predictive model here - maybe for HP’s sales growth or the revenue/yield of XLB.

For now, let’s code up a simple algorithm: if the hpi monthly change starts a flat or negative trend, let’s exit our positions until they pick back up.

If the previous three HPI monthly changes were below mean, exit the XLB position.

The key here is the if_else() call in R. We can implement whatever logic we want and see the results. The data wrangling was the hard part.

xlf_hpi_joined_returns %>%
  mutate(signal = if_else(
                            lag(hpi_change, 1) < 0,
                            0, 1),
         trend_returns = if_else(lag(signal) == 1, 
                                 (signal * asset_returns), 0),
         buy_hold_returns = (asset_returns)) %>% 
  select(date, buy_hold_returns, trend_returns) %>%
  na.omit() %>% 
  mutate(
         trend_growth = accumulate(1 + trend_returns, `*`),
         buy_hold_growth = accumulate(1 + buy_hold_returns, `*`)) %>%
  select(date, trend_growth, buy_hold_growth) %>%
  gather(strategy, growth, -date) %>% 
  hchart(., hcaes(x = date, y = growth, group = strategy), type = "line") %>% 
  hc_tooltip(pointFormat = "{point.strategy}: ${point.growth: .2f}")

To Shiny!