Market Structure Part 1: Order Volume Density

2020-01-11

by Jonathan Regenstein

Welcome to another installment of Reproducible Finance!

Inspired by a great visualization in Hands on Time Series with R by Rami Krispin, today we’ll investigate some market structure data and get to know the Midas data source provided by the SEC.

Let’s start by importing data from the SEC website for the 2nd quarter of 2019.

If you navigate to the SEC website here https://www.sec.gov/opa/data/market-structure/market-structure-data-security-and-exchange.html and right click on the link labeled ‘2019 Q2’, you can copy the link address as https://www.sec.gov/files/opa/data/market-structure/metrics-individual-security-and-exchange/individual_security_exchange_2019_q2.zip. That long string is the path to the zip file holding our market structure data for Q2 of 2019. We will start by downloading that zip file and extracting the CSV of market data. Then we will progress to a programmatic way to download all of the market structure data on that page using a custom-built function.

Have a close look at this string and notice that there are two unique pieces of information https://www.sec.gov/files/opa/data/market-structure/metrics-individual-security-and-exchange/individual_security_exchange_2019_q2.zip: the year and the quarter. In this case, those are equal to 2019 and q2, respectively.

Let’s first create variables to hold those pieces of data.

year <- "2019"
quarter <- "q2"

Now we need to combine those two variables with the rest of the string. We can use paste0 since there are no spaces in this string.

market_structure_data_address <- 
paste0("https://www.sec.gov/files/opa/data/market-structure/metrics-individual-security/individual_security_", year, "_", quarter, ".zip")

And here is the string we just created.

market_structure_data_address

[1] "https://www.sec.gov/files/opa/data/market-structure/metrics-individual-security/individual_security_2019_q2.zip"

Nothing too mind blowing there but note that this same code paradigm can be used for any data housed on a website: find the link address, identify the pieces that are unique, create variables for them, paste them together with the non-unique portions of the string. The date is very often the unique piece of the string, which makes this process not too painful to repeat elsewhere.

Now that we have a web address for the zip file, we need to download it to R.

Let’s first create a temporary file called temp so that we have a place to deposit the download.

temp <- tempfile()

Next we call the download.file() function, supply our pasted string variable market_structure_data_address and point to the temp location.

download.file(
  # location of file to be downloaded
  market_structure_data_address,
  # where we want R to store that file
  temp, 
  quiet = TRUE)

# tempdir() will show you where this temp file is being stored

Next we need to create a file name, so we can tell vroom() the name of the file to extract from the zip. We will use the quarter and year variables for this as well.

file_name <- paste0(quarter, "_", year, "_all.csv")

We can now import our csv file into R using vroom(unzip(temp, filename)). Let’s also pipe straight to janitor::clean_names() to clean up the column names and use a combination of ymd() and parse_date_time() to coerce the dates into a more usable format.

q2_2019 <- 
  vroom(unz(temp, file_name)) %>%
  clean_names() %>% 
  mutate(date = ymd(parse_date_time(date, "%Y%m%d")))

q2_2019 %>% 
  head()

# A tibble: 6 x 19
  date       security ticker mcap_rank turn_rank volatility_rank price_rank
  <date>     <chr>    <chr>      <dbl>     <dbl>           <dbl>      <dbl>
1 2019-04-01 Stock    A             10         5               1          9
2 2019-04-01 Stock    AA             9        10               6          7
3 2019-04-01 Stock    AAC            2         9              10          1
4 2019-04-01 Stock    AAL           10         9               6          7
5 2019-04-01 Stock    AAMC           2         9              10          7
6 2019-04-01 Stock    AAME           2         1               3          2
# … with 12 more variables: lit_vol_000 <dbl>, order_vol_000 <dbl>,
#   hidden <dbl>, trades_for_hidden <dbl>, hidden_vol_000 <dbl>,
#   trade_vol_for_hidden_000 <dbl>, cancels <dbl>, lit_trades <dbl>,
#   odd_lots <dbl>, trades_for_odd_lots <dbl>, odd_lot_vol_000 <dbl>,
#   trade_vol_for_odd_lots_000 <dbl>

We just imported daily market structure data for thousands of equities and ETFs. Let’s use group_by() and summarise(...n_distinct()) to count how many equitis versus ETFs we have.

q2_2019 %>% 
  group_by(security) %>%
  summarise(distinct_tickers = n_distinct(ticker))

# A tibble: 2 x 2
  security distinct_tickers
  <chr>               <int>
1 ETF                  2135
2 Stock                3698

We could choose any structure metric as a good exploratory starting point. Let’s suppose we’re interested in order volume.

Let’s find the mean and median daily order volume, for both stocks and ETFs.

q2_2019 %>% 
  select(date, security, ticker, order_vol_000) %>% 
  group_by(security) %>% 
  summarise(mean_vol = mean(order_vol_000),
            median_vol = median(order_vol_000)) %>% 
  group_by(security) %>% 
  skimr::skim()

Table 1: Data summary
Name	Piped data
Number of rows	2
Number of columns	3
_______________________
Column type frequency:
numeric	2
________________________
Group variables	security

Variable type: numeric

skim_variable	security	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mean_vol	ETF	1	79449.74	NA	79449.74	79449.74	79449.74	79449.74	79449.74	▁▁▇▁▁
mean_vol	Stock	1	13988.47	NA	13988.47	13988.47	13988.47	13988.47	13988.47	▁▁▇▁▁
median_vol	ETF	1	9586.77	NA	9586.77	9586.77	9586.77	9586.77	9586.77	▁▁▇▁▁
median_vol	Stock	1	3054.13	NA	3054.13	3054.13	3054.13	3054.13	3054.13	▁▁▇▁▁

Here’s a plot of the mean daily order volume, by security type.

q2_2019 %>% 
  select(date, security, ticker, order_vol_000) %>% 
  group_by(security, date) %>% 
  summarise(mean_vol = mean(order_vol_000)) %>% 
  ggplot(aes(x = date, y = mean_vol, color = security)) +
    geom_line() +
  scale_x_date(breaks = scales::pretty_breaks(n = 20)) +
  scale_y_continuous(label = scales::comma_format(), breaks = scales::pretty_breaks(n = 10)) +
  theme(axis.text.x = element_text(angle = 90),
        plot.title = element_text(hjust = .5),
        plot.subtitle = element_text(hjust = .5)) +
  labs(x = "", y = "mean volume (000's)", title = "Average Order Volume by Type", subtitle = "Q2 2019") +
  scale_colour_manual(values = c("#ff6699", "cornflowerblue"))

Well, we could keep exploring this data set and creating visualizations, and we will do so in the next post, but for now, let’s cover how to programatically import more data from the SEC.

Glance back up at our original import code flow and notice the only variables that we specified individually were year and quarter. That means we can create a function that accepts those two argumnents and then download data from the time period of our choice.

market_str_data_import <- function(year, quarter){

  
  
  market_structure_data_address <-
    paste0(
      "https://www.sec.gov/files/opa/data/market-structure/metrics-individual-security/individual_security_",
      year,
      "_",
      quarter,
      ".zip"
    )
  
  
  file_name <- paste0(quarter, "_", year, "_all.csv")
  
  temp <- tempfile()
  
  download.file(# location of file to be downloaded
    market_structure_data_address,
    # where we want R to store that file
    temp,
    mode = "wb",
    quiet = TRUE)


  vroom(unzip(temp, file_name)) %>%
    clean_names() %>%
    mutate(date = ymd(parse_date_time(date, "%Y%m%d")))
  
}

Let’s test the function on the third quarter of 2018

market_str_data_import("2018", "q3") %>% 
  head()

# A tibble: 6 x 19
  date       security ticker mcap_rank turn_rank volatility_rank price_rank
  <date>     <chr>    <chr>      <dbl>     <dbl>           <dbl>      <dbl>
1 2018-07-02 Stock    A             10         5               1          9
2 2018-07-02 Stock    AA             9         9               4          8
3 2018-07-02 Stock    AAC            3         5               8          3
4 2018-07-02 Stock    AAL           10         7               4          7
5 2018-07-02 Stock    AAMC           2         2              10          9
6 2018-07-02 Stock    AAME           2         1              10          2
# … with 12 more variables: lit_vol_000 <dbl>, order_vol_000 <dbl>,
#   hidden <dbl>, trades_for_hidden <dbl>, hidden_vol_000 <dbl>,
#   trade_vol_for_hidden_000 <dbl>, cancels <dbl>, lit_trades <dbl>,
#   odd_lots <dbl>, trades_for_odd_lots <dbl>, odd_lot_vol_000 <dbl>,
#   trade_vol_for_odd_lots_000 <dbl>

If we wish to import data for all 4 quarters of a year, we can create a year_for_function variable to hold the years and a quarters_vector of quarters. We can then call map2_dfr() to loop over those quarters. For multiple years, we would create a vector years and loop over that as well.

year_for_function <- "2018"

quarters_vector <- c("q1", "q2", "q3", "q4")

market_structure_2018 <- 
map2_dfr(year_for_function, quarters_vector, market_str_data_import)

Let’s start with the same count of stocks and ETFs

market_structure_2018 %>%
  group_by(security) %>%
  summarise(distinct_tickers = n_distinct(ticker))

# A tibble: 2 x 2
  security distinct_tickers
  <chr>               <int>
1 ETF                  2132
2 Stock                3928

Same as before, we have 6060 tickers, broken down into 2132 ETFs and 3928 stocks, but now we have an entire year of daily observations.

Let’s grab one ticker, ‘TSLA’ and let’s also add another column called ‘day of the week’. We eventually want to visualize how certain market structure metrics are distributed across different trading days, so we need to convert from calendar dates to M-F.

ticker_chosen <- "TSLA"

market_structure_2018 %>%
  filter(ticker == ticker_chosen) %>%
  mutate(`day of the week` = as_factor(wday(date, label = TRUE, abbr = FALSE))) %>%
  select(date, `day of the week`, everything()) %>% 
  head()

# A tibble: 6 x 20
  date       `day of the wee… security ticker mcap_rank turn_rank
  <date>     <ord>            <chr>    <chr>      <dbl>     <dbl>
1 2018-01-02 Tuesday          Stock    TSLA          10        10
2 2018-01-03 Wednesday        Stock    TSLA          10        10
3 2018-01-04 Thursday         Stock    TSLA          10        10
4 2018-01-05 Friday           Stock    TSLA          10        10
5 2018-01-08 Monday           Stock    TSLA          10        10
6 2018-01-09 Tuesday          Stock    TSLA          10        10
# … with 14 more variables: volatility_rank <dbl>, price_rank <dbl>,
#   lit_vol_000 <dbl>, order_vol_000 <dbl>, hidden <dbl>,
#   trades_for_hidden <dbl>, hidden_vol_000 <dbl>,
#   trade_vol_for_hidden_000 <dbl>, cancels <dbl>, lit_trades <dbl>,
#   odd_lots <dbl>, trades_for_odd_lots <dbl>, odd_lot_vol_000 <dbl>,
#   trade_vol_for_odd_lots_000 <dbl>

From here, we use ggplot to create density plots, faceted by day of the week.

ticker_chosen <- "TSLA"

market_structure_2018 %>%
  filter(ticker == ticker_chosen) %>%
  mutate(`day of the week` = as_factor(wday(date, label = TRUE, abbr = FALSE))) %>%
  select(date, `day of the week`, everything()) %>%
  ggplot(aes(x = order_vol_000)) +
  geom_density(aes(fill = `day of the week`), alpha = .3) +
  facet_grid(rows = vars(`day of the week`))

That looks pretty good, but let’s do some aesthetic cleanup. I don’t love that geom_density() includes the black line at the bottom. Let’s paint a white line over it with geom_hline(yintercept = 0, colour = "white", size = 1). We can add a title and caption with labs(y = "", x = "", title = paste("2018 Order Volume Density for", ticker_chosen, sep = " "), caption = "source: SEC Midas").

Finally, order_vol_000 column is in the 000’s. Let’s clean up the x-axis labels with scale_x_continuous(labels = function(l) {paste0(round(l / 1e6, 2), "B")}...).

ticker_chosen <- "TSLA"

market_structure_2018 %>%
  filter(ticker == ticker_chosen) %>%
  mutate(`day of the week` = as_factor(wday(date, label = TRUE, abbr = FALSE))) %>%
  select(date, `day of the week`, everything()) %>%
  ggplot(aes(x = order_vol_000)) +
  geom_density(aes(fill = `day of the week`), alpha = .3) +
  geom_hline(yintercept = 0,
             colour = "white",
             size = 1) +
  facet_grid(rows = vars(`day of the week`)) +
  labs(y = "", x = "",  title = paste("2018 Order Volume Density for", ticker_chosen, sep = " "), caption = "source: SEC Midas") +
  scale_x_continuous(
    labels = function(l) {paste0(round(l / 1e6, 2), "B")},
    breaks = scales::pretty_breaks(n = 10)
  ) +
   theme(plot.title = element_text(hjust = .5),
         plot.subtitle = element_text(hjust = .5))

That’s all for today!

Shameless book plug for those who read to the end: if you like this sort of thing, check out my book Reproducible Finance with R!

Thanks for reading and see you next time.

Here’s a quick addendum to this weekend’s post on order volume. Instead of comparing ETFs and stocks, let’s compare ETFs and just stocks in the SP100, i.e. probably the 100 stocks with the highest daily volume.

First, let’s get the tickers for the SP100. To do that, I’m going to import the tickers for OEF, Blackrock’s ETF that tracks the SP100. BLK is nice about making their fund holdings available so we can grab the tickers from this link: https://www.ishares.com/us/products/239723/ishares-sp-100-etf/1467271812596.ajax?fileType=csv&fileName=OEF_holdings&dataType=fund.

oef_sp100_holdings <- read_csv("https://www.ishares.com/us/products/239723/ishares-sp-100-etf/1467271812596.ajax?fileType=csv&fileName=OEF_holdings&dataType=fund", 
    col_types = cols(ISIN = col_skip(), `Notional Value` = col_skip(), 
        SEDOL = col_skip()), skip = 9) %>% 
  mutate(fund = "oef", 
         download_date = ymd(Sys.Date()))


oef_sp100_holdings %>% 
  head()

# A tibble: 6 x 11
  Ticker Name  `Asset Class` `Weight (%)` Price Shares `Market Value`
  <chr>  <chr> <chr>                <dbl> <dbl>  <dbl>          <dbl>
1 AAPL   APPL… Equity                7.41  310. 1.38e6     426849725.
2 MSFT   MICR… Equity                7.09  162. 2.52e6     408162395.
3 AMZN   AMAZ… Equity                4.54 1901. 1.37e5     261333541.
4 FB     FACE… Equity                3.01  218. 7.94e5     173402676.
5 BRKB   BERK… Equity                2.56  229. 6.46e5     147626272.
6 JPM    JPMO… Equity                2.47  137. 1.04e6     142284210.
# … with 4 more variables: Sector <chr>, Exchange <chr>, fund <chr>,
#   download_date <date>

We have a wealth of information in this data frame but we need only the tickers so that we can filter our original Midas data set down to stocks in the SP100.

oef_sp100_tickers <- 
oef_sp100_holdings %>% 
  clean_names() %>% 
  select(ticker, sector, asset_class) %>%
  mutate(ticker = str_replace(ticker, "BRKB", "BRK.B")) %>% 
  filter(asset_class == "Equity") %>% 
  pull(ticker)

Now that we have the tickers for the SP100, we can filter our original data down to just ETFs and stock tickers that match the vector oef_sp100_tickers.

market_structure_2018_etfs_sp100 <- 
market_structure_2018 %>% 
  filter(security == "ETF" | ticker %in% oef_sp100_tickers)

market_structure_2018_etfs_sp100 %>% 
  group_by(security) %>%
  summarise(distinct_tickers = n_distinct(ticker))

# A tibble: 2 x 2
  security distinct_tickers
  <chr>               <int>
1 ETF                  2132
2 Stock                  94

We matched on only 94 of the 101. Not a huge deal for today’s purposes but let’s figure out which ones did not match.

setdiff(oef_sp100_tickers, market_structure_2018_etfs_sp100$ticker)

[1] "MDT" "ACN" "AGN" "SLB" "DD"  "SPG" "DOW"

Well, MDT pops out a bit, that’s Medtronic. Probably worth investigating why we can’t locate that ticker in the Midas/SEC data, for now, though, we’ll plow on to some visualizations.

market_structure_2018_etfs_sp100 %>% 
  select(date, security, ticker, order_vol_000) %>% 
  mutate(security = case_when(security == "Stock" ~ "SP100", 
                              TRUE ~ "ETF")) %>% 
  group_by(security, date) %>% 
  summarise(mean_vol = mean(order_vol_000)) %>% 
  ggplot(aes(x = date, y = mean_vol, color = security)) +
    geom_line() +
  scale_x_date(breaks = scales::pretty_breaks(n = 20)) +
  scale_y_continuous(label = scales::comma_format(), breaks = scales::pretty_breaks(n = 10)) +
  theme(axis.text.x = element_text(angle = 90),
        plot.title = element_text(hjust = .5),
        plot.subtitle = element_text(hjust = .5)) +
  labs(x = "", y = "mean volume (000's)", title = "Average Order Volume by Type", subtitle = "Q2 2019") +
  scale_colour_manual(values = c("#ff6699", "cornflowerblue"))

etf_plot <- 
market_structure_2018_etfs_sp100 %>%
  group_by(security, date) %>% 
  summarise(mean_vol = mean(order_vol_000)) %>% 
  mutate(`day of the week` = as_factor(wday(date, label = TRUE, abbr = FALSE))) %>%
  filter(security == "ETF") %>% 
  select(date, `day of the week`, everything()) %>%
  ggplot(aes(x = mean_vol)) +
  geom_density(aes(fill = `day of the week`), alpha = .3) +
  geom_hline(yintercept = 0,
             colour = "white",
             size = 1) +
  facet_grid(rows = vars(`day of the week`)) +
  labs(y = "", x = "",  title = "2018 Order Volume Density for ETFs", caption = "source: SEC Midas") +
  scale_x_continuous(
    labels = function(l) {paste0(round(l / 1e6, 2), "B")},
    breaks = scales::pretty_breaks(n = 10)
  ) +
   theme(plot.title = element_text(hjust = .5),
         plot.subtitle = element_text(hjust = .5))

etf_plot

sp100_plot <- market_structure_2018_etfs_sp100 %>%
  select(date, security, ticker, order_vol_000) %>% 
  mutate(security = case_when(security == "Stock" ~ "SP100", 
                              TRUE ~ "ETF")) %>% 
  group_by(security, date) %>% 
  summarise(mean_vol = mean(order_vol_000)) %>% 
  mutate(`day of the week` = as_factor(wday(date, label = TRUE, abbr = FALSE))) %>%
  filter(security == "SP100") %>% 
  select(date, `day of the week`, everything()) %>%
  ggplot(aes(x = mean_vol)) +
  geom_density(aes(fill = `day of the week`), alpha = .3) +
  geom_hline(yintercept = 0,
             colour = "white",
             size = 1) +
  facet_grid(rows = vars(`day of the week`)) +
  labs(y = "", x = "",  title = "2018 Order Volume Density for SP100", caption = "source: SEC Midas") +
  scale_x_continuous(
    labels = function(l) {paste0(round(l / 1e6, 2), "B")},
    breaks = scales::pretty_breaks(n = 10)
  ) +
   theme(plot.title = element_text(hjust = .5),
         plot.subtitle = element_text(hjust = .5))

sp100_plot

library(patchwork)

etf_plot / sp100_plot