# Exploratory Data Visualization for Time-Series and Longitudinal Data

Christian Testa
January 22nd, 2025

Hanage, W.P., Testa, C., Chen, J.T. et al. [COVID-19: US federal accountability for entry, spread, and inequities—lessons for the future.]( Eur J Epidemiol 35, 995–1006 (2020).</span> --- # Motivations <span style='font-size: 25px;'> > "The greatest value of a picture is when it forces us to notice what we never expected to see." –John Tukey -- > "Visualization is often used for evil - twisting insignificant data changes and making them look meaningful. Don't do that crap if you want to be my friend. Present results clearly and honestly. If something isn't working - those reviewing results need to know." —John Tukey </span> -- <img src='images/EleanorLutz.png' height='400px'> --- # Aims - Learn how to use data manipulation tools such as `dplyr` and `tidyr` - Learn how to use `ggplot2`, a powerful, flexible framework for visualizing data in R - Learn where to find more resources <img src='images/R-for-Data-Science.jpg'/> --- # Before we get started There are some packages you'll want to make sure you have installed. ```r install.packages("tidyverse") ``` ```r library(tidyverse, quietly = F, warn.conflicts = T) ``` ``` ## ── Attaching core tidyverse packages ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.5 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.2 ✔ tidyr 1.3.1 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<>) to force all conflicts to become errors ``` --- # Example Data Set ```r df <- readr::read_csv("example_data/example_dataset_1.csv") ``` ``` ## Rows: 400 Columns: 6 ## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): strata, gender ## dbl (4): X2005, X2010, X2015, X2020 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ``` -- Why use `readr::read_csv`? - Reports on the assumed column types, with options to override - Loads faster - Loads into a tibble, which are faster, better data.frames. --- # Check out the data ```r knitr::kable(head(df,4)) ``` <table> <thead> <tr> <th style="text-align:left;"> strata </th> <th style="text-align:left;"> gender </th> <th style="text-align:right;"> X2005 </th> <th style="text-align:right;"> X2010 </th> <th style="text-align:right;"> X2015 </th> <th style="text-align:right;"> X2020 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 10.509583 </td> <td style="text-align:right;"> 15.796463 </td> <td style="text-align:right;"> 14.918578 </td> <td style="text-align:right;"> 25.42652 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 19.413696 </td> <td style="text-align:right;"> 25.082179 </td> <td style="text-align:right;"> 25.200332 </td> <td style="text-align:right;"> 30.43139 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 12.475157 </td> <td style="text-align:right;"> 19.262194 </td> <td style="text-align:right;"> 19.930757 </td> <td style="text-align:right;"> 20.63809 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> F </td> <td style="text-align:right;"> 1.284895 </td> <td style="text-align:right;"> 5.251643 </td> <td style="text-align:right;"> 8.746856 </td> <td style="text-align:right;"> 10.24585 </td> </tr> </tbody> </table> Note that: - Data is in a wide format - Column names need cleaning - We have groups of participants --- # Examine the categorical variables Let's check what those groups are: ```r unique(df$strata) ``` ``` ## [1] "A" "B" "C" "D" ``` -- ```r table(df$gender) ``` ``` ## ## F M ## 210 190 ``` --- # Summarize quantitative variables -- ```r df %>% select(-c(strata, gender)) %>% summary() ``` ``` ## X2005 X2010 X2015 X2020 ## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 ## 1st Qu.: 5.217 1st Qu.: 6.765 1st Qu.: 6.135 1st Qu.: 7.621 ## Median : 9.292 Median :10.592 Median :10.280 Median :12.065 ## Mean : 9.537 Mean :11.260 Mean :10.677 Mean :12.678 ## 3rd Qu.:12.851 3rd Qu.:15.073 3rd Qu.:14.331 3rd Qu.:17.638 ## Max. :34.977 Max. :34.707 Max. :33.236 Max. :39.052 ``` -- ... Let's break this one down. --- <img src="images/MagrittrPipe.png" align='right' width='20%' style='padding: 50px;'><br> # Intro to `%>%` and dplyr ```r df %>% select(-c(strata, gender)) %>% summary() ``` -- `%>%`, the pipe operator, comes from the `magrittr` package, but is also included in `dplyr`. -- `x %>% f()` is equivalent to `f(x)` <br> `x %>% f(y)` is equivalent to `f(x,y)` -- `df %>% select(-c(strata, gender))` is equivalent to <br>`select(df, -c(strata, gender))` -- Read `x %>% f(y)` as "`x` gets passed to `f` with additional argument `y`." -- Using pipes helps to: 1. chain several commands together, 2. without creating unnecessarily nested one-liners, e.g. <br> `summary(select(df, -strata))` --- # What's the deal with select? ```r df %>% select(-c(strata, gender)) %>% summary() ``` `select` is the command for subsetting the columns of a data.frame or tibble. -- Notice that `strata` and `gender` are not in quotes. This is because `dplyr` and many of the functions in the tidyverse use tidy-evaluation, which allows users to reference column names of data.frames and tibbles as if they are variables within tidyverse functions. -- The minus sign is saying that we want to remove strata and gender, or equivalently to select all of the columns except for strata and gender. ``` ## X2005 X2010 X2015 X2020 ## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000 ## 1st Qu.: 5.217 1st Qu.: 6.765 1st Qu.: 6.135 1st Qu.: 7.621 ## Median : 9.292 Median :10.592 Median :10.280 Median :12.065 ## Mean : 9.537 Mean :11.260 Mean :10.677 Mean :12.678 ## 3rd Qu.:12.851 3rd Qu.:15.073 3rd Qu.:14.331 3rd Qu.:17.638 ## Max. :34.977 Max. :34.707 Max. :33.236 Max. :39.052 ``` --- # Let's add participant ID numbers ```r df <- df %>% mutate(id = 1:nrow(.)) %>% select(id, everything()) knitr::kable(head(df,3)) ``` <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> strata </th> <th style="text-align:left;"> gender </th> <th style="text-align:right;"> X2005 </th> <th style="text-align:right;"> X2010 </th> <th style="text-align:right;"> X2015 </th> <th style="text-align:right;"> X2020 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 10.50958 </td> <td style="text-align:right;"> 15.79646 </td> <td style="text-align:right;"> 14.91858 </td> <td style="text-align:right;"> 25.42652 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 19.41370 </td> <td style="text-align:right;"> 25.08218 </td> <td style="text-align:right;"> 25.20033 </td> <td style="text-align:right;"> 30.43139 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 12.47516 </td> <td style="text-align:right;"> 19.26219 </td> <td style="text-align:right;"> 19.93076 </td> <td style="text-align:right;"> 20.63809 </td> </tr> </tbody> </table> -- Note the use of `.` here, which refers to the argument passed using `%>%`. -- Equivalently, we could have written `mutate(df, id = 1:nrow(df))`. -- `everything()` is part of the `tidyselect` package and system which helps with the programmatic selection of columns and offers other helpful functions like `starts_with` or `contains`. --- # A note about `|>` and `%>%` <small> `|>` is a pipe now built into base R starting from version 4.1.0. Basically the main differences is that `%>%` uses the `.` notation to refer to the left-hand-side of the pipe, while `|>` uses `_` instead. There are some subtle differences in what you can do with the left-hand-side (like `%>%` supports `.$var` while `|>` does not). In many regards, the two pipes are similar, and you will start to see more and more code using `|>` because it is new and hopefully faster than `%>%`. </small> .pull-left[ <a href=""><img src="images/pipe_article.png" width='300px' /> </a> ] .pull-right[ Read more here: <> ] --- # Let's convert to a tidy format ```r df <- df %>% tidyr::pivot_longer( cols = starts_with('X'), names_to = 'year', values_to = 'rate') knitr::kable(head(df, 4)) ``` <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> strata </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> year </th> <th style="text-align:right;"> rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> X2005 </td> <td style="text-align:right;"> 10.50958 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> X2010 </td> <td style="text-align:right;"> 15.79646 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> X2015 </td> <td style="text-align:right;"> 14.91858 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> X2020 </td> <td style="text-align:right;"> 25.42652 </td> </tr> </tbody> </table> --- # Get rid of "X" ```r df <- df %>% mutate(year = stringr::str_remove(year, "X")) knitr::kable(head(df, 4)) ``` <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> strata </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> year </th> <th style="text-align:right;"> rate </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> 2005 </td> <td style="text-align:right;"> 10.50958 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:right;"> 15.79646 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> 2015 </td> <td style="text-align:right;"> 14.91858 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:left;"> 2020 </td> <td style="text-align:right;"> 25.42652 </td> </tr> </tbody> </table> --- # Now we can do some plotting .pull-left[ ```r ggplot(data = df, aes(x = year, y = rate)) + geom_point() ``` ] .pull-right[ <img src="longitudinal_eda_files/figure-html/unnamed-chunk-17-1.png" width="100%" /> ] --- # Now let's try geom_line .pull-left[ ```r ggplot(data = df, aes(x = year, y = rate, * group = id)) + geom_line() ``` ] .pull-right[ <img src="longitudinal_eda_files/figure-html/unnamed-chunk-18-1.png" width="100%" /> ] --- # Adding color .pull-left[ ```r ggplot(data = df, aes(x = year, y = rate, group = id, * color = strata)) + * geom_line(alpha=0.5) ``` ] .pull-right[ <img src="longitudinal_eda_files/figure-html/unnamed-chunk-19-1.png" width="100%" /> ] --- # Facet Wrapping .tiny[ ```r ggplot(data = df, aes(x = year, y = rate, group = id, color = strata)) + geom_line(alpha=0.5) + * facet_wrap(~strata) ``` ] <img src="longitudinal_eda_files/figure-html/unnamed-chunk-20-1.png" width="576" height="100%" style="display: block; margin: auto;" /> --- # Facet Grid .tiny[ ```r ggplot(data = df, aes(x = year, y = rate, group = id, color = strata)) + geom_line(alpha=0.5) + * facet_grid(gender~strata) + ggtitle("Different strata had different trajectories") ``` ] <img src="longitudinal_eda_files/figure-html/unnamed-chunk-21-1.png" width="576" height="100%" style="display: block; margin: auto;" /> --- # Using Stat Summaries .tiny[ ```r ggplot(data = df, aes(x = year, y = rate, group = id, color = gender)) + geom_line(alpha=0.5) + facet_wrap(~strata) + stat_summary(aes(group = interaction(strata, gender)), fun = mean, geom='line', color = 'black') + stat_summary(aes(group = interaction(strata, gender), shape=gender), fun = mean, geom='point', size=2, color = 'black') + labs(shape = 'Gender and Strata\nLevel Average', color = 'Gender') + ggtitle("Men have higher rates than women") ``` ] <img src="longitudinal_eda_files/figure-html/unnamed-chunk-22-1.png" width="576" height="100%" style="display: block; margin: auto;" /> --- # Another way using boxplots .tiny[ ```r ggplot(data = df, aes(x = year, y = rate, color = gender)) + geom_boxplot(alpha=0.5) + stat_summary(aes(group = interaction(strata, gender), shape=''), position = position_dodge(width=0.75), fun = mean, geom='point', color = 'grey10', alpha=0.8) + facet_wrap(~strata) + labs(color = "Gender", shape = "Gender + Strata\nLevel Average") + ggtitle("Boxplots allow us to see the interquartile range clearly") ``` ] <img src="longitudinal_eda_files/figure-html/unnamed-chunk-23-1.png" width="576" height="100%" style="display: block; margin: auto;" /> --- # Using `geom_ribbon` .tiny[ ```r df %>% group_by(strata, gender, year) %>% summarize( percentile_97.5 = quantile(rate, 0.975), percentile_2.5 = quantile(rate, 0.025), mean = mean(rate), .groups = 'keep') %>% ggplot(aes(x = year, y = mean, ymax = percentile_97.5, ymin = percentile_2.5, group = gender, fill = gender, color = gender)) + geom_ribbon(alpha=0.5, linewidth = 0) + geom_line(aes(linetype='')) + facet_wrap(~strata) + scale_color_manual(values = c('M' = '#2980b9', 'F' = '#c0392b')) + labs(linetype = 'Gender+Strata\nLevel Average', fill = 'Gender', color = 'Gender', y = 'Rate') + ggtitle(paste0("The difference between men and women was consistent over time")) ``` ] <img src="longitudinal_eda_files/figure-html/unnamed-chunk-24-1.png" width="432" height="100%" style="display: block; margin: auto;" /> --- # Using plotly for interactive graphics .tiny[ ```r suppressMessages(library(plotly)) ggplotly() %>% layout(width = 8, height = 3.5) ```
--- # Widening Data for Correlation Analysis Before we can look at correlation across the years, we need to widen the dataframe (similar to how it was originally formatted). ```r df_wide <- df %>% tidyr::pivot_wider(id_cols = c(id, strata, gender), names_from = year, values_from = rate) knitr::kable(head(df_wide, 3)) ``` <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:left;"> strata </th> <th style="text-align:left;"> gender </th> <th style="text-align:right;"> 2005 </th> <th style="text-align:right;"> 2010 </th> <th style="text-align:right;"> 2015 </th> <th style="text-align:right;"> 2020 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 10.50958 </td> <td style="text-align:right;"> 15.79646 </td> <td style="text-align:right;"> 14.91858 </td> <td style="text-align:right;"> 25.42652 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 19.41370 </td> <td style="text-align:right;"> 25.08218 </td> <td style="text-align:right;"> 25.20033 </td> <td style="text-align:right;"> 30.43139 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> A </td> <td style="text-align:left;"> M </td> <td style="text-align:right;"> 12.47516 </td> <td style="text-align:right;"> 19.26219 </td> <td style="text-align:right;"> 19.93076 </td> <td style="text-align:right;"> 20.63809 </td> </tr> </tbody> </table> --- # Correlation Plot .tiny[ ```r # install.packages(GGally) library(GGally) ggpairs(df_wide, aes(color = strata, alpha=0.25), columns = c('2005', '2010', '2015', '2020'), progress=F) ``` ] <img src="longitudinal_eda_files/figure-html/unnamed-chunk-27-1.png" width="576" height="100%" style="display: block; margin: auto;" /> --- # Bivariate Pairs Plots ```r ggbivariate(df, outcome = 'gender', explanatory = 'strata') + theme(legend.position = 'bottom') + ggtitle("Strata were about evenly split across gender") ``` <img src="longitudinal_eda_files/figure-html/unnamed-chunk-28-1.png" width="432" height="100%" style="display: block; margin: auto;" /> --- <img src="images/tt_logo.png" align='right' width='30%' style='padding: 0px;'> # Where you can learn more For data manipulation and visualization: <img src="images/ggplot2-cheatsheet.png" align='right' width='30%' style='padding: 0px;'> - R for Data Science, by Garrett Grolemund and Hadley Wickham, []( - The ggplot2 Website, []( - The [RStudio Cheatsheets]( (I suggest starting with [dplyr]( and [ggplot2]( - Watch the [TidyTuesday tutorials on YouTube]( or check out [TidyTuesday on GitHub]( -- For longitudinal data analysis: - [Applied Longitudinal Analysis by Garrett Fitzmaurice, Nan Laird, and James Ware]( - [Marie Davidian's Slides on Modeling and Analysis of Longitudinal Data]( - [Patrick Hagearty's notes on Longitudinal Data Analysis]( (fairly technical) - Longitudinal Data Analysis: Autoregressive Linear Mixed Effects Models, by Ikuko FunatogawaTakashi Funatogawa (very technical) --- # How to take your longitudinal analysis further .pull-left[ <!-- <div style='font-size: 75% !important;'> --> <ul> Use models to make inferences about your data. Models for longitudinal data often include the following features: <br> <br> <li> Multi-level or random effects design</li> <li> Generalized Linear Models</li> <li> Auto-regressive</li> <li> Treatment of missing data</li> </ul> <!-- </div> --> ] .pull-right[ <img src='images/multilevel-reaction.png'/> <span style='font-size: 16px;'> `library(lme4)` <br> `lmer(Reaction ~ Days + (Days|Subject), sleepstudy)` </span> ] --- # Get More Inspiration .pull-left[ * [Georgios Karamanis]( * [Cedric Scherer]( * [Eleanor Lutz]( * [Flowing Data]( <img src="images/Kaashoek.png" alt="A figure showing effectiveness of statewide interventions as an effect modifier" /> <span style='font-size: 12px;'>Kaashoek J, Testa C, Chen JT, Stolerman LM, Krieger N, Hanage WP, et al. [The evolving roles of US political partisanship and social vulnerability in the COVID-19 pandemic from February 2020–February 2021.]( PLOS Global Public Health. 2022 </span> ] .pull-right[ <img src="images/EscalatingDrought.jpg" height='500px' alt='A figure on drought from Cedric Scherer' /> ] --- # Join the R User Group! .pull-left[ <img src="images/rug.png" alt='screenshot of our RUG YouTube'><br> Join at <> and check out our [YouTube]( ] <!-- .pull-right[ <img src="images/tidymodels.png" style='max-height: 450px;' alt="poster for our upcoming talk on Tidymodels" /> ] --> --- # Find this talk on my github <img src='screenshot.png' height='450px' /> <img src='images/github_qr.png' style = 'vertical-align: top;' height='175px' /> <> --- # Image Credits <div style='font-size:small'> <ul> <li>R for Data Science: </li> <li>Magrittr Logo:</li> <li>Tidy Tuesday Logo:</li> <li>Escalating Drought:</li> </ul> </div>