class: center, middle, inverse, title-slide .title[ # 36-613: Data Visualization ] .subtitle[ ## Introduction and the Grammar of Graphics ] .author[ ### Professor Ron Yurko ] .date[ ### 8/29/2022 ] --- ## Who am I? .pull-left[ * Assistant Teaching Professor * Finished Phd in Statistics @ CMU in May 2022 * Previously BS in Statistics @ CMU in 2015 * Research interests: statistical genetics, selective inference, clustering, sparsity, statistics in sports * Industry experience: briefly worked in finance before returning to grad school and also as data scientist in professional sports ] -- .pull-right[  ] --- name: quartet ## Do these [datasets](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html) have anything in common? .center[] --- ## __Always visualize your data__ before analyzing it! .center[] Need to understand the interplay between graphics and statistical inference --- ## Course Structure (READ THE SYLLABUS): #### Monday / Wednesday = Lecture + Will include example code, all slides and additional `R` demos posted on https://cmu-36613.netlify.app/ + Want class discussion - so __participate and ask questions__ -- #### 5 Weekly Homework Assignments due Wednesdays by 11:59 PM EST + Posted Wednesday mornings and due one week later -- #### Two Graphics Critique / Replication of Data Viz in the Wild + Submit two graphics critique / replications (with pseudocode) of data visualizations you find in the wild (see syllabus for due dates) + Both data visualizations must be from a __recent source that was posted online within the past month__ -- #### Group EDA Project due Friday October 14th by 11:59 PM EST + Each group will write an IMRD report and present their work in 36-611 --- ## Course Objectives (READ THE SYLLABUS): ### Learn useful principles for making appropriate statistical graphics. ### Critique existing graphs and remake better ones. -- ### Visualize statistical analyses to facilitate communication. ### Pinpoint the statistical claims you can/cannot make from graphics. -- ### Practice tidy data manipulation in `R` using the `tidyverse` ### Practice reproducible workflows with `RMarkdown` --- ## What do I mean by `tidy` data? Data are often stored in __tabular__ (or matrix) form: ``` ## # A tibble: 5 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 ## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007 ## 3 Adelie Torgersen 40.3 18 195 3250 female 2007 ## 4 Adelie Torgersen NA NA NA NA <NA> 2007 ## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007 ``` -- + Each row `==` unit of observation, e.g., penguins -- + Each column `==` variable/measurement about each observation, e.g., `flipper_length_mm` -- + Known as a `data.frame` in base `R` and `tibble` in the `tidyverse` -- + Two main variable types: quantitative and categorical -- __How do we convert data into visualizations?__ --- ## [The Grammar of Graphics](https://link.springer.com/book/10.1007/0-387-28695-0) - by Leland Wilkinson .pull-left[ All plots can be broken down into core components 1. __data__ 2. __geometries__: type of geometric objects to represent data, e.g., points, lines 3. __aesthetics__: visual characteristics of geometric objects to represent data, e.g., position, size 4. __scales__: how each aesthetic is converted into values on the graph, e.g., color scales 5. __stats__: statistical transformations to summarize data, e.g., counts, means, regression lines 6. __facets__: split data and view as multiple graphs 7. __coordinate system__: 2D space the data are projected onto, e.g., Cartesian coordinates ] -- .pull-right[ [Hadley Wickham](http://hadley.nz/) [expanded upon this](http://vita.had.co.nz/papers/layered-grammar.pdf) with [`ggplot2`](https://ggplot2.tidyverse.org/) 1. `data` 2. `geom` 3. `aes`: mappings of columns to geometric objects 4. `scale`: one scale for each `aes` variable 5. `stat` 6. `facet` 7. `coord` 8. `labs`: labels/guides for each variable and other parts of the plot, e.g., title, subtitle, caption 9. `theme`: customization of plot layout ] --- ## Start with the `data` .pull-left[ Access `ggplot2` from the `tidyverse`: ```r library(tidyverse) *ggplot(data = penguins) ``` Or equivalently using `%>%`: ```r *penguins %>% ggplot() ``` ] -- .pull-right[ __Nothing is displayed__ <img src="figs/Lec1/unnamed-chunk-3-1.png" width="100%" /> ] --- ## Add geometric object with columns mapped to aesthetics .pull-left[ + Use the `+` operator in `ggplot` to add layers + Map `bill_length_mm` to x-axis and `bill_depth_mm` to y-axis ```r penguins %>% * ggplot(aes(x = bill_length_mm, * y = bill_depth_mm)) + * geom_point() ``` + NOTE we are implicitly saying: ```r penguins %>% * ggplot(mapping = aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() ``` ] -- .pull-right[ __And now we have a scatterplot!__ <img src="figs/Lec1/unnamed-chunk-5-1.png" width="100%" /> ] --- ## Modify scale, add statistical summary, and so on... .pull-left[ + Adjust global aesthetics outside of `aes` ```r penguins %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + # Adjust alpha of points * geom_point(alpha = 0.5) + # Add smooth regression line * stat_smooth(method = "lm") + # Flip the x-axis scale * scale_x_reverse() + # Change title & axes labels * labs(x = "Bill length (mm)", * y = "Bill depth (mm)", * title = "Clustering of penguins bills") + # Change the theme: * theme_bw() ``` ] -- .pull-right[ __You will be covering more basics in HW1__ <img src="figs/Lec1/unnamed-chunk-6-1.png" width="100%" /> ] --- ## In the beginning... #### Michael Florent van Langren published the first (known) statistical graphic in 1644 <img src="https://upload.wikimedia.org/wikipedia/commons/6/66/Grados_de_la_Longitud.jpg" width="60%" style="display: block; margin: auto;" /> + Plots different estimates of the longitudinal distance between Toledo, Spain and Rome, Italy + i.e., visualization of collected data to aid in estimation of parameter -- <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQDU0fPHI7y9TstYN0hASi7wlDcBUDnNoTS8yNjXehDAZVJ17glqqGBI7Wxt6y_wdgyyw&usqp=CAU" width="60%" style="display: block; margin: auto;" /> --- ## [John Snow](https://www.theguardian.com/news/datablog/2013/mar/15/john-snow-cholera-map) Knows Something About Cholera <img src="https://media.nationalgeographic.org/assets/photos/000/276/27636.jpg" width="60%" style="display: block; margin: auto;" /> --- ## [Charles Minard's](https://www.datavis.ca/gallery/minard/minard.pdf) Map of Napoleon's Russian Disaster <img src="https://datavizblog.files.wordpress.com/2013/05/map-full-size1.png" width="90%" style="display: block; margin: auto;" /> --- ## [Florence Nightingale's](https://www.datavis.ca/gallery/flo.php) Rose Diagram <img src="https://daily.jstor.org/wp-content/uploads/2020/08/florence_nightingagle_data_visualization_visionary_1050x700.jpg" width="75%" style="display: block; margin: auto;" /> --- ## [Milestones in Data Visualization History](https://friendly.github.io/HistDataVis/) <img src="https://www.researchgate.net/profile/Michael-Friendly/publication/45858111/figure/fig1/AS:276894395191302@1443028176705/The-time-distribution-of-events-considered-milestones-in-the-history-of-data.png" width="70%" style="display: block; margin: auto;" /> --- ## How to Fail this Class: <img src="https://socviz.co/assets/ch-01-chartjunk-life-expectancy.png" width="65%" style="display: block; margin: auto;" /> --- ## [Edward Tufte's](https://www.edwardtufte.com/tufte/) Principles of Data Visualization Graphics: visually display measured quantities by combining points, lines, coordinate systems, numbers, symbols, words, shading, color -- #### Goal is to show data and/or communicate a story! -- + Induce viewer to think about substance, __not graphical methodology__ + Make large, complex datasets more coherent + Encourage comparison of different pieces of data + __Describe, explore, and identify relationships__ + __Avoid data distortion and data decoration__ + Use consistent graph design -- #### Avoid graphs that lead to misleading conclusions! --- <img src="https://github.com/ryurko/SURE22-examples/blob/main/figures/lecture_examples/nyt_ex.png?raw=true" width="110%" style="display: block; margin: auto;" /> -- #### [Think twice before you spiral](https://junkcharts.typepad.com/junk_charts/nyt/) --- class: center, middle # Next time: 1D categorical data Recommended reading: [CW Chapter 2 Visualizing data: Mapping data onto aesthetics](https://clauswilke.com/dataviz/aesthetic-mapping.html) [CW Chapter 17 The principle of proportional ink](https://clauswilke.com/dataviz/proportional-ink.html) [KH Chapter 1 Look at data](https://socviz.co/lookatdata.html#lookatdata) [KH Chapter 3 Make a plot](https://socviz.co/makeplot.html#makeplot) Lecture slides created via the `R` packages: [**xaringan**](https://github.com/yihui/xaringan)<br> [gadenbuie/xaringanthemer](https://github.com/gadenbuie/xaringanthemer)