class: center, middle, inverse, title-slide .title[ # 36-613: Data Visualization ] .subtitle[ ## 1D Categorical Data ] .author[ ### Professor Ron Yurko ] .date[ ### 8/31/2022 ] --- ## 1D categorical data Two different versions of categorical: -- #### _Nominal_: coded with arbitrary numbers, i.e., no real order + Examples: race, gender, species, text -- #### _Ordinal_: levels with a meaningful order + Examples: education level, grades, ranks -- #### __NOTE__: `R` and `ggplot` considers a categorical variable to be `factor` + `R` will always treat categorical variables as ordinal! Defaults to alphabetical... + We will need to manually define the `factor` levels --- ## 1D categorical data structure + Observations are collected into a vector `\((x_1, \dots, x_n)\)`, where `\(n\)` is number of observations + Each observed value `\(x_i\)` can only belong to one category level `\(\{ C_1, C_2, \dots \}\)` -- We're going to look at `penguins` data from the [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/) package, specifically the `species` column: ```r library(palmerpenguins) head(penguins$species) ``` ``` ## [1] Adelie Adelie Adelie Adelie Adelie Adelie ## Levels: Adelie Chinstrap Gentoo ``` #### How could we summarize these data? What information would you report? -- Tables are the most common form of non-graphical EDA: ```r table(penguins$species) ``` ``` ## ## Adelie Chinstrap Gentoo ## 152 68 124 ``` --- ## Area plots <img src="https://clauswilke.com/dataviz/directory_of_visualizations_files/figure-html/proportions-1.png" width="100%" /> -- - Each area corresponds to one categorical level - Area is proportional to counts/frequencies/percentages - Differences between areas correspond to differences between counts/frequencies/percentages --- ## Bar charts .pull-left[ - Rectangular bar is created for each unique categorical level - heights `\(\propto\)` counts (proportions) - width `\(\propto\)` 1 (i.e., nothing!) - `\(\rightarrow\)` area `\(\propto\)` counts (proportions) ```r library(tidyverse) penguins %>% ggplot(aes(x = species)) + * geom_bar() ``` - `geom_bar` to display bar charts + appears to count the levels... ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-4-1.png" width="100%" /> ] --- ## Behind the scenes: statistical summaries <img src="https://d33wubrfki0l68.cloudfront.net/70a3b18a1128c785d8676a48c005ee9b6a23cc00/7283c/images/visualization-stat-bar.png" width="100%" /> From [Chapter 3 of R for Data Science](https://r4ds.had.co.nz/data-visualisation.html) --- ## Spine charts .pull-left[ Consists of a single bar whose height or width is divided into different categories - with two versions: - height `\(\propto\)` counts (proportions) ```r penguins %>% * ggplot(aes(fill = species, x = "")) + geom_bar() ``` - width `\(\propto\)` counts (proportions) ```r penguins %>% ggplot(aes(fill = species, x = "")) + geom_bar() + * coord_flip() ``` ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-6-1.png" width="100%" /> <img src="figs/Lec2/unnamed-chunk-7-1.png" width="100%" /> ] --- ## What does a bar chart show? #### Marginal Distribution - Assume categorical variable `\(X\)` has `\(K\)` categories: `\(C_1, \dots, C_K\)` - __True__ marginal distribution of `\(X\)`: $$ P(X = C_j) = p_j,\ j \in \{ 1, \dots, K \} $$ -- #### We have access to the Empirical Marginal Distribution - Observed distribution of `\(X\)`, our best estimate (MLE) of the marginal distribution of `\(X\)`: `\(\hat{p}_1\)`, `\(\hat{p}_2\)`, `\(\dots\)`, `\(\hat{p}_K\)` ```r # Proportion estimates for penguins species table(penguins$species) / nrow(penguins) ``` ``` ## ## Adelie Chinstrap Gentoo ## 0.4418605 0.1976744 0.3604651 ``` --- ## Bar charts with proportions .pull-left[ - [`after_stat()`](https://ggplot2.tidyverse.org/reference/aes_eval.html) indicates the aesthetic mapping is performed after statistical transformation - Use `after_stat(count)` to access the `stat_count()` called by `geom_bar()` ```r penguins %>% ggplot(aes(x = species)) + * geom_bar(aes(y = after_stat(count) / * sum(after_stat(count)))) + labs(y = "Proportion") ``` - Kind of weird code to use... ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-9-1.png" width="100%" /> ] --- ## Compute and display the proportions directly .pull-left[ ```r penguins %>% * group_by(species) %>% * summarize(count = n(), * .groups = "drop") %>% * mutate(total = sum(count), * prop = count / total) %>% ggplot(aes(x = species)) + geom_bar(aes(y = prop), stat = "identity") ``` - Use `group_by()`, `summarize()`, and `mutate()` in a pipeline to compute then display the proportions directly - Need to indicate we are displaying the `y` axis as given, i.e., the identity function ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-10-1.png" width="100%" /> ] --- ## Statistical inference for proportions - Our estimate for `\(p_j\)` is `\(\hat{p}_j = \frac{n_j}{n}\)`, compute the standard error as: $$ SE(\hat{p}_j) = \sqrt{\frac{\hat{p}_j(1 - \hat{p}_j)}{n}} $$ -- - Compute `\(\alpha\)`-level __confidence interval__ (CI) as `\(\hat{p}_j \pm z_{1 - \alpha / 2} \cdot SE(\hat{p}_j)\)` - Good rule-of-thumb: construct 95% CI using `\(\hat{p}_j \pm 2 \cdot SE(\hat{p}_j)\)` -- - Just an approximation justified by CLT, so CI could include values outside of [0,1] -- #### Add CIs to bars for 1D categorical data - Need to remember each CI is for each `\(\hat{p}_j\)` marginally, not jointly - Have to be careful with __multiple testing__ --- ## Add standard errors to bars .pull-left[ ```r penguins %>% group_by(species) %>% summarize(count = n(), .groups = "drop") %>% mutate(total = sum(count), prop = count / total, * se = sqrt(prop * (1 - prop) / total), * lower = prop - 2 * se, * upper = prop + 2 * se) %>% ggplot(aes(x = species)) + geom_bar(aes(y = prop), stat = "identity") + * geom_errorbar(aes(ymin = lower, * ymax = upper), * color = "red") ``` - If CIs don’t overlap `\(\rightarrow\)` likely significant difference - If CIs overlap a little `\(\rightarrow\)` ambiguous - If CIs overlap a lot `\(\rightarrow\)` no significant difference ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-11-1.png" width="100%" /> ] --- ## Why does this matter? .pull-left[ <img src="figs/Lec2/unnamed-chunk-12-1.png" width="100%" /> ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-13-1.png" width="100%" /> ] --- ## Graphs can appear the same with very different statistical conclusions - mainly due to sample size .pull-left[ <img src="figs/Lec2/unnamed-chunk-14-1.png" width="100%" /> ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-15-1.png" width="100%" /> ] --- ## Useful to order categories by frequency with [`forcats`](https://forcats.tidyverse.org/) .pull-left[ ```r penguins %>% group_by(species) %>% summarize(count = n(), .groups = "drop") %>% mutate(total = sum(count), prop = count / total, se = sqrt(prop * (1 - prop) / total), lower = prop - 2 * se, upper = prop + 2 * se, species = * fct_reorder(species, prop)) %>% ggplot(aes(x = species)) + geom_bar(aes(y = prop), stat = "identity") + geom_errorbar(aes(ymin = lower, ymax = upper), color = "red") ``` ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-16-1.png" width="100%" /> ] --- ## So you want to make pie charts... .pull-left[ - Circle is divided up into sections, i.e., _pie slices_, one slice for each category - Total area `\(= \pi r^2\)`, slice area `\(= \frac{\pi r^2 \cdot \theta}{360}\)` - Angle `\(\theta \propto\)` counts (proportions), and radius `\(r \propto 1\)` ```r penguins %>% ggplot(aes(fill = species, x = "")) + * geom_bar(aes(y = after_stat(count))) + * coord_polar(theta = "y") ``` ] .pull-right[ <img src="figs/Lec2/unnamed-chunk-17-1.png" width="100%" /> ] --- ## It's true... <img src="https://cdn-media-1.freecodecamp.org/images/5S8tNA6GCGEl-ANW7fu20o63I35bZ46Trsdy" width="50%" style="display: block; margin: auto;" /> --- ## But why?... .pull-left[ <img src="https://datachant.com/wp-content/uploads/2019/10/Bad-Pie-Chart-1.png" width="90%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="https://www.reactiongifs.us/wp-content/uploads/2013/05/this_is_useless_star_wars.gif" width="90%" style="display: block; margin: auto;" /> ] #### You should almost always stick to bars! --- class: center, middle # Next time: 2D categorical data Recap: __Make bar charts with standard errors for 1D categorical data__ Recommended reading: [CW Chapter 10 Visualizing proportions](https://clauswilke.com/dataviz/visualizing-proportions.html) [CW Chapter 16.2 Visualizing the uncertainty of point estimates](https://clauswilke.com/dataviz/visualizing-uncertainty.html#visualizing-the-uncertainty-of-point-estimates) [KH Chapter 3 Make a plot](https://socviz.co/makeplot.html#makeplot)