36-613: Data Visualization

class: center, middle, inverse, title-slide

.title[
# 36-613: Data Visualization
]
.subtitle[
## 1D Categorical Data
]
.author[
### Professor Ron Yurko
]
.date[
### 8/31/2022
]

---

## 1D categorical data

Two different versions of categorical:

#### _Nominal_: coded with arbitrary numbers, i.e., no real order

+ Examples: race, gender, species, text
  
--

#### _Ordinal_: levels with a meaningful order

+ Examples: education level, grades, ranks
  
--

#### __NOTE__: `R` and `ggplot` considers a categorical variable to be `factor`

+ `R` will always treat categorical variables as ordinal! Defaults to alphabetical...
  
  + We will need to manually define the `factor` levels

---

## 1D categorical data structure

+ Observations are collected into a vector `$(x_1, \dots, x_n)$`, where `$n$` is number of observations

+ Each observed value `$x_i$` can only belong to one category level `$\{ C_1, C_2, \dots \}$`

We're going to look at `penguins` data from the [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/) package, specifically the `species` column:

```r
library(palmerpenguins)
head(penguins$species)
```

```
## [1] Adelie Adelie Adelie Adelie Adelie Adelie
## Levels: Adelie Chinstrap Gentoo
```

#### How could we summarize these data? What information would you report?

Tables are the most common form of non-graphical EDA:

```r
table(penguins$species)
```

```
## 
##    Adelie Chinstrap    Gentoo 
##       152        68       124
```

---

## Area plots

- Each area corresponds to one categorical level

- Area is proportional to counts/frequencies/percentages

- Differences between areas correspond to differences between counts/frequencies/percentages

---

## Bar charts

.pull-left[

- Rectangular bar is created for each unique categorical level

- heights `$\propto$` counts (proportions)

- width `$\propto$` 1 (i.e., nothing!)

- `$\rightarrow$` area `$\propto$` counts (proportions)

```r
library(tidyverse)
penguins %>% 
  ggplot(aes(x = species)) +
* geom_bar()
```

- `geom_bar` to display bar charts

+ appears to count the levels...

]

.pull-right[

]

---

## Behind the scenes: statistical summaries

From [Chapter 3 of R for Data Science](https://r4ds.had.co.nz/data-visualisation.html)

---

## Spine charts

.pull-left[

Consists of a single bar whose height or width is divided into different
categories - with two versions:

- height `$\propto$` counts (proportions)

```r
penguins %>% 
* ggplot(aes(fill = species, x = "")) +
  geom_bar() 
```

- width `$\propto$` counts (proportions)

```r
penguins %>% 
  ggplot(aes(fill = species, x = "")) + 
  geom_bar() +
* coord_flip()
```

]

.pull-right[

]

---

## What does a bar chart show?

#### Marginal Distribution

- Assume categorical variable `$X$` has `$K$` categories: `$C_1, \dots, C_K$`

- __True__ marginal distribution of `$X$`:

$$
P(X = C_j) = p_j,\ j \in \{ 1, \dots, K \}
$$

#### We have access to the Empirical Marginal Distribution

- Observed distribution of `$X$`, our best estimate (MLE) of the marginal distribution of `$X$`: `$\hat{p}_1$`, `$\hat{p}_2$`, `$\dots$`, `$\hat{p}_K$`

```r
# Proportion estimates for penguins species
table(penguins$species) / nrow(penguins)
```

```
## 
##    Adelie Chinstrap    Gentoo 
## 0.4418605 0.1976744 0.3604651
```

---

## Bar charts with proportions

.pull-left[

- [`after_stat()`](https://ggplot2.tidyverse.org/reference/aes_eval.html) indicates the aesthetic mapping is performed after statistical transformation

- Use `after_stat(count)` to access the `stat_count()` called by `geom_bar()`

```r
penguins %>% 
  ggplot(aes(x = species)) +
* geom_bar(aes(y = after_stat(count) /
*                sum(after_stat(count)))) +
  labs(y = "Proportion")
```

- Kind of weird code to use...

]

.pull-right[

]

---

## Compute and display the proportions directly

.pull-left[

```r
penguins %>%
* group_by(species) %>%
* summarize(count = n(),
*           .groups = "drop") %>%
* mutate(total = sum(count),
*        prop = count / total) %>%
  ggplot(aes(x = species)) +
  geom_bar(aes(y = prop),
           stat = "identity") 
```

- Use `group_by()`, `summarize()`, and `mutate()` in a pipeline to compute then display the proportions directly

- Need to indicate we are displaying the `y` axis as given, i.e., the identity function

]

.pull-right[

]

---

## Statistical inference for proportions

- Our estimate for `$p_j$` is `$\hat{p}_j = \frac{n_j}{n}$`, compute the standard error as:

$$
SE(\hat{p}_j) = \sqrt{\frac{\hat{p}_j(1 - \hat{p}_j)}{n}}
$$

- Compute `$\alpha$`-level __confidence interval__ (CI) as `$\hat{p}_j \pm z_{1 - \alpha / 2} \cdot SE(\hat{p}_j)$`

- Good rule-of-thumb: construct 95% CI using `$\hat{p}_j \pm 2 \cdot SE(\hat{p}_j)$`

- Just an approximation justified by CLT, so CI could include values outside of [0,1]

#### Add CIs to bars for 1D categorical data

- Need to remember each CI is for each `$\hat{p}_j$` marginally, not jointly

- Have to be careful with __multiple testing__

---

## Add standard errors to bars

.pull-left[

```r
penguins %>%
  group_by(species) %>% 
  summarize(count = n(), .groups = "drop") %>% 
  mutate(total = sum(count), 
         prop = count / total,
*        se = sqrt(prop * (1 - prop) / total),
*        lower = prop - 2 * se,
*        upper = prop + 2 * se) %>%
  ggplot(aes(x = species)) +
  geom_bar(aes(y = prop),
           stat = "identity") +
* geom_errorbar(aes(ymin = lower,
*                   ymax = upper),
*               color = "red")
```

- If CIs don’t overlap `$\rightarrow$` likely significant difference

- If CIs overlap a little `$\rightarrow$` ambiguous

- If CIs overlap a lot `$\rightarrow$` no significant difference

]

.pull-right[

]

---

## Why does this matter?

.pull-left[

]

.pull-right[

]

---

## Graphs can appear the same with very different statistical conclusions - mainly due to sample size

.pull-left[

]

.pull-right[

]

---

## Useful to order categories by frequency with [`forcats`](https://forcats.tidyverse.org/)

.pull-left[

```r
penguins %>%
  group_by(species) %>% 
  summarize(count = n(), .groups = "drop") %>% 
  mutate(total = sum(count), 
         prop = count / total,
         se = sqrt(prop * (1 - prop) / total), 
         lower = prop - 2 * se, 
         upper = prop + 2 * se,
         species = 
*          fct_reorder(species, prop)) %>%
  ggplot(aes(x = species)) +
  geom_bar(aes(y = prop),
           stat = "identity") +
  geom_errorbar(aes(ymin = lower, 
                    ymax = upper), 
                color = "red") 
```

]

.pull-right[

]

---

## So you want to make pie charts...

.pull-left[

- Circle is divided up into sections, i.e., _pie slices_, one slice for each
category

- Total area `$= \pi r^2$`, slice area `$= \frac{\pi r^2 \cdot \theta}{360}$`

- Angle `$\theta \propto$` counts (proportions), and radius `$r \propto 1$`

```r
penguins %>% 
  ggplot(aes(fill = species, x = "")) + 
* geom_bar(aes(y = after_stat(count))) +
* coord_polar(theta = "y")
```

]

.pull-right[

]

---

## It's true...

---

## But why?...

.pull-left[

]

.pull-right[
<img src="https://www.reactiongifs.us/wp-content/uploads/2013/05/this_is_useless_star_wars.gif" width="90%" style="display: block; margin: auto;" />
]

#### You should almost always stick to bars!

---
class: center, middle

# Next time: 2D categorical data

Recap: __Make bar charts with standard errors for 1D categorical data__