36-613: Data Visualization

class: center, middle, inverse, title-slide

.title[
# 36-613: Data Visualization
]
.subtitle[
## Density Estimation
]
.author[
### Professor Ron Yurko
]
.date[
### 9/14/2022
]

---

## Simulate data from mixture of Normal distributions

Will sample 100 draws from `$N(-1.5, 1)$` and 100 draws from `$N(1.5, 1)$`

---

# Revisit histograms

.pull-left[

```r
set.seed(2020)
fake_data <- 
  tibble(fake_x = c(rnorm(100, -1.5), 
                    rnorm(100, 1.5))) %>%
  mutate(component = 
           c(rep("left", 100), 
             rep("right", 100)))

fake_data %>%
  ggplot(aes(x = fake_x)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-5, 5))
```

- Default histogram with 30 bins...

]

.pull-right[

]

---

## What happens as we change the number of bins?

.pull-left[

- Decrease it to 15 bins

```r
fake_data %>%
  ggplot(aes(x = fake_x)) +
* geom_histogram(bins = 15) +
  scale_x_continuous(limits = c(-5, 5))
```

]

.pull-right[

- Increase it to 60 bins

```r
fake_data %>%
  ggplot(aes(x = fake_x)) +
* geom_histogram(bins = 60) +
  scale_x_continuous(limits = c(-5, 5))
```

<img src="figs/Lec5/hist3-1.png" width="100%" />
]

---

## What happens as we change the number of bins?

.pull-left[

- Decrease it to 5 bins

```r
fake_data %>%
  ggplot(aes(x = fake_x)) +
* geom_histogram(bins = 5) +
  scale_x_continuous(limits = c(-5, 5))
```

]

.pull-right[

- Increase it to 100 bins

```r
fake_data %>%
  ggplot(aes(x = fake_x)) +
* geom_histogram(bins = 100) +
  scale_x_continuous(limits = c(-5, 5))
```

<img src="figs/Lec5/hist5-1.png" width="100%" />
]

---

## Variability of graphs - 30 bins

.pull-left[

- First sample with 30 bins...

]

.pull-right[

- __What happens with a different sample?__

]

---

## Variability of graphs - 15 bins

.pull-left[

- First sample with 15 bins...

]

.pull-right[

- __What happens with a different sample?__

]

---

## Variability of graphs - too few bins

.pull-left[

- __High bias__

]

.pull-right[

- __Low variance__

]

---

## Variability of graphs - too many bins

.pull-left[

- __Low bias__

]

.pull-right[

- __High variance__

]

---

## Back to penguins...

.pull-left[

- We make __histograms__ with [`geom_histogram()`](https://ggplot2.tidyverse.org/reference/geom_histogram.html)

```r
penguins %>%
* ggplot(aes(x = flipper_length_mm)) +
* geom_histogram()
```

- __Pros__:
  - Displays full shape of distribution
  - Easy to interpret and see sample size

- __Cons__:
  - Have to choose number of bins and bin locations
  - You can make a bad histogram
  
]
.pull-right[
<img src="figs/Lec5/unnamed-chunk-12-1.png" width="100%" />
]

---

# What about displaying conditional distributions?

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
* geom_histogram(aes(fill = species))
```

- Display conditional distribution of `flipper_length_mm` | `species`, i.e., `x` | `fill`
  
  - Default behavior is to __stack__ histograms
  
  - __What distribution is easy to see here?__
  
  
]
.pull-right[
<img src="figs/Lec5/unnamed-chunk-13-1.png" width="100%" />
]

---

# What about displaying conditional distributions?

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(aes(fill = species),
*                position = "identity",
*                alpha = 0.3)
```

- Can change to overlay histograms
  
  - Modify the `position` to be `identity`
  
  - Need to adjust transparency with `alpha`
  
  
]
.pull-right[
<img src="figs/Lec5/unnamed-chunk-14-1.png" width="100%" />
]

---

# Normalize histogram frequencies with density values

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
  geom_histogram(
*   aes(y = after_stat(density),
        fill = species),
    position = "identity", 
    alpha = 0.3) 
```

- Total area under the bars equals to 1

- Area of any bar is height `$\times$` width `$=$` density `$\times$` width
  
]
.pull-right[
<img src="figs/Lec5/unnamed-chunk-15-1.png" width="100%" />
]

---

## Use density curves instead for comparison

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
* geom_density(aes(color = species))
```

- Much easier to overlay for comparisons
  
  - Uses minimal amount of ink
  
]
.pull-right[
<img src="figs/Lec5/unnamed-chunk-16-1.png" width="100%" />
]

---

## We do NOT fill the density curves

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
  geom_density(aes(fill = species),
*              alpha = .3)
```

- __Unnecessary ink!__
  
  - Fill will be overwhelming with several categorical levels
  
]
.pull-right[
<img src="figs/Lec5/unnamed-chunk-17-1.png" width="100%" />
]

---

### How do histograms relate to the PDF and CDF?

__Remember__: we use the __probability density function (PDF)__ to provide a __relative likelihood__

--
- PDF is the __derivative__ of the cumulative distribution function (CDF)

--
- Histograms approximate the PDF with bins, and __points are equally likely within a bin__

.pull-left[
<img src="figs/Lec5/unnamed-chunk-18-1.png" width="100%" />
]
.pull-right[
<img src="figs/Lec5/ecdf-1.png" width="100%" />

]

--
__What can say about the relative likelihood of data we have not observed?__

---

## Kernel density estimation

__Goal__: estimate the PDF `$f(x)$` for all possible values (assuming it is continuous / smooth)

$$
\text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i)
$$

--
- `$n =$` sample size, `$x =$` new point to estimate `$f(x)$` (does NOT have to be in dataset!)

--
- `$h =$` __bandwidth__, analogous to histogram bin width, ensures `$\hat{f}(x)$` integrates to 1

- `$x_i =$` `$i$`th observation in dataset

--
- `$K_h(x - x_i)$` is the __Kernel__ function, creates __weight__ given distance of `$i$`th observation from new point 
  - as `$|x - x_i| \rightarrow \infty$` then `$K_h(x - x_i) \rightarrow 0$`, i.e. further apart `$i$`th row is from `$x$`, smaller the weight
  
  - as __bandwidth__ `$h \uparrow$` weights are more evenly spread out (as `$h \downarrow$` more concentrated around `$x$`)

- typically use [__Gaussian__ / Normal](https://en.wikipedia.org/wiki/Normal_distribution) kernel: `$\propto e^{-(x - x_i)^2 / 2h^2}$`
  
  - `$K_h(x - x_i)$` is large when `$x_i$` is close to `$x$`

---

## [Wikipedia example](https://en.wikipedia.org/wiki/Kernel_density_estimation)

.center[![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Comparison_of_1D_histogram_and_KDE.png/1000px-Comparison_of_1D_histogram_and_KDE.png)]
  
---

## We display __kernel density estimates__ with [`geom_density()`](https://ggplot2.tidyverse.org/reference/geom_density.html)

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
* geom_density() +
  theme_bw()
```

- __Pros__:
  - Displays full shape of distribution
  - Can easily layer
  - Add categorical variable with color

- __Cons__:
  - Need to pick bandwidth and kernel...
  
]
.pull-right[
<img src="figs/Lec5/unnamed-chunk-19-1.png" width="100%" />
]

---

## Choice of [kernel?](https://en.wikipedia.org/wiki/Kernel_(statistics)

---

## What about the bandwidth?

Use __Gaussian reference rule__ (_rule-of-thumb_) `$\approx 1.06 \cdot \sigma \cdot n^{-1/5}$`, where `$\sigma$` is the observed standard deviation

Modify the bandwidth using the `adjust` argument - __value to multiply default bandwidth by__

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
* geom_density(adjust = 0.5) +
  theme_bw()
```

]
.pull-right[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm)) + 
* geom_density(adjust = 2) +
  theme_bw()
```

<img src="figs/Lec5/curve-smooth-1.png" width="100%" />
]

---

## CAUTION: dealing with bounded data...

.pull-left[

```r
set.seed(101)
bound_data <- tibble(fake_x = runif(100))

bound_data %>%
  ggplot(aes(x = fake_x)) +
  geom_density() +
* geom_rug(alpha = 0.5) +
  stat_function(data = 
                  tibble(fake_x = c(0, 1)),
                fun = dunif, color = "red") +
  scale_x_continuous(limits = c(-.5, 1.5))
```

- Use `geom_rug()` to display raw data points

- __Observe density estimates for impossible values!__

- Proposed solutions based on [reflecting](https://github.com/tidyverse/ggplot2/issues/3387)

- Check out [`evmix` package for more information](https://www.jstatsoft.org/article/view/v084i05)

]

.pull-right[

]

---

## Alternative density-based option: violin plots

.pull-left[

```r
penguins %>%
  ggplot(aes(x = species, 
             y = flipper_length_mm)) +
* geom_violin() +
  coord_flip()
```

- __Pros__:
  - Displays full shape of distribution
  
  - Can easily layer...
  
- __Cons__:

- Inherits same problems as density curves
  
  - Mirror image is duplicate information...

]
.pull-right[

]

---

## Alternative density-based option: violin plots

.pull-left[

```r
penguins %>%
  ggplot(aes(x = species, 
             y = flipper_length_mm)) +
  geom_violin() + 
* geom_boxplot(width = .2) +
  coord_flip()
```

- __Pros__:

- Displays full shape of distribution
  
  - __Can easily layer with box plots on top__
  
- __Cons__:

- Inherits same problems as density curves
  
  - Mirror image is duplicate information...

]
.pull-right[

]

---

## Alternative density-based option: ridge plots

.pull-left[

- Available with [`ggridges` package](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html)

```r
library(ggridges)
penguins %>%
  ggplot(aes(x = flipper_length_mm, 
*            y = species)) +
* geom_density_ridges(rel_min_height = 0.01)
```

- __Pros__:

- Drops the mirror image of violins
  
  - Useful for many categorical levels (especially ordinal data)
  
- __Cons__:

- Inherits same problems as density curves
  
  - Be careful with overlap...

]
.pull-right[

]

---

## Alternative data-based option: beeswarm plots

.pull-left[

- Available with [`ggbeeswarm` package](https://github.com/eclarke/ggbeeswarm)

```r
library(ggbeeswarm)
penguins %>%
  ggplot(aes(x = species, 
             y = flipper_length_mm)) +
* geom_beeswarm(cex = 1.5) +
  coord_flip()
```

- __Pros__:

- Displays full shape of distribution
  
  - Displays each data point individually
  
- __Cons__:

- Which algorithm for arranging points?
  
  - __What's another major disadvantage?__

]
.pull-right[

]

---
class: center, middle

# Next time: 2D Quantitative Data

Reminder: __HW2 due tonight!__ __Graphics critique/replication due Friday!__