class: center, middle, inverse, title-slide .title[ # 36-613: Data Visualization ] .subtitle[ ## Density Estimation ] .author[ ### Professor Ron Yurko ] .date[ ### 9/14/2022 ] --- ## Simulate data from mixture of Normal distributions Will sample 100 draws from `\(N(-1.5, 1)\)` and 100 draws from `\(N(1.5, 1)\)` <img src="figs/Lec5/unnamed-chunk-2-1.png" width="100%" /> --- # Revisit histograms .pull-left[ ```r set.seed(2020) fake_data <- tibble(fake_x = c(rnorm(100, -1.5), rnorm(100, 1.5))) %>% mutate(component = c(rep("left", 100), rep("right", 100))) fake_data %>% ggplot(aes(x = fake_x)) + geom_histogram() + scale_x_continuous(limits = c(-5, 5)) ``` - Default histogram with 30 bins... ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-3-1.png" width="100%" /> ] --- ## What happens as we change the number of bins? .pull-left[ - Decrease it to 15 bins ```r fake_data %>% ggplot(aes(x = fake_x)) + * geom_histogram(bins = 15) + scale_x_continuous(limits = c(-5, 5)) ``` <img src="figs/Lec5/hist2-1.png" width="100%" /> ] -- .pull-right[ - Increase it to 60 bins ```r fake_data %>% ggplot(aes(x = fake_x)) + * geom_histogram(bins = 60) + scale_x_continuous(limits = c(-5, 5)) ``` <img src="figs/Lec5/hist3-1.png" width="100%" /> ] --- ## What happens as we change the number of bins? .pull-left[ - Decrease it to 5 bins ```r fake_data %>% ggplot(aes(x = fake_x)) + * geom_histogram(bins = 5) + scale_x_continuous(limits = c(-5, 5)) ``` <img src="figs/Lec5/hist4-1.png" width="100%" /> ] -- .pull-right[ - Increase it to 100 bins ```r fake_data %>% ggplot(aes(x = fake_x)) + * geom_histogram(bins = 100) + scale_x_continuous(limits = c(-5, 5)) ``` <img src="figs/Lec5/hist5-1.png" width="100%" /> ] --- ## Variability of graphs - 30 bins .pull-left[ - First sample with 30 bins... <img src="figs/Lec5/unnamed-chunk-4-1.png" width="100%" /> ] .pull-right[ - __What happens with a different sample?__ <img src="figs/Lec5/unnamed-chunk-5-1.png" width="100%" /> ] --- ## Variability of graphs - 15 bins .pull-left[ - First sample with 15 bins... <img src="figs/Lec5/unnamed-chunk-6-1.png" width="100%" /> ] .pull-right[ - __What happens with a different sample?__ <img src="figs/Lec5/unnamed-chunk-7-1.png" width="100%" /> ] --- ## Variability of graphs - too few bins .pull-left[ - __High bias__ <img src="figs/Lec5/unnamed-chunk-8-1.png" width="100%" /> ] .pull-right[ - __Low variance__ <img src="figs/Lec5/unnamed-chunk-9-1.png" width="100%" /> ] --- ## Variability of graphs - too many bins .pull-left[ - __Low bias__ <img src="figs/Lec5/unnamed-chunk-10-1.png" width="100%" /> ] .pull-right[ - __High variance__ <img src="figs/Lec5/unnamed-chunk-11-1.png" width="100%" /> ] --- ## Back to penguins... .pull-left[ - We make __histograms__ with [`geom_histogram()`](https://ggplot2.tidyverse.org/reference/geom_histogram.html) ```r penguins %>% * ggplot(aes(x = flipper_length_mm)) + * geom_histogram() ``` - __Pros__: - Displays full shape of distribution - Easy to interpret and see sample size - __Cons__: - Have to choose number of bins and bin locations - You can make a bad histogram ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-12-1.png" width="100%" /> ] --- # What about displaying conditional distributions? .pull-left[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + * geom_histogram(aes(fill = species)) ``` - Display conditional distribution of `flipper_length_mm` | `species`, i.e., `x` | `fill` - Default behavior is to __stack__ histograms - __What distribution is easy to see here?__ ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-13-1.png" width="100%" /> ] --- # What about displaying conditional distributions? .pull-left[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + geom_histogram(aes(fill = species), * position = "identity", * alpha = 0.3) ``` - Can change to overlay histograms - Modify the `position` to be `identity` - Need to adjust transparency with `alpha` ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-14-1.png" width="100%" /> ] --- # Normalize histogram frequencies with density values .pull-left[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + geom_histogram( * aes(y = after_stat(density), fill = species), position = "identity", alpha = 0.3) ``` - Total area under the bars equals to 1 - Area of any bar is height `\(\times\)` width `\(=\)` density `\(\times\)` width ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-15-1.png" width="100%" /> ] --- ## Use density curves instead for comparison .pull-left[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + * geom_density(aes(color = species)) ``` - Much easier to overlay for comparisons - Uses minimal amount of ink ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-16-1.png" width="100%" /> ] --- ## We do NOT fill the density curves .pull-left[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + geom_density(aes(fill = species), * alpha = .3) ``` - __Unnecessary ink!__ - Fill will be overwhelming with several categorical levels ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-17-1.png" width="100%" /> ] --- ### How do histograms relate to the PDF and CDF? __Remember__: we use the __probability density function (PDF)__ to provide a __relative likelihood__ -- - PDF is the __derivative__ of the cumulative distribution function (CDF) -- - Histograms approximate the PDF with bins, and __points are equally likely within a bin__ .pull-left[ <img src="figs/Lec5/unnamed-chunk-18-1.png" width="100%" /> ] .pull-right[ <img src="figs/Lec5/ecdf-1.png" width="100%" /> ] -- __What can say about the relative likelihood of data we have not observed?__ --- ## Kernel density estimation __Goal__: estimate the PDF `\(f(x)\)` for all possible values (assuming it is continuous / smooth) -- $$ \text{Kernel density estimate: } \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) $$ -- - `\(n =\)` sample size, `\(x =\)` new point to estimate `\(f(x)\)` (does NOT have to be in dataset!) -- - `\(h =\)` __bandwidth__, analogous to histogram bin width, ensures `\(\hat{f}(x)\)` integrates to 1 - `\(x_i =\)` `\(i\)`th observation in dataset -- - `\(K_h(x - x_i)\)` is the __Kernel__ function, creates __weight__ given distance of `\(i\)`th observation from new point - as `\(|x - x_i| \rightarrow \infty\)` then `\(K_h(x - x_i) \rightarrow 0\)`, i.e. further apart `\(i\)`th row is from `\(x\)`, smaller the weight - as __bandwidth__ `\(h \uparrow\)` weights are more evenly spread out (as `\(h \downarrow\)` more concentrated around `\(x\)`) - typically use [__Gaussian__ / Normal](https://en.wikipedia.org/wiki/Normal_distribution) kernel: `\(\propto e^{-(x - x_i)^2 / 2h^2}\)` - `\(K_h(x - x_i)\)` is large when `\(x_i\)` is close to `\(x\)` --- ## [Wikipedia example](https://en.wikipedia.org/wiki/Kernel_density_estimation) .center[] --- ## We display __kernel density estimates__ with [`geom_density()`](https://ggplot2.tidyverse.org/reference/geom_density.html) .pull-left[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + * geom_density() + theme_bw() ``` - __Pros__: - Displays full shape of distribution - Can easily layer - Add categorical variable with color - __Cons__: - Need to pick bandwidth and kernel... ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-19-1.png" width="100%" /> ] --- ## Choice of [kernel?](https://en.wikipedia.org/wiki/Kernel_(statistics) <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Kernels.svg/1000px-Kernels.svg.png" width="70%" style="display: block; margin: auto;" /> --- ## What about the bandwidth? Use __Gaussian reference rule__ (_rule-of-thumb_) `\(\approx 1.06 \cdot \sigma \cdot n^{-1/5}\)`, where `\(\sigma\)` is the observed standard deviation Modify the bandwidth using the `adjust` argument - __value to multiply default bandwidth by__ .pull-left[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + * geom_density(adjust = 0.5) + theme_bw() ``` <img src="figs/Lec5/curve-noisy-1.png" width="100%" /> ] .pull-right[ ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + * geom_density(adjust = 2) + theme_bw() ``` <img src="figs/Lec5/curve-smooth-1.png" width="100%" /> ] --- ## CAUTION: dealing with bounded data... .pull-left[ ```r set.seed(101) bound_data <- tibble(fake_x = runif(100)) bound_data %>% ggplot(aes(x = fake_x)) + geom_density() + * geom_rug(alpha = 0.5) + stat_function(data = tibble(fake_x = c(0, 1)), fun = dunif, color = "red") + scale_x_continuous(limits = c(-.5, 1.5)) ``` - Use `geom_rug()` to display raw data points - __Observe density estimates for impossible values!__ - Proposed solutions based on [reflecting](https://github.com/tidyverse/ggplot2/issues/3387) - Check out [`evmix` package for more information](https://www.jstatsoft.org/article/view/v084i05) ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-21-1.png" width="100%" /> ] --- ## Alternative density-based option: violin plots .pull-left[ ```r penguins %>% ggplot(aes(x = species, y = flipper_length_mm)) + * geom_violin() + coord_flip() ``` - __Pros__: - Displays full shape of distribution - Can easily layer... - __Cons__: - Inherits same problems as density curves - Mirror image is duplicate information... ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-22-1.png" width="100%" /> ] --- ## Alternative density-based option: violin plots .pull-left[ ```r penguins %>% ggplot(aes(x = species, y = flipper_length_mm)) + geom_violin() + * geom_boxplot(width = .2) + coord_flip() ``` - __Pros__: - Displays full shape of distribution - __Can easily layer with box plots on top__ - __Cons__: - Inherits same problems as density curves - Mirror image is duplicate information... ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-23-1.png" width="100%" /> ] --- ## Alternative density-based option: ridge plots .pull-left[ - Available with [`ggridges` package](https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html) ```r library(ggridges) penguins %>% ggplot(aes(x = flipper_length_mm, * y = species)) + * geom_density_ridges(rel_min_height = 0.01) ``` - __Pros__: - Drops the mirror image of violins - Useful for many categorical levels (especially ordinal data) - __Cons__: - Inherits same problems as density curves - Be careful with overlap... ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-24-1.png" width="100%" /> ] --- ## Alternative data-based option: beeswarm plots .pull-left[ - Available with [`ggbeeswarm` package](https://github.com/eclarke/ggbeeswarm) ```r library(ggbeeswarm) penguins %>% ggplot(aes(x = species, y = flipper_length_mm)) + * geom_beeswarm(cex = 1.5) + coord_flip() ``` - __Pros__: - Displays full shape of distribution - Displays each data point individually - __Cons__: - Which algorithm for arranging points? - __What's another major disadvantage?__ ] .pull-right[ <img src="figs/Lec5/unnamed-chunk-25-1.png" width="100%" /> ] --- class: center, middle # Next time: 2D Quantitative Data Reminder: __HW2 due tonight!__ __Graphics critique/replication due Friday!__ Recommended reading: [CW Chapter 7 Visualizing distributions: Histograms and density plots](https://clauswilke.com/dataviz/histograms-density-plots.html)