class: center, middle, inverse, title-slide .title[ # 36-613: Data Visualization ] .subtitle[ ## 1D Quantitative Data ] .author[ ### Professor Ron Yurko ] .date[ ### 9/12/2022 ] --- # 1D quantitative data Observations are collected into a vector `\((x_1, \dots, x_n)\)`, `\(x_i \in \mathbb{R}\)` (or maybe `\(\mathbb{R}^+\)`, `\(\mathbb{Z}\)`) Common __summary statistics__ for 1D quantitative data: -- + __Center__: Mean, median, weighted mean, mode + Related to the first moment, i.e., `\(\mathbb{E}[X]\)` -- + __Spread__: Variance, range, min/max, quantiles, IQR + Related to the second moment, i.e., `\(\mathbb{E}[X^2]\)` -- + __Shape__: symmetry, skew, kurtosis ("peakedness") + Related to higher order moments, i.e., skewness is `\(\mathbb{E}[X^3]\)`, kurtosis is `\(\mathbb{E}[X^4]\)` -- Compute various statistics with `summary()`, `mean()`, `median()`, `quantile()`, `range()`, `sd()`, `var()`, etc. --- ## Box plots visualize summary statistics .pull-left[ - We make a __box plot__ with [`geom_boxplot()`](https://ggplot2.tidyverse.org/reference/geom_boxplot.html) ```r penguins %>% * ggplot(aes(y = flipper_length_mm)) + * geom_boxplot(aes(x = "")) + * coord_flip() ``` - __Pros__: - Displays outliers, percentiles, spread, skew - Useful for side-by-side comparison - __Cons__: - Does not display the full distribution shape! - Missing some summary stats potentially - Stresses middle portion _Why use `aes(x = "")` inside `geom_boxplot()`?_ ] .pull-right[ <img src="figs/Lec4/unnamed-chunk-2-1.png" width="100%" /> ] --- ## Histograms display 1D continuous distributions .pull-left[ - We make __histograms__ with [`geom_histogram()`](https://ggplot2.tidyverse.org/reference/geom_histogram.html) ```r penguins %>% * ggplot(aes(x = flipper_length_mm)) + * geom_histogram() ``` $$ \text{# total obs.} = \sum_{b=1}^B \text{# obs. in bin }b $$ - __Pros__: - Displays full shape of distribution - Easy to interpret and see sample size - __Cons__: - Have to choose number of bins and bin locations (will revisit Wednesday) - You can make a bad histogram ] .pull-right[ <img src="figs/Lec4/unnamed-chunk-3-1.png" width="100%" /> ] --- # [Do NOT rely on box plots...](https://www.autodesk.com/research/publications/same-stats-different-graphs) <img src="https://damassets.autodesk.net/content/dam/autodesk/research/publications-assets/gifs/same-stats-different-graphs/boxplots.gif" width="100%" /> Three clearly different distributions of data... #### But they all result in the exact same box plot! --- ### What do visualizations of continuous distributions display? __Probability that continuous variable `\(X\)` takes a particular value is 0__ e.g., `\(P\)` (`flipper_length_mm` `\(= 200\)`) `\(= 0\)`, _why_? -- Instead we use the __probability density function (PDF)__ to provide a __relative likelihood__ - Density estimation is the focus of Wednesday's lecture -- For continuous variables we can use the __cumulative distribution function (CDF)__, $$ F(x) = P(X \leq x) $$ -- For `\(n\)` observations we can easily compute the __Empirical CDF (ECDF)__: `$$\hat{F}_n(x) = \frac{\text{# obs. with variable} \leq x}{n} = \frac{1}{n} \sum_{i=1}^{n}1(x_i \leq x)$$` - where `\(1()\)` is the indicator function, i.e. `ifelse(x_i <= x, 1, 0)` --- ## Display full distribution with ECDF plot .pull-left[ - We make __ECDF plots__ with [`stat_ecdf()`](https://ggplot2.tidyverse.org/reference/stat_ecdf.html) ```r penguins %>% ggplot(aes(x = flipper_length_mm)) + * stat_ecdf() + theme_bw() ``` - __Pros__: - Displays all of your data at once (except the order) - Does NOT require any parameters to adjust - As `\(n \rightarrow \infty\)`, our ECDF `\(\hat{F}_n(x)\)` converges to the true CDF `\(F(x)\)` - __Cons__: - _What do you think the cons are?_ ] .pull-right[ <img src="figs/Lec4/unnamed-chunk-5-1.png" width="100%" /> ] --- ## What's the relationship between these two figures? .pull-left[ <img src="figs/Lec4/unnamed-chunk-6-1.png" width="100%" /> ] .pull-right[ <img src="figs/Lec4/unnamed-chunk-7-1.png" width="100%" /> ] --- ## What about comparing to theoretical distributions? .pull-left[ <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/2560px-Normal_Distribution_PDF.svg.png" width="100%" /> ] -- .pull-right[ <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/2560px-Normal_Distribution_CDF.svg.png" width="100%" /> ] --- ## One-Sample Kolmogorov-Smirnov Test - We compare the ECDF `\(\hat{F}(x)\)` to a theoretical distribution's CDF `\(F(x)\)` -- - The one sample KS test statistic is: `\(\text{max}_x |\hat{F}(x) - F(x)|\)` <img src="https://upload.wikimedia.org/wikipedia/commons/c/cf/KS_Example.png" width="45%" style="display: block; margin: auto;" /> --- ## Flipper length example What if we assume `flipper_length_mm` follows Normal distribution? i.e., `flipper_length_mm` `\(\sim N(\mu, \sigma^2)\)` + Need estimates for mean `\(\mu\)` and standard deviation `\(\sigma\)`: ```r flipper_length_mean <- mean(penguins$flipper_length_mm, na.rm = TRUE) flipper_length_sd <- sd(penguins$flipper_length_mm, na.rm = TRUE) ``` -- Perform one-sample KS test using [`ks.test()`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html): ```r ks.test(x = penguins$flipper_length_mm, y = "pnorm", mean = flipper_length_mean, sd = flipper_length_sd) ``` ``` ## ## Asymptotic one-sample Kolmogorov-Smirnov test ## ## data: penguins$flipper_length_mm ## D = 0.12428, p-value = 5.163e-05 ## alternative hypothesis: two-sided ``` --- ## Flipper length example <img src="figs/Lec4/unnamed-chunk-13-1.png" width="100%" /> --- ### Visualize distribution comparisons using quantile-quantile (q-q) plots .pull-left[ - Compare observed values to theoretical predictions using assumed distribution - Theoretical values are based on observation's rank in sample and assumed distribution ```r penguins %>% * ggplot(aes(sample = flipper_length_mm)) + * stat_qq() + * stat_qq_line() ``` - Use [`stat_qq` and `stat_qq_line`](https://ggplot2.tidyverse.org/reference/geom_qq.html) to create q-q plots (default assumption is Normal distribution) - Line displays where observed `\(==\)` theoretical ] .pull-right[ <img src="figs/Lec4/unnamed-chunk-14-1.png" width="100%" /> ] --- class: center, middle # Next time: Density estimation Reminder: __HW2 due Wednesday!__ __Graphics critique/replication due Friday!__ Recommended reading: [CW Chapter 7 Visualizing distributions: Histograms and density plots](https://clauswilke.com/dataviz/histograms-density-plots.html) [CW Chapter 8 Visualizing distributions: Empirical cumulative distribution functions and q-q plots](https://clauswilke.com/dataviz/ecdf-qq.html) --- ## BONUS: Visualizing the KS test statistic ```r # First create the ECDF function for the variable: fl_ecdf <- ecdf(penguins$flipper_length_mm) # Compute the absolute value of the differences between the ECDF for the values # and the theoretical values with assumed Normal distribution: abs_ecdf_diffs <- abs(fl_ecdf(penguins$flipper_length_mm) - pnorm(penguins$flipper_length_mm, mean = flipper_length_mean, sd = flipper_length_sd)) # Now find where the maximum difference is: max_abs_ecdf_diff_i <- which.max(abs_ecdf_diffs) # Get this flipper length value: max_fl_diff_value <- penguins$flipper_length_mm[max_abs_ecdf_diff_i] # Plot the ECDF with the theoretical Normal and KS test info: penguins %>% ggplot(aes(x = flipper_length_mm)) + stat_ecdf(color = "darkblue") + # Use stat_function to draw the Normal ECDF stat_function(fun = pnorm, args = list(mean = flipper_length_mean, sd = flipper_length_sd), color = "black", linetype = "dashed") + # Draw KS test line: geom_vline(xintercept = max_fl_diff_value, color = "red") + # Add text with the test results (x and y are manually entered locations) annotate(geom = "text", x = 215, y = .25, label = "KS test stat = 0.12428\np-value = 5.163e-05") + labs(x = "Flipper length (mm)", y = "Fn(x)") + theme_bw() ```