class: center, middle, inverse, title-slide .title[ # 36-613: Data Visualization ] .subtitle[ ## More High Dimensional Data and Shiny ] .author[ ### Professor Ron Yurko ] .date[ ### 9/28/2022 ] --- ## Consider the following spiral structure... <img src="figs/Lec9/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" /> --- ## PCA simply rotates the data... <img src="figs/Lec9/unnamed-chunk-3-1.png" width="100%" /> --- ## Nonlinear dimension reduction with t-SNE and UMAP .pull-left[ <img src="figs/Lec9/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/Lec9/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ] Both t-SNE and UMAP look at the local distances between points in the original `\(p\)`-dimensional space and try to reproduce them in a lower `\(k\)`-dimensional space --- ### [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding): t-distributed stochastic neighbor embedding - Construct conditional probability for similarity between observations in original space - i.e., probability `\(x_i\)` will pick `\(x_j\)` as its neighbor `$$p_{j \mid i}=\frac{\exp \left(-\left\|x_i-x_j\right\|^2 / 2 \sigma_i^2\right)}{\sum_{k \neq i} \exp \left(-\left\|x_i-x_k\right\|^2 / 2 \sigma_i^2\right )},\quad p_{i j}=\frac{\left(p_{j \mid i}+p_{i \mid j}\right)}{2 n}$$` - `\(\sigma_i\)` is the variance of Gaussian centered at `\(x_i\)` controlled by __perplexity__: `\(\log (\text { perplexity })=-\sum_j p_{j \mid i} \log _2 p_{j \mid i}\)` - loosely interpreted as the number of close neighbors to consider for each point -- - Find points `\(y_i\)` in lower dimensional space with symmetrized student t-distribution `$$q_{j \mid i}=\frac{\left(1+\left\|y_i-y_j\right\|^2\right)^{-1}}{\sum_{k \neq i}\left(1+\left\|y_i-y_k\right\|^2\right)^{-1}}, \quad q_{i j}=\frac{q_{i \mid j}+q_{j \mid i}}{2 n}$$` - Match conditional probabilities by minimize sum of KL divergences `\(C=\sum_{i j} p_{i j} \log \left(\frac{p_{i j}}{q_{i j}}\right)\)` --- ## Starbucks t-SNE plot .pull-left[ Use [`Rtsne`](https://github.com/jkrijthe/Rtsne) package ```r set.seed(2013) tsne_fit <- starbucks %>% dplyr::select(serv_size_m_l:caffeine_mg) %>% scale() %>% * Rtsne(check_duplicates = FALSE) starbucks %>% mutate(tsne1 = tsne_fit$Y[,1], tsne2 = tsne_fit$Y[,2]) %>% ggplot(aes(x = tsne1, y = tsne2, color = size)) + geom_point(alpha = 0.5) + labs(x = "t-SNE 1", y = "t-SNE 2") ``` ] .pull-right[ <img src="figs/Lec9/unnamed-chunk-6-1.png" width="100%" /> ] --- ## Starbucks t-SNE plot - involves randomness! .pull-left[ __Depends on the random starting point!__ ```r *set.seed(2014) tsne_fit <- starbucks %>% dplyr::select(serv_size_m_l:caffeine_mg) %>% scale() %>% * Rtsne(check_duplicates = FALSE) starbucks %>% mutate(tsne1 = tsne_fit$Y[,1], tsne2 = tsne_fit$Y[,2]) %>% ggplot(aes(x = tsne1, y = tsne2, color = size)) + geom_point(alpha = 0.5) + labs(x = "t-SNE 1", y = "t-SNE 2") ``` ] .pull-right[ <img src="figs/Lec9/unnamed-chunk-7-1.png" width="100%" /> ] --- ## Starbucks t-SNE plot - watch the perplexity! .pull-left[ ```r *set.seed(2013) tsne_fit <- starbucks %>% dplyr::select(serv_size_m_l:caffeine_mg) %>% scale() %>% * Rtsne(perplexity = 100, check_duplicates = FALSE) starbucks %>% mutate(tsne1 = tsne_fit$Y[,1], tsne2 = tsne_fit$Y[,2]) %>% ggplot(aes(x = tsne1, y = tsne2, color = size)) + geom_point(alpha = 0.5) + labs(x = "t-SNE 1", y = "t-SNE 2") ``` - Increases with more data - Should not be bigger than `\(\frac{n-1}{3}\)` ] .pull-right[ <img src="figs/Lec9/unnamed-chunk-8-1.png" width="100%" /> ] --- ## Back to the spirals: results depend on perplexity! <img src="figs/Lec9/unnamed-chunk-9-1.png" width="100%" /> --- ## Criticisms of t-SNE plots .pull-left[ - __Poor scalability__: does not scale well for large data, can practically only embed into 2 or 3 dimensions - __Meaningless global structure__: distance between clusters might not have clear interpretation and cluster size doesn’t have any meaning to it - __Poor performance with very high dimensional data__: need PCA as pre-dimension reduction step - [__Sometime random noise can lead to false positive structure in the t-SNE projection__](https://distill.pub/2016/misread-tsne/) - __Can NOT interpret like PCA!__ ] .pull-right[ <img src="figs/Lec9/unnamed-chunk-10-1.png" width="100%" /> ] --- ## Interactive web apps with [`Shiny`](https://shiny.rstudio.com/) Shiny is a framework to __interactive__ web applications and dynamic dashboards in `R` __You do NOT need to be a web developer to create Shiny apps__, you just need to learn some additional syntax to augment your `R` code -- Every Shiny app consists of two scripts (could also be saved into one file `app.R` but that's annoying) 1. `ui.R`: controls __user interface__, sets up the display, __widgets__ for user `input` - contains more code specific to Shiny 2. `server.R`: code to generate / display the results! Communicates with `ui.R` with __reactive objects__: processes user `input` to return `output` - will contain more _traditional_ `R` code: load packages, data wrangling, create plots -- Can be run locally or deployed on a Shiny app server for public viewing --- class: center, middle # DO IT LIVE --- class: center, middle # Next time: Maps __HW4 due today! HW5 due next Wednesday and Graphics Critique / Replication #2 due Friday Oct 7th!__ Recommended reading: [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/) [Understanding UMAP](https://pair-code.github.io/understanding-umap/) [Shiny tutorials](https://shiny.rstudio.com/tutorial/) [Shiny Gallery](https://shiny.rstudio.com/gallery/)