36-613: Data Visualization

class: center, middle, inverse, title-slide

.title[
# 36-613: Data Visualization
]
.subtitle[
## More High Dimensional Data and Shiny
]
.author[
### Professor Ron Yurko
]
.date[
### 9/28/2022
]

---

## Consider the following spiral structure...

---

## PCA simply rotates the data...

---

## Nonlinear dimension reduction with t-SNE and UMAP

.pull-left[
<img src="figs/Lec9/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />

]
.pull-right[
<img src="figs/Lec9/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" />
]

Both t-SNE and UMAP look at the local distances between points in the original `$p$`-dimensional space and try to reproduce them in a lower `$k$`-dimensional space

---

### [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding): t-distributed stochastic neighbor embedding

- Construct conditional probability for similarity between observations in original space

- i.e., probability `$x_i$` will pick `$x_j$` as its neighbor

`$$p_{j \mid i}=\frac{\exp \left(-\left\|x_i-x_j\right\|^2 / 2 \sigma_i^2\right)}{\sum_{k \neq i} \exp \left(-\left\|x_i-x_k\right\|^2 / 2 \sigma_i^2\right
)},\quad p_{i j}=\frac{\left(p_{j \mid i}+p_{i \mid j}\right)}{2 n}$$`

- `$\sigma_i$` is the variance of Gaussian centered at `$x_i$` controlled by __perplexity__:  `$\log (\text { perplexity })=-\sum_j p_{j \mid i} \log _2 p_{j \mid i}$`

-  loosely interpreted as the number of close neighbors to consider for each point
  
--

- Find points `$y_i$` in lower dimensional space with symmetrized student t-distribution

`$$q_{j \mid i}=\frac{\left(1+\left\|y_i-y_j\right\|^2\right)^{-1}}{\sum_{k \neq i}\left(1+\left\|y_i-y_k\right\|^2\right)^{-1}}, \quad q_{i j}=\frac{q_{i \mid j}+q_{j \mid i}}{2 n}$$`
- Match conditional probabilities by minimize sum of KL divergences `$C=\sum_{i j} p_{i j} \log \left(\frac{p_{i j}}{q_{i j}}\right)$`

---

## Starbucks t-SNE plot

.pull-left[

Use [`Rtsne`](https://github.com/jkrijthe/Rtsne) package

```r
set.seed(2013)
tsne_fit <- starbucks %>%
  dplyr::select(serv_size_m_l:caffeine_mg) %>%
  scale() %>%
* Rtsne(check_duplicates = FALSE)

starbucks %>%
  mutate(tsne1 = tsne_fit$Y[,1],
         tsne2 = tsne_fit$Y[,2]) %>%
  ggplot(aes(x = tsne1, y = tsne2, 
             color = size)) +
  geom_point(alpha = 0.5) + 
  labs(x = "t-SNE 1", y = "t-SNE 2")
```

]

.pull-right[

]

---

## Starbucks t-SNE plot - involves randomness!

.pull-left[

__Depends on the random starting point!__

```r
*set.seed(2014)
tsne_fit <- starbucks %>%
  dplyr::select(serv_size_m_l:caffeine_mg) %>%
  scale() %>%
* Rtsne(check_duplicates = FALSE)

]

.pull-right[

]

---

## Starbucks t-SNE plot - watch the perplexity!

.pull-left[

```r
*set.seed(2013)
tsne_fit <- starbucks %>%
  dplyr::select(serv_size_m_l:caffeine_mg) %>%
  scale() %>%
* Rtsne(perplexity = 100,
        check_duplicates = FALSE)

- Increases with more data

- Should not be bigger than `$\frac{n-1}{3}$`

]

.pull-right[

]

---

## Back to the spirals: results depend on perplexity!

---

## Criticisms of t-SNE plots

.pull-left[

- __Poor scalability__: does not scale well for large data, can practically
only embed into 2 or 3 dimensions

- __Meaningless global structure__: distance between clusters might not
have clear interpretation and cluster size doesn’t have any meaning to
it

- __Poor performance with very high dimensional data__: need PCA as
pre-dimension reduction step

- [__Sometime random noise can lead to false positive structure in the
t-SNE projection__](https://distill.pub/2016/misread-tsne/)

- __Can NOT interpret like PCA!__

]

.pull-right[
<img src="figs/Lec9/unnamed-chunk-10-1.png" width="100%" />
]

---

## Interactive web apps with [`Shiny`](https://shiny.rstudio.com/)

Shiny is a framework to __interactive__ web applications and dynamic dashboards in `R`

__You do NOT need to be a web developer to create Shiny apps__, you just need to learn some additional syntax to augment your `R` code

Every Shiny app consists of two scripts (could also be saved into one file `app.R` but that's annoying)

1. `ui.R`: controls __user interface__, sets up the display, __widgets__ for user `input`

- contains more code specific to Shiny

2. `server.R`: code to generate / display the results! Communicates with `ui.R` with __reactive objects__: processes user `input` to return `output`

- will contain more _traditional_ `R` code: load packages, data wrangling, create plots
  
--

Can be run locally or deployed on a Shiny app server for public viewing

---
class: center, middle

# DO IT LIVE

---
class: center, middle

# Next time: Maps

__HW4 due today! HW5 due next Wednesday and Graphics Critique / Replication #2 due Friday Oct 7th!__