36-613: Data Visualization

class: center, middle, inverse, title-slide

.title[
# 36-613: Data Visualization
]
.subtitle[
## High Dimensional Data
]
.author[
### Professor Ron Yurko
]
.date[
### 9/26/2022
]

---

# Conceptual review

Last class: contour plots, heat maps, and diving into high-dimensional data

#### Today: how do we visualize structure of high-dimensional data?

- Example: What if I give you a dataset with 50 variables, and ask you to make __one visualization__ that best represents the data? _What do you do?_

- Do NOT panic and make `$\binom{50}{2} = 1225$` pairs of plots!

- __Intuition__: Take high-dimensional data and __represent it in 2-3 dimensions__, then visualize those dimensions

---

## Thinking about distance...

When describing visuals, we've implicitly "clustered" observations together

- e.g., where are the mode(s) in the data?

These types of task require characterizing the __distance__ between observations

- Clusters: groups of observations that are "close" together

This is easy to do for 2 quantitative variables: just make a scatterplot (possibly with contours or heatmap)

#### But how do we define "distance" for high-dimensional data?

Let `$\boldsymbol{x}_i = (x_{i1}, \dots, x_{ip})$` be a vector of `$p$` features for observation `$i$`

Question of interest: How "far away" is `$\boldsymbol{x}_i$` from `$\boldsymbol{x}_j$`?

When looking at a scatterplot, you're using __Euclidean distance__ (length of the line in `$p$`-dimensional space):

`$$d(\boldsymbol{x}_i, \boldsymbol{x}_j) = \sqrt{(x_{i1} - x_{j1})^2 + \dots + (x_{ip} - x_{jp})^2}$$`

---

## Distances in general

There's a variety of different types of distance metrics: [Manhattan](https://en.wikipedia.org/wiki/Taxicab_geometry), [Mahalanobis](https://en.wikipedia.org/wiki/Mahalanobis_distance), [Cosine](https://en.wikipedia.org/wiki/Cosine_similarity), [Kullback-Leiber Divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence), [Wasserstein](https://en.wikipedia.org/wiki/Wasserstein_metric), but we're just going to focus on [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)

`$d(\boldsymbol{x}_i, \boldsymbol{x}_j)$` measures pairwise distance between two observations `$i,j$` and has the following properties:

1. __Identity__: `$\boldsymbol{x}_i = \boldsymbol{x}_j \iff d(\boldsymbol{x}_i, \boldsymbol{x}_j) = 0$`

2. __Non-Negativity__: `$d(\boldsymbol{x}_i, \boldsymbol{x}_j) \geq 0$`

3. __Symmetry__: `$d(\boldsymbol{x}_i, \boldsymbol{x}_j) = d(\boldsymbol{x}_j, \boldsymbol{x}_i)$`

4. __Triange Inequality__: `$d(\boldsymbol{x}_i, \boldsymbol{x}_j) \leq d(\boldsymbol{x}_i, \boldsymbol{x}_k) + d(\boldsymbol{x}_k, \boldsymbol{x}_j)$`

.pull-left[

__Distance Matrix__: matrix `$D$` of all pairwise distances

- `$D_{ij} = d(\boldsymbol{x}_i, \boldsymbol{x}_j)$`

- where `$D_{ii} = 0$` and `$D_{ij} = D_{ji}$`

]
.pull-right[

`$$D = \begin{pmatrix}
                0 & D_{12} & \cdots & D_{1n} \\
                D_{21} & 0 & \cdots & D_{2n} \\
                \vdots & \vdots & \ddots & \vdots \\
                D_{n1} & \cdots & \cdots & 0
            \end{pmatrix}$$`

]

---

## Multi-dimensional scaling (MDS)

#### General approach for visualizing distance matrices

- Puts `$n$` observations in a `$k$`-dimensional space such that the distances are preserved as much as possible

- where `$k << p$` typically choose `$k = 2$`
  
--

MDS attempts to create new point `$\boldsymbol{y}_i = (y_{i1}, y_{i2})$` for each unit such that:

`$$\sqrt{(y_{i1} - y_{j1})^2 + (y_{i2} - y_{j2})^2} \approx D_{ij}$$`

- i.e., distance in 2D MDS world is approximately equal to the actual distance

#### Then plot the new `$\boldsymbol{y}$`s on a scatterplot

- Use the `scale()` function to ensure variables are comparable

- Make a distance matrix for this dataset

- Visualize it with MDS

---

## MDS example with Starbucks drinks

.pull-left[

```r
starbucks_scaled_quant_data <- starbucks %>%
  dplyr::select(serv_size_m_l:caffeine_mg) %>%
  scale(center = FALSE, 
*       scale = apply(., 2, sd, na.rm = TRUE))

*dist_euc <- dist(starbucks_scaled_quant_data)

*starbucks_mds <- cmdscale(d = dist_euc, k = 2)

starbucks <- starbucks %>%
  mutate(mds1 = starbucks_mds[,1],
         mds2 = starbucks_mds[,2])

starbucks %>%
  ggplot(aes(x = mds1, y = mds2)) +
  geom_point(alpha = 0.5) +
  labs(x = "Coordinate 1", y = "Coordinate 2")
```

]

.pull-right[

]

---

# View structure with additional variables

.pull-left[

]

.pull-right[

]

---

## Dimension reduction - searching for variance

__GOAL__: Focus on reducing dimensionality of feature space, i.e., number of columns, while __retaining__ most of the information, i.e., __variance__, in a lower dimensional space

- `$n \times p$` matrix `$\rightarrow$` dimension reduction technique `$\rightarrow$` `$n \times k$` matrix

Special case we just discussed: __MDS__

- `$n \times n$` __distance__ matrix `$\rightarrow$` MDS `$\rightarrow$` `$n \times k$` matrix (usually `$k = 2$`)

- This requires converting data into a distance matrix - summarizing all differences between observations into a single number, effectively "double reduction"

1. Reduce data to a distance matrix

2. Reduce distance matrix to `$k = 2$` dimensions

#### How can we apply dimension to the original data?

---

## Principal Component Analysis (PCA)

$$
`\begin{pmatrix}
& & \text{really} & & \\
& & \text{wide} & & \\
& & \text{matrix} & &
\end{pmatrix}`
\rightarrow \text{matrix algebra stuff} \rightarrow 
`\begin{pmatrix}
\text{much}  \\
\text{thinner}  \\
\text{matrix} 
\end{pmatrix}`
$$

- Start with `$n \times p$` matrix of __correlated__ variables `$\rightarrow$` `$n \times k$` matrix of __uncorrelated__ variables

- Each of the `$k$` columns in the right-hand matrix are __principal components__, all uncorrelated with each other

- First column accounts for most variation in the data, second column for second-most variation, and so on

#### Intuition: first few principal components account for most of the variation in the data

---

## What are principal components?

- Assume `$\boldsymbol{X}$` is a `$n \times p$` matrix that is __centered__ and __stardardized__

- _Total variation_ `$= p$`, since Var( `$\boldsymbol{x}_j$` ) = 1 for all `$j = 1, \dots, p$`

- PCA will give us `$p$` principal components that are `$n$`-length columns - call these `$Z_1, \dots, Z_p$`

__First principal component__ (aka PC1):

`$$Z_1 = \phi_{11} X_1 + \phi_{21} X_2 + \dots + \phi_{p1} X_p$$`

- `$\phi_{j1}$` are the weights indicating the contributions of each variable `$j \in 1, \dots, p$`
  
  - Weights are normalized `$\sum_{j=1}^p \phi_{j1}^2 = 1$`
  
  - `$\phi_{1} = (\phi_{11}, \phi_{21}, \dots, \phi_{p1})$` is the __loading vector__ for PC1

--
  
  - `$Z_1$` is a linear combination of the `$p$` variables that has the __largest variance__

---

## What are principal components?

__Second principal component__:

`$$Z_2 = \phi_{12} X_1 + \phi_{22} X_2 + \dots + \phi_{p2} X_p$$`

- `$\phi_{j2}$` are the weights indicating the contributions of each variable `$j \in 1, \dots, p$`
  
  - Weights are normalized `$\sum_{j=1}^p \phi_{j1}^2 = 1$`
  
  - `$\phi_{2} = (\phi_{12}, \phi_{22}, \dots, \phi_{p2})$` is the __loading vector__ for PC2
  
  - `$Z_2$` is a linear combination of the `$p$` variables that has the __largest variance__
  
    - __Subject to constraint it is uncorrelated with `$Z_1$`__

We repeat this process to create `$p$` principal components

- __Uncorrelated__: Each ($Z_j, Z_{j'}$) is uncorrelated with each other

- __Ordered Variance__: Var( `$Z_1$` ) `$>$` Var( `$Z_2$` ) `$> \dots >$` Var( `$Z_p$` )

- __Total Variance__: `$\sum_{j=1}^p \text{Var}(Z_j) = p$`

#### Intuition: pick some `$k << p$` such that if `$\sum_{j=1}^k \text{Var}(Z_j) \approx p$`, then just using `$Z_1, \dots, Z_k$`

---