36-613: Data Visualization

class: center, middle, inverse, title-slide

.title[
# 36-613: Data Visualization
]
.subtitle[
## 2D Quantitative Data
]
.author[
### Professor Ron Yurko
]
.date[
### 9/19/2022
]

---

## Let's talk about the project..

#### IDMRAD report, due Friday, Oct 14th at 5 PM EDT

- Read through the rubric file on Canvas in the `project` folder

- You will find your team's assigned project dataset in [this google sheet](https://docs.google.com/spreadsheets/d/1XRCC8WO8O1-fy9nDf1pQ_15XxYYNGU6Z/edit?usp=sharing&ouid=110931742237993657587&rtpof=true&sd=true)

- Look at the example IDMRAD paper for how to structure reports (also covered in rubric)

- Your report will include many more (and higher quality) graphics!
  
  - Also required to make an interactive Shiny app/dashboard with link in report
  
- The remaining HWs will include exercises related to your projects

---

## 2D quantitative data

- We're working with two variables: `$(X, Y) \in \mathbb{R}^2$`, i.e., dataset with `$n$` rows and 2 columns

- Goals:

- describing the relationships between two variables
  
  - describing the conditional distribution `$Y | X$` via regression analysis
  
  - describing the joint distribution `$X,Y$` via contours, heatmaps, etc.
  
--

- Few big picture ideas to keep in mind:

- scatterplots are by far the most common visual
  
  - regression analysis is by far the most popular analysis (you have a whole class on this...)
  
  - relationships may vary across other variables, e.g., categorical variables
  
---

## Making scatterplots with `geom_point()`

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) +
* geom_point()
```

- Displaying the joint distribution

- What is the __obvious flaw__ with this plot?

]

.pull-right[

]

---

## Making scatterplots: ALWAYS adjust the `alpha`

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm,
             y = body_mass_g)) +
* geom_point(alpha = 0.5)
```

- Adjust the transparency of points via `alpha` to __visualize overlap__

- Provides better understanding of joint frequency

- TEspecially important with larger datasets

]

.pull-right[

]

---

## Displaying trend lines: linear regression

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) +
  geom_point(alpha = 0.5) + 
* geom_smooth(method = "lm")
```

- Displays `body_mass_g ~ flipper_length_mm` regression line

- Adds 95% confidence intervals by default

- Estimating the __conditional expectation__ of `body_mass_g` | `flipper_length_mm`,

- i.e., `$\mathbb{E}[Y | X]$`

]
.pull-right[

]

---

## Assessing assumptions of linear regression

Linear regression assumes `$Y_i \overset{iid}{\sim} N(\beta_0 + \beta_1 X_i, \sigma^2)$`

- If this is true, then `$Y_i - \hat{Y}_i \overset{iid}{\sim} N(0, \sigma^2)$`

Plot residuals against `$\hat{Y}_i$`, __residuals vs fit__ plot

- Used to assess linearity, any divergence from mean 0

- Used to assess equal variance, i.e., if `$\sigma^2$` is homogenous across predictions/fits `$\hat{Y}_i$`

More difficult to assess the independence and fixed `$X$` assumptions

- Make these assumptions based on subject-matter knowledge

---

## Residual vs fit plots

.pull-left[

```r
lin_reg <- 
* lm(body_mass_g ~ flipper_length_mm,
     data = penguins)

tibble(fits = fitted(lin_reg), 
       residuals = residuals(lin_reg)) %>%
  ggplot(aes(x = fits, y = residuals)) +
  geom_point(alpha = 0.5) +
* geom_hline(yintercept = 0,
*            linetype = "dashed",
             color = "red")
```

Two things to look for:

1. Any trend around horizontal reference line?

2. Equal vertical spread?

]

.pull-right[

]

---

## Residual vs fit plots

.pull-left[

```r
tibble(fits = fitted(lin_reg), 
       residuals = residuals(lin_reg)) %>%
  ggplot(aes(x = fits, y = residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, 
             linetype = "dashed",
             color = "red") +
* geom_smooth()
```

Two things to look for:

1. Any trend around horizontal reference line?

- add `geom_smooth` to highlight this

2. Equal vertical spread?

]

.pull-right[

]

---

## Local linear regression via LOESS (Local Estimated Scatterplot Smoothing)

Still assume Normality, but not linearity: `$Y_i \overset{iid}{\sim} N(f(x), \sigma^2)$`, where `$f(x)$` is some unknown function

In __local linear regression__, we estimate `$f(X_i)$`:

`$$\text{arg }\underset{\beta_0, \beta_1}{\text{min}} \sum_i^n w_i(x) \cdot \big(Y_i - \beta_0 - \beta_1 X_i \big)^2$$`

- Notice the weights depend on `$x$`, meaning observations close to `$x$` given large weight in estimating `$f(x)$`

`geom_smooth()` uses tri-cubic weighting:

`$$w_i(d_i) = \begin{cases} (1 - |d_i|^3)^3, \text{ if } i \in \text{neighborhood of  } x, \\
0 \text{ if } i \notin \text{neighborhood of  } x \end{cases}$$`

- where `$d_i$` is the distance between `$x$` and `$X_i$` scaled to be between 0 and 1
  
  - control `span`: decides proportion of observations in neighborhood
  
---

## Displaying trend lines: LOESS

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) +
  geom_point(alpha = 0.5) + 
* geom_smooth()
```

- LOESS is default behavior with `$n \leq 1000$`

- Default `span = 0.75`

- For `$n > 1000$`, `mgcv::gam()` is used with `formula = y ~ s(x, bs = "cs")` and `method = "REML"`

]
.pull-right[

]

---

## Displaying trend lines: LOESS

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) +
  geom_point(alpha = 0.5) + 
* geom_smooth(span = .1)
```

- LOESS is default behavior with `$n \leq 1000$`

- Default `span = 0.75`

- For `$n > 1000$`, `mgcv::gam()` is used with `formula = y ~ s(x, bs = "cs")` and `method = "REML"`

]
.pull-right[

]

---

## Can also update formula within plot

.pull-left[

```r
penguins %>%
  ggplot(aes(x = flipper_length_mm, 
             y = body_mass_g)) +
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm",
*             formula = y ~ x + I(x^2))
```

- Fit `body_mass_g ~ flipper_length_mm + flipper_length_mm^2` instead

- _Exercise: check the updated residual plot with this model_

]
.pull-right[

]

---

## What about focusing on the joint distribution?

.pull-left[

- Example [dataset of pitches](https://raw.githubusercontent.com/ryurko/DataViz-36613-Fall22/main/data/ohtani_pitches_2022.csv) thrown by baseball superstar [Shohei Ohtani](https://www.baseball-reference.com/players/o/ohtansh01.shtml)

```r
ohtani_pitches %>%
  ggplot(aes(x = plate_x, y = plate_z)) +
  geom_point(alpha = 0.2) +
* coord_fixed() +
  theme_bw()
```

- Where are the high/low concentrations of X,Y?

- How do we display concentration for 2D data?

- `coord_fixed()` so axes match with unit scales

]

.pull-right[
<img src="figs/Lec6/unnamed-chunk-11-1.png" width="100%" />

]

---

## Going from 1D to 2D density estimation

In 1D: estimate density `$f(x)$`, assuming that `$f(x)$` is _smooth_:

$$
\hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i)
$$

In 2D: estimate joint density `$f(x_1, x_2)$`

`$$\hat{f}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h_1h_2} K(\frac{x_1 - x_{i1}}{h_1}) K(\frac{x_2 - x_{i2}}{h_2})$$`

In 1D there was one bandwidth, now __we have two bandwidths__

- `$h_1$`: controls smoothness as `$X_1$` changes, holding `$X_2$` fixed
  - `$h_2$`: controls smoothness as `$X_2$` changes, holding `$X_1$` fixed

Again Gaussian kernels are the most popular...

---

## So how do we display densities for 2D data?

---

## How to read contour plots?

Best known in topology: outlines (contours) denote levels of elevation

---

## Display 2D contour plot

.pull-left[

```r
ohtani_pitches %>%
  ggplot(aes(x = plate_x, y = plate_z)) +
  geom_point(alpha = 0.2) +
* geom_density2d() +
  coord_fixed() +
  theme_bw()
```

- Use `geom_density2d` to display contour lines

- Inner lines denote "peaks"

]

.pull-right[
<img src="figs/Lec6/unnamed-chunk-14-1.png" width="100%" />

]

---

## Display 2D contour plot

.pull-left[

```r
ohtani_pitches %>%
  ggplot(aes(x = plate_x, y = plate_z)) +
* stat_density2d(aes(fill = after_stat(level)),
*                geom = "polygon") +
  geom_point(alpha = 0.2) +
  coord_fixed() +
* scale_fill_gradient(low = "darkblue",
*                     high = "darkorange") +
  theme_bw()
```

- Use `stat_density2d` for additional features

- May be easier to read than nested lines with color

- __Default color scale is awful!__ Always change it!

]

.pull-right[
<img src="figs/Lec6/unnamed-chunk-15-1.png" width="100%" />

]

---

## Visualizing grid heat maps

.pull-left[

```r
ohtani_pitches %>%
  ggplot(aes(x = plate_x, y = plate_z)) +
* stat_density2d(aes(fill = after_stat(density)),
*                geom = "tile",
*                contour = FALSE) +
  geom_point(alpha = 0.2) +
  coord_fixed() +
* scale_fill_gradient(low = "white",
*                     high = "red") +
  theme_bw()
```

- Divide the space into a grid and color the grid according to high/low values

- Common to treat "white" as empty color

]

.pull-right[
<img src="figs/Lec6/unnamed-chunk-16-1.png" width="100%" />

]

---

## Alternative idea: hexagonal binning

.pull-left[

```r
ohtani_pitches %>%
  ggplot(aes(x = plate_x, y = plate_z)) +
* geom_hex() +
  coord_fixed() +
  scale_fill_gradient(low = "darkblue", 
                      high = "darkorange") + 
  theme_bw()
```

- Can specify `binwidth` in both directions

- 2D version of histogram

- _Need to install `hexbin` package_

]

.pull-right[
<img src="figs/Lec6/unnamed-chunk-17-1.png" width="100%" />

]

---
class: center, middle

# Next time: High-Dimensional Data

Reminder: __HW3 due Wednesday!__