36-613: Data Visualization

class: center, middle, inverse, title-slide

.title[
# 36-613: Data Visualization
]
.subtitle[
## Visualizing Text Data
]
.author[
### Professor Ron Yurko
]
.date[
### 10/5/2022
]

---

## Working with raw text data

- We'll work with script from the best episode of ['The Office': Season 4, Episode 13 - 'Dinner Party'](https://en.wikipedia.org/wiki/Dinner_Party_(The_Office))

- We can access the script using the [`schrute` package (yes this is a real thing)](https://cran.r-project.org/web/packages/schrute/vignettes/theoffice.html):

```r
library(tidyverse)
library(schrute)

# Create a table from this package just corresponding to the Dinner Party episode:
dinner_party_table <- theoffice %>%
  filter(season == 4, episode == 13) %>%
  # Just select columns of interest:
  dplyr::select(index, character, text)
head(dinner_party_table)
```

```
## # A tibble: 6 × 3
##   index character text                                                          
##   <int> <chr>     <chr>                                                         
## 1 16791 Stanley   This is ridiculous.                                           
## 2 16792 Phyllis   Do you have any idea what time we'll get out of here?         
## 3 16793 Michael   Nobody likes to work late, least of all me. Do you have plans…
## 4 16794 Jim       Nope I don't, remember when you told us not to make plans 'ca…
## 5 16795 Michael   Yes I remember. Mmm, this is B.S. This is B.S. Why are we her…
## 6 16796 Dwight    Thank you Michael.
```

---

## Bag of Words representation of text

- Most common way to store text data is with a __document-term matrix__ (DTM):

|            | Word 1   | Word 2   | `$\dots$`  | Word `$J$` |
| ---------- | -------- | -------- | -------- | -------- |
| Document 1 | `$w_{11}$` | `$w_{12}$` | `$\dots$`  | `$w_{1J}$` |
| Document 2 | `$w_{21}$` | `$w_{22}$` | `$\dots$`  | `$w_{2J}$` |
| `$\dots$`    | `$\dots$`  | `$\dots$`  | `$\dots$`  | `$\dots$`  |
| Document N | `$w_{N1}$` | `$w_{N2}$` | `$\dots$`  | `$w_{NJ}$` |

- `$w_{ij}$`: count of word `$j$` in document `$i$`, aka _term frequencies_

Two additional ways to reduce number of columns:

1. __Stop words__: remove extremely common words (e.g., of, the, a)

2. __Stemming__: Reduce all words to their "stem"

- For example: Reducing = reduc. Reduce = reduc. Reduces = reduc.

Easy to convert text into DTM format using [`tidytext` package](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html)

---

## Tokenize text into long format

- Convert raw text into long, tidy table with one-token-per-document-per-row

- A __token__ equals a unit of text - typically a word

```r
library(tidytext)
tidy_dinner_party_tokens <- dinner_party_table %>%
* unnest_tokens(word, text)
# View the first so many rows:
head(tidy_dinner_party_tokens)
```

```
## # A tibble: 6 × 3
##   index character word      
##   <int> <chr>     <chr>     
## 1 16791 Stanley   this      
## 2 16791 Stanley   is        
## 3 16791 Stanley   ridiculous
## 4 16792 Phyllis   do        
## 5 16792 Phyllis   you       
## 6 16792 Phyllis   have
```

---

## Remove stop words and apply stemming

.pull-left[

- Load `stop_words` from `tidytext`

```r
data(stop_words)

tidy_dinner_party_tokens <- tidy_dinner_party_tokens %>%
* filter(!(word %in% stop_words$word))

head(tidy_dinner_party_tokens)
```

```
## # A tibble: 6 × 3
##   index character word      
##   <int> <chr>     <chr>     
## 1 16791 Stanley   ridiculous
## 2 16792 Phyllis   idea      
## 3 16792 Phyllis   time      
## 4 16793 Michael   likes     
## 5 16793 Michael   late      
## 6 16793 Michael   plans
```

]

.pull-right[

- Use [`SnowballC` package](https://cran.r-project.org/web/packages/SnowballC/SnowballC.pdf) to perform stemming

```r
library(SnowballC)

tidy_dinner_party_tokens <- tidy_dinner_party_tokens %>%
* mutate(stem = wordStem(word))

head(tidy_dinner_party_tokens)
```

```
## # A tibble: 6 × 4
##   index character word       stem   
##   <int> <chr>     <chr>      <chr>  
## 1 16791 Stanley   ridiculous ridicul
## 2 16792 Phyllis   idea       idea   
## 3 16792 Phyllis   time       time   
## 4 16793 Michael   likes      like   
## 5 16793 Michael   late       late   
## 6 16793 Michael   plans      plan
```

]

---

## Create word cloud using term frequencies

__Word Cloud__: Displays all words mentioned across documents, where more common words are larger

- To do this, you must compute the _total_ word counts:

`$$w_{\cdot 1} = \sum_{i=1}^N w_{i1} \hspace{0.1in} \dots \hspace{0.1in} w_{\cdot J} = \sum_{i=1}^N w_{iJ}$$`

- Then, the size of Word `$j$` is proportional to `$w_{\cdot j}$`

Create word clouds in `R` using [`wordcloud` package](https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf)

Takes in two main arguments to create word clouds:

1. `words`: vector of unique words

2. `freq`: vector of frequencies

---

## Create word cloud using term frequencies

.pull-left[

```r
token_summary <- tidy_dinner_party_tokens %>%
  group_by(stem) %>%
  count() %>%
  ungroup()

library(wordcloud)
*wordcloud(words = token_summary$stem,
*         freq = token_summary$n,
*         random.order = FALSE,
          max.words = 100, 
          colors = brewer.pal(8, "Dark2"))
```

- Set `random.order = FALSE` to place biggest words in center

- Can customize to display limited # words (`max.words`)

- Other options as well like `colors`

]

.pull-right[

]

---

## TF-IDF weighting

- We saw that `michael` was the largest word, but what if I'm interested in comparing text across characters (i.e., documents)?

- It’s arguably of more interest to understand which words are frequently used in one set of texts but not the other, i.e., which words are unique?

- Many text analytics methods will __down-weight__ words that occur frequently across all documents

- __Inverse document frequency (IDF)__: for word `$j$` we compute `$\text{idf}_j = \log \frac{N}{N_j}$`

- where `$N$` is number of documents, `$N_j$` is number of documents with word `$j$`
  
--

- Compute __TF-IDF__ `$= w_{ij} \times \text{idf}_j$`

---

## TF-IDF example with characters

Compute and join TF-IDF using `bind_tf_idf()`:

```r
character_token_summary <- tidy_dinner_party_tokens %>%
* group_by(character, stem) %>%
  count() %>%
  ungroup()

character_token_summary <- character_token_summary %>%
* bind_tf_idf(stem, character, n)
character_token_summary
```

```
## # A tibble: 597 × 6
##    character stem        n     tf   idf tf_idf
##    <chr>     <chr>   <int>  <dbl> <dbl>  <dbl>
##  1 All       cheer       1 1      2.77  2.77  
##  2 Andy      anim        1 0.0476 2.77  0.132 
##  3 Andy      bet         1 0.0476 2.08  0.0990
##  4 Andy      capit       1 0.0476 2.77  0.132 
##  5 Andy      dinner      1 0.0476 0.981 0.0467
##  6 Andy      flower      2 0.0952 2.77  0.264 
##  7 Andy      hei         1 0.0476 1.39  0.0660
##  8 Andy      helena      1 0.0476 2.77  0.132 
##  9 Andy      hump        2 0.0952 2.77  0.264 
## 10 Andy      michael     1 0.0476 0.981 0.0467
## # … with 587 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

---

## Top 10 words by TF-IDF for each character

.pull-left[

```r
character_token_summary %>%
  filter(character %in% c("Michael", "Jan", "Jim", "Pam")) %>%
  group_by(character) %>%
* slice_max(tf_idf, n = 10, with_ties = FALSE) %>%
  ungroup() %>%
* mutate(stem = reorder_within(stem, tf_idf,
*                              character)) %>%
  ggplot(aes(y = tf_idf, x = stem),
         fill = "darkblue", alpha = 0.5) +
  geom_col() +
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~ character, ncol = 2, 
*            scales = "free") +
  labs(y = "TF-IDF", x = NULL)
```

- Bars can be simpler for comparison than several word clouds, to focus on top words

]

.pull-right[
<img src="figs/Lec11/unnamed-chunk-7-1.png" width="100%" />

]

---

## Sentiment Analysis

- The visualizations so far only look at word _frequency_ (possibly weighted with TF-IDF)

- Doesn't tell you _how_ words are used
  
--

- A common goal in text analysis is to try to understand the overall __sentiment__ or "feeling" of text, i.e., __sentiment analysis__

- Typical approach:

1.  Find a sentiment dictionary (e.g., "positive" and "negative" words)
  
  2. Count the number of words belonging to each sentiment
  
  3. Using the counts, you can compute an "average sentiment" (e.g., positive counts - negative counts)
  
--

- This is called a __dictionary-based approach__

- There are many sentiment dictionaries already available

- The __Bing__ dictionary (named after Bing Liu) provides 6,786 words that are either "positive" or "negative"

---

## Character sentiment analysis

.pull-left[

```r
get_sentiments("bing")
```

```
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
## # ℹ Use `print(n = ...)` to see more rows
```

]

.pull-right[

Join sentiment to token table (without stemming)

```r
tidy_all_tokens <- dinner_party_table %>%
  unnest_tokens(word, text)

tidy_sentiment_tokens <- tidy_all_tokens %>%
* inner_join(get_sentiments("bing"))

head(tidy_sentiment_tokens)
```

```
## # A tibble: 6 × 4
##   index character word       sentiment
##   <int> <chr>     <chr>      <chr>    
## 1 16791 Stanley   ridiculous negative 
## 2 16793 Michael   likes      positive 
## 3 16793 Michael   work       positive 
## 4 16795 Michael   enough     positive 
## 5 16795 Michael   enough     positive 
## 6 16795 Michael   mad        negative
```

]

---

## Character sentiment analysis

.pull-left[

```r
tidy_sentiment_tokens %>%
  group_by(character, sentiment) %>%
  summarize(n_words = n()) %>%
  ungroup() %>%
  group_by(character) %>%
  mutate(total_assigned_words = sum(n_words)) %>%
  ungroup() %>%
  mutate(character = 
           fct_reorder(character, 
                       total_assigned_words)) %>%
  ggplot(aes(x = character, y = n_words, 
*            fill = sentiment)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("red", "blue")) +
  theme_bw() +
  theme(legend.position = "bottom")
```

]

.pull-right[
<img src="figs/Lec11/unnamed-chunk-10-1.png" width="100%" />

]

---

## Other functions of text

- We've just focused on word counts - __but there are many functions of text__

- For example: __number of unique words__ is often used to measure vocabulary

---

# Main Takeaways

- Text is arguably infinite-dimensional, and we need to (somehow) represent text with a finite number of dimensions

- Most common representation: Bag of words and term frequencies (possibly weighted by TF-IDF)

- Word clouds are the most common way to visualize the most frequent
words in a set of documents

- TF-IDF weighting allows you to detect words that are uniquely used
in certain documents

- Common to plot the most "important words" of a set of documents,
ordered by TF-IDF weights

- Can also measure the "sentiment" of text with sentiment-based dictionaries

- Sentiment analyses are only as relevant as the dictionaries you use

---
class: center, middle

# Next time: Dashboards

__HW5 due today! Graphics Critique / Replication #2 due Friday Oct 7th!__