(Broadly speaking) EDA = questions about data + wrangling + visualization
R for Data
Science: “EDA is a state of mind”, an iterative cycle:
generate questions
answer via transformations and visualizations
Example of questions?
What type of variation do the variables display?
What type of relationships exist between variables?
Goal: develop understanding and become familiar with your data
EDA is NOT a replacement for statistical inference and learning
EDA is an important and necessary step to build intuition
We tackle the challenges of EDA with a data science workflow. An
example of this according to Hadley
Wickham in R for Data
Science:
.center[]
Aspects of data wrangling:
import: reading in data (e.g.,
read_csv())
tidy: rows = observations, columns = variables (i.e. tabular data)
transform: filter observations, create new variables, summarize, etc.
penguinsIn R, there are many libraries or packages/groups of
programs that are not permanently stored in R, so we have
to load them when we want to use them. You can load an R
package by typing library(package_name). (Sometimes we need
to download/install the package first, as described in Demo0.)
Throughout this demo we will use the palmerpenguins
dataset. To access the data, you will need to install the
palmerpenguins package:
install.packages("palmerpenguins")
Import the penguins dataset by loading
the palmerpenguins package using the library
function and then access the data with the data()
function:
library(palmerpenguins)
data(penguins)
View some basic info about the penguins dataset:
dim(penguins) # displays same info as c(nrow(penguins), ncol(penguins))
## [1] 344 8
class(penguins)
## [1] "tbl_df" "tbl" "data.frame"
tbl (pronounced tibble) is the
tidyverse way of storing tabular data, like a spreadsheet
or data.frame
I assure you that you’ll run into errors as you code in
R; in fact, my attitude as a coder is that something is
wrong if I never get any errors while working on a project.
When you run into an error, your first reaction may be to panic and post
a question to Piazza. However, checking help documentation in
R can be a great way to figure out what’s going wrong. (For
good or bad, I end up having to read help documentation almost every day
of my life - because, well, I regularly make mistakes in
R.)
Look at the help documentation for penguins by typing
help(penguins) in the Console. What are the names of the
variables in this dataset? How many observations are in this
dataset?
help(penguins)
You should always look at your data before doing
anything: view the first 6 (by default) rows with
head()
head(penguins) # Try just typing penguins into your console, what happens?
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
Is our penguins dataset tidy?
Each row = a single penguin
Each column = different measurement about the penguins (can print
out column names directly with colnames(penguins) or
names(penguins))
We’ll now explore differences among the penguins using the
tidyverse.
First, load the tidyverse for exploring the data - and
do NOT worry about the warning messages that will pop-up! Warning
messages will tell you when other packages that are loaded may have
functions replaced with the most recent package you’ve loaded. In
general though, you should just be concerned when an error message pops
up (errors are different than warnings!).
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
We’ll start by summarizing continuous
(e.g., bill_length_mm, flipper_length_mm) and
categorical (e.g., species, island)
variables in different ways.
We can compute summary statistics for
continuous variables with the summary()
function:
summary(penguins$bill_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32.10 39.23 44.45 43.92 48.50 59.60 2
Compute counts of categorical variables
with table() function:
table("island" = penguins$island) # be careful it ignores NA values!
## island
## Biscoe Dream Torgersen
## 168 124 52
How do we remove the penguins with missing
bill_length_mm values? Within the tidyverse,
dplyr is a
package with functions for data wrangling (because it’s within the
tidyverse that means you do NOT have to load it separately with
library(dplyr) after using
library(tidyverse)!). It’s considered a “grammar of
data manipulation”: dplyr functions are
verbs, datasets are nouns.
We can filter()
our dataset to choose observations meeting conditions:
clean_penguins <- filter(penguins, !is.na(bill_length_mm))
# Use help(is.na) to see what it returns. And then observe
# that the ! operator means to negate what comes after it.
# This means !TRUE == FALSE (i.e., opposite of TRUE is equal to FALSE).
nrow(penguins) - nrow(clean_penguins) # Difference in rows
## [1] 2
If we want to only consider a subset of columns in our data,
we can select()
variables of interest:
sel_penguins <- select(clean_penguins, species, island, bill_length_mm, flipper_length_mm)
head(sel_penguins, n = 3)
## # A tibble: 3 × 4
## species island bill_length_mm flipper_length_mm
## <fct> <fct> <dbl> <int>
## 1 Adelie Torgersen 39.1 181
## 2 Adelie Torgersen 39.5 186
## 3 Adelie Torgersen 40.3 195
We can arrange()
our dataset to sort observations by variables:
bill_penguins <- arrange(sel_penguins, desc(bill_length_mm)) # use desc() for descending order
head(bill_penguins, n = 3)
## # A tibble: 3 × 4
## species island bill_length_mm flipper_length_mm
## <fct> <fct> <dbl> <int>
## 1 Gentoo Biscoe 59.6 230
## 2 Chinstrap Dream 58 181
## 3 Gentoo Biscoe 55.9 228
We can summarize()
our dataset to one row based on functions of variables
summarize(bill_penguins, max(bill_length_mm), median(flipper_length_mm))
## # A tibble: 1 × 2
## `max(bill_length_mm)` `median(flipper_length_mm)`
## <dbl> <dbl>
## 1 59.6 197
We can mutate()
our dataset to create new variables (mutate is a weird
name…)
new_penguins <- mutate(bill_penguins,
bill_flipper_ratio = bill_length_mm / flipper_length_mm,
flipper_bill_ratio = flipper_length_mm / bill_length_mm)
head(new_penguins, n = 1)
## # A tibble: 1 × 6
## species island bill_length_mm flipper_length_mm bill_flipper_ratio flipper_b…¹
## <fct> <fct> <dbl> <int> <dbl> <dbl>
## 1 Gentoo Biscoe 59.6 230 0.259 3.86
## # … with abbreviated variable name ¹flipper_bill_ratio
How do we perform several of these actions?
head(arrange(select(mutate(filter(penguins, !is.na(flipper_length_mm)), bill_flipper_ratio = bill_length_mm / flipper_length_mm), species, island, bill_flipper_ratio), desc(bill_flipper_ratio)), n = 1)
## # A tibble: 1 × 3
## species island bill_flipper_ratio
## <fct> <fct> <dbl>
## 1 Chinstrap Dream 0.320
That’s awfully annoying to do, and also difficult to read…
The %>% (pipe) operator is used in the
tidyverse (from magrittr)
to chain commands together
%>% directs the data analyis
pipeline: output of one function pipes into input of the next
function
penguins %>%
filter(!is.na(flipper_length_mm)) %>%
mutate(bill_flipper_ratio = bill_length_mm / flipper_length_mm) %>%
select(species, island, bill_flipper_ratio) %>%
arrange(desc(bill_flipper_ratio)) %>%
head(n = 5)
## # A tibble: 5 × 3
## species island bill_flipper_ratio
## <fct> <fct> <dbl>
## 1 Chinstrap Dream 0.320
## 2 Chinstrap Dream 0.275
## 3 Chinstrap Dream 0.270
## 4 Chinstrap Dream 0.270
## 5 Chinstrap Dream 0.268
Instead of head(), we can slice()
our dataset to choose the observations based on the
position
penguins %>%
filter(!is.na(flipper_length_mm)) %>%
mutate(bill_flipper_ratio = bill_length_mm / flipper_length_mm) %>%
select(species, island, bill_flipper_ratio) %>%
arrange(desc(bill_flipper_ratio)) %>%
slice(c(1, 2, 10, 100))
## # A tibble: 4 × 3
## species island bill_flipper_ratio
## <fct> <fct> <dbl>
## 1 Chinstrap Dream 0.320
## 2 Chinstrap Dream 0.275
## 3 Chinstrap Dream 0.264
## 4 Gentoo Biscoe 0.227
We group_by()
to split our dataset into groups based on a variable’s
values
penguins %>%
filter(!is.na(flipper_length_mm)) %>%
group_by(island) %>%
summarize(n_penguins = n(), #counts number of rows in group
ave_flipper_length = mean(flipper_length_mm),
sum_bill_depth = sum(bill_depth_mm),
.groups = "drop") %>% # all levels of grouping dropping
arrange(desc(n_penguins)) %>%
slice(1:5)
## # A tibble: 3 × 4
## island n_penguins ave_flipper_length sum_bill_depth
## <fct> <int> <dbl> <dbl>
## 1 Biscoe 167 210. 2651.
## 2 Dream 124 193. 2275.
## 3 Torgersen 51 191. 940.
group_by() is only useful in a pipeline (e.g. with
summarize()), and pay attention to its behavior
specify the .groups field to decide if observations
remain grouped or not after summarizing (you can also use
ungroup() for this as well)
As your own exercise, reate a tidy dataset where each row = an island with the following variables:
help(unique)),body_mass_g,help(var)) of
bill_depth_mmPrior to making those variables, make sure to filter missings and
also only consider female penguins. Then arrange the islands in order of
the average body_mass_g:
# INSERT YOUR CODE HERE