Chapter 22 Reshaping data
As we have seen through the book, having data in tidy format is what makes the tidyverse flow. After the first step in the data analysis process, importing data, a common next step is to reshape the data into a form that facilitates the rest of the analysis. The tidyr package includes several functions that are useful for tidying data.
We will use the fertility wide format dataset described in Section 4.1 as an example in this section.
library(tidyverse) library(dslabs) <- system.file("extdata", package="dslabs") path <- file.path(path, "fertility-two-countries-example.csv") filename <- read_csv(filename)wide_data
One of the most used functions in the tidyr package is
pivot_longer, which is useful for converting wide data into tidy data.
As with most tidyverse functions, the
pivot_longer function’s first argument is the data frame that will be converted. Here we want to reshape the
wide_data dataset so that each row represents a fertility observation, which implies we need three columns to store the year, country, and the observed value. In its current form, data from different years are in different columns with the year values stored in the column names. Through the
values_to argument we will tell
pivot_longer the column names we want to assign to the columns containing the current column names and observations, respectively. The default names are
value, which are often usable choices.
In this case a better choice for these two arguments would be
fertility. Note that nowhere in the data file does it tell us this is fertility data. Instead, we deciphered this from the file name. Through
cols,the second argument we specify the columns containing observed values; these are the columns that will be pivoted. The default is to pivot all columns so, in most cases, we have to specify the columns. In our example we want columns
1961 up to
The code to pivot the fertility data therefore looks like this:
<- pivot_longer(wide_data, `1960`:`2015`, names_to = "year", values_to = "fertility")new_tidy_data
We can also use the pipe like this:
<- wide_data %>% new_tidy_data pivot_longer(`1960`:`2015`, names_to = "year", values_to = "fertility")
We can see that the data have been converted to tidy format with columns
head(new_tidy_data) #> # A tibble: 6 x 3 #> country year fertility #> <chr> <chr> <dbl> #> 1 Germany 1960 2.41 #> 2 Germany 1961 2.44 #> 3 Germany 1962 2.47 #> 4 Germany 1963 2.49 #> 5 Germany 1964 2.49 #> # … with 1 more row
and that each year resulted in two rows since we have two countries and this column was not pivoted. A somewhat quicker way to write this code is to specify which column will not include in the pivot, rather than all the columns that will be pivoted:
<- wide_data %>% new_tidy_data pivot_longer(-country, names_to = "year", values_to = "fertility")
new_tidy_data object looks like the original
tidy_data we defined this way
data("gapminder") <- gapminder %>% tidy_data filter(country %in% c("South Korea", "Germany") & !is.na(fertility)) %>% select(country, year, fertility)
with just one minor difference. Can you spot it? Look at the data type of the year column:
class(tidy_data$year) #>  "integer" class(new_tidy_data$year) #>  "character"
pivot_longer function assumes that column names are characters. So we need a bit more wrangling before we are ready to make a plot. We need to convert the year column to be numbers:
<- wide_data %>% new_tidy_data pivot_longer(-country, names_to = "year", values_to = "fertility") %>% mutate(year = as.integer(year))
Note that we could have also used the
Now that the data is tidy, we can use this relatively simple ggplot code:
%>% ggplot(aes(year, fertility, color = country)) + new_tidy_data geom_point()
As we will see in later examples, it is sometimes useful for data wrangling purposes to convert tidy data into wide data. We often use this as an intermediate step in tidying up data. The
pivot_wider function is basically the inverse of
pivot_longer. The first argument is for the data, but since we are using the pipe, we don’t show it. The
names_from argument tells
pivot_wider which variable will be used as the column names. The
values_from argument specifies which variable to use to fill out the cells.
<- new_tidy_data %>% new_wide_data pivot_wider(names_from = year, values_from = fertility) select(new_wide_data, country, `1960`:`1967`) #> # A tibble: 2 x 9 #> country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 #> 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
values_from default to
The data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files, we include an illustration that is slightly more complicated. It contains two variables: life expectancy and fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.
<- system.file("extdata", package = "dslabs") path <- "life-expectancy-and-fertility-two-countries-example.csv" filename <- file.path(path, filename) filename <- read_csv(filename) raw_dat select(raw_dat, 1:5) #> # A tibble: 2 x 5 #> country `1960_fertility` `1960_life_expectancy` `1961_fertility` #> <chr> <dbl> <dbl> <dbl> #> 1 Germany 2.41 69.3 2.44 #> 2 South Korea 6.16 53.0 5.99 #> # … with 1 more variable: 1961_life_expectancy <dbl>
First, note that the data is in wide format. Second, notice that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable. Encoding information in the column names is not recommended but, unfortunately, it is quite common. We will put our wrangling skills to work to extract this information and store it in a tidy fashion.
We can start the data wrangling with the
pivot_longer function, but we should no longer use the column name
year for the new column since it also contains the variable type. We will call it
name, the default, for now:
<- raw_dat %>% pivot_longer(-country) dat head(dat) #> # A tibble: 6 x 3 #> country name value #> <chr> <chr> <dbl> #> 1 Germany 1960_fertility 2.41 #> 2 Germany 1960_life_expectancy 69.3 #> 3 Germany 1961_fertility 2.44 #> 4 Germany 1961_life_expectancy 69.8 #> 5 Germany 1962_fertility 2.47 #> # … with 1 more row
The result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the
name column into the year and the variable type. Notice that the entries in this column separate the year from the variable name with an underscore:
$name[1:5] dat#>  "1960_fertility" "1960_life_expectancy" "1961_fertility" #>  "1961_life_expectancy" "1962_fertility"
Encoding multiple variables in a column name is such a common problem that the readr package includes a function to separate these columns into two or more. Apart from the data, the
separate function takes three arguments: the name of the column to be separated, the names to be used for the new columns, and the character that separates the variables. So, a first attempt at this is:
%>% separate(name, c("year", "name"), "_")dat
_ is the default separator assumed by
separate, we do not have to include it in the code:
%>% separate(name, c("year", "name")) dat #> Warning: Expected 2 pieces. Additional pieces discarded in 112 rows [2, #> 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, #> 40, ...]. #> # A tibble: 224 x 4 #> country year name value #> <chr> <chr> <chr> <dbl> #> 1 Germany 1960 fertility 2.41 #> 2 Germany 1960 life 69.3 #> 3 Germany 1961 fertility 2.44 #> 4 Germany 1961 life 69.8 #> 5 Germany 1962 fertility 2.47 #> # … with 219 more rows
The function does separate the values, but we run into a new problem. We receive the warning
Too many values at 112 locations: and that the
life_expectancy variable is truncated to
life. This is because the
_ is used to separate
expectancy, not just year and variable name! We could add a third column to catch this and let the
separate function know which column to fill in with missing values,
NA, when there is no third value. Here we tell it to fill the column on the right:
<- c("year", "first_variable_name", "second_variable_name") var_names %>% separate(name, var_names, fill = "right") dat #> # A tibble: 224 x 5 #> country year first_variable_name second_variable_name value #> <chr> <chr> <chr> <chr> <dbl> #> 1 Germany 1960 fertility <NA> 2.41 #> 2 Germany 1960 life expectancy 69.3 #> 3 Germany 1961 fertility <NA> 2.44 #> 4 Germany 1961 life expectancy 69.8 #> 5 Germany 1962 fertility <NA> 2.47 #> # … with 219 more rows
However, if we read the
separate help file, we find that a better approach is to merge the last two variables when there is an extra separation:
%>% separate(name, c("year", "name"), extra = "merge") dat #> # A tibble: 224 x 4 #> country year name value #> <chr> <chr> <chr> <dbl> #> 1 Germany 1960 fertility 2.41 #> 2 Germany 1960 life_expectancy 69.3 #> 3 Germany 1961 fertility 2.44 #> 4 Germany 1961 life_expectancy 69.8 #> 5 Germany 1962 fertility 2.47 #> # … with 219 more rows
This achieves the separation we wanted. However, we are not done yet. We need to create a column for each variable. As we learned, the
pivot_wider function can do this:
%>% dat separate(name, c("year", "name"), extra = "merge") %>% pivot_wider() #> # A tibble: 112 x 4 #> country year fertility life_expectancy #> <chr> <chr> <dbl> <dbl> #> 1 Germany 1960 2.41 69.3 #> 2 Germany 1961 2.44 69.8 #> 3 Germany 1962 2.47 70.0 #> 4 Germany 1963 2.49 70.1 #> 5 Germany 1964 2.49 70.7 #> # … with 107 more rows
The data is now in tidy format with one row for each observation with three variables: year, fertility, and life expectancy.
It is sometimes useful to do the inverse of
separate, unite two columns into one. To demonstrate how to use
unite, we show code that, although not the optimal approach, serves as an illustration. Suppose that we did not know about
extra and used this command to separate:
<- c("year", "first_variable_name", "second_variable_name") var_names %>% dat separate(name, var_names, fill = "right") #> # A tibble: 224 x 5 #> country year first_variable_name second_variable_name value #> <chr> <chr> <chr> <chr> <dbl> #> 1 Germany 1960 fertility <NA> 2.41 #> 2 Germany 1960 life expectancy 69.3 #> 3 Germany 1961 fertility <NA> 2.44 #> 4 Germany 1961 life expectancy 69.8 #> 5 Germany 1962 fertility <NA> 2.47 #> # … with 219 more rows
We can achieve the same final result by uniting the second and third columns, then pivoting the columns and renaming
%>% dat separate(name, var_names, fill = "right") %>% unite(name, first_variable_name, second_variable_name) %>% pivot_wider() %>% rename(fertility = fertility_NA) #> # A tibble: 112 x 4 #> country year fertility life_expectancy #> <chr> <chr> <dbl> <dbl> #> 1 Germany 1960 2.41 69.3 #> 2 Germany 1961 2.44 69.8 #> 3 Germany 1962 2.47 70.0 #> 4 Germany 1963 2.49 70.1 #> 5 Germany 1964 2.49 70.7 #> # … with 107 more rows
1. Run the following command to define the
<- data.frame(matrix(co2, ncol = 12, byrow = TRUE)) %>% co2_wide setNames(1:12) %>% mutate(year = as.character(1959:1997))
pivot_longer function to wrangle this into a tidy dataset. Call the column with the CO2 measurements
co2 and call the month column
month. Call the resulting object
2. Plot CO2 versus month with a different curve for each year using this code:
%>% ggplot(aes(month, co2, color = year)) + geom_line()co2_tidy
If the expected plot is not made, it is probably because
co2_tidy$month is not numeric:
Rewrite your code to make sure the month column is numeric. Then make the plot.
3. What do we learn from this plot?
- CO2 measures increase monotonically from 1959 to 1997.
- CO2 measures are higher in the summer and the yearly average increased from 1959 to 1997.
- CO2 measures appear constant and random variability explains the differences.
- CO2 measures do not have a seasonal trend.
4. Now load the
admissions data set, which contains admission information for men and women across six majors and keep only the admitted percentage column:
load(admissions) <- admissions %>% select(-applicants)dat
If we think of an observation as a major, and that each observation has two variables (men admitted percentage and women admitted percentage) then this is not tidy. Use the
pivot_wider function to wrangle into tidy shape: one row for each major.
5. Now we will try a more advanced wrangling challenge. We want to wrangle the admissions data so that for each major we have 4 observations:
applicants_women. The trick we perform here is actually quite common: first use
pivot_longer to generate an intermediate data frame and then
pivot_wider to obtain the tidy data we want. We will go step by step in this and the next two exercises.
pivot_longer function to create a
tmp data.frame with a column containing the type of observation
applicants. Call the new columns
6. Now you have an object
tmp with columns
value. Note that if you combine the
gender, we get the column names we want:
applicants_women. Use the function
unite to create a new column called
7. Now use the
pivot_wider function to generate the tidy data with four variables for each major.
8. Now use the pipe to write a line of code that turns
admissions to the table produced in the previous exercise.