Demo College
See what you can do on Homebrew
data-manipulation-in-r
Chapter 4 - Data Manipulation in R
Introduction to Data Manipulation in R
Welcome to the chapter on Data Manipulation in R, where we dive into transforming raw data into valuable insights! Data manipulation is a fundamental skill in data analytics, helping you clean, organize, and reshape data for analysis and visualization. In this chapter, we’ll introduce you to the tidyverse ecosystem, which is an essential toolset for data manipulation in R, and explore two powerful packages: dplyr
for data manipulation and tidyr
for reshaping your data.
Let’s gear up and get your data wrangling skills rolling!
The Tidyverse Ecosystem
The tidyverse is a collection of R packages designed for data science. These packages share an underlying design philosophy, grammar, and data structures, making them easy to use together. Here are some key packages in the tidyverse:
- ggplot2: for data visualization
- dplyr: for data manipulation
- tidyr: for tidying data
- readr: for reading data
- purrr: for functional programming
- tibble: for modern data frames
Why Tidy Data?
In data analytics, having data in a "tidy" format makes analysis easier, as tidy data:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
To install the tidyverse, simply execute:
R
Data Manipulation with dplyr
The dplyr
package is your go-to-tool for manipulating data frames in R. It provides a set of functions that allow you to perform operations like filtering, selecting, arranging, and summarizing data.
Key Functions in dplyr
- select() - Choose specific columns from a data frame.
- filter() - Subset rows based on specific conditions.
- arrange() - Sort the data frame by one or more variables.
- mutate() - Add new variables or modify existing ones.
- summarize() / group_by() - Aggregate data and create summaries.
Example: Using dplyr
Let’s work through a practical example using a built-in dataset mtcars
:
R
Reshaping Data with tidyr
tidyr
is used to tidy your data by reshaping it, so each variable gets its own column and each observation its own row. The two main functions we’ll cover are:
- pivot_longer(): Convert data from wide format to long format.
- pivot_wider(): Convert data from long format to wide format.
Example: Using tidyr
Suppose we have a dataset where tests for students are recorded:
R
Practical Exercises
-
Data Frame Manipulation:
- Load the
iris
dataset. - Use
dplyr
functions to:- Select columns
Sepal.Length
andSpecies
. - Filter rows where
Sepal.Length
is greater than 5. - Create a new column that contains the ratio of
Sepal.Length
toSepal.Width
.
- Select columns
- Load the
-
Reshape Data:
- Create your own data frame containing students and scores in various subjects.
- Use
tidyr
functions to pivot the data to long format and then back to wide format.
Chapter Summary
In this chapter, you've learned the essentials of data manipulation using R, focusing on the tidyverse package ecosystem. We delved into:
- The tidyverse and its importance for data manipulation.
- Key
dplyr
functions for manipulating data frames. tidyr
functions for reshaping your data to make it tidy.
Practice these concepts with the provided exercises, and you'll be well on your way to mastering data manipulation in R!