R Language Filtering Selecting Mutating Summarizing Complete Guide
Understanding the Core Concepts of R Language Filtering, Selecting, Mutating, Summarizing
R Language Filtering, Selecting, Mutating, Summarizing
1. Filtering
Filtering in R is used to select specific rows based on conditions. The dplyr package provides the filter()
function for this purpose. It allows you to specify conditions that the rows must meet to be kept in the dataset.
Syntax:
filter(data, condition1, condition2, ...)
Example:
library(dplyr)
# Sample data frame
data <- data.frame(
name = c("Alice", "Bob", "Charlie", "David"),
age = c(25, 30, 35, 40),
salary = c(50000, 60000, 70000, 80000)
)
# Filter data where age is greater than 30
filtered_data <- filter(data, age > 30)
Important Info:
- You can combine multiple conditions using logical operators (
&
for AND,|
for OR). - Use
!
for negation to exclude rows that meet a condition. - Filters can also use functions like
between()
for a range of values orstr_detect()
for string matching.
2. Selecting
Selecting in R allows you to choose specific columns from a dataset. The dplyr package provides the select()
function for this. It lets you specify which columns you want to keep and in which order.
Syntax:
select(data, column1, column2, ...)
Example:
# Select name and salary from the data frame
selected_data <- select(data, name, salary)
Important Info:
- You can use helper functions like
starts_with()
,ends_with()
,contains()
,one_of()
, andmatches()
to select columns based on patterns. - The
everything()
function can be used to keep the selected columns in the specified order while moving the rest of the columns to the end.
3. Mutating
Mutating in R involves creating new variables or modifying existing ones within a dataset. The dplyr package provides the mutate()
function, which can compute new columns and add them to the data frame.
Syntax:
mutate(data, new_column = new_values, ...)
Example:
# Create a new column that calculates a 10% bonus on salary
mutated_data <- mutate(data, bonus = salary * 0.10)
Important Info:
- You can use arithmetic operations, logical conditions, and functions to create and modify columns.
- It supports the use of previously created columns directly within the
mutate()
function.
4. Summarizing
Summarizing in R involves computing summary statistics such as mean, median, minimum, and maximum for a dataset or subsets of it. The dplyr package provides the summarise()
(or summarize()
in American English) function for this purpose.
Syntax:
summarise(data, new_summary_column = function(column), ...)
Example:
# Calculate the average salary from the data frame
summary_data <- summarise(data, avg_salary = mean(salary))
Important Info:
- You can use other functions such as
count()
,tally()
,group_by()
, andn()
to further organize and compute summary statistics. - For complex summaries,
dplyr
allows chaining multiple operations using the%>%
pipe operator, enhancing readability and workflow efficiency.
Using Pipes (%>%
)
Combining the above operations using pipes can streamline data workflows and make code more readable. Pipes pass the result of one operation as an argument to the next function.
Example:
# Filter, select, mutate, and summarize in one pipeline
pipe_data <- data %>%
filter(age > 30) %>%
select(name, salary) %>%
mutate(bonus = salary * 0.10) %>%
summarise(avg_bonus = mean(bonus))
Through these operations, data manipulation in R becomes intuitive and powerful, allowing analysts to efficiently process, clean, and transform their data.
Conclusion
Online Code run
Step-by-Step Guide: How to Implement R Language Filtering, Selecting, Mutating, Summarizing
Load Required Libraries
First, make sure you have dplyr
installed and loaded.
# Install dplyr package if not already installed
install.packages("dplyr")
# Load dplyr package
library(dplyr)
# Also, let's create a sample data frame for demonstration
data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David"),
Age = c(25, 30, 35, 40),
Salary = c(50000, 60000, 70000, 80000),
Department = c("HR", "Finance", "IT", "Marketing")
)
# View the initial data
print(data)
Filtering Data
Filtering is used to subset rows in a dataset based on some conditions. Let’s filter our dataset to find employees in the IT department.
# Filter rows where Department is "IT"
filtered_data <- filter(data, Department == "IT")
# Print the filtered data
print(filtered_data)
Selecting Data
Selecting is used to choose specific columns from a dataset. Let's select Name
and Salary
columns from our dataset.
# Select Name and Salary columns
selected_data <- select(data, Name, Salary)
# Print the selected data
print(selected_data)
Mutating Data
Mutating is used to create new columns or modify existing columns. Let's create a new column AnnualBonus
based on the Salary
.
# Create a new column AnnualBonus which is 10% of Salary
mutated_data <- mutate(data, AnnualBonus = Salary * 0.10)
# Print the mutated data
print(mutated_data)
Summarizing Data
Summarizing is used to aggregate or summarize data based on some operations such as mean
, sum
, etc. Let's find the average salary.
# Find the average salary
summary_data <- summarise(data, avg_salary = mean(Salary))
# Print the summary data
print(summary_data)
Chaining Operations
You can chain multiple dplyr
functions using the %>%
(pipe) operator to perform multiple operations succinctly. Let's combine filtering, selecting, mutating, and summarizing.
For example, let’s filter for employees earning more than 60000, select their names and salaries, add a bonus column, and then find the total bonus.
Top 10 Interview Questions & Answers on R Language Filtering, Selecting, Mutating, Summarizing
1. How do you filter rows of a data frame based on a condition?
Answer: You can use the filter()
function to select rows based on one or more conditions. For example, suppose you have a data frame df
that includes a column age
and you wish to filter for rows where age > 30
.
library(dplyr)
# Sample data frame
df <- data.frame(name = c("Alice", "Bob", "Charlie"),
age = c(25, 35, 30))
# Filtering rows where age is greater than 30
filtered_df <- df %>% filter(age > 30)
print(filtered_df)
2. How do you select specific columns from a data frame?
Answer: Use the select()
function to choose columns by their names. You can also use helper functions like starts_with()
, ends_with()
, and contains()
to select multiple columns based on their names.
# Selecting specific columns
selected_df <- df %>% select(name, age)
print(selected_df)
# Selecting columns using helper functions
# Suppose df has columns 'name', 'age', 'height', 'weight'
df <- data.frame(name = c("Alice", "Bob", "Charlie"),
age = c(25, 35, 30),
height = c(165, 180, 175),
weight = c(60, 80, 75))
selected_by_helper <- df %>% select(starts_with("a")) # Selects columns starting with 'a'
print(selected_by_helper)
3. How do you add a new column to a data frame based on the values of existing columns?
Answer: The mutate()
function is used to create new columns based on existing data. For example, to add a BMI column based on height (in cm) and weight (in kg), you would use:
mutated_df <- df %>% mutate(bmi = weight / (height / 100)^2)
print(mutated_df)
4. How do you summarize data, such as finding the mean and standard deviation of a column?
Answer: Use the summarize()
or its alias summarise()
function to perform summary statistics:
summary_df <- df %>% summarise(
mean_age = mean(age),
sd_age = sd(age)
)
print(summary_df)
5. How do you group data and then perform actions on each group?
Answer: Combine group_by()
with a summarizing function. For example, to calculate the mean age by name
:
# Suppose adding another column 'gender' for illustration
df <- data.frame(name = c("Alice", "Bob", "Charlie", "Alice"),
age = c(25, 35, 30, 28),
gender = c("Female", "Male", "Male", "Female"))
grouped_summary <- df %>% group_by(gender) %>%
summarise(mean_age = mean(age))
print(grouped_summary)
6. How do you filter rows after grouping data?
Answer: You can filter after grouping by using the filter()
function. For example, filter names where the mean age is greater than 30:
filtered_grouped <- df %>% group_by(name) %>%
summarise(mean_age = mean(age)) %>%
filter(mean_age > 30)
print(filtered_grouped)
7. How do you rename columns in a data frame?
Answer: Use the rename()
function:
renamed_df <- df %>% rename(full_name = name)
print(renamed_df)
8. How do you filter rows based on multiple conditions?
Answer: Use logical operators (&
for AND, |
for OR) within the filter()
function:
# Filter rows where age is greater than 28 and gender is 'Male'
filtered_multi <- df %>% filter(age > 28 & gender == "Male")
print(filtered_multi)
9. How do you sort a data frame by column values?
Answer: Use the arrange()
function:
# Sort df by age in ascending order
sorted_df <- df %>% arrange(age)
print(sorted_df)
# Sort df by age in descending order
sorted_desc_df <- df %>% arrange(desc(age))
print(sorted_desc_df)
10. How do you modify an existing column in a data frame?
Answer: You can also use mutate()
to modify existing columns:
Login to post a comment.