R Language Filtering Selecting Mutating Summarizing Complete Guide

Last Update:2025-06-22T00:00:00 .NET School AI Teacher - SELECT ANY TEXT TO EXPLANATION. 6 mins read Difficulty-Level: beginner

Understanding the Core Concepts of R Language Filtering, Selecting, Mutating, Summarizing

R Language Filtering, Selecting, Mutating, Summarizing

1. Filtering

Filtering in R is used to select specific rows based on conditions. The dplyr package provides the filter() function for this purpose. It allows you to specify conditions that the rows must meet to be kept in the dataset.

Syntax:

filter(data, condition1, condition2, ...)

Example:

library(dplyr)

# Sample data frame
data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David"),
  age = c(25, 30, 35, 40),
  salary = c(50000, 60000, 70000, 80000)
)

# Filter data where age is greater than 30
filtered_data <- filter(data, age > 30)

Important Info:

You can combine multiple conditions using logical operators (& for AND, | for OR).
Use ! for negation to exclude rows that meet a condition.
Filters can also use functions like between() for a range of values or str_detect() for string matching.

2. Selecting

Selecting in R allows you to choose specific columns from a dataset. The dplyr package provides the select() function for this. It lets you specify which columns you want to keep and in which order.

Syntax:

select(data, column1, column2, ...)

Example:

# Select name and salary from the data frame
selected_data <- select(data, name, salary)

Important Info:

You can use helper functions like starts_with(), ends_with(), contains(), one_of(), and matches() to select columns based on patterns.
The everything() function can be used to keep the selected columns in the specified order while moving the rest of the columns to the end.

3. Mutating

Mutating in R involves creating new variables or modifying existing ones within a dataset. The dplyr package provides the mutate() function, which can compute new columns and add them to the data frame.

Syntax:

mutate(data, new_column = new_values, ...)

Example:

# Create a new column that calculates a 10% bonus on salary
mutated_data <- mutate(data, bonus = salary * 0.10)

Important Info:

You can use arithmetic operations, logical conditions, and functions to create and modify columns.
It supports the use of previously created columns directly within the mutate() function.

4. Summarizing

Summarizing in R involves computing summary statistics such as mean, median, minimum, and maximum for a dataset or subsets of it. The dplyr package provides the summarise() (or summarize() in American English) function for this purpose.

Syntax:

summarise(data, new_summary_column = function(column), ...)

Example:

# Calculate the average salary from the data frame
summary_data <- summarise(data, avg_salary = mean(salary))

Important Info:

You can use other functions such as count(), tally(), group_by(), and n() to further organize and compute summary statistics.
For complex summaries, dplyr allows chaining multiple operations using the %>% pipe operator, enhancing readability and workflow efficiency.

Using Pipes (`%>%`)

Combining the above operations using pipes can streamline data workflows and make code more readable. Pipes pass the result of one operation as an argument to the next function.

Example:

# Filter, select, mutate, and summarize in one pipeline
pipe_data <- data %>%
  filter(age > 30) %>%
  select(name, salary) %>%
  mutate(bonus = salary * 0.10) %>%
  summarise(avg_bonus = mean(bonus))

Through these operations, data manipulation in R becomes intuitive and powerful, allowing analysts to efficiently process, clean, and transform their data.

Conclusion

Online Code run

🔔 Note: Select your programming language to check or run code at

💻 Run Code Compiler

Step-by-Step Guide: How to Implement R Language Filtering, Selecting, Mutating, Summarizing

Load Required Libraries

First, make sure you have dplyr installed and loaded.

# Install dplyr package if not already installed
install.packages("dplyr")

# Load dplyr package
library(dplyr)

# Also, let's create a sample data frame for demonstration
data <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "David"),
  Age = c(25, 30, 35, 40),
  Salary = c(50000, 60000, 70000, 80000),
  Department = c("HR", "Finance", "IT", "Marketing")
)

# View the initial data
print(data)

Filtering Data

Filtering is used to subset rows in a dataset based on some conditions. Let’s filter our dataset to find employees in the IT department.

# Filter rows where Department is "IT"
filtered_data <- filter(data, Department == "IT")

# Print the filtered data
print(filtered_data)

Selecting Data

Selecting is used to choose specific columns from a dataset. Let's select Name and Salary columns from our dataset.

# Select Name and Salary columns
selected_data <- select(data, Name, Salary)

# Print the selected data
print(selected_data)

Mutating Data

Mutating is used to create new columns or modify existing columns. Let's create a new column AnnualBonus based on the Salary.

# Create a new column AnnualBonus which is 10% of Salary
mutated_data <- mutate(data, AnnualBonus = Salary * 0.10)

# Print the mutated data
print(mutated_data)

Summarizing Data

Summarizing is used to aggregate or summarize data based on some operations such as mean, sum, etc. Let's find the average salary.

# Find the average salary
summary_data <- summarise(data, avg_salary = mean(Salary))

# Print the summary data
print(summary_data)

Chaining Operations

You can chain multiple dplyr functions using the %>% (pipe) operator to perform multiple operations succinctly. Let's combine filtering, selecting, mutating, and summarizing.

For example, let’s filter for employees earning more than 60000, select their names and salaries, add a bonus column, and then find the total bonus.

Top 10 Interview Questions & Answers on R Language Filtering, Selecting, Mutating, Summarizing

1. How do you filter rows of a data frame based on a condition?

Answer: You can use the filter() function to select rows based on one or more conditions. For example, suppose you have a data frame df that includes a column age and you wish to filter for rows where age > 30.

library(dplyr)

# Sample data frame
df <- data.frame(name = c("Alice", "Bob", "Charlie"),
                 age = c(25, 35, 30))

# Filtering rows where age is greater than 30
filtered_df <- df %>% filter(age > 30)
print(filtered_df)

2. How do you select specific columns from a data frame?

Answer: Use the select() function to choose columns by their names. You can also use helper functions like starts_with(), ends_with(), and contains() to select multiple columns based on their names.

# Selecting specific columns
selected_df <- df %>% select(name, age)
print(selected_df)

# Selecting columns using helper functions
# Suppose df has columns 'name', 'age', 'height', 'weight'
df <- data.frame(name = c("Alice", "Bob", "Charlie"),
                 age = c(25, 35, 30),
                 height = c(165, 180, 175),
                 weight = c(60, 80, 75))
selected_by_helper <- df %>% select(starts_with("a"))  # Selects columns starting with 'a'
print(selected_by_helper)

3. How do you add a new column to a data frame based on the values of existing columns?

Answer: The mutate() function is used to create new columns based on existing data. For example, to add a BMI column based on height (in cm) and weight (in kg), you would use:

mutated_df <- df %>% mutate(bmi = weight / (height / 100)^2)
print(mutated_df)

4. How do you summarize data, such as finding the mean and standard deviation of a column?

Answer: Use the summarize() or its alias summarise() function to perform summary statistics:

summary_df <- df %>% summarise(
  mean_age = mean(age),
  sd_age = sd(age)
)
print(summary_df)

5. How do you group data and then perform actions on each group?

Answer: Combine group_by() with a summarizing function. For example, to calculate the mean age by name:

# Suppose adding another column 'gender' for illustration
df <- data.frame(name = c("Alice", "Bob", "Charlie", "Alice"),
                 age = c(25, 35, 30, 28),
                 gender = c("Female", "Male", "Male", "Female"))

grouped_summary <- df %>% group_by(gender) %>%
  summarise(mean_age = mean(age))
print(grouped_summary)

6. How do you filter rows after grouping data?

Answer: You can filter after grouping by using the filter() function. For example, filter names where the mean age is greater than 30:

filtered_grouped <- df %>% group_by(name) %>%
  summarise(mean_age = mean(age)) %>%
  filter(mean_age > 30)
print(filtered_grouped)

7. How do you rename columns in a data frame?

Answer: Use the rename() function:

renamed_df <- df %>% rename(full_name = name)
print(renamed_df)

8. How do you filter rows based on multiple conditions?

Answer: Use logical operators (& for AND, | for OR) within the filter() function:

# Filter rows where age is greater than 28 and gender is 'Male'
filtered_multi <- df %>% filter(age > 28 & gender == "Male")
print(filtered_multi)

9. How do you sort a data frame by column values?

Answer: Use the arrange() function:

# Sort df by age in ascending order
sorted_df <- df %>% arrange(age)
print(sorted_df)

# Sort df by age in descending order
sorted_desc_df <- df %>% arrange(desc(age))
print(sorted_desc_df)

10. How do you modify an existing column in a data frame?

Answer: You can also use mutate() to modify existing columns:

R Language Filtering Selecting Mutating Summarizing Complete Guide