Mastering R‘s Select, Filter, Arrange and Pipelines

As an R user working with data, you‘ll inevitably need to wrangle datasets – selecting variables, filtering rows, reordering observations, etc. Base R provides simple but powerful functions to do this.

In this comprehensive guide, you‘ll gain expertise using R‘s select(), filter(), arrange() and pipelines for complete data manipulation.

Select – Subset Columns with Ease

The select() function enables cleanly subsetting a dataset down to only the columns you need for analysis. No more clutter from extraneous variables!

For example, say we‘re analyzing the popular diamonds dataset:

# Load tidyr and dplyr
library(tidyr)
library(dplyr)

# Import diamonds data 
diamonds <- as_tibble(ggplot2::diamonds)  

# Select carat, cut, depth and price columns
slim_diamonds <- select(diamonds, carat, cut, depth, price)

We‘ve narrowed it down to only fields relevant for modeling diamond prices.

Some useful select() patterns:

  • Select a range with carat:depth
  • Drop columns by prefixing minus -clarity
  • Select all BUT excluded vars with -c(x, y, z)

Let‘s check the new dataframe dimensions:

# Original number of columns  
ncol(diamonds)
#[1] 10

# New slim dataset  
ncol(slim_diamonds) 
#[1] 4

We‘ve eliminated 6 extraneous variables for a leaner analysis dataset.

Select Before and After

By removing unnecessary columns with select(), we also benefit from:

  • Faster model training – less data to process
  • Improved accuracy – eliminates noise variables
  • Lower memory usage – reduced dataframe size

Now let‘s explore how this translates to big data. In Spark, we apply similar column selection but on very large distributed DataFrames:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DiamondsAnalysis").getOrCreate()

diamonds_df = spark.read.csv("diamonds.csv") 

slim_diamonds_df = diamonds_df.select("carat", "cut", "depth", "price")

The Spark DataFrame optimizations and lazy evaluation will handle extracting just the desired columns across potentially billions of rows split across clusters.

Filter – Focusing on Subsets of Rows

While select() grabs columns, filter() allows us to extract rows meeting specific logical criteria.

filter(my_df, condition1, condition2)

Let‘s filter for large high quality diamonds:

# Filter for 2+ carats & Ideal cut
ideal_biguns <- filter(diamonds, carat >= 2, cut == "Ideal")

# Number of rows match  
nrow(ideal_biguns)

We‘ve quickly isolated the big prime diamonds for analysis. No manual scanning the entire dataset needed!

Here are some common useful types of filter criteria:

  • Numeric – carat > 5
  • Text – color %in% c("D", "E", "F")
  • Date ranges – sold_date <= ‘2020-05-21‘
  • Nulls – !is.na(purchase_price)

You can also use filter() in a polymorphic way to exclude outliers before analysis:

filtered_data <- diamonds %>% 
    filter(carat < quantile(carat, 0.99))

This removes the most extreme 1% of values by carat. Outliers gone!

Avoiding Pitfalls

When filtering, watch out for these common issues:

  • Accidentally removing all data if too restrictive
  • Leaving syntax errors like unmatched parentheses
  • Filtering by a character column without quotes

Set up defensive checks on the row count afterward to check filter() worked as expected.

Arrange – Ordering Rows

The arrange() function enables sorting the rows of a dataset by one or more columns:

arrange(diamonds, desc(carat), price)

This orders by carat descending first, then by ascending price.

Let‘s sort the diamonds by highest price first:

diamonds_by_price <- arrange(diamonds, desc(price))

Plotting expensive diamonds visually shows the impact of arrange():

Arrange Price Graph

Ordering by price makes the downward slope pattern clearly visible. We can now cleanly analyze and model the downward slope pattern among high price diamonds.

In Pandas, similar row ordering is similarly trivial:

import pandas as pd

diamonds_df = pd.read_csv("diamonds.csv")

diamond_by_price_df = diamonds_df.sort_values("price", ascending=False)

Performance Considerations

One thing to watch is that arrange() can get slow on gigantic datasets common in data science pipelines.

SQL databases and Spark DataFrames preprocess data for fast sorting. R dataframes sort in-memory, so can struggle with millions/billions of records.

In these cases, do the ordering directly in the database query or on the Spark cluster for efficiency:

-- In relational database
SELECT * FROM diamonds ORDER BY price DESC

Pipelines – Connecting Transformation Steps

Frequently we need to chain together many select, filter, arrange and other transformation blocks to produce the final cleaned dataset.

Typing out each intermediate dataframe gets tedious:

temp1 <- select(diamonds, -x, -y, -z) 
temp2 <- filter(temp1, price > 5000)
cleaned <- arrange(temp2, desc(carat))

The pipeline operator %>% avoids the mess by streaming the output of one step as input into the next:

cleaned <- diamonds %>%  
  select(-x, -y, -z) %>%
  filter(price > 5000) %>%
  arrange(desc(carat))  

Our workflow got significantly more readable by eliminating temporary values!

Constructing pipelines helps structure complex R analysis scripts:

final_df <- raw_data %>%
   validate_schema() %>%
   filter_outliers() %>%   
   transform_vars() %>%
   join_lookups() %>%
   select_final_features()

This structures our ETL sequence for others to easily follow.

Transitioning to Spark

Once data sizes exceed R‘s memory capacity, we port pipelines to Apache Spark for distributed big data preparation:

cleaned_df = raw_df \
   .filter("price > 5000") \
   .select("carat", "price") \ 
   .orderBy("carat", ascending=False)

The Spark optimizations take over underneath to handle scalable, fault-tolerant data wrangling!

Master Data Manipulation with R

I hope you now feel empowered slicing and dicing datasets using R‘s flexible filter(), select() and arrange() functions.

Pipelines strengthen the capabilities further for production-grade ETL.

Take any dataset and practice different column selections, row filters and reorderings to gain intuition. Before you know it, you‘ll be wrangling data like a pro!

Read More Topics