An Introduction to R Programming for Data Science

R is an open-source programming language developed in the 1990s for statistical computing and graphics. Over the past decade, R has become one of the leading data science tools due to its versatility in analyzing data and creating beautiful, informative visualizations. This beginner‘s guide will introduce the basics of using R for data analysis, visualization, and machine learning.

The Meteoritic Rise of R

R was created in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand as a statistical analysis tool for coursework and research. But over the past 15 years, R has skyrocketed from a niche academic tool to one of the world‘s leading programming languages for data science.

Let‘s look at a few key stats about the R language‘s meteoric rise:

  • From 2004 to 2022, R ranked jumped from outside the top 20 languages to #8 most popular overall language based on Redmonk programming language rankings that track popularity by activity and discussion.
  • In the recent Kaggle ML & Data Science Survey 2022 filled out by over 23,000 practitioners, R was the 2nd most popular language behind only Python and ahead of SQL. 76% of respondents reported using R for their work/research.
  • The annual number of R package downloads from CRAN doubled from 1.69 billion in 2016 to 3.33 billion downloads in 2021 based on the R Consortium‘s reports. This shows incredible growth in R adoption and usage for analyzing data.

The visualizations below created using the ggplot2 package in R show the soaring number of R users and packages over time:

library(ggplot2)

users_data = read.csv("r_users_packages.csv") 

ggplot(users_data) +
  geom_line(aes(x=year, y=users), color="darkblue") +  
  geom_point(aes(x=year, y=packages), color="darkred") +
  scale_y_continuous(
    name="Millions",
    sec.axis = sec_axis(~./100, name="Packages")
  ) +
  labs(title = "R Language Adoption Over Time",
       x = "Year", y = "Estimated R Users")

This massive growth underscores how pivotal R has become to the data science industry. Next let‘s understand why R makes a great data analysis tool.

Why Use R for Data Science?

Here are some of the key advantages that make R ideal for data-driven analyses and research:

Open Source

R is completely free under the GPL-2 license. This allows anyone to access R‘s source code and contribute packages, documentation, bug fixes and features through GitHub. This open ecosystem facilitates rapid progress.

Specialized for Statistical Computing

R was designed specifically for statistical analysis and visualization compared to general-purpose languages like Python. All common statistical tests, models, and algorithms come built-in or through packages. This specificity lends itself well to flexible analyses.

Reproducible Analyses

R makes your analysis reproducible by allowing you to share the actual code. Packages also fix version numbers for reliable rebuilding. This facilitates collaboration and peer-review of techniques.

Beautiful Data Visualization

The ggplot2 R package powers publication-quality plots unmatched by any language. Flexible layers and themes create rich interactive charts.

Scalability

R performs data analysis on personal laptops to clusters running Hadoop MapReduce jobs by integrating with Spark, SQL and parallel backends. So you can scale your analysis to big data levels.

Now that you‘ve seen why over 20 million data science practitioners have adopted R, let‘s overview its key capabilities…

R‘s Core Capabilities

R delivers the following main capability areas:

Statistical Analysis & Modeling

Base R provides standard statistics like linear models, hypothesis tests, clustering and nonlinear regression out of the box. Specialized model packages extend functionality further.

Data Wrangling

The tidyverse packages like dplyr streamline common data transformation tasks like filtering, aggregating, mutating variables.

Data Visualization

Built-in graphics and ggplot2 cater to flexible and custom publication-ready visualizations for exploratory analysis.

Machine Learning Algorithms

Packages like caret, TensorFlow, SparkML contain implementations for supervised and unsupervised machine learning models.

Interactive Dashboards

Shiny lets you create interactive web apps and visualizations customizable via user inputs in a simple interface without traditional web development.

Now let‘s look at RStudio which enhances productivity when coding in R…

RStudio Organizes your Analyses

While R can be used through script files, it is most convenient inside an Integrated Development Environment (IDE) like RStudio for organizing code.

RStudio is a free and open-source IDE specifically tailored for R that provides:

  • Code editor with syntax highlighting and auto-complete
  • Execution of code line-by-line or by batch files
  • Data viewer to see environment variables, history, graphs
  • Menu of packages, help, plots, viewer panes
  • Notebooks for mixing code, visualizations and text

This streamlines importing, transforming, visualizing, modeling and communicating findings in a collaborative fashion with built-in version control integration.

Below depicts the different panes in the RStudio IDE:

We will be using RStudio to learn R by running code examples in the built-in console. First let‘s get familiar with the basic data structures that organize information in R.

R‘s Data Structures

Like any programming language, R has data structures to efficiently store and access data for computation. Common ones include:

Vectors

Homogenous sequences of values like:

ratings <- c(1.3, 3.5, 2.8, 4.2) 

Matrices

Tables of data values with rows and columns. Good for math operations:

users_matrix <- matrix(1:12, nrow = 3, ncol = 4)

Data Frames

Heterogeneous tabular data like SQL tables for storing datasets:

surveys_df <- data.frame(
  student = c("John", "Amy", "James"),
  height = c(64, 57, 72), 
  taken = c(TRUE, FALSE, TRUE)
)

Lists

Contain many different types of data structures like vectors, functions:

survey_results <- list(
  response_rates = c(0.7, 0.8),
  impute = mean  
)

These structures organize data for efficient analysis operations in R.

Now let‘s go over one of the most common early tasks – data manipulation…

Data Wrangling with dplyr

Real-world data is messy, so we must clean and transform it for analysis which is called data wrangling.

The dplyr R package included with the tidyverse, makes data manipulation simple using easy to understand verb functions. The key dplyr functions are:

filter()   - Filter rows that meet conditions
select()   - Select certain columns
mutate()   - Add new columns 
summarize() - Aggregate to summary statistics
group_by()  - Group by categories
arrange()  - Sort rows

By combining these verbs, you can shape datasets just how you need:

surveys_df %>%
  filter(height > 70) %>% 
  select(student, height) %>%
  arrange(desc(height))

This workflow pipelines data transformations for fast iteration. Let‘s visualize the cleaned up data…

Visualization with ggplot2

The ggplot2 package implemented the layered grammar of graphics principles to enable flexible and consistent data visualization in R.

You visualize data by:

ggplot(data) + 
  geom_point(mapping = aes(x = weight, y = height), stat = "identity") +
  facet_wrap(~diet) +
  theme(legend.position = "none")

This adds layers of geoms, mappings, faceting and themes to build custom plots. Customization is endless for fine-tuning exactly what insights you want viewers to see.

Let‘s quickly visualize the height survey data frame we have:

ggplot(surveys_df) +   
  geom_histogram(aes(x = height), bins = 30)

That was a basic example of ggplot2‘s capabilities to create publication-grade graphics tailored to your goals. Next let‘s overview R‘s machine learning functionality.

Scalable Machine Learning in R

R features implementations of all common machine learning algorithms for both supervised and unsupervised learning techniques such as:

Supervised Learning

  • Regression models – linear regression, random forest regression
  • Classification models – logistic regression, support vector machines, naive bayes
  • Neural networks – shallow and deep learning

Unsupervised Learning

  • Dimensionality reduction – principal component analysis, matrix factorization
  • Clustering – k-means, hierarchical clustering
  • Association rule learning – apriori algorithm

Packages like caret and tidymodels provide a unified framework for comparing models through resampling and workflows. There are also packages for specific algorithms like randomForest and e1071.

Here is brief code to train and test a random forest model using caret:

library(caret)
set.seed(123)

trainIndex = createDataPartition(data$quality, p = 0.8, list = FALSE) 

training_set = data[trainIndex,]
test_set = data[-trainIndex,]

rf_model <- train(quality ~ ., method = "rf", data = training_set)
predictions <- predict(rf_model, newdata = test_set) 
print(confusionMatrix(predictions, test_set$quality))

This built-in support for machine learning helps extract insights from data. Now let‘s cover RShiny for shared interactive dashboards.

Interactivity with R Shiny

The shiny R package from RStudio lets you create web apps and interactive visualizations directly from R without needing to know typical JavaScript/CSS web development.

You can easily:

  • Render dynamic UI components like plots, tables and controls
  • Customize them based on user inputs and selections
  • Update underlying data changes
  • Share them via dashboards deployed on web pages
  • Monitor usage analytics to improve app.

This interactivity makes your analysis more impactful through exploration by both technical and non-technical audiences.

Below shows a Shiny dashboard with dropdowns adjusting the displayed plot:

Now analyses encoded in R can be made available to anyone with an internet browser!

Additional R Resources

This guide introduced you to the meteoric rise of R for modern data science along with R‘s core components like data wrangling, visualization and modeling.

To continue mastering R, some useful next steps include:

R has been pivotal to advances in statistics, graphics and machine learning. I hope you‘re now equipped with the basics to start your journey leveraging R‘s capabilities for your own data science needs. Feel free to reach out if you have any other questions!

Read More Topics