Table of Contents
Scatterplots enable easy visualization of relationships between two continuous variables. The ggplot2 package in R structures scatterplots according to a flexible "grammar of graphics" making customization simple yet consistent.
Let‘s explore key techniques for creating informative scatterplots for effective data analysis. I‘ll explain concepts informally along the way.
Introducing ggplot2 Scatterplots
A scatterplot displays values of a variable on the y-axis and values of a second variable on the x-axis, with one dot per observation in your dataset. The position of each dot represents that observation‘s paired x and y values. Patterns in dot positioning reveal correlations.
For example, here is a scatterplot with engine displacement (displ) on the x-axis and highway fuel efficiency (hwy) on the y-axis using mpg dataset.
library(ggplot2)
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
Each dot represents an automobile. Dot position indicates the displ (x) and hwy (y) values for that car. We see smaller engines cluster towards top right (good hwy) while larger engines cluster bottom right (worse hwy). This negative correlation is expected since larger engines normally use more fuel.
The ggplot2 package structures plots according to this grammar:
- Data – Dataset
- Aesthetics – Variables mapped to visual properties
- Geom – Geometric objects like points, lines
- Facets – Panels to segment data
- Stats – Statistical transformations
- Scales & Coordinates – Axis scaling
- Theme – Complete theme styling
This provides immense flexibility! Let‘s see how to leverage ggplot2‘s capabilities for insightful scatterplots.
Visualizing Groups with Aesthetics
We can visualize additional data dimensions by mapping variables to aesthetics like color, shape, or size.
Here we map the class
variable to color aesthetic:
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
Mapping the vehicle classes to color reveals clusters of SUVs with higher engine size but lower efficiency (orange dots, bottom right). Sedans cluster towards top right with better efficiency. This lets us easily highlight and compare groups.
For better color distinction, we could use a colorblind-friendly or qualitative palette:
+ scale_color_brewer(palette="Set1")
Shape aesthetic can also effectively visualize groups:
+ aes(shape = class)
+ scale_shape_manual(values = c(21,24,22,25,23))
Tip: Use shapes of similar visual size for fair comparison between groups! Vary color & shape together when plotting many groups.
Approximating Density with 2D Contours
Visualizing density of data points can indicate where relationships are stronger. We can approximate density using 2D kernel density estimation with contours:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(alpha = 0.3) +
stat_density2d(geom = "contour", aes(fill = ..level..), color = "black") +
scale_fill_gradient(low="yellow",high="red")
Light yellow contours indicate regions with higher density of points. We clearly see negative correlation again with point density highest along the sloped ridgeline.
This visualizes the distribution better than eyeballing color groups!
Adding Model Fits with geom_smooth()
We can quantify correlation strength and direction by fitting a linear regression model line instead of subjective visual evaluation.
The geom_smooth()
layer plots model fits with optional confidence intervals:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(alpha = 0.4) +
geom_smooth(method="lm", se=FALSE)
The linear model line (in red) indicates the negative correlation direction and strength. No confidence interval provides cleaner visualization. The line slope quantifies the rate hwy drops as displ increases.
We could also try non-linear smoothing methods:
+ geom_smooth(method="loess", se=TRUE)
The loess model fit indicates a slight curve bends steeper downwards in the right side of the plot. This suggests fuel efficiency decrease may accelerate for very high engine displacements. The confidence region conveys uncertainty range in the smoothed fit.
Faceting Scatterplots
Visualizing multiple groups together can clutter a plot. Faceting splits data across multiple panels to provide focused views.
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ class, nrow=2)
Now we visualize relationship per vehicle class, side-by-side. This reveals stronger negative correlation clearly visible within most passenger cars than SUVs and trucks. Such insights can guide and inspire further statistical investigation.
Customizing Appearance
Many options exist for customizing appearance to better resonate with your audience:
Themes – Completely change look and feel:
+ theme_dark()
Labels – Describe data better:
+ labs(
title="Engine Displacement vs Efficiency",
x="Engine Size (L)",
y="Highway MPG"
)
Scales – Draw attention to patterns:
+ scale_x_continuous(limits=c(0,7),breaks=seq(0,7,1))
+ scale_y_continuous(limits=c(0,50),breaks=seq(0,50,10))
With some polish, even simple scatterplots effectively highlight relationships between your data!
Interactive Tooltips
Beyond static visualization, we can enable interactive data inspection with tooltips:
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(alpha = 0.5, aes(text = (cty))) +
ggrepel::geom_text_repel(
aes(label = rownames(mpg))) +
theme_classic() +
labs(
title = "Fuel Efficiency vs Engine Size",
subtitle = "Hover over points for city mileage",
caption = "Source: mpg R dataset"
)
Here, the tooltip shows city mileage when hovering over each car‘s dot. IDs enable lookup in the underlying mpg dataset to view all variables for that observation. Even basic interactivity delivers focused data insights!
Final Thoughts
With an understanding of its grammar layers, the extensive customization flexibility of ggplot2 makes scatterplots invaluable for both analysis and presentation. Visualizing relationships often spawns deeper questions to pursue statistically. I hope these examples provide ideas and motivation to explore your data and communicate discoveries with scatterplot visualizations!