3  Visualizations

3.1 Introduction

Whoever said a picture is worth 1,000 words severely understated how many words a picture is actually worth. When working with data, there is a strong argument to make that nothing is more important than visuals.

TipIf there is one piece of advice to take from this textbook, it is this:

After running summary statistics, always visualize your data!

You may be thinking “how powerful can a visualization even be?” That is a great question, that Anscombe’s quartet will help answer.

Quick history lesson. In 1973 (before the invention of R), a statistician named Francis Anscombe created four unique datasets, which all had identical summary statistics.

library(tidyverse)
library(quartets)
library(knitr)

anscombe_quartet %>%
  group_by(dataset) %>%
  summarise(mean_x = mean(x),
            variance_x = var(x),
            mean_y = mean(y),
            variance_y = var(y),
            correlation = cor(x, y)) %>%
  kable(digits = 2,caption = "A breakdown of summary statistics from the four individual datasets Anscombe created.")
Table 3.1: A breakdown of summary statistics from the four individual datasets Anscombe created.
dataset mean_x variance_x mean_y variance_y correlation
(1) Linear 9 11 7.5 4.13 0.82
(2) Nonlinear 9 11 7.5 4.13 0.82
(3) Outlier 9 11 7.5 4.12 0.82
(4) Leverage 9 11 7.5 4.12 0.82

As Table 3.1 shows, all four of the different datasets show the same means, variances, and correlations (more on that in chapter 6). With just these summary statistics, you’d likely think “eh, all the data is the same, these datasets are basically identical.”

Wrong.

When we graph these four datasets, we see something totally different than what the table shows.

ggplot(anscombe_quartet, aes(x = x, y = y)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  facet_wrap(~dataset)
`geom_smooth()` using formula = 'y ~ x'
Figure 3.1: While the four datasets looked identical in the table, the visualized Anscombe datasets show an entirely different picture.

With our visualization, we are introduced to an entirely different way of seeing our data. The table showed that the summary statistics were identical, but here we can see:

  1. Dataset One has a linear relationship between x and y.
  2. Dataset Two has a nonlinear relationship between x and y.
  3. Dataset Three, while still linear, has an outlier.
  4. Dataset Four shows something totally different from the rest!

Visualizations provide insights into data that sometimes numbers can’t show.

This chapter is meant less to be a lesson, and more to be a reference page to come to when you need to make graphs. You do not need to remember every plotting option shown here. This chapter is designed to be returned to whenever you need a reminder or example.

3.2 Learning Objectives

By the end of this chapter, you will be able to:

  • Explain why visualizing data is a critical step alongside summary statistics
  • Describe the core grammar of graphics used by ggplot2 (data, aesthetics, geometry)
  • Create common plot types using ggplot2, including scatterplots, bar charts, column charts, histograms, density plots, boxplots, and line graphs
  • Map and customize aesthetics such as color, shape, size, fill, and transparency
  • Enhance visual clarity using labels, themes, facets, coordinate transformations, and reference lines
  • Add contextual information to plots using trend lines, error bars, and text labels
  • Interpret visual patterns to identify relationships, distributions, outliers, and trends in data
  • Create visualizations that are interpretable and reproducible when viewed independently of accompanying text

With that being said, let’s get right to it.

3.3 Base R

One of the strongest qualities in R is its ability to create visualizations, powered by ggplot2. However, it is possible to use base R to create plots as well. It is recommended to use ggplot2; however, you may encounter base R plots in older scripts or documentation, so an example is included for familiarity.

values <- c(100, 17, 45, 55, 44)

barplot(values, xlab = "X-axis", ylab = "Y-axis", main ="Base R Bar Chart")
Figure 3.2: An example of a bar chart created using base R.

3.4 ggplot2

Before we go into visualizing our data, we should probably see what data we will be working with! Similar to how R comes preinstalled with datasets, ggplot2 also comes with prepacked data that can be utilized.

kable(head(mpg), caption = "A base R dataset: Fuel economy data from 1999 to 2008 for 38 popular models of cars.")
Table 3.2: A base R dataset: Fuel economy data from 1999 to 2008 for 38 popular models of cars.
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
kable(head(economics), caption = "A base R dataset: US Economic Time Series.")
Table 3.3: A base R dataset: US Economic Time Series.
date pce pop psavert uempmed unemploy
1967-07-01 506.7 198712 12.6 4.5 2944
1967-08-01 509.8 198911 12.6 4.7 2945
1967-09-01 515.6 199113 11.9 4.6 2958
1967-10-01 512.2 199311 12.9 4.9 3143
1967-11-01 517.4 199498 12.8 4.7 3066
1967-12-01 525.1 199657 11.8 4.8 3018
kable(head(diamonds), caption = "A base R dataset: Prices of over 50,000 round cut diamonds.")
Table 3.4: A base R dataset: Prices of over 50,000 round cut diamonds.
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
kable(head(mtcars), caption = "A base R dataset: Motor Trend Car Road Tests.")
Table 3.5: A base R dataset: Motor Trend Car Road Tests.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Additionally, ggplot2 is part of the tidyverse package. So, you can either load ggplot2 or tidyverse if you want to visualize using ggplot2. Because ggplot2 is part of the tidyverse, everything you learned in Chapter ?sec-intro-tidyverse about pipelines and data manipulation carries directly into visualizations.

3.4.1 Basics

When using ggplot2, there are unlimited possibilities on what you can manipulate/influence. This may be daunting, but always remember that all plots work on the same framework.

Tipggplot2 framework

plot = data + aesthetics + geometry

No matter what ggplot you are making, no matter how many characteristics you influence, all ggplot2 needs are three things:

  1. The data: the data being used to make the plot
  2. The aesthetics: x/y/color/shape/etc. In ggplot2 aesthetics is shortened to aes.
  3. The geometry: plot type (e.g., scatterplot, boxplot, etc.)

With only those three things, you can make any type of visualization you want or need. From there, you can build as far as your mind can see. When you do want to add more levels to your plot, you do so by using the + sign.

Here is an example of a basic graph made with ggplot:

ggplot(data = mpg, aes(x = cty, y = hwy)) +
  geom_point()
Figure 3.3: An example of a visualization made using ggplot2.

This is our first ggplot (a scatterplot) we have created, so let’s break this down:

  • The data: we are using the mpg dataset.
  • The aesthetics: the x variable is cty and the y is hwy.
  • The geometry: geom_point(), which creates points on the graph.
    • Importantly, to add geometry, you do need to add the + sign.

With only two lines of code, a scatterplot was created! Since the framework has been established, it is time to build some visualizations!

NoteHow to move on from here:

In this chapter, there will be basic examples of each visualization type, and advanced examples of each visualization type. This is done to display the scope of what can be done with this package (ggplot2). It is encouraged to experiment with the code, change numbers, remove things, and compare and contrast the differences between the experimented visuals.

3.4.2 Scatterplot - geom_point()

When data has two continuous (for example, numeric) variables, and you want to visualize their relationship, a scatterplot is a fantastic choice. This is the basis for relationship analysis that topics such as linear regression (more about this in Chapter Section 8.1) rely on.

The geometry that needs to be specified is geom_point().

ggplot(data = mpg, aes(x = cty, y = hwy)) +
  geom_point()
Figure 3.4: A basic example of a scatterplot using ggplot2.

Typically it is best practice to have a “line of best fit” when creating scatterplots, similar to Anscombe’s visualization. To do that, you can utilize the geom_smooth() command.

# Line of Best Fit
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm") # lm stands for "linear model"
`geom_smooth()` using formula = 'y ~ x'
Figure 3.5: An example of a scatterplot with a line of best fit.

Notice that to add geom_smooth, which is another layer of our visualization, we needed to add another + sign. It is used quite literally in ggplot2.

Let’s add some more information to our scatterplot.

ggplot(data = mpg, aes(x = cty, y = hwy, color = class, shape = drv)) +
  geom_point(alpha = 0.8, size = 2.5) + # opacity and size
  labs(
    title = "City vs Highway MPG",      # adds a title to the plot
    x = "City MPG",                     # adds a x-axis label to the plot
    y = "Highway MPG",                  # adds a y-axis label to the plot
    color = "Vehicle Class",            # adds a color label to the plot
    shape = "Drivetrain"                # adds a shape label to the plot
  ) + 
  facet_wrap(~ class) +                 # breaks the graph into individual graphs
  geom_smooth(method = "lm", se = TRUE) # adds a line of best fit
`geom_smooth()` using formula = 'y ~ x'
Figure 3.6: A scatterplot incorporating multiple aesthetics and faceting.

3.4.3 Bar Chart (counts) and Column Chart (values)

Bar charts visualize counts of a discrete variable, while column charts visualize pre-summarized numeric values.

3.4.3.1 Bar Chart - geom_bar()

For a bar chart, the geometry used is geom_bar().

# BASIC (counts by class) - geom_bar() counts rows automatically
ggplot(mpg, aes(x = class)) +
  geom_bar()
Figure 3.7: An example of a basic bar chart.

Depending on how long the variable names are, it may be best to switch the x and y axis. They would still act the same, but they would just flip on the coordinate plane. The x variable would still be the x variable, and the y would still be the y variable, but flipped. To do this, you can utilize the coord_flip() command.

ggplot(mpg, aes(x = manufacturer)) +
  geom_bar(fill = "steelblue", color = "white") +
  coord_flip() +
  labs(title = "Counts by Manufacturer", x = "", y = "Count")
Figure 3.8: An example of a coordinate flipped bar chart.

Depending on the audience, a stacked bar chart may be the best way to visualize the data. To do that, you can add position = "stack" and ggplot does the rest.

# What if we want a stacked bar chart (default with fill)?
ggplot(mpg, aes(x = manufacturer, fill = drv)) +
  geom_bar(position = "stack", color = "white") +
  coord_flip() +
  labs(
    title = "Counts by Manufacturer and Drivetrain",
    x = "",
    y = "Count",
    fill = "Drivetrain"
  ) +
  theme_minimal()
Figure 3.9: An example of a stacked bar chart.

If you need to create a grouped bar chart instead, you can add position = "dodge" to create your visual.

#What if we want a grouped bar chart
ggplot(mpg, aes(x = manufacturer, fill = drv)) +
  geom_bar(position = "dodge", color = "white") +
  coord_flip() +
  labs(
    title = "Counts by Manufacturer and Drivetrain",
    x = "",
    y = "Count",
    fill = "Drivetrain"
  ) +
  theme_minimal()
Figure 3.10: An example of a grouped bar chart.

3.4.3.2 Column Chart - geom_col()

Column charts work best when you have pre-summarized values, and not raw values.

# USING PRE-SUMMARIZED VALUES - geom_col() requires explicit values
class_counts <- mpg %>%
  count(class)  # counts rows by class

kable(class_counts, caption = "Pre-summarized values")
Table 3.6: Pre-summarized values
class n
2seater 5
compact 47
midsize 41
minivan 11
pickup 33
subcompact 35
suv 62

Once the values have been summarized explicitly, you use geom_col() to create a column chart.

ggplot(class_counts, aes(x = class, y = n)) +
  geom_col()
Figure 3.11: An example of a basic column chart.

There are a few things you can do to a column chart to add some flavor. For example:

  • Use the reorder command to reorder the columns into ascending or descending order based on their n values.
  • Inside of geom_col change the width of the columns.
  • Depending on the bars, you can change the legend.position to a particular spot (or remove it entirely) from the visualization.
# PLUS AESTHETICS (polished)
ggplot(class_counts, aes(x = reorder(class, n), y = n, fill = class)) +
  geom_col(width = 0.7, color = "white") +   # width = bar thickness, color = border
  coord_flip() +                             # flip for readability
  labs(title = "Counts by Vehicle Class", x = "", y = "Count") +
  theme(legend.position = "none")
Figure 3.12: An example of a column chart with polished aesthetics.
NoteLegends in a graph and redundancy.

Sometimes having a legend in a graph can be redundant because of the x-axis containing the same information as a legend would. In this case, you can remove the legend not only to avoid the redundancy, but also save space.

3.4.4 Histograms and Density Plots (distribution)

Histograms are perfect for when you are looking to display distribution.

3.4.4.1 Histograms - geom_histogram()

For the geometry of a histogram, you use geom_histogram().

# BASIC
ggplot(mpg, aes(x = hwy)) +
  geom_histogram(binwidth = 3)
Figure 3.13: An example of a basic histogram.

R automatically assigns bins when creating a histogram unless otherwise instructed. There are several things that can be influenced:

  • bin width: use binwidth to change the width of the bins
  • boundary: use boundary to set the separation between the bins
# STYLED (bin edges + colors)
ggplot(mpg, aes(hwy)) +
  geom_histogram(binwidth = 5, boundary = 0,
                 fill = "red", color = "orange") +
  labs(title = "Highway MPG Distribution", x = "Highway MPG", y = "Frequency")
Figure 3.14: An example of a styled histogram.

In the case that you need a stacked histogram, below is code to create that. The secret here is using the fill command.

# MAPPED FILL (stacked by class)
ggplot(mpg, aes(hwy, fill = class)) +
  geom_histogram(binwidth = 10, color = "white") +
  labs(title = "Highway MPG Distribution by Vehicle Class",
       x = "Highway MPG", y = "Count", fill = "Vehicle Class") +
  theme(legend.position = "bottom")
Figure 3.15: An example of a filled histogram.

3.4.4.2 Density - geom_density()

Density plots still show the distribution of data, but instead of doing it in bins like a histogram, accomplish this through an outline. The geometry for a density plot is geom_density().

# BASIC
ggplot(diamonds, aes(x = price)) +
  geom_density()
Figure 3.16: An example of a basic density plot.

In the below example, the visualization is filtering within the data portion to only keep the cuts “Good”, “Ideal” and “Premium.” Since it is utilizing color inside the aesthetics, this will create a grouped density plot, showing three different lines for each of the three different cuts.

Note

##Regarding the data portion of the graph below: If there was no filtering, this code would still separate into different lines.

# GROUPED
ggplot(diamonds %>% filter(cut %in% c("Good", "Ideal", "Premium")),
       aes(price, color = cut)) +
  geom_density() +
  labs(title = "Price Density by Cut", x = "Price", y = "Density") +
  theme(legend.position = "bottom")
Figure 3.17: An example of a grouped density plot.

3.4.5 Boxplot - geom_boxplot()

Commonly known as a box and whisker plot, boxplots are fantastic for providing numerical insights between categorical variables. There are a few different pieces of a boxplot:

  • Whiskers: there are two whiskers on each boxplot
    • Lower Whisker: shows the lower 25% of the data. The bottom is the lowest value in the dataset
    • Upper Whisker: shows the upper 25% of the data. The top is the highest value in the dataset
  • Box: the box itself shows the middle 50% of the data. This includes:
    • Interquartile Range (IQR): the lowest line in the bar is the 25th percentile and the top is the 75th percentile
    • Median: the darker line inside of the box
  • Outliers: any data point that is above or below the upper and lower whisker, respectively.

The geometry for a boxplot is geom_boxplot().

# BASIC
ggplot(mpg, aes(x = class, y = hwy)) +
  geom_boxplot() +
  labs(title = "Highway MPG by Vehicle Class", x = "Vehicle Class", y = "Highway MPG")
Figure 3.18: An example of a basic boxplot.

Sometimes the points on a boxplot (or any plot) can be indistinguishable due to them being so close together. In that case, utilize the geom_jitter() command.

# WITH JITTERED POINTS OVERLAID
ggplot(mpg, aes(class, hwy)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1.5) +
  coord_flip() +
  labs(title = "Highway MPG by Vehicle Class (with Points)",
       x = "", y = "Highway MPG")
Figure 3.19: An example of a boxplot with jittered points.

3.4.6 Lines (time series) - geom_line()

Time and time again, when working with time-series data, a line graph is created. The geometry for a line graph is geom_line().

# BASIC: unemployment over time
ggplot(economics, aes(x = date, y = unemploy)) +
  geom_line() +
  labs(title = "US Unemployment Over Time",
       x = "Date", y = "Number Unemployed (thousands)")
Figure 3.20: An example of a basic line graph.

Like in the scatterplot, you can add a line of best fit.

# PLUS: LM vs LOESS contrast
ggplot(economics, aes(date, unemploy)) +
  geom_line(linewidth = 1.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "US Unemployment with Linear Trend",
       x = "Date", y = "Number Unemployed (thousands)")
`geom_smooth()` using formula = 'y ~ x'
Figure 3.21: An example of line graph with a LM line of best fit.

Instead of using “lm” for our method, let’s try “loess” and see our results.

ggplot(economics, aes(date, unemploy)) +
  geom_line(linewidth = 0.6) +
  geom_smooth(method = "loess", se = TRUE) +  # loess = flexible smoothing; se = confidence band
  labs(title = "US Unemployment with LOESS Smooth",
       x = "Date", y = "Number Unemployed (thousands)")
`geom_smooth()` using formula = 'y ~ x'
Figure 3.22: An example of line graph with a LM line of best fit.

3.4.7 Put text on the plot - geom_text()

No matter the type of plot, it may help the viewer understand your plot better if you add labels to some of your data points, for example, the most extreme. To do this, utilize the geom_text() command.

# BASIC: label extreme points
mpg_extremes <- mpg %>% slice_max(order_by = hwy, n = 5)
mpg_extremes
# A tibble: 6 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 volkswagen   jetta        1.9  1999     4 manua… f        33    44 d     comp…
2 volkswagen   new beetle   1.9  1999     4 manua… f        35    44 d     subc…
3 volkswagen   new beetle   1.9  1999     4 auto(… f        29    41 d     subc…
4 toyota       corolla      1.8  2008     4 manua… f        28    37 r     comp…
5 honda        civic        1.8  2008     4 auto(… f        25    36 r     subc…
6 honda        civic        1.8  2008     4 auto(… f        24    36 c     subc…
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_text(data = mpg_extremes, aes(label = model), nudge_y = 1, size = 3) +
  labs(title = "Top Highway MPG Models Labeled",
       x = "Engine Displacement (L)", y = "Highway MPG")
Figure 3.23: An example of adding text inside a plot.

The code above first identifies the top three hwy values using slice_max(), pairs that with geom_text() and labels only the extreme data points. Now, viewers can read from the plot which model of cars have the best highway miles per gallon.

3.4.8 Error bars (requires summary stats) - geom_errorbar()

In times where you want error bars to be displayed, you first need to compute the mean and standard error for your values.

Table 3.7: Creating a summarized table for mean and standard error, which will be utilized for the error bars.
# Compute mean & standard error for hwy by class
summ_hwy <- mpg %>%
  group_by(class) %>%
  summarize(
    mean_hwy = mean(hwy, na.rm = TRUE),
    se_hwy   = sd(hwy, na.rm = TRUE) / sqrt(n()))

summ_hwy
# A tibble: 7 × 3
  class      mean_hwy se_hwy
  <chr>         <dbl>  <dbl>
1 2seater        24.8  0.583
2 compact        28.3  0.552
3 midsize        27.3  0.334
4 minivan        22.4  0.622
5 pickup         16.9  0.396
6 subcompact     28.1  0.909
7 suv            18.1  0.378

Once completed, you can utilize the geom_errorbar() command to add error bars to your plot.

# Points + error bars (plot error bars first so points sit on top)
ggplot(summ_hwy, aes(class, mean_hwy)) +
  geom_errorbar(aes(ymin = mean_hwy - se_hwy, ymax = mean_hwy + se_hwy), width = 0.2) +
  geom_point(size = 2) +
  coord_flip() +
  labs(title = "Mean Highway MPG (± SE) by Class", x = "", y = "Mean Highway MPG")
Figure 3.24: An example of a plot with error bars.

3.4.9 Reference lines

Let’s say there is a scenario where you are looking for data above and below a particular threshold. In this case, reference lines can become an essential tool for yourself and your viewers. Once the threshold is established (mean, median, really any number of significance to you), you can utilize the geom_hline() or the geom_vline() commands to create horizontal or vertical reference lines, respectively.

# Horizontal line at overall mean
overall_mean <- mean(mpg$hwy, na.rm = TRUE)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = overall_mean, linetype = "dashed") +
  labs(title = "Reference Line at Overall Mean Highway MPG",
       x = "Engine Displacement (L)", y = "Highway MPG")
Figure 3.25: An example of plot with a horizontal reference line.
# Vertical line at displ = 3
ggplot(mpg, aes(displ, hwy)) +
  geom_point(alpha = 0.6) +
  geom_vline(xintercept = 3, linetype = "dotted") +
  labs(title = "Reference Line at Engine Displacement = 3L",
       x = "Engine Displacement (L)", y = "Highway MPG")
Figure 3.26: An example of plot with a vertical reference line.

3.5 Key Takeaways

  • Always visualize your data! Summary statistics can hide patterns (and problems) in your data.
  • ggplot2 follows a consistent grammar: data + aesthetics + geometry.
  • geom_bar() counts rows automatically; geom_col() plots pre-summarized values.
  • Use labs(), theme(), and coord_flip() to improve clarity and readability.
  • facet_wrap() helps compare groups by creating small multiples.
  • Trend lines (geom_smooth()), error bars (geom_errorbar()), labels (geom_text()), and reference lines (geom_hline(), geom_vline()) help communicate the story in your data.

3.6 Checklist

When creating a visualization, have you:

3.7 ggplot2 Visualization Reference

Unlike other chapters, visualization relies on combining multiple components rather than calling single functions. This section serves as a reference for common geometries, aesthetics, and commands used throughout the book.

3.7.1 Summary of ggplot Geometries

Below is a list of plot types, their purpose, and the geom command used:

  • Scatterplot - Relationships - geom_point()
  • Bar Chart - Counts - geom_bar()
  • Column Chart - Pre-summarized values - geom_col()
  • Histogram - Distribution - geom_histogram()
  • Density Plot - Distribution - geom_density()
  • Boxplot - Group comparison - geom_boxplot()
  • Line Graph - Time - geom_line()

3.7.2 Summary of other ggplot commands

Below is a list of other commands used to alter plots:

  • Aesthetics:
    • color: Changes the color of the points.
    • shape: Changes the shape of the points.
    • alpha: Changes the opacity of the point.
    • size: Changes the size of the point.
    • fill: Controls the interior color of shapes
  • labs(): Creates labels, including title, x-axis, and y-axis.
  • facet_wrap(): Creates individual plots and puts it into one graphic.
  • coord_flip(): Flips the axes without changing the underlying variables.
  • theme(): Controls the overall appearance of the plot
  • theme_minimal(): Makes the most basic looking plot.
  • theme(legend.position = "..."): Dictates where (if at all) the legend appears on the plot.
  • geom_text(): Adds text to the data points within the plot.
  • geom_hline(): Adds a horizontal reference line.
  • geom_vline(): Adds a vertical reference line.
  • geom_smooth(): Adds trend lines.
  • reorder(): Reorders categorical variables based on the values of another variable.

3.8 💡 Reproducibility Tip:

With visualizations (especially in R) there are nearly limitless possibilities. To support reproducibility, aim to create figures that clearly communicate their purpose even when viewed on their own.

When creating a visualization, ask yourself:

  1. What question is this visualization answering?
  2. What do I want my audience to understand from it?
  3. What would someone understand if they saw this figure without any surrounding text?

To help close the gap between these questions, use informative labels and captions, that will help guide users on what they’re seeing.

Within ggplot2, functions like labs() allow you to clearly label axes, titles, and legends so the intent of the plot is immediately clear. When working in R Markdown or bookdown (Section C.4.3.2.2), figure captions (using fig-cap) provide additional context that travels with the figure wherever it appears.

Visualizations that are well-labeled and properly captioned are easier to interpret, reuse, and reproduce—both by others and by your future self.