9  Logistic Regression

9.1 Introduction

Two chapters ago, we did a great job of creating linear regression models, which was the first time in this book that we created prediction models. With a lot of TLC, we created a y = mx + b (rhymed on purpose) so that whatever future of value of x we had, we could predict what y would be.

However, that was when we had numeric data. What happens when we do not have any numeric data? From our last chapter, we know that is where a chi-square comes in, but what if we want to still do some predicting? This is exactly where Logistic Regression comes into play.

NoteThe question we are trying to answer is:

Can we predict if someone will default on their credit cards based on other factors about them?

Could you have predicted that there would be another chapter about making predictions…?

Using the ISLR2 package, we will be working with the Default dataset.

9.2 Learning Objectives

By the end of this chapter, you will be able to:

  • Explain when logistic regression is appropriate and how it differs from linear regression
  • Interpret probabilities, odds, log-odds, and odds ratios in the context of binary outcomes
  • Perform exploratory data analysis for a binary response variable, including calculating proportions and odds
  • Visualize relationships between a binary outcome and predictor variables
  • Split data into training and testing sets for predictive modeling
  • Fit a logistic regression model using glm() with a binomial family
  • Interpret logistic regression coefficients and convert them to odds ratios using exp()
  • Assess model fit using McFadden’s pseudo-R²
  • Evaluate predictor importance and check for multicollinearity using VIF
  • Generate predicted probabilities and classifications from a fitted model
  • Evaluate classification performance using confusion matrices, sensitivity, specificity, and balanced accuracy
  • Assess model discrimination using ROC curves and the Area Under the Curve (AUC)

9.3 Load and Preview Data

First, we need to load the ISLR2 package and then we will create our cc_data dataset using the Default dataset, which inside the package.

library(ISLR2)
library(tidyverse)

cc_data <- Default # this data comes from the ISLR2 package

library(skimr)

skim(cc_data)
Table 9.1: The skim function provides us with insightsinto our data that are helpful when first looking at the data.
Data summary
Name cc_data
Number of rows 10000
Number of columns 4
_______________________
Column type frequency:
factor 2
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
default 0 1 FALSE 2 No: 9667, Yes: 333
student 0 1 FALSE 2 No: 7056, Yes: 2944

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
balance 0 1 835.37 483.71 0.00 481.73 823.64 1166.31 2654.32 ▆▇▅▁▁
income 0 1 33516.98 13336.64 771.97 21340.46 34552.64 43807.73 73554.23 ▂▇▇▅▁

Our data contains 10,000 rows of data, each relating to a particular person. Two variables are factors, the other two are numeric. The variables are:

  1. default: A factor with levels No and Yes indicating whether the customer defaulted on their debt
  2. student: A factor with levels No and Yes indicating whether the customer is a student
  3. balance: The average balance that the customer has remaining on their credit card after making their monthly payment
  4. income: Income of customer
NoteRegarding the ISLR2 package:

The definitions of the columns come directly from the description found in the package

We have our most crucial data point - if the person did or did not default on their credit card payments. Soon, we will be using the other data points to see if we can predict if someone will or will not default on their payments.

First, let’s do a little digging to find out more about our data.

9.4 Exploratory Data Analysis

Before we predict if a person defaults or not, it is good to find out more about the default variable, answering some questions like:

  • How many people have defaulted?
  • What percentage of people have defaulted?
  • What are the odds that someone will default?
  • What are the preliminary relationships between default and the other variables?
Table 9.2
table(cc_data$default)

  No  Yes 
9667  333 
# Proportion of people who default
p <- mean(cc_data$default == "Yes")
p*100
[1] 3.33
odds <- p / (1 - p)

# Below provides us with the default rate
cat("Default rate:", round(p*100,2), "%\n")
Default rate: 3.33 %
# Below provides us with the odds of someone defaulting
cat("Odds of defaulting:", round(odds,4), "\n")
Odds of defaulting: 0.0344 
# Below provides us with the odds ratio
cat("Equivalent odds ratio (1 in every", round(1/odds,0), ")\n")
Equivalent odds ratio (1 in every 29 )
# Cross-tab between student status and default
cc_data %>% count(default, student) %>% arrange(desc(n))
  default student    n
1      No      No 6850
2      No     Yes 2817
3     Yes      No  206
4     Yes     Yes  127
Table 9.3: A summarized table uncovering insights into the relationship between income and default.
# Quick descriptive stats
library(mosaic)

favstats(income ~ default, data = cc_data)
  default       min       Q1   median       Q3      max     mean       sd    n
1      No  771.9677 21405.06 34589.49 43823.76 73554.23 33566.17 13318.25 9667
2     Yes 9663.7882 19027.51 31515.34 43067.33 66466.46 32089.15 13804.22  333
  missing
1       0
2       0
Table 9.4: A summarized table uncovering insights into the relationship between balance and default.
favstats(balance ~ default, data = cc_data)
  default      min        Q1    median       Q3      max      mean       sd
1      No   0.0000  465.7146  802.8571 1128.249 2391.008  803.9438 456.4762
2     Yes 652.3971 1511.6110 1789.0934 1988.870 2654.323 1747.8217 341.2668
     n missing
1 9667       0
2  333       0

Some things that were uncovered in the code above:

  • The probability of someone defaulting is 0.0333.
  • The percentage of people defaulting is 3.33%.
  • Note: Probabilities are always going to be a number between 0 and 1. This is incredibly important later when we are trying to create our line.
  • 1 in every 29 people will default on their credit card. These are the odds, which by definition, compare successes (defaults) to failures (non-defaults).
  • While probabilities are between 0 and 1, odds are between 0 and ∞.
  • An overwhelming majority of students did not default.
  • The mean income of people who do not default is only slightly higher than people who do.
  • The mean balance of people who do default is more than double the amount than people who don’t.

Overall, most people don’t default on their credit cards, and while we see differences in means, we do not know if the difference is statistically significant or not.

Now that we have some insights, it is time to visualize!

9.5 Visualize Relationships

We are going to make three different visualizations:

  1. A bar plot of defaults
  2. A histogram of balances split by default, and then
  3. A scatterplot of balance and income

For fun, we will be utilizing some themes from the ggthemes package. When recreating these on your own, of course feel free to play around and change the themes. As I always say, R puts the r in Artist!

library(ggthemes)
library(ggpubr)

# Default distribution
ggplot(cc_data, aes(x = default, fill = default)) +
  geom_bar() +
  labs(title = "Count of Defaulters vs Non-Defaulters", y = "Count", x = "Default?") +
  theme_economist()
Figure 9.1: Most customers do not default on their credit cards, with defaulters representing only about 3% of the dataset.

For a quick review on histograms, see Section 3.4.4.1.

# Balance by default
ggplot(cc_data, aes(x = balance, fill = default)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  labs(title = "Higher Balances Increase Likelihood of Default",
       x = "Average Credit Card Balance ($)", y = "Count") +
  theme_economist_white()
Figure 9.2: While most customers do not default on their credit cards, it seems that people that do have higher balances.
# Balance vs. Income
ggplot(cc_data, aes(x = balance, y = income, color = default)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  labs(title = "Customers Who Default Tend to Have Higher Balances",
       subtitle = "Each point represents a customer",
       x = "Average Credit Card Balance ($)",
       y = "Annual Income ($)",
       color = "Default") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))
Figure 9.3: Regardless of if you default or not, there is a negative relationship between Annual Income and Average Credit Card Balance.

The visualizations look great! In terms of insights, we see (again) that most people do not default on their credit cards, most balances are around the 800 range for non-defaulters and higher for the defaulters, and that as balance increases, income decreases.

With a basic understanding of our data, it is now time to do something really fun - split our data!

9.6 Train and Test Split

Right now, we have one complete data set. What is amazing about our data is that it is complete. We have 10,000 data points each with the crucial information of if the person did or did not default. Our goal is still to create a model that takes the information we have about the people (if they’re a student, their balance, and their income) and use it to predict if a person will or will not default.

Ideally, we want this model to be able to predict it for people in the future? When we build it, how will we know how accurate it is?

In scenarios like this, we can actually split our data into two sets: one for training (creating our model) and one for testing (see how well our model actually works). This is important because we get to test our model using real data. It is kind of like practicing a basketball move before you use it in a real game.

Our goal: use a decent portion of our real data to test our model, and compare it to data not used in the model.

# set.seed allows us to get the same split of the data each time we run this code
set.seed(42)

library(caTools) # Package that helps us split our data into Train and Test sets.

# We are going to be doing a 70/30 split. It is good practice to have most of your data in the training
sample <- sample.split(cc_data$default, SplitRatio = 0.7)

# This creates our test data, what we will use to train our logistic regression model
training_data <- subset(cc_data, sample == TRUE)

# This creates our test data, what we will measure our model against

test_data <- subset(cc_data, sample == FALSE)

We have successfully split our data into a training and a testing set! Again, most of our data is going to go into the training data set. We want our model to be as successful as possible while also providing enough of a chance to see if it works.

We have our training set, so let’s get predicting!

9.7 Build Logistic Regression Model

When building a logistic regression model, it is going to look very similar to the linear regression model we built starting in Section 8.7. The major difference from a coding perspective between a linear regression and a logistic regression model is that instead of using lm(), you need to use glm().

Before we create our model, let’s take a step back. Logistic regression, just like linear regression, is a tool to help us predict an outcome. Specifically, we are predicting the probability of an outcome.

NoteThe formula for the logistic regression is as follows:

\[\log\left(\frac{p}{1 - p}\right) = \beta\_0 + \beta\_1 \cdot \text{balance} + \beta\_2 \cdot \text{income} + \beta\_3 \cdot \text{student}\]

where (p) is the probability of default. The left-hand side is called the logit, or log-odds. Notice that the right-hand side looks just like a straight-line regression.

In Linear Regression, your line goes on forever in both directions. But in probability, you can’t be less than 0% or more than 100%. If you used a straight line to predict defaults, the math might eventually say someone has a -20% chance of defaulting, which is impossible.

To fix this, we use the Sigmoid Function. This “squashes” the straight line into an S-curve that stays strictly between 0 and 1.

ggplot(cc_data, aes(x = balance, y = as.numeric(default) - 1)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "red") +
  labs(y = "Probability of Default", x = "Balance")
`geom_smooth()` using formula = 'y ~ x'
Figure 9.4: When plotting the balance and the default, the scatterplot creates a line that is not linear, but in fact an S shape.

In Figure 9.4, notice how the red line transitions from a low probability to a high probability as the balance increases. Unlike a straight linear line, this S-curve naturally ‘levels off’ at 0 and 1, ensuring our predictions remain realistic probabilities.

Now, let’s get to business and create our logistic regression model.

Table 9.5: Logistic Regression Model Predicting the Probability of Credit Default based on Student Status, Account Balance, and Income.
# We are using binomial as family since default is yes/no

train_model <- glm(default ~ student + balance + income,
                   family = "binomial", data = training_data) 

summary(train_model)

Call:
glm(formula = default ~ student + balance + income, family = "binomial", 
    data = training_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.090e+01  5.931e-01 -18.372  < 2e-16 ***
studentYes  -8.245e-01  2.835e-01  -2.908  0.00364 ** 
balance      5.796e-03  2.805e-04  20.661  < 2e-16 ***
income       2.768e-06  9.736e-06   0.284  0.77614    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2043.8  on 6999  degrees of freedom
Residual deviance: 1098.2  on 6996  degrees of freedom
AIC: 1106.2

Number of Fisher Scoring iterations: 8

After we create our model and call summary(), we get an output that should look similar to the outputs we’ve seen before in this chapter. Taking a look at our wall of words, it is incredibly important to note that the numbers here represent the log-odds

Model interpretation:

  1. Student (negative coefficient, not significant): Students less likely to default.
  2. Balance (positive, significant): Higher balance = higher odds of default.
  3. Income (tiny effect, not significant).
  4. “Fisher Scoring iterations: 8” means R needed 8 rounds to find best fit.

One thing to note is that unlike a linear regression model, logistic regression models take a few tries to create the best model (in this case, 8).

While the outputs look the same, there is a major difference in the numbers that are output in a logistic regression vs a linear regression. Remember earlier when we calculated the probability of someone defaulting? It was mentioned that probabilities will always be a number between 0 and 1. But, when we are creating a line in a regression model, we can get any numbers between -∞ and ∞. How does logistic regression, which uses probabilities, make a line then?

Log-odds!

We need to get a number that actually has a range, and this is exactly where log-odds come in. Because with regression, we need to get a line (and to build a line, we need a formula). Instead of modeling probability, we model the log of the odds. Here is the step by step of how logistic regression works:

  1. Start with a probability.
  2. Convert to odds.
  3. Takes the log of the odds
  4. Using those log odds, fits a regression line
  5. Converts it back with a logistic function
# Convert log-odds → odds ratios
exp(coef(train_model))
 (Intercept)   studentYes      balance       income 
1.850225e-05 4.384379e-01 1.005813e+00 1.000003e+00 
# Example: students have 0.43× the odds (≈57% lower odds) of defaulting

A very important thing to note is that odds and probability are not the same thing. With odds ratios, we are looking to see how much larger or smaller it is than 1.

  • OR = 1 → no change
  • OR > 1 → increase
  • OR < 1 → decrease

Breaking this down using our model:

  1. If you are a student, the odds of you defaulting decrease, as the odds ratio is 0.438. Their odds are 56.2% lower than non-students (1 - 0.438). This is a significant difference.
  2. For every dollar your balance increases, the odds of you defaulting increase by 0.58%. This seems small per dollar, but compounds rapidly across real-life balance amounts. For example, increasing the balance by $100 multiplies the odds of default by about 1.78, which is a 78% increase in odds.
  3. The odds ratio for income is so small (OR = 1.000003), that a $1 increase in income has basically no influence on your default odds.

9.7.1 McFadden’s Pseudo-R²

Unlike in linear regression, logistic regression does not have an R²: (for a review on R²: see Section 8.7). However, we can use the pscl library to calculate a pseudo R² value.

# McFadden’s pseudo-R2
library(pscl)

pR2(train_model)["McFadden"]
fitting null model for pseudo-r2
 McFadden 
0.4626515 

As of right now, this is a moderately strong model!

9.7.2 Variable Importance

With the caret package, we can also see which variables are the most important in our regression model.

# Variable importance
library(caret)

varImp(train_model) %>% arrange(desc(Overall))
              Overall
balance    20.6607703
studentYes  2.9080015
income      0.2843578
# balance seems to be the most impactful in our model

9.7.3 Multicollinearity check

Just like in Section 7.8, we want to check to see if there is any interactions between some of our variables that may be impacting our regression model. Using the vif function from the car library, we can see the interactions between our variables in our model.

# Multicollinearity check
library(car)

vif(train_model)
 student  balance   income 
2.671007 1.072618 2.593366 
# does not seem to be any multicollinearity
# anything less than 5 means no multicollinearity

As of right now, it is safe to say that there is no multicollinearity, as all values are less than 5.

9.8 Make Predictions

Before, we took our data and split it into a training set and a test set. Using the training set, we built a logistic regression model. Now that we have it, it is time to test it against our test data!

As previously stated, it is best to test our model against real data, because we then get to compare and contrast what our model predicted vs what actually happened. It is now time to do exactly that!

We are going to create two new columns in our test data:

  1. pred_prob: Our model is going to predict the likelihood of someone defaulting, using that row’s data.
  2. pred_class: based on the probability from pred_prob, it is going to either be “Yes” or “No.”
test_data$pred_prob <- predict(train_model, newdata = test_data, type = "response")

test_data$pred_class <- ifelse(test_data$pred_prob > 0.5, "Yes", "No") %>% as.factor()

In the above code, we used 0.5 as our cutoff for deciding if a person defaulted or not. 0.5 is a good measure since it is at the halfway point, but it is up to you to decide what the marker should be.

With this code complete, our model has now churned out the results (if someone will or will not default on their credit card).

9.9 Evaluate Model

Using real data (our test data) and a built logistic regression model, we now have predicted if a person will or will not default on their credit card. The next question is: how well did our model predict if someone defaulted on their credit card or not?

To do this, we run the command confusionMatrix. This comes from the caret package which we have already loaded.

Table 9.6: Confusion Matrix Evaluating Model Predictions Against Actual Default Status on the Test Dataset.
conf_matrix <- confusionMatrix(test_data$pred_class, test_data$default, positive = "Yes")

conf_matrix
Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  2886   71
       Yes   14   29
                                          
               Accuracy : 0.9717          
                 95% CI : (0.9651, 0.9773)
    No Information Rate : 0.9667          
    P-Value [Acc > NIR] : 0.06743         
                                          
                  Kappa : 0.3934          
                                          
 Mcnemar's Test P-Value : 1.247e-09       
                                          
            Sensitivity : 0.290000        
            Specificity : 0.995172        
         Pos Pred Value : 0.674419        
         Neg Pred Value : 0.975989        
             Prevalence : 0.033333        
         Detection Rate : 0.009667        
   Detection Prevalence : 0.014333        
      Balanced Accuracy : 0.642586        
                                          
       'Positive' Class : Yes             
                                          

Another wall of data. The first thing we see if a table of Reference vs Prediction. This tells us our true/false negatives/positives. For instance, there were 14 false positives. Most of our data is accurate: 2886 true negatives (most of our data is negative - people do not default).

Next, we get some performance metrics. Here is an initial breakdown of what we see:

  • Accuracy ≈ how often model is correct overall.
    • The model is very accurate, at about 97%.
  • 95% CI: Where the accuracy of our model most likely lies.
  • No Information Rate (NIR): If we just predicted “No” for each row, how accurate would we be.
    • This is not much different from the actual accuracy.
  • P-value [Acc > NIR]: Is our model statistically different than if we just put “No” for each.
    • The p-value is not statistically significant, meaning there is no difference between running our model or just predicting “No” for each person.
  • Sensitivity: how well it finds actual defaults.
    • Only 29% of actual defaulters are detected.
  • Specificity: how well it identifies non-defaulters.
    • It predicts non-defaulters at almost 100%.
  • High accuracy but low sensitivity means the model misses many rare defaults.
  • Balanced Accuracy is a better metric for imbalanced data.

9.10 ROC Curve + AUC

As always, we want to visualize our data. To do this, we can create a special visualization, an ROC CURVE.

library(pROC)

roc_obj <- roc(test_data$default, test_data$pred_prob, levels = c("No", "Yes"))
plot(roc_obj, col = "darkblue", lwd = 2,
     main = "ROC Curve – Logistic Regression")

Receiver Operating Characteristic (ROC) curve for the logistic regression model predicting credit card default. The curve plots sensitivity against 1 − specificity across all possible classification thresholds. Curves closer to the top-left corner indicate better classification performance. The area under the curve (AUC) summarizes the model’s ability to distinguish between defaulters and non-defaulters.
auc(roc_obj)
Area under the curve: 0.9529

Interpretation:

  • ROC curve shows sensitivity vs. 1 - specificity.
  • AUC = 0.95 → excellent ability to distinguish defaulters from non-defaulters.
  • Random guessing = 0.5; perfect model = 1.0

9.11 Interpretation

Our logistic regression model predicts the probability that a person will default on their credit card based on balance, income, and student status.

Coefficients:

  • The model found a significant positive relationship between balance and default.
    • As balance increases, the odds of default rise sharply.
    • The student variable had a negative coefficient, meaning students are less likely
    • to default than non-students (about 57% lower odds, holding other factors constant).
  • Income showed a very small, non-significant effect once balance was included.

Model Fit:

  • McFadden’s R² ≈ 0.46, suggesting the model explains a moderate amount of variation.
  • The AUC = 0.95, indicating excellent ability to distinguish between defaulters and non-defaulters.
  • Confusion Matrix: high overall accuracy (~97%), but this is inflated due to the low number of defaults (class imbalance).
  • The model is excellent at identifying people who will not default, but less effective at catching rare defaults.

In short:

  • Balance is the strongest predictor.
  • Students are less likely to default.
  • Income doesn’t matter much here.
  • The model performs very well overall.

9.12 Key Takeaways

  • Logistic Regression predicts the probability of a categorical (binary) outcome.
  • Instead of fitting a straight line, it models the log-odds of the outcome.
  • Coefficients tell us how each variable changes the log-odds of the event.
  • Positive coefficients increase the odds; negative ones decrease them.
  • We exponentiate coefficients (exp()) to interpret them as odds ratios.
  • Example: An odds ratio of 1.50 means 50% higher odds of the outcome.
  • AUC (Area Under the Curve) tells how well the model separates the two groups:
    • 0.5 = no skill, 0.7–0.8 = good, 0.8–0.9 = great, >0.9 = excellent.
  • Accuracy can be misleading when classes are imbalanced — check sensitivity and specificity instead.
  • ROC curves and confusion matrices help visualize model performance.
  • Logistic regression is a gateway to machine learning — it’s the foundation for classification models in predictive analytics.

9.13 Checklist

When running a logistic regression, have you:

9.14 Key Functions & Commands

The following functions and commands are introduced or reinforced in this chapter to support logistic regression modeling, classification, and model evaluation.

  • glm() (stats)
    • Fits generalized linear models, including logistic regression when using the binomial family.
  • summary() (base R)
    • Summarizes logistic regression output, including coefficients, standard errors, and p-values.
  • exp() (base R)
    • Converts log-odds coefficients into odds ratios for interpretation.
  • predict() (stats)
    • Generates predicted probabilities or classifications from a fitted logistic regression model.
  • sample.split() (caTools)
    • Splits data into training and testing sets for model validation.
  • confusionMatrix() (caret)
    • Evaluates classification performance using accuracy, sensitivity, specificity, and related metrics.
  • pR2() (pscl)
    • Computes pseudo R² measures (e.g., McFadden’s R²) for logistic regression models.
  • varImp() (caret)
    • Assesses the relative importance of predictors in a classification model.
  • vif() (car)
    • Detects multicollinearity among predictors using variance inflation factors.
  • roc() (pROC)
    • Constructs a Receiver Operating Characteristic (ROC) curve to evaluate classification performance.
  • auc() (pROC)
    • Calculates the Area Under the ROC Curve (AUC) as a summary measure of model discrimination.

9.15 Example APA-style Write-up

The following example demonstrates one acceptable way to report the results of a logistic regression analysis in APA style.

Logistic Regression

A logistic regression analysis was conducted to examine whether credit card balance, income, and student status predicted the likelihood of default. The overall logistic regression model was statistically significant, \(\chi^2(3) = 428.52, p < .001\), and demonstrated good model fit, as indicated by McFadden’s pseudo-\(R^2 = .46\). Higher credit card balances were associated with increased odds of default, whereas students had significantly lower odds of default than non-students. Income was not a significant predictor after accounting for balance and student status.

9.16 💡 Reproducibility Tip:

When evaluating predictive models, results can change depending on how the data are split into training and testing sets. To make your results reproducible, always control randomness and clearly document your modeling choices.

Using set.seed() (like in the reproducibility tip found in Section 5.15) ensures that the same observations are assigned to the training and test sets each time the code is run. In addition, choices such as the train/test split ratio and classification threshold (e.g., 0.5) should be explicitly stated, as they directly affect model performance.

Predictive results are only reproducible when both the code and the evaluation decisions are transparent.