Understanding Softmax Regression: Principles, Formulas, and Practical Insights

In classification tasks, the question is often not “how much” but “which one.” For example:

Is an email spam or not?
Is an image showing a cat, dog, horse, or chicken?
Which movie will a user choose to watch?

The answer is typically just one category from a set of possible options. Softmax regression is one of the most popular models for multi-class classification.

1. Core Idea of Softmax Regression for Multi-Class Classification

The main difference between Softmax regression and linear regression is:

Linear regression predicts continuous values.
Softmax regression predicts a probability distribution over classes and chooses the one with the highest probability as the final prediction.

2. Model Structure and Computation Flow in Softmax Regression

(1) Input Features

Let’s start with a feature vector: x ∈ Rᵈ

(2) Linear Transformation (Affine Transformation)

o = W x + b

o: logits (Unnormalized prediction scores, which can be positive or negative)
W: weight matrix, b: bias vector

(3) Softmax Transformation

We convert logits into a probability distribution: ŷⱼ = exp(oⱼ) / ∑ₖ exp(oₖ)

Exponentiation guarantees that all values are non-negative.
Normalization ensures the sum of probabilities equals 1.

(4) Predicting the Class

Although Softmax produces a probability for each class, the predicted class is simply:

ŷ = arg maxⱼ oⱼ

Because Softmax is a monotonically increasing function, the class with the largest logit is also the one with the highest probability.

3. Mini-batch Computation

In deep learning, we typically process multiple samples at a time:

O = XW + b

𝐗 ∈ ℝⁿˣᵈ: feature matrix for a batch, with n samples and d features.
Softmax is applied row-wise, normalizing the logits of each sample independently.

This approach improves computational efficiency and makes full use of GPU parallelism.

4. Cross-Entropy Loss in Softmax Regression (with Examples)

Cross-entropy measures the difference between the predicted probability distribution and the true distribution.

l(y, ŷ) = - ∑ⱼ yⱼ log ŷⱼ

y is a one-hot label vector, meaning it contains a 1 for the correct class and 0s for all other classes.
If the true class is c, the formula simplifies to:

l = −log(ŷᶜ)

The goal is to maximize the predicted probability of the true class.

Example:

ŷᶜ = 0.9 ⇒ l ≈ 0.105 (good prediction)
ŷᶜ = 0.2 ⇒ l ≈ 1.609 (poor prediction)

5. Gradients of Softmax Regression with Cross-Entropy

When Softmax is combined with cross-entropy, the gradient becomes surprisingly simple.

∂l/∂oⱼ = ŷⱼ − yⱼ

This means: the gradient for each logit is just the predicted probability minus the true label probability.
If the predicted probability exceeds the true label probability → positive gradient → the corresponding logit will be decreased.
If the predicted probability is lower than the true label probability → negative gradient → the corresponding logit will be increased.

This simple gradient formulation makes Softmax regression efficient and stable for multi-class classification.

6. Intuitive Understanding of Softmax Regression

Logits: The raw, unnormalized outputs of the model.
Softmax: Converts logits into a probability distribution.
Cross-entropy: Measures the difference between the predicted distribution and the true labels.
Gradient: The difference between prediction and truth, used for parameter updates.

7. Conclusion

Softmax regression is a go-to model for multi-class classification.
It extends the idea of linear regression into probability space, uses Softmax to obtain a probability distribution, applies cross-entropy loss to measure prediction error, and updates parameters with a simple gradient formula.
Whether in image classification, text classification, or recommendation systems, Softmax regression is a fundamental and highly effective choice.

Related post: Linear Regression with PyTorch: Theory and Practice
Related post: PyTorch Image Classification Tutorial on Google Colab (CIFAR-10 Example)