A Complete Guide to Linear Regression (Theory + PyTorch Practice)

Introduction: What is Linear Regression?

Linear regression is one of the most fundamental supervised learning algorithms in machine learning.
It is mainly used for solving regression problems, i.e. predicting continuous values, such as:

Housing prices
Stock prices
Sales forecasts

In this tutorial, we will cover linear regression theory step by step, followed by a PyTorch implementation with training code and examples.

Core Idea of Linear Regression

We aim to fit a “linear model” that captures the relationship between input features and the output value. Mathematically:

ŷ = wᵀx + b

ŷ: predicted output
x: input feature vector
w: weight vector (indicating feature importance)
b: bias term (controls overall offset)

Key Concepts in Linear Regression

Here are some basic terms you’ll encounter when studying linear regression for beginners:

Concept	Meaning
Sample size `n`	Number of data rows (e.g., 100 houses with area and price info)
Feature dimension `d`	Number of input variables (e.g., area and age)
Training set	Data used to train the model
Feature `x`	Input values per sample (e.g., area, age)
Label `y`	Ground truth output (e.g., price)

Mathematical Formulation

For 2D features: price = w₁ · area + w₂ · age + b

For higher dimensions, the linear regression equation can be written as:
ŷ = w₁x₁ + w₂x₂ + ⋯ + wᵢxᵢ + b = wᵀx + b

If we have n samples, we can use matrix notation:

Feature matrix: X ∈ ℝⁿˣᵈ
Label vector: y ∈ ℝⁿ

Then: ŷ = Xw + b

Loss Function: Mean Squared Error (MSE)

We use the Mean Squared Error (MSE) as the loss function:

L(w, b) = (1 / 2n) · ∑(ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)²

The smaller the value, the more accurate the model.
Squaring amplifies the penalty for large errors.

Training a Linear Regression Model

Method 1: Analytical Solution (Closed-form)

If the dataset is small and the problem is linear, you can directly solve the optimal weights and bias using the formula: w* = (XᵀX)⁻¹ Xᵀy

This method is fast and accurate. However, it becomes impractical if:

The dimensionality is too high
The matrix is non-invertible
Non-linear structure is introduced

Method 2: Gradient Descent Optimization

We iteratively update the model parameters by minimizing the loss function. Each step updates the weights and bias in the negative gradient direction:

w ← w − η · ∇w L(w, b)

b ← b − η · ∇b L(w, b)

η is the learning rate (step size)
∇w and ∇b are gradients with respect to w and b

Common Practice: Mini-batch Stochastic Gradient Descent (SGD)

Instead of using all data at once, we randomly select a small batch of samples to update the parameters each time:

More efficient
Faster convergence
Widely used in deep learning

This is the standard approach used in PyTorch linear regression tutorials, because it scales well with large datasets and deep learning.

Why Use Vectorization in Linear Regression?

Vectorization makes training faster by using optimized matrix operations instead of Python loops.

In training, we often perform large matrix operations. Vectorization improves efficiency by:

Calling optimized low-level C libraries (e.g., BLAS)
Speeding up computation by 100x or more
Reducing the chance of bugs
Avoiding slow for loops

Why Squared Loss? Connection to Normal Distribution

We can model prediction error as Gaussian noise:

y = wᵀx + b + ε, where ε ~ 𝒩(0, σ²)

Using Maximum Likelihood Estimation (MLE) to fit the model is mathematically equivalent to minimizing the mean squared error (MSE):

L(w, b) = (1 / 2n) · ∑(ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)²

This is the mathematical motivation behind using MSE.

From Linear Regression to Neural Networks

Linear regression can be seen as the simplest neural network model (a single-layer perceptron). This helps beginners connect regression to deep learning basics.

Linear regression can be seen as the simplest form of a neural network:

Contains only one fully-connected layer (no activation function)
Also known as a single-layer perceptron or dense layer

Just like a biological neuron:

Input: x₁, x₂, ..., xᵢ
Weights: w
Bias: b
Output: o = wᵀx + b

Analogy with Biological Neurons

A biological neuron structure includes:

Dendrites: Receive inputs
Nucleus: Perform weighted sum
Axon: Pass the output to the next neuron

This is where the term neural network originates. Modern deep learning is more influenced by math and engineering than biology.

Summary Table: Linear Regression at a Glance

Item	Description
Goal	Learn a linear function to predict values
Model Equation	`ŷ = wᵀx + b`
Loss Function	Mean Squared Error (MSE)
Training Methods	Closed-form solution or Gradient Descent
Vectorization	Improves performance and readability
Relation to NN	Simplest single-layer neural network

FAQ

Q: What is linear regression used for in real life?
A: Linear regression is widely used for housing price prediction, stock forecasting, marketing analytics, and risk modeling.

Q: Why is Mean Squared Error (MSE) commonly used?
A: Because minimizing MSE is equivalent to Maximum Likelihood Estimation under Gaussian noise assumption, making it mathematically justified.

Q: Can I implement linear regression in PyTorch?
A: Yes. PyTorch provides autograd and optimization tools, making it simple to implement linear regression with gradient descent.