Skip to main content

Command Palette

Search for a command to run...

Neural Networks From Scratch

Understanding Learning Through Forward Propagation, Backpropagation, and Gradient Descent

Updated
•7 min read
Neural Networks From Scratch
T

I decode data, craft AI solutions, and write about everything from algorithms to analytics. Here to share what I learn and learn from what I share. 🚀 Data Scientist | AI Enthusiast | Building intelligent systems & simplifying complexity through code and curiosity. Sharing insights, projects, and deep dives in ML, data, and innovation.

Introduction

When we train a neural network using frameworks such as PyTorch or TensorFlow, a large portion of the learning process remains hidden behind abstraction layers. We define a model, specify a loss function, call .backward(), and let the framework compute gradients and update parameters.

While this abstraction is incredibly useful in practice, it often leaves an important question unanswered:

What actually happens when a neural network learns?

To answer that question, I implemented a small neural network from scratch using only NumPy. The goal is not to build a production-ready framework. The goal is to expose the mechanics of learning by implementing:

  • Forward propagation

  • Loss computation

  • Backpropagation

  • Gradient descent optimization

from first principles.

To keep the focus entirely on learning mechanics, activations, softmax, regularization, and advanced optimizers have been intentionally omitted.

The accompanying handwritten derivations contain the complete mathematical proof of the gradients used throughout this implementation. This article focuses on connecting those mathematical results to working code.


The Problem Setup

We will train a neural network to predict house prices from four features:

  • House size (square feet)

  • Number of bedrooms

  • House age

  • Distance from city center

The target value is generated using a known linear relationship:

$$y = 600 \cdot size_sqft + 5 \cdot bedrooms + 15 \cdot age_years + 12 \cdot distance + noise$$

The features and targets are standardized before training.

This dataset was intentionally chosen because it follows a linear relationship. The objective is not to solve a difficult machine learning problem but to observe learning behavior clearly.


The Network Architecture

The network consists of three dense layers:

$$X \rightarrow Layer_1 \rightarrow Layer_2 \rightarrow Layer_3 \rightarrow \hat y$$

Each dense layer performs:

$$Z = XW + b$$

where:

  • $X$ is the input matrix

  • $W$ is the weight matrix

  • $b$ is the bias vector

  • $Z$ is the output of the layer

The implementation stores:

self.weight
self.bias

inside each DenseLayer.

During the forward pass:

output = np.dot(input_matrix, self.weight) + self.bias

which directly corresponds to the equation above.


A Deliberate Simplification

You may notice that the network contains no activation functions.

This is intentional.

Without activations, multiple dense layers collapse into a single linear transformation:

$$XW_1W_2W_3 = XW_{effective}$$

Therefore:

$$\hat y = XW_{effective} + b_{effective}$$

This means the network does not gain additional representational power from having multiple layers.

The additional layers exist solely to demonstrate how gradients propagate backward through several stages.

The purpose of this implementation is not model capacity.

The purpose is visibility into the learning process.


Forward Propagation

The network begins with the following input matrix for every layer:

$$Z_l = X_lW_l + b_l$$

The output of one layer becomes the input of the next.

In code:

for key, value in self.layers.items():
    value.forward(output)
    output = value.output

After the final layer:

$$\hat y = logits$$

The implementation stores this prediction as:

self.logits

Measuring Error Using Mean Squared Error

Once predictions are produced, they must be compared against the target values.

The error is:

$$error = \hat y - y$$

which is implemented as:

error = self.logits - target

The loss is computed as:

$$L = (\hat y - y)^2$$

implemented as:

loss = error ** 2

The network now has a numerical measure of how wrong its predictions are.

Learning begins by asking:

How does this loss change if a weight changes?


The Two Local Derivatives Every Dense Layer Knows

For a dense layer:

$$Z = XW + b$$

there are only two local derivatives required for backpropagation.

Local Derivative With Respect to Weights

$$\frac{\partial Z}{\partial W}=X$$

represented by:

get_local_derivative_wrt_weights()

which returns:

self.input_matrix

Local Derivative With Respect to Inputs

$$\frac{\partial Z}{\partial X}=W$$

represented by:

get_local_derivative_wrt_inputs()

which returns:

self.weight

These two quantities are sufficient for gradient propagation.

Every dense layer only needs to know:

  1. How its output changes when weights change.

  2. How its output changes when inputs change.

Everything else comes from the chain rule.


Backpropagation: The Chain Rule in Action

Backpropagation is often presented as a separate algorithm.

In reality, it is simply repeated application of the chain rule.

The derivative of the loss with respect to the prediction is:

$$\frac{\partial L}{\partial \hat y}= \frac{2(\hat y-y)}{N}$$

implemented as:

loss_gradient_wrt_prediction =
(2 * self.error) / self.error.size

This becomes the initial gradient flowing backward through the network:

cumulated_backpropagating_gradient

Computing Weight Gradients

For a layer:

$$\frac{\partial L}{\partial W}= \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W}$$

Using the local derivative:

$$\frac{\partial Z}{\partial W}=X$$

the implementation computes:

weight_gradient =
np.dot(
    layer.get_local_derivative_wrt_weights().T,
    cumulated_backpropagating_gradient
)

This produces the gradient used to update the layer's weights.


Computing Bias Gradients

Because each neuron receives its bias directly:

$$\frac{\partial Z}{\partial b}=1$$

the bias gradient is obtained by summing the incoming gradient:

bias_gradient =
np.sum(
    cumulated_backpropagating_gradient,
    axis=0,
    keepdims=True
)

Propagating Error to the Previous Layer

The chain rule also tells us how to continue moving backward.

$$\frac{\partial L}{\partial X}= \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial X}$$

Since:

$$\frac{\partial Z}{\partial X}=W$$

the implementation becomes:

cumulated_backpropagating_gradient =
np.dot(
    cumulated_backpropagating_gradient,
    layer.get_local_derivative_wrt_inputs().T
)

This is the core mechanism of backpropagation.

The gradient is transformed by each layer's local derivative and passed to the previous layer.

The same process repeats until the first layer is reached.


Gradient Descent Optimization

Once gradients have been computed, parameters are updated.

For every weight:

$$W_{new}= W- \eta \frac{\partial L}{\partial W}$$

For every bias:

$$b_{new}= b- \eta \frac{\partial L}{\partial b}$$

where:

\(\eta \) is the learning rate.

The implementation performs:

new_weight =
weight -
learning_rate * weight_gradient

new_bias =
bias -
learning_rate * bias_gradient

This step moves the parameters in the direction that reduces loss.


Training Results

After training for 125 epochs:

Epoch Loss
0 14.3718
20 0.1922
40 0.0149
60 0.0052
80 0.0037
100 0.0035
120 0.0034

The loss consistently decreases, indicating that gradient descent is successfully adjusting the parameters.

Comparing predictions and targets:

Prediction: [ 2.0858 -0.3741 -0.7292 ... ]
Target    : [ 2.0799 -0.3551 -0.7831 ... ]

shows that the model has learned the underlying relationship in the dataset.


What This Implementation Teaches

This implementation intentionally avoids many components found in modern neural networks.

There are:

  • No activation functions

  • No softmax

  • No cross-entropy

  • No Adam optimizer

  • No regularization

These omissions are deliberate.

The objective is to isolate and expose the learning process itself.

The most important takeaway is that every dense layer only needs two local derivatives:

$$\frac{\partial Z}{\partial W}$$

and

$$\frac{\partial Z}{\partial X}$$

Once those local derivatives are known, the chain rule handles the rest.

Backpropagation is not a mysterious algorithm hidden inside deep learning frameworks.

It is simply calculus applied repeatedly from the output layer back to the input layer.

Modern frameworks automate these calculations, but the mathematics remains exactly the same.

Understanding this process removes much of the abstraction surrounding neural network training and provides a clearer picture of what is actually happening when a model learns.