Neural Networks From Scratch
Understanding Learning Through Forward Propagation, Backpropagation, and Gradient Descent

I decode data, craft AI solutions, and write about everything from algorithms to analytics. Here to share what I learn and learn from what I share. 🚀 Data Scientist | AI Enthusiast | Building intelligent systems & simplifying complexity through code and curiosity. Sharing insights, projects, and deep dives in ML, data, and innovation.
Introduction
When we train a neural network using frameworks such as PyTorch or TensorFlow, a large portion of the learning process remains hidden behind abstraction layers. We define a model, specify a loss function, call .backward(), and let the framework compute gradients and update parameters.
While this abstraction is incredibly useful in practice, it often leaves an important question unanswered:
What actually happens when a neural network learns?
To answer that question, I implemented a small neural network from scratch using only NumPy. The goal is not to build a production-ready framework. The goal is to expose the mechanics of learning by implementing:
Forward propagation
Loss computation
Backpropagation
Gradient descent optimization
from first principles.
To keep the focus entirely on learning mechanics, activations, softmax, regularization, and advanced optimizers have been intentionally omitted.
The accompanying handwritten derivations contain the complete mathematical proof of the gradients used throughout this implementation. This article focuses on connecting those mathematical results to working code.
GitHub repository link containing the implementation code and the above mentioned handwritten notes (notes coming soon).
The Problem Setup
We will train a neural network to predict house prices from four features:
House size (square feet)
Number of bedrooms
House age
Distance from city center
The target value is generated using a known linear relationship:
$$y = 600 \cdot size_sqft + 5 \cdot bedrooms + 15 \cdot age_years + 12 \cdot distance + noise$$
The features and targets are standardized before training.
This dataset was intentionally chosen because it follows a linear relationship. The objective is not to solve a difficult machine learning problem but to observe learning behavior clearly.
The Network Architecture
The network consists of three dense layers:
$$X \rightarrow Layer_1 \rightarrow Layer_2 \rightarrow Layer_3 \rightarrow \hat y$$
Each dense layer performs:
$$Z = XW + b$$
where:
$X$ is the input matrix
$W$ is the weight matrix
$b$ is the bias vector
$Z$ is the output of the layer
The implementation stores:
self.weight
self.bias
inside each DenseLayer.
During the forward pass:
output = np.dot(input_matrix, self.weight) + self.bias
which directly corresponds to the equation above.
A Deliberate Simplification
You may notice that the network contains no activation functions.
This is intentional.
Without activations, multiple dense layers collapse into a single linear transformation:
$$XW_1W_2W_3 = XW_{effective}$$
Therefore:
$$\hat y = XW_{effective} + b_{effective}$$
This means the network does not gain additional representational power from having multiple layers.
The additional layers exist solely to demonstrate how gradients propagate backward through several stages.
The purpose of this implementation is not model capacity.
The purpose is visibility into the learning process.
Forward Propagation
The network begins with the following input matrix for every layer:
$$Z_l = X_lW_l + b_l$$
The output of one layer becomes the input of the next.
In code:
for key, value in self.layers.items():
value.forward(output)
output = value.output
After the final layer:
$$\hat y = logits$$
The implementation stores this prediction as:
self.logits
Measuring Error Using Mean Squared Error
Once predictions are produced, they must be compared against the target values.
The error is:
$$error = \hat y - y$$
which is implemented as:
error = self.logits - target
The loss is computed as:
$$L = (\hat y - y)^2$$
implemented as:
loss = error ** 2
The network now has a numerical measure of how wrong its predictions are.
Learning begins by asking:
How does this loss change if a weight changes?
The Two Local Derivatives Every Dense Layer Knows
For a dense layer:
$$Z = XW + b$$
there are only two local derivatives required for backpropagation.
Local Derivative With Respect to Weights
$$\frac{\partial Z}{\partial W}=X$$
represented by:
get_local_derivative_wrt_weights()
which returns:
self.input_matrix
Local Derivative With Respect to Inputs
$$\frac{\partial Z}{\partial X}=W$$
represented by:
get_local_derivative_wrt_inputs()
which returns:
self.weight
These two quantities are sufficient for gradient propagation.
Every dense layer only needs to know:
How its output changes when weights change.
How its output changes when inputs change.
Everything else comes from the chain rule.
Backpropagation: The Chain Rule in Action
Backpropagation is often presented as a separate algorithm.
In reality, it is simply repeated application of the chain rule.
The derivative of the loss with respect to the prediction is:
$$\frac{\partial L}{\partial \hat y}= \frac{2(\hat y-y)}{N}$$
implemented as:
loss_gradient_wrt_prediction =
(2 * self.error) / self.error.size
This becomes the initial gradient flowing backward through the network:
cumulated_backpropagating_gradient
Computing Weight Gradients
For a layer:
$$\frac{\partial L}{\partial W}= \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W}$$
Using the local derivative:
$$\frac{\partial Z}{\partial W}=X$$
the implementation computes:
weight_gradient =
np.dot(
layer.get_local_derivative_wrt_weights().T,
cumulated_backpropagating_gradient
)
This produces the gradient used to update the layer's weights.
Computing Bias Gradients
Because each neuron receives its bias directly:
$$\frac{\partial Z}{\partial b}=1$$
the bias gradient is obtained by summing the incoming gradient:
bias_gradient =
np.sum(
cumulated_backpropagating_gradient,
axis=0,
keepdims=True
)
Propagating Error to the Previous Layer
The chain rule also tells us how to continue moving backward.
$$\frac{\partial L}{\partial X}= \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial X}$$
Since:
$$\frac{\partial Z}{\partial X}=W$$
the implementation becomes:
cumulated_backpropagating_gradient =
np.dot(
cumulated_backpropagating_gradient,
layer.get_local_derivative_wrt_inputs().T
)
This is the core mechanism of backpropagation.
The gradient is transformed by each layer's local derivative and passed to the previous layer.
The same process repeats until the first layer is reached.
Gradient Descent Optimization
Once gradients have been computed, parameters are updated.
For every weight:
$$W_{new}= W- \eta \frac{\partial L}{\partial W}$$
For every bias:
$$b_{new}= b- \eta \frac{\partial L}{\partial b}$$
where:
\(\eta \) is the learning rate.
The implementation performs:
new_weight =
weight -
learning_rate * weight_gradient
new_bias =
bias -
learning_rate * bias_gradient
This step moves the parameters in the direction that reduces loss.
Training Results
After training for 125 epochs:
| Epoch | Loss |
|---|---|
| 0 | 14.3718 |
| 20 | 0.1922 |
| 40 | 0.0149 |
| 60 | 0.0052 |
| 80 | 0.0037 |
| 100 | 0.0035 |
| 120 | 0.0034 |
The loss consistently decreases, indicating that gradient descent is successfully adjusting the parameters.
Comparing predictions and targets:
Prediction: [ 2.0858 -0.3741 -0.7292 ... ]
Target : [ 2.0799 -0.3551 -0.7831 ... ]
shows that the model has learned the underlying relationship in the dataset.
What This Implementation Teaches
This implementation intentionally avoids many components found in modern neural networks.
There are:
No activation functions
No softmax
No cross-entropy
No Adam optimizer
No regularization
These omissions are deliberate.
The objective is to isolate and expose the learning process itself.
The most important takeaway is that every dense layer only needs two local derivatives:
$$\frac{\partial Z}{\partial W}$$
and
$$\frac{\partial Z}{\partial X}$$
Once those local derivatives are known, the chain rule handles the rest.
Backpropagation is not a mysterious algorithm hidden inside deep learning frameworks.
It is simply calculus applied repeatedly from the output layer back to the input layer.
Modern frameworks automate these calculations, but the mathematics remains exactly the same.
Understanding this process removes much of the abstraction surrounding neural network training and provides a clearer picture of what is actually happening when a model learns.




