Backprop from Scratch

Prathmesh Deshpande
5 min readMar 9, 2023

--

In this post, we will discuss the fundamental concept of derivatives and their application in neural networks. We will explore how changes in inputs affect the output of a function and how this knowledge can be leveraged to optimize neural networks. Additionally, we will work through a simple example and demonstrate how to calculate gradients using PyTorch.

Intuition

An intuitive way to think about a derivative is how the output of the function changes with a tiny change in the input. Let’s take a look at an example:

Let

y = a * b + c ….. (Eq 1)

And let,

a = 7

b = -2

c = 26

Using Eq 1, we get y = 12

To calculate dy/da, we need to see how much the value of y changes if we bump the variable ‘a’ by a small value h while keeping the other variables constant.

let h = 0.0001

Then new values are

a = 7.0001

b = -2

c = 26

Subsequently, the new value of y,

y’ = 11.9998

To get dy/da, we use the formula

dy/da = (y’-y)/h

Hence, dy/da = -0.19999999999953388

Intuitively, we can reason for this as follows:

b is negative, hence as long as ‘a’ is positive, b*a will always be negative.

Given this, as the value of ‘a’ becomes greater, the value of b*a will be lesser.

Hence, if we increase the value of var ‘a’, by a small value, the total value of y must decrease as we will be getting a bigger negative (b*a) value to subtract from c.

Hence the slope or dy/da is negative and it denotes the direction of change in y when the var ‘a‘ is changed.

The value of dy/da denotes how sensitive the function is to a change in the value of var ‘a’. If the value of dy/da is large, it means a small change in var ‘a’ will result in a large change in the value of y and vice-versa.

Similarly making a small bump in the values of b and c while keeping other vars constant will give us dy/db and dy/dc at our original point (a = 7, b = -2, c = 26).

Applying Intuition to Neural Networks

Consider the diagram shown above,

The main water input (valve m) represents the input weights to the neural network and the shower represents the output.

The intermediate valve ‘h’ represents the hidden layer weight.

What we want to know is how a change in the input m and h, changes the output y i.e. dy/dm and dy/dh.

First, we calculate the change in y for a small change in h by changing the value of h to get dy/dh. (Here we assume that we keep the flow through valve m constant)

But since the change in y can also be calculated from its immediate predecessor x, we can know how much y changes for a small change in x by keeping h constant, i.e. dy/dx.

Further, we can also find a change in the intermediate flow x with a small change in the opening of valve m while keeping w constant, i.e. dx/dm.

Using the chain rule of differentiation, we can calculate dy/dm as

dy/dm = dx/dm * dy/dx

Now we know how the output y changes with changes in values of both layers M and h. Hence we can tweak the values such that the output y is optimal.

Working through a simple example

Consider the following equations

a = 3
b = 2

c = a * b
d = 4

y = c * d

This can be represented as a simple neural net as follows:

Let’s look at how we can calculate different gradients to see how each value affects the value of y

dy/dc = d

dy/dd = c

dc/da = b

dc/db = a

dy/da = (dy/dc) * (dc/da) = d * b

dy/db = (dy/dc) * (dc/db) = d * a

Hence we can find how y changes with each of the inputs of the neural network.

A note to remember here is that in a real neural net, each input has weights and the weights are tweaked not the actual input. Here we can think of all variables as weights and all inputs as 1.

Example with PyTorch

import torch

#Initiaalize Inputs
x1 = torch.Tensor([2.0]).double()
x1.requires_grad = True

x2 = torch.Tensor([0.0]).double()
x2.requires_grad = True

#Initialize Weights
w1 = torch.Tensor([-3.0]).double()
w1.requires_grad = True

w2 = torch.Tensor([1.0]).double()
w2.requires_grad = True

#Initialize Bias
b = torch.Tensor([6.8813735870195432]).double()
b.requires_grad = True

#Output Layer
n = x1*w1 + x2*w2 + b

#Add non linearity
o = torch.tanh(n)

#Backward Pass
#This is where the gradiets are calculated and accumulated
o.backward()

#Print the gradients
print('x2', x2.grad.item())
print('w2', w2.grad.item())
print('x1', x1.grad.item())
print('w1', w1.grad.item())

"""
Output:
x2 0.5000001283844369
w2 0.0
x1 -1.5000003851533106
w1 1.0000002567688737
"""

As we see, we get the gradients w.r.t. each input and their weights.

In a neural network, a loss function is defined to give the measure of goodness of the predicted values.

The weights are then updated using the gradients so that the value of the loss function reduces. One way to think about this is as follows:

Let’s assume our Loss Function value is 100 and ideally, it should be zero.

From the example above we know the derivate at the weight of each input.

Since w2 is 0, it means changing it won’t affect the output.

For w1 however, the gradient is ~ 1 hence we know 2 things:

If we increase w1 in the positive direction, y also increases in the positive direction. (Recall that sign of gradient shows how y will shift wrt to the variable)

In addition, we also know that the change is approximately directly proportional by a factor of 1 meaning a +1 unit change to w will cause a ~+1 unit change in y.

Since we want to reduce the Loss Function from 100 to 0, our change should be ~-100.

Section Notes:

  1. Here ‘~’ means approximately.
  2. Our example is too simple hence a direct jump by 100 is okay, however in training there is a step rate a.k.a learning rate, denoted by α, which decides how much the weights should jump at each iteration of the training process.
  3. The direction is decided based on the sign of gradient and the definition of the loss function.

Conclusion

Understanding derivatives and the chain rule is essential to understanding how neural networks work. By knowing how the output changes with a small change in input, we can tweak the weights of the neural network such that the output is optimal.

📖Resources

  1. Andrej Karpathy‘s Tutorial
  2. 3Blue1Brown Video — Intro
  3. 3Blue1Brown Video — Calculus

That’s it for this issue. I hope you found this article interesting. Until next time!

📱Let’s connect :)

Twitter | Instagram

--

--

Prathmesh Deshpande

Computer Science Graduate. Birds+Wildlife nerd. Passionate Photographer. I am more active here: https://prathmesh6.substack.com/ ✉: prathu10@gmail.com