A Neural Network in Python, Part 1: sigmoid function, gradient descent & backpropagation

In this article, I’ll show you a toy example to learn the XOR logical function. My objective is to make it as easy as possible for you to to see how the basic ideas work, and to provide a basis from which you can experiment further. In real applications, you would not write these programs from scratch (except we do use numpy for the low-level number crunching), you would use libraries such as Keras, Tensorflow, SciKit-Learn, etc.

What do you need to know to understand the code here? Python 3, numpy, and some linear algebra (e.g. vectors and matrices). If you want to proceed deeper into the topic, some calculus, e.g. partial derivatives would be very useful, if not essential. If you aren’t already familiar with the basic principles of ANNs, please read the sister article over on AILinux.net: A Brief Introduction to Artificial Neural Networks. When you have read this post, you might like to visit A Neural Network in Python, Part 2: activation functions, bias, SGD, etc.

This less-than-20-lines program learns how the exclusive-or logic function works. This function is true only if both inputs are different. Here is the truth-table for xor:

a	b	a xor b
0	0	0
0	1	1
1	0	1
1	1	0

Main variables:

Wh & Wz are the weight matrices, of dimension previous layer size * next layer size.
X is the input matrix, dimension 4 * 2 = all combinations of 2 truth values.
Y is the corresponding target value of XOR of the 4 pairs of values in X.
Z is the vector of learned values for XOR.

#   XOR.py-A very simple neural network to do exclusive or.
import numpy as np
 
epochs = 60000           # Number of iterations
inputLayerSize, hiddenLayerSize, outputLayerSize = 2, 3, 1
 
X = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = np.array([ [0],   [1],   [1],   [0]])
 
def sigmoid (x): return 1/(1 + np.exp(-x))      # activation function
def sigmoid_(x): return x * (1 - x)             # derivative of sigmoid
                                                # weights on layer inputs
Wh = np.random.uniform(size=(inputLayerSize, hiddenLayerSize))
Wz = np.random.uniform(size=(hiddenLayerSize,outputLayerSize))
 
for i in range(epochs):
 
    H = sigmoid(np.dot(X, Wh))                  # hidden layer results
    Z = sigmoid(np.dot(H, Wz))                  # output layer results
    E = Y - Z                                   # how much we missed (error)
    dZ = E * sigmoid_(Z)                        # delta Z
    dH = dZ.dot(Wz.T) * sigmoid_(H)             # delta H
    Wz +=  H.T.dot(dZ)                          # update output layer weights
    Wh +=  X.T.dot(dH)                          # update hidden layer weights
 
print(Z)                # what have we learnt?

Walk-through

We use numpy, because we’ll be using matrices and vectors. There are no ‘neuron’ objects in the code, rather, the neural network is encoded in the weight matrices.

Our hyperparameters (fancy word in AI for parameters) are epochs (lots) and layer sizes. Since the input data comprises 2 operands for the XOR operation, the input layer devotes 1 neuron per operand. The result of the XOR operation is one truth value, so we have one output node. The hidden layer can have any number of nodes, 3 seems sufficient, but you should experiment with this.

The successive values of our training data add another dimension at each layer (or matrix) so the input matrix X is 4 * 2, representing all possible combinations of truth value pairs. The training data Y is 4 values corresponding to the result of XOR on those combinations.

An activation function corresponds to the biological phenomenon of a neuron ‘firing’, i.e. triggering a nerve signal when the neuron’s inputs combine in some appropriate way. It has to be chosen so as to cause reasonably proportionate outputs within a small range, for small changes of input. We’ll use the very popular sigmoid function, but note that there are others. We also need the sigmoid derivative for backpropagation.

Initialise the weights. Setting them all to the same value, e.g. zero, would be a poor choice because the weights are very likely to end up different from each other and we should help that along with this ‘symmetry-breaking’.

Now for the learning process:

We’ll make an initial guess using the random initial weights, propagate it through the hidden layer as the dot product of those weights and the input vector of truth-value pairs. Recall that a matrix – vector multiplication proceeds along each row, multiplying each element by corresponding elements down through the vector, and then summing them. This matrix goes into the sigmoid function to produce H. So H = sigmoid(X * Wh)

Same for the Z (output) layer, Z = sigmoid(H * Wz)

Now we compare the guess with the training date, i.e. Y – Z, giving E.

Finally, backpropagation. This comprises computing changes (deltas) which are multiplied (specifically, via the dot product) with the values at the hidden and input layers, to provide increments for the appropriate weights. If any neuron values are zero or very close, then they aren’t contributing much and might as well not be there. The sigmoid derivative (greatest at zero) used in the backprop will help to push values away from zero. The sigmoid activation function shapes the output at each layer.

E is the final error Y – Z.
dZ is a change factor dependent on this error magnified by the slope of Z; if its steep we need to change more, if close to zero, not much. The slope is sigmoid_(Z).
dH is dZ backpropagated through the weights Wz, amplified by the slope of H.

Finally, Wz and Wn are adjusted applying those deltas to the inputs at their layers, because the larger they are, the more the weights need to be tweaked to absorb the effect of the next forward prop. The input values are the value of the gradient that is being descended; we’re moving the weights down towards the minimum value of the cost function.

If you want to understand the code at more than a hand-wavey level, study the backpropagation algorithm mathematical derivation such as this one or this one so you appreciate the delta rule, which is used to update the weights. Essentially, its the partial derivative chain rule doing the backprop grunt work. Even if you don’t fully grok the math derivation at least check out the 4 equations of backprop, e.g. as listed here (click on the Backpropagation button near the bottom) and here because those are where the code ultimately derives from.

The X matrix holds the training data, excluding the required output values. Visualise it being rotated 90 degrees clockwise and fed one pair at a time into the input layer (X00 and X01, etc). They go across each column of the weight matrix Wh for the hidden layer to produce the first row of the result H, then the next etc, until all rows of the input data have gone in. H is then fed into the activation function, ready for the corresponding step from the hidden to the output layer Z.

If you run this program, you should get something like:

[[ 0.01288433]

[ 0.99223799]

[ 0.99223787]

[ 0.00199393]]

You won’t get the exact same results, but the first and last numbers should be close to zero, while the 2 inner numbers should be close to 1. You might have preferred exact 0s and 1s, but our learning process is analogue rather than digital; you could always just insert a final test to convert ‘nearly 0’ to 0, and ‘nearly 1’ to 1!

Here’s an improved version, it has no (or linear) activation on the output layer and gets more accurate results faster.

#   XOR.py-A very simple neural network to do exclusive or.
#   sigmoid activation for hidden layer, no (or linear) activation for output
 
import numpy as np
 
epochs = 20000                                  # Number of iterations
inputLayerSize, hiddenLayerSize, outputLayerSize = 2, 3, 1
L = .1                                          # learning rate      
 
X = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = np.array([ [0],   [1],   [1],   [0]])
 
def sigmoid (x): return 1/(1 + np.exp(-x))      # activation function
def sigmoid_(x): return x * (1 - x)             # derivative of sigmoid
                                                # weights on layer inputs
Wh = np.random.uniform(size=(inputLayerSize, hiddenLayerSize))
Wz = np.random.uniform(size=(hiddenLayerSize,outputLayerSize))
 
for i in range(epochs):
 
    H = sigmoid(np.dot(X, Wh))                  # hidden layer results
    Z = np.dot(H,Wz)                            # output layer, no activation
    E = Y - Z                                   # how much we missed (error)
    dZ = E * L                                  # delta Z
    Wz +=  H.T.dot(dZ)                          # update output layer weights
    dH = dZ.dot(Wz.T) * sigmoid_(H)             # delta H
    Wh +=  X.T.dot(dH)                          # update hidden layer weights
 
print(Z)                # what have we learnt?

Output should look something like this:

[[ 6.66133815e-15]

[ 1.00000000e+00]

[ 8.88178420e-15]]

Part 2 will build on this example, introducing biases, graphical visualisation, learning a math function (sine), etc…