Project 0 - Machine Learning and Neural Networks
- James Canova
- Sep 26, 2021
- 4 min read
Updated: Mar 24, 2024
Last updated: 14 October 2023
This page is in can be downloaded as a pdf document from my GitHub repository:
This is an introduction to the theory of machine learning and neural networks.

The equation for the straight line in the figure is y = a.x + b. The parameters 'a' and 'b' can be chosen to minimise, for example, the sum of the squares of the difference between the true values of 'y' and those predicted by the straight line. In the language of machine learning, this is called 'training'.
If the data points (the dataset) change, then the parameters 'a' and 'b' are updated and this can be done programmatically by simply re-training. This is of great advantage if a program is large with thousands or hundreds of thousand parameters because there is no need to change them manually.
A more complicated example, and one that is in very wide-spread use, of machine learning is the 'artificial' neural network as opposed to the 'biological' neural network. The term 'neural network' will be used here to refer to an 'artificial' neural network.
An artificial neural network is analogous to the biological neural network with both consisting of interconnected neurons.
Biological neural networks are part of just about every living creature. A worm has a few hundred neurons, a frog has about 400 million neurons, which is about the number used in a modern very large neural network running on a supercomputer, and the human brain has about 86 billion neurons.
The structure of a very simple ANN is shown below and represents the most basic form of a biological neural network.

The inputs are analogous to the values of x of the linear regression example and the output(s) are analogous to the values of y of the linear regression example.
There can be more than one hidden layer. A typical vision system has about 100 layers.
As an aside, a worm has a few hundred neurons, a frog has about 400 million neurons, which is about the number used in a modern very large neural network running on a super computer, and the human brain has about 86 billion neurons.
Ignore the circles with the 1s for now. The remainder of the circles are the neurons, nodes or more properly, perceptrons, as they are artificial representations of neurons.
The biases, the Bs, are used to introduce non-linearity into the network. This topic is beyond my current understanding but I have found by experimentation that sometimes biases can be excluded.
Whether or not they are included does not depend on the complexity of a problem but whether or not the problem is linear. If the problem is non-linear then biases are required.
Each neuron, except for each input neuron, is modelled as a summation of weighted input signals multiplied by an activation function just like with biological neural networks.

There are many types of activation functions and much research goes into understanding which type is best for the type of problem being solved.
A very common activation function used with ANNs is the sigmoid function.

Given these details of the structure of an ANN, weights and biases (the Ws and Bs) are calculated in order to minimise a cost function which is also known as a loss function. Such a function can be, for example, the sum of the squares of the difference between calculated and true results. The process of calculating the Ws and Bs is called training.

The process has these steps:
Initialisation of all weights and biases with random numbers between -1 and 1
Forward propagation
Backwards propagation
Repeated application of steps 2 and 3
Steps 2 and 3 are repeated until the cost function is minimized. Each repetition is called an epoch which is referred to as a hyperparameter which is a value that controls the training process. It is important to note that the complete dataset is processed during each epoch.
Forward propagation is simply applying the input and calculating all downstream values to calculate a predicted output.
Backwards propagation is more complicated. The Calculus chain rule is used in conjunction with the error between the known output and predicted output to update the weights and biases.
After each application of backwards propagation, the error moves towards the minimum. This is known as gradient descent. This descent can be seen in the figure below.
Another important hyperparameter is called the learning rate which determines the maximum change of weights and biases during each backwards propagation. If it is too small then the training time is unnecessarily extended. If it is too large then the minimum can be overshot and calculations become unstable.

Training is very time consuming and requires a lot of computing power. My most recent project was an ANN to detect and read car licence plates and is a real-world application of object detection. My home computers froze so I resorted to Google Colab which is a free cloud computing service. Training took 3 hours and 15 minutes.
Once a network is trained then predicted results are estimated with application of a single forward propagation step. This is called inference. This process is much easier and quicker taking seconds or fractions of a second.
Inference is a good word to use because results generated by a neural network are provided in terms of a probability. In the 1990s probabilities were measured in the range of 60% and now 95% + seems typical.
In addition to the book, Make Your Own Neural Network (ref. 1 in rReferences post) please see the following article. Amongst other topics, it covers the use of biases which are not covered in the book.
[1] Wikipedia
If you have any problems or need clarification please contact me: jscanova@gmail.com
Mga Komento