What is a neural network ?

Deep neural networks are simply extremely high dimensional (i.e many parameters ) composite functions implemented in matrix form . That is, a function can be represented as a matrix operation which when applied to another matrix (the input) produces yet another matrix transformed in some way (the output). Neural networks are a chain of such transformations, taking an input (at the input layer) and producing an output (at the output layer) where of each of these layers, and each layer in between (a.k.a hidden layers), is a n-d array (a.k.a matrix). Each of these matrices are populated and defined by numbers which scale their respective inputs, a.k.a, the weights and a bias vector. The weights are applied such that for a weights matrix ‘W’ an input x, and bias vector b , we get an output matrix Z defined as

\[{Z}{}=W^{T}\cdot X + b\]

To the output Z is applied one of several non-linear activation functions which transforms each element in the output matrix Z to produce matrix A. \[ A=\sigma \left( z\right) \] This set up is often visualized as the following for a fully-connected (all outputs go to all inputs) feed-forward neural network (Figure 1), that is there are no-loops : information is always fed forward, input to output. Here, the each layer is a matrix appropriately shaped to receive the inputs from the previous layer (i.e., so that the matrix operations are legal with the correct shapes).

Figure 1: Fully connected feed-forward neural network diagram

knitr::include_graphics("neural_network_diagram.png")

In this example, we can assume that the input vector is a 2 dimensional numeric vector (the input has to be in numeric form) because there are 5 arrows coming into each node in the first layer. From this, we can calculate that the first layer has \(5 \text{ nodes} \cdot 2 \text{ inputs per node} = 15 \text{ weights}\). The units or nodes of this network are separate in that each gets its own copy of the previous layer’s output to be transformed by the weights that it holds.

Why is this interesting?

A high dimensional composite function thus defined isn’t much use to anybody (yet). The weights are random numbers and as it stands, sending in input will just provide us with a random output. What makes it interesting is that we can tweak the parameters of this function (a.k.a the numbers in the matrices that define the nodes) such that the output is something that we want. This is the learning in deep learning. The deep in deep learning simply refers to the fact that we can have an arbitrarily deep chain of transformation after transformation, building one large elaborate function. For example, the following image is one such elaborate function where the input is a spectrogram (an audio signal as an image) and the output is the prediction of what language is being spoken. Every arrow in this image is the output of some function (implemented as matrix transformations) going in as input to another function.

Figure 2: A deep neural network. A sequence of elaborate transformations to the input.

knitr::include_graphics("a deep net.png")

Universal approximation

It has been shown (with math n stuff) that , theoretically, this elaborate function, depending on how elaborate you can afford to make it, can approximate any function. In so far as what we care about (e.g the classification of a language correctly) can be modeled as the output of some function, the weights of a neural network can be tweaked to approximate it.

But, if we already knew the function that generated the output we cared about, we wouldn’t need this thing. We could just use the function to generate our output. How is this useful?

Alas, we get to the main reason why this ‘universal approximator’ is interesting and the crucial insight:

This thing, our neural network, let’s call it \(f\), full of weights we can fiddle with but currently produces garbage.
This other function, let’s call it F (which we can’t or won’t understand), that we can’t really fiddle with but produces the thing we want.

Can’t we fiddle the weights of f till its output matches the output of the function we care about? Why yes, yes we can. Have you ever had to hang something on a wall and had one of your friends standing behind you telling you how to move the thing so that it’s straight? That’s pretty much what is going to happen here.

How do we fiddle with the weights ?

Okay, so we have this thing that we know can be fiddled with to learn any function (subject to the realities of GPU shortages, energy costs and the like). The question now is : how do we do the fiddling?

Data

Remember that the whole reason we’re trying to do this is because the actual function \(F\) that we care about is, well, too complicated. There are two ways to look at this:

\(F\) produces output that varies a lot: For example, think of a function F that ‘produces all images of cats’. How many different variations of the output is there? Infinite.
F is able to deal with a lot of different variations of inputs : Imagine the function in our head that ‘recognizes cats’. It’s capable of recognizing black cats, white cats, cats upside down, cats running.. all of the cats.

What this implies is that we are never able to observe all the possible outputs of \(F\) nor are we able test all the possible inputs that \(F\) can handle. But, we need to approximate \(F\) somehow and the best indication we have for what its actually doing is the samples of data it produces or consumes. So when we collect a million images of what we consider to be a cat, we’re collecting a million examples of the output of \(F\). It’s not the case that all the cats in the world will be photographed nor will they all be in our sample if they were. We’ve just decided that in this collection of images there is some information about “what it means to be cat” layered in with a bunch of other stuff that has nothing to do with what a cat is (e.g., the color of the background or the brightness of the image).

Our goal now is to fiddle with \(f\) until it’s output approximately matches the sample of data from \(F\) that we’ve collected. If we’ve collected a representative sample and manage to disregard the unrelated details from this sample, we should be able to get \(f\) to approximate \(F\) to the point that it satisfies our the requirements of our problem.

This is why a big part of deep learning (and machine learning in general) is getting the data we need to approximate these insanely complicated \(F\) functions that we care about. The more , high quality data we have the more closely we can approximate \(F\). In the hanging something on the wall analogy, it would be pretty hard to know what your friend wants if every so often she tells you to move the thing in a completely random direction. But, if you have enough examples of her instructions, you might be able to ignore these random instructions and approximate what she actually wants.

What we do now is to take this data that we’ve collected, pump it through \(f\) , observe the outputs and make the adjustments necessary until it produces non-garbage.

Learning from data

Okay I have all this data and my function accepts them gladly but it doesn’t produce the output I want. How do I teach this function what I want?

Well, now that we know this \(f\) is just a function, we can turn to the tools of calculus, specifically, to methods of differentiation. Calculus is the mathematics of change. It lets us take functions and ask : How will you change if I take your input from here to there ?

The cost function

So we’ve sent some data through \(f\) and we have some output. We need a way to quantify how different this output is compared to that of F. Depending on the problem , there’s a cost function for it. It’s called a cost function because if you consider every output from \(f\) for a given example a bet that you place, how far off you are from the actual result is how much you lose. If you sum all of these losses up, you get the total cost of all your bets for all the training examples. However people use these terms interchangeably so don’t sweat it.

Once again, we have another function. What we need to do is figure out what needs to happen to make this cost as small as possible. A loss for a single training example \(i\) looks something like this. \[Loss_i = difference(actual_i, predicted_i)\] and the cost would be sum of all the losses \[Cost(model, training\ data) = average\left ( \sum_{i=1}^N Loss(i) \right)\]

The most important thing about this function is that it is defined to be continuous and differentiable with respect to the parameters. This will be important in what comes next.

Now, notice that the cost function is a function composted of other functions. Specifically, the model \(f\) and our training data. The training data is what it is : we can’t really do anything about this to improve our cost11 Well this isn’t totally true. Suppose that we know our data set is too small to representative of the behavior of \(F\) that we want to emulate. We can employ some data augmentation techniques to generate new samples that are meaningful representations of what \(F\) might produce. We can also mess with the data by adding information that we would like \(f\) to ignore

For the most part though, we have the data we have and the only option left is to mess with the weights of our network, \(f\).

Gradient descent

Now imagine that in Figure 3, the curvy surface represents all the possible values that the cost function can produce. For any given value of \(weight_1\) and \(weight_2\), when we pass the training data through the model, we get a cost. (note that this is just some random function generated for the purpose of demonstration. In reality, if you have million of weight in your function, this will be a million dimensional surface)22 I’m only playing the greatest hits here. For the full album check out this playlist by 3Blue1Brown.

Figure 3: A cost surface

knitr::include_graphics("a func surface.png")

What we want to figure out is what the values of \(weight_1\) and \(weight_2\) are at the lowest point of this surface. Recall that that the weights of our function are initialized to some random values (subject to possible initialization techniques that improves the speed of finding the lowest point. So we start somewhere on this surface (cost at initialization) and we need to figure out how to get to the lowest point.

Which direction do I go in?

We need to step in a direction and the only guide we have is the landscape. So we need to step around with our feet and gauge if we’re moving down this surface or up it. For example, we want to know what moving in a direction that increases \(weight_2\) feels like : is it taking me higher up the surface or lower? Similarly for \(weight_1\).

Given a multivariate (many variable) function (i.e \(Cost(f,data)\)), we want to know how it changes with respect to its inputs. The inputs that we care about in this case are the weights of our network \(f\), assuming the data stays constant.

imagine our network as a composite \(f\) like this with input \(x\) (just 2 for simplicity but this could be an arbitrary depth composition of functions where g is a function of another function etc.): \[ Cost = Difference(f(weight_1, weight_2, g(x_i)), Actual)\]

Figuring out how cost changes with respect to the weights of \(f\) is nothing more than a very long partial differentiation using the multivariate chain rule. That is, the partial derivative of each of the weights in our network with respect to the cost.

Back Propagation

The term “back propagation” itself offers a clue. We are propagating the error “backwards” through the network. Once you calculate the error at the end (at the output layer), you then distribute or propagate this error backward through the layers of the network in order to adjust the weights accordingly.

Here’s a simple breakdown of how this works:

Feed Forward: Input is passed through the network, layer by layer, until it reaches the output layer. This produces the predicted output using the current weights and biases.
Calculate the Error: The difference between the predicted output and the actual target values, typically using the chosen cost function, is computed.
Backward Pass:
1. Compute the Gradient:
Using calculus, find the gradient of the error with respect to each weight. This essentially tells you the direction and magnitude of change to minimize the error.
1. Update Weights and Biases:
Adjust the weights and biases of the network based on the computed gradient. This is done using optimization techniques, the most common being Gradient Descent. By adjusting the weights in the direction of the negative gradient, the error is minimized.
1. Propagate the Gradient:
The error gradient is passed backwards through the network, layer by layer, allowing each set of weights to be adjusted in proportion to how much they contributed to the total error. This is achieved through the chain rule of differentiation.
Iterate: Repeat these steps for a set number of iterations or until the network performs satisfactorily.

The beauty of back propagation is its efficiency. By using the chain rule, it cleverly calculates the gradients for each layer without redundant recalculations. Without back propagation, updating a deep neural network would be computationally prohibitive.

However, it’s important to note that the success of training a network isn’t solely dependent on back propagation. Other elements like the choice of activation functions, optimization algorithms (like momentum or RMSprop), initialization techniques, the network architecture, and regularization all play crucial roles in the performance of a neural network.

In essence, neural networks ‘learn’ by iteratively adjusting their weights to minimize the error. This process is steered by back propagation and optimized using algorithms like gradient descent. The network’s aim is to reach a state where it has - as close as possible - mimicked the mysterious function \(F\) through many rounds of feedback and refinement.

Yet another deep learning explanation

Self-explanation of neural networks

Udesh Habaraduwa

2024-11-04