## How Neural Network Works | Deep learning | Machine Learning |

- September 14, 2020
- By Saurav prasad
- 0 Comments

You must be wondering what is happening here. This GIF image explains the whole story of a Neural Network. You may not understand it now, but after reading this post, you will appreciate this GIF.

So let us, deep-dive, into the nitty-gritty of the Neural Network.

Above we have two images, one representing the biological neuron, and the other one is the perceptron, a mathematical representation of neurons.

**Let's talk a little bit about the neuron.**

Here we have three main components of the neuron,

**Dendrites**: The work of the dendrites is to receive a signal from other neurons or any sensory organ like the eyes, tongue, etc.**Nucleus**: Here, some transformation happens in the Nucleus with the received signal**Axon**: Once the signal is transformed, it is passed on to another neuron with the help of the axon.

**So How does a perceptron works?**

As we know, the perceptron is an imitator of a biological neuron. Neurons that transmit sensory and other information from one neuron to other neurons. A perceptron can do the same.

Here we see that two inputs are fed to the perceptron, and it returns an output just like our neurons do. So what happens inside that circle. We do some operation on the incoming data, with the help of some Mathematical function. In the above case, it's the addition of two inputs. So, suppose two inputs are 2 and 3. Then the output from the perceptron is going to be 5.

**So How does the perceptron learn?**

A perceptron learns by adjusting its weights. Each input has its weight or importance. The value for these weights can be positive or negative. In our case, suppose we want our perceptron to return the average of the two numbers. So here, we will supply each weight value of 0.5, with that we get an output as 2.5. These weights are adjusted during the training, which we will talk about in the latter part of the blog.

**What is bias?**

Bias is a term that we add to our neural network. It helps us modeling real-life scenarios more precisely. The simplest example is the regression line. Suppose we want to build a regression model using a neural network. We need to introduce the intercept, and that is done using bias. Below is the side by side comparison.

**What is Neural Networks?**

A single perceptron is not capable enough to model complex problems. To overcome this, we build a network of perceptrons connected layer by layer, also knows as the multi-layer perceptron model. Above is a fully connected multi-layer perceptron model.

In this neural network model, the output of the previous layer is input for the next layer. The neuron of one layer is connected with every other neuron of the next layer, and this is known as a fully connected layer. This dense connection allows the network to learn more complex features that cannot be learned by some other algorithms. Based on this idea, there are several different networks like CNNs, RNNs, etc.

**What are the different parts of the Neural Network?**

**Input layer** - This layer directly receives the data in tabular form, for which we are trying to predict the outcome.

**Hidden Layers** - these layers are more complex than the other two due to high inter-connectivity among neurons, and tough to interpret for the human mind that what's going inside those layers. As we keep on increasing the layers in hidden layers, the network becomes convoluted. If we have more than two hidden layers, that network will be called a deep neural network.

**Output Layer** - This is the last layer in the network, that gives the outcome we are trying to predict. It can have a single neuron or multiple neurons. For the regression task, the last layer only has one neuron, while for classification, it can vary from two to the number of outcomes our problem can have.

**Width **- It is defined as how many neurons are present in the single layer.

**Depth **- It is the number of layers present in the neural network.

**What is the Activation Function?**

In our brain, not every neuron transmits the signal. They only transfer information when the electrical energy surpasses a certain threshold, and the same is the case with our perceptrons. So how does the neural network does this thing?

It does this by using some mathematical function. When the values are feed into this function, it decides whether to give output or not. That is why we called it the activation function because it decides whether to activate a neuron or not. The activation function helps in reducing the noise by deactivating a neuron. In the image, we can see that only some neurons are firing up, which happens during the training.

**Different Types of Activation Function:**

**Relu (Rectified Linear Unit) :**

It stands for a rectified linear unit, this function work as a step-up function. If the input is negative, it doesn't fire-up the neuron, that is, ii gives zero as an output, and if the input is positive, it fire-up the neuron. Because of this non-linearity, it allows the network to learn some complex non-linear relationships in data.

This function has become by default choice for some of the deep learning models. The reason being, with this function, the model takes less training time and gives quite an accurate result.

**Sigmoid (S-shape Function) :**

It is an S-shape function, also known as a logistic function, where the values saturate at the ends of the curve, to be more precise, the curve has a pair of horizontal asymptotes as x tends to infinity. The values, for this function, vary from 0 to 1. This function is useful in binary classification. This function is easy to understand and mostly used in Shallow neural networks, Neural networks with one hidden layer. The major drawback of this function is sharp damp in the gradient while back-propagating, gradient saturation, etc.

In Deep Learning, there are three variants of this function.

*Hard sigmoid**Sigmoid- Weighted Linear Units**Derivative of Sigmoid-Weighted Linear Units*

**Tanh (Hyperbolic Tangent) : **

It is one of the activation functions used in Deep Learning tasks. It is known as the hyperbolic tangent function. The range of this function is between -1 to 1, where zero is the center. This function performs better than the sigmoid in terms of training for multi-layer neural networks. This function fails to solve the vanishing gradient problem, suffered by the sigmoid function as well. These functions have a great application in LSTM(Long short term memory) types of recurrent neural networks.

There are many other activation functions in deep learning. In case you want to deep dive into different types of activation functions and their variations. Here is the paper "Activation Functions: Comparison of Trends in Practice and Research for Deep Learning" by Chigozie Enyinna Nwankpa, Winifred Ijomah, Anthony Gachagan, and Stephen Marshall.

**What is Cost functions?**

The cost function helps evaluate the performance of the model. It tells us how far off we are from the actual value. In the above GIF image, we saw that the jet cost was 70K $, and Gru was 5K $ far off from the actual value. It is a basic example, in Deep Learning, we deal with a large dataset so, there are more complex cost functions to evaluate the performance of the neural network.

**For the Regression task :**

- SSE - Sum Squared Error It is the sum of the square difference between the predicted value and the original values, also known as the residual sum of squares. We aim to minimize this error as much as possible. So what does it tell? It tells that model was not able to explain this much variance in the data.

- RMSE- Root Mean Square Error is another way to calculate the error. It tells how spread the errors are.

- Mean Absolute Error - It is the absolute difference divided by the number of observations.

**For classification Task:**

**Cross-entropy:**This is the commonly used loss function for the classification task in machine learning. Here we calculate the log loss for n classes and average it out. This function can be used for multi as well as binary classification.

*So you must be wondering how we minimize these errors? It is done using some state-of-the-art optimization algorithms in machine learning. Below are the few optimization algorithms I have talked about.*

**Optimization Algorithms:**

**Gradient Descent -Going down the hill**

The best way to understand this problem is to imagine a blind person is on a hill, and he has to go down. Now, to descent, he needs two pieces of information, whether he is going down or up, and second what should be his step size so that he can reach there as soon as possible.

So now, let's talk about it in mathematical terms. So it's a first-order optimization algorithm that tries to find local minima. It has two parameters firth the direction, and the second is step size, most often called learning rate in deep learning.

**Stochastic Gradient Descent**

The gradient descent is computationally expensive because it requires more number of iteration, so to overcome this problem, we use stochastic gradient descent for the machine learning deep learning tasks more often. The term Stochastic means randomness, so here the algorithm tries to find the cost function for each data point instead of taking the squared sum of the whole dataset. Though, Stochastic Gradient Descent can take more iteration to reach minima than the gradient descent, and still, SGD less computationally expensive. It is because we can distribute the task on multiple cores of the CPU.

Here is the link to the paper, in which all the other state-of-the-art optimization algorithms used in deep learning are explained.

**Back Propagation:**

It is the most important and the toughest part of the neural network, due to the involvement of the calculus. Don't worry about the calculus part because we will not be dealing with it.

**So what is backpropagation?**

There are two types of flow in the neural network the forward flow and the backward flow. In the forward flow, data passes through the neural network, and weights are assigned randomly. In the backward flow or backpropagation, the Weights are updated to minimize the error.

**So how does it happen?**

It happens using the chain rule. It is responsible for the change in the Weights across the network. Here we differentiate the Loss function with weights. It allows the neural network to learn how sensitive is the cost function to change in Weights. A similar thing happens for the bias terms.

Note: There is a lot of calculus involve in the backpropagation. If you want to go into depth, here is the link that gives an idea about how it is done mathematically. Here is the link to a video that gives a visual representation of backpropagation.

**So let's summarise the steps of Neural Network:**

- Step 1: The data is feed to the first layer, and random weights and biases are assigned to the network. It passes to the activation function.
- Step 2: The output of the first layer is then fed to the second layer, and again weights and biases are assigned. It keeps on happening until it reached the output layer.
- Step 3: The loss function is calculated by using the method specified by us.
- Step 4: Based on the loss function weights and biases are updated throughout the neural networks using backpropagation.

**Let's see how we build Neural Networks using the Keras library.**

Here we have imported the Sequential model. Let's understand why. In Keras, there are two ways to build a model: Sequential and Functional.

*Sequential:*The Sequential API allows us to create a model where the output of one layer as input for the next layer. It allows us to create a model layer by layer.

*Functional*: The functional API allows us to create more flexible models like, we have the flexibility to feed the output of the first layer to the 10th layer or any other layer.

**So what is batch size and why do we use it?**

**Frequently asked questions:**

**What is a neural network used for?**

**What are the disadvantages of neural networks?**

**Data**: A neural network requires a lot of data to learn the hidden pattern than usual algorithms. It means we can not use it for small problems because it won't be able to generalize it well.

**Computationally expensive**: If the data is too large neural networks require a GPU (Graphical Processing Unit ) or TPU (Tensor Processing Unit). Deep learning saw a surge after 2000 because, before that, we didn't have the computational power.

**Training Time**: When data is huge the neural network takes a lot of time for training.

**What is a fully connected neural network?**

"Thanks for reading"