# Note down some statements about Deep Learning

This article is just a cheat sheet for the recent review, those who know about it can just read it, those who don’t, can read later.

Just take a cup of coffee ☕️ to get into deep learning with other resource.

# What is Deep Learning?

## Brief Theory

Deep learning (*also known as deep structured learning, hierarchical learning or deep machine learning*) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.

In Practice, defining the term “Deep”: in this context, deep means that we are studying a Neural Network which has several hidden layers (more than one), no matter what type (convolutional, pooling, normalization, fully-connected etc). The most interesting part is that some papers noticed that Deep Neural Networks with the right architectures/hyper-parameters achieve better results than shallow Neural Networks with the same computational power (e.g. number of neurons or connections or units).

**An classical example** is using the famous MNIST Dataset to build two Neural Networks capable to perform handwritten digits classification.

The first Network is a simple Multi-layer Perceptron (MLP) and the second one is a Convolutional Neural Network (CNN from now on). In other words, when given an input our algorithm will say, with some associated error, what type of digit this input represents.

In Practice, defining “Learning”: In the context of supervised learning, digits recognition in our case, example 👆

- The learning part consists of a feature/target which is to be predicted using a given set of
**observations**with the already known**final prediction (label)**. In our case 👆, the**target**will be the digit (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and the**observations**are the intensity and relative position of the pixels. - After some training, it is possible to generate a “
**function**” that**maps***inputs*(digit image) to desired*outputs*(type of digit). The only problem is how well this map operation occurs. - While trying to generate this “function”, the training process continues until the model achieves
**a desired level of accuracy**on the training data.

# Weights and Biases to input

## Forward propagation

In Tensorflow we can do like this:

def forward(x): return tf.matmul(x,W) + b

## Back-propagation

Deriviatives computation on each layer for gradients. The gradients will be used to update the weights and bias.

The gradients help us to get the convergence of cost/loss function 👇, or find the max likelihood of model distribution (negative of cost/loss).

**Actually the cost/loss is derived from max likelihood of the conditional probabilities of sample space.**

# Activation Function

The objective of the Activation Function is to handle **Non-Linearity from a linear pass **in the Network.

## Softmax Regression

Softmax is an activation function that is normally used in classification problems. It generates the probabilities for the output.

For example 👆, our model will not be 100% sure that one digit is the number nine, instead, the answer will be a distribution of probabilities where, if the model is right, the nine number will have a larger probability than the other other digits.

For comparison, below is the one-hot vector for a nine digit label:

0 → 0

1 → 0

2 → 0

3 → 0

4 → 0

5 → 0

6 → 0

7 → 0

8 → 0

9 → 1

A machine does not have all this certainty, so we want to know what is the best guess, but we also want to understand how sure it was and what was the second better option. Below is an example of a hypothetical distribution for a nine digit:

0 →0.01

1 →0.02

2 →0.03

3 →0.02

4 →0.12

5 →0.01

6 →0.03

7 →0.06

8 →0.1

9 →0.6

In Tensorflow we can do like this:

def activate(x): return tf.nn.softmax(forward(x))

# Cost function

It is a function that is used to minimize the difference between the** right answers (labels) and estimated outputs** by our Network.

For our example 👆 we can use the **cross entropy function**, which is a popular cost function used for categorical models. The function is defined in terms of probabilities, which is why we must used normalized vectors. It is given as:

In Tensorflow we can do like this:

def cross_entropy(y_label, y_pred): return (-tf.reduce_sum(y_label * tf.math.log(y_pred + 1.e-10))) #addition of 1e-10 to prevent errors in zero calculations

# Optimisation: Gradient Descent

In order to minimise the cost of training at the fastest rate (likelihood maximum), we need to keep updating weights and bias, which can be understood as “acceleration”.

The optimisation helps use update the weights and bias with method of gradient descent in different way:

SGD, Adam, …..

**Training batches**

## Train using minibatch Gradient Descent

In practice, Batch Gradient Descent (whole training set for a batch) is **not often** used because is too computationally expensive. **But** the good part about this method is that you have the **true gradient **costing** **with the **expensive** **computing** task of using the whole dataset in **one** time. Due to this problem, Neural Networks usually use **mini-batch** to train.

## Mini-batch

We can divide our full dataset into batches of 50 each using the any kind of datasets API (Tensorflow dataset package, PyTorch, sklearn).

Then we can iterate through each of those batches to compute a gradient. Once we iterate through all of the batches in the dataset, we complete an **epoch**, or a full traversal of the dataset.

# How to improve our model?

## Several options as follow

- Regularization of Neural Networks using DropConnect
- Multi-column Deep Neural Networks for Image Classification
- APAC: Augmented Pattern Classification with Neural Networks
- Simple Deep Neural Network with Dropout