Deep Learning, key-points

14 min readNov 11, 2022

Introduction to Neural Networks

First thing’s first

Neural Networks and Deep Learning are behind most of the AI that shapes our everyday life. Think of how you interact everyday with these technologies just by using the greatest features in our phones (face-recognition, autocorrect, text-autocomplete, voicemail-to-text previews), finding what we need on the internet (predictive internet searches, content or product recommendations), or using self-driving cars. Also, some of the classification and regression problems you need to solve, are good candidates for Neural Networks and Deep Learning as well.

Some basic facts of Neural Networks:

Use biology as inspiration for mathematical models
Get signals from previous neurons
Generate signals according to inputs
Pass signals on to next neurons
You can create a complex model by layering many neurons

The basic syntax of Multi-Layer Perceptrons in scikit learn is:

# Import Scikit-Learn model

from sklearn.neural_network import MLPClassifier

# Specify an activation function

mlp = MLPClassifier(hidden_layer_sizes=(5,2), activation= ‘logistic’)

# Fit and predict data (similar to approach for other sklearn models)

mlp.fit(X_train, y_train)

mlp.predict(X_test)

These are the main parts of MLP:

Weights
Input layer
Hidden Layer
Weights
Net Input
Activation

Deep Learning Use Cases Summary

Training a Neural Network

In a nutshell this is the process to train a neural network:

Put in Training inputs, get the output.
Compare output to correct answers: Look at loss function J.
Adjust and repeat.
Backpropagation tells us how to make a single adjustment using calculus.

The vanishing gradient problem is caused due to the fact that as you have more layers, the gradient gets very small at the early layers. For this reason, other activations (such as ReLU) have become more common.

The right activation function depends on the application, and there are no hard and fast rules. These are the some of the most used activation functions and their most common use cases:

Deep Learning and Regularization

Technically, a deep Neural Network has 2 or more hidden layers (often, many more). Deep Learning involves Machine Learning with deep Neural Networks. However, the term Deep Learning is often used to broadly describe a subset of Machine Learning approaches that use deep Neural Networks to uncover otherwise-unobservable relationships in the data, often as an alternative to manual feature engineering. Deep Learning approaches are common in Supervised, Unsupervised, and Semi-supervised Machine Learning.

These are some common ways to prevent overfitting and regularize neural networks:

Regularization penalty in cost function — This option explicitly adds a penalty to the loss function

Dropout — This is a mechanism in which at each training iteration (batch) we randomly remove a subset of neurons. This prevents a neural network from relying too much on individual pathways, making it more robust. At test time the weight of the neuron is rescaled to reflect the percentage of the time it was active.
Early stopping — This is another heuristic approach to regularization that refers to choosing some rules to determine if the training should stop.

Example:

Check the validation log-loss every 10 epochs.

If it is higher than it was last time, stop and use the previous model.

Optimizers — This approaches are based on the idea of tweaking and improving the weights using other methods instead of gradient descent.

Details of Neural Networks

Training Neural Networks is sensitive to how to compute the derivative of each weight and how to reach convergence. Important concepts that are involved at this step:

Batching methods, which includes techniques like full-batch, mini-batch, and stochastic gradient descent, get the derivative for a set of points

Data shuffling, which aids convergence by making sure data is presented in a different order every epoch.

Keras

Keras is a high-level library that can run on either TensorFlow or Theano. It simplifies the syntax, and allows multiple backend tools, though it is most commonly used with TensorFlow.

This is a common approach to train a deep learning model using Keras:

Compile the model, specifying your loss function, metrics, and optimizer.
Fit the model on your training data (specifying batch size, number of epochs).
Predict on new data.
Evaluate your results.

Below is the syntax to create a sequential model in Keras.

First, import the Sequential function and initialise your model object:

from keras.models import Sequential

from keras.layers import Dense, Activation

from keras.models import Sequential

from keras.layers import Dense, Activation

model = Sequential()

model.add(Dense(units=4, input_dim=3))

model.add(Activation(‘sigmoid’))

model.add(Dense(units=4))

model.add(Activation(‘sigmoid’))

Softmax activation, classification

Scale inputs

CNNs

Convolutional Layers have relatively few weights and more layers than other architectures. In practice, data scientists add layers to CNNs to solve specific problems using Transfer Learning.

Kernels (Filters)

Note down the skeleton of the convolutional network (CNN)

This article is to document a standard and templated code of CNN based on classical MNIST dataset, those who know about…

teetracker.medium.com

An illustration(GIF) to explain deep convolutional networks (DCNN)

In the world of computer vision, the most basic and common image recognition algorithm is the convolutional network…

teetracker.medium.com

Transfer Learning

The main idea of Transfer Learning consists of keeping early layers of a pre-trained network and re-train the later layers for a specific application.

Last layers in the network capture features that are more particular to the specific data you are trying to classify.

Later layers are easier to train as adjusting their weights has a more immediate impact on the final result.

Guiding Principles for Fine Tuning

While there are no rules of thumb, these are some guiding principles to keep in mind:

The more similar your data and problem are to the source data of the pre-trained network, the less intensive fine-tuning will be.
If your data is substantially different in nature than the data the source model was trained on, Transfer Learning may be of little value.

CNN Architectures

LeNet-5

Created by Yann LeCun in the 1990s
Used on the MNIST data set.
Novel Idea: Use convolutions to efficiently learn features on data set.

AlexNet

Considered the “flash point” for modern deep learning
Created in 2012 for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
Task: predict the correct label from among 1000 classes.
Dataset: around 1.2 million images.

AlexNet developers performed data augmentation for training.

Cropping, horizontal flipping, and other manipulations.

Basic AlexNet Template:

Convolutions with ReLUs.
Sometimes add maxpool after convolutional layer.
Fully connected layers at the end before a softmax classifier.

VGG

Simplify Network Structure: has same concepts and ideas from LeNet, considerably deeper.

This architecture avoids Manual Choices of Convolution Size and has very Deep Network with 3x3 Convolutions.

These structures tend to give rise to larger convolutions.

This was one of the first architectures to experiment with many layers (More is better!). It can use multiple 3x3 convolutions to simulate larger kernels with fewer parameters and it served as ”base model” for future works.

Inception

Ideated by Szegedy et al 2014, this architecture was built to turn each layer of the neural network into further branches of convolutions. Each branch handles a smaller portion of workload.

The network concatenates different branches at the end. These networks use different receptive fields and have sparse activations of groups of neurons.

Inception V3 is a relevant example of an Inception architecture.

ResNet

Researchers were building deeper and deeper networks but started finding these issues:

In theory, the very deep (56-layer) networks should fit the training data better (even if they overfit) but that was not happening.

Seemed that the early layers were just not getting updated and the signal got lost (due to vanishing gradient type issues).

These are the main reasons why adding layers does not always decrease training error:

Early layers of Deep Networks are very slow to adjust.
Analogous to “Vanishing Gradient” issue.
In theory, should be able to just have an “identity” transformation that makes the deeper network behave like a shallow one.

In a nutshell, a ResNet:

Has several layers such as convolutions
Enforces “best transformation” by adding “shortcut connections”.
Adds the inputs from an earlier layer to the output of current layer.
Keeps passing both the the initial unchanged information and the transformed information to the next layer.

RNN (recurrent neural networks)

State cell and Unrolled, associated weights with dimensions

Recurrent Neural Networks are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are mostly used in applications of natural language processing and speech recognition.

One of the main motivations for RNNs is to derive insights from text and do better than “bag of words” implementations. Ideally, each word is processed or understood in the appropriate context.

Words should be handled differently depending on “context”. Also, each word should update the context.

Under the notion of recurrence, words are input one by one. This way, we can handle variable lengths of text. This means that the response to a word depends on the words that preceded it.

These are the two main outputs of an RNN:

Prediction: What would be the prediction if the sequence ended with that word.
State: Summary of everything that happened in the past.

Mathematical Details

Mathematically, there are cores and subsequent dense layers

current state = function1(old state, current input).
current output = function2(current state).

We learn function1 and function2 by training our network!

r = dimension of input vector
s = dimension of hidden state
t = dimension of output vector (after dense layer)
U is a s × r matrix
W is a s × s matrix
V is a t × s matrix

In which the weight matrices U, V, W are the same across all positions

Practical Details

Often, we train on just the ”final” output and ignore intermediate outputs (Seqs-Vector).

Slight variation called Backpropagation Through Time (BPTT) is used to train RNNs.

Sensitive to length of sequence (due to “vanishing/exploding gradient” problem).

In practice, we still set a maximum length to our sequences. If the input is shorter than maximum, we “pad” it. If the input is longer than maximum, we truncate it.

RNN Applications

RNNs often focus on text applications, but are commonly used for other sequential data:

Forecasting: Customer Sales, Loss Rates, Network Traffic.
Speech Recognition: Call Center Automation, Voice Applications.
Manufacturing Sensor Data
Genome Sequences

Long-Short Term Memory RNNs (LSTM)

LSTMs are a special kind of RNN (invented in 1997). LSTM has as motivation solve one of the main weaknesses of RNNs, which is that its transitional nature, makes it hard to keep information from distant past in current memory without reinforcement.

LSTM have a more complex mechanism for updating the state.

Standard RNNs have poor memory because the transition Matrix necessarily weakens signal.

This is the problem addressed by Long-Short Term Memory RNNs (LSTM).

To solve it, you need a structure that can leave some dimensions unchanged over many steps.

By default, LSTMs remember the information from the last step.
Items are overwritten as an active choice.

The idea for updating states that RNNs use is old, but the available computing power to do it sequence to sequence mapping, explicit memory unit, and text generation tasks is relatively new.

Augment RNNs with a few additional Gate Units:

Gate Units control how long/if events will stay in memory.
Input Gate: If its value is such, it causes items to be stored in memory.
Forget Gate: If its value is such, it causes items to be removed from memory.
Output Gate: If its value is such, it causes the hidden unit to feed forward (output) in the network.

Gated Recurrent Units (GRUs)

GRUs are a gating mechanism for RNNs that is an alternative to LSTM. It is based on the principle of Removed Cell State:

Past information is now used to transfer past information.
Think of as a “simpler” and faster version of LSTM.

These are the gates of GRU:

Reset gate: helps decide how much past information to forget.

Update gate: helps decide what information to throw away and what new information to keep.

LSTM vs GRU

LSTMs are a bit more complex and may therefore be able to find more complicated patterns.

Conversely, GRUs are a bit simpler and therefore are quicker to train.

GRUs will generally perform about as well as LSTMs with shorter training time, especially for smaller datasets.

In Keras it is easy to switch from one to the other by specifying a layer type. It is relatively quickly to change one for the other.

Sequence-to-Sequence Models (Seq2Seq)

Thinking back to any type of RNN interprets text, the model will have a new hidden state at each step of the sequence containing information about all past words.

Seq2Seq improve keeping necessary information in the hidden state from one sequence to the next.

This way, at the end of a sentence, the hidden state will have all information relating to past words.

The size of the vector from the hidden state is the same no matter the size of the sentence.

In a nutshell, there is an encoder, a hidden state, and a decoder.

Beam Search

Beam search is an attempt to solve greedy inference.

Greedy Inference, which means that a model producing one word at a time implies that if it produces one wrong word, it might output a wrong entire sequence of words.
Beam search tries to produce multiple different hypotheses to produce words until <EOS> and then see which full sentence is most likely.

These are examples of common enterprise applications of LSTM models:

Forecasting: (LSTM among most common Deep Learning models used in forecasting).
Speech Recognition
Machine Translation
Image Captioning
Question Answering
Anomaly Detection
Robotic Control

Autoencoder

Autoencoders are a neural network architecture that forces the learning of a lower dimensional representation of data, commonly images.

Autoencoders are a type of unsupervised deep learning model that use hidden layers to decompose and then recreate their input. They have several applications:

Dimensionality reduction
Preprocessing for classification
Identifying ‘essential’ elements of the input data, and filtering out noise

One of the main motivations is find whether two pictures are similar.

Autoencoder and PCA

Autoencoders can be used in cases that are suited for Principal Component Analysis (PCA).

Autoencoders also help to deal with some of these PCA limitations: PCA has learned features that are linear combinations of original features.

Autoencoders can detect complex, nonlinear relationship between original features and best lower dimensional representation.

Autoencoding process

The process for autoencoding can be summarized as:

Feed image through encoder network
Generate the lower dimension embedding
Feed embedding through decoder network
Generate reconstructed version of the original data
Compare the result of the generated vs the original image

Result: A network will learn the lower dimensional space that represents the original data

Autoencoder applications

Autoencoders have a wide variety of enterprise applications:

Dimensionality reduction as preprocessing for classification
Information retrieval
Anomaly detection
Machine translation
Image-related applications (generation, denoising, processing and compression)
Drug discovery
Popularity prediction for social media posts
Sound and music synthesis
Recommender systems

Variational Autoencoder

Variational autoencoders also generate a latent representation and then use this representation to generate new samples (i.e. images).

These are some important features of variational autoencoders:

Data are assumed to be represented by a set of normally-distributed latent factors.
The encoder generates parameters of these distributions, namely µ and σ.
Images can be generated by sampling from these distributions.

VAE goals

The main goal of VAEs: generate images using the decoder

The secondary goal is to have similar images be close together in latent space

Loss Function of Variational Autoencoder

The VAE reconstruct the original images from the space of a vector drawn from a standard normal distribution.

The two components of the loss function are:

A penalty for not reconstructing the image correctly.
A penalty for generating vectors of parameters µ and σ that are different than 0 and 1, respectively: the parameters of the standard normal distribution.

Generative Adversarial Networks (GANs)

The invention of GANs was connected to neural networks’ vulnerability to adversarial examples. Researchers were going to run a speech synthesis contest, to see which neural network could generate the most realistic-sounding speech.

A neural network — the “discriminator” — would judge whether the speech was real or not.

In the end, they decided not to run the contest, because they realized people would generate speech to fool this particular network, rather than actually generating realistic speech.

One of the main advantages of GANs over other adversarial networks is that it does not spend any time evaluating whether an input or image is fake or real. It only computes probability of being fake.

These are the step to train GANs

Randomly initialize weights of generator and discriminator networks
Randomly initialize noise vector and generate image using generator
Predict probability generated image is real using discriminator
Compute losses both assuming the image was fake and assuming it was real
Train the discriminator to output whether the image is fake
Compute the penalty for the discriminator probability, without using it to train the discriminator
Train the generator to generate images that the discriminator thinks are real
Use the discriminator to calculate the probability that a real image is real
Use L to train the discriminator to output 1 when it sees real images

Reinforcement Learning

In Reinforcement Learning, Agents interact with an Environment

They choose from a set of available Actions

The actions impact the Environment, which impacts agents via Rewards

Rewards are generally unknown and must be estimated by the agent

The process repeats dynamically, so agents learn how to estimate rewards over time

Advances in deep learning have led to many recent RL developments:

In 2013, researchers from DeepMind developed a system to play Atari games
In 2017, the AlphaGo system defeated the world champion in Go

As a result, many well-known use cases involve learning to play games. More recently, progress has been made in areas with more direct business applications.

Attributes of RL

In general, RL algorithms have been limited due to significant data and computational requirements. The RL approaches are often limited by excessive computation resources and data requirements.
Simulation is a common approach for Reinforcement Learning applications that are complex or computing intensive.
Discounting rewards DOESN’T refer to an agent reducing the value of the reward based on its uncertainty.
Successful Reinforcement Learning approaches are often limited by extreme sensitivity to hyperparameters.

Reinforcement Learning Architecture

The main components of reinforcement learning are: Policy, Agents, Actions, State, and Reward.

Solutions represents a Policy by which Agents choose Actions in response to the State

Agents typically maximize expected rewards over time

In Python, the most common library for RL is Open AI GYM

This differs from typical Machine Learning Problems:

Unlike labels, rewards are not known and are often highly uncertain

As actions impact the environment, the state changes, which changes the problem

Agents face a tradeoff between rewards in different periods

Examples of everyday applications of Reinforcement Learning include recommendation engines, marketing, and automated bidding.