Deep Learning, key-points
Introduction to Neural Networks
First thing’s first
Neural Networks and Deep Learning are behind most of the AI that shapes our everyday life. Think of how you interact everyday with these technologies just by using the greatest features in our phones (face-recognition, autocorrect, text-autocomplete, voicemail-to-text previews), finding what we need on the internet (predictive internet searches, content or product recommendations), or using self-driving cars. Also, some of the classification and regression problems you need to solve, are good candidates for Neural Networks and Deep Learning as well.
Some basic facts of Neural Networks:
- Use biology as inspiration for mathematical models
- Get signals from previous neurons
- Generate signals according to inputs
- Pass signals on to next neurons
- You can create a complex model by layering many neurons
The basic syntax of Multi-Layer Perceptrons in scikit learn is:
# Import Scikit-Learn model
from sklearn.neural_network import MLPClassifier
# Specify an activation function
mlp = MLPClassifier(hidden_layer_sizes=(5,2), activation= ‘logistic’)
# Fit and predict data (similar to approach for other sklearn models)
mlp.fit(X_train, y_train)
mlp.predict(X_test)
These are the main parts of MLP:
- Weights
- Input layer
- Hidden Layer
- Weights
- Net Input
- Activation
Deep Learning Use Cases Summary
Training a Neural Network
In a nutshell this is the process to train a neural network:
- Put in Training inputs, get the output.
- Compare output to correct answers: Look at loss function J.
- Adjust and repeat.
- Backpropagation tells us how to make a single adjustment using calculus.
The vanishing gradient problem is caused due to the fact that as you have more layers, the gradient gets very small at the early layers. For this reason, other activations (such as ReLU) have become more common.
The right activation function depends on the application, and there are no hard and fast rules. These are the some of the most used activation functions and their most common use cases:
Deep Learning and Regularization
Technically, a deep Neural Network has 2 or more hidden layers (often, many more). Deep Learning involves Machine Learning with deep Neural Networks. However, the term Deep Learning is often used to broadly describe a subset of Machine Learning approaches that use deep Neural Networks to uncover otherwise-unobservable relationships in the data, often as an alternative to manual feature engineering. Deep Learning approaches are common in Supervised, Unsupervised, and Semi-supervised Machine Learning.
These are some common ways to prevent overfitting and regularize neural networks:
- Regularization penalty in cost function — This option explicitly adds a penalty to the loss function
- Dropout — This is a mechanism in which at each training iteration (batch) we randomly remove a subset of neurons. This prevents a neural network from relying too much on individual pathways, making it more robust. At test time the weight of the neuron is rescaled to reflect the percentage of the time it was active.
- Early stopping — This is another heuristic approach to regularization that refers to choosing some rules to determine if the training should stop.
Example:
Check the validation log-loss every 10 epochs.
If it is higher than it was last time, stop and use the previous model.
- Optimizers — This approaches are based on the idea of tweaking and improving the weights using other methods instead of gradient descent.
Details of Neural Networks
Training Neural Networks is sensitive to how to compute the derivative of each weight and how to reach convergence. Important concepts that are involved at this step:
Batching methods, which includes techniques like full-batch, mini-batch, and stochastic gradient descent, get the derivative for a set of points
Data shuffling, which aids convergence by making sure data is presented in a different order every epoch.
Keras
Keras is a high-level library that can run on either TensorFlow or Theano. It simplifies the syntax, and allows multiple backend tools, though it is most commonly used with TensorFlow.
This is a common approach to train a deep learning model using Keras:
- Compile the model, specifying your loss function, metrics, and optimizer.
- Fit the model on your training data (specifying batch size, number of epochs).
- Predict on new data.
- Evaluate your results.
Below is the syntax to create a sequential model in Keras.
First, import the Sequential function and initialise your model object:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(units=4, input_dim=3))
model.add(Activation(‘sigmoid’))
model.add(Dense(units=4))
model.add(Activation(‘sigmoid’))
Softmax activation, classification
Scale inputs
CNNs
Convolutional Layers have relatively few weights and more layers than other architectures. In practice, data scientists add layers to CNNs to solve specific problems using Transfer Learning.
Kernels (Filters)
Transfer Learning
The main idea of Transfer Learning consists of keeping early layers of a pre-trained network and re-train the later layers for a specific application.
Last layers in the network capture features that are more particular to the specific data you are trying to classify.
Later layers are easier to train as adjusting their weights has a more immediate impact on the final result.
Guiding Principles for Fine Tuning
While there are no rules of thumb, these are some guiding principles to keep in mind:
- The more similar your data and problem are to the source data of the pre-trained network, the less intensive fine-tuning will be.
- If your data is substantially different in nature than the data the source model was trained on, Transfer Learning may be of little value.
CNN Architectures
LeNet-5
- Created by Yann LeCun in the 1990s
- Used on the MNIST data set.
- Novel Idea: Use convolutions to efficiently learn features on data set.
AlexNet
- Considered the “flash point” for modern deep learning
- Created in 2012 for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
- Task: predict the correct label from among 1000 classes.
- Dataset: around 1.2 million images.
AlexNet developers performed data augmentation for training.
- Cropping, horizontal flipping, and other manipulations.
Basic AlexNet Template:
- Convolutions with ReLUs.
- Sometimes add maxpool after convolutional layer.
- Fully connected layers at the end before a softmax classifier.
VGG
Simplify Network Structure: has same concepts and ideas from LeNet, considerably deeper.
This architecture avoids Manual Choices of Convolution Size and has very Deep Network with 3x3 Convolutions.
These structures tend to give rise to larger convolutions.
This was one of the first architectures to experiment with many layers (More is better!). It can use multiple 3x3 convolutions to simulate larger kernels with fewer parameters and it served as ”base model” for future works.
Inception
Ideated by Szegedy et al 2014, this architecture was built to turn each layer of the neural network into further branches of convolutions. Each branch handles a smaller portion of workload.
The network concatenates different branches at the end. These networks use different receptive fields and have sparse activations of groups of neurons.
Inception V3 is a relevant example of an Inception architecture.
ResNet
Researchers were building deeper and deeper networks but started finding these issues:
In theory, the very deep (56-layer) networks should fit the training data better (even if they overfit) but that was not happening.
Seemed that the early layers were just not getting updated and the signal got lost (due to vanishing gradient type issues).
These are the main reasons why adding layers does not always decrease training error:
- Early layers of Deep Networks are very slow to adjust.
- Analogous to “Vanishing Gradient” issue.
- In theory, should be able to just have an “identity” transformation that makes the deeper network behave like a shallow one.
In a nutshell, a ResNet:
- Has several layers such as convolutions
- Enforces “best transformation” by adding “shortcut connections”.
- Adds the inputs from an earlier layer to the output of current layer.
- Keeps passing both the the initial unchanged information and the transformed information to the next layer.
RNN (recurrent neural networks)
Recurrent Neural Networks are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are mostly used in applications of natural language processing and speech recognition.
One of the main motivations for RNNs is to derive insights from text and do better than “bag of words” implementations. Ideally, each word is processed or understood in the appropriate context.
Words should be handled differently depending on “context”. Also, each word should update the context.
Under the notion of recurrence, words are input one by one. This way, we can handle variable lengths of text. This means that the response to a word depends on the words that preceded it.
These are the two main outputs of an RNN:
- Prediction: What would be the prediction if the sequence ended with that word.
- State: Summary of everything that happened in the past.
Mathematical Details
Mathematically, there are cores and subsequent dense layers
- current state = function1(old state, current input).
- current output = function2(current state).
We learn function1 and function2 by training our network!
- r = dimension of input vector
- s = dimension of hidden state
- t = dimension of output vector (after dense layer)
- U is a s × r matrix
- W is a s × s matrix
- V is a t × s matrix
In which the weight matrices U, V, W are the same across all positions
Practical Details
Often, we train on just the ”final” output and ignore intermediate outputs (Seqs-Vector).
Slight variation called Backpropagation Through Time (BPTT) is used to train RNNs.
Sensitive to length of sequence (due to “vanishing/exploding gradient” problem).
In practice, we still set a maximum length to our sequences. If the input is shorter than maximum, we “pad” it. If the input is longer than maximum, we truncate it.
RNN Applications
RNNs often focus on text applications, but are commonly used for other sequential data:
- Forecasting: Customer Sales, Loss Rates, Network Traffic.
- Speech Recognition: Call Center Automation, Voice Applications.
- Manufacturing Sensor Data
- Genome Sequences
Long-Short Term Memory RNNs (LSTM)
LSTMs are a special kind of RNN (invented in 1997). LSTM has as motivation solve one of the main weaknesses of RNNs, which is that its transitional nature, makes it hard to keep information from distant past in current memory without reinforcement.
LSTM have a more complex mechanism for updating the state.
Standard RNNs have poor memory because the transition Matrix necessarily weakens signal.
This is the problem addressed by Long-Short Term Memory RNNs (LSTM).
To solve it, you need a structure that can leave some dimensions unchanged over many steps.
- By default, LSTMs remember the information from the last step.
- Items are overwritten as an active choice.
The idea for updating states that RNNs use is old, but the available computing power to do it sequence to sequence mapping, explicit memory unit, and text generation tasks is relatively new.
Augment RNNs with a few additional Gate Units:
- Gate Units control how long/if events will stay in memory.
- Input Gate: If its value is such, it causes items to be stored in memory.
- Forget Gate: If its value is such, it causes items to be removed from memory.
- Output Gate: If its value is such, it causes the hidden unit to feed forward (output) in the network.
Gated Recurrent Units (GRUs)
GRUs are a gating mechanism for RNNs that is an alternative to LSTM. It is based on the principle of Removed Cell State:
- Past information is now used to transfer past information.
- Think of as a “simpler” and faster version of LSTM.
These are the gates of GRU:
Reset gate: helps decide how much past information to forget.
Update gate: helps decide what information to throw away and what new information to keep.
LSTM vs GRU
LSTMs are a bit more complex and may therefore be able to find more complicated patterns.
Conversely, GRUs are a bit simpler and therefore are quicker to train.
GRUs will generally perform about as well as LSTMs with shorter training time, especially for smaller datasets.
In Keras it is easy to switch from one to the other by specifying a layer type. It is relatively quickly to change one for the other.
Sequence-to-Sequence Models (Seq2Seq)
Thinking back to any type of RNN interprets text, the model will have a new hidden state at each step of the sequence containing information about all past words.
Seq2Seq improve keeping necessary information in the hidden state from one sequence to the next.
This way, at the end of a sentence, the hidden state will have all information relating to past words.
The size of the vector from the hidden state is the same no matter the size of the sentence.
In a nutshell, there is an encoder, a hidden state, and a decoder.
Beam Search
Beam search is an attempt to solve greedy inference.
- Greedy Inference, which means that a model producing one word at a time implies that if it produces one wrong word, it might output a wrong entire sequence of words.
- Beam search tries to produce multiple different hypotheses to produce words until <EOS> and then see which full sentence is most likely.
These are examples of common enterprise applications of LSTM models:
- Forecasting: (LSTM among most common Deep Learning models used in forecasting).
- Speech Recognition
- Machine Translation
- Image Captioning
- Question Answering
- Anomaly Detection
- Robotic Control
Autoencoder
Autoencoders are a neural network architecture that forces the learning of a lower dimensional representation of data, commonly images.
Autoencoders are a type of unsupervised deep learning model that use hidden layers to decompose and then recreate their input. They have several applications:
- Dimensionality reduction
- Preprocessing for classification
- Identifying ‘essential’ elements of the input data, and filtering out noise
One of the main motivations is find whether two pictures are similar.
Autoencoder and PCA
Autoencoders can be used in cases that are suited for Principal Component Analysis (PCA).
Autoencoders also help to deal with some of these PCA limitations: PCA has learned features that are linear combinations of original features.
Autoencoders can detect complex, nonlinear relationship between original features and best lower dimensional representation.
Autoencoding process
The process for autoencoding can be summarized as:
- Feed image through encoder network
- Generate the lower dimension embedding
- Feed embedding through decoder network
- Generate reconstructed version of the original data
- Compare the result of the generated vs the original image
Result: A network will learn the lower dimensional space that represents the original data
Autoencoder applications
Autoencoders have a wide variety of enterprise applications:
- Dimensionality reduction as preprocessing for classification
- Information retrieval
- Anomaly detection
- Machine translation
- Image-related applications (generation, denoising, processing and compression)
- Drug discovery
- Popularity prediction for social media posts
- Sound and music synthesis
- Recommender systems
Variational Autoencoder
Variational autoencoders also generate a latent representation and then use this representation to generate new samples (i.e. images).
These are some important features of variational autoencoders:
- Data are assumed to be represented by a set of normally-distributed latent factors.
- The encoder generates parameters of these distributions, namely µ and σ.
- Images can be generated by sampling from these distributions.
VAE goals
The main goal of VAEs: generate images using the decoder
The secondary goal is to have similar images be close together in latent space
Loss Function of Variational Autoencoder
The VAE reconstruct the original images from the space of a vector drawn from a standard normal distribution.
The two components of the loss function are:
- A penalty for not reconstructing the image correctly.
- A penalty for generating vectors of parameters µ and σ that are different than 0 and 1, respectively: the parameters of the standard normal distribution.
Generative Adversarial Networks (GANs)
The invention of GANs was connected to neural networks’ vulnerability to adversarial examples. Researchers were going to run a speech synthesis contest, to see which neural network could generate the most realistic-sounding speech.
A neural network — the “discriminator” — would judge whether the speech was real or not.
In the end, they decided not to run the contest, because they realized people would generate speech to fool this particular network, rather than actually generating realistic speech.
One of the main advantages of GANs over other adversarial networks is that it does not spend any time evaluating whether an input or image is fake or real. It only computes probability of being fake.
These are the step to train GANs
- Randomly initialize weights of generator and discriminator networks
- Randomly initialize noise vector and generate image using generator
- Predict probability generated image is real using discriminator
- Compute losses both assuming the image was fake and assuming it was real
- Train the discriminator to output whether the image is fake
- Compute the penalty for the discriminator probability, without using it to train the discriminator
- Train the generator to generate images that the discriminator thinks are real
- Use the discriminator to calculate the probability that a real image is real
- Use L to train the discriminator to output 1 when it sees real images
Reinforcement Learning
In Reinforcement Learning, Agents interact with an Environment
They choose from a set of available Actions
The actions impact the Environment, which impacts agents via Rewards
Rewards are generally unknown and must be estimated by the agent
The process repeats dynamically, so agents learn how to estimate rewards over time
Advances in deep learning have led to many recent RL developments:
- In 2013, researchers from DeepMind developed a system to play Atari games
- In 2017, the AlphaGo system defeated the world champion in Go
As a result, many well-known use cases involve learning to play games. More recently, progress has been made in areas with more direct business applications.
Attributes of RL
- In general, RL algorithms have been limited due to significant data and computational requirements. The RL approaches are often limited by excessive computation resources and data requirements.
- Simulation is a common approach for Reinforcement Learning applications that are complex or computing intensive.
- Discounting rewards DOESN’T refer to an agent reducing the value of the reward based on its uncertainty.
- Successful Reinforcement Learning approaches are often limited by extreme sensitivity to hyperparameters.
Reinforcement Learning Architecture
The main components of reinforcement learning are: Policy, Agents, Actions, State, and Reward.
Solutions represents a Policy by which Agents choose Actions in response to the State
Agents typically maximize expected rewards over time
In Python, the most common library for RL is Open AI GYM
This differs from typical Machine Learning Problems:
Unlike labels, rewards are not known and are often highly uncertain
As actions impact the environment, the state changes, which changes the problem
Agents face a tradeoff between rewards in different periods
Examples of everyday applications of Reinforcement Learning include recommendation engines, marketing, and automated bidding.