Dropout (inverted dropout)
Terminology check here https://machinelearning.wtf/terms/inverted-dropout/
Dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.
Both Dropout and Regularization are actually a kind of weight decay, that is, weakening the effect of weights or parameters on all neurons, which can theoretically include the input layer.
How it works
At each training iteration with dropout, you shut down (= set to zero) each neuron of a layer with probability 1−keep_prob or keep it with probability keep_prob (50% here).
The dropped neurons don’t contribute to the training in both the forward and backward propagations of the iteration.
Intuition
When you shut some neurons down, you actually modify your model. The idea behind dropout is that at each iteration, you train a different model that uses only a subset of your neurons.
With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.
You should use dropout (randomly eliminate nodes) only in training.
Implementation
Usage
Assume we have a neuron network containing one hidden layer L, and the training is running at one iteration (forward or backward). In order to eliminate some units on the L (here is only a random matrix for demo, each row represents one training example’s activations or derivatives, 3 examples, 4 dimensions):
L = randi(35, 3, 4);
We define a keep_prob, let’s say 0.85 which means (1-keep_prob) will be dropped out:
keep_prob = 0.85;
After running dropout, we will see something like:
L = 24 15 13 11
1 26 26 22
30 28 7 13sum(L): 63.000000
sum(L): 75.000000
sum(L): 78.000000
mask = 0 1 1 1
1 1 1 1
1 1 0 1sum(L): 45.882353
sum(L): 88.235294
sum(L): 83.529412
L = 0 17.6471 15.2941 12.9412
1.1765 30.5882 30.5882 25.8824
35.2941 32.9412 0 15.2941
The sum(L) is the summation of each training example’s activation.
Full code with matlab:
L = randi(35, 3, 4);
keep_prob = 0.85;Lprintf("sum(L): %f\n", sum(L, 2));L = dropout(L, keep_prob);printf("sum(L): %f\n", sum(L, 2));
Dropout algorithm
Assume we have a couple of feature vectors or activations (matrix) X and predefined keep_prob, our goal is to eliminate (1-keep_prob) of units of each vector.
Logic-wise, we can define a mask that has exactly the same dimension as X and some units to be 0:
sz = size(X);mask = rand(sz);
mask = mask < 0.8; % Element of mask will be set to 1 or 0 with probability 𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏
The output should be
0 1 1 1
1 1 1 1
1 1 0 1
We see that some of the units have been set to 0, now we get it and we can put this mask over the X to let the elements of X be 0 at the same position as the mask.
We do elementwise-multiply over X:
X = X .* mask;
Important
The element size of X (input) has been reduced by keep_prob from mask (a percentage of elements have been dropped out by mask), thus the value of X (output) is also gonna be reduced, so to compensate this roughly we shall invert the change by dividing keep_prob to make sure the value of X (output) won’t be impacted most.
In order to give the rough compensation, we scale the rows by:
X = X ./ keep_prob;
Full code in matlab:
function [X] = dropout(X, keep_prob)% Dropout some units from X.% (1 - keep_prob) of units will be dropped out.sz = size(X);mask = rand(sz);mask = mask < 0.8; % Element of mask will be set to 1 or 0 with probability 𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏X = X .* mask; % The element size of X (input) has been reduced by keep_prob from mask (a percentage of elements have been dropped out by mask),% thus the value of X (output) is also gonna be reduced, so to compensate this roughly we shall invert the change by dividing% keep_prob to make sure the value of X (output) won't be impacted mostly.X = X ./ keep_prob;endfunction
Check out the total here https://gist.github.com/XinyueZ/debcff1838abb8e00143d7ecf99283c9#file-ml_dropout_inverted_dropout-m-L23
Illustration dropout
In fact, it is too obscure to explain dropout from the mathematical level, so it is better to use two diagrams to roughly explain it here, which is enough as a developer of applied ML direction.
Assuming that the NN without dropout will be so, this is very familiar.
During ONE iteration, we know that dropout is the random elimination (set to 0) of some units between each layer, according to a certain ratio (1-keep_prob), so after dropout, it will be a similar situation.
This is only one iteration, which means that it may be different in the next iteration, e.g.
Do not be surprised, because each iteration is a random annihilation of some units, so of course, the network will change.
However, intuitively, is it not found that the network becomes simpler and at the extreme becomes a fully linear network (assuming relu as the activation function for each layer).
If one makes an assumption, it becomes clear that
Assume we are monitoring one layer of NN and the model is just a linear one:
Y = W * X+ b
Y = W * dropout(X) + b
=> Y = [(x1, x2, x3) .... (some "x"s)]*[W1, W2, W3….] + [b1, b2, b3...]
after dropping some units out
Y = W * X+ b
Y = W * dropout(X) + b
=> Y = [(x1, 0, 0) .... (some "x"s and "0"s)]*[W1, W2, W3….] + [b1, b2, b3...]
after dropping ALL units out
Y = W * X + b
Y = W * dropout(X) + b
=> Y = [(0, 0, 0) .... ("0"s)]*[W1, W2, W3….] + [b1, b2, b3...]
=> Y = b
The model Y becomes a model with only bias b.
This model has a high bias(underfitting), extremely high because it is extremely simple.
Because bias and variance (overfitting) is the trade-off, the model then has extremely low variance(no overfitting).
This is the intuitive reason that the dropout works for reducing overfitting or high variance.
Notice
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.