Dropout (inverted dropout)

TeeTracker
6 min readJan 15, 2022

--

Terminology check here https://machinelearning.wtf/terms/inverted-dropout/

Dropout is a widely used regularization technique that is specific to deep learning. It randomly shuts down some neurons in each iteration.

Both Dropout and Regularization are actually a kind of weight decay, that is, weakening the effect of weights or parameters on all neurons, which can theoretically include the input layer.

How it works

At each training iteration with dropout, you shut down (= set to zero) each neuron of a layer with probability 1−keep_prob or keep it with probability keep_prob (50% here).

The dropped neurons don’t contribute to the training in both the forward and backward propagations of the iteration.

Intuition

When you shut some neurons down, you actually modify your model. The idea behind dropout is that at each iteration, you train a different model that uses only a subset of your neurons.

With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time.

You should use dropout (randomly eliminate nodes) only in training.

Implementation

Usage

Assume we have a neuron network containing one hidden layer L, and the training is running at one iteration (forward or backward). In order to eliminate some units on the L (here is only a random matrix for demo, each row represents one training example’s activations or derivatives, 3 examples, 4 dimensions):

L = randi(35, 3, 4); 

We define a keep_prob, let’s say 0.85 which means (1-keep_prob) will be dropped out:

keep_prob = 0.85;

After running dropout, we will see something like:

L =   24   15   13   11
1 26 26 22
30 28 7 13
sum(L): 63.000000
sum(L): 75.000000
sum(L): 78.000000
mask =
0 1 1 1
1 1 1 1
1 1 0 1
sum(L): 45.882353
sum(L): 88.235294
sum(L): 83.529412
L =
0 17.6471 15.2941 12.9412
1.1765 30.5882 30.5882 25.8824
35.2941 32.9412 0 15.2941

The sum(L) is the summation of each training example’s activation.

Full code with matlab:

L = randi(35, 3, 4);
keep_prob = 0.85;
Lprintf("sum(L): %f\n", sum(L, 2));L = dropout(L, keep_prob);printf("sum(L): %f\n", sum(L, 2));

Dropout algorithm

Assume we have a couple of feature vectors or activations (matrix) X and predefined keep_prob, our goal is to eliminate (1-keep_prob) of units of each vector.

Logic-wise, we can define a mask that has exactly the same dimension as X and some units to be 0:

sz = size(X);mask = rand(sz);
mask = mask < 0.8; % Element of mask will be set to 1 or 0 with probability 𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏

The output should be

  0  1  1  1
1 1 1 1
1 1 0 1

We see that some of the units have been set to 0, now we get it and we can put this mask over the X to let the elements of X be 0 at the same position as the mask.

We do elementwise-multiply over X:

X = X .* mask;

Important

The element size of X (input) has been reduced by keep_prob from mask (a percentage of elements have been dropped out by mask), thus the value of X (output) is also gonna be reduced, so to compensate this roughly we shall invert the change by dividing keep_prob to make sure the value of X (output) won’t be impacted most.

In order to give the rough compensation, we scale the rows by:

X = X ./ keep_prob;

Full code in matlab:

function [X] = dropout(X, keep_prob)% Dropout some units from X.% (1 - keep_prob) of units will be dropped out.sz = size(X);mask = rand(sz);mask = mask < 0.8; % Element of mask will be set to 1 or 0 with probability 𝑘𝑒𝑒𝑝_𝑝𝑟𝑜𝑏X = X .* mask; % The element size of X (input) has been reduced by keep_prob from mask (a percentage of elements have been dropped out by mask),% thus the value of X (output) is also gonna be reduced, so to compensate this roughly we shall invert the change by dividing% keep_prob to make sure the value of X (output) won't be impacted mostly.X = X ./ keep_prob;endfunction

Check out the total here https://gist.github.com/XinyueZ/debcff1838abb8e00143d7ecf99283c9#file-ml_dropout_inverted_dropout-m-L23

Illustration dropout

In fact, it is too obscure to explain dropout from the mathematical level, so it is better to use two diagrams to roughly explain it here, which is enough as a developer of applied ML direction.

Assuming that the NN without dropout will be so, this is very familiar.

During ONE iteration, we know that dropout is the random elimination (set to 0) of some units between each layer, according to a certain ratio (1-keep_prob), so after dropout, it will be a similar situation.

This is only one iteration, which means that it may be different in the next iteration, e.g.

Do not be surprised, because each iteration is a random annihilation of some units, so of course, the network will change.

However, intuitively, is it not found that the network becomes simpler and at the extreme becomes a fully linear network (assuming relu as the activation function for each layer).

If one makes an assumption, it becomes clear that

Assume we are monitoring one layer of NN and the model is just a linear one:

Y = W * X+ b
Y = W * dropout(X) + b
=> Y = [(x1, x2, x3) .... (some "x"s)]*[W1, W2, W3….] + [b1, b2, b3...]

after dropping some units out

Y = W * X+ b
Y = W * dropout(X) + b
=> Y = [(x1, 0, 0) .... (some "x"s and "0"s)]*[W1, W2, W3….] + [b1, b2, b3...]

after dropping ALL units out

Y = W * X + b
Y = W * dropout(X) + b
=> Y = [(0, 0, 0) .... ("0"s)]*[W1, W2, W3….] + [b1, b2, b3...]
=> Y = b

The model Y becomes a model with only bias b.

This model has a high bias(underfitting), extremely high because it is extremely simple.

Because bias and variance (overfitting) is the trade-off, the model then has extremely low variance(no overfitting).

This is the intuitive reason that the dropout works for reducing overfitting or high variance.

Notice

  • Dropout is a regularization technique.
  • You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
  • Apply dropout both during forward and backward propagation.
  • During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.

--

--