PixelRNN, image generation with RNN(lab note 1: model architecture)

3 min readNov 7, 2022

Use recurrent neural network (RNN) to generate image, simplest image generative model.

First try

In fact it surprised me how easy it was to do this with RNN to generate images. In a lab, I started with a simple gray-scale image as a starting point and did some scaling down, and the resulting image was actually 100% restored.

first try with gray-scale image
Sampling loss

The infrastructure follows the pattern of encoder and decoder. assume we flatten the image, then use the previous pixel to generate the next pixel, and this new pixel plus the previous pixel to generate the one further one back.

t: represents as one pixel

One optimization is to add batch normalization after some layers, which allows for model hierarchical independence, some degree of regularization, and improved training efficiency.

In-depth attempt

With a complex image, first binarize the image intensity between 0, 1, so as to avoid blurring the image, and then flatten each line of the image for all colour channels ie.


Keep the previous logic, but replace the pixel generating pixel for row to generate row.

Try 3 channels image

After generation, comparing the origin images, there is very little loss of 0.1160.

Sampling loss

left: Many-To-One, right: Many-To-Many

RNN output

Many-To-One(seq2vec) or Many-To-Many(seq2seq)

The only difference between them is which RNN output sections dominate the generation of the next pixel row, in other words, for Many-To-One there’s an extra call

y = y[:,-1,:]

after RNN completes.

left: Many-To-Many, right:Many-To-One

Create model

model = GenModel(input_size= # Dimension of RNN timeframe (time series step)hidden_size= # The RNN internal unitsnum_layers= # The layer number of RNN units, think of general MLP layersbidirectional= # Bi-directional RNN or not, True for yes

Quote of input_size:

Assume we have m ╳n ╳c image, m is row number, n is column, c is color channel number.
For grey-scale, the input_siz e should be n╳1 because there is only one color channel . For multi-channels the input_size is n╳c:

Generally to say:input_size = image_flatten.shape[-1]

Code snippet of Many-To-One(seq2vec)

Code snippet of Many-To-Many(seq2seq)

To be continue…….