From Linear combination To Recommendation algorithm

8 min readNov 20, 2022

Introduce a use case of a collaborative (based) filtering based recommendation system via deep learning.

This article will not use the mathematical terms of linear algebra or term knowledge, which are occasionally mentioned, and do not count these too much.

A “recommendation” usage

A question is raised here, can we use a kind of supervised learning algorithm to predict the rating a user might give by a user-ID and a movie-ID, or a user-ID and a lecture-ID? We then compare the estimated rating or quantity of interest with a reasonable threshold to decide whether we should recommend the item to the user or not.

Linear combination

Recall that the vector (in row-wise) X0 and X1 can be stretched by Beta vector (two factors) on their way vertically and horizontally. Check

Left: Equation for single vector. Right: Equation for a matrix (vector in row-wise)

Decomposition

It turns out that the vector can be “decomposed” into 2 vectors as long as there’re some reasonable Beta factors.

Instead, if we combine two random vectors to form the vector we want(green point), the ideal case is that some best X with Beta pair result in the overlapping of “green” and “yellow” points.

This is what machine learning can help.

Intuition and vanille

We know this math 10 = 3 +7, this has been learned since primary school. How do you think in 10 = 1.5 + 2 × 7 or even in 10 = 1.5 × 2 + 2 × 3.5 ? The latter can result in the same output as the first one.

C0, C1 are components and **10 = 1.5 × 2 + 2 × 3.5**

In more math way:

10 = 3 + 7 

10 = [1.5, 2] dot [2,
                   3.5]

Guess this example is enough for us to recall the insight of the Linear Combination.

Don’t read this article when you cannot get the point, otherwise it is wasting your time……🫡.

Go ahead

2-D to 3-D and more

We can extend to 3-D world as long as we find out reasonable Betas, for 3-D the vector representing Beta is 3-dimensional.

Decomposition 2-D

So far we have done a decomposition for a scalar. For a multi-dimensional vector we can do the same thing:

vector := (10, 10) or [10,
                       10]

# After decomposing:

vector := 

 [                        [
   [1.5, 2],    dot         [2,    [2,
   [1.5, 2],                 3.5]   3.5]
 ]                        ]

I was trying to write them in python-wise.

Assume that we have two feature vectors in a 2-D coordination system row-wise representing by (10) and (10), both of them have built a dense matrix, according to the decomposition we’ve split them into 2 matrices, both of them are sparse matrices.

Usually, the dense format is more preferred as it saves a lot of storage and memory space. While the benefit of the sparse matrix is it is in the nature matrix format and you could apply computations such as cosine similarity directly.

ML use case, recommendation system

We can use the idea of decomposition to build a recommendation system. Assume that we have a rating system containing the quantity of user interest in the item. We can define a threshold over the model prediction of a certain quantity of a specific user to determine whether the item can be recommended to the user or not.

A typical example of the rating data structure

We have the table data for the rating of result of users for different lectures. We call this user-item interaction table(matrix):

Recall the decomposition algorithm we can decompose this table into two tables(matrices) U and I:

How can we figure out the table U and I respectively or in other words, how can we find a model which can always inverse two tables like those into a user-item interaction table(matrix)?

This concept has a name “collaborative (based) filtering”, because the rating of unseen user × item would be inferred by other users include himself × items (collaborative working).

One solution with deep learning

Assume we have a multi-D vector to represent user-ID and item-ID, each of them are unique, ie: we have 1000 users, then the user-ID is one-hot unique. Same happens on the items, ie: 20000 items.

Naturally, when we encode them into one-hot, the user-ID and item-ID have different dimensions.

Left: A rating table in origin, Right: one-hot(sparse) encoded

Using the concept of embedding in the deep learning we can transform two vectors of different dimensions into a unified dimension. What kind of dimension of the both vectors after embedding is just a hyperparameter.

It turns out that use deep learning gradient descent and ground true labels, we do inverse decomposition, and the resulting model is exactly what we are aiming for.

Objectives of this learning system: The weights of embedding-layers for two vectors (representing the User ID and Item ID respectively).

We can consider the embedding layers as “Encoder” and the inverting layers as “Decoder”. The Encoder-Decoder-Linear-Predictor is a typical deep learning pattern.

Optional: I’m not sure whether it is possible:
Because I do computer vision as well, I use 2D-Convolution quite often in order to extract image features. I suppose that the 1D-Convolution can do the similar thing in this case to extract features from user and item🧐and bring them into the same dimension, of cos with the help of some feature engineerings, for example, we have user unique information in other text form and same for item🧐.

You can try if you want, “never say never”.

Return to our case that we have user-ID and item-ID, both of them are uniques and have been preprocessed. The inverse of decomposition, i.e. doing the dot operation, is defined in the framework. As mentioned earlier, the inverting is about “approximating” the real “existence”, so we can also append a bias (optionally) to the dot result of each of the two vectors.

Illustration of DL Net structure (above) and embedding (below)

The code is in Keras, I can do in Pytorch, too. In order to keep both closer without huge difference, I’ve done this in a class-wise(not functional or sequential), so that the Pytorch user can read this easily as well.

Notice, the predictor is Tanh, we can basically use Sigmoid, too.

Personally, I think Tanh’s is more “active”, meaning that the weight-updates “not too slowly” during learning, while Sigmoid tends to be “flat”. Of course, by adjusting the learning-rate we can overcome the difference between the two, which can only be determined by trial and error.

If we solve regression problem unlike the case we talking for classification(categorical) problem, we can assign the predictor with the result of the penultimate layer, here the dropout.

Dropout is not a layer, however, in framework we call name them “layer”. I avoid addressing this as topic in this article.

Code snippets

First thing’s first, do data preprocessing, load data from a rating system database or what else and encode them into one-hot, for instance:

Notice that scale=True determines the problem of objectives. Which means the classification problem(predict that user would rate 1 or not rate 0) or regression problem(user will predict for continues values for the quanity of the rating).

# Data preprocessing and create train, test and cross-validation sets.
def generate_train_test_datasets(dataset, scale=True):

    min_rating = min(dataset["rating"])
    max_rating = max(dataset["rating"])

    dataset = dataset.sample(frac=1, random_state=42)
    x = dataset[["user", "item"]].values
    if scale:
        y = dataset["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values
    else:
        y = dataset["rating"].values

    # Assuming training on 80% of the data and validating on 10%, and testing 10%
    train_indices = int(0.8 * dataset.shape[0])
    test_indices = int(0.9 * dataset.shape[0])

    x_train, x_val, x_test, y_train, y_val, y_test = (
        x[:train_indices],
        x[train_indices:test_indices],
        x[test_indices:],
        y[:train_indices],
        y[train_indices:test_indices],
        y[test_indices:],
    )
    return x_train, x_val, x_test, y_train, y_val, y_test

# Prepare dateset
x_train, x_val, x_test, y_train, y_val, y_test = generate_train_test_datasets(encoded_data)


class RecommenderNet(keras.Model):
    
    def __init__(self, num_users, num_items, dropout=3.28e-2, embedding_size=16, **kwargs):
        """
           Constructor
           :param int num_users: number of users
           :param int num_items: number of items
           :param int embedding_size: the size of embedding vector
        """
        super(RecommenderNet2, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_items = num_items
        self.embedding_size = embedding_size
        
     
        self.user_embedding_layer = layers.Embedding(
            input_dim=num_users,
            output_dim=embedding_size,
            name='user_embedding_layer',
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
        
        self.user_bias = layers.Embedding(
            input_dim=num_users,
            output_dim=1,
            embeddings_regularizer=keras.regularizers.l2(1e-6),
            name="user_bias")
        
       
        self.item_embedding_layer = layers.Embedding(
            input_dim=num_items,
            output_dim=embedding_size,
            name='item_embedding_layer',
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-6),
        )
     
        self.item_bias = layers.Embedding(
            input_dim=num_items,
            output_dim=1,
            embeddings_regularizer=keras.regularizers.l2(1e-6),
            name="item_bias")

        self.dot = layers.Dot(axes=(-1, -1))
        self.add = layers.Add()
        self.relu = layers.Activation('relu')
        self.dropout = layers.Dropout(dropout)

        self.predictor = layers.Activation('tanh')
        
    def call(self, inputs):
        """
           method to be called during model fitting
           
           :param inputs: user and item one-hot vectors
        """
        # Compute the user embedding vector
        user_vector = self.user_embedding_layer(inputs[:, 0])
        user_bias = self.user_bias(inputs[:, 0])
        item_vector = self.item_embedding_layer(inputs[:, 1])
        item_bias = self.item_bias(inputs[:, 1])

        # Inverse the decomposition
        # Compute the item embedding vector
        dot_user_item = self.dot([user_vector, item_vector]) 
        # Add all the components (including bias)
        x=self.add([dot_user_item, user_bias, item_bias])
        x=self.relu(x)
        # Regularization with dropout
        x = self.dropout(x)

        # Sigmoid output layer to output the probability
        return self.predictor(x)

early_stop_callback = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', min_delta=0.0001, patience=5, verbose=1,
)
embedding_size = 32
model_improved = RecommenderNet(num_users,
                                 num_items,
                                 0.6e-2,
                                 embedding_size,
                                 name="recommender_net",
                                 )

loss_object = tf.keras.losses.MeanSquaredError()
optimizer = keras.optimizers.Adam()
metric = tf.keras.metrics.RootMeanSquaredError()
model_improved.compile(loss=loss_object, optimizer=optimizer, metrics=[metric])


model_improved.trainable = True
fit_1 = model_improved.fit(x_train, y_train,
                           validation_data=(x_val, y_val),
                           batch_size=2**11,
                           epochs=100,
                           verbose=2,
                           callbacks=[EpochCallback(), early_stop_callback]
                           )

plt.plot(fit_1.history['loss'], label='train loss')
plt.plot(fit_1.history['val_loss'], label='validation loss')
plt.legend()
plt.tight_layout()
plt.show()

plt.plot(fit_1.history['root_mean_squared_error'], label='train rms')
plt.plot(fit_1.history['val_root_mean_squared_error'], label='validation rms')
plt.legend()
plt.tight_layout()
plt.show()

Basically the model is universal for the data with the structure introduced ☝️. However, different use cases have different context, this article brings only a soft landing on the recommendation planet connected by the basic knowledge of linear combination, decomposition and embedding.

NMF (Non-negative matrix factorization) approach

Other page 🫡 later.