Unsupervised Machine Learning (review)

TeeTracker

17 min readOct 31, 2022

Unsupervised Learning Algorithms

Unsupervised algorithms are relevant when we don’t have an outcome or labeled variable we are trying to predict.

They are helpful to find structures within our data set and when we want to partition our data set into smaller pieces.

Types of Unsupervised Learning:

Dimensionality reduction is important in the context of large amounts of data.

The Curse of Dimensionality

In theory, a large number of features should improve performance. In theory, as models have more data to learn from, they should be more successful. But in practice, too many features lead to worse performance. There are several reasons why too many features end up leading to worse performance. If you have too many features, several things can be wrong, for example:

- Some features can be spurious correlations, which means they correlate into the data set but not outside your data set as long as new data comes in.

- Too many features create more noise than signal.

- Algorithms find it hard to sort through non meaningful features if you have too many features.

- The number of training examples required increases exponentially with dimensionality.

- Higher dimensions slows performance.

- Larger data sets are computationally more expensive.

- Higher incidence of outliers.

To fix these problems in real life, it’s best to reduce the dimension of the data set.

Similar to feature selection, you can use Unsupervised Machine Learning models such as Principal Components Analysis.

Common uses of clustering cases in the real world

1. Anomaly detection

Example: Fraudulent transactions.

Suspicious fraud patterns such as small clusters of credit card transactions with high volume of attempts, small amounts, at new merchants. This creates a new cluster and this is presented as an anomaly so perhaps there’s fraudulent transactions happening.

2. Customer segmentation

You could segment the customers by recency, frequency, average amount of visits in the last 3 months. Another common type of segmentation is by demographics and the level of engagement, for example: single costumers, new parents, empty nesters, etc. And the combinations of each with the preferred marketing channel, so you can use these insights for future marketing campaigns.

3. Improve supervised learning

You can perform a Logistic regression for each cluster. This means training one model for each segment of your data to try to improve classification.

Common uses of Dimension Reduction in the real world

1. Turn high resolution images into compressed images

This means to come to a reduced, more compact version of those images so they can still contain most of the data that can tell us what the image is about.

2. Image tracking

Reduce the noise to the primary factors that are relevant in a video capture. The benefits of reducing the data set can greatly speed up the computational efficiency of the detection algorithms.

You see how the more dimensions you add on, the more rows you need, the more data you need to get that same amount of coverage.

K-mean algorithm

The k-means algorithm adjusts the centroids to the new mean of each cluster, and then it keeps repeating this process until no example is assigned to another cluster.

The problem with K-means algorithm is that is sensitive to the choice of the initial points, so different initial configurations may yield different results.

K-means++

Problem

So local optima or just non-optimal solutions you can think of, often happen when two cluster centroids are initialized close to each other.

Solution

To initialize with points that are far away enough from one another.

Start by a random initial point.
For the second pick, instead of getting it randomly, we’re going to prioritize far away points by assigning a probability of the distance of each point squared, over the sum of all the distances squared for each point from that initial centroid.
When we use the probability formula, we put more weight on the points that are far away. So, our second cluster centroid is likely going to be more distant 👇.

So we look at every single point, square the distance from the original centroid, and we put a lot more weight if you look at this formula to those that are far away, because that’ll take up a larger proportion of the total distance squared of all of our points. So we’ll be more likely to end up with a not so close points such as the blue one (👆) as our second cluster centroid.

Metrics, clustering performance evaluation for selecting K

Inertia(惯性)

Distortion(失真)

How to use?

Inertia will always increase as more members are added to each cluster. But this will not be the case with distortion, since it’ll work by taking that average.
When the similarity of points in the cluster is more important, you should use distortion. And if you’re more concerned that clusters have similar numbers of points, then you should use inertia.

Do with scikit-learn

K-mean summary

K-means clustering is an iterative process in which similar observations are grouped together. To do that, this algorithm starts by taking 2 random points known as centroids, and starts calculating the distance of each observation to the centroid, and assigning each cluster to the nearest centroid. After the first iteration every point belongs to a cluster.

Next, the number of centroids increases by one, and the centroid for each cluster are recalculated as the points with the average distance to all points in a given cluster. Then we keep repeating this process until no example is assigned to another cluster.

And this process is repeated k-times, hence the name k-means. This algorithm converges when clusters do not move anymore.

We can also create multiple clusters, and we can have multiple solutions, by multiple solutions we mean that the clusters are not going to move anymore (they converged) but we can converge in different places where we no longer move those centroids.

Advantages and Disadvantages of K-Means

The main advantage of k-means algorithm is that it is easy to compute. One disadvantage is that this algorithm is sensitive to the choice of the initial points, so different initial configurations may yield different results.

To overcome this, there is a smarter initialization of K-mean clusters called K-means++, which helps to avoid getting stuck at local optima. This is the default implementation of the K-means.

Model Selection, choosing K number of clusters

Sometimes you want to split your data into a predetermined number of groups or segments. Often, the number of clusters (K) is unclear, and you need an approach to select it.

A common metric is Inertia, defined as the sum of squares distance from each point to its cluster centroid.

Smaller values of Inertia correspond to tighter clusters, this means that we are penalizing spread out clusters and rewarding clusters that are tighter to their centroids.

The draw back of this metric is that its value sensitive to number of points in clusters. The more points you add, the more you will continue penalizing the inertia of a cluster, even if those points are relatively closer to the centroids than the existing points.

Another metric is Distortion defined as the average of squared distance from each point to its cluster.

Smaller values of distortion corresponds to tighter clusters.

An advantage of distortion is that it doesn’t generally increase as more points are added (relative to inertia). This means that it doesn’t increase distortion, as closer points will actually decrease the average distance to the cluster centroid.

Inertia Vs. Distortion

Both Inertia and Distortion are measures of entropy per-cluster.

Inertia will always increase as more members are added to each cluster, while this will not be the case with distortion.

When the similarity of the points in the cluster are very relevant, you should use distortion and if you are more concerned that clusters should have a similar number of points, then you should use inertia.

Finding the right cluster

To find the cluster with a low entopy metric, you can run a few k-means clustering models with different initial configurations, compare the results, and determine which one of the different initializations of configurations lead to the lowest inertia or distortion.

The standard deviation (std) of the cluster defines how tightly around each one of the centroids are. With a small standard deviation, the points will be closer to the centroids.

Distance Metrics

Assume that we have 2-dims sample: [Recency, Visits].

Clustering methods rely very heavily on our definition of distance. Our choice of Distance Metric will be extremely important when discussing our clustering algorithms and to clustering success.

Each metric has strengths and most appropriate use cases, but sometimes choosing a distance metric is also based on empirical evaluation to determine which metric works best to achieve our goals.

These are the most common distance metrics:

Euclidean Distance

This one is the most intuitive distance metric, and that we use in K-means, another name for this is the L2 distance. You probably remember from your trigonometry classes.

We calculate (d) by taking the square root of the square of each of this changes (values). We can move this to higher dimensions for example 3 dimensions, 4 dimensions etc. In general, for an n-dimensional space, the distance is:

Manhattan Distance (L1 or City Block)

Another distance metric is the L1 distance or the Manhattan distance, and instead of squaring each term we are adding up the absolute value of each term. It will always be larger than the L2 distance, unless they lie on the same axis. We use this in business cases where there is very high dimensionality.

As high dimensionality often leads to difficulty in distinguishing distances between one point and the other, the L1 score does better than the L2 score in distinguishing these different distances once we move into a higher dimensional space.

Cosine Distance

This is a bit less intuitive distance metric. What we really care about the Cosine Distance is the angle between 2 points, for example, for two given points A and B:

This metric gives us the cosine of the angle between the two vectors defined from the origin to two given points in a two-dimensional space. To translate this definition into higher dimensions, we take the dot product of the vectors and divide it by the norm of each point.

The key to the Cosine distance is that it will remain insensitive to the scaling with respect to the origin, which means that we can move some of the points along the same line and the distance will remain the same. So, any two points on that same array, passing through the origin will have a distance of zero from one another.

Euclidean VS Cosine distances

Euclidean distance is useful for coordinate based measurements.
Euclidean distance is more sensitive to curse of dimensionality
Cosine is better for data such as text where location of occurrence is less important.

Jaccard Distance

This distance is useful for texts and is often used to word occurrence.

Consider the following example:

In this case, the Jaccard Distance is going to be one minus the amount of value shared. So, the intersection over that union. This intersection means, the shared values of the two sentences over the length of the total unique values between sentecnes A and B.

It can be useful in cases you have text documents and you want to group similar topics together.

Hierarchical Clustering

This clustering algorithm, will try to continuously split out and merge new clusters successively until it reaches a level of convergence.

This algorithm identifies first the pair of points which has the minimal distance and it turns it into the first cluster, then the second pair of points with the second minimal distance will form the second cluster, and so on. As the algorithm continues doing this with all the pairs of closest points, we can turn our points into just one cluster, which is why HAC also needs a stopping criterion.

There are a few linkage types or methods to measure the distance between clusters. these are the most common:

Single linkage: minimum pairwise distance between clusters (shortest distance between two points in each cluster).

It takes the distance between specific points and declare that as the distance between 2 clusters and then it tries to find for all these pairwise linkages which one is the minimum and then we will combine those together as we move up to a higher hierarchy.

Pros: It helps ensuring a clear separation between clusters.

Cons: It won’t be able to separate out cleanly if there is some noise between 2 different clusters.

Complete linkage: maximum pairwise distance between clusters(longest distance between the points in each cluster).

Instead of taking the minimum distance given the points within each cluster, it will take the maximum value. Then from those maximum distances it decides which one is the smallest and then we can move up that hierarchy.

Pro: It would do a much better job of separating out the clusters if there’s a bit of noise or overlapping points of two different clusters.

Cons: Tends to break apart a larger existing cluster depending on where that maximum distance of those different points may end up lying

Average linkage: Average pairwise distance between clusters (the average distance of each point from one cluster).

Takes the average of all the points for a given cluster and use those averages or clusters centroids to determine the distance between the different clusters.

Pros: The same as the single and complete linkage.

Cons: It also tends to break apart a larger existing cluster.

Ward linkage: Cluster merge is based on inertia (Centroid is the average of the feature sets of points in a cluster).

Computes the inertia for all pairs of points and picks the pair that will ultimately minimizes the value of inertia.

The pros and cons are the same as the average linkage.

Performance and vs. K-Means

Example

Let’s say we want to cluster six cities in Canada based on their distances from one another. They are Toronto(TO), Ottawa(OT), Vancouver(VA), Montreal(MO), Winnipeg(WI), and Edmonton(ED).

We construct a distance matrix at this stage, where the numbers in the row i column j is the distance between the i and j cities.

dis(i , j)

In fact, this table shows the distances between each pair of cities. The algorithm is started by assigning each city to its own cluster. So if we have six cities, we have six clusters each containing just one city. Let’s note each city by showing the first two characters of its name.

The first step is to determine which cities, let’s call them clusters from now on, to merge into a cluster. Usually, we want to take the two closest clusters according to the chosen distance. Looking at the distance matrix, MO and OT are the closest clusters SO we make a cluster out of them,

Notice that we just use a simple one-dimensional distance feature here, but our object can be multidimensional and distance measurement can either be Euclidean, Pearson, average distance or many others depending on data type and domain knowledge.

See the images 👆, they are using 2-dims.

Anyhow, we have to merge these two closest cities in the distance matrix as well.

So, rows and columns are merged as the cluster is constructed.

As you can see in the distance matrix, rows and columns related to MO and OT cities are merged as the cluster is constructed.

The distances from all cities to this new merged cluster get updated. But how? For example, how do we calculate the distance from WI to the [OT/MO] cluster? Well, there are different approaches but let’s assume for example, we just select the distance from the center (Ward linkage)of the [OT/MO] cluster to WI. Updating the distance matrix, we now have one less cluster.

We look for the closest clusters once again. In this case, [OT/MO], and TO are the closest ones which creates another cluster.

The closest distances between the VA cluster and the ED cluster. Forming a new cluster, the data in the matrix table gets updated.

Essentially, the rows and columns are merged as the clusters are merged and the distance updated. This is a common way to implement this type of clustering and has the benefit of caching distances between clusters. In the same way, agglomerative algorithm proceeds by merging clusters, and we repeat it until all clusters are merged and the tree becomes completed. It means, until all cities are clustered into a single cluster of size six. Hierarchical clustering is typically visualised as a dendrogram as shown on this slide.

Dendrogram:

Each merge is represented by a horizontal line.
The y-coordinate of the horizontal line is the similarity of the two clusters that were merged where cities are viewed as singleton clusters.

By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering.

Essentially, hierarchical clustering does not require a prespecified number of clusters. However, in some applications, we want a partition of disjoint clusters just as in flat clustering. In those cases, the hierarchy needs to be cut at some point. For example here, cutting in a specific level of similarity(threshold), we create three clusters of similar cities.

DBSCAN

Density-based spatial clustering of applications with noise

We keep going until we find the entire cluster, and no point is left unvisited by this chain reaction. If we have no neighbors left, randomly try a new unvisited point to potentially start a new cluster. You can find more information in the lesson

We cannot call db.predict() as others models!

Mean Shift

New centroid(mean) can be weighted mean instead just mean

Summary of clustering approaches

Model Selection

Performance compare

PCA

Principal Component Analysis

Non Negative Matrix Decomposition

Non Negative Matrix Decomposition is another way of reducing the number of dimensions. Similar to PCA, it is also a matrix decomposition method in the form V=WxH.

The main difference is that it can only be applied to matrices that have positive values as inputs, for example:

pixels in a matrix
positive attributes that can be zero or higher

In the case of word and vocabulary recognition, each row in the matrix can be considered a document, while each column can be considered a topic.

NMF has proven to be powerful for:

word and vocabulary recognition
image processing,
text mining
transcribing
encoding and decoding
decomposition of video, music, or images

There are advantages and disadvantages of only dealing with non negative values.

An advantage, is that NMF leads to features that tend to be more interpretable. For example, in facial recognition, the decomposed components match to something more interpretable like, for example, the nose, the eyebrows, or the mouth.

A disadvantage is that NMF truncates negative values by default to impose the added constraint of only positive values. This truncation tends to lose more information than other decomposition methods.

Unlike PCA, it does not have to use orthogonal latent vectors, and can end up using vectors that point in the same direction.

NMF for NLP

In the case of Natural Language Processing, NMF works as below given these inputs, parameters to tune, and outputs:

Inputs

Given vectorized inputs, which are usually pre-processed using count vectorizer or vectorizers in the form Term Frequency — Inverse Document Frequency (TF-IDF).

Parameters to tune

The main two parameters are:

Number of Topics
Text Preprocessing (stop words, min/max document frequency, parts of speech, etc)

Output

The output of NMF will be two matrices:

W Matrix telling us how the terms relate to the different topics.
H Matrix telling us how to use those topics to reconstruct our original documents.

Syntax

The syntax consists of importing the class containing the clustering method:

from sklearn.decomposition import NMF

creating the instance of the class:

nmf=NMF(n_components=3, init=’random’)

and fit the instance and create a transformed version of the data:

x_nmf=NMF.fit(X)