Machine Learning (recap summary of notes)

TeeTracker

22 min readAug 23, 2022

Supervised Machine Learning

The types of supervised Machine Learning are:

Regression, in which the target variable is continuous.

For example movie revenue

Classification, in which the target variable is categorical.

For example spam mail or not

To build a classification model you need:

Features that can be quantified
A labeled target or outcome variable
Method to measure similarity
Use Pytorch as an example

Loss is a measured result between predicted **yhat** and ground true Y

Loss triggers back propagation and the optimizer stepped updates the model parameters

Linear Regression

A linear regression models the relationship between a continuous variable and one or more scaled variables.It is usually represented as a dependent function equal to the sum of a coefficient plus scaling factors times the independent variables.

Residuals are defined as the difference between an actual value and a predicted value.

A modeling best practice for linear regression is:

Use cost function to fit the linear regression model
Develop multiple models
Compare the results and choose the one that fits your data and whether you are using your model for prediction or interpretation.

Three common measures of error for linear regressions are:

Sum of squared Error (SSE)

Total Sum of Squares (TSS)

Coefficient of Determination (R2)

Linear Regression Syntax

The most simple syntax to train a linear regression using scikit learn is:


from sklearn.linear_model import LinearRegressionLR = LinearRegression()LR = LR.fit(X_train, y_train)# To score a data frame X_test you would use this syntax:y_predict = LR.predict(X_test)

Training and Test Splits

Splitting your data into a training and a test set can help you choose a model that has better chances at generalizing and is not overfitted.

The training data is used to fit the model, while the test data is used to measure error and performance.

Training error tends to decrease with a more complex model.Cross validation error generally has a u-shape. It decreases with more complex models, up to a point in which it starts to increase again.

**Underfitting model, high bias, low accuracy**

**Overfitting model, high variance of parameters, big loss gap between train and validation set**

**Just-right model, not too simple, not too complex**

Cross Validation

The three most common cross validation approaches are:

k-fold cross validation

leave one out cross validation
stratified cross validation

Polynomial Regression

Polynomial terms help you capture nonlinear effects of your features.

Other algorithms that help you extend your linear models are:

Logistic Regression
K-Nearest Neighbors
Decision Trees
Support Vector Machines
Deep Learning Approaches
Random Forests
Ensemble Methods

Regularization Techniques

Three sources of error for your model are: bias, variance, and, irreducible error.

Regularization is a way to achieve building simple models with relatively low error. It helps you avoid overfitting by penalizing high-valued coefficients. It reduces parameters and shrinks the model.

Regularization adds an adjustable regularization strength parameter directly into the cost function.

Regularization performs feature selection by shrinking the contribution of features, which can prevent overfitting.

Ridge Regression

The complexity penalty λ is applied proportionally to squared coefficient values.

– The penalty term has the effect of “shrinking” coefficients toward 0.

– This imposes bias on the model, but also reduces variance.

– We can select the best regularization strength lambda via cross-validation.

– It’s a best practice to scale features (i.e. using StandardScaler) so penalties aren’t impacted by variable scale.

LASSO regression

The complexity penalty λ (lambda) is proportional to the absolute value of coefficients. LASSO stands for : Least Absolute Shrinkage and Selection Operator.

– Similar effect to Ridge in terms of complexity tradeoff: increasing lambda raises bias but lowers variance.

– LASSO is more likely than Ridge to perform feature selection, in that for a fixed λ, LASSO is more likely to result in coefficients being set to zero.

Elastic Net

Combines penalties from both Ridge and LASSO regression. It requires tuning of an additional parameter that determines emphasis of L1 vs. L2 regularization penalties.

LASSO’s feature selection property yields an interpretability advantage, but may underperform if the target truly depends on many of the features.

Elastic Net, an alternative hybrid approach, introduces a new parameter α (alpha) that determines a weighted average of L1 and L2 penalties.

Regularization techniques have an analytical, a geometric, and a probabilistic interpretation.

Classification Problems

The two main types of supervised learning models are:

Regression models, which predict a continuous outcome (problem of “how much”).
Classification models, which predict a categorical outcome.

The most common models used in supervised learning are:

Logistic Regression

K-Nearest Neighbors

Support Vector Machines

Decision Tree
Neural Networks
Random Forests
Boosting
Ensemble Models

With the exception of logistic regression, these models are commonly used for both regression and classification.

Logistic regression is most common for dichotomous and nominal dependent variables.

Normalisation, standardisation

It doesn’t matter whether you scale before or afterwards, in terms of the raw predictions, for Linear Regression. However, it matters for other algorithms, ie. L1 or L2.

Logistic Regression

A type of regression that models the probability of a certain class occurring given other independent variables. It uses a logistic or logit function to model a dependent variable. It is a very common predictive model because of its high interpretability.

Classification Error Metrics

A confusion matrix tabulates:

true positives, false negatives (Typ II), false positives (Typ I), true negatives

Accuracy is defined as the ratio of true postives and true negatives divided by the total number of observations. It is a measure related to predicting correctly positive and negative instances.

Recall or sensitivity identifies the ratio of true positives divided by the total number of actual positives. It quantifies the percentage of positive instances correctly identified.

Precision is the ratio of true positive divided by total of predicted positives.

The closer this value is to 1.0, the better job this model does at identifying only positive instances.

Specificity is the ratio of true negatives divided by the total number of actual negatives.

The closer this value is to 1.0, the better job this model does at avoiding false alarms.

The receiver operating characteristic (ROC) plots the true positive rate (sensitivity) of a model vs. its false positive rate (1-sensitivity).

The area under the curve of a ROC plot is a very common method of selecting a classification methods.

The precision-recall curve measures the trade-off between precision and recall.

The ROC curve generally works better for data with balanced classes, while the precision-recall curve generally works better for data with unbalanced classes.

def evaluate_metrics(yt, yp):
    results_pos = {}
    results_pos['accuracy'] = accuracy_score(yt, yp)
    precision, recall, f_beta, _ = precision_recall_fscore_support(yt, yp, average='binary')
    results_pos['recall'] = recall
    results_pos['precision'] = precision
    results_pos['f1score'] = f_beta
    return results_pos

K Nearest Neighbour Methods for Classification

K nearest neighbor methods are useful for classification. The elbow method is frequently used to identify a model with low K and low error rate.

Low K -> overfitting
High K -> overly generalised

These methods are popular due to their easy computation and interpretability, although it might take time scoring new observations, it lacks estimators, and might not be suited for large data sets.

Support vector machines

The main idea behind support vector machines is to find a hyperplane that separates classes by determining decision boundaries that maximize the distance between classes.

SVM tries to find hyperplanes that have the maximum margin.
The hyperplanes are determined by support vectors (data points have the smallest distance to the hyperplanes).
Meanwhile, in order to reduce model variance, the SVM model aims to find the maximum possible margins so that unseen data will be more likely to be classified correctly.

The cost function for logistic regression has a cost function that decreases to zero, but rarely reaches zero.

SVMs use the Hinge Loss function as a cost function to penalize misclassification. This tends to lead to better accuracy at the cost of having less sensitivity on the predicted probabilities.

Regularization can help SVMs generalize better with future data. Support Vector Machine models rarely overfit on training data.

Kernel trick

Support vector machines can be extended to work with nonlinear classification boundaries by using the kernel trick.

SVM addresses non-linear separable via kernel trick. Kernels are a special type of function that takes two vectors and returns a real number, like a dot-product operation.
As such, kernels are not any real mapping functions from low dimensional spaces to high dimensional spaces.

As such, computing the k(x, y) is equivalent to computing a dot-product of the higher dimensional vectors, without doing the actual feature space transforms.
Consequently, SVM with non-linear kernels can transform existing features into high dimensional features that can be lkinearly separated in higher dimensional spaces.

The SVCmodel provided by sklearn has two important arguments to be tuned: regularization parameter C and kernel.

S.T: abs(margin * ||β||)≥1

The C (Prof.Ng uses 2 in ML course)argument is a regularization parameter.
For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly.
For small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.(underfitting)

By using gaussian kernels, you transform your data space vectors into a different coordinate system, and may have better chances of finding a hyperplane that classifies well your data.

SVMs with RBFs Kernels are slow to train with data sets that are large or have many features.

The kernel argument specifies the kernel to be used for transforming features to higher-dimensional spaces, some commonly used non-linear kernels are:

rbf: Gaussian Radial Basis Function (RBF)
poly: Polynomial Kernel
sigmoid: Sigmoid Kernel

More to SVM

As a classification technique, SVM leverage high computational efficiency on a large dataset, it comprises of high-dimensionality spaces and can be suitable to document classification and sentiment data analysis. In such cases, data dimensionality is remarkably large.
It has immense memory efficiency, though in making decisions only a subset of training data points is applied in selecting new features, so few data points need to be stored in decision making.
It has greater adaptability(skillfulness), there is big non-linearity in the class-division process that leads to having ample flexibility in applying new kernels for the decision- ends and commencing high classification execution.
SVM delivers high accuracy as opposed to other classifiers, like, logistics regression, decision trees. It has kernel-ability to manage nonlinear input spaces.

Decision tree

Decision trees split your data using impurity(不纯物) measures. They are a greedy algorithm and are not based on statistical assumptions.

The most common splitting impurity measures are Entropy and Gini index.

Decision trees tend to overfit and to be very sensitive to different data.

Cross validation and pruning sometimes help with some of this.

Great advantages of decision trees are that they are really:

Easy to interpret
require NO data preprocessing

Ensemble learning

1. Bagging

A model that averages the predictions of multiple models reduces the variance of a single model and has high chances to generalize well when scoring new data. Bagging is a tree ensemble that combines the prediction of several trees that were trained on bootstrap samples of the data.

If our bagging produced n independent trees, each with variance Sigma squared, our keyword being independent trees here, then the bagged variance would be Sigma squared divided by n. So the larger n is, that is the more trees that we are using, assuming independent trees, the more we can reduce this overall variance. In reality though, these trees are not independent. Since we are sampling with replacement, they are likely to be very highly correlated. As we see with this equation, if the correlation is close to one, we end up with no reduction in variance, which should make sense if you keep using the same or very similar trees, you’re not gaining any new information if you keep using the same decision tree over and over. You need to ensure that each one these decision trees are somewhat different than one another.

Ensemble Based Methods and Bagging

Tree ensembles have been found to generalize well when scoring new data. Some useful and popular tree ensembles are bagging, boosting, and random forests. Bagging, which combines decision trees by using bootstrap aggregated samples. An advantage specific to bagging is that this method can be multithreaded or computed in parallel. Most of these ensembles are assessed using out-of-bag error.

2. Random Forest

Random forest is a tree ensemble that has a similar approach to bagging. Their main characteristic is that they add randomness by only using a subset of features to train each split of the trees it trains. Extra Random Trees is an implementation that adds randomness by creating splits at random, instead of using a greedy search to find split variables and split points.

Random Forest is similar to Bagging using multiple model versions and aggregating the ensemble of models to make a single prediction. RF uses an ensemble of tree’s and introduces randomness into each tree by randomly selecting a subset of the features for each node to split on. This makes the predictions of each tree uncorrelated, improving results when the models are aggregated.

In general, a random forest can be considered a special case of bagging and it tends to have better out of sample accuracy.

3. Boosting

Boosting methods are additive in the sense that they sequentially retrain decision trees using the observations with the highest residuals on the previous tree. To do so, observations with a high residual are assigned a higher weight.

The key is that each tree gets a vote for that final decision. So we see here we’re
starting off with three different trees and then we’re going to use those three different trees to vote on a single classifier. So as we see here, we run through the decisions given the fit of each tree and that’s going to be that pink, pink, blue, and that’s going to be for a specific row
within our dataset. For that specific row, we can come up with the majority class that predicted results and we can do this across all of our different rows. So we take every single one of those different rows and see what the majority class is and getting this
majority class will be called the meta classification, as it uses several classifiers in order to take their outputs and decide on the class by combining, or aggregating, or here voting on which one is the majority class. As an example, if we look back to our first row, we see that tree one
may have set red, tree two set red as well, and tree three set blue. We have a democracy here. We have two reds versus one blue. So we’re going to decide red. In our intro we brought up this term, bagging. Bagging is just going to be short for bootstrap aggregating, which is what we just laid out here. We bootstrap by getting each one of our smaller samples, build out our decision trees, followed by way of bringing together all of those different decision trees on those samples, that’s the aggregating step. So bagging is bootstrap aggregating.

n_estimators=20
BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion=”entropy”, max_depth = 4,random_state=2),n_estimators=n_estimators,random_state=0,bootstrap=True)
RandomForestClassifier( max_features=max_features,n_estimators=n_estimators, random_state=0)

Random forest is essentially the bagging. So bootstrapping and aggregating with not only the subset of the rows being random, but also the subset of the features or columns also being random. (In general, a random forest can be considered a special case of bagging and it tends to have better out of sample accuracy.)

Random forests are a combination of trees such that each tree depends on a random subset of the features and data. As a result, each tree in the forest is different and usually performs better than Bagging. The most important parameters are the number of trees and the number of features to sample. First, we import RandomForestClassifier.

Like Bagging, increasing the number of trees improves results and does not lead to overfitting in most cases; but the improvements plateau as you add more trees. For this exxample, the number of trees in the forest (default=100).

Boosting and stacking

Boosting

Boosting methods are additive(添加剂) in the sense that they sequentially retrain decision trees using the observations with the highest residuals on the previous tree. To do so, observations with a high residual are assigned a higher weight.

Boosting algorithms create trees iteratively, not independently, by boosting observations with high residuals from the previous tree model.

Generally speaking, where you’ll want more trees if you were working with a lower learning rate.
A smaller learning rate, means less overfitting, hence a higher bias if we have a smaller learning rate and a lower variance.

AdaBoost is actually part of a family of Boosting algorithms. Like Bagging and Random Forest (RF), AdaBoost combines the outputs of many classifiers into an ensemble, but there are some differences. In both Bagging and RF, each classifier in the ensemble is powerful but prone to overfitting. As Bagging or RF aggregate more and more classifiers, they reduce overfitting.

With AdaBoost, each Classifier usually has performance slightly better than random. This is referred to as a weak learner or weak classifier. AdaBoost combines these weak classifiers to get a strong classifier. Unlike Bagging and Random Forest, in AdaBoost, adding more learners can cause overfitting. As a result, AdaBoost requires Hyperparameter tuning, taking more time to train. One advantage of AdaBoost is that each classifier is smaller, so predictions are faster.

In AdaBoost, the strong classifier 𝐻(𝑥)H(x) is a linear combination of 𝑇T weak classifiers ℎ𝑡(𝑥)ht(x) and 𝛼𝑡αt as shown in (1). Although each classifier ℎ𝑡(𝑥)ht(x) appears independent, the 𝛼𝑡αt contains information about the error of classifiers from ℎ_1(𝑥),..,ℎ_𝑡−1(𝑥)h_1(x),..,h_t−1(x). As we add more classifiers, the training accuracy gets larger. What’s not so apparent in (1) is that during the training process, the values of that training sample are modified for ℎ𝑡(𝑥)ht(x). For a more in depth look at the theory behind Adaboost, check out The Elements of Statistical Learning Data Mining, Inference, and Prediction.

𝐻(𝑥)=𝑒𝑥𝑡𝑠𝑖𝑔𝑛(∑_𝑡=1𝑇𝛼𝑡ℎ𝑡(𝑥))H(x)=extsign(∑_t=1Tαtht(x)) [1]

Gradient Boosting

The main loss functions for boosting algorithms are:

0–1 loss function, which ignores observations that were correctly classified. The shape of this loss function makes it difficult to optimize.
Adaptive boosting loss function 👆code, which has an exponential nature. The shape of this function is more sensitive to outliers.
Gradient boosting loss function👆code. The most common gradient boosting implementation uses a binomial log-likelihood loss function called deviance. It tends to be more robust to outliers than AdaBoost.

The nature of boosting algorithms tends to produce good results in the presence of outliers and rare events.

The additive nature of gradient boosting makes it prone to overfitting. This can be addressed using cross validation or fine tuning the number of boosting iterations. Other hyperparameters to fine tune are:

learning rate (shrinkage)
subsample
number of features.

Stacking

Stacking is an ensemble method that combines any type of model by combining the predicted probabilities of classes. In that sense, it is a generalized case of bagging. The two most common ways to combine the predicted probabilities in stacking are: using a majority vote or using weights for each predicted probability.

Save sklearn model, just a note

Bagging vs Boosting

When we get into boosting, we will not be able to build trees in parallel, so it won’t be as computationally efficient. Whereas with bagging, we can grow trees in parallel, where each tree is not dependent on any other tree because it’s just specific to its own dataset, so we’ll be able to grow these trees in parallel.

Basically, machine learning models can be categorized into two groups regarding their interpretability:

Self-interpretable models refer to those models with simple and intuitive structures, which are easily comprehended by humans without extra explanation methods. Self-interpretable models are usually preferred in high-risk areas such as finance or health because humans could also understand the model’s logic and make similar predictions.

In contrast, non-self-interpretable models are those models with complex structures and can be described as the black-box models. The main reason we use those complex models is they normally can achieve the-state-of-the-art performance in specific problems such as natural language translations, image recognition, and traffic patterns.

Linear models are probably the most widely used predictive models due to their simplicity and effectiveness, especially in the financial industry.

An effective way to make it easier to understand is to use feature selection methods such as Lasso regression analysis so that the model only includes important features, thus both increasing interpretability and decreasing the risk of overfitting.

Tree models such as decision trees, are another popular self-interpretable type of model. The main characteristic of tree models is they mimic human’s reasoning process via creating a set of IF-THEN-ELSE rules. For example, if a house has more than 4 rooms and if its associated school is within the top-10, then its estimated house price is $850,000. However, like linear models, big tree models with large widths and depths also become difficult to understand and prone to suffer from overfitting. Tree pruning is one way to reduce the size of the generated trees.

The K-nearest neighbor model, or KNN, can also be considered a self-interpretable model if the feature spaces can be comprehensible and kept small. For example, suppose we want to predict the price of a house with 4 rooms and its associated high school ranking is 10th. A trained KNN model predicts the price of house zero to be $855,000 based on the average of three of its neighbors. Such a prediction process is consistent with our intuition and common sense, as we will also check for similar houses sold in the neighborhood as baselines. Also, like linear and tree models, KNN models can become difficult to understand if the features space is large. We can reduce the number of neighbor instances and features to simplify KNN and make it more interpretable.

One common way to explain a machine learning model is via finding its important features and permutation feature importance 👇 is a popular method to calculate feature importance.The basic idea of permutation feature importance is we shuffle interested feature values and make predictions using the shuffled values. The feature importance will be measured by calculating the difference between the prediction errors before and after permutation.

Partial Dependency Plot (PDP) ☝️ is an effective way to illustrate the relationship between an interested feature and the model outcome. It essentially visualizes the marginal effects of a feature, that is, shows how the model outcome changes when a specific feature changes in its distribution.

Since a machine learning model may include many features, and it is not feasible to create PDP for every single feature. Thus, we normally first find the most important features via ranking their feature importances. Then, we can only focus PDP on those important features.

The Model-Agnostic explanation 👇 (i.e: surrogate models) can be used to describe different types of Machine Learning models no matter the complexity while also having the same formats and presentations for model explanations.

Surrogate Model

Global Surrogate Model

First, we select a dataset X_test as input.
Then, we use the black-box model to make predictions y_blackbox using the X_test.
With both testing data and labels ready, we can use them to train a simple logistic regression model OR a decision tree model which are surrogate models.
The surrogate model outputs its own predictions y_surrogate.
Lastly, we can measure the difference between y_surrogate and y_blackbox using an accuracy score(or f1_score or roc_auc) to determine how well the surrogate model approximating the black-box model.

Local Surrogate

Global surrogate models may have large prediction inconsistency between the complex black-box model and the simple surrogate models or there are many instance groups or clusters in the dataset which make the surrogate model more generalized to those different patterns and lose the interpretability on a specific data group.

A Global Surrogate model might fail if there is a large inconsistency between surrogate models and black-box models.

On the other hand, we are also interested in how black-box models make predictions on some representative instances. By understanding these very typical examples, we can sometimes obtain valuable insights without understanding the model’s behaviors on the entire dataset.

The main difference between LIME when compared to building a global surrogate model is that LIME first generates an artificial dataset based on the selected data instance.

Modeling Unbalanced Classes

Classification algorithms are built to optimize accuracy, which makes it challenging to create a model when there is not a balance across the number of observations of different classes. Common methods to approach balancing the classes are:

Downsampling or removing observations from the most common class
Upsampling or duplicating observations from the rarest class or classes
A mix of downsampling and upsampling

Modeling Approaches for Unbalanced Classes

Specific algorithms to upsample and downsample are:

Stratified sampling
Random oversampling
Synthetic oversampling, the main two approaches being Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic sampling (ADASYN)
Cluster Centroids implementations like NearMiss, Tomek Links, and Nearest Neighbors

A best practice is to do a stratified train/test split before, then use an upsample or downsample technique.

Blagging is an intuitive technique used for unbalanced datasets that ensures a continuous downsample for each of the bootstrap samples.