F-Score Metrics

TeeTracker
4 min readJan 8, 2022

--

Problem from:

  1. We have samples of event cases(cancer, spam emails, etc.), 0.2% positive(POS).
  2. We have a model with a 1% error, which means to detect POS with 99% accuracy. Predict on samples with this model, the result is cool, 1%.
  3. We simply classify every single example as a negative(NEG) without a learning model, even make mistake to classify on all POS samples, the accuracy increases to 99.8%, the error reduces to 0.2%.
    Well, it is even much cooler than step 2.
  4. Additionally, we can build a model return NEG ever(return NET), when we deploy the model to predict on certain samples, the accuracy of prediction of NEG is 99.8% correct, this is a ridiculous model.

It turns out that the learning model doesn’t help more or the metric - the accuracy, is not reliable? It’s a paradox, something must have gone wrong somewhere.

Accuracy:=(correct POS predictions+correct NEG predictions) ∕ total samples

Perceptually, a metric based on the correct prediction of POS is better than one based simply on accuracy; after all, the project ultimately wants the model to make correct predictions of POS.

The samples which represent the POS and NEG at the beginning of this page are called “skewed classes”, class, at least POS is very rate in the dataset.

Terminology

Before talking about F-Score, we define some terminologies which we use in F-Score.

The task of machine learning is nothing more than making predictions, which means, nothing more than predicting or guessing right and wrong:

True-positive(TRUE_POS): Predict as POS, ground-true is POS
False-positive(FALSE_POS): Predict as POS, ground-true is NEG
True-negative(TRUE_NEG): Predict as NEG, ground-true is NEG
False-negative(FALSE_NEG): Predict as NEG, ground-true is POS

Note: All those terminologies mean the number of detection times. For example, TRUE_POS is the time of detection of true positive results. We compare the results of the model and ground true values.

Precision

Fraction of TRUE_POS overall results of POS detection.

(P)recision:=TRUE_POS ∕ (TRUE_POS+FALSE_POS)

Recall

Fraction of TRUE_POS overall actual POS cases, caution, includes with the detection of POS correctly which means TRUE_POS, and the detection of NEG wrongly which means FALSE_NEG (take some seconds to get this point………..).

(R)call:=TRUE_POS ∕ (TRUE_POS+FALSE_NEG)

Both precision and recall are based on TRUE_POS, one is for detection performance of POS, and the latter one is the ratio of detection of POS overall actual POS, logically!

A good model must have high P and R, both in the range (0, 1).

Try P and R on the ridiculous model, you find they are 0. So the model is pretty bad and helps nothing.

F-Score

You can treat the F-Score as an “average” of P and R and it can reflect the quality of the model and avoid building ridiculous ones.

It turns out that we have had a bad experience with the ridiculous model on a skewed classes dataset, so we find the problem and want to build different models which do not return NEG ever……..We work hard and there are three models which can produce the following P and R:

Model 1: P=.5, R=.4
Model 2: P=.7, R=.1
Model 3: P=.02, R=1.0

We now introduce F-Score.

In Wikipedia, it looks like this

We reduce it into

Score:

Model 1: .4444
Model 2: .175
Model 3: .0392

Because P and R are both in range (0, 1), the F-Score has the same range and the best score is 1. So rationally, the 1st model is closest to 1 and best.

My experience

When not sure, if the accuracy can convince you or not, because of some reason, maybe you have seen the “skewed classes” in your cross-validation or test data, plugin F-Score in metrics can help us tune the hyperparameters and adjust model training.

Coding instead of talking….

Here is an implementation of F-Score, this is just a version without optimizing, however, it truly demonstrates everything.

Here is a sample of the “Multinomial Event Model” which bases on the Naive Bayes to predict spam email, the model itself is not a part of the paper’s topic, however, I did use the F-Score to verify cross-validation data during training.

https://colab.research.google.com/drive/1LXPNXpglEFtyni2rZB5MmwqA1A_xkEu1?usp=sharing

--

--