Naive Bayes Classifier (NBC)

TeeTracker

4 min readJan 12, 2022

Detect spam|non-spam messages

There are two major types of NBC models

Bernoulli Event Model (BEM) and Multinomial Event Model (MEM)

Advantages

Simple and easy to understand, the implementation method is clear, without too much framework assistance, low variance.

Disadvantages

If we do text classification, such models cannot analyze the semantics, and they are not suitable for super-big data.

Terminology and notation

Indicator function

1{.}: If the expression inside {} is true, 1{} gives 1, otherwise 0
For example:

1{2=2} -> 1, 1{2=3} ->0

∑1{.}: Count elements which give the expression inside {} to be true.
For exampe:

∑1{x=2}: Count all element which equals to 2 in collection.

Bernoulli Event Model (BEM)

Count the presence fraction of observed events in sample.

Parameters or weights

∅: Represent the likelihood of the presence of X given Y in the samples.

n: n samples number.

∅ with j|y=1: Event j for event positive.
∅ with j|y=0: Event j for event negative.

∅ y: Represent the likelihood of event Y .

∅ with y=1: Event for positive.
∅ with y=0: Event for negative.

Multinomial Event Model (MEM)

Count the number (amount) fraction of observed events in sample.

Parameters or weights

∅: Represent the likelihood of the number of X given Y in samples.

n: n samples number.

∅ with j|y=1: Event j for event positive.
∅ with j|y=0: Event j for event negative.

∅ y: Represent the likelihood of event Y .

∅ with y=1: Event for positive.
∅ with y=0: Event for negative.

Experiment

It turns out that we have a collection of sample data and each sample contains independent events (Xs) and a label that also represents an event (Y).

The relationship between Xs and Y is that these events occur in the context of Y.

The label can be binary or multinomial, in this paper we go for binary to make things easier. And we predefined all possible events of our samples, let’s say there are n events.

Label or Y: positive(1) or negative(0).

Each sample has the events randomly from those n events.

Goal

We input random events (….n) ones and find out the likelihood of positive or negative Y=1 or 0.

Bayes rule

P(Y|X) = [ P(X|Y) P(Y) ] / P(X)

P(X) = P(X|Y=0) + P(X|Y=1)

Then

Build model

In my experiment I use Naive Bayes Classifier to detect spam | non-spam messages.

Download: https://dl.dropbox.com/s/igq20e1nvwjwxw4/spam_ham_dataset.csv

There isn’t training loop like deep learning, linear regression or logistic regression. In order to train the Naive Bayes Classifier models what we do is to implement the method (👆) into ML language like Python, matlab.

Framework

Model is built by pure Python.

Use Tensorflow Tokenizer to convert text (sentences) into numbers.

For example, there are some sentences:

The quick brown fox jumps over the lazy dog.
This is a beautiful garden I ever seen.
Sorry for the next absence.

…..

The tokenizer will encode those sentences into:

[2, 5, 6, 7, 3, 9, 10, 15, 19]

[20, 22, 56, 36, 222, 123, 999, 345]

[100, 58, 10, 245, 93]

With tokenizer we can call index_word to get map between words and its representing indices, for example, the “sorry” is 100, the “the” is 10 (it appears two times).

Method

preprocess(sentences, labels): First method to call after constructing model instance.

sentences: All sample messages
labels: Y, 1 or 0. 1: spam, 0: non-spam

fit(self, X, Y, X_cv, Y_cv): Train model, following the method (👆) to calcuate likelihood of each valid words.

X: The training data, the data has been encoded with indices.

Y: The labels associated to X.

X_cv: The cross validation data, the data has been encoded with indices.

Y: The labels associated X_cv.

Use F-Score to verify metric on cross validation data.

predict(self, X): Predict on X with model. X is a collection of test data.

Returns tuple with list of prediction likelihoods and list of 1 or 0. 1: spam, 0: non-spam.