Naive Bayes Classifier (NBC)
Detect spam|non-spam messages
There are two major types of NBC models
Bernoulli Event Model (BEM) and Multinomial Event Model (MEM)
Advantages
Simple and easy to understand, the implementation method is clear, without too much framework assistance, low variance.
Disadvantages
If we do text classification, such models cannot analyze the semantics, and they are not suitable for super-big data.
Terminology and notation
Indicator function
1{.}: If the expression inside {} is true, 1{} gives 1, otherwise 0
For example:
1{2=2} -> 1, 1{2=3} ->0
∑1{.}: Count elements which give the expression inside {} to be true.
For exampe:
∑1{x=2}: Count all element which equals to 2 in collection.
Bernoulli Event Model (BEM)
Count the presence fraction of observed events in sample.
Parameters or weights
∅: Represent the likelihood of the presence of X given Y in the samples.
n: n samples number.
∅ with j|y=1: Event j for event positive.
∅ with j|y=0: Event j for event negative.
∅ y: Represent the likelihood of event Y .
∅ with y=1: Event for positive.
∅ with y=0: Event for negative.
Multinomial Event Model (MEM)
Count the number (amount) fraction of observed events in sample.
Parameters or weights
∅: Represent the likelihood of the number of X given Y in samples.
n: n samples number.
∅ with j|y=1: Event j for event positive.
∅ with j|y=0: Event j for event negative.
∅ y: Represent the likelihood of event Y .
∅ with y=1: Event for positive.
∅ with y=0: Event for negative.
Experiment
It turns out that we have a collection of sample data and each sample contains independent events (Xs) and a label that also represents an event (Y).
The relationship between Xs and Y is that these events occur in the context of Y.
The label can be binary or multinomial, in this paper we go for binary to make things easier. And we predefined all possible events of our samples, let’s say there are n events.
Label or Y: positive(1) or negative(0).
Each sample has the events randomly from those n events.
Goal
We input random events (….n) ones and find out the likelihood of positive or negative Y=1 or 0.
Bayes rule
P(Y|X) = [ P(X|Y) P(Y) ] / P(X)
P(X) = P(X|Y=0) + P(X|Y=1)
Then
Build model
In my experiment I use Naive Bayes Classifier to detect spam | non-spam messages.
Download: https://dl.dropbox.com/s/igq20e1nvwjwxw4/spam_ham_dataset.csv
There isn’t training loop like deep learning, linear regression or logistic regression. In order to train the Naive Bayes Classifier models what we do is to implement the method (👆) into ML language like Python, matlab.
Framework
Model is built by pure Python.
Use Tensorflow Tokenizer to convert text (sentences) into numbers.
For example, there are some sentences:
The quick brown fox jumps over the lazy dog.
This is a beautiful garden I ever seen.
Sorry for the next absence.
…..
The tokenizer will encode those sentences into:
[2, 5, 6, 7, 3, 9, 10, 15, 19]
[20, 22, 56, 36, 222, 123, 999, 345]
[100, 58, 10, 245, 93]
With tokenizer we can call index_word to get map between words and its representing indices, for example, the “sorry” is 100, the “the” is 10 (it appears two times).
Method
preprocess(sentences, labels): First method to call after constructing model instance.
sentences: All sample messages
labels: Y, 1 or 0. 1: spam, 0: non-spam
fit(self, X, Y, X_cv, Y_cv): Train model, following the method (👆) to calcuate likelihood of each valid words.
X: The training data, the data has been encoded with indices.
Y: The labels associated to X.
X_cv: The cross validation data, the data has been encoded with indices.
Y: The labels associated X_cv.
Use F-Score to verify metric on cross validation data.
predict(self, X): Predict on X with model. X is a collection of test data.
Returns tuple with list of prediction likelihoods and list of 1 or 0. 1: spam, 0: non-spam.