on
Statistics
Bayesian Statistics and Naive Bayes Classifiers
Bayesian statistics is an approach in statistics where probability is interpreted as a degree in belief in an event. The probability of event A happening given B is true can be defined as follows.
P(A|B)=P(B|A)P(A)P(B)Where the probability is a degree of belief from the observation. Suppose we define event A as the event where a traffic accident happening at a highway, and event B as the event where it’s raining in the area. Therefore, P(A) is the probability of the traffic accident happening at the highway and P(B) is the probability of rainfall in the area. Thus, P(B|A) is the probability of rainfall whenever an accident happens based on our belief (which is based on our observations). P(A|B) can then be calculated by plugging the numbers into the above formula.
Let’s suppose we’re observing the number of accidents in the highway when the area is having a drought and no rainfall has been observed throughout the observation period. We don’t have any rainfall in the observation, but setting P(B)=0 would imply that in our belief it’s absolutely impossible for the area to have rainfall and will break the equation since calculating P(A|B) will involve division by zero.
In Bayesian statistics it’s advised to avoid setting the probability of an event to zero, we can set it to a very small value instead when the event isn’t observed during the observation and there’s a very small probability that it might happen some time in the future.
The concepts of Bayesian statistics are applied in naive Bayes classifiers to determine which events are more likely to happen given a data input, according to the beliefs built up from the past observation data processed by the classifiers for each possible class.
Gaussian Naive Bayes
The following is the formula to calculate the probability of an event with the value for the feature x=v belonging to a certain class Ck in the list of classes C.
p(x=v|Ck)=1√2πσ2ke−(v−μk)22σ2kGaussian naive Bayes classifier basically calculates the probability of the event belonging to the classes Ck∈C given that the feature x to have a certain value v assuming the distribution of values in x for each Ck.
For each class Ck, the distribution of the values for the feature x is assumed as a Gaussian distribution described by the population mean μk and the standard deviation σk which translates to the variance σ2k. The values of v, μk, and σ2k can be plugged into the formula to get the value of the probability p(x=v|Ck).
Since we’re calculating the probability p(x=v|Ck) for each Ck∈C, we’re going to have the probability Pk=p(x=v|Ck) and Pk∈P where P is the set of probabilities of x=v for each element Ck∈C.
The probabilities Pk∈P that we have calculated are basically the likelihood of the event belonging to the class Ck. We can simply classify the event as an occurence of Ck which has the highest probability Pk of having x=v.
Notice that the probability formula for the Gaussian naive Bayes is similar with the probability density function of the Gaussian distribution which is as follows.
P(X=x)=1σ√2πe−12(x−μσ)2The formula can be transformed into the exact same form used in the Gaussian naive Bayes before.
P(X=x)=1σ√2πe−12(x−μσ)2P(X=x)=1√2πσ2e−12((x−μ)2σ2)P(X=x)=1√2πσ2e−(x−μ)22σ2A sample usage of the Gaussian naive Bayes classifier implemented in scikit-learn can be seen in this article.
Multinomial Naive Bayes
With a multinomial model, we assume a feature vector x=(x1,x2,x3,...,xn) as a histogram with xi is the number of times the event i is observed. Each class k also has a vector of probabilities pki of the event i happening for class Ck, which is denoted as pk=(pk1,pk2,pk3,...,pkn).
The following is the formula to calculate the probability p(x|Ck), which is the probability of feature vector x belonging to the class Ck with a known probability vector pk.
p(x|Ck)=(∑ixi)!∏ixi!∏ipxikiSimilar to the case with Gaussian naive bayes before, from here we have the probability Pk=p(x|Ck) of feture vector x belonging to the class Ck. After calculating the probability Pk of x belonging to the class Ck∈C, we’re going to have the values Pk∈P for the probability of x belonging to Ck. From here, we can pick the class Ck which has the highest Pk value as the most likely class for the feature vector X to belong to.
Just as Gaussian naive Bayes classifier before, the multinomial naive Bayes classifier is based on the probability mass function of the multinomial distribution.
P(X)=n!x1!x2!x3!...xk!px11px22px33...pxkkWhich can be expressed as follows using the product notation.
P(X)=n!∏ixi!∏ipxiiMultinomial naive Bayes is commonly used for text classification. For a sample dataset with a lot of words, however, each word found in the dataset will become a feature in the feature vector x and the feature vector has a high probability of being a sparse vector.
Bernoulli Naive Bayes
Bernoulli naive Bayes is similar to multinomial naive Bayes in the sense that we’re calculating the probability of a feature vector x=(x1,x2,x3,...,xn) belonging to the class. The difference is that for Bernoulli naive Bayes, each feature xi is assumed to be binary-valued which is either 1 or 0. Each class k also has a vector of probabilities pk=(pk1,pk2,pk3,...,pkn) where pki is the probability of xi having the value of 1.
p(x|Ck)=n∏i=1pxiki(1−pki)(1−xi)Just like the previous naive Bayes classifiers we’ve covered, we can plug in the numbers into the above formula and find the class with the highest value of p(x|Ck) as the most likely class for the feature vector x to belong to. The Bernoulli naive Bayes classifier is also based on a distribution whose namesake is taken as a reference of the classifier.
The following is the probability mass function formula for the Bernoulli distribution.
P(X=x)=px(1−p)(1−x)Where p is the probability of X having the value of x, where X∈{0,1}. Suppose we have the probability of X=1 defined as p=0.2, if we plug in the numbers into the formula to find P(X=1) and P(X=0) we’re going to get the result as follows.
P(X=1)=(0.2)1(1−0.2)(1−1)=(0.2)1(0.8)0=(0.2)(1)=0.2P(X=0)=(0.2)0(1−0.2)(1−0)=(0.2)0(0.8)1=(1)(0.8)=0.8We can see that the Bernoulli naive Bayes formula is a bit different if compared to the Bernoulli distribution probability mass function, a bit different from the Gaussian and multinomial naive Bayes formulas in the previous sections which are exactly the same with the distributions their formula is derived from. This is because in the Bernoulli naive Bayes classifier, we’re processing a feature vector which contains multiple Bernoulli random variables which are assumed to be independent from each other.
Since we’re assuming all of the random events in the feature vector x=(x1,x2,x3,...,xn) to be independent from one another, the overall probability of the random events in the feature vector can be calculated as the product of the probabilities of each random event happening.
P(x)=P(X=x1)P(X=x2)P(X=x3)...P(X=xn)P(x)=∏iP(X=xi)This article shows an example of Bernoulli naive Bayes application and a sample use of scikit-learn for Bernoulli naive Bayes classification. This other article by Matt Johnson digs deeper into the math of Bernoulli naive Bayes classifier and also gives us a sample implementation of the classifier.