Understanding Bayes’ Theorem

“When the facts change, I change my mind. What do you do sir?”

– John Maynard Keynes

Bayes’ Theorem is a method to update our beliefs based on new information acquired.

For example, if we were trying to find out the probability of a said person liking Evangelion, we could just take the percentage of the entire population who likes Evangelion and call it a day. But, if there was new information that said person actually likes Anime, we can adjust our probability upwards based on the fact that most likely, Evangelion lovers are also Anime lovers.

The formula of Bayes’ Theorem

In it’s simplest form, the equation of Bayes’ theorem is:

\[P(A|B) = \frac{P(B|A) P(A) }{P(B)}\]

For the above equation,

\(P(A|B)\) is the posterior probability; the probability of event \(A\) happening given that event \(B\) has happened. Relating to the example in our introduction, this would be the probability of liking Evangelion given the person also likes Anime.

\(P(B|A)\) is the likelihood; the probability of event \(B\) happening given that event \(A\) has happened. This would be the probability of liking Anime given the person likes Evangelion.

\(P(A)\) is the prior probability; the probability of event \(A\) happening. This would be the probability of liking Evangelion. \(P(B)\) is the evidence; the probability of event \(B\) happening. This would be the probability of liking Anime.

Basically, in simple English sense, the above equation can written as:

\[posterior = \frac{prior \times likelihood }{evidence}\]

Bayes’ Theorem example calculation

Because examples are always good to explain complex problems, let’s use our previous example from the introduction to explain the calculations in Bayes’ Theorem.

For starters, let’s say the possibility of someone liking Evangelion, \(P(Evangelion)\) is somewhat rare with a 1% probability (not because it’s bad, but because it’s unconventional).

In addition, Anime fans, \(P(Anime)\) are fairly common with a 10% probability due to the weeb movement.

And as for Evangelion fans who like Anime, \(P(Anime | Evangelion)\), that’s about 95%.

With the information we have obtained above, we can find the probability of someone liking Evangelion if that said someone likes Anime.

\[P(Evangelion|Anime) = \frac{P(Anime|Evangelion) P(Evangelion) }{P(Anime)}\\= \frac{0.95 \times 0.01}{0.10}\\= 9.5\%\]

Consequently, just by obtaining our new evidence, we manage to achieve a better probability, from a probability of 0.01 to 0.095.

Classification using Bayes’ Theorem

In a classification problem where we have to categorise one from another, Bayes’ Theorem can prove to be very useful as we have previously seen in our blog post, “Fake Jobs Text Classification using Naive Bayes”.

As so, if we have a problem instance that needs to be classified, our equation can be written as:

\[P(Class|Data) = \frac{P(Data|Class) P(Class) }{P(Data)}\]

This equation is then used to find the probability of each class in a problem, the class with the highest probability is then categorised as that class.

However, the issue arises when we start including multiple features within our data. The calculation of the likelihood, \(P(x|Class)\) becomes exponentially more expensive if these features relate to each other.

Because of this, we don’t use Bayes’ Theorem fully, we simplify it.

Naive Bayes Classifier

Naive Bayes aims to simply the Bayes’ Theorem by making two huge assumptions:

  • All features of our data are independent of one another
  • All features have an equal contribution to the outcome

With that said, if we consider any two events A and B are independent, then,

\[P(A,B) = P(A) \cdot P(B)\]

Because of this, in a conditional independence, we can define the independence in a conditional model as:

\[P(A,B|C) = P(A|C) \cdot P(B|C)\]

So, if we define, \(Data = (x_1, x_2,…,x_n)\) and \(x_1, x_2,…,x_n\) are representatives of the features in our data, we can substitute them for our \(Data\).

\[P(Class|x_1,x_2,...,x_3) = \frac{P(x_1,x_2,...,x_3|Class)P(Class)}{P(x_1)P(x_2)...P(x_n)}\]

And because we’re assuming that all features of our data are independent, we can actually split up the conditional probability so that \(P(x_1, x_2 | Class)\) can be written as \(P(x_1 | Class) P(x_2 | Class) \).

\[P(Class|x_1,x_2,...,x_3) = \frac{P(x_1|Class)P(x_2|Class)...,P(x_n|Class)P(Class)}{P(x_1)P(x_2)...,P(x_n)}\]

Furthermore, as the denominator remains constant for all entries in the data, we can remove the denominator and introduce a proportionality.

\[P(Class|x_1,x_2,...,x_3) \propto P(Class) \prod_{i=1}^{n} P(x_i|Class)\]

Once we have derived our independent model, we just need to apply a MAP decision rule to find the Class with the highest probability. Which is essentially just like a MAX function in programming. This can be shown mathematically as,

\[Class = argmax_y P(Class) \prod_{i=1}^{n} P(x_i|Class)\]

Finally, all we’re required to do is to calculate \(P(Class)\) and \(P(x_i|Class)\). This can be done easily by calculating the appropriate probabilities from our data.

Parameters and types of Naive Bayes classifiers

Now that we have derived our model, we need a way to calculate \(P(x_i|Class)\), which requires us to deal with our data.

When dealing with data, probability distributions may differ from one another. Different distributions can have very different characteristics, for example:

  • Continuous distributions vs. Discrete distributions
  • For continuous distributions, there could be different sets of possible values (e.g. numbers \(x, x>0,\) or \(x\in[0,1]\))
  • For discrete distributions, there could be a finite vs. an infinite number of possible values

In essence, each distribution within a family of distributions is different in terms of the parameters of the distribution. These parameters are what help determine the mean and variance of the distribution, and the values of probabilities from it.

Gaussian Naive Bayes

Typically when we deal with continuous data, we assume that continuous values associated with each feature are distributed across a Gaussian distribution, or another term you might be more familiar with, Normal distribution.

Gaussian Distribution
Gaussian Distribution, made with matplotlib

When the likelihood of the features in our data are assumed to be Gaussian, the conditional probability can be described as:

\[P(x_i|Class) = \frac{1}{\sqrt{2 \pi \sigma_{Class}^2}}e^{-\frac{(x_i - \mu_{Class})^2}{2 \sigma_{Class}^2}}\]

Where \(\mu_{Class}\) and \(\sigma_{Class}^2\) is the mean and variance of our features respectively (our parameters) in respect to our \(Class\).

Multinomial Naive Bayes

In a Multinomial model, we estimate the conditional probability of a certain feature based on the frequency of the feature appearing for a specific \(Class\). It also has a bag of words assumption that the position of the features does not matter.

In this case, the likelihood of observing a feature can be given as,

\[P(x_{i}|Class) = \frac{(\sum_i x_i)!}{\prod_{i} x_i!} \prod_{i} {p_{Class_{i}}}^{x_i}\]

Bernoulli Naive Bayes

For the Bernoulli model, features are considered independent booleans (binary variables). In contrast to the Multinomial model, it finds the binary feature occurrence as opposed to the feature frequency.

The likelihood of a observing a feature in this model can be give as,

\[P(x_i|Class) = \prod_{i=1}^{n}p_{Class_i}^{x_i} (1 - p_{Class_i})^{1-x_i}\]


Despite the simplified assumptions on Naive Bayes, the Naive Bayes classifiers have performed surprisingly well in real-world applications. They require only a handful of training data to estimate the necessary parameters.

Naive Bayes algorithms are also fast and easy to implement in comparison to other models. The assumption on independent conditional probability helps greatly with speeding things up. However, in real-world scenarios, our features are most likely dependent on each other, hence could cripple the performance of our classifier.

Note. This was also a complimentary blog post to the “Fake Jobs Text Classification using Naive Bayes” where we used Naive Bayes for a natural language processing (NLP) problem.

Further Reading