Naïve Bayes classification model for Natural Language Processing problem using Python

In this post, let us understand how to fit a classification model using Naïve Bayes (read about Naïve Bayes in this post) to a natural language processing (NLP) problem.

  • Download sample dataset
  • Split dataset into test and train data
  • Vectorize
  • Build and measure the accuracy of the model


Step 1: Let us use publicly available dataset for spam detection. Download the dataset from this site.
Extract the files. This dataset contains 5,572 SMS messages, labelled ham (legitimate) or spam.

Sample dataset: ham or spam?

Step 2: Import the text data set, provide column names.

Step 3: Convert labels (ham and spam) to numbers (0 and 1).

Step 4: Split the dataset into test and train.

Step 5: Vectorize

In this step, words are converted to numerical structure. You can read more on this here.

Step 6: Vectorize training dataset

Step 7: Vectorize test data set

Step 8: Build the Naïve Bayes classification model. If you want to know what is Naive Bayes model, then read my post on Naive Bayes.

Step 9: Measure the accuracy on test data

Accuracy of the Naïve Bayes model in classifying the test data is 0.98851.


I have used the codes from the following sites and modified wherever needed:

Further reference materials:
I personally found this post very helpful:
You can find sample datasets on this site