Detecting Spam Emails using Naive Bayes Algorithm


In this notebook, we will explore Python code for implementing Naive Bayes algorithm to detect spam emails.

The model being built is a Supervised Learning model. In other words, we will feed data containing spam emails and good emails along with labels regarding which email is a spam email and which email is a good email into the machine and make the machine learn to detect spam emails and good emails. Once the model is ready, if we feed any email to the model, the model should respond by stating whether the input email is a spam email or a good email.

In [ ]:
import warnings
warnings.filterwarnings('ignore')

Loading the data

The first step in building a supervised learning model is to load the data based on which the model will be prepared. The data for building this model is available in the file spamEmailDataset.csv. We will used the pandas read_csv() function to read the data into a pandas dataframe.

In [1]:
# Import the pandas library
import pandas as pd
In [2]:
# Read the data into a dataframe and display the dataframe
df = pd.read_csv('spamEmailDataset.csv')
df
Out[2]:
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

Unnamed: 0 label text label_num
0 605 ham Subject: enron methanol ; meter # : 988291\r\n… 0
1 2349 ham Subject: hpl nom for january 9 , 2001\r\n( see… 0
2 3624 ham Subject: neon retreat\r\nho ho ho , we ‘ re ar… 0
3 4685 spam Subject: photoshop , windows , office . cheap … 1
4 2030 ham Subject: re : indian springs\r\nthis deal is t… 0
5166 1518 ham Subject: put the 10 on the ft\r\nthe transport… 0
5167 404 ham Subject: 3 / 4 / 2000 and following noms\r\nhp… 0
5168 2933 ham Subject: calpine daily gas nomination\r\n>\r\n… 0
5169 1409 ham Subject: industrial worksheets for august 2000… 0
5170 4807 spam Subject: important online banking alert\r\ndea… 1

5171 rows × 4 columns

If the dataframe is uploaded correctly, you will notice that the dataframe has 4 columns and 5171 rows.

Notice that there is a column named label_num in the dataframe. Let us see the unique values in this column.

In [3]:
# Checking the unique values in the column label_num in the dataframe df
df.label_num.unique()
Out[3]:
array([0, 1])

Notice that the column label_num has 2 values – 0 and 1. Here, 0 indicates that the corresponding email is a good email and 1 indicates that the corresponding email is a spam email.

Apart from the column label_num, we need the column text which contains the contents of the emails. We do not need the rest of the columns in the dataframe df. So, we will only retain these 2 columns in the dataframe for our purpose.

In [4]:
# Retain only the "text" and the "label_num" columns
df = df[['text', 'label_num']]
df
Out[4]:
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

text label_num
0 Subject: enron methanol ; meter # : 988291\r\n… 0
1 Subject: hpl nom for january 9 , 2001\r\n( see… 0
2 Subject: neon retreat\r\nho ho ho , we ‘ re ar… 0
3 Subject: photoshop , windows , office . cheap … 1
4 Subject: re : indian springs\r\nthis deal is t… 0
5166 Subject: put the 10 on the ft\r\nthe transport… 0
5167 Subject: 3 / 4 / 2000 and following noms\r\nhp… 0
5168 Subject: calpine daily gas nomination\r\n>\r\n… 0
5169 Subject: industrial worksheets for august 2000… 0
5170 Subject: important online banking alert\r\ndea… 1

5171 rows × 2 columns

The column text is the independent variable and the column label_num is the dependent variable.

However, we should understand that the computer cannot make anything out of the text available in the column text. So, we have to convert the contents of the column text to numbers. We will see how this can be done. Nevertheless, for the moment, we will term the independent variable as X and the dependent column as y.

In [5]:
X = df['text']
y = df['label_num']

Let us check how many examples of good emails we have and how many examples of spam emails we have.

In [6]:
y.value_counts()
Out[6]:
0    3672
1    1499
Name: label_num, dtype: int64

We see that the number of examples of spam we have is about half the number of examples of good emails we have. This is what we refer to as an unbalanced dataset. The data available from any fraud detection system is typically like this. When we are dealing with data regarding credit card frauds, we may have only about 1% data regarding frauds and the rest being good transactions.

Having unbalanced dataset will tend to make our models inefficient as the weight will be more on the set of data which have more data points. Later in this notebook, we will see how we can balance the dataset. For the moment, we will work with the dataset as it is.

Creating and Training and Test sets

We will split the data at random so that we have 2 sets. We will use one set for training our model and the other set to validate the goodness of the model. We will split the data such that 80% of the data is used for training and the rest 20% will be used for validation.

To split the data into training and test sets, we will use the function train_test_split() from the Scikit-Learn library.

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
print("Number of data points in the training set = %d" % (X_train.shape[0]))
print("Number of data points in the test set = %d" % (X_test.shape[0]))
Number of data points in the training set = 4136
Number of data points in the test set = 1035

train_test_split() returns 4 datasets. We have captured these datasets in the variables X_train, X_test, y_train, y_test. X_train contains 80% (in our case) of the rows from the original dataset (i.e. X) chosen at random and only contains the columns related to the independent variables. y_train contains the corresponding values of the dependent variable with relation to the dataset X_train. X_test contains the rest of the 20% of the rows from the original dataset (i.e. X) and contains only the columns corresponding to the independent variables. y_test contains the corresponding values of the dependent variable with relation to the dataset X_test.

We will train the model using X_train and y_train.

We will test the model by making predictions using X_test.

We will measure the goodness of the model by comparing the predictions made on X_test to the actual values in y_test.

Pre-processing the data

As stated before, we need to find a way to represent the text data as numbers. One way to do this is by preparing a count of each word in every data point. We refer to such a data as a vector of counts. However, before we create the vector of counts, we can remove non essential words/elements from our data.

Among the non essential words/elements are the following:

  1. Commonly used words like a, an, the, etc. These are referred to as Stop Words.
  2. Punctuations and Special Characters
  3. URLs
  4. Numbers
  5. Extra white spaces

There are some more categories. However, dealing with the stated categories above will give an idea for how to treat such data.

Functions to remove non essential elements from the data

We will now write functions to remove the non essential elements from the data. We will then collected all these individual functions into a single function so that they could be applied as a whole on the complete dataset.

Function to remove Stop Words

We can get a list of Stop Words in English from the Natural Language Tool Kit (nltk) library. We will get this list and from every data point we will remove these words.

In [8]:
# Import the required libraries 
import nltk
from   nltk.corpus import stopwords
from   nltk.tokenize import word_tokenize

# Gather the list of Stop Words
stopWordList = nltk.corpus.stopwords.words('english')

# Function to remove Stop Words
def removeStopWords(text):
    # splitting strings into tokens (list of words)
    tokens = word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    # filtering out the stop words
    filtered_tokens = [token for token in tokens if token not in stopWordList]
    filtered_text = ' '.join(filtered_tokens) 
    
    return filtered_text

We now test our function.

We will pass all the data in our test string in lower case.

This is because we will convert all the data in our text column to lower case before building the model.

In [9]:
removeStopWords("This is a test of Stop Word remover".lower())
Out[9]:
'test stop word remover'

However, we will not be supplying one row at a time to this function. We will want all the rows in our dataframe to be treated by this function. To do this, we will use the code below.

As we are just testing the function, we will make a copy of the dataset and try the function on that copy. This way, we will not disturb our original dataset which we will treat later.

In [10]:
dfXDemo = X_train.copy()
print("Original")
print(X_train.head())
dfXDemo = dfXDemo.apply(removeStopWords)
print("\nAfter removing Stop Words")
print(dfXDemo.head())
Original
5132    Subject: april activity surveys\r\nwe are star...
2067    Subject: message subject\r\nhey i ' am julie ^...
4716    Subject: txu fuels / sds nomination for may 20...
4710    Subject: re : richardson volumes nov 99 and de...
2268    Subject: a new era of online medical care .\r\...
Name: text, dtype: object

After removing Stop Words
5132    Subject : april activity surveys starting coll...
2067    Subject : message subject hey ' julie ^ _ ^ . ...
4716    Subject : txu fuels / sds nomination may 2001 ...
4710    Subject : : richardson volumes nov 99 dec 99 m...
2268    Subject : new era online medical care . new er...
Name: text, dtype: object

The apply() function passes one row at a time from the dataset to the function (in this case the function is removeStopWords()) and sends the output of the function to the destination variable.

Function to remove Special Characters

Next, we remove all the special characters from the text. Now, this is a debatable case as spam emails might contain some sequences of special characters. So, ideally we should have been classifying them as well. However, for the sake of simplicity, we will remove the special characters from the text.

In [11]:
import re

def removeSpecialCharacters(text):
    pattern = r'[^a-zA-z0-9\s]'
    return re.sub(pattern, '', text)

This function replaces any character in the input which is not an alphabet or a digit or a space character with an empty string. So, in the process, any character other than alphabets and digits and space character gets eliminated from the input.

Let us test this function.

In [12]:
removeSpecialCharacters("'This'; is a #test $tring, sent: to @someone.")
Out[12]:
'This is a test tring sent to someone'

Function remove URLs

We will remove all the URLs from the text. Here again, we need to decide whether we should be doing this in the case of spam detection. As this is a tutorial, I expose to the method of removing URLs from text. As a developer, you need to decide whether you would like to apply this function during pre-processing the text for spam detection.

In [13]:
import re

def removeURLs(text):
    return re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)

Let us test this function.

In [14]:
removeURLs("Click this link to register - https://www.fake.com OR this link for information - http://info.org or download this file - file://test.txt")
Out[14]:
'Click this link to register -  OR this link for information -  or download this file - '

Function to remove Numbers

Numbers in the text will definitely not help us identify a spam mail. So, we remove all the numbers present in the text.

In [15]:
import re

def removeNumbers(text):
    return re.sub('([0-9]+)', '', text)

Let us test our function.

In [53]:
removeNumbers("This text contains 66 numbers and 1654 characters.")
Out[53]:
'This text contains  numbers and  characters.'

Function to remove extra spaces

Extra Spaces contribute nothing to building the model. Extra Spaces may be present in the raw data and/or could have been introduced due to the operations we performed earlier. So, we need to remove them.

In [17]:
def removeExtraWhiteSpaces(text):
    return " ".join(text.split())

Let us test our code.

In [18]:
removeExtraWhiteSpaces("This   is a    test of removing     extra white    spaces")
Out[18]:
'This is a test of removing extra white spaces'

Putting together the functions to create a single function to clean the data

We now put together all the individual functions we have written so far so that we have a single function which can clean the input data. We will run this function on the Training dataset first and get to the next step in formulating the training data.

In [19]:
def cleanData(text):
    text = removeURLs(text)
    text = removeSpecialCharacters(text)
    text = removeStopWords(text)
    text = removeNumbers(text)
    text = removeExtraWhiteSpaces(text)
    
    return text

Now that our function is ready, we will apply this function on our input data. We will treat only the text in the X_train dataset for the moment as this is what we will use for training our model.

Before applying the function, we convert all the data to lower case. This is so that all the similar words appear the same. For example, the word “Same” may be used in the text as “Same”, “SAME”, “same” or in any other way. If we do not convert all the text to either lower case or upper case, all the different representations of the same word will appear as distinct entities. Having the same word as different entities does not add any value to the Spam Detector. Suppose, we consider that the word “free” is used in Spams. So, we need to find out how frequently the word “free” is used in the text of an Email. Here, if we give a different treatment to the different representations of the word “free”, then our model will become weak.

In [20]:
X_train = X_train.str.lower()
X_train = X_train.apply(cleanData)
X_train
Out[20]:
5132    subject april activity surveys starting collec...
2067    subject message subject hey julie ^ _ ^ turned...
4716    subject txu fuels sds nomination may attached ...
4710    subject richardson volumes nov dec meter nick ...
2268    subject new era online medical care new era on...
                              ...                        
4426    subject ena sales hpl last legal reviewing con...
466     subject tenaska iv bob understand sandi handli...
3092    subject broom bristles flew differentiable ono...
3772    subject calpine daily gas nomination weekend r...
860     subject meter yep right except june please let...
Name: text, Length: 4136, dtype: object

Convert our text matter to numbers

Now that we have a clean set of data, we need to convert this text data to numbers so that the computer can use it to model the data. There are number of ways to convert text data to numbers. We will see 2 such methods.

Convert using CountVectoriser

The first method we will use is to create Count Vectors. What we will do in this method is as follows:

  1. From our dataset, we will identify all the unique words across all the data points
  2. For each data point, we will count how many times each word (determined in the step 1) appears in the data point.

So, we will have a vector for each data point which will have a number (O or greater) for each word in our dataset.

To create the Count Vectors, we will use the CountVectorizer from the SciKit-Learn library as shown below.

In [21]:
from   sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cntText = cv.fit_transform(X_train)
print('Shape of original Training Data:', X_train.shape)
print('Shape of transformed Training Data:', cntText.shape)
Shape of original Training Data: (4136,)
Shape of transformed Training Data: (4136, 40923)

So, we see that we originally had one column containing all the text. After transforming using CountVectorizer, we now have 40923 columns. This implies that our text in the training set had 40923 number of unique words.

Let us see what this transformed data looks like.

In [22]:
cntText.todense()
Out[22]:
matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

What the above matrix represents is that it has the count of every unique word in the dataset for every data point in the dataset. However, this may be seeming very abstract. We can get a better visualisation if we could also see the words for which the counts are being displayed. We can do this by converting the above matrix to a dataframe as shown below.

In [23]:
dfTemp = pd.DataFrame(cntText.todense(), columns=cv.get_feature_names())
dfTemp
/Users/parthamajumdar/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
Out[23]:
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

aa aaa aabda aabvmmq aac aachecar aaer aafco aaigrcrb aaihmqv zyl zynsdirnh zynve zyqtaqlt zyrtec zzezrjok zzn zzo zzocb zzsyt
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4132 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4133 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4134 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4135 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4136 rows × 40923 columns

The above dataframe suggests that there are a lot of junk words used in the emails of our dataset. We can try to look up all the words which starts with the letter “s”.

In [24]:
dfTemp[dfTemp.columns[pd.Series(dfTemp.columns).str.startswith('s')]]
Out[24]:
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

sa saaaaalute saaug saave saazjegp sab saba sabbatical sabeve sabina syzdek syzygy szajac szdzvylc szeb szilard szol szpbi szunrne szvdqvsck
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4132 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4133 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4134 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4135 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4136 rows × 3527 columns

Let us see how many times the word “free” appears across all the data points.

In [25]:
dfTemp['free'].head()
Out[25]:
0    0
1    1
2    0
3    0
4    1
Name: free, dtype: int64

This shows that the word “free” definitely appears in the second and fifth data point.

We can see the total number of times the word “free” occurs in the dataset.

In [26]:
dfTemp['free'].sum()
Out[26]:
425

So, the word “free” appears 425 times across all the data points.

For the purpose of the next section, we will add the data regarding whether a data point is for good email or spam email to the dataframe dfTemp.

In [27]:
dfTemp['spam'] = y_train

Understanding the Bayes Theorem and how it can be applied to Spam Detection

With the formulation, we have made so far, we understand the Bayes Theorem and its application to our problem.

Bayes Theorem states that

P(A|B) = P(B|A) * P(A) / (P(B)

Here, A and B are 2 events. P(A) means the probability of the event A occurring. P(A|B) means the probability of the event A occurs given that the event B has occurred.

In our problem, we want to find the probability that an email is a spam or not. For doing this, we will evaluate the occurrence of every word in the email and see if its presence can mean whether the email is spam or not.

To understand the above statement, say that we have a hypothesis that if the word “free” appears in a email, there is a chance that the email may be a spam. To formulate this as per Bayes Theorem, we can write

P(spam|free) = P(free|spam) * P(spam) / P(free)

So, P(spam|free) means that the email is spam given that the word “free” appears in the email. However, we cannot directly calculate this number. So, we use the formula as stated above.

We can find P(spam). This is number of spam emails in our dataset divided by the total number of data points in our dataset.

P(spam) = (Number of spam mails in our dataset) / (Total number of data points in our dataset)

We can also find P(free). This was illustrated in the previous step. Similarly, we can find the probability of occurrence of any word in the email to be analysed for whether good or spam.

We can find P(free|spam). This is found by determining the number of time the word “free” occurs in the data points which are marked as spam.

In [28]:
P_spam = dfTemp[dfTemp.spam == 1]['spam'].sum() / len(dfTemp)
P_free = dfTemp['free'].sum() / len(dfTemp)

Finding P(spam) and P(given any word like free) is relatively straight forward as shown above.

However, to find probabilities like P(free|spam) is a bit more intense. For finding this, we first subset the dataframe dfTemp to only contain the rows for the data points for spam emails and then count the number of times the word “free” occurs in this set.

In [29]:
P_free_given_spam = len(dfTemp[(dfTemp.spam == 1) & (dfTemp['free'] == 1)]) / len(dfTemp[dfTemp.spam == 1])
In [30]:
print("P_spam = %5.5f" % (P_spam))
print("P_free = %5.5f" % (P_free))
print("P_free_given_spam = %5.5f" % (P_free_given_spam))
P_spam = 0.22921
P_free = 0.10276
P_free_given_spam = 0.04325

So, now we can calculate P(spam|free) = P(free|spam) * P(spam) / P(free)

In [31]:
P_spam_given_free = P_free_given_spam * P_spam / P_free
print("P_spam_given_free = %5.5f" % (P_spam_given_free))
P_spam_given_free = 0.09647

So, there is a 9.65% chance that a email may be a spam given that the email contains the word “free”.

Let us try the same for the word “money”.

In [32]:
P_spam = len(dfTemp[dfTemp.spam == 1]) / len(dfTemp)
P_money = dfTemp['money'].sum() / len(dfTemp)
P_money_given_spam = len(dfTemp[(dfTemp.spam == 1) & (dfTemp['money'] == 1)]) / len(dfTemp[dfTemp.spam == 1])

print("P_spam = %5.5f" % (P_spam))
print("P_money = %5.5f" % (P_money))
print("P_money_given_spam = %5.5f" % (P_money_given_spam))

P_spam_given_money = P_money_given_spam * P_spam / P_money
print("P_spam_given_money = %5.5f" % (P_spam_given_money))
P_spam = 0.22921
P_money = 0.07302
P_money_given_spam = 0.02848
P_spam_given_money = 0.08940

So, there is a 8.94% chance that a email will be a spam given that it contains the word “money”.

We now need to find the chance that a email is spam given that it contains both the words “free” and “money”. So, by Bayes Theorem, what we need to find is

P(spam|(free, money)) = P((free, money)|spam) * P(spam) / P((free, money))

We can calculate all the elements on the right hand side as we are dealing with only 2 words. However, when consider all the words in the input email, this calculation will be become impossible. Here we use the Naive Bayes Theorem by assuming that the presence of each word is independent of each other. Using this assumption, the Bayes formula becomes as follows.

P(spam|(free, money)) = P(free|spam) P(money|spam) P(spam)

P(good|(free, money)) = P(free|good) P(money|good) P(good)

Let us calculate P(spam|(free, money)) and P(good|(free, money)).

In [33]:
P_free_given_spam = len(dfTemp[(dfTemp.spam == 1) & (dfTemp['free'] == 1)]) / len(dfTemp[dfTemp.spam == 1])
P_money_given_spam = len(dfTemp[(dfTemp.spam == 1) & (dfTemp['money'] == 1)]) / len(dfTemp[dfTemp.spam == 1])
P_spam = len(dfTemp[dfTemp.spam == 1]) / len(dfTemp)
print("P_free_given_spam = %5.5f" % (P_free_given_spam))
print("P_money_given_spam = %5.5f" % (P_money_given_spam))
print("P_spam = %5.5f" % (P_spam))

P_spam_given_free_and_money = P_free_given_spam * P_money_given_spam * P_spam
print("P_spam_given_free_and_money = %5.5f" % (P_spam_given_free_and_money))
P_free_given_spam = 0.04325
P_money_given_spam = 0.02848
P_spam = 0.22921
P_spam_given_free_and_money = 0.00028
In [34]:
P_free_given_good = len(dfTemp[(dfTemp.spam == 0) & (dfTemp['free'] == 1)]) / len(dfTemp[dfTemp.spam == 0])
P_money_given_good = len(dfTemp[(dfTemp.spam == 0) & (dfTemp['money'] == 1)]) / len(dfTemp[dfTemp.spam == 0])
P_good = len(dfTemp[dfTemp.spam == 0]) / len(dfTemp)
print("P_free_given_good = %5.5f" % (P_free_given_good))
print("P_money_given_good = %5.5f" % (P_money_given_good))
print("P_good = %5.5f" % (P_good))

P_good_given_free_and_money = P_free_given_good * P_money_given_good * P_good
print("P_good_given_free_and_money = %5.5f" % (P_good_given_free_and_money))
P_free_given_good = 0.04772
P_money_given_good = 0.03590
P_good = 0.57253
P_good_given_free_and_money = 0.00098

We notice that P(spam|(free, money)) = 0.00028 and P(good|(free, money)) = 0.00098.

Clearly, P(good|(free, money)) > P(spam|(free|money)).

So, at this stage, we can conclude that the email being examined has more chance to be good than to be a spam.

When we apply Naive Bayes theorem for our problem, we will find P(good|(word1, word2, …, wordn)) and P(spam|(word1, word2, …, wordn)). Here, word1 to wordn are all the words in the email being examined. Based on whether P(good|(word1, word2, …, wordn)) > P(spam|(word1, word2, …, wordn)) or the other way around, we will conclude whether the email is good or a spam.

Implementing Naive Bayes algorithm for Spam Detection using Scikit-Learn

Now that we have discussed the theory behind Naive Bayes algorithm, let us use the library provided in Scikit-Learn to implement Naive Bayes algorithm for Spam Detection.

There are many variations of Naive Bayes implementation provided in Scikit-Learn. We will discuss Multinomial Naive Bayes as this variation has been written for specifically implementing Naive Bayes algorithm on Count Vectors.

Building the Multinomial Naive Bayes model

For building the model, we need the training data and the associated labels. We built the count vectors for the training data set and stored it in cntText. We have the corresponding labels in y_train. Using this, we can build our model as shown below.

In [35]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(cntText, y_train)
Out[35]:
MultinomialNB()

Now our model has been prepared. Let us test how the model is working on the training set.

To test the model, we need to make predictions from the model. As we are testing the training set first, we make predictions on the training dataset, i.e. cntText and compare the predictions with y_train. The below code demonstrates making prediction on the training dataset using our model.

In [36]:
yTrainPred = mnb.predict(cntText)

To compare the predictions made by the model with the actual data, we can use the Confusion Matrix as shown below.

In [37]:
from   sklearn import metrics
import matplotlib.pyplot as plt

confusionMatrix = metrics.confusion_matrix(y_train, yTrainPred)
cmDisplay = metrics.ConfusionMatrixDisplay(confusion_matrix = confusionMatrix, display_labels = [False, True])
cmDisplay.plot()
plt.show()
Confusion Matrix

From the above confusion matrix, we see that 2913 good emails were truly classified by the model as good emails and 1167 spam emails were classified by the model as spams. The model classified 17 good emails as spam and 39 spam emails as good emails.

So, the accuracy of the model is (2913 + 1167) / (2913 + 1167 + 17 + 39)

In [38]:
print("Training Accuracy = %5.5f" % ((2913 + 1167) / (2913 + 1167 + 17 + 39)))
Training Accuracy = 0.98646

We can also calculate the accuracy using functions available in Scikit-Learn as shown below.

In [39]:
print("Training Accuracy = %5.5f" % (metrics.accuracy_score(y_train, yTrainPred)))
Training Accuracy = 0.98646

Validating the model on the test dataset

We have seen the accuracy of the model on the training data. However, training dataset was the dataset on which the model was built. We will now feed the test dataset (which the model has not yet seen) and check the accuracy of the model.

We have kept the text dataset in the variable X_test and the corresponding labels in y_test. Before we can apply the model on the test dataset, we must transform the test dataset just like we transformed the training dataset. The below steps transform the test dataset.

In [40]:
X_test = X_test.str.lower()
X_test = X_test.apply(cleanData)
X_test
Out[40]:
1566    subject hpl nom march see attached file hplno ...
1988    subject online pharxmacy meds disscount phafrm...
1235    subject nom actual volume april th agree eilee...
2868    subject meter dec robert put heads together de...
4903    subject coastal oil gas corporation melissa de...
                              ...                        
1175    subject alert spam prevention cllck stop sign ...
4476    subject enron blockbuster launch entertainment...
4198    subject make computer like new remove spyware ...
2689    subject temp forecast model xls file city temp...
2142    subject enron hpl actuals august teco tap enro...
Name: text, Length: 1035, dtype: object
In [41]:
cntTestText = cv.transform(X_test)
print('Shape of original Test Data:', X_test.shape)
print('Shape of transformed Test Data:', cntTestText.shape)
Shape of original Test Data: (1035,)
Shape of transformed Test Data: (1035, 40923)

The important aspect to notice is that we applied the same instance of CountVectorizer that we used for the training dataset. This is because we fit the CountVectorizer on the training dataset and we only transformed the test data using the fit we generated from the training dataset. Notice that the transformed test dataset contains 40923 unique words. This is because the fit of the test dataset is made on the unique set of words extracted during fitting the training dataset.

Now that we have transformed the test dataset, we can use to make predictions using the model we developed.

In [42]:
yTestPred = mnb.predict(cntTestText)

Let us generate the Confusion Matrix for the predictions made on the test dataset and check the goodness of the model.

In [43]:
confusionMatrix = metrics.confusion_matrix(y_test, yTestPred)
cmDisplay = metrics.ConfusionMatrixDisplay(confusion_matrix = confusionMatrix, display_labels = [False, True])
cmDisplay.plot()
plt.show()
Confusion Matrix

Let us check the test accuracy of our model.

In [44]:
print("Test Accuracy = %5.5f" % (metrics.accuracy_score(y_test, yTestPred)))
Test Accuracy = 0.97391

Making predictions using our model

We can feed new emails in our model and get the predictions from the model. As I do not have a email, I will use a simple sentence in English language as shown below. Here, I am explainning the mechanics and the same mechanics can be used by applying on real data.

In [45]:
testEmail = "This is a test email to be checked for spam"
testEmail = testEmail.lower()
testEmail = cleanData(testEmail)
testEmail
Out[45]:
'test email checked spam'
In [46]:
cntTestEmail = cv.transform([testEmail])
In [47]:
mnb.predict(cntTestEmail)
Out[47]:
array([1])

So, our model predicts that if the text is as we have given, then the email could be a spam.

Using TF-IDF Vectoriser

While building the previous model, the vectoriser we used to converyt text to numbers was the Count Vectoriser. Generally, a much better vectoriser is TF-IDF. TF-IDF stands for Term Frequency-Inverse Document Frequency. I will not get into the theory behind TF-IDF. I will leave that for you to research. I will show the implementation of TF-IDF to vectorise our data. The code is shown below.

In [48]:
from   sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0, use_idf=True, ngram_range=(1,4))
tfidfText = tv.fit_transform(X_train)
tfidfText.shape
Out[48]:
(4136, 730859)

Balancing the dataset

We saw that the number of examples of spam emails was about half the number of good emails in our dataset. This implies that the dataset is unbalanced. The models built on such datasets could be biased towards the class which has more data points. So, generally, it advised that balanced datasets should be used to develop models. However, it could be very expensive to gather more data regarding spam emails (in our case) to balance the dataset. Instead, we will use statistical techniques to balance our dataset.

One such technique for balancing datasets is SMOTE. SMOTE stands for Synthetic Minority Over-sampling Technique. Using this technique, we will generate additional data points for the minority class, i.e., we will generate more examples for spam emails.

The implementation of SMOTE is provided below.

In [49]:
from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(tfidfText, y_train)

y_res.value_counts()
Out[49]:
0    2930
1    2930
Name: label_num, dtype: int64

We see that now we have equal number of examples for the good emails set and the spam emails set.

We will use this dataset to built the next Naive Bayes model.

Building the model

Now, we will the above generated dataset for building our model as shown below.

In [50]:
mnb = MultinomialNB()
mnb.fit(X_res, y_res)
Out[50]:
MultinomialNB()

We now check the accuracy of this model.

In [51]:
yTrainPred = mnb.predict(X_res)

confusionMatrix = metrics.confusion_matrix(y_res, yTrainPred)
cmDisplay = metrics.ConfusionMatrixDisplay(confusion_matrix = confusionMatrix, display_labels = [False, True])
cmDisplay.plot()
plt.show()

print("Training Accuracy = %5.5f" % (metrics.accuracy_score(y_res, yTrainPred)))
Confusion Matrix
Training Accuracy = 0.99795

We now test this model on our test dataset.

In [52]:
X_test = X_test.str.lower()
X_test = X_test.apply(cleanData)
tfidfTest = tv.transform(X_test)
yTestPred = mnb.predict(tfidfTest)

confusionMatrix = metrics.confusion_matrix(y_test, yTestPred)
cmDisplay = metrics.ConfusionMatrixDisplay(confusion_matrix = confusionMatrix, display_labels = [False, True])
cmDisplay.plot()
plt.show()

print("Test Accuracy = %5.5f" % (metrics.accuracy_score(y_test, yTestPred)))
Confusion Matrix
Test Accuracy = 0.97778

 

 

%d bloggers like this: