Concept of the Product¶
Cybersecurity is of paramount importance in the modern world for any government and any enterprise. Governments and Enterprises are under constant cyber attacks from all kinds of adversaries. It is said that the next war will be fought in the Cyber Space. So, it is of utmost importance to develop products and strategies to be able to deal with the threat of cyber attacks.
One of the means of cyber attacks is through malware. Malware are malicious computer programs embedded in software of regular use. For example, an email may carry a malware which can infect the machine of the receiver of the email and then spread across an enterprise.
One very ubiquitous file exchanged by people across the globe are images. The advent of social media has boosted the exchange of images. Images can be exchanged not only between 2 computers; but also using devices like mobile phone, etc. Also, it is the tendency of people to spread images across wide circles. So, developers of malware target embedding their malicious code into images so that it gets a lot of traction.
One of the very popular forms of image files are JPEG files. So, infecting JPEG files has lots of incentives for malware developers. So far, antivirus software has been developed which study signatures inside the JPEG files to detect presence of malware. However, this technique is expensive as large teams have to be deployed to constantly research for new signatures. This project aims to use supervised machine learning so that computers can detect the malware in JPEG files without having to constantly feed the computer with knowledge of new signatures.
Result obtained from the Project¶
During training, from a Training Set of 4,582 clean JPEG Files, 4,580 JPEG files got classified a clean JPEG file and 2 clean JPEG Files got classified as JPEG files containing malware. However, from a training set of 400 JPEG files containing malware, none of these files got classified as clean JPEG files. This was a significant result as the most unwanted result of this project was to get a False Positive. A clean JPEG file getting classified as a malware is less dangerous than a malware getting classified as a clean file.
During testing, from a Test Set of 117 clean JPEG files, 107 JPEG files got classified as clean JPEG Files and 10 clean JPEG Files for classified as JPEG Files containing malware. However, from a test set of 22 JPEG files containing malware, none of the files got classified as clean JPEG files.
We see that the training accuracy is 99.95%. And we got a testing accuracy of 92.80%. From this observation, it is safe to conclude that the developed model does not suffer from overfitting.
Through training the model with around 5,000 files, the model is able to detect malware in JPEG files with an accuracy of about 93%. However, the model has some notable weaknesses. These weakness can be eliminated from the model by providing a much larger data set (of about a million files or more). The project demonstrates that this strategy for malware detection using machine learning can save a lot of money and effort for detecting malware. The concepts used in this project can be extended to other file types; thus broadening the utility of this product.
Purpose of this Product¶
The way the product should be useable is as follows:
- The product will expose a API which will analyse whether a JPEG file is clean or contains a malware.
- When one JPEG file or a set of JPEG file is given as an input to the API, the API will report back which of the files is clean and which of the files contains a malware.
- Based on the result obtained from the API, appropriate actions can be planned for dealing with the files.
Let us take a use case for this particular project. Let us say that a company is in the business of acquiring photographs from around the world. There are many companies of this nature. A few examples are 123RF, Adobe, Instagram, Flickr, etc. If this company sets up a directory on their server where all the uploaded photographs from all the users will be stored, then a file watcher can be set up on that directory to pick up any incoming JPEG file. The File Picker can submit the picked up file to a processor which will call the API from this product to analyse the JPEG file. Based on the analysis of the file, further action like quarantining the file or deleting the file or accepting the file to the platform can be made by a software.
The main advantage of the product is that the product will eliminate the need for a dedicated team to constantly research JPEG files looking for signatures of malware. The models of this product can be refreshed at routine intervals so that the models become heuristic.
Important Note¶
This project only demonstrates the Model. The API and the associated software are not part of this project though they are part of the larger product development.
About JPEG Files¶
JPEG Files are compressed files used to store images. The JPEG Files can be identified by a marker 0xFFD8 at the start of the file.
The information in the JPEG Files can be classified into 2 types – Image Data and EXIF Tags. EXIF Tags were added to JPEG file format in 2010.
The Image Data of JPEG files have the following segments:¶
- Header
- 2 Quantization Tables
- Frame Information
The Header segment contains the following data:
- Identifier (OxJFIF for JPEG, etc)
- Version
- Units
- Density
- Thumbnail
The Quantization Table 1 contains the data regarding the luminance of the image. It is a 8 8 table.
The Quantization Table 2 contains the data regarding the chrominance of the image. It is a 8 8 table.
The Frame Information is a series of Huffman encoded tables containing the bit pattern of the image.
The important thing to note is that if any of above values is tampered with, the JPEG file will not render appropriately. So, the developers of Malware do not tamper with this part of JPEG files to introduce malware.
EXIF Tags¶
Exif Tags provide additional information to the JPEG Files. The EXIF Tags can be altered programmatically to alter the nature of the image stored in the JPEG files. For example, by altering the Exif Tag BrightnessValue, the brightness of the image can be altered. Similarly, by altering the Exif Tags ExifImageHeight and ExifImageWidth, the size of the image can be altered.
Exif Tags provide the facility for photograph editors to make enhancements/alterations to an image stored as a JPEG File. As these are values which can be altered, the developers of Malware alter these tags to introduce their spurious code. One way to clean an infected JPEG file is to convert the JPEG File to a BMP File. BMP Files can only contain the image information and does not support any Exif tags. Converting to BMP files gets rid of the Exif tags and thus the JPEG files gets cleaned of malware.
A JPEG File may contain Exif Tags or may not contain Exif Tags. Definitely, JPEG Files created before 2010 do not contain Exif Tags.
Conclusion¶
If a JPEG File does not contain Exif tags, then it can be safe to state that the JPEG File can be considered as a clean file.
So, for malware analysis, we need concentrating only on JPEG Files which contain Exif Tags.
About the data¶
The initial data for this project was obtained from McAfee Labs in India. McAfee provided 3,124 Clean JPEG Files and 278 JPEG Files with Malware.
As the total number of files were less, 2,763 JPEG Files were collected from my library of photographs and from the collection of photographs from my friend. These files were transferred using Google Drive. Transferring using Google Drive was a round of check to ensure that these files were actually clean as Google Drive discards any JPEG files containing a malware.
As the number of JPEG files containing malware was less, C3i Labs of IIT, Kanpur was approached. C3i Labs provided 160 JPEG files containing malware.
So, the total data set contained 5,887 clean JPEG files and 438 JPEG files with malware.
Training Set¶
Out of the total data set, 5,583 clean JPEG files and 304 JPEG files with malware were used for the training set.
Test Set¶
Out of the total data set, 304 clean JPEG files and 23 JPEG files with malware were used for the test set.
Strategies tried to arrive at a model¶
The initial part of the project involved finding reliable libraries to be able to read the JPEG files. Once a library was found and JPEG files were being read, two significant discoveries were that not all the JPEG files in the obtained data set were actually JPEG Files and that many of the JPEG Files did not contain EXIF tags. Both these reasons reduced the data set. This discovery also lead to the following conclusions:
- If the file was not a valid JPEG File, the software would just reject these files as not a part of the scope of this product.
- If the file was a valid JPEG file but did not contain EXIF tags, then the file would be classified as benign. (This is because in almost all cases it is not possible to transmit malware in a JPEG File without editing the EXIF tags).
- Some of the JPEG Files containing malware were found to be not JPEG files. This is possible as the file could have got damaged while it was being manipulated. JPEG Files are identified by the signature for JPEG Files in the header of the JPEG Files.
- The surprise element was that some of the files in the data set obtained as containing malware were found to not contain any EXIF tags. This set is the unexplained set of this project.
- Not all the JPEG Files contain the same set of tags.
Strategy 1: Using the length of the tags as features¶
Once the tags were extracted from the JPEG Files, an unique list of tags found across all the JPEG files was created. A data frame was formed with each of these tags as columns.
The length of each tag in every JPEG file was determined. If a tag was not available in a JPEG File, the corresponding column in the data frame was assigned a value of zero. So, now there was a data frame containing only numbers. This data frame could be used for machine learning.
Using this data frame, a Logistic Regression model was developed. The accuracy obtained through this model was about 56%.
Using the same data frame, a random forest model was developed. This model performed slightly better at an accuracy of about 58%.
Then, using this data frame, a artificial neural network was tried using Tensor Flow API. This model performed at an accuracy of about 72%.
Strategy 2: Forming TF-IDF¶
Following the guidance of the Mentor to use TF-IDF over the tags, a string was formed by concatenating all the tags available in each JPEG File. TF-IDF was formed from these strings. After forming the TF-IDF, a model was trained using Decision Tree Algorithm. This model gave an accuracy of 99.5% during training.
Encouraged by this result, a Random Forest Model was created using the same IF-IDF. The results were similar to the result of the Decision Tree Model.
This document only contains the model formed using strategy 2. The model formed using strategy 1 has been discarded.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os.path
import datetime
# For Progress display
from tqdm import tqdm
import time
# For JPEG File Feature Extraction
from PIL import Image
from PIL.ExifTags import TAGS
# For File Handling
import glob
# For SMOTE
import imblearn as ib
# For building the Term Frequency
from sklearn.feature_extraction.text import TfidfVectorizer
# For building the model
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# For computing the metrics
from sklearn import metrics
# To save and load the models
import pickle
Constants¶
Constants were used to avoid hard coding. This also gave the advantage that indicators could be changed in one place to impact the complete program.
# Constants
FILE_NAME_COLUMN_NAME = 'FileName'
FILE_TYPE_COLUMN_NAME = 'FileType'
TAG_STRING_COLUMN_NAME = 'TagString'
NUMERIC_COLUMN_IDENTIFIER = 'AAA'
BENIGN_FILE = 0
FILE_WITH_MALWARE = 1
Function used to extract EXIF Tags from JPEG Files¶
The following 3 functions were used to extract the EXIF Tags from the JPEG Files.
extractTagsFromADirectory¶
This function takes a single file name or a set of file names and returns the Tags extracted from all the JPEG Files as a List. Along with the Tags found in the set of JPEG Files, this function returns the number of valid JPEG Files, number of invalid JPEG files and the number of JPEG Files which contained no tags.
This function calls the functions JPEGFileFeatureExtractorToDictionary() and isImageFile().
JPEGFileFeatureExtractorToDictionary¶
This function takes one JPEG file as input and returns all the EXIF Tags in the JPEG file in a Dictionary as output.
isImageFile¶
This function takes a file as an input and returns TRUE if the file is a JPEG File and returns FALSE if the file is not a JPEG File.
def extractTagsFromADirectory(inputDirectory):
# Declare Counters
numberOfValidFiles = 0
numberOfInvalidFiles = 0
numberOfFilesWithoutTags = 0
# Create an Empty List to hold all the features of all the files
returnValue = []
n = 0
# Loop through all the files in the Input Directory
for file in glob.glob(inputDirectory):
# Create an empty Dictionary
oneFileFeatures = {}
try:
# Read the file and extract the features
fileFeatures = JPEGFileFeatureExtractorToDictionary(file)
# If the File had some features, then create an entry for the file
if len(fileFeatures.keys()) > 0:
# Write the File Name
oneFileFeatures[FILE_NAME_COLUMN_NAME] = file
# Add the File Features to the main Dictionary
oneFileFeatures.update(fileFeatures)
# Add the entry to the return value
returnValue.append(oneFileFeatures)
numberOfValidFiles = numberOfValidFiles + 1
else:
# Check if the file is a valid Image File
if isImageFile(file):
numberOfFilesWithoutTags = numberOfFilesWithoutTags + 1
else:
numberOfInvalidFiles = numberOfInvalidFiles + 1
except:
# Check if the file is a valid Image File
if isImageFile(file):
numberOfFilesWithoutTags = numberOfFilesWithoutTags + 1
else:
numberOfInvalidFiles = numberOfInvalidFiles + 1
return (returnValue, numberOfValidFiles, numberOfInvalidFiles, numberOfFilesWithoutTags)
def JPEGFileFeatureExtractorToDictionary(imageFile):
#Declare an empty Dictionary
returnValue = {}
# read the image data using PIL
image = Image.open(imageFile)
# extract EXIF data
exifdata = image.getexif()
# iterating over all EXIF data fields
for tag_id in exifdata:
# get the tag name
tag = TAGS.get(tag_id, tag_id)
data = exifdata.get(tag_id)
# decode bytes
if isinstance(data, bytes):
data = data.decode('iso8859-1')
returnValue[tag] = data
return returnValue
def isImageFile(imageFileName):
returnValue = True
try:
img = Image.open('./' + imageFileName) # open the image file
img.verify() # verify that it is an image
except (IOError, SyntaxError) as e:
returnValue = False
return returnValue
Extract Features from Benign Files¶
In this step, the program extracts all the features from the clean JPEG Files. All the clean JPEG files were stored in a separate directory for easy classification. The extension of asterisk.jasterisks is used because while transferring photographs from my library and from my friend’s library, there were some video files as well. Even if the video files got processed, it would not have caused any difference as those files would have got segregated as not JPEG Files and thus not considered later.
Notice that there were 816 valid JPEG files in the directory which contained no EXIF tags.
benignFileFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/clean_jpeg/*.j*")
print("Valid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))
Extract Features from File containing Malware¶
In this step, the program extracts the features from the JPEG files containing malware. Significant point to note is that there are 3 JPEG files containing malware which have no tags. As these files have been marked as containing malware, tried to upload these files to Google Drive. The files got uploaded to Google Drive as can be seen in the picture below. However, when I tried to download these files, the download failed. So, I am unable to conclude whether these are actually JPEG files containing malware.
malwareFileFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/malicious_files/*")
print("Valid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))
Extract the unique list of Keys for all the files¶
From the last 2 steps, we have 2 dictionaries containing all the tags extracted from all the clean JPEG files and from all the JPEG files containing malware. In this step, a list of all the unique tags is prepared. To do this, a set of tags is created. Set is chosen as sets cannot contain duplicate items.
Another significant aspect noticed was that there were some tags which were numeric. Since a data frame cannot have a column name containing only numbers, these tags were prefixed with a fixed string.
The list of unique tags as extracted is listed below.
featureList = set()
for i in benignFileFeatures:
for k in i.keys():
if type(k) == int:
featureList.add(NUMERIC_COLUMN_IDENTIFIER + str(k))
else:
featureList.add(k)
for i in malwareFileFeatures:
for k in i.keys():
if type(k) == int:
featureList.add(NUMERIC_COLUMN_IDENTIFIER + str(k))
else:
featureList.add(k)
featureList
Create a Data Frame containing all the Keys extracted in the previous step as Columns and fill in the values in the Data Frame¶
fillDataInDataFrame¶
Function to create a data frame of all the tags available in the set of JPEG Files.
For the strategy 1, separate columns were created for each tag. Each of these columns contained the tags if it was present in a JPEG file; other it contained zero. As this strategy was abandoned, a string was created which contained the concatenation of all the tags in a JPEG file. The tags were separated by a space. The string of concatenated tags was inserted as a separate column in the data frame.
Note that the last space in the string of tags was removed. It is not known to me why the presence of this space in the string of tags caused error while creating the TF-IDF.
def fillDataInDataFrame(featureDictionary, extractionDescription, fileType):
# Create an Empty DataFrame object
df = pd.DataFrame()
for record in tqdm(featureDictionary, desc = extractionDescription):
# Create an empty Dictionary
oneRecord = {}
# Create an empty string
recordString = ""
# Initialise all the columns
for colName in featureList:
oneRecord[colName] = pd.to_numeric(0, downcast='integer')
# Loop through all the features in a record
for k in record.keys():
# Extract the column name from the record
if type(k) == int:
extractedColumnName = NUMERIC_COLUMN_IDENTIFIER + str(k)
else:
extractedColumnName = k
# Extract the value for the key and store in the Dictionary
if ((type(record[k]) == int) | (type(record[k]) == float)):
oneRecord[extractedColumnName] = pd.to_numeric(record[k], downcast='integer')
else:
oneRecord[extractedColumnName] = record[k]
# Extract the value for the key and append to the record string
# This will be used for TFID
# Add a SPACE between each tag value
# Do not include File Name
if k != FILE_NAME_COLUMN_NAME:
recordString = recordString + str(record[k]) + " "
# Add the record string as a separate column in the record
oneRecord[TAG_STRING_COLUMN_NAME] = recordString[:-1]
# Add column to mark Dependent Column as Benign File
oneRecord[FILE_TYPE_COLUMN_NAME] = pd.to_numeric(fileType, downcast='integer')
# Add the Record to the Data Frame
df = df.append(oneRecord, ignore_index=True)
time.sleep(0.0001)
return df
Create data frame of tags¶
In this step, 2 data frames are created – one containing the tags from clean JPEG Files and one containing tags from JPEG Files containing malware.
benignFileDF = fillDataInDataFrame(benignFileFeatures, 'Extracting Clean Files Features', BENIGN_FILE)
malignantFileDF = fillDataInDataFrame(malwareFileFeatures, 'Extracting Features from Files with Malware', FILE_WITH_MALWARE)
Create a single data frame containing all the tags¶
In this step, the 2 data frames created in the previous step are concatenated into a single data frame.
The data frame is saved to a CSV file.
df = pd.concat([benignFileDF, malignantFileDF], ignore_index=True)
df['FileType'] = pd.to_numeric(df['FileType'], downcast='integer')
df.to_csv('./JPEGDataSet.csv')
This was another unexplained position in the program. Over a week was spent on solving this mystery.
If the CSV File was read to form the data frame and the data frame is used for creating the TF-IDF, the TF-IDF had very few features. The problem was solved by eliminating CHR(0) from the strings in the Tags. On eliminating the CHR(0) and then saving the data frame solved the problem and the TF-IDF got created properly with all the features. When the data frame was being saved without eliminating the CHR(0), the tags in the saved CSV were getting truncated.
In the next step, the CHR(0) is eliminated from the tags and then the data frame is saved to a CSV File. Now that the data frame is saved to a CSV Files, running all the above steps can be eliminated as we can start from the step of reading the CSV File and forming the data frame. The above steps can be run only when new set of JPEG files are received. So, the program can be altered to skip the above steps if the file JPEGCleanDataSet.csv is present in the current directory.
So far, we have extracted the features from the JPEG Files. Features were extracted from JPEG Files containing malware and clean JPEG Files. Now, these features need to be organised so that models can be prepared for classifying JPEG Files as Clean and Containing Malware.
Notice that we have 4,582 clean JPEG Files and 400 JPEG Files with Malware in our data frame.
df[FILE_TYPE_COLUMN_NAME].value_counts()
dfTFID = df[[TAG_STRING_COLUMN_NAME, FILE_TYPE_COLUMN_NAME]].copy()
print('Start Time: %s' % datetime.datetime.now())
dfTFIDClean = pd.DataFrame()
for i in dfTFID.index:
dfTFIDClean.loc[i, TAG_STRING_COLUMN_NAME] = dfTFID.iloc[i, 0].replace(chr(0), '')
dfTFIDClean.loc[i, FILE_TYPE_COLUMN_NAME] = dfTFID.iloc[i, 1]
print('End Time: %s' % datetime.datetime.now())
dfTFIDClean.shape
In this step, there is no need for reading the CSV file and forming the data frame. This is done just to prove that the process works as so much time was wasted in finding this nuance.
dfTFIDClean = dfTFIDClean.dropna()
dfTFIDClean[FILE_TYPE_COLUMN_NAME]= pd.to_numeric(dfTFIDClean[FILE_TYPE_COLUMN_NAME], downcast='integer')
dfTFIDClean.to_csv('./JPEGCleanDataSet.csv')
dfTFIDClean = pd.read_csv('./JPEGCleanDataSet.csv')
dfTFIDClean[FILE_TYPE_COLUMN_NAME].value_counts()
Form the TF-IDF¶
TF-IDF is the acronym for Term Frequency-Inverse Document Frequency. TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. For this case, each JPEG file is a document in the collection and each tags or parts of a tag in each JPEG File is a Term.
Notice that after forming the TF-IDF, there are 64,774 features in our data set.
The TF-IDF is stored in an array with each element of the array corresponding to a JPEG file in the same order as we formed the data frame. This forms the set of independent variables for our Machine Learning model. The independent variables are stored in the variable X. The dependent variable is the set of FILE-TYPE as recorded while load the JPEG Files (remember that the clean JPEG files were stored in a separate directory and all these files were read as a set and marked as benign. Similarly, all the JPEG files with malware were stored in a separate directory and read as a set and marked as MALWARE).
print('Start Time: %s' % datetime.datetime.now())
tfidfconverter = TfidfVectorizer(max_features=90000, min_df=1, max_df=0.7)
X = tfidfconverter.fit_transform(dfTFIDClean.TagString).toarray()
print('End Time: %s' % datetime.datetime.now())
y = dfTFID.FileType
y.value_counts()
Handle the Imbalanced Data Sets¶
Notice that we have 4,582 clean JPEG Files and 400 JPEG Files with malware. So, our data set is imbalanced as we have more than 10 times the number of clean JPEG files as compared to the number of JPEG files with malware. Imbalanced data sets will cause algorithms like Random Forest and Decision Tree (and most of other algorithms) to not be able function properly. We can understand this specifically in the case of Random Forest algorithm. We know that in the Random Forest algorithm, the data sets are split both vertically and horizontally at random. Each split is then analysed by a separate decision tree model. When we have far less number of data points for one set of data, it is possible that one or more of the horizontal splits will get data of only one class. This will cause those decision trees not being able to see both the types of data. Thus, the classifications will not be proper.
To resolve the problem of imbalanced data set, we over sample the data set. Oversampling means that we try to increase the number of data point in the class which has less number of data point to have similar to the number of data point as the other set with more data points. For oversampling, SMOTE (Synthetic Minority Over-sampling Technique) algorithm is used.
Notice that after applying SMOTE, we have 9,164 data points in our data set and there are 4,582 data points for clean JPEG Files and 4,582 data points for JPEG Files with malware.
# Transform the dataset
print('Start Time: %s' % datetime.datetime.now())
oversample = ib.over_sampling.SMOTE()
X, y = oversample.fit_resample(X, y)
print('End Time: %s' % datetime.datetime.now())
X.shape
y.value_counts()
Form the Decision Tree Model¶
Now that we have the data in the form we desire, we form the Model using the Decision Tree algorithm.
Notice that the model has an accuracy of 99.5% when we generate the cross validation score.
print('Start Time: %s' % datetime.datetime.now())
# Create the Decision Tree Model
modelDT = DecisionTreeClassifier()
# Evaluate
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(modelDT, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.5f' % scores.mean())
print('End Time: %s' % datetime.datetime.now())
In this step, the Decision Tree is created. The model is saved to a file. When the software is run for evaluating the JPEG files, we can just load the model from this file and generate the model every time.
print('Start Time: %s' % datetime.datetime.now())
modelDT.fit(X, y)
pickle.dump(modelDT, open("./DTModelMalwareDetection", 'wb'))
print('End Time: %s' % datetime.datetime.now())
Now that the model is ready, we evaluate the model. Notice that only 2 files from the Training Set were wrongly classified. However, we do not have any FALSE POSITIVES.
y_pred = modelDT.predict(X)
cm = metrics.confusion_matrix(y, y_pred)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);
# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Benign', 'Malware']);
ax.yaxis.set_ticklabels(['Benign', 'Malware']);
print("\n\nConfusion Classification Report\n")
print(metrics.classification_report(y, y_pred))
Form the Random Forest Model¶
Now we create the Random Forest Model. Notice that the cross validation accuracy of this model is 100%. However, notice later that when we evaluate this model on the Training Set, there are 2 files which are classified incorrectly. However, even the Random Forest algorithm does not report any FALSE POSITIVES.
print('Start Time: %s' % datetime.datetime.now())
# Create the Random Forest Model
modelRF = RandomForestClassifier()
# Evaluate
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(modelRF, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.5f' % scores.mean())
print('End Time: %s' % datetime.datetime.now())
print('Start Time: %s' % datetime.datetime.now())
modelRF.fit(X, y)
pickle.dump(modelRF, open("./RFModelMalwareDetection", 'wb'))
print('End Time: %s' % datetime.datetime.now())
y_pred = modelRF.predict(X)
cm = metrics.confusion_matrix(y, y_pred)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);
# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Benign', 'Malware']);
ax.yaxis.set_ticklabels(['Benign', 'Malware']);
print("\n\nConfusion Classification Report\n")
print(metrics.classification_report(y, y_pred))
Make predictions for test data¶
Now that the models are ready, we test the models using the test data.
First the Decision Tree Model is tested.
To test the models with the test data, we need to conduct the following steps:
- Read the Tags from the clean JPEG Files and form the data frame of tags and mark these rows as Benign files.
- Read the Tags from the JPEG Files containing malware and form the data frame of tags and mark these rows as JPEG files containing malware.
- Combine the 2 dataframes created in steps 1 and 2 into a single data frame (We could have tested the 2 data frames created in steps 1 and 2 separately as well).
- Remove the CHR(0) from the tags.
- Then we form the TF-IDF using the TF-IDF converter already created before (It is important to note that we should use the same TF-IDF converter as the number of features in the TF-IDF should be the same as that was used to create the models).
- Once the TF-IDF is ready, we can use the model to make the predictions.
Notice that 10 files have been classified incorrectly.
# Read the data for the Benign Files
testCleanFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/ValidationSet-Clean/*.j*")
print("\nBENIGN FILES\n------------\nValid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))
# Form the data frame
testCleanDF = fillDataInDataFrame(testCleanFeatures, 'Extracting Clean Files Features for Test Data', BENIGN_FILE)
# Read the data for the file with Malware
testMalwareFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/ValidationSet-Malicious/*")
print("\nMALWARE FILES\n-------------\nValid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))
# Form the data frame
testMalwareDF = fillDataInDataFrame(testMalwareFeatures, 'Extracting Clean Files Features for Test Data', FILE_WITH_MALWARE)
# Combine the 2 data frames formed above
testDF = pd.concat([testCleanDF, testMalwareDF], ignore_index=True)
testDF['FileType'] = pd.to_numeric(testDF['FileType'], downcast='integer')
# Create data for TFID
testTFIDDF = testDF[[TAG_STRING_COLUMN_NAME, FILE_TYPE_COLUMN_NAME]].copy()
# Clean the data
testTFIDClean = pd.DataFrame()
for i in testTFIDDF.index:
testTFIDClean.loc[i, TAG_STRING_COLUMN_NAME] = testTFIDDF.iloc[i, 0].replace(chr(0), '')
testTFIDClean.loc[i, FILE_TYPE_COLUMN_NAME] = testTFIDDF.iloc[i, 1]
# Drop NULL Values
testTFIDClean = testTFIDClean.dropna()
testTFIDClean[FILE_TYPE_COLUMN_NAME]= pd.to_numeric(testTFIDClean[FILE_TYPE_COLUMN_NAME], downcast='integer')
testTFIDClean[FILE_TYPE_COLUMN_NAME].value_counts()
# Create the TFID
X = tfidfconverter.transform(testTFIDClean.TagString).toarray()
y = testTFIDClean.FileType
# Make the prediction as per the Decision Tree Model and check the result
y_pred = modelDT.predict(X)
cm = metrics.confusion_matrix(y, y_pred)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);
# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Benign', 'Malware']);
ax.yaxis.set_ticklabels(['Benign', 'Malware']);
Lastly, we test the Random Forest Model.
Notice that this model also classifies 10 files incorrectly. However, now we have ONE FALSE POSITIVE.
# Make the prediction as per the Random Forest Model and check the result
y_pred = modelRF.predict(X)
print("\nPrediction as per Random Forest Model\n")
cm = metrics.confusion_matrix(y, y_pred)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);
# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels');
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Benign', 'Malware']);
ax.yaxis.set_ticklabels(['Benign', 'Malware']);
Conclusion¶
The model have been tested on JPEG Files which were absolutely not seen when the models were developed. As the model generally does not report FALSE POSITIVES, the models can be reasonably relied upon.
Weakness of the model¶
Though TF-IDF gives importance to the significant terms in a document, there are certain terms in a JPEG File containing malware which needs giving more emphasis. For example, if a JPEG File contains a malware, one of the tags is bound to have some form of code to run an executable. Normally, the code to run an executable starts with the function eval(). If the model could be enhanced so that such patterns could be given more importance, the model could perform better.
You must be logged in to post a comment.