Detecting Malware in JPEG Files


Concept of the Product

Cybersecurity is of paramount importance in the modern world for any government and any enterprise. Governments and Enterprises are under constant cyber attacks from all kinds of adversaries. It is said that the next war will be fought in the Cyber Space. So, it is of utmost importance to develop products and strategies to be able to deal with the threat of cyber attacks.

One of the means of cyber attacks is through malware. Malware are malicious computer programs embedded in software of regular use. For example, an email may carry a malware which can infect the machine of the receiver of the email and then spread across an enterprise.

One very ubiquitous file exchanged by people across the globe are images. The advent of social media has boosted the exchange of images. Images can be exchanged not only between 2 computers; but also using devices like mobile phone, etc. Also, it is the tendency of people to spread images across wide circles. So, developers of malware target embedding their malicious code into images so that it gets a lot of traction.

One of the very popular forms of image files are JPEG files. So, infecting JPEG files has lots of incentives for malware developers. So far, antivirus software has been developed which study signatures inside the JPEG files to detect presence of malware. However, this technique is expensive as large teams have to be deployed to constantly research for new signatures. This project aims to use supervised machine learning so that computers can detect the malware in JPEG files without having to constantly feed the computer with knowledge of new signatures.

Result obtained from the Project

During training, from a Training Set of 4,582 clean JPEG Files, 4,580 JPEG files got classified a clean JPEG file and 2 clean JPEG Files got classified as JPEG files containing malware. However, from a training set of 400 JPEG files containing malware, none of these files got classified as clean JPEG files. This was a significant result as the most unwanted result of this project was to get a False Positive. A clean JPEG file getting classified as a malware is less dangerous than a malware getting classified as a clean file.

During testing, from a Test Set of 117 clean JPEG files, 107 JPEG files got classified as clean JPEG Files and 10 clean JPEG Files for classified as JPEG Files containing malware. However, from a test set of 22 JPEG files containing malware, none of the files got classified as clean JPEG files.

We see that the training accuracy is 99.95%. And we got a testing accuracy of 92.80%. From this observation, it is safe to conclude that the developed model does not suffer from overfitting.

Through training the model with around 5,000 files, the model is able to detect malware in JPEG files with an accuracy of about 93%. However, the model has some notable weaknesses. These weakness can be eliminated from the model by providing a much larger data set (of about a million files or more). The project demonstrates that this strategy for malware detection using machine learning can save a lot of money and effort for detecting malware. The concepts used in this project can be extended to other file types; thus broadening the utility of this product.

Purpose of this Product

The way the product should be useable is as follows:

  1. The product will expose a API which will analyse whether a JPEG file is clean or contains a malware.
  2. When one JPEG file or a set of JPEG file is given as an input to the API, the API will report back which of the files is clean and which of the files contains a malware.
  3. Based on the result obtained from the API, appropriate actions can be planned for dealing with the files.

Let us take a use case for this particular project. Let us say that a company is in the business of acquiring photographs from around the world. There are many companies of this nature. A few examples are 123RF, Adobe, Instagram, Flickr, etc. If this company sets up a directory on their server where all the uploaded photographs from all the users will be stored, then a file watcher can be set up on that directory to pick up any incoming JPEG file. The File Picker can submit the picked up file to a processor which will call the API from this product to analyse the JPEG file. Based on the analysis of the file, further action like quarantining the file or deleting the file or accepting the file to the platform can be made by a software.

The main advantage of the product is that the product will eliminate the need for a dedicated team to constantly research JPEG files looking for signatures of malware. The models of this product can be refreshed at routine intervals so that the models become heuristic.

Important Note

This project only demonstrates the Model. The API and the associated software are not part of this project though they are part of the larger product development.

About JPEG Files

JPEG Files are compressed files used to store images. The JPEG Files can be identified by a marker 0xFFD8 at the start of the file.

The information in the JPEG Files can be classified into 2 types – Image Data and EXIF Tags. EXIF Tags were added to JPEG file format in 2010.

The Image Data of JPEG files have the following segments:

  1. Header
  2. 2 Quantization Tables
  3. Frame Information

The Header segment contains the following data:

  1. Identifier (OxJFIF for JPEG, etc)
  2. Version
  3. Units
  4. Density
  5. Thumbnail

The Quantization Table 1 contains the data regarding the luminance of the image. It is a 8 8 table.
The Quantization Table 2 contains the data regarding the chrominance of the image. It is a 8
8 table.

The Frame Information is a series of Huffman encoded tables containing the bit pattern of the image.

The important thing to note is that if any of above values is tampered with, the JPEG file will not render appropriately. So, the developers of Malware do not tamper with this part of JPEG files to introduce malware.

EXIF Tags

Exif Tags provide additional information to the JPEG Files. The EXIF Tags can be altered programmatically to alter the nature of the image stored in the JPEG files. For example, by altering the Exif Tag BrightnessValue, the brightness of the image can be altered. Similarly, by altering the Exif Tags ExifImageHeight and ExifImageWidth, the size of the image can be altered.

Exif Tags provide the facility for photograph editors to make enhancements/alterations to an image stored as a JPEG File. As these are values which can be altered, the developers of Malware alter these tags to introduce their spurious code. One way to clean an infected JPEG file is to convert the JPEG File to a BMP File. BMP Files can only contain the image information and does not support any Exif tags. Converting to BMP files gets rid of the Exif tags and thus the JPEG files gets cleaned of malware.

A JPEG File may contain Exif Tags or may not contain Exif Tags. Definitely, JPEG Files created before 2010 do not contain Exif Tags.

Conclusion

If a JPEG File does not contain Exif tags, then it can be safe to state that the JPEG File can be considered as a clean file.

So, for malware analysis, we need concentrating only on JPEG Files which contain Exif Tags.

About the data

The initial data for this project was obtained from McAfee Labs in India. McAfee provided 3,124 Clean JPEG Files and 278 JPEG Files with Malware.

As the total number of files were less, 2,763 JPEG Files were collected from my library of photographs and from the collection of photographs from my friend. These files were transferred using Google Drive. Transferring using Google Drive was a round of check to ensure that these files were actually clean as Google Drive discards any JPEG files containing a malware.

As the number of JPEG files containing malware was less, C3i Labs of IIT, Kanpur was approached. C3i Labs provided 160 JPEG files containing malware.

So, the total data set contained 5,887 clean JPEG files and 438 JPEG files with malware.

Training Set

Out of the total data set, 5,583 clean JPEG files and 304 JPEG files with malware were used for the training set.

Test Set

Out of the total data set, 304 clean JPEG files and 23 JPEG files with malware were used for the test set.

Strategies tried to arrive at a model

The initial part of the project involved finding reliable libraries to be able to read the JPEG files. Once a library was found and JPEG files were being read, two significant discoveries were that not all the JPEG files in the obtained data set were actually JPEG Files and that many of the JPEG Files did not contain EXIF tags. Both these reasons reduced the data set. This discovery also lead to the following conclusions:

  1. If the file was not a valid JPEG File, the software would just reject these files as not a part of the scope of this product.
  2. If the file was a valid JPEG file but did not contain EXIF tags, then the file would be classified as benign. (This is because in almost all cases it is not possible to transmit malware in a JPEG File without editing the EXIF tags).
  3. Some of the JPEG Files containing malware were found to be not JPEG files. This is possible as the file could have got damaged while it was being manipulated. JPEG Files are identified by the signature for JPEG Files in the header of the JPEG Files.
  4. The surprise element was that some of the files in the data set obtained as containing malware were found to not contain any EXIF tags. This set is the unexplained set of this project.
  5. Not all the JPEG Files contain the same set of tags.

Strategy 1: Using the length of the tags as features

Once the tags were extracted from the JPEG Files, an unique list of tags found across all the JPEG files was created. A data frame was formed with each of these tags as columns.

The length of each tag in every JPEG file was determined. If a tag was not available in a JPEG File, the corresponding column in the data frame was assigned a value of zero. So, now there was a data frame containing only numbers. This data frame could be used for machine learning.

Using this data frame, a Logistic Regression model was developed. The accuracy obtained through this model was about 56%.

Using the same data frame, a random forest model was developed. This model performed slightly better at an accuracy of about 58%.

Then, using this data frame, a artificial neural network was tried using Tensor Flow API. This model performed at an accuracy of about 72%.

Strategy 2: Forming TF-IDF

Following the guidance of the Mentor to use TF-IDF over the tags, a string was formed by concatenating all the tags available in each JPEG File. TF-IDF was formed from these strings. After forming the TF-IDF, a model was trained using Decision Tree Algorithm. This model gave an accuracy of 99.5% during training.

Encouraged by this result, a Random Forest Model was created using the same IF-IDF. The results were similar to the result of the Decision Tree Model.

This document only contains the model formed using strategy 2. The model formed using strategy 1 has been discarded.

Building and Testing the model

The rest of the document states all the programming steps taken to build and test the model.

Listed below are the libraries used.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt     
import seaborn as sns
import os.path
import datetime

# For Progress display
from tqdm import tqdm
import time

# For JPEG File Feature Extraction
from PIL import Image
from PIL.ExifTags import TAGS

# For File Handling
import glob

# For SMOTE
import imblearn as ib

# For building the Term Frequency
from sklearn.feature_extraction.text import TfidfVectorizer

# For building the model
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# For computing the metrics
from sklearn import metrics

# To save and load the models
import pickle

Constants

Constants were used to avoid hard coding. This also gave the advantage that indicators could be changed in one place to impact the complete program.

In [2]:
# Constants
FILE_NAME_COLUMN_NAME = 'FileName'
FILE_TYPE_COLUMN_NAME = 'FileType'
TAG_STRING_COLUMN_NAME = 'TagString'
NUMERIC_COLUMN_IDENTIFIER = 'AAA'
BENIGN_FILE = 0
FILE_WITH_MALWARE = 1

Function used to extract EXIF Tags from JPEG Files

The following 3 functions were used to extract the EXIF Tags from the JPEG Files.

extractTagsFromADirectory

This function takes a single file name or a set of file names and returns the Tags extracted from all the JPEG Files as a List. Along with the Tags found in the set of JPEG Files, this function returns the number of valid JPEG Files, number of invalid JPEG files and the number of JPEG Files which contained no tags.

This function calls the functions JPEGFileFeatureExtractorToDictionary() and isImageFile().

JPEGFileFeatureExtractorToDictionary

This function takes one JPEG file as input and returns all the EXIF Tags in the JPEG file in a Dictionary as output.

isImageFile

This function takes a file as an input and returns TRUE if the file is a JPEG File and returns FALSE if the file is not a JPEG File.

In [3]:
def extractTagsFromADirectory(inputDirectory):
    # Declare Counters
    numberOfValidFiles = 0
    numberOfInvalidFiles = 0
    numberOfFilesWithoutTags = 0

    # Create an Empty List to hold all the features of all the files
    returnValue = []
    
    n = 0
    
    # Loop through all the files in the Input Directory
    for file in glob.glob(inputDirectory):
        # Create an empty Dictionary
        oneFileFeatures = {}
        
        try:
            # Read the file and extract the features
            fileFeatures = JPEGFileFeatureExtractorToDictionary(file)

            # If the File had some features, then create an entry for the file
            if len(fileFeatures.keys()) > 0:
                # Write the File Name
                oneFileFeatures[FILE_NAME_COLUMN_NAME] = file

                # Add the File Features to the main Dictionary
                oneFileFeatures.update(fileFeatures)
            
                # Add the entry to the return value
                returnValue.append(oneFileFeatures)
                
                numberOfValidFiles = numberOfValidFiles + 1
            else:
                # Check if the file is a valid Image File
                if isImageFile(file):
                    numberOfFilesWithoutTags = numberOfFilesWithoutTags + 1
                else:
                    numberOfInvalidFiles = numberOfInvalidFiles + 1
 
        except:
            # Check if the file is a valid Image File
            if isImageFile(file):
                numberOfFilesWithoutTags = numberOfFilesWithoutTags + 1
            else:
                numberOfInvalidFiles = numberOfInvalidFiles + 1
            
    return (returnValue, numberOfValidFiles, numberOfInvalidFiles, numberOfFilesWithoutTags)
In [4]:
def JPEGFileFeatureExtractorToDictionary(imageFile):
    #Declare an empty Dictionary
    returnValue = {}
    
    # read the image data using PIL
    image = Image.open(imageFile)
    
    # extract EXIF data
    exifdata = image.getexif()

    # iterating over all EXIF data fields
    for tag_id in exifdata:
        # get the tag name
        tag = TAGS.get(tag_id, tag_id)
        data = exifdata.get(tag_id)
        # decode bytes 
        if isinstance(data, bytes):
            data = data.decode('iso8859-1')

        returnValue[tag] = data
        
    return returnValue
In [5]:
def isImageFile(imageFileName):
    returnValue = True
    
    try:
        img = Image.open('./' + imageFileName) # open the image file
        img.verify() # verify that it is an image
    except (IOError, SyntaxError) as e:
        returnValue = False
        
    return returnValue

Extract Features from Benign Files

In this step, the program extracts all the features from the clean JPEG Files. All the clean JPEG files were stored in a separate directory for easy classification. The extension of asterisk.jasterisks is used because while transferring photographs from my library and from my friend’s library, there were some video files as well. Even if the video files got processed, it would not have caused any difference as those files would have got segregated as not JPEG Files and thus not considered later.

Notice that there were 816 valid JPEG files in the directory which contained no EXIF tags.

In [6]:
benignFileFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/clean_jpeg/*.j*")
print("Valid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))
Valid JPEG Files = 4582
Invalid Image Files = 0
JPEG Files without Tags = 816

Extract Features from File containing Malware

In this step, the program extracts the features from the JPEG files containing malware. Significant point to note is that there are 3 JPEG files containing malware which have no tags. As these files have been marked as containing malware, tried to upload these files to Google Drive. The files got uploaded to Google Drive as can be seen in the picture below. However, when I tried to download these files, the download failed. So, I am unable to conclude whether these are actually JPEG files containing malware.

In [7]:
malwareFileFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/malicious_files/*")
print("Valid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))
Valid JPEG Files = 400
Invalid Image Files = 12
JPEG Files without Tags = 3

Extract the unique list of Keys for all the files

From the last 2 steps, we have 2 dictionaries containing all the tags extracted from all the clean JPEG files and from all the JPEG files containing malware. In this step, a list of all the unique tags is prepared. To do this, a set of tags is created. Set is chosen as sets cannot contain duplicate items.

Another significant aspect noticed was that there were some tags which were numeric. Since a data frame cannot have a column name containing only numbers, these tags were prefixed with a fixed string.

The list of unique tags as extracted is listed below.

In [8]:
featureList = set()

for i in benignFileFeatures:
    for k in i.keys():
        if type(k) == int:
            featureList.add(NUMERIC_COLUMN_IDENTIFIER + str(k))
        else:
            featureList.add(k)

for i in malwareFileFeatures:
    for k in i.keys():
        if type(k) == int:
            featureList.add(NUMERIC_COLUMN_IDENTIFIER + str(k))
        else:
            featureList.add(k)
In [9]:
featureList
Out[9]:
{'AAA0',
 'AAA20736',
 'AAA20752',
 'AAA20753',
 'AAA20754',
 'AAA34864',
 'AAA34866',
 'AAA36873',
 'AAA36880',
 'AAA36881',
 'AAA36882',
 'AAA42080',
 'AAA59932',
 'AAA59933',
 'AAA769',
 'AAA770',
 'AAA771',
 'ApertureValue',
 'Artist',
 'BitsPerSample',
 'BodySerialNumber',
 'BrightnessValue',
 'CFAPattern',
 'CameraOwnerName',
 'ColorSpace',
 'ComponentsConfiguration',
 'CompressedBitsPerPixel',
 'Compression',
 'Contrast',
 'Copyright',
 'CustomRendered',
 'DateTime',
 'DateTimeDigitized',
 'DateTimeOriginal',
 'DeviceSettingDescription',
 'DigitalZoomRatio',
 'DocumentName',
 'ExifImageHeight',
 'ExifImageWidth',
 'ExifInteroperabilityOffset',
 'ExifOffset',
 'ExifVersion',
 'ExposureBiasValue',
 'ExposureIndex',
 'ExposureMode',
 'ExposureProgram',
 'ExposureTime',
 'FNumber',
 'FileName',
 'FileSource',
 'Flash',
 'FlashPixVersion',
 'FocalLength',
 'FocalLengthIn35mmFilm',
 'FocalPlaneResolutionUnit',
 'FocalPlaneXResolution',
 'FocalPlaneYResolution',
 'GPSInfo',
 'GainControl',
 'HostComputer',
 'ISOSpeedRatings',
 'ImageDescription',
 'ImageLength',
 'ImageUniqueID',
 'ImageWidth',
 'JpegIFByteCount',
 'JpegIFOffset',
 'LensMake',
 'LensModel',
 'LensSerialNumber',
 'LensSpecification',
 'LightSource',
 'Make',
 'MakerNote',
 'MaxApertureValue',
 'MeteringMode',
 'Model',
 'OECF',
 'Orientation',
 'PageName',
 'PhotometricInterpretation',
 'PlanarConfiguration',
 'PrimaryChromaticities',
 'PrintImageMatching',
 'ProcessingSoftware',
 'Rating',
 'RatingPercent',
 'ReferenceBlackWhite',
 'RelatedImageLength',
 'RelatedImageWidth',
 'RelatedSoundFile',
 'ResolutionUnit',
 'SamplesPerPixel',
 'Saturation',
 'SceneCaptureType',
 'SceneType',
 'SensingMethod',
 'Sharpness',
 'ShutterSpeedValue',
 'Software',
 'SpatialFrequencyResponse',
 'SpectralSensitivity',
 'SubjectDistance',
 'SubjectDistanceRange',
 'SubjectLocation',
 'SubsecTime',
 'SubsecTimeDigitized',
 'SubsecTimeOriginal',
 'TileLength',
 'TileWidth',
 'TimeZoneOffset',
 'UserComment',
 'WhiteBalance',
 'WhitePoint',
 'XPAuthor',
 'XPComment',
 'XResolution',
 'YCbCrCoefficients',
 'YCbCrPositioning',
 'YCbCrSubSampling',
 'YResolution'}

Create a Data Frame containing all the Keys extracted in the previous step as Columns and fill in the values in the Data Frame

fillDataInDataFrame

Function to create a data frame of all the tags available in the set of JPEG Files.

For the strategy 1, separate columns were created for each tag. Each of these columns contained the tags if it was present in a JPEG file; other it contained zero. As this strategy was abandoned, a string was created which contained the concatenation of all the tags in a JPEG file. The tags were separated by a space. The string of concatenated tags was inserted as a separate column in the data frame.

Note that the last space in the string of tags was removed. It is not known to me why the presence of this space in the string of tags caused error while creating the TF-IDF.

In [10]:
def fillDataInDataFrame(featureDictionary, extractionDescription, fileType):
    # Create an Empty DataFrame object
    df = pd.DataFrame()

    for record in tqdm(featureDictionary, desc = extractionDescription):
        # Create an empty Dictionary
        oneRecord = {}

        # Create an empty string
        recordString = ""

        # Initialise all the columns
        for colName in featureList:
            oneRecord[colName] = pd.to_numeric(0, downcast='integer')

        # Loop through all the features in a record
        for k in record.keys():
            # Extract the column name from the record
            if type(k) == int:
                extractedColumnName = NUMERIC_COLUMN_IDENTIFIER + str(k)
            else:
                extractedColumnName = k

            # Extract the value for the key and store in the Dictionary
            if ((type(record[k]) == int) | (type(record[k]) == float)):
                oneRecord[extractedColumnName] = pd.to_numeric(record[k], downcast='integer')
            else:
                oneRecord[extractedColumnName] = record[k]

            # Extract the value for the key and append to the record string
            # This will be used for TFID
            # Add a SPACE between each tag value
            # Do not include File Name
            if k != FILE_NAME_COLUMN_NAME:
                recordString = recordString + str(record[k]) + " "

        # Add the record string as a separate column in the record
        oneRecord[TAG_STRING_COLUMN_NAME] = recordString[:-1]

        # Add column to mark Dependent Column as Benign File
        oneRecord[FILE_TYPE_COLUMN_NAME] = pd.to_numeric(fileType, downcast='integer')

        # Add the Record to the Data Frame
        df = df.append(oneRecord, ignore_index=True)

        time.sleep(0.0001)

    return df

Create data frame of tags

In this step, 2 data frames are created – one containing the tags from clean JPEG Files and one containing tags from JPEG Files containing malware.

In [11]:
benignFileDF = fillDataInDataFrame(benignFileFeatures, 'Extracting Clean Files Features', BENIGN_FILE)
malignantFileDF = fillDataInDataFrame(malwareFileFeatures, 'Extracting Features from Files with Malware', FILE_WITH_MALWARE)
Extracting Clean Files Features: 100%|██████████| 4582/4582 [03:13<00:00, 23.64it/s]
Extracting Features from Files with Malware: 100%|██████████| 400/400 [00:12<00:00, 31.28it/s]

Create a single data frame containing all the tags

In this step, the 2 data frames created in the previous step are concatenated into a single data frame.

The data frame is saved to a CSV file.

In [12]:
df = pd.concat([benignFileDF, malignantFileDF], ignore_index=True)
df['FileType'] = pd.to_numeric(df['FileType'], downcast='integer')
df.to_csv('./JPEGDataSet.csv')

This was another unexplained position in the program. Over a week was spent on solving this mystery.
If the CSV File was read to form the data frame and the data frame is used for creating the TF-IDF, the TF-IDF had very few features. The problem was solved by eliminating CHR(0) from the strings in the Tags. On eliminating the CHR(0) and then saving the data frame solved the problem and the TF-IDF got created properly with all the features. When the data frame was being saved without eliminating the CHR(0), the tags in the saved CSV were getting truncated.

In the next step, the CHR(0) is eliminated from the tags and then the data frame is saved to a CSV File. Now that the data frame is saved to a CSV Files, running all the above steps can be eliminated as we can start from the step of reading the CSV File and forming the data frame. The above steps can be run only when new set of JPEG files are received. So, the program can be altered to skip the above steps if the file JPEGCleanDataSet.csv is present in the current directory.

So far, we have extracted the features from the JPEG Files. Features were extracted from JPEG Files containing malware and clean JPEG Files. Now, these features need to be organised so that models can be prepared for classifying JPEG Files as Clean and Containing Malware.

Notice that we have 4,582 clean JPEG Files and 400 JPEG Files with Malware in our data frame.

In [13]:
df[FILE_TYPE_COLUMN_NAME].value_counts()
Out[13]:
0    4582
1     400
Name: FileType, dtype: int64
In [14]:
dfTFID = df[[TAG_STRING_COLUMN_NAME, FILE_TYPE_COLUMN_NAME]].copy()
In [15]:
print('Start Time: %s' % datetime.datetime.now())

dfTFIDClean = pd.DataFrame()
for i in dfTFID.index:
    dfTFIDClean.loc[i, TAG_STRING_COLUMN_NAME] = dfTFID.iloc[i, 0].replace(chr(0), '')
    dfTFIDClean.loc[i, FILE_TYPE_COLUMN_NAME] = dfTFID.iloc[i, 1]
    
print('End Time: %s' % datetime.datetime.now())
Start Time: 2021-06-29 03:39:35.366684
End Time: 2021-06-29 03:39:41.592263
In [16]:
dfTFIDClean.shape
Out[16]:
(4982, 2)

In this step, there is no need for reading the CSV file and forming the data frame. This is done just to prove that the process works as so much time was wasted in finding this nuance.

In [17]:
dfTFIDClean = dfTFIDClean.dropna()
dfTFIDClean[FILE_TYPE_COLUMN_NAME]= pd.to_numeric(dfTFIDClean[FILE_TYPE_COLUMN_NAME], downcast='integer')
dfTFIDClean.to_csv('./JPEGCleanDataSet.csv')

dfTFIDClean = pd.read_csv('./JPEGCleanDataSet.csv')
In [18]:
dfTFIDClean[FILE_TYPE_COLUMN_NAME].value_counts()
Out[18]:
0    4582
1     400
Name: FileType, dtype: int64

Form the TF-IDF

TF-IDF is the acronym for Term Frequency-Inverse Document Frequency. TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. For this case, each JPEG file is a document in the collection and each tags or parts of a tag in each JPEG File is a Term.

Notice that after forming the TF-IDF, there are 64,774 features in our data set.

The TF-IDF is stored in an array with each element of the array corresponding to a JPEG file in the same order as we formed the data frame. This forms the set of independent variables for our Machine Learning model. The independent variables are stored in the variable X. The dependent variable is the set of FILE-TYPE as recorded while load the JPEG Files (remember that the clean JPEG files were stored in a separate directory and all these files were read as a set and marked as benign. Similarly, all the JPEG files with malware were stored in a separate directory and read as a set and marked as MALWARE).

In [19]:
print('Start Time: %s' % datetime.datetime.now())

tfidfconverter = TfidfVectorizer(max_features=90000, min_df=1, max_df=0.7)
X = tfidfconverter.fit_transform(dfTFIDClean.TagString).toarray()
    
print('End Time: %s' % datetime.datetime.now())
Start Time: 2021-06-29 03:39:41.882733
End Time: 2021-06-29 03:39:42.874750
In [20]:
y = dfTFID.FileType
y.value_counts()
Out[20]:
0    4582
1     400
Name: FileType, dtype: int64

Handle the Imbalanced Data Sets

Notice that we have 4,582 clean JPEG Files and 400 JPEG Files with malware. So, our data set is imbalanced as we have more than 10 times the number of clean JPEG files as compared to the number of JPEG files with malware. Imbalanced data sets will cause algorithms like Random Forest and Decision Tree (and most of other algorithms) to not be able function properly. We can understand this specifically in the case of Random Forest algorithm. We know that in the Random Forest algorithm, the data sets are split both vertically and horizontally at random. Each split is then analysed by a separate decision tree model. When we have far less number of data points for one set of data, it is possible that one or more of the horizontal splits will get data of only one class. This will cause those decision trees not being able to see both the types of data. Thus, the classifications will not be proper.

To resolve the problem of imbalanced data set, we over sample the data set. Oversampling means that we try to increase the number of data point in the class which has less number of data point to have similar to the number of data point as the other set with more data points. For oversampling, SMOTE (Synthetic Minority Over-sampling Technique) algorithm is used.

Notice that after applying SMOTE, we have 9,164 data points in our data set and there are 4,582 data points for clean JPEG Files and 4,582 data points for JPEG Files with malware.

In [21]:
# Transform the dataset
print('Start Time: %s' % datetime.datetime.now())

oversample = ib.over_sampling.SMOTE()
X, y = oversample.fit_resample(X, y)

print('End Time: %s' % datetime.datetime.now())
Start Time: 2021-06-29 03:39:42.900536
End Time: 2021-06-29 03:39:58.995813
In [22]:
X.shape
Out[22]:
(9164, 64774)
In [23]:
y.value_counts()
Out[23]:
1    4582
0    4582
Name: FileType, dtype: int64

Form the Decision Tree Model

Now that we have the data in the form we desire, we form the Model using the Decision Tree algorithm.

Notice that the model has an accuracy of 99.5% when we generate the cross validation score.

In [24]:
print('Start Time: %s' % datetime.datetime.now())

# Create the Decision Tree Model
modelDT = DecisionTreeClassifier()

# Evaluate
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(modelDT, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.5f' % scores.mean())

print('End Time: %s' % datetime.datetime.now())
Start Time: 2021-06-29 03:39:59.154544
Mean ROC AUC: 0.99347
End Time: 2021-06-29 04:11:10.642740

In this step, the Decision Tree is created. The model is saved to a file. When the software is run for evaluating the JPEG files, we can just load the model from this file and generate the model every time.

In [25]:
print('Start Time: %s' % datetime.datetime.now())

modelDT.fit(X, y)
pickle.dump(modelDT, open("./DTModelMalwareDetection", 'wb'))

print('End Time: %s' % datetime.datetime.now())
Start Time: 2021-06-29 04:11:10.679177
End Time: 2021-06-29 04:18:10.996525

Now that the model is ready, we evaluate the model. Notice that only 2 files from the Training Set were wrongly classified. However, we do not have any FALSE POSITIVES.

In [26]:
y_pred = modelDT.predict(X)

cm = metrics.confusion_matrix(y, y_pred)

ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Benign', 'Malware']); 
ax.yaxis.set_ticklabels(['Benign', 'Malware']);

print("\n\nConfusion Classification Report\n")
print(metrics.classification_report(y, y_pred))
Confusion Classification Report

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4582
           1       1.00      1.00      1.00      4582

    accuracy                           1.00      9164
   macro avg       1.00      1.00      1.00      9164
weighted avg       1.00      1.00      1.00      9164

Confusion Matrix

Form the Random Forest Model

Now we create the Random Forest Model. Notice that the cross validation accuracy of this model is 100%. However, notice later that when we evaluate this model on the Training Set, there are 2 files which are classified incorrectly. However, even the Random Forest algorithm does not report any FALSE POSITIVES.

In [27]:
print('Start Time: %s' % datetime.datetime.now())

# Create the Random Forest Model
modelRF = RandomForestClassifier()

# Evaluate
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(modelRF, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
print('Mean ROC AUC: %.5f' % scores.mean())

print('End Time: %s' % datetime.datetime.now())
Start Time: 2021-06-29 04:18:13.060353
Mean ROC AUC: 0.99995
End Time: 2021-06-29 04:40:54.765297
In [28]:
print('Start Time: %s' % datetime.datetime.now())

modelRF.fit(X, y)
pickle.dump(modelRF, open("./RFModelMalwareDetection", 'wb'))

print('End Time: %s' % datetime.datetime.now())
Start Time: 2021-06-29 04:40:54.800978
End Time: 2021-06-29 04:42:20.807526
In [29]:
y_pred = modelRF.predict(X)

cm = metrics.confusion_matrix(y, y_pred)

ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Benign', 'Malware']); 
ax.yaxis.set_ticklabels(['Benign', 'Malware']);

print("\n\nConfusion Classification Report\n")
print(metrics.classification_report(y, y_pred))
Confusion Classification Report

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4582
           1       1.00      1.00      1.00      4582

    accuracy                           1.00      9164
   macro avg       1.00      1.00      1.00      9164
weighted avg       1.00      1.00      1.00      9164

Confusion Matrix

Make predictions for test data

Now that the models are ready, we test the models using the test data.

First the Decision Tree Model is tested.

To test the models with the test data, we need to conduct the following steps:

  1. Read the Tags from the clean JPEG Files and form the data frame of tags and mark these rows as Benign files.
  2. Read the Tags from the JPEG Files containing malware and form the data frame of tags and mark these rows as JPEG files containing malware.
  3. Combine the 2 dataframes created in steps 1 and 2 into a single data frame (We could have tested the 2 data frames created in steps 1 and 2 separately as well).
  4. Remove the CHR(0) from the tags.
  5. Then we form the TF-IDF using the TF-IDF converter already created before (It is important to note that we should use the same TF-IDF converter as the number of features in the TF-IDF should be the same as that was used to create the models).
  6. Once the TF-IDF is ready, we can use the model to make the predictions.

Notice that 10 files have been classified incorrectly.

In [30]:
# Read the data for the Benign Files
testCleanFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/ValidationSet-Clean/*.j*")
print("\nBENIGN FILES\n------------\nValid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))

# Form the data frame
testCleanDF = fillDataInDataFrame(testCleanFeatures, 'Extracting Clean Files Features for Test Data', BENIGN_FILE)

# Read the data for the file with Malware
testMalwareFeatures, numValidFiles, numInvalidFiles, numFilesWithoutTags = extractTagsFromADirectory("./Data/ValidationSet-Malicious/*")
print("\nMALWARE FILES\n-------------\nValid JPEG Files = %d\nInvalid Image Files = %d\nJPEG Files without Tags = %d" % (numValidFiles, numInvalidFiles, numFilesWithoutTags))

# Form the data frame
testMalwareDF = fillDataInDataFrame(testMalwareFeatures, 'Extracting Clean Files Features for Test Data', FILE_WITH_MALWARE)

# Combine the 2 data frames formed above
testDF = pd.concat([testCleanDF, testMalwareDF], ignore_index=True)
testDF['FileType'] = pd.to_numeric(testDF['FileType'], downcast='integer')

# Create data for TFID
testTFIDDF = testDF[[TAG_STRING_COLUMN_NAME, FILE_TYPE_COLUMN_NAME]].copy()

# Clean the data
testTFIDClean = pd.DataFrame()
for i in testTFIDDF.index:
    testTFIDClean.loc[i, TAG_STRING_COLUMN_NAME] = testTFIDDF.iloc[i, 0].replace(chr(0), '')
    testTFIDClean.loc[i, FILE_TYPE_COLUMN_NAME] = testTFIDDF.iloc[i, 1]
    
# Drop NULL Values
testTFIDClean = testTFIDClean.dropna()
testTFIDClean[FILE_TYPE_COLUMN_NAME]= pd.to_numeric(testTFIDClean[FILE_TYPE_COLUMN_NAME], downcast='integer')
testTFIDClean[FILE_TYPE_COLUMN_NAME].value_counts()

# Create the TFID
X = tfidfconverter.transform(testTFIDClean.TagString).toarray()
y = testTFIDClean.FileType

# Make the prediction as per the Decision Tree Model and check the result
y_pred = modelDT.predict(X)

cm = metrics.confusion_matrix(y, y_pred)

ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Benign', 'Malware']); 
ax.yaxis.set_ticklabels(['Benign', 'Malware']);
Extracting Clean Files Features for Test Data:   4%|▍         | 5/117 [00:00<00:02, 45.18it/s]
BENIGN FILES
------------
Valid JPEG Files = 117
Invalid Image Files = 0
JPEG Files without Tags = 175
Extracting Clean Files Features for Test Data: 100%|██████████| 117/117 [00:02<00:00, 45.16it/s]
Extracting Clean Files Features for Test Data:  23%|██▎       | 5/22 [00:00<00:00, 49.28it/s]
MALWARE FILES
-------------
Valid JPEG Files = 22
Invalid Image Files = 0
JPEG Files without Tags = 0
Extracting Clean Files Features for Test Data: 100%|██████████| 22/22 [00:00<00:00, 43.91it/s]
Confusion Matrix

Lastly, we test the Random Forest Model.

Notice that this model also classifies 10 files incorrectly. However, now we have ONE FALSE POSITIVE.

In [31]:
# Make the prediction as per the Random Forest Model and check the result
y_pred = modelRF.predict(X)

print("\nPrediction as per Random Forest Model\n")
cm = metrics.confusion_matrix(y, y_pred)

ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax);

# labels, title and ticks
ax.set_xlabel('Predicted labels');
ax.set_ylabel('True labels'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['Benign', 'Malware']); 
ax.yaxis.set_ticklabels(['Benign', 'Malware']);
Prediction as per Random Forest Model

Confusion Matrix

Conclusion

The model have been tested on JPEG Files which were absolutely not seen when the models were developed. As the model generally does not report FALSE POSITIVES, the models can be reasonably relied upon.

Weakness of the model

Though TF-IDF gives importance to the significant terms in a document, there are certain terms in a JPEG File containing malware which needs giving more emphasis. For example, if a JPEG File contains a malware, one of the tags is bound to have some form of code to run an executable. Normally, the code to run an executable starts with the function eval(). If the model could be enhanced so that such patterns could be given more importance, the model could perform better.

 

 

%d bloggers like this: