Authors: Eva Marková, Tomáš Bajtoš, Pavol Sokol, Terézia Mézešová, Patrik Pekarčík
Abstract
An inherent part of everyday life and work on a computer is ownership and usage of the email address. This paper aims to analyze existing approaches to the classification of malicious emails into four subcategories: spam, scam, phishing, and emails that contain malware. First, we prepared a labelled data set. Next, we extracted 34 features from emails included in the data set, and then we found a set of more effective features to use. Finally, we have evaluated four supervised machine learning methods (Random Forest, Decision Tree, Support Vector Machines, -Nearest Neighbors). According to our results, the Random Forest is the most suitable approach for email classification if accuracy is the most important value.
Introduction
It is well known that humans are the weakest link in the security of organizations. Attackers use current events worldwide and manipulate people’s fears with social engineering techniques; as for many, this is an easier attack vector to execute than more technical exploits. The majority of security incidents nowadays involves some social engineering component, and organizations are trying to improve the awareness of their employees. However, identification of phishing emails is difficult also for aware individuals due to their diversity. The widespread phishing emails are an emerging threat to organizations now more than ever, with more of them adapting to epidemiological restrictions, and digital-first is becoming new normal [13]. Spam and phishing had many current events to tie to in 2020, especially in the first quarter [16]. Attackers often get credentials to various services or even bank details through malicious emails, which may contain fraudulent links. They also may require users to install some malicious software that can monitor their activity on the computer. Many email service providers can report an email as junk, but the users do not have a universal option to verify if an email is legitimate.
This paper is based on research [9], in which authors described how to implement an automated service for the users that would classify received emails and automatically reply with a tailored response based on the resulting classification. It is not enough today to distinguish just between legitimate and illegitimate emails. Even as users, we can observe various types of fraudulent emails. Each poses a different risk to the organization, and security teams need to adjust their approach when reacting to any security incidents stemming from such emails.
There are four main categories of illegitimate email that we considered: spam, scam, phishing and emails which contain malware. These labels come directly from the labels that Office 365 Quarantine functionality uses. Spam is any irrelevant or unsolicited message, mostly advertisement that is sent over the Internet. A scam is a popular form of fraud in which an attacker may convince a victim to pay a certain amount of money and promises a reward. Phishing contains links to malicious websites that appear legitimate or encourages one to click on a link or attempt to retrieve sensitive information. Finally, emails with malware are those that contain malicious code. It is complicated to find characteristics of emails that clearly distinguish these categories (e.g. in a scam, it is needed to look at the content of an email because it does not contain any suspicious links or any attachment).
Machine learning offers various classification methods that learn to assign a class label from already labelled examples. The training examples represent the problem well and contain multiple examples of each class. Each example is described by several features, and the model processes which features distinguish the most between the individual classes. It is interesting what features we should consider if we want to distinguish between categories of malicious emails. We chose to extract as many features as possible and discard those, which are irrelevant or worsen the accuracy results. We aimed to identify features that are suitable for different classification methods.
This paper compares and discusses the performance of four different classifier algorithms applied for the email classification problem. We evaluate the performance of a self-collected data set of 1000 emails. We compared Random Forest, Decision Tree, Support Vector Machines and k-Nearest Neighbors classifiers. The results show Random Forest achieves the best performance.
The main goal of this paper is to distinguish between malicious categories of emails and extract features, which would achieve the best results in accuracy and time spent for classification. We state the following research sub-goals:
- comparison of the performance of methods for classification of malicious emails, and
- find a set of attributes which would have the best results from as many as possible considered features.
This paper is organized into five sections. Section 2 focuses on the review of published research related to methods to detect and classify malicious emails. Section 3 focuses on the research methodology. Section 4 describes the processing of the emails. Section 5 contains results and discussion, and the last section contains the conclusion and our suggestions for the future research.