Authors: Eva Marková, Tomáš Bajtoš, Pavol Sokol, Terézia Mézešová
Abstract
An inherent part of everyday life and work on a computer is ownership and use of an email address. The main aim of this paper is to analyze existing approaches to classification of malicious emails. We have implemented a system, which is able to distinguish between legitimate and malicious emails. Subsequently, malicious emails are classified into three subcategories: spam, scam, and phishing. We prepared a labeled dataset. We extracted several features from emails contained in the dataset. Within the system, we have implemented four supervised machine learning methods (Random Forest, Decision Tree, Support Vector Machines, k-Nearest Neighbors) and evaluated them. According to our results, the Random Forest is the most suitable approach for email classification.
Introduction
Every year, attacks on the Internet become more and more sophisticated. Attackers often get credentials or bank details through malicious emails, which may contain fraudulent links. They also may require users to install some malicious software which can monitor their activity on the computer. Today’s emails no longer seem to be written by a machine, so it’s difficult to distinguish illegitimate emails from the legitimate ones.
This problem needs to be solved, so we can prevent e.g. data leaks in an organisation. It is also important to users not to deal with the loss of money or credentials. We also want to make it easier for IT staff, because they often have to deal with problems, which can be avoided by creating a system capable of detecting malicious email.
Before the system was built, we previously tested employees and students in our organization, which is around 10.000 people [1]. We sent them three different versions of phishing email. It turned out that approximately 5 percent of people (about 500 people) responded to this email and that’s a big loss for such an organization as university.
In this paper we compared the performance of 4 different classifiers designed to categorize emails into categories and we evaluated them by performance metrics. We used Random Forest, Decision Tree (CART), Support Vector Machines and k-Nearest Neighbors. It turned out that the best results are achieved by Random Forest and in our system we used default implementation offered by the scikit-learn library [2] in Python. In our system we also can see the results of others algorithms, but the most important is Random Forest.
The problem with classification is how to clearly distinguish between spam, scam phishing and legitimate email. Spam is any irrelevant or unsolicited message, mostly advertisement that is sent over the Internet. Scam is a popular form of fraud in which an attacker convinces a victim to pay a certain amount of money and promises a greater reward. Phishing contains links to malicious websites that appear legitimate, or encourages to click on a link, or attempt to retrieve sensitive information. It is very difficult to find characteristics of emails that clearly distinguish these categories. For example, in scam it is needed to look at the text part of an email, because it often does not contain any suspicious links or any attachment. The main goal of this paper is the analysis of the techniques for detection malicious emails and also implementation of malicious emails prevention system. We state the following research sub-goals:
- comparison of methods for detection of malicious emails, and
- design and implementation of system for detection of malicious emails.
This paper is organized into six sections. Section II focuses on the review of published research related to methods to detect malicious emails and design and implementation of prevention system. Section III focuses on research methodology. Section IV describes design and implementation of proposed system. Section V contains results and discussion and last section contains conclusion and our suggestions for the future research.