Designing an Effective Malware Detector: Key Insights

How can we determine if a program is malicious or benign? Malicious software deceives users by hiding its intention until it’s too late. Once executed, the damage is already done. To detect and block a malware before it can harm user devices, we need to uncover its intention: Does the program seek to harm the user, or will it behave as expected? Unlike simple tasks, such as distinguishing an apple from an orange, malware detection is challenging because we cannot truly identify a program’s intention. So, achieving 100% accuracy in malware detection is impossible. However, certain static and behavioral differences between malware and benign programs can help us identify them effectively. In this article, we will discuss the design principles for machine learning-based malware detector and outline the key requirements necessary for its effectiveness.

Malware Detector Must be Scalable, Efficient, and Robust

In the early days of cybersecurity, antivirus softwares relied on manual detection rules to identify and block malwares. However, as the number of threats and malware families grew, this approach became unsustainable. With millions of potential threats emerging daily, a more scalable and efficient solution is needed. Furthermore, the landscape of malware detection is constantly evolving, with malware writers continuously releasing new versions to evade detection. To keep pace, a malware detector must be robust and capable of adapting to these ongoing changes.

Key Requirements for Effective Malware Detection

Machine learning-based malware detector must meet several key requirements to ensure effectiveness and adaptability. First, the models must be trained with a large, representative dataset that accurately captures the differences between malicious and benign programs in the real world. Without such a dataset, models may incorrectly classify programs; for example, they might label all large files as malicious if the training data includes large malwares but lacks large goodwares. Second, these models should be transparent and their decisions should be interpretable, allowing us to identify the root causes of false alarms. This transparency is crucial for refining both the model and the dataset. Moreover, while accurate malware detection is crucial, minimizing false positives is even more important, as misclassifying legitimate programs as threats can frustrate users, potentially leading them to disable their protection and leave their devices vulnerable. Finally, the landscape of malware detection is constantly changing, with malware authors frequently releasing new versions to evade detection. This ongoing evolution leads to concept drift, where models become outdated over time. To address this, a malware detector should be flexible, allowing for updates with new data between retraining to maintain its effectiveness.

Machine Learning Approaches to Malware Detection

Machine learning offers several approaches to malware detection, each with its strengths and limitations. For example, supervised learning models, such as random forest and support vector machine, are effective at detecting known threats and their variations. However, these models require labeled data, which can be expensive and time-consuming to obtain. In contrast, unsupervised learning models, such as clustering algorithms, can group similar threats and separate malware from benign programs more cost-effectively. While unsupervised learning can reduce the manual effort needed for labeling data, it has the limitation of not being able to label the threats it identifies. Deep learning models, a subset of machine learning, are adept at recognizing sophisticated threats by inferring high-level patterns from low-level data. However, deep learning models require significantly more data than traditional approaches and offer limited interpretability due to their complexity.

Program Features - Static vs. Dynamic

Program features are crucial for machine learning models to differentiate between malware and goodware. These features are gathered in two phases: pre-execution and post-execution. In the pre-execution phase, static features such as signature, header metadata, and imports are extracted from the program’s binary file without executing it. In contrast, during the post-execution phase, dynamic features such as network traffic, process creation, and registry changes are collected by executing the program in a controlled environment. While dynamic features provide more detailed insights into the program’s actual behavior, they are more resource-intensive and may not be ideal for real-time detection.

Static features can be extracted and used with lightweight models on user devices to provide real-time detection. If this approach can not provide decision with high confidence, the detection task can be forwarded to the cloud, where a more complex machine learning model can utilize both static and dynamic features for a more accurate decision.

Two-Stage Pre-Execution Detection Approach by Kaspersky Lab

Kaspersky Lab’s [1] two-stage pre-execution detection approach is an efficient, robust, and interpretable method that ensures a low false-positive rate. Due to the high volume of daily malware detection requests and the time and resource requirements of machine learning models, relying solely on machine learning is impractical. To address this, Kaspersky employs a two-stage approach that combines similarity hash function with an ensemble of decision trees.

Pre-Detect Stage

In the first stage, known as the pre-detect stage, lightweight features - requiring minimal overhead to extract - are used to calculate a similarity hash function based on locality-sensitive hashing (LSH). LSH has the interesting property that almost identical files fall into the same hash bucket. As a result, a malicious executable and its slightly modified versions will fall into the same hash bucket. Similarly, almost identical benign files will be grouped into one bucket. This allows for fast classification: if a target file falls into a bucket where all files are either benign or malicious, it can be quickly categorized accordingly. However, if a file falls into a bucket with a mix of benign and malicious files, further analysis is required in the second stage - the detect stage.

Detect Stage

During the detect stage, specialized classifiers trained with heavy features perform a more in-depth analysis. Extracting these features requires significant computational resources. In this stage, each classifier is trained using malicious files from one specific hash bucket and benign files from all buckets. The pre-detect stage significantly reduces the number of files needing this costly analysis. Since the bucket-specific classifiers deal with nearly identical malware samples, they tend to have a lower false-positive rate.

This two-stage approach is interpretable because each hash bucket contains nearly identical samples. It is also adaptable to new threats. We can add new hash buckets and classifiers without retraining the entire model.

Addressing the Challenge of Imbalanced Dataset

In real-world scenarios, there is a significant imbalance between the amount of malware and goodware, which poses a challenge for traditional machine learning approaches. These approaches often struggle with such imbalanced dataset, as they tend to bias predictions toward the majority class. Oak et al. [2] demonstrate that the Bidirectional Encoder Representations from Transformers (BERT) pretrained large language model achieves state-of-the-art performance even with an extremely unbalanced dataset where only 0.5% of the samples are malware.

How Human Experts and Machine Algorithms Approach Malware Classification

Now that we understand the key factors to consider when designing machine learning-based malware classifiers, let’s explore how human experts approach malware classification and the key differences in the decision-making processes between humans and machine algorithms [3].

Human experts tend to focus on dynamic features because they contain semantically rich categorical data that are easier for humans to interpret. For example, a program connecting to an unrelated web URL is suspicious to a human analyst. Although dynamic features are often missing compared to static ones, humans can use complementary data to make informed decisions. However, missing information in dynamic features reduce their usefulness in machine learning models because machines are not capable of inferring missing data.

On the other hand, static features are typically numeric, making it difficult for humans to perform statistical analysis and identify meaningful correlations. Machine learning models, however, excel at analyzing static features, detecting strong statistical correlations without needing deep semantic understanding.

Both humans and machines recognize the importance of network traffic and signature in malware detection. However, they differ in the features they prioritize. For instance, machine learning models consider the resource section of an executable to be most important, whereas human experts consider it to be least important. Malwares frequently embed executables, DLLs, or large raw data within resource section, a pattern that machine learning can detect through statistical analysis. Additionally, there is no consensus between human experts and machine learning algorithms on which samples are the most challenging to classify. These differences highlight that humans and machines have complementary skills and can benefit from each other’s insights.

References

[1] Cybersecurity, Kaspersky Enterprise. “Machine learning for malware detection.” (2017). Link

[2] Oak, Rajvardhan, et al. “Malware detection on highly imbalanced data through sequence modeling.” Proceedings of the 12th ACM Workshop on artificial intelligence and security. 2019. Link

[3] Aonzo, Simone, et al. “Humans vs. machines in malware classification.” 32nd USENIX Security Symposium (USENIX Security 23). 2023. Link