Building Better Malware Detectors: Challenges, Pitfalls, and Practical Recommendations

Introduction

Creating machine learning-based malware detectors comes with many challenges and pitfalls that can arise at nearly every stage of the workflow, often leading to unrealistic performance expectations and misunderstandings of the results. In this article, we will explore the key challenges faced in this domain, highlight common pitfalls throughout the machine learning workflow, and provide practical recommendations to navigate these obstacles effectively.

Data Collection and Labeling Stage

An appropriate dataset is crucial for any machine learning application because poor-quality data leads to poor results. Selecting the right dataset is particularly challenging in the cybersecurity domain due constant evolution of both benign and malicious samples. Malware changes more rapidly than benign files, as attackers continuously adapt their strategies to evade detection. On the other hand, benign programs tend to evolve slowly to maintain stability. Below, we explore the key challenges and pitfalls associated with data collection and discuss potential solutions. The dataset must represent the true distribution of the underlying security problem - otherwise, the model will fail to generalize to unseen data. If real-world conditions are not considered when creating a dataset, it leads to a pitfall known as sampling bias. Capturing real-world scenarios in a dataset is challenging. Small datasets may overlook many real-world scenarios, while larger datasets require longer training times, which may not be feasible for real-time detection on endpoint devices. Also, a large dataset does not guarantee performance improvement. Ceschin et al. [1] shows that model performance only improves up to a certain point with increasing dataset size. Regional differences also impact model performance. For example, a model trained on U.S. dataset may perform poorly when applied to Brazilian samples, as attack vectors can vary significantly between regions. U.S. users often rely on SMS features, while Brazilian users prefer messaging apps like WhatsApp, exposing them to different types of malware. To mitigate these challenges, it is essential to incorporate data from various regions and diverse malware families. This approach ensures greater diversity, enabling the trained model to generalize better.

For effective malware detection, the labels in the training dataset must be accurate. However, ground truth labels are often inaccurate and unstable. Researchers often assume that samples collected from app stores are benign, while those gathered from blacklisted sources are malicious. However, app stores can also contain malwares. Another common labeling method involves using the VirusTotal service, which employs multiple antivirus (AV) solutions to determine labels. However, antivirus labels can change over time. Recent research indicates that AV labels may not stabilize for 20 to 30 days after a new threat emerges. Training a model on incorrect labels may result in learning malicious behaviors as legitimate, which hampers its real-world detection performance. Failing to consider these labeling issues leads to a pitfall known as label inaccuracy. To address these labeling challenges, delayed evaluation approaches should be considered. Additionally, manually investigating false positives or actively modeling label noise can help mitigate the effects of label inaccuracies.

In practice, benign files far outnumber malicious ones, causing typical machine learning models to bias their predictions toward the majority class. To address this issue, we can re-sample the dataset using two approaches: (1) undersampling, which involves removing instances from the majority class, and (2) oversampling, which involves creating synthetic instances for the minority class. When undersampling, failing to consider temporal information can alter the distribution of different subclasses, causing a deviation from the original distribution. To maintain the real-world distribution of various malware families, it’s essential to undersample all subclasses proportionally. Similarly, while creating synthetic samples, we must ensure that the distribution of different malware families is preserved over time. Alternative methods to tackle class imbalance include increasing the cost of misclassifying minority classes, which elevates their importance during model training. Another approach is ensemble learning, which combines multiple models to enhance performance. Lastly, anomaly detection algorithms can be applied to classify malwares as anomalies.

System Design and Learning Stage

In traditional machine learning applications, k-fold cross-validation is commonly used to estimate model performance. However, in the cybersecurity landscape, this approach can lead to data leakage i.e. using test data during training. In k-fold cross-validation, the dataset is randomly split into training and test sets without considering the temporal order of the samples. As a result the model may be exposed to future samples (test data) during training. In reality, malware samples are constantly changing and future malwares are expected to drift. So, similar conditions (train with past samples and test with future samples) should be maintained during the learning stage, otherwise the performance will significantly degrade when deployed in the real-world. Failing to consider this problem is known as data snooping pitfall. To mitigate this issue, it is essential to order the dataset based on the timestamps of data samples and ensure that the model is not exposed to test data during training.

Sometimes, a model learns irrelevant patterns that correlate with the task but fail to generalize effectively. For instance, in a network intrusion detection system, if a large portion of the attacks in the dataset originates from a specific network region, the model may learn to detect that particular IP range rather than generic attack patterns. Similarly, if the dataset includes large malware files but mostly small benign files, the model might mistakenly associate file size with the likelihood of being malicious. These shortcut patterns may lead to better performance during evaluation but often perform poorly in real-world scenarios. Failing to identify such irrelevant correlations is known as the spurious correlation pitfall. Sampling bias mainly causes these spurious correlations, and using explanation techniques can help reveal them.

It is common practice to tune hyperparameters and select the best-performing model based on its performance on test data. However, in this process, test data guides the parameter selection which is known as biased parameter selection pitfall. To prevent this, the dataset should be divided into three distinct sets: training, validation, and test. The validation data should be used exclusively for parameter tuning and model selection, while the test data should be reserved solely for reporting the performance of the final solution.

Performance Evaluation Stage

Accurate evaluation of machine learning-based security solutions is essential, as incorrect assessments can lead to wrong conclusions and unrealistic expectations about the effectiveness of these solutions. For example, if a model is tested with 100 samples, where 99 are benign and 1 is malicious, it might achieve 99% accuracy by blindly predicting all samples as benign. However, this high accuracy is misleading. In this case, precision and recall offer a clearer picture of the model’s performance. Precision measures the quality of prediction (proportion of true positive predictions among all positive predictions), while recall measures the quantity of prediction (proportion of true positives among all actual positive samples). Here, both precision and recall would be 0, indicating that the model completely failed to identify the malicious sample. This illustrates that in the cybersecurity domain, where class distribution can be highly imbalanced, accuracy alone does not provide a reliable measure of a model’s effectiveness. Failing to consider class imbalance and using inappropriate metrics when evaluating a solution is known as the Inappropriate Performance Measure pitfall. Additionally, class imbalance may give rise to another pitfall known as the base rate fallacy. For example, if the model predicts two malicious samples among the 100, a false positive rate of 1% might seem reasonable. However, it actually corresponds to 100 false positives for every 100 true positives, indicating very poor performance. To effectively address these challenges, we should utilize metrics like the precision-recall curve or Matthews Correlation Coefficient, which better account for the real-world constraints of class imbalance. Furthermore, it is important to discuss false positives in relation to the base rate of the negative class to provide a comprehensive evaluation.

In performance evaluation, another challenge is selecting appropriate baselines for comparison. Consider two solutions: one detects malware in real-time on end-user devices, while the other is designed for internal analysis using significantly more resources and time. The first solution is expected to be faster and may have more false negatives compared to the second. Comparing these two solutions is like comparing apples to oranges, as their use cases and resource requirements vary significantly. Likewise, comparing your solution to proprietary solutions like antivirus software is not helpful since the inner workings of those systems are not publicly known. Failing to take these factors into account when making comparisons leads to what is known as the Inappropriate Baseline pitfall. To address this issue, we should compare solutions with well-understood baselines that are similar in terms of detection timing and context.

Deployment and Operation Stage

In the deployment and operation stage of the machine learning workflow, two major pitfalls are prevalent.

The first pitfall in deployment and operation stage is the use of an inappropriate threat model. Without clearly defining the capabilities and motivations of potential attackers, security solutions can develop blind spots. Since defending against every possible attack is unrealistic, it’s crucial to prioritize protection against high-reward attacks. For instance, if the cost of breaking into a house exceeds the benefits, attackers are unlikely to target it. However, if the reward is substantial, attackers might invest more resources and explore all possible entry points. A real-world example is the Stuxnet malware attack on Iran’s nuclear facility, where internal threats were overlooked. This failure to account for internal attackers exemplifies the inappropriate threat model pitfall. To address this issue, it is essential to conduct a thorough threat analysis and consider the motivations and resources of potential attackers. In machine learning-based malware detection, a similar issue arises if attackers can easily bypass the model. A poorly regularized model that depends on a limited set of features is particularly vulnerable to evasion. To counter this, vulnerabilities must be assessed at every stage of the machine learning pipeline, and white-box attacks should be conducted to evaluate the model’s robustness.

The second pitfall is evaluating the solution only in a lab setting. An inherent limitation of any security solution is that it cannot be tested in real-world scenarios, so researchers usually evaluate its performance in controlled environments. While a model may perform well in the lab, it can fail in real-world scenarios if the test conditions don’t reflect the dynamic nature of actual attacks. To address this issue, it’s crucial to test the model in environments that closely mimic real-world conditions. Accounting for the temporal and spatial relationships in the data can help capture the typical dynamics encountered outside of the lab.

Impacts of Pitfalls

To illustrate the the impacts of pitfalls discussed above, we present two case studies from the cybersecurity domain.

Case Study 1: Vulnerability Discovery

Vulnerabilities in source code are unintentional, unlike malware, which is deliberately created to cause harm. However, vulnerabilities can be exploited by attackers to create malware, making them a significant threat. Arp et al. [2] show that the state-of-the-art machine learning-based vulnerability detector VulDeePecker suffers from three pitfalls. First, the dataset used in VulDeePecker contains certain sizes of buffers in only one class, leading to the spurious correlation pitfall. The model learns to associate buffer sizes with vulnerabilities, which may not generalize well to unseen data. Second, during preprocessing, VulDeePecker replaces buffer size values with generic tokens like INT0 or INT1, making it impossible to determine if buffer overflow can occur or not. This results in the label inaccuracy pitfall. Third, the authors find that the performance of VulDeePecker is almost same as basic linear classifiers, indicating that the model is not exploiting relations in the sequence as claimed. This highlights the inappropriate baseline pitfall, which could have been avoided by comparing the model with baseline models.

Case Study 2: Source Code Author Attribution

Identifying the attacker is just as important as detecting and preventing malware. Bringing attackers to justice can help prevent future attacks and provide valuable insights into their motives and methods. While many studies focus on identifying code authorship through coding style and patterns, they also suffer from the pitfalls mentioned earlier, reducing their effectiveness in real-world. Recent approaches in this area often rely on data from Google Code Jam competitions, where participants reuse personalized templates across challenges. However, these templates are frequently unrelated to the current challenges, leading to sampling bias. As a result, machine learning models may incorrectly associate an author with code based on these templates, creating a spurious correlation rather than learning the actual coding style.

Conclusion

Arp et al. [2] reviewed 30 papers from top security conferences over the past decade, revealing that the pitfalls discussed here are widespread in current security research. The prevalence of these issues highlights a general lack of awareness and reflects the inherent complexity of the cybersecurity. Clearly stating the assumptions behind proposed solutions enhances understanding and identifies areas for future research. By addressing these challenges and pitfalls, we can build more robust and effective machine learning-based malware detectors that perform well in real-world scenarios.

References

[1] Ceschin, Fabrício, et al. “Machine learning (in) security: A stream of problems.” Digital Threats: Research and Practice 5.1 (2024): 1-32. Link

[2] Arp, Daniel, et al. “Dos and don’ts of machine learning in computer security.” 31st USENIX Security Symposium (USENIX Security 22). 2022. Link