Adapting Malware Detection to Concept Drift
Introduction
Machine learning pipelines are widely used in antivirus (AV) systems to efficiently detect malware due to their speed and scalability. However, malware developers continually modify their malware features to evade detection. This constant evolution leads to changes in data distribution, which directly impacts model detection rates, a phenomenon known as concept drift. Traditional batch learning methods have proven inadequate in addressing malware concept drift. Recent research emphasizes drift detection strategies and modeling malware detection as ML data stream pipeline with drift detection capabilities to ensure robustness in real-world applications.
Pervasiveness of Concept Drift
While machine learning-based malware detectors perform well in experimental settings, their real-world deployment often suffers from performance degradation over time. This is due to malware authors continually adapting and modifying malware features to evade detection - a phenomenon known as concept drift. Ceschin et al. [1] demonstrate that concept drift is a widespread issue, not limited to specific datasets, by analyzing two well-known datasets of Android applications: DREBIN and AndroZoo. They identify two main factors driving the constant changes in malware samples. The first is periodic trends, where statistical correlations shift over time. The second is the evolution of the operating system, where the features themselves change.
To ensure that malware remains functional on newer systems while maintaining its malicious intent, malware authors must adapt to platform evolution. This involves supporting newly introduced APIs or dropping deprecated ones in response to operating system updates. For example, an analysis of the DREBIN dataset shows that malware samples using the SMS-sending feature appear and disappear in sync with changes in Android’s SMS subsystem. These samples often carry out SMS jacking attacks, which secretly subscribe victims to attacker services for financial gain. This behavior reflects the ongoing arms race between Google and attackers over the SMS sending permission. Other similar patterns of malware evolution are also observed in both the DREBIN and AndroZoo dataset, confirming that concept drift is a pervasive issue.
Concept Drift Detection Strategies
To build an effective machine learning model for malware detection in real-world environments, it is essential to detect concept drift early, before the model’s performance declines. It concept drift is not detected early enough, attackers can take the opportunity to exploit the model’s outdated detection capabilities. Early detection enables timely retraining, ensuring the model continues to recognize new samples. A common approach to measure concept drift involves assessing the model’s credibility in detecting new samples. If the prediction lacks sufficient credibility, the sample is set aside - a process known as classification with rejection.
Credibility is typically measured by estimating the probability that a test sample belongs to a particular class. However, since class probabilities must sum to 100%, this method can be unreliable for unseen test samples that don’t belong to any predefined class. For instance, suppose an ML model designed to classify samples as either HTTP or DNS finds 60% similarity between an unknown sample and an HTTP request. The model would assign 60% confidence to the HTTP classification. Given that there are only two possible classes, it would automatically predict the sample as DNS with 40% confidence. But in reality, the sample might belong to an entirely different category, such as a SKYPE request.
Because the model’s confidence or class probability is not always a reliable indicator for unseen data, alternative methods are needed to assess a model’s credibility. Jordaney et al. propose a framework called TRANSCEND [3] based on conformal prediction theory which measures prediction credibility by statistically comparing samples encountered during deployment with those used to train the model. This statistical approach to credibility assessment considers each decision in the context of previous decisions, unlike probabilistic methods that assess the likelihood of a test sample belonging to a class. Instead of asking how likely a test sample is to fit a class, statistical techniques evaluate how similar the test sample is compared to other members of that class.
Transcend uses p-values as the assessment criteria in its Conformal Evaluator, measuring how well a sample fits into a class using statistical comparison. The p-value is defined as the ratio of training elements that are more dissimilar to the element under test. Transcend has empirically demonstrated that p-value-based assessments using statistical approach outperform probability-based method in detecting concept drift.
While Transcend is effective in measuring concept drift, it has several limitations, including experimental bias, high resource demands that make it impractical for real-world use, and a lack of sufficient evidence to support its generalization claims. To overcome these challenges, Barbero et al. introduced TRANSCENDENT [4], an enhanced version of Transcend. TRANSCENDENT incorporates two additional conformal evaluators that match or exceed Transcend’s performance while significantly reducing computational overhead. The experimental bias present in Transcend’s evaluation is eliminated, and TRANSCENDENT has been empirically demonstrated to generalize effectively across various malware domains and classifiers.
Modeling Malware Detection as ML Data Stream Pipeline with Drift Detection
Batch learning methods require processing all training samples simultaneously. When concept drift is detected, reapplying batch learning becomes time-consuming and resource-intensive. In contrast, data stream-based machine learning algorithms only need to process the new samples responsible for the drift, making them far more efficient in terms of time and resource usage.
Traditional data stream-based machine learning models that account for concept drift typically update only the classifier when drift is detected. However, Ceschin et al. extend this approach and propose Fast & Furious [1] that also considers changes in feature extractors, demonstrating that Fast & Furious significantly outperforms the traditional data stream-based approaches. Fast & Furious involves three drift detection levels - Normal, Warning, and Drift. At the normal level, only the classifier is incrementally updated with the new sample. During the warning level, the classifier is incrementally updated, and the new sample is stored in a buffer. At the drift level, both the classifier and feature extractor are retrained using the data collected in the warning level. The performance improvement achieved by incorporating the feature extractor update into the pipeline suggests that detecting changes in feature attributes, such as new API calls, permissions, and URLs, helps boost detection rates in the presence of concept drift.
A key limitation of Fast & Furious is its dependence on ground truth labels, which require either manual labeling or resource intensive and costly dynamic analysis of the drifted samples with certain delay. This delay in obtaining these true labels can create a window of opportunity for attackers to exploit the model’s outdated detection capabilities, leading to a decline in the model’s overall detection performance. To address this challenge, Xu et al. introduce DroidEvolver [2] that automatically and continuously updates itself during malware detection without human intervention.
DroidEvolver operates in two phases: Initialization and Detection. During the initialization phase, it builds a feature set and a model pool containing various detection models, using samples with true labels. This is the only stage where true labels are needed. In the detection phase, predictions for unknown samples are made through weighted voting among “young” models. A model is considered young based on the Juvenilization Indicator (JI), which measures the similarity between the detected sample and previously classified samples with the same prediction label. If the JI for any model falls outside a specific range, the sample is flagged as a drifting sample, and the corresponding model is considered as an aging model. Upon detecting a drifting sample, DroidEvolver updates all aging models using the sample and its pseudo label generated by the model pool. Simultaneously, the feature set is updated to reflect the feature changes observed in the drifting sample. This approach allows DroidEvolver to make lightweight updates with evolving feature sets and pseudo labels, avoiding the need for true labels.
While DroidEvolver excels at quickly adapting to concept drift and operates without the need for true labels, it requires maintaining a pool of models and periodically updating the aging models, which increases resource consumption and costs. In contrast, Fast & Furious relies on ground truth for drifted samples, which can be acquired through large-scale dynamic analysis in the cloud, and it outperforms DroidEvolver in detection accuracy. So, both systems can be viewed as complementary, each addressing different challenges in the detection process.
References
[1] Ceschin, Fabrício, et al. “Fast & Furious: On the modelling of malware detection as an evolving data stream.” Expert Systems with Applications 212 (2023): 118590. Link
[2] Xu, Ke, et al. “Droidevolver: Self-evolving android malware detection system.” 2019 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 2019. Link
[3] Jordaney, Roberto, et al. “Transcend: Detecting concept drift in malware classification models.” 26th USENIX security symposium (USENIX security 17). 2017. Link
[4] Barbero, Federico, et al. “Transcending transcend: Revisiting malware classification in the presence of concept drift.” 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022. Link