Machine Learning in Security Applications

Machine learning (ML) is revolutionizing security by addressing key challenges in authentication and malware detection. In continuous authentication, ML model tracks user behavior to verify identity without constant password prompts. Usable authentication methods, like combining passphrases with keystroke dynamics, boost security while staying user-friendly. In malware detection, ML tools like AutoYara speed up malware identification by automating Yara rule creation. This article explores the role of ML in these security applications.

Continuous Authentication

Authentication and authorization are distinct yet interconnected components of system security. Authentication verifies a user’s identity, ensuring they are who they claim to be, while authorization determines the resources or actions the user is permitted to access. Traditional authentication methods, such as passwords, have many security problems. For instance, if a device is left unattended or stolen, an adversary can gain unauthorized access. Continuous authentication offers a more secure alternative authentication by repeatedly verifying a user’s identity throughout a session. However, relying on repeated password prompts for continuous authentication compromises usability, as it can frustrate users and increase the time spent on the authentication process. Additionally, the widespread issue of password leakage further undermines the reliability of such methods.

To address these challenges, Giovanini et al.[1] investigated the use of computer usage profiles as a means to enhance security while maintaining usability. A computer usage profile is a digital representation of a user based on their typical usage patterns. The authors collected network events (sites accessed, frequency, and patterns of online activity), process events (applications opened and used), and mouse and keystroke dynamics (frequency and patterns of physical interaction) from 31 participants over 8 weeks. Their findings revealed that most users exhibit consistent usage patterns over time. They discovered that including background process activity increased temporal correlations in the profiles, suggesting that the model captured computer-specific behavior rather than user behavior. The authors also found that computer usage profiling has the potential to uniquely characterize computer users. They compared both online and offline models for profiling. Online models, which adapt to gradual changes in user behavior, outperformed offline models in accuracy and robustness to temporal changes in profiles. Offline models, on the other hand, are less effective when user behavior evolved after initial training. They also found that network-related events are the most relevant features to accurately recognize profiles.

Despite its promise, this approach has notable limitations. This method requires an initial dataset to begin recognizing users, leading to a cold start problem where the system is vulnerable until sufficient data is collected. Additionally, user behavior changes over time, leading to concept drift. If the system is slow to detect and adjust to these changes, its effectiveness diminishes. For real-world adoption, continuous authentication systems must be consistently accurate to ensure reliability. Another critical concern is privacy. Since this approach continuously tracks keystroke dynamics and other interaction data, its functinality resembles malware like keyloggers. Although the data is used for security purpose, it raises ethical questions about user consent, transparency, and the potential misuse of sensitive information. For widespread adoption, these issues must be addressed to balance security, usability, and user trust.

Usable Authentication

Authentication methods like biometrics and text-based methods have their own strengths and weaknesses. Biometric authentication - such as fingerprints, face recognition, or voice identification - offers enhanced security and ease of use. However, it carries a critical risk: if someone discovers how to crack a specific biometric, it becomes permanently unusable as a form of authentication. Text-based methods, such as passwords and passphrases, remain popular but face significant challenges. The strength of authentication methods can be measured using Shannon entropy theory, which quantifies the probability of guessing an outcome. Higher entropy indicates greater unpredictability and stronger security. While passwords are theoretically more secure than passphrases due to policies requiring special characters, numbers, and uppercase letters, these same policies often undermine both security and usability. Users frustrated by complexity may create weak or predictable passwords, leading to reduced security. Additionally, complex passwords are prone to typing errors and memory failures, which further diminishes their effectiveness and increases time spent on the authentication process. This authentication method is also vulnerable to brute force attack, where attackers systematically guess the password until the correct one is found. While retry limitation mechanisms can help deter brute force attempts during login, this attack remain effective on leaked databases. In many systems, passwords are stored in one of two ways: as plaintext or as hashed values. When hashed passwords are obtained from a leaked database, attackers can use brute force techniques to crack them. However, this process is slow. To accelerate it, attackers often use a rainbow table - a precomputed database of potential passwords and their corresponding hash values. Instead of generating a new hash for each guess, the attacker simply matches the hash from the stolen database with the entries in the rainbow table, significantly speeding up the attack. To counter rainbow table attacks, a technique called salting is employed. A salt is a unique, random value, added to the password before hashing. This ensures that even if two users choose the same password, their hashed values will be different. Salting makes rainbow attacks ineffective, as the attacker would need to regenerate the table for each unique salt.

To tackle the shortcomings of current authentication methods, Bhana and Flowerday [2] propose a two-tier user authentication method designed to improve both security and usability. This approach combines passphrases with keystroke dynamics, a backend technique that records the keystroke patterns of the user when he/she inserts their credentials into the system.The authors show that combining keystroke dynamics with passphrases significantly increases entropy compared to traditional passwords while enhancing the user experience.

However, despite its potential, this two-tier authentication approach faces practical challenges. Variations in keystroke patterns across different keyboards can lead to login failures, negatively impacting usability. Additionally, factors such as illness, temporary physical impairments, or mental states like fatigue can alter typing patterns, further complicating access. Moreover, the technology’s resemblance to malware such as keyloggers raises concerns about privacy and security perceptions.

Malware Detection

Malware can be categorized into 0-day, 1-day, and N-day samples. 0-day malware refers to threats that have not yet been discovered, while 1-day malware includes newly discovered threats. N-day malware, on the other hand, comprises threats that have been known for some time. Machine learning-based malware detection methods require extensive training data and processing time, making them less effective against 1-day threats but highly effective for N-day threats.

To address 1-day threats, malware analysts often rely on Yara rules, an open-source tool for identifying and classifying files based on specific patterns. Users create rules that define the patterns or sequences to detect, and Yara scans files or directories, comparing their content against these criteria. Yara efficiently matches multiple patterns across large datasets by leveraging the Aho-Corasick algorithm, a string-searching technique designed for fast and simultaneous multi-pattern matching within text.

Developing high-quality Yara rules to detect specific malware families can be time-consuming and labor-intensive, even for experts. To address this challenge, researchers have explored automated methods for generating Yara rules. Raff et al.[3] introduce AutoYara, a system that uses n-grams and biclustering to automate the process. Biclustering identifies shared patterns by grouping both malware samples and features. To optimize runtime, the authors precomputed the n-grams and to efficiently store the large number of n-grams, they used Bloom filters - probabilistic data structures designed to test set membership. Bloom filters are highly space-efficient, allowing for a small probability of false positives (where an element is incorrectly reported as present) but ensuring no false negatives. This property makes Bloom filters an effective complement to the biclustering approach.

AutoYara is designed to be efficient, even on low-resource equipment. It produces rules with useful true-positive rates while maintaining low false-positive rates, sometimes matching or even outperforming human analysts. By automating the rule creation process, AutoYara significantly reduces the time analysts spend developing Yara rules, enabling them to focus on more complex malware that existing tools are unable to address.

References

[1] Giovanini, Luiz, et al. “Online binary models are promising for distinguishing temporally consistent computer usage profiles.” IEEE Transactions on Biometrics, Behavior, and Identity Science 4.3 (2022): 412-423. Link

[2] Bhana, Bhaveer, and Stephen Flowerday. “Passphrase and keystroke dynamics authentication: Usable security.” Computers & Security 96 (2020): 101925. Link

[3] Raff, Edward, et al. “Automatic yara rule generation using biclustering.” Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security. 2020. Link