Given the continuous growth of illicit activities on the Internet, there is a need for intelligent systems to identify malicious web pages. It has been shown that URL anal- ysis is an e\u21b5ective tool for detecting phishing, malware, and other attacks. Previous studies have performed URL classification using a combination of lexical features, network tra c, hosting information, and other strategies. These approaches require time-intensive lookups which introduce significant delay in real-time systems. This paper describes a lightweight approach for classifying malicious web pages using URL lexical analysis alone. The goal is to explore the upper-bound of the classification accuracy of a purely lexical approach. Another aim is to develop an approach which could be used in a real-time system. These goal culminate in the development of a classification system based on lexical analysis of URLs. It correctly classifies URLs of malicious web pages with 99.1% accuracy, a 0.4% false positive rate, an F1-Score of 98.7, and requires 0.62 milliseconds on average. This method substantially out- performs previously published algorithms on out-of-sample data.
Machine Learning, Malware Detection, Classification, Malicious Web Pages, Supervised Learning, Natural Language Processing
Amrita Center for CyberSecurity
Level of Degree
Electrical and Computer Engineering
First Committee Member (Chair)
Second Committee Member
Darling, Michael. "A Lexical Approach for Classifying Malicious URLs." (2015). https://digitalrepository.unm.edu/ece_etds/63