To read this content please select one of the options below:

Weighted ensemble classifier for malicious link detection using natural language processing

Saleem Raja A. (Information Technology Department, University of Technology and Applied Sciences – Shinas, Shinas, Oman)
Sundaravadivazhagan Balasubaramanian (Information Technology Department, University of Technology and Applied Sciences – Al Musannah, Al Musannah, Oman)
Pradeepa Ganesan (Information Technology Department, University of Technology and Applied Sciences – Shinas, Shinas, Oman)
Justin Rajasekaran (Information Technology Department, University of Technology and Applied Sciences – Shinas, Shinas, Oman)
Karthikeyan R. (Department of Artificial Intelligence and Machine Learning (CSE), Vardhaman College of Engineering, Hyderabad, India)

International Journal of Pervasive Computing and Communications

ISSN: 1742-7371

Article publication date: 3 January 2023

91

Abstract

Purpose

The internet has completely merged into contemporary life. People are addicted to using internet services for everyday activities. Consequently, an abundance of information about people and organizations is available online, which encourages the proliferation of cybercrimes. Cybercriminals often use malicious links for large-scale cyberattacks, which are disseminated via email, SMS and social media. Recognizing malicious links online can be exceedingly challenging. The purpose of this paper is to present a strong security system that can detect malicious links in the cyberspace using natural language processing technique.

Design/methodology/approach

The researcher recommends a variety of approaches, including blacklisting and rules-based machine/deep learning, for automatically recognizing malicious links. But the approaches generally necessitate the generation of a set of features to generalize the detection process. Most of the features are generated by processing URLs and content of the web page, as well as some external features such as the ranking of the web page and domain name system information. This process of feature extraction and selection typically takes more time and demands a high level of expertise in the domain. Sometimes the generated features may not leverage the full potentials of the data set. In addition, the majority of the currently deployed systems make use of a single classifier for the classification of malicious links. However, prediction accuracy may vary widely depending on the data set and the classifier used.

Findings

To address the issue of generating feature sets, the proposed method uses natural language processing techniques (term frequency and inverse document frequency) that vectorize URLs. To build a robust system for the classification of malicious links, the proposed system implements weighted soft voting classifier, an ensemble classifier that combines predictions of base classifiers. The ability or skill of each classifier serves as the base for the weight that is assigned to it.

Originality/value

The proposed method performs better when the optimal weights are assigned. The performance of the proposed method was assessed by using two different data sets (D1 and D2) and compared performance against base machine learning classifiers and previous research results. The outcome accuracy shows that the proposed method is superior to the existing methods, offering 91.4% and 98.8% accuracy for data sets D1 and D2, respectively.

Keywords

Citation

A., S.R., Balasubaramanian, S., Ganesan, P., Rajasekaran, J. and R., K. (2023), "Weighted ensemble classifier for malicious link detection using natural language processing", International Journal of Pervasive Computing and Communications, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/IJPCC-09-2022-0312

Publisher

:

Emerald Publishing Limited

Copyright © 2022, Emerald Publishing Limited

Related articles