##plugins.themes.bootstrap3.article.main##

The presence of malicious traffic presents a substantial risk to network systems and the integrity of confidential information. Organisations may enhance their protection against threats and mitigate the possible impact of malicious traffic on their networks by maintaining vigilance, deploying comprehensive security measures, and cultivating a cybersecurity-aware culture. The purpose of this study is to propose a theoretical framework for identifying and analysing potentially harmful network traffic within a network system. In order to identify and classify various types of malicious network traffic in a multi-class setting, we employed a dataset consisting of nine distinct categories of network system attacks. In order to optimise the performance of the model, an exploratory data analysis is conducted on the dataset. Exploratory data analysis (EDA) was employed to assess various aspects like the presence of missing values, correlation among characteristics, data imbalance, and identification of significant features. The findings derived from the exploratory data analysis indicate that the dataset exhibits an imbalance, which, if left unaddressed, may result in overfitting. The data imbalance was addressed with the implementation of the RandomOverSampling approach in Python, which involved executing random oversampling. Following the resolution of the data imbalance, a random forest classifier was employed to extract significant features from the dataset. In this study, a total of ten characteristics were extracted based on the ranking provided by the random forest model. The features that were extracted were utilised in the training process of the suggested model, which aims to identify and detect malicious activity within a network system. The findings of the model indicate a much improved level of accuracy in identifying malicious traffic within a network system, with an accuracy rate of 99.99%. Furthermore, the precision, recall, and F1-score metrics also demonstrate a consistent accuracy rate of 99.99%.

Introduction

In the dynamic field of IT/OT, “IT” is an acronym for information technology, which encompasses the management and use of computer systems and networks. On the other hand, “OT” stands for operational technology, which pertains to the hardware and software used to monitor and control physical processes and devices. Safeguarding the Internet of Things (IoT) is an essential and urgent task in today’s society. The vulnerability of the Internet of Things (IoT) can be ascribed to two main factors. Firstly, the widespread usage of IoT devices in various sectors, such as homes, smart grids, transportation systems, and critical infrastructures. Furthermore, these devices employ a wide array of transmission mechanisms. The quantity of Internet of Things (IoT) devices has had a substantial increase, escalating from 15.4 billion in 2015 to 26.7 billion in 2019. The continuous rise in this trend is attributed to the increasing reliance of individuals and organisations on online services for their everyday tasks [1].

The wide array of Internet of Things (IoT) devices, found in various sectors such as Smart homes, connected cars, smart cities, healthcare, and industrial IoT, has spurred rapid advancements in IoT technology and shortened product development cycles. Hence, it is crucial to create a unified system that enables seamless communication between different devices, while considering the varied communication protocols they employ. In addition, there is an increasing accessibility of tools and experimental platforms for operating Internet of Things (IoT) devices, as well as for conducting simulations at the network level [2].

The MQTT protocol enables two-way communication, providing a simplified and efficient method for establishing remote connectivity between Internet of Things (IoT) devices and centralised brokers. MQTT’s lightweight design makes it highly suitable for mobile applications that require low resource usage. Nevertheless, it is crucial to recognise that MQTT is not devoid of vulnerabilities that could potentially be exploited by adversaries. The threats encompassed in this context consist of device compromise, breaches of data privacy, deliberate Denial of Service (DoS) attacks on MQTT services, and Man-in-The-Middle (MiTM) assaults on MQTT messages [3].

Reconnaissance attacks, also known as intelligence gathering assaults, are used by hackers as an initial step to breach a specific system or network. During reconnaissance assaults, hackers systematically gather comprehensive information about the target, including its weaknesses, network topology, and potential entry points. Common strategies utilised in the realm of cybersecurity include port scanning, network mapping, and social engineering techniques like phishing. The information obtained during this phase assists attackers in formulating and executing subsequent stages of their attack, hence increasing the likelihood of a successful breach [4].

Machine learning is crucial in efficiently reducing reconnaissance attacks. Machine learning-powered security systems have the ability to evaluate large amounts of data, allowing them to detect patterns and anomalies that human operators might overlook. These systems are designed to observe and detect unusual network activities, recognise patterns linked to potential attackers, and differentiate between valid and suspicious acts within a large volume of network traffic. Moreover, it is important to highlight that machine learning algorithms have the capability to improve their detecting capabilities by incorporating new data. This trait enhances organisations’ defensive measures against reconnaissance attempts and successfully reduces the probability of these attacks escalating into more serious security breaches [5].

Review of Related Works

In the study [6], the authors introduced a novel approach for identifying and detecting unauthorised access in Internet of Things (IoT) networks. The study has employed an innovative method that employs deep learning principles to precisely categorise patterns of traffic movement. The researchers utilised a recently available dataset on the Internet of Things (IoT) and retrieved pertinent information elements found in the packet-level fields. The researchers have devised a feed-forward neural network model to tackle the challenges of categorising data into many groups. This model exhibits the ability to detect different forms of attacks on Internet of Things (IoT) devices. The attacks discussed above encompass denial of service, distributed denial of service, reconnaissance, and information theft. The evaluation of this approach on the modified dataset reveals a substantial level of classification accuracy.

Shafiq et al. [7] presents a framework model specifically developed to address the issue of identifying malicious traffic in Internet of Things (IoT) networks. The present methodology offers a novel metric known as CorrAUC for the aim of selecting features. Furthermore, it introduces an approach that employs the area under the curve (AUC) as a performance indicator to take advantage of this measurement. The proposed solution effectively integrates the methodologies of Entropy and TOPSIS measurements by utilising bijective soft sets to detect erroneous data within the framework of the Internet of Things (IoT). The efficacy of this approach is demonstrated through conducting an experiment on the Bot-IoT dataset, employing four distinct machine learning algorithms. The results indicate a notable average accuracy of categorization above 96%.

Bendiab et al. [8] introduces an innovative and intellectually stimulating method for analysing the network activities of malware in the Internet of Things (IoT). The application of deep learning and visual representation techniques in this methodology results in enhanced effectiveness in detecting and classifying recently developed harmful software, commonly known as zero-day malware. The main goal of this technology is to identify malicious network traffic at a detailed level, specifically at the level of individual packets. This solution effectively reduces the detection time by employing advanced deep learning algorithms. The utilisation of the Residual Neural Network (ResNet50) in the conducted experiment has yielded encouraging outcomes, with a detection accuracy rate of 94.50% for identifying malware traffic.

In the study [9], where the authors present an approach for identifying fraudulent communication by combining Support Vector Machine (SVM) and Convolutional Neural Network (CNN). Although both approaches yield satisfactory outcomes with a minimal occurrence of incorrect identifications, it is evident that the Support Vector Machine (SVM) technique surpasses the Convolutional Neural Network (CNN) technique in terms of all evaluation metrics. The study also suggests potential areas for future investigation, such as analysing the impacts of various sizes and orientations of the transport layer as characteristics and utilising a Convolutional Neural Network (CNN) enhanced with a Long Short-term Memory (LSTM) for automated feature engineering in order to identify malicious network traffic.

The study of Mitsuhashi et al. [10] introduces a two-tiered approach to detect malware in Android applications. The initial layer of the system employs a fixed model to identify malware, relying on parameters such as authorization, intent, and component information. This is achieved by utilising a fully connected neural network. The subsequent segment presents a novel methodology known as CACNN, which integrates Convolutional Neural Networks (CNN) with AutoEncoder techniques to identify malicious software by analysing the network traffic patterns of applications. The empirical evidence suggests that the two-layer technique is highly effective in detecting malware. This encompasses the capacity to do binary classification and detect malware based on its category and malicious lineage.

The research paper of Rose et al. [11] introduces an innovative approach to detect rogue DNS tunnelling software. This approach utilises a hierarchical classification strategy and employs machine learning algorithms on DNS over HTTPS (DoH) network traffic. The system prototype underwent a comprehensive evaluation using the CIRA-CIC-DoHBrw-2020 dataset. The test demonstrated a high degree of accuracy in filtering DNS over HTTPS (DoH) traffic, identifying suspicious DoH traffic, and detecting malicious DNS tunnelling tools.

Rajesh and Satyanarayana [12] explores the application of network profiling and machine learning techniques to enhance the security of Internet of Things (IoT) systems by mitigating the vulnerabilities associated with cyber-attacks. The suggested anomaly-based intrusion detection solution proactively profiles and monitors all networked devices, efficiently detecting any instances of tampering with IoT devices and suspicious transactions within the network. The methodology utilised in this investigation produced encouraging outcomes, attaining an overall precision rate of 98.35% and an exceptionally low false-positive ratio of 0.98% when assessed on the Cyber-Trust testbed by employing a mixture of benign and malicious network data.

The dataset presented by Hwang et al. [13] comprises real-time SCADA test bed data, encompassing instances of both routine operations and malicious intrusions. Multiple feature extraction methods, such as Chi-Square, ANOVA, and LASSO, are employed to decrease the number of features in the dataset. Furthermore, the researchers employ SVMSMOTE, a modified version of the Support Vector Machine (SVM), to address the issue presented by imbalanced datasets. The results of the performance assessment of four machine learning algorithms indicate that the Support Vector Machine (SVM) algorithm, when used with filtering and Synthetic Minority Over-sampling Technique (SVMSMOTE), demonstrates superior performance compared to other methods. The convergence of these parameters results in a remarkable Receiver Operating Characteristic (ROC) score of 99.96%.

Abdulhammed et al. [14] presents a traffic detection system known as D-PACK, which has demonstrated significant efficacy. The proposed approach integrates a Convolutional Neural Network (CNN) with an unsupervised deep learning method known as an Autoencoder to perform traffic profiling and filtering. The D-PACK technique demonstrates exceptional effectiveness in identifying malicious network traffic through analysis of only the initial two packets. The system demonstrates a remarkable level of precision, achieving an accuracy rate close to 100% while also maintaining an impressively low false-positive rate of 0.83%. Thus, D-PACK exhibits a high level of effectiveness in preventing detrimental streams.

Indrasiri et al. [15] investigate the issue of imbalanced datasets within intrusion detection systems. They utilise various methods to tackle this problem. The researchers conduct an empirical evaluation of various machine learning classifiers, including deep neural networks, random forest, voting, variational autoencoder, and stacking, using the CIDDS-001 dataset. The authors’ proposed approach demonstrates remarkable outcomes, with a maximum accuracy of 99.99% in threat detection. In addition, it effectively tackles the problem of imbalanced class distributions by using a smaller amount of samples. As a result, it provides a suitable approach for addressing the challenges related to the integration of real-time data [16].

Design Methodology

This section explains the breakdown of the proposed system for detecting intrusions on network logs. The breakdown of the system architecture can be seen in Fig. 1.

Fig. 1. Architectural design.

Dataset

The audited dataset was provided, encompassing a diverse range of simulated intrusions within a military network context. The system was designed to facilitate the collection of unprocessed TCP/IP dump data from a network by emulating a standard local area network (LAN) often found in the United States Air Force. The local area network (LAN) was simulated to resemble a genuine environment and subjected to numerous attacks. A connection refers to a series of Transmission Control Protocol (TCP) packets that initiate and terminate within a specific time frame. During this time, data is transmitted bidirectionally between a source IP address and a target IP address, following a clearly defined protocol. Furthermore, it is worth noting that every link is categorised as either regular or as an attack, with the presence of precisely one distinct attack type. Each connection record is composed of around 100 bytes.

Data Pre-Processing

In this section, we utilized the Standard scaler and the data cleaning procedure. To maintain data quality and dependability, missing values, outliers, and inconsistencies must be found and corrected as part of the data cleaning process.

Feature Extraction

The random forest classifier was used to zero in on the most relevant characteristics. This is how the random forest classifier works:

1. For each tree b in the Random Forest:

  1. Before performing the split, determine the Gini impurity (or alternative impurity measure) of the dataset at the root node, which we’ll refer to as Gini root.
  2. Gini children stand for the weighted average Gini impurity of the offspring nodes following the split. Each node’s weight is calculated by dividing the total number of data points by the number of data points in its parent node.

2. The Gini coefficient (or other impurity metric) can be used to determine how much each tree b benefits from splitting on feature X_i:

  1. Decrease in Gini (X_i) = Gini_root − Gini_children

3. The average decline in Gini impurity (or impurity measure) for feature X_i over all trees in the Random Forest can be determined by doing the following calculation:

  1. Average decrease in Gini (X_i) = (1/B) × Σ Decrease in Gini (X_i) for all trees b

4. Normalize the feature importance values so that they sum up to 1 or 100 (depending on your preference).

Hybrid Model

Random Forest classifier was used for extracting the most important features on the dataset. The extracted features were used as input to the gradient boosting classifier in detecting intrusions on network logs.

Model Output

This displays if the network flow is anomalous or normal.

Results and Discussion

The experiment was conducted using Jupyter Notebook, and the experimental data consisted of two distinct phases. The experimental process consists of two main phases: Exploratory Data Analysis (EDA) and training the Random Forest Classifier for the purpose of detecting malicious traffic on network systems.

Exploratory Data Analysis (EDA)

An exploratory data analysis (EDA) was performed on the dataset in order to gain a deeper understanding of its characteristics. Subsequently, an exploratory data analysis (EDA) was performed to assess the presence of uneven distribution within the dataset. Based on the observations depicted in Fig. 2, it is apparent that the dataset demonstrates an imbalance, as there is a disparity in the number of cases across different classes. Figs. 3 and 4 show the distribution of the attack class on the protocol_type feature and the flag feature. The plots show that the dataset is imbalanced. To effectively tackle this matter, SMOTE technique was implemented using the Python computer language. The employed methodology entailed augmenting the quantity of instances in the minority classes to align with that of the majority class. The graphic displaying the balanced data is depicted in Fig. 5. Fig. 6 presents the results of the person correlation matrix. This action was undertaken in order to ascertain the characteristics of the dataset that exhibited correlation. The final step of the exploratory data analysis (EDA) was identifying the most significant features within the dataset. The classification task was accomplished by employing the Random Forest Classifier. The ranking of the characteristics in the dataset was conducted in order to accomplish this task. Fig. 7 displays the top ten features as determined by the ranking.

Fig. 2. Count plot of the imbalanced data.

Fig. 3. Distribution of the attack class on protocol_type feature.

Fig. 4. Distribution of the attack class on flag feature.

Fig. 5. SMOTE analysis of the balanced class.

Fig. 6. Pearson correlation matrix.

Fig. 7. Important features.

From the count plot, it is seen that the normal class is greater than the anomaly class.

Model Training with Gradient Boost Classifier (GBC)

The GBC model was trained using the significant features. The hyperparameters were fine-tuned. The parameters that have been fine-tuned for the model are as follows: n_estimators = 100, learning_rate = 0.1, max_depth = 3, and random_state = 42. The Gradient Boosting Classifier (GBC) model was employed to create predictions on an unobserved dataset for the purpose of detecting intrusion on network logs. The GBC model’s performance was assessed through the utilisation of matrix assessment techniques, including the Classification matrix and Roc Curve. The outcome of the GBC model is visually presented in Figs. 8 and 9.

Fig. 8. Classification report.

Fig. 9. Roc curve.

Conclusion

This study introduces a theoretical framework for the identification and analysis of potentially harmful network traffic within a network infrastructure. In order to identify and classify various forms of malicious network traffic in a multi-class setting, we employed a dataset consisting of nine distinct types of harmful attacks targeting network systems. In order to optimise the performance of the model, it is important to undertake exploratory data analysis on the dataset. Exploratory data analysis (EDA) was employed to assess several aspects of the data, including the presence of missing values, correlations between characteristics, imbalances in the data, and identification of significant features. The findings derived from the exploratory data analysis indicate that the dataset exhibits an imbalance, which, if left unaddressed, may result in overfitting. The data imbalance was addressed with the implementation of the RandomOverSampling approach in Python, which involved executing SMOTE. Following the resolution of the data imbalance, a random forest classifier was employed to extract significant features from the dataset. In this study, a set of ten features was extracted using the ranking generated by the random forest model. The features that were extracted were utilised in the training process of the gradient boosting model, which aims to identify and detect malicious activity within a network system. The model’s findings indicate a significantly improved accuracy in detecting malicious network traffic, with precision, recall, and F1-score of 99.99%.

References

  1. Gao M, Ma L, Liu H, Zhang Z, Ning Z, Xu J. Malicious network traffic detection based on deep neural networks and association analysis. Sensors. 2020;20(5):1452.
     Google Scholar
  2. Zheng J, Zeng Z, Feng T. GCN-ETA: high-efficiency encrypted malicious traffic detection. Secur Commun Netw. 2022;2022:1–11.
     Google Scholar
  3. Xin L, Ziang L, Yingli Z, Wenqiang Z, Dong L, Qingguo Z. TCN enhanced novel malicious traffic detection for IoT devices. Conn Sci. 2022;34(1):1322–41.
     Google Scholar
  4. Feng J, Shen L, Chen Z, Wang Y, Li H. A two-layer deep learning method for android malware detection using network traffic. IEEE Access. 2020;8:125786–96.
     Google Scholar
  5. Wang W, Zhu M, Zeng X, Ye X, Sheng Y. Malware traffic classification using convolutional neural network for representation learning. 2017 International Conference on Information Networking (ICOIN), pp. 712–7, IEEE, Jan 2017.
     Google Scholar
  6. Ge M, Fu X, Syed N, Baig Z, Teo G, Robles-Kelly A. Deep learning-based intrusion detection for IoT networks. 2019 IEEE 24th Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 256–25609, IEEE, Dec 2019.
     Google Scholar
  7. Shafiq M, Tian Z, Bashir AK, Du X, Guizani M. CorrAUC: a malicious bot-IoT traffic detection method in IoT network using machine-learning techniques. IEEE Internet Things J . 2020;8(5):3242–54.
     Google Scholar
  8. Bendiab G, Shiaeles S, Alruban A, Kolokotronis N. IoT malware network traffic classification using visual representation and deep learning. 2020 6th IEEE Conference on Network Softwarization (NetSoft), pp. 444–9, IEEE, Jun 2020.
     Google Scholar
  9. De Lucia MJ, Cotton C. Detection of encrypted malicious network traffic using machine learning. MILCOM 2019-2019 IEEE Military Communications Conference (MILCOM), pp. 1–6, IEEE, Nov 2019.
     Google Scholar
  10. Mitsuhashi R, Satoh A, Jin Y, Iida K, Shinagawa T, Takai Y. Identifying malicious dns tunnel tools from doh traffic using hierarchical machine learning classification. Information Security: 24th International Conference, ISC 2021, Virtual Event, November 10–12, 2021, Proceedings 24, pp. 238–56, Springer International Publishing, 2021.
     Google Scholar
  11. Rose JR, Swann M, Bendiab G, Shiaeles S, Kolokotronis N. Intru- sion detection using network traffic profiling and machine learning for IoT. 2021 IEEE 7th International Conference on Network Soft-warization (NetSoft), pp. 409–15, IEEE, Jun 2021.
     Google Scholar
  12. Rajesh L, Satyanarayana P. Evaluation of machine learning algorithms for detection of malicious traffic in scada network. J Electr Eng Technol. 2021;1–16.
     Google Scholar
  13. Hwang RH, Peng MC, Huang CW, Lin PC, Nguyen VL. An unsupervised deep learning model for early network traffic anomaly detection. IEEE Access. 2020;8:30387–99.
     Google Scholar
  14. Abdulhammed R, Faezipour M, Abuzneid A, AbuMallouh A. Deep and machine learning approaches for anomaly-based intrusion detection of imbalanced network traffic. IEEE Sens Lett. 2018;3(1):1–4.
     Google Scholar
  15. Indrasiri PL, Lee E, Rupapara V, Rustam F, Ashraf I. Malicious traffic detection in iot and local networks using stacked ensemble classifier. Comput Mater Contin. 2022;71(1):489–515.
     Google Scholar
  16. Alshammari A, Aldribi A. Apply machine learning techniques to detect malicious network traffic in cloud computing. J Big Data. 2021;8(1):1–24.
     Google Scholar