Pundra University of Science & Technology, Bangladesh
* Corresponding author
Pundra University of Science & Technology, Bangladesh
Varendra University, Bangladesh
Varendra University, Bangladesh
Varendra University, Bangladesh
Varendra University, Bangladesh

Article Main Content

Ransomware is a new cybersecurity attack with huge financial and operational impact in industries globally. In this paper, an investigation of utilizing machine learning algorithms for ransomware detection is performed and compared with conventional methods, which consistently fall prey to dynamically altering attacks. Various algorithms, such as Support Vector Machines, Random Forest, Gradient Boosting, Artificial Neural Networks, Logistic Regression and ensemble methods, have been evaluated, with ensemble method of Gradient Boosting and Logistic Regression proving validation accuracy of 100% and Random Forest showing validation accuracy of 100% and 99.99% Recall. These findings validate the viability of utilizing machine learning for both known and unknown forms of ransomware detection, current work opens avenues for developing sophisticated, adaptive anti-ransomware frameworks.

Introduction

Ransomware has become one of the most common and challenging cyber threats to defend against in recent times. This kind of malware encrypts critical files and systems and holds them hostage until a ransom is paid in cryptocurrency. The escalating complexity of emerging ransomware variants has developed the capability of exploiting the vulnerabilities of organizations across various fields. In 2024, Synnovis, a provider under the UK’s National Health Service, was under a ransomware attack, which cost it a whooping £32.7 million [1]. That figure outweighed the prior year profit of £4.3 million, showing the massive financial impairments arising from such an attack. Thousands of medical procedures in London were disrupted by this incident, so it flags the critical risk that ransomware poses to public health systems [2]. The financial sector too has suffered from this quite a bit, with reports in the recent past, at the beginning of 2024, indicating that more than 65% of firms were under ransomware attack. Recovery costs in this sector are high, averaging $2.58 million per attack, apart from compromised client data and loss of reputation [3]. Cultural sites have also suffered. For instance, in late 2023, the British Library experienced a significant ransomware attack. It refused to pay, and as a result, 600 GB of data got exposed [4], [5].

In this research, we systematically assessed multiple ML models—including SVM, Random Forest, Gradient Boosting, Logistic Regression, and ANN—as well as ensemble methods, using a dataset of 138,047 samples with 57 key features. Our experiments demonstrated near-perfect detection accuracy and minimal false positives in detecting both recognized and previously unknown ransomware variants.

To facilitate understanding and navigation, the present manuscript has been divided into eight sections. Section 1 identifies ransomware and our contribution to this research. Section 2 provides an extensive literature review on previous research work. In contrast, Section 3 defines the research problem, hypotheses, and research aims. Section 4 discusses the methodology that includes the dataset preparation and the development of the model, whereas Section 5 presents the findings from the experiments along with the analysis. Section 6 discusses the limitations and future research directions, whereas Section 7 concludes the paper. Section 8 contains references and further reading materials.

Literature Review

Recent advancements in machine learning have significantly improved ransomware detection, addressing challenges posed by evolving threats and encryption techniques. Various studies have explored different algorithms, highlighting their strengths and limitations in identifying malicious activities. This section examines significant research contributions in the field, focusing on innovative approaches.

Ransomware Detection using GANs

Wiles et al. proposed a Generative Adversarial Networks -based method for ransomware detection via network traffic analysis. By simulating normal behavior, it detected anomalies with 94.2% accuracy and a 0.952 AUC. This approach addressed limitations of traditional methods, including encrypted traffic and evolving threats, and proved effective in real-time use [6].

Ransomware Detection through Machine Learning

Masum et al. designed a feature selection model for conventional machine learning and neural network classifiers for identifying and classifying ransomware. Random Forest ranked best, with best accuracy (99%) and best precision with 10-fold cross-validation consistently generating similar values. Feature selection and effectiveness in using machine learning for ransomware identification were stressed [7].

Ransomware Detection using Random Forests

Wu et al. proposed a machine learning-based ransomware detection system for use in Linux environments using the Random Forest algorithm. In employing a mixed corpus of ransomware and innocent samples for a range of Linux distributions, feature extraction with an eye towards file metadata, system call sequences, and network behavior was prioritized. High accuracy (94%) with a 93% precision, 89% recall, and an AUC value of 0.96 was attained with the use of the Random Forest model, outdoing traditional methods such as SVMs and neural networks. High-dimensional data and multi-configurations presented challenging issues in Linux ransomware detection, and adaptability towards new variants of ransomware was demonstrated with this work. Computational complexity and preprocessing requirements gave rise to concerns, but the work identifies the efficacy of the Random Forest algorithm in enhancing cybersecurity in Linux environments [8].

Ransomware Identification via Cooperative Clustering

Panaras et al. proposed a cooperative clustering model for identifying ransomware with real-time collaboration between distant agents. With unsupervised techniques, it reached 96.8% accuracy and 2.7% false positive, outperforming traditional techniques. It exhibited scalability and competency under varying network environments despite having issues with memory overhead and intermittent false positives. It is an adaptive model for variable ransomware attack environments [9].

Problem, Hypothesis and Research Objectives

The growing sophistication of ransomware attacks poses a critical challenge to traditional cybersecurity defenses. Conventional methods often struggle to detect new variants due to advanced evasion techniques. Machine learning offers a promising alternative by identifying behavioral patterns and anomalies indicative of ransomware activity. This section defines the research problem, outlines key hypotheses, and establishes objectives aimed at developing an effective ransomware detection system.

Need for Advanced Methods of Detection

Traditional detection systems have fallen short of being successfully used against the constantly changing tact of cybercriminals. Most ransomware variations are built to circumvent traditional defenses by using polymorphic encryption and fileless attacks, among other advanced obfuscation techniques. This makes it important to seek more adaptive and proactive ways of detecting ransomware.

ML is promising, as it analyzes behavior patterns and anomalies that indicate ransomware activities. Unlike traditional methods, ML models will detect known and even never-seen versions of ransomware. Several studies in ML-based approach in different fields shows near-perfect classification performances. These developments show how machine learning can significantly raise the chances of defending against ransomware, reduce the risk of such an attack, and make critical infrastructure resilient.

Hypothesis Statement

The implementation of machine learning methodologies will significantly improve the detection of ransomware compared to traditional methods by attaining greater precision while minimizing false positive occurrences.

Primary Hypothesis: The machine learning-based detection models will outperform traditional detection systems in identifying both known and unknown, zero-day ransomware variants [10].

Secondary Hypothesis: Some of the algorithms, such as Random Forest, SVM, or Neural Networks, will result in better performance for some metrics such as accuracy and false positive rate in ransomware detection [11].

Research Objectives

• To design an ML-powered system for a ransomware attack in real-life scenarios.

• To compare performance between an assortment of algorithms.

• To design a strong model with high accuracy in its detection and zero false positive.

• To help secure cybersecurity with insights on improvement in anti-ransomware strategies.

Methodology

This section outlines the approach used for ransomware detection, including dataset collection, feature selection, model training, and evaluation. Various machine learning techniques were tested to achieve optimal accuracy and efficiency. Fig. 1 illustrates the overall training and evaluation process.

Fig. 1. Machine learning model training process.

Dataset Collection and Features

The dataset used in the study in the present work consisted of 138,047 samples and 57 unique attributes. These attributes are a wide range of technical specifications, including filenames, MD5 hashes, structural elements (machine type, SizeOfOptionalHeader, and Characteristics), as well as memory organization and entropy-related features. Features based on entropy, most notably SectionsMeanEntropy and SectionsMinEntropy, were particularly useful for distinguishing among obfuscation methods. On the other hand, other attributes such as Subsystem, DllCharacteristics, and LoadConfigurationSize helped better understand the behavior of the files. The dependent variable “legitimate” allowed one to distinguish between benign files (value 1) and malicious files (value 0), thus aiding supervised learning methods. A feature importance analysis demonstrated that a number of attributes, including SectionsMeanEntropy, SizeOfInitializedData, and ResourcesNb, were among the most capable of distinguishing among the various types of files.

Data Preparation

Preprocessing was performed leveraging Python libraries Pandas and Scikit-learn. Missing values, if any are cleaned and numerical data standardized. Categorical features are encoded to assure the quality of the data. First, the dataset is split into features, X, and the target variable y. Scaling is performed uniformly across features by StandardScaler to support efficient model convergence.

Feature Selection

Since all the features, except the target, are relevant for ransomware detection, no feature would be excluded. However, features concerning entropy, memory layout, and file structure have been weighted more than others since these are usual points where malignant software reveals its pattern.

Pseudocode SVM and Ensemble (ANN+RF)

SVM

The dataset is loaded and split into features (X) and the target (y). An SVM classifier is initialized, and 5-fold stratified cross-validation is performed. For each fold, the model is trained, tested, and evaluated using accuracy, precision, recall, F1-score, AUC, false positive/negative rates, and validation accuracy. The ROC curve is plotted for each fold. Finally, the mean performance metrics are computed and displayed.

Ensemble (ANN+RF)

The dataset is imported and divided into training and testing sets, and standardized. An ANN model and Random Forest (RF) model are trained separately. Their predictions are averaged to form an ensemble. Key metrics like accuracy, precision, recall, F1-score, AUC, and a confusion matrix are computed. Finally, a heatmap for the confusion matrix and an ROC curve are plotted for evaluation.

Model Development and Hyperparameter Optimization

We have shown all the hyper parameter in Table I for different models utilized in this study.

Model Hyperparameters
Support Vector Machine (SVM) Kernel: RBF—C (Regularization): 1.0—Gamma (Kernel Coefficient): Scale
Logistic Regression Solver: lbfgs—C (Regularization): 1.0—Max Iterations: 1000—Threshold: 0.6
Gradient Boosting n_estimators: 100—Learning Rate: 0.1—Max Depth: 3
Random Forest (RF) n_estimators: 100—Max Depth: None—Class Weight: Balanced
Artificial Neural Network (ANN) Layers: 3 dense layers (128-64-32 neurons)—Activation: ReLU—Dropout: 0.3—Optimizer: Adam—Learning Rate: 0.001—Loss Function: Binary Cross-Entropy—Final Activation: Sigmoid
Ensemble (LR + GB) Logistic Regression:—Same as above—Gradient Boosting:—Same as above
Ensemble (ANN + RF) ANN: Same as above—Random Forest: Same as above
Table I. Compare Model Hyperparameters

Support Vector Machine (SVM) Implementation

The SVM model was realized by using the Scikit-learn’s SVC with RBF kernel since it can learn complex relations in the data. Its parameters were a regularization parameter C = 1.0 and a kernel coefficient gamma = scale. For imbalanced classes, stratified fivefold cross-validation was performed for the robustness of the model [12].

Logistic Regression (LR) Implementation

The logistic regression was implemented with the help of the Scikit-learn’s LogisticRegression module, optimized for larger datasets by the lbfgs solver. An iteration limit to 1000 was given for handling the convergence with regularization, C = 1.0. Custom decision threshold 0.6 improved the recall, thereby balancing the predictive performance [13].

Gradient Boosting (GB) Implementation

Gradient boosting utilized the Scikit-learn’s GradientBoostingClassifier. Its major hyperparameters are n_estimators = 100, learning_rate = 0.1, and max_depth = 3. It provides just the right balance between accuracy and overfitting. Stratified cross-validation is applied to compute the metrics [14].

Random Forest (RF) Implementation

A Random Forest model was built using Sk-learn’s RandomForestClassifier, having 100 trees with unrestricted maximum depth, max_depth = None. Also, class_weight = balanced property to counter the imbalance in the target classes. Ensuring models were fairly evaluated with the inclusion of a random seed for reproducibility [15].

Artificial Neural Network (ANN) Implementation

Architecture of the ANN model using TensorFlow’s Keras API: three dense layers with 128 neurons for input, 64 neurons in the first hidden, and 32 neurons in the last, all with the ReLU activation function. A dropout of 0.3 has been performed to avoid overfitting. The model was trained by using the Adam optimizer with a learning rate of 0.001, whereas the last layer had the sigmoid activation [16].

Ensemble Methods Implementation

The studies utilized two ensemble models that combined the strength of multiple algorithms for enhanced classification performance. One ensemble combined Logistic Regression with Gradient Boosting using a soft VotingClassifier, averaging predicted probabilities from the models for reliability. The Logistic Regression model was optimized for 1000 iterations, while Gradient Boosting was set for 100 estimators, a learning rate of 0.1, with a maximum depth of three. The second ensemble combined an Artificial Neural Network (ANN) with a Random Forest, combining the ANN results that had three hidden layers using ReLU activation with a 30% dropout with the Random Forest results (100 estimators with class weights balanced) to efficiently identify complex patterns of data [17]. Model assessment was conducted using strict measures such as accuracy, precision, recall, F1-score, AUC, false positive and false negative rates, as well as ROC curves, hence enabling efficient ransomware identification.

Result and Discussion

This section presents the findings of the study, analyzing the performance of different machine learning models in detecting ransomware. The findings are assessed based on essential performance indicators, including accuracy, precision, recall, F1-score, and the area under the curve (AUC). A comparative analysis between different models are provided in Table II.

Model results Accuracy Precision Recall F1 Score AUC score False positive rate False negative rate Validation accuracy
SVM 0.9961 1.0000 0.9871 0.9935 1.0000 0.0000 0.0129 0.9961
ANN 0.9034 0.9491 0.7166 0.8145 0.8527 0.9034
Random Forest 0.9995 0.9985 0.9999 0.9992 1.0000 0.0006 0.0001 1.0000
ANN & R.F. Ensemble 0.9992 0.9982 0.9993 0.9987 1.0000 15 6 0.9990
Logistic Regression 0.8405 0.9937 0.4733 0.6110 0.7271 0.0026 0.5267 0.8362
Gradient Boosting 1.0000 1.0000 1.0000 1.0000 1.0000 0 0 1.0000
GB & LR Ensemble 1.0000 1.0000 1.0000 1.0000 1.0000 0 0 1.0000
Table II. Compare Model Performance

Accuracy and Other Performance Metrics

Performance of performance metrics further reaffirmed that tree-based models such as Gradient Boosting and Random Forest perform much better in classifying between benign and malicious files. They achieved very high scores in non-linear pattern identification as well as their entropy-based characteristics. The performance was further enhanced to optimum scores across all the metrics while implementing ensemble methods. It validates the necessity of implementing complementary algorithms in ransomware detection fields [18].

Evaluation of ML Performance Metrics

In the evaluation of classification algorithms under machine learning, certain metrics such as Accuracy, Precision, Recall, and F1-score are very crucial. They give us a complete estimation of the capability of a model not just with precision but also with efficacy [19]. In the context of this study, these metrics were instrumental in evaluating the efficacy of ML models developed for ransomware detection. Given the critical nature of accurately differentiating between benign and malicious files, the choice of these metrics ensures a robust assessment. Furthermore, the study prioritized minimizing false positive and false negative rates, as these directly impact the reliability and applicability of the detection system.

Comparative Analysis with Existing Studies

Compared to previous research, this study has raised a new benchmark in ransomware detection through better trade-offs between precision and recall. The high AUC scores achieved by Gradient Boosting and ensemble techniques confirm their outstanding ability to discriminate between normal and ransomware-related files, thus pushing to the frontiers of state-of-the-art developments in this area [8]. In Figs. 2 and 3 we show the difference of ROC curve and Confusion Matrix between simple ANN and ensemble method of ANN + Random Forest.

Fig. 2. ROC curve of (left) simple ANN model and (right) ensemble ANN + Random Forest model.

Fig. 3. Confusion matrices of (left) simple ANN model and (right) ensemble ANN + RF model.

ROC Curve Analysis

Fig. 2 compares the ROC curves of the (left) standalone ANN and (right) the ensemble ANN + RF model. The ANN (left) achieves an AUC of 0.8527, showing good but improvable classification performance. In contrast, the ensemble model (right) boosts the AUC to 1.0000, indicating a near-perfect classification ability with significantly enhanced sensitivity and specificity.

Confusion-Matrix Analysis

Fig. 3 compares the classification performance of (left) the standalone ANN and (right) the ANN + Random Forest ensemble. The ANN (left) correctly classifies 95,538 legitimate and 31,253 malicious flows but produces 1186 false positives and 10,070 false negatives. In contrast, the ensemble (right) reduces errors to just 15 false positives and six false negatives while accurately identifying 19,330 legitimate and 8259 malicious flows, highlighting greatly improved sensitivity and specificity.

Performance Analysis

This underlines the fact that tree-based models, Gradient Boosting and Random Forest in this case, are especially good at detecting complex patterns embodied within the dataset. Their ability to process features relating to entropy significantly increases their precision and recall scores. Although Artificial Neural Networks showed promise, the elicited lower recall indicates the need for better hyperparameter tuning and methodologies to mitigate overfitting. In addition, ensemble techniques have proved their effectiveness through the aggregation of multiple algorithms, indicating great flexibility and reliability in a variety of situations.

Limitation and Future Work

Study Limitations

This study acknowledges several limitations. First, the dataset used, although extensive, may not comprehensively represent the wide and continuously evolving range of ransomware strains. This raises concerns about the generalizability of the model’s performance when faced with novel or zero-day ransomware attacks in the future. Secondly, those methods with high performance, in particular, the Artificial Neural Network (ANN) and ensemble methods, are computationally demanding. This makes their employment in real-time applications more challenging, especially in resource-restricted settings such as embedded platforms, Internet of Things (IoT) sensors, or that of small companies. In conclusion, the performance of these models depends on unique properties inherent in individual domains as well as optimal threshold values for classification. Different domains of applications have their own specific properties that provide insight to improve and tailor models in order to enhance their applicability.

Future Work

In the future, we will explore how deep learning algorithms can achieve even greater accuracy and efficiency in ransomware detection. Additionally, efforts will focus on expanding the dataset by incorporating new strains of ransomware, aiming to enhance generalizability and adaptability to emerging threats over the course of development.

Conclusion

This research confirms the efficacy and efficiency of machine learning algorithms, namely ensemble ones like Logistic Regression + Gradient Boosting and Random Forest + ANN, with high accuracy and low values of false positive in ransomware detection. With complex feature use and optimal algorithm use, work sets a high standard in practice in cybersecurity. Despite this, representativeness and computational efficiency in datasets opens doors for future work. This work forms a contribution towards an ongoing corpus for ransomware detection and a beginning for even more effective defense in future work.

References

  1. Sollof J. Cyber attack cost Synnovis estimated £32.7m in 2024 [Internet]. Digit Health. 2025. Available from: https://www.digitalhealth.net/2025/01/cyber-attack-cost-synnovis-estimated-32-7m-in-2024.
     Google Scholar
  2. Sollof J. Synnovis cyber attack caused two cases of severe patient harm [Internet]. Digital Health. 2025. Available from: https://www.digitalhealth.net/2025/01/synnovis-attack-led-to-at-least-two-cases-of-severe-patient-harm/.
     Google Scholar
  3. Arcserve. Ransomware hit over two-thirds of financial services firms in 2024: 5 steps to ensure your firm can recover from an attack [Internet]. Arcserve. 2024 [cited 2025 Jan 25]. Available from: https://www.arcserve.com/blog/ransomware-hit-over-two-thirds-financial-services-firms-2024-5-steps-ensure-your-firm-can.
     Google Scholar
  4. British Library. Learning Lessons From the Cyber-Attack British Library cyber incident review 8 March 2024 Contents [Internet]. 2024. 2024 Mar. Available from: https://www.bl.uk/home/british-library-cyber-incident-review-8-march-2024.pdf.
     Google Scholar
  5. Report: British library cyber incident review | NISO website [Internet]. Niso.org. 2024. Available from: https://www.niso.org/niso-io/2024/03/report-british-library-cyber-incident-review.
     Google Scholar
  6. Wiles A, Colombo F, Mascorro R. Ransomware detection using network traffic analysis and generative adversarial networks. Authorea Preprint. 2024. [Internet]. Available from: https://www.authorea.com/users/831526/articles/1224977-ransomware-detection-using-network-traffic-analysis-and-generative-adversarial-networks.
     Google Scholar
  7. Masum M, Faruk MJ, Shahriar H, Qian K, Lo D, Adnan MI. Ransomware classification and detection with machine learning algorithms. 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0316–22, IEEE, 2022 Jan 26.
     Google Scholar
  8. Wu YC, Chang YL. Ransomware detection on linux using machine learning with random forest algorithm. Authorea Preprints. 2024 Jun 7. [Internet]. Available from: https://www.techrxiv.org/users/789959/articles/1053538-ransomware-detection-on-linux-using-machine-learning-with-random-forest-algorithm.
     Google Scholar
  9. Panaras A, Silverstein B, Edwards S. Automated cooperative clustering for proactive ransomware detection and mitigation using machine learning. Authorea Preprints. 2024 Sep 20. [Internet]. Available from: https://www.techrxiv.org/users/833673/articles/1226582-automated-cooperative-clustering-for-proactive-ransomware-detection-and-mitigation-using-machine-learning.
     Google Scholar
  10. Alraizza A, Algarni A. Ransomware detection using machine learning: a survey. Big Data Cogn Comput. 2023 Aug 16;7(3):143.
     Google Scholar
  11. Rani N, Dhavale SV. Leveraging machine learning for ransomware detection [Internet]. arXiv.org. 2022. Available from: https://arxiv.org/abs/2206.01919.
     Google Scholar
  12. Saputra H, Stiawan D, Satria H. Malware detection in portable document format (PDF) files with byte frequency distribution (BFD) and support vector machine (SVM). Jurnal Ilmiah Teknik Elektro Komputer dan Informatika (JITEKI). 2023 Dec;9(4):1144–53.
     Google Scholar
  13. Chen J, Zhang G. Detecting stealthy ransomware in IPFS networks using machine learning. OSF Preprints. 2024. [Internet]. Available from: https://osf.io/38ex9.
     Google Scholar
  14. Adejumo IE, Ayeni OA. Detection of ransomware using random forest, support vector machine and gradient boosting techniques. Int J Appl Inf Syst. 2024 Mar;12(42):63–70. [Internet]. Available from: https://www.ijais.org/archives/volume12/number42/detection-of-ransomware-using-random-forest-support-vector-machine-and-gradient-boosting-techniques/.
     Google Scholar
  15. Rafapa J, Konokix A. Ransomware detection using aggregated random forest technique with recent variants. Authorea Preprint. 2025 Apr 3. [Internet]. Available from: https://www.authorea.com/users/816233/articles/1216996-ransomware-detection-using-aggregated-random-forest-technique-with-recent-variants.
     Google Scholar
  16. Cahyani ND, Nuha HH. Ransomware detection on bitcoin transactions using artificial neural network methods. 2021 9th International Conference on Information and Communication Technology (ICoICT), pp. 1–5, IEEE, 2021 Aug 3.
     Google Scholar
  17. Marais B, Quertier T, Morucci S. Ai-based malware and ran- somware detection models. arXiv preprint arXiv:2207.02108. 2022 Jul 5.
     Google Scholar
  18. Bajaj A. Performance metrics in machine learning [Complete Guide]—neptune.ai [Internet]. neptune.ai. 2022. Available from: https://neptune.ai/blog/performance-metrics-in-machine-learning-complete-guide.
     Google Scholar
  19. Google LLC. Classification: accuracy, recall, precision, and related metrics [Internet]. Google for Developers. 2024. Available from: https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall.
     Google Scholar