Investigating a SMOTE-Tomek Boosted Stacked Learning Scheme for Phishing Website Detection: A Pilot Study
(Eferhire Valentine Ugbotu, Frances Uchechukwu Emordi, Emeke Ugboh, Kizito Eluemunor Anazia, Christopher Chukwufunaya Odiakaose, Paul Avwerosuoghene Onoma, Rebecca Okeoghene Idama, Arnold Adimabua Ojugo, Victor Ochuko Geteloma, Amanda Enaodona Oweimieotu, Tabitha Chukwudi Aghaunor, Amaka Patience Binitie, Anne Odoh, Chris Chukwudi Onochie, Peace Oguguo Ezzeh, Andrew Okonji Eboka, Joy Agboi, Patrick Ogholuwarami Ejeh)
DOI : 10.62411/jcta.14472
- Volume: 3,
Issue: 2,
Sitasi : 80 01-Oct-2025
| Abstrak
| PDF File
| Resource
| Last.29-Jan-2026
Abstrak:
The daily exchange of informatics over the Internet has both eased the widespread proliferation of resources to ease accessibility, availability and interoperability of accompanying devices. In addition, the recent widespread proliferation of smartphones alongside other computing devices has continued to advance features such as miniaturization, portability, data access ease, mobility, and other merits. It has also birthed adversarial attacks targeted at network infrastructures and aimed at exploiting interconnected cum shared resources. These exploits seek to compromise an unsuspecting user device cum unit. Increased susceptibility and success rate of these attacks have been traced to user's personality traits and behaviours, which renders them repeatedly vulnerable to such exploits especially those rippled across spoofed websites as malicious contents. Our study posits a stacked, transfer learning approach that seeks to classify malicious contents as explored by adversaries over a spoofed, phishing websites. Our stacked approach explores 3-base classifiers namely Cultural Genetic Algorithm, Random Forest, and Korhonen Modular Neural Network – whose output is utilized as input for XGBoost meta-learner. A major challenge with learning scheme(s) is the flexibility with the selection of appropriate features for estimation, and the imbalanced nature of the explored dataset for which the target class often lags behind. Our study resolved dataset imbalance challenge using the SMOTE-Tomek mode; while, the selected predictors was resolved using the relief rank feature selection. Results shows that our hybrid yields F1 0.995, Accuracy 0.997, Recall 0.998, Precision 1.000, AUC-ROC 0.997, and Specificity 1.000 – to accurately classify all 2,764 cases of its held-out test dataset. Results affirm that it outperformed bench-mark ensembles. Result shows the proposed model explored UCI Phishing Website dataset, and effectively classified phishing (cues and lures) contents on websites.
|
80 |
2025 |
Resolving Data Imbalance Using a Bi-Directional Long-Short Term Memory for Enhanced Diabetes Mellitus Detection
(Andrew Okonji Eboka, Christopher Chukwufunaya Odiakaose, Joy Agboi, Margaret Dumebi Okpor, Paul Avweresuoghene Onoma, Tabitha Chukwudi Aghaunor, Arnold Adimabua Ojugo, Eferhire Valentine Ugbotu, Asuobite ThankGod Max-Egba, Victor Ochuko Geteloma, Amaka Patience Binitie, Christopher Chukwudi Onochie, Rita Erhovwo Ako)
DOI : 10.62411/faith.3048-3719-73
- Volume: 2,
Issue: 1,
Sitasi : 84 08-May-2025
| Abstrak
| PDF File
| Resource
| Last.29-Jan-2026
Abstrak:
Diabetes is the body’s inability to efficiently break down sugar or secrete enough insulin required to process glucose, which supports normal bodily functions. Diabetes, as a prevalent chronic disorder, has contributed to numerous underlying health challenges among its carriers and is classified by the WHO as the world’s deadliest disease and silent killer. Its non-communicable nature makes early diagnosis difficult, allowing progression through various stages: type I, type II, pre-diabetes, and gestational. This challenge is further compounded by the imbalanced nature of diabetes datasets, which leads to high misclassification, poor generalization, and reduced accuracy. This study predicts diabetes using a bi-directional long short-term memory (BiLSTM) model applied to two datasets: (a) PIMA Indian Diabetes and (b) Iraqi Society Dataset, to evaluate the impact of six known balancing techniques and assess their effectiveness. Results show that for PID, the SMOTE-Tomek fused BiLSTM outperforms other balancing schemes with F1, Accuracy, Precision, Recall, and Specificity scores of 0.9182, 0.9198, 0.9128, 0.9248, and 0.9208, respectively. For ISD, it also achieves the best performance with values of 0.9367, 0.9369, 0.9386, 0.9388, and 0.9313, respectively. Other balancing approaches yielded F1 scores ranging from [0.6751 to 0.9347], accuracy [0.684 to 0.9358], Precision [0.6851 to 0.9296], Recall [0.6639 to 0.9356], and specificity [0.6658 to 0.9298]. These results imply that BiLSTM is resilient to the vanishing gradient problem and can effectively classify diabetes cases with enhanced performance.
|
84 |
2025 |
Integrating Hybrid Statistical and Unsupervised LSTM-Guided Feature Extraction for Breast Cancer Detection
(De Rosal Ignatius Moses Setiadi, Arnold Adimabua Ojugo, Octara Pribadi, Etika Kartikadarma, Bimo Haryo Setyoko, Suyud Widiono, Robet Robet, Tabitha Chukwudi Aghaunor, Eferhire Valentine Ugbotu)
DOI : 10.62411/jcta.12698
- Volume: 2,
Issue: 4,
Sitasi : 0 05-May-2025
| Abstrak
| PDF File
| Resource
| Last.29-Jan-2026
Abstrak:
Breast cancer is the most prevalent cancer among women worldwide, requiring early and accurate diagnosis to reduce mortality. This study proposes a hybrid classification pipeline that integrates Hybrid Statistical Feature Selection (HSFS) with unsupervised LSTM-guided feature extraction for breast cancer detection using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. Initially, 20 features were selected using HSFS based on Mutual Information, Chi-square, and Pearson Correlation. To address class imbalance, the training set was balanced using the Synthetic Minority Over-sampling Technique (SMOTE). Subsequently, an LSTM encoder extracted non-linear latent features from the selected features. A fusion strategy was applied by concatenating the statistical and latent features, followed by re-selection of the top 30 features. The final classification was performed using a Support Vector Machine (SVM) with RBF kernel and evaluated using 5-fold cross-validation and a held-out test set. Experimental results showed that the proposed method achieved an average training accuracy of 98.13%, F1-score of 98.13%, and AUC-ROC of 99.55%. On the held-out test set, the model reached an accuracy of 99.30%, precision of 100%, and F1-score of 99.05%, with an AUC-ROC of 0.9973. The proposed pipeline demonstrates improved generalization and interpretability compared to existing methods such as LightGBM-PSO, DHH-GRU, and ensemble deep networks. These results highlight the effectiveness of combining statistical selection and LSTM-based latent feature encoding in a balanced classification framework.
|
0 |
2025 |
Hypertension Detection via Tree-Based Stack Ensemble with SMOTE-Tomek Data Balance and XGBoost Meta-Learner
(Christopher Chukwufunaya Odiakaose, Fidelis Obukohwo Aghware, Margaret Dumebi Okpor, Andrew Okonji Eboka, Amaka Patience Binitie, Arnold Adimabua Ojugo, De Rosal Ignatius Moses Setiadi, Ayei Egu Ibor, Rita Erhovwo Ako, Victor Ochuko Geteloma, Eferhire Valentine Ugbotu, Tabitha Chukwudi Aghaunor)
DOI : 10.62411/faith.3048-3719-43
- Volume: 1,
Issue: 3,
Sitasi : 114 01-Dec-2024
| Abstrak
| PDF File
| Resource
| Last.29-Jan-2026
Abstrak:
High blood pressure (or hypertension) is a causative disorder to a plethora of other ailments – as it succinctly masks other ailments, making them difficult to diagnose and manage with a targeted treatment plan effectively. While some patients living with elevated high blood pressure can effectively manage their condition via adjusted lifestyle and monitoring with follow-up treatments, Others in self-denial leads to unreported instances, mishandled cases, and in now rampant cases – result in death. Even with the usage of machine learning schemes in medicine, two (2) significant issues abound, namely: (a) utilization of dataset in the construction of the model, which often yields non-perfect scores, and (b) the exploration of complex deep learning models have yielded improved accuracy, which often requires large dataset. To curb these issues, our study explores the tree-based stacking ensemble with Decision tree, Adaptive Boosting, and Random Forest (base learners) while we explore the XGBoost as a meta-learner. With the Kaggle dataset as retrieved, our stacking ensemble yields a prediction accuracy of 1.00 and an F1-score of 1.00 that effectively correctly classified all instances of the test dataset.
|
114 |
2024 |