CLASSIFICATION OF SMS SPAM WITH N-GRAM AND PEARSON CORRELATION BASED USING MACHINE LEARNING TECHNIQUES

Authors

  • Nova Tri Romadloni Universitas Muhammadiyah Karanganyar
  • Nisa Dwi Septiyanti Universitas Muhammadiyah Karanganyar
  • Cucut Hariz Pratomo Universitas Muhammadiyah Karanganyar
  • Wakhid Kurniawan Universitas Muhammadiyah Karanganyar
  • Rauhulloh Ayatulloh Khomeini Noor Bintang Universitas Muhammadiyah Karanganyar

DOI:

https://doi.org/10.55681/sentri.v3i2.2252

Keywords:

Feature Selection, Machine Learning, Ngram, Pearson Correlation, SMS Classification

Abstract

The Short Message Service (SMS) has garnered widespread popularity due to its simplicity, reliability, and ubiquitous accessibility.This study aims to enhance the efficacy of SMS classification by refining the classification process itself. Specifically, it strives to streamline the process by diminishing feature dimensions and eliminating inconsequential attributes. The textual data undergoes preprocessing, which involves employing the N-Gram technique for feature representation, followed by meticulous feature selection utilizing Pearson Correlation. The study employs 5 of classification algorithms. Notably, the findings underscore that the optimal outcomes emerge from the fusion of the N-Gram methodology with feature selection through Pearson Correlation. Among these, the Support Vector Machine methodology stands out, exhibiting a remarkable 91.41% enhancement in accuracy without feature selection, a further improvement to 91.96% through N-Gram utilization, and a final performance of 70.80% following the inclusion of weighted correlation. However, it is imperative to acknowledge the limitations inherent in the model's generalizability, primarily stemming from the utilization of a relatively modest dataset. Despite the efficacy of Pearson correlation and N-gram-based feature selection in curbing data dimensionality and enhancing processing efficiency, certain pertinent features may have been overlooked, or the chosen attributes might not be optimally suited for specific classifications.

Downloads

Download data is not yet available.

References

O. Marzouk, J. Salminen, P. Zhang, and B. J. Jansen, “Which message? Which channel? Which customer? Exploring response rates in multi-channel marketing using short-form advertising,” Data Inf. Manag., vol. 6, no. 1, p. 100008, 2022, doi: 10.1016/j.dim.2022.100008.

U. Nandagopal and S. Thirumalaivelu, “Classification of Malware with MIST and N-Gram Features Using Machine Learning,” Int. J. Intell. Eng. Syst., vol. 14, no. 2, pp. 323–333, 2021, doi: 10.22266/ijies2021.0430.29.

C. Engineering, C. Science, and C. Science, “Mobile Sms Call Spam Filtering Techniques,” vol. 10, no. 2, pp. 112–116, 2021, doi: 10.17148/IJARCCE.2021.10217.

M. Habib, H. Faris, M. A. Hassonah, J. Alqatawna, A. F. Sheta, and A. M. Al-Zoubi, “Automatic Email Spam Detection using Genetic Programming with SMOTE,” ITT 2018 - Inf. Technol. Trends Emerg. Technol. Artif. Intell., no. 1, pp. 185–190, 2018, doi: 10.1109/CTIT.2018.8649534.

Z. Alkhalil, C. Hewage, L. Nawaf, and I. Khan, “Phishing Attacks: A Recent Comprehensive Study and a New Anatomy,” Front. Comput. Sci., vol. 3, no. March, pp. 1–23, 2021, doi: 10.3389/fcomp.2021.563060.

N. Choudhary and A. K. Jain, “Comparative analysis of mobile phishing detection and prevention approaches,” Smart Innov. Syst. Technol., vol. 83, no. Ictis 2017, pp. 349–356, 2018, doi: 10.1007/978-3-319-63673-3_43.

N. Choudhary and A. K. Jain, “Towards filtering of SMS spam messages using machine learning based technique,” Commun. Comput. Inf. Sci., vol. 712, pp. 18–30, 2017, doi: 10.1007/978-981-10-5780-9_2.

P. K. Roy, J. P. Singh, and S. Banerjee, “Deep learning to filter SMS Spam,” Futur. Gener. Comput. Syst., vol. 102, pp. 524–533, 2020, doi: 10.1016/j.future.2019.09.001.

O. Abayomi-Alli, S. Misra, A. Abayomi-Alli, and M. Odusami, “A review of soft techniques for SMS spam classification: Methods, approaches and applications,” Eng. Appl. Artif. Intell., vol. 86, no. July, pp. 197–212, 2019, doi: 10.1016/j.engappai.2019.08.024.

S. Kaddoura, G. Chandrasekaran, D. E. Popescu, and J. H. Duraisamy, “A systematic literature review on spam content detection and classification,” PeerJ Comput. Sci., vol. 8, 2022, doi: 10.7717/PEERJ-CS.830.

M. A. Gohan, M. Andayani, M. Naufal, and Masliana, “Counseling on the Spread of Covid-19 Using a Participatory Action Research Approach in Responding to Hoax News on Social Media,” vol. 1, pp. 66–73, 2021.

R. Puspita and A. Widodo, “Perbandingan Metode KNN, Decision Tree, dan Naïve Bayes Terhadap Analisis Sentimen Pengguna Layanan BPJS,” J. Inform. Univ. Pamulang, vol. 5, no. 4, p. 646, 2021, doi: 10.32493/informatika.v5i4.7622.

Y. Deta Kirana and S. Al Faraby, “Sentiment Analysis of Beauty Product Reviews Using the K-Nearest Neighbor (KNN) and TF-IDF Methods with Chi-Square Feature Selection,” Open Access J Data Sci Appl, vol. 4, no. 1, pp. 31–042, 2021, doi: 10.34818/JDSA.2021.4.71.

M. Hou, X. Zhou, and R. Jiang, “What Influences Family Migration Decision of China’s New Generation Rural-urban Migrants? A Multilevel Logistic Regression Analysis,” J. Geogr. Res., vol. 5, no. 4, pp. 1–15, 2022, doi: 10.30564/jgr.v5i4.4996.

N. Hafidz and D. Yanti Liliana, “Klasifikasi Sentimen pada Twitter Terhadap WHO Terkait Covid-19 Menggunakan SVM, N-Gram, PSO,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 5, no. 2, pp. 213–219, 2021, doi: 10.29207/resti.v5i2.2960.

A. Setiyono and H. F. Pardede, “Klasifikasi Sms Spam Menggunakan Support Vector Machine,” J. Pilar Nusa Mandiri, vol. 15, no. 2, pp. 275–280, 2019, doi: 10.33480/pilar.v15i2.693.

C. Villavicencio, J. J. Macrohon, X. A. Inbaraj, J. H. Jeng, and J. G. Hsieh, “Twitter sentiment analysis towards covid-19 vaccines in the Philippines using naïve bayes,” Inf., vol. 12, no. 5, 2021, doi: 10.3390/info12050204.

S. Sheikhi, M. T. Kheirabadi, and A. Bazzazi, “An effective model for SMS spam detection using content-based features and averaged neural network,” Int. J. Eng. Trans. B Appl., vol. 33, no. 2, pp. 221–228, 2020, doi: 10.5829/IJE.2020.33.02B.06.

N. Arifin, U. Enri, and N. Sulistiyowati, “Penerapan Algoritma Support Vector Machine (SVM) dengan TF-IDF N-Gram untuk Text Classification,” STRING (Satuan Tulisan Ris. dan Inov. Teknol., vol. 6, no. 2, p. 129, 2021, doi: 10.30998/string.v6i2.10133.

M. M. Dewi, “Optimasi Pearson Correlation untuk Sistem Rekomendasi menggunakan Algoritma Firefly,” J. Inform., vol. 9, no. 1, pp. 1–5, 2022, doi: 10.31294/inf.v9i1.10209.

S. S. Harahap, “Hubungan Usia, Tingkat Pendidikan, Kemampuan Bekerja, dan Masa Bekerja Terhadap Kinerja Pegawai dengan Menggunakan Metode Pearson Correlation,” J. Teknovasi, vol. 06, no. 02, pp. 12–26, 2019.

N. T. Romadloni and Hilman F Pardede, “Seleksi Fitur Berbasis Pearson Correlation Untuk Optimasi Opinion Mining Review Pelanggan,” J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 3, no. 3, pp. 505–510, 2019, doi: 10.29207/resti.v3i3.1189.

Downloads

Published

2024-02-06

How to Cite

Romadloni, N. T., Septiyanti, N. D., Pratomo, C. H., Kurniawan, W., & Bintang, R. A. K. N. (2024). CLASSIFICATION OF SMS SPAM WITH N-GRAM AND PEARSON CORRELATION BASED USING MACHINE LEARNING TECHNIQUES . SENTRI: Jurnal Riset Ilmiah, 3(2), 967–977. https://doi.org/10.55681/sentri.v3i2.2252