Digital Repository

Hate Words Detection Among Sri Lankan Social Media Text Messages

Show simple item record

dc.contributor.author Shalinda, J. A. D. U.
dc.contributor.author Munasinghe, Lankeshwara
dc.date.accessioned 2022-10-31T07:37:34Z
dc.date.available 2022-10-31T07:37:34Z
dc.date.issued 2022
dc.identifier.citation Shalinda J. A. D. U.; Munasinghe Lankeshwara (2022), Hate Words Detection Among Sri Lankan Social Media Text Messages, International Research Conference on Smart Computing and Systems Engineering (SCSE 2022), Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka. 55-60. en_US
dc.identifier.uri http://repository.kln.ac.lk/handle/123456789/25401
dc.description.abstract The number of Sri Lankan social media users have been increased with the rapid growth of 23% between 2020 and 2021, reaching 7.9 million in 2021 January. Social media platforms became more popular when they started supporting native languages. The problems with social media also evolved as popularity grows. Social media platforms were banned for Sri Lankan users in 2019 to prevent the spreading of hate messages and incorrect information among citizens. The lack of automatically recognizing tools for hate messages in Sinhala and Romanized Sinhala was reported as the reason for the ban. It’s also a waste of time and money to manually identify them. Many studies have been conducted to identify hate messages in both English and Sinhala separately. Users in Sri Lanka tend to combine Sinhala, Romanized Sinhala, and English phrases while expressing their opinions.” Mama job ekakata apply kara,” for example. To train, an open-source data set which consists of 2500 comments, was used. And the comments were categorized as either hateful or non-hateful. To pre-process the data set, an Open-source stop word corpus and stem word corpus in Sinhala were utilized, and two corpus were manually converted into Romanized Sinhala stop word corpus and Romanized Sinhala stem word corpus to identify stop words and stem words in Romanized Sinhala. All English words were recognized using an open- source English word corpus, and a library was utilized to obtain stop word corpus and stem English words. As a result, doing research to identify hate speech in all of the languages indicated above will be more effective in reaching Sri Lankan users. The bag of words and term frequency-inverse document frequency were compared for feature engineering. Linear Support vector classifier, Random Forest Classification, SGD classifier, Logistic Regression, XGBoost classifier and multinomial Naive Bayes classifier are used as classification algorithms and evaluated. Using the SGD classification using TF-IDF with uni&bi-gram, the highest accuracy was determined to be 74.2%. en_US
dc.publisher Department of Industrial Management, Faculty of Science, University of Kelaniya Sri Lanka en_US
dc.subject English, hate speech detection, NLP, Romanized Sinhala, Sinhala en_US
dc.title Hate Words Detection Among Sri Lankan Social Media Text Messages en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Repository


Browse

My Account