dc.contributor.author |
Shanmugalingam, K. |
|
dc.contributor.author |
Sumathipala, S. |
|
dc.date.accessioned |
2019-05-13T04:24:34Z |
|
dc.date.available |
2019-05-13T04:24:34Z |
|
dc.date.issued |
2019 |
|
dc.identifier.citation |
Shanmugalingam, K., Sumathipala, S. (2019). Language identification at word level in Sinhala-English code-mixed social media text. IEEE International Research Conference on Smart computing & Systems Engineering (SCSE) 2019, Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka.P.113 |
en_US |
dc.identifier.uri |
http://repository.kln.ac.lk/handle/123456789/20164 |
|
dc.description.abstract |
Automatic analyzing and extracting useful information from the noisy social media content are currently getting
attention from the research community. It is common to find people easily mixing their native language along with the
English language to express their thoughts in social media, using Unicode characters or the Unicode characters written
in Roman Scripts. Thus these types of noisy code-mixed text are characterized by a high percentage of spelling mistakes
with phonetic typing, wordplay, creative spelling, abbreviations, Meta tags, and so on. Identification of languages at
word level become a necessary part for analyzing the noisy content in social media. It would be used as an intimidate
language identifier for chatbot application by using the native languages. For this study we used Sinhala-English codemixed
text from social media. Natural Language Processing (NLP) and Machine Learning (ML) technologies are used
to identify the language tags at the word level. A novel approach proposed for this system implemented is machine
learning classifier based on features such as Sinhala Unicode characters written in Roman scripts, dictionaries, and
term frequency. Different machine learning classifiers such as Support Vector Machines (SVM), Naive Bayes, Logistic
Regression, Random Forest and Decision Trees were used in the evaluation process. Among them, the highest accuracy
of 90.5% was obtained when using Random Forest classifier |
en_US |
dc.language.iso |
en |
en_US |
dc.publisher |
IEEE International Research Conference on Smart computing & Systems Engineering (SCSE) 2019, Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka |
en_US |
dc.subject |
Code-mixing |
en_US |
dc.subject |
Language identification |
en_US |
dc.subject |
Machine learning |
en_US |
dc.subject |
Natural Language Processing (NLP) |
en_US |
dc.title |
Language identification at word level in Sinhala-English code-mixed social media text |
en_US |
dc.type |
Article |
en_US |