A Comprehensive Part of Speech (POS) Tag Set for Sinhala Language.

Dilshani, N.; Fernando, S.; Ranathunga, S.; Jayasena, S.; Dias, G.

UoK Repository Home
→
Humanities
→
Symposia & Conferences
→
International Conference on Linguistics in Sri Lanka (ICLSL)
→
ICLSL 2017
→
View Item

dc.contributor.author	Dilshani, N.
dc.contributor.author	Fernando, S.
dc.contributor.author	Ranathunga, S.
dc.contributor.author	Jayasena, S.
dc.contributor.author	Dias, G.
dc.date.accessioned	2017-12-04T08:47:13Z
dc.date.available	2017-12-04T08:47:13Z
dc.date.issued	2017
dc.identifier.citation	Dilshani, N., Fernando, S., Ranathunga, S., Jayasena, S. and Dias, G. (2017). A Comprehensive Part of Speech (POS) Tag Set for Sinhala Language. The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017. Department of Linguistics, University of Kelaniya, Sri Lanka. p59.	en_US
dc.identifier.uri	http://repository.kln.ac.lk/handle/123456789/18366
dc.description.abstract	Sinhala, which belongs to Indo-Aryan language family, is a morphologically complex language. Most of the features of the words are postpositionally affixed to the root word. Thus, well-developed Part of Speech (POS) tag sets for languages such as English cannot be easily adopted to create a POS tag set for Sinhala. Moreover, currently available Sinhala POS tag sets have many limitations such as the unavailability of tags for certain words. The objective of the research is to overcome and to identify ambiguities and limitations of the present POS tag sets for Sinhala language, and to develop a comprehensive multi-level tag set for Sinhala language. The new tag set was designed after a thorough evaluation of different types of corpora such as news articles and official government letters, and as well as an analysis of the existing POS tag set for Sinhala. This new tag set consists of 148 tags and is organized into 3 levels. Thus, it covers most of the word classes and inflection based grammatical variations of the Sinhala language. The ultimate purpose of developing this tag set is to implement an automatic POS tagger, which is an essential tool in implementing Natural Language Processing Applications. To train the automatic POS tagger, a corpus of 300000 words has been POS annotated manually using this tag set. This tag set produced an overall accuracy of 84.68% and it bypasses the other Sinhala POS taggers. However, this annotation is done only up to level 2 in the tag set. Annotating at level 3 has the potential to introduce many ambiguities to the manual annotation process, due to the large number of POS tags. Thus this opens up new research avenues to investigate on the use of inflectional morphological features of Sinhala language, in order to determine the POS tag of a word at the third level.	en_US
dc.language.iso	en	en_US
dc.publisher	The Third International Conference on Linguistics in Sri Lanka, ICLSL 2017. Department of Linguistics, University of Kelaniya, Sri Lanka.	en_US
dc.subject	Lexical	en_US
dc.subject	Morphology	en_US
dc.subject	Natural Language Processing (NLP)	en_US
dc.subject	Parts of Speech (POS)	en_US
dc.subject	Sinhala	en_US
dc.title	A Comprehensive Part of Speech (POS) Tag Set for Sinhala Language.	en_US
dc.type	Article	en_US