This is the first tool available for POS (Parts of Speech) tagging the Khasi language. In very simple terms, a POS tagger automatically assigns a Parts of Speech tag to each word in a given sentence. These Parts of Speech tags are basically taken from the parts of speech which we have all learned in English grammar and in this site it is with reference to the Parts of Speech present in Khasi grammar. The Parts of Speech tags assigned for each word in a Khasi sentence have been taken from the Bureau of Indian Standards (BIS) tagset proposed for Khasi.
For those interested in the details of the proposed BIS tagset for Khasi along with the corpus used for training and testing the POS tagger, they can download the paper Challenges and Issues in Developing an Annotated Corpus and HMM POS Tagger for Khasi available online and others.
This tagger utilizes the Hidden Markov Model for training the tagger on a training set consisting of 3,984 sentences comprising of 86,087 tokens and 5,313 word types. The test set consists of 402 sentences which include 8,565 tokens and 1,110 word types . The tagging accuracy achieved on the test set is 95.68%. Using ten-fold cross validation the tagger achieved an accuracy of 93.39%. This is the first release of the tagger and you can try out the tool here.
In order to understand the tags assigned to each word, a quick reference of what the tags represent are given below. For example the Khasi sentence "I mei jong nga" (My mother) is tagged as I/PR_PRP_M jong/IN nga/PR_PRP where the tags PR_PRP_M stands for Pronominal Marker, IN stands for Preposition and PR_PRP stands for Pronoun).
Katto katne shaphang ka HMM tagger
Kane ka tagger ka dei ka software kaba phi lah ban thoh ka senten ha ka ktien Khasi bad kan ai ïa ki jait kyntien (Parts of Speech) jong ki ktien kiba don ha ka senten. Ïa ki jait kyntien kiba la pyndonkam ha kane ka tagger, la shim ïa ki kat kum ka Bureau of Indian Standards (BIS) tagset na ka bynta ka ktien Khasi, bad ka jingbatai kaba kham bniah shaphang kane ka tagger ka don ha ka jingthoh Challenges and Issues in Developing an Annotated Corpus and HMM POS Tagger for Khasi.
Haba la test ïa kane ka tagger, ka la ai ïa ka ki jait kyntien kiba biang kumba 95.68 na ka shispah. Kum ka nuksa, ha ka senten "I mei jong nga"; phin ïoh ka jubab I/PR_PRP_M jong/IN nga/PR_PRP, kaba mut Pronominal Marker ïa u PR_PRP_M, Preposition ïa u IN, bad Pronoun ïa u PR_PRP. Phi lah ban shem ïa ka list jong kine ki jait kyntien harum. Ngi lah ruh ban ong ba kane ka software ka dei kaba nyngkong eh kaba la don na ka bynta ka ktien Khasi.
The tagging accuracy achieved on the test set is 95.68%. Using ten-fold cross validation the tagger achieved an accuracy of 93.39%
To reduce the errors present in the HMM POS tagger output, conditional random fields (CRF) are used where unlike HMM, CRFs allow the inclusion of features that are non-independent and varied in depth even on the same observation. If you are interested in the details to the implementation of this tagger you may download the paper A Hybrid POS Tagger for Khasi, an Under Resourced Language. Ten-fold cross validation is undertaken for training and testing the CRF tagger using the same training data employed by the HMM POS tagger. This means that the same training data was used both by the CRF POS tagger and HMM POS tagger. In so doing, training on 4k sentences and tagging on 10 per cent of the training sentences, average tagging accuracy using ten-fold cross validation is 95.29%.
Katto katne shaphang ka Hybrid tagger
Kane ka tagger ka dei ka software kaba kham pyntbit ia ka HMM tagger da kaba ka pyndokam ïa ki conditional random fields (CRF) kiba lah ban bat ïa ki jinglong bapher bapher jong ki kyntien Khasi. ka jingbatai kaba kham bniah shaphang kane ka tagger ka don ha ka jingthoh A Hybrid POS Tagger for Khasi, an Under Resourced Language. Haba la test ïa kane ka tagger, ka la ai ïa ka ki jait kyntien kiba biang kumba 95.29 na ka shispah.
Using ten-fold cross validation, the Hybrid tagger's accuracy is is 95.29%.
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.