Analysing Accuracy of Slovak Language Lemmatization and MSD Tagging
Analysing Accuracy of Slovak Language Lemmatization and MSD Tagging
Author(s): Radovan Garabík, Denis MitanaSubject(s): Language and Literature Studies, Applied Linguistics, Computational linguistics
Published by: SAV - Slovenská akadémia vied - Jazykovedný ústav Ľudovíta Štúra Slovenskej akadémie vied
Keywords: lemmatization; MSD tagging; POS tagging; Slovak
Summary/Abstract: Lemmatization and morphological tagging is an indispensable step in Slovak corpus linguistics. In this article, we evaluate two state-of-the-art Slovak language lemmatizers and MSD taggers. One is based on MorphoDiTa and the other is based on spaCy. We measured accuracy on the test subset of manually lemmatized and MSD annotated corpus and found that the combination of lemma and tag achieved 93.5% accuracy with MorphoDiTa, and 95.6% accuracy with spaCy. Most of the errors occurred in disambiguating MSD tags for homonymous uninflected parts of speech such as particles, conjunctions, and adverbs, and in disambiguating singular masculine inanimate nominative and accusative. In these cases, spaCy shows a noticeable improvement over MorphoDiTa, likely due to a better exploitation of the context of the words.
Journal: Slovenská reč
- Issue Year: 88/2023
- Issue No: 2
- Page Range: 129-140
- Page Count: 12
- Language: English