Categorization of Ottoman Literary Texts using Machine Learning

Project Staff (Bilkent):

Fazli Can
Pinar Duygulu
Ethem F. Can
Mehmet Kalpakli

Abstract

Millions of handwritten documents are available in the Ottoman language. The automatic categorization of Ottoman texts would make them much more accessible in various applications ranging from literary analysis to historical investigations. The Ottoman Text Archive Project (OTAP) and Text Bank Project (TBP) aim to make Ottoman texts available as a language resource to various types of users by providing their transcribed versions in adapted Latin characters. In this work, we study the automatic text categorization (ATC) of transcribed versions of Ottoman handwritten manuscripts. For this purpose we use five different machine learning algorithms and employ forty nine style markers. In the experiments we use the collected works (divans) of ten different poets, two authors from five different hundred-year periods ranging from the 15th to the 19th centuries. The experimental results show that our automatic categorization approach has high correct classification rates on both categorization by poet and categorization by period. The experimental results show that using our method we can distinguish differences in style over time and among poets.

According to average correct classification rates, Support Vector Machine (SVM) is the most effective categorization tool in the context of transcribed Ottoman texts. In addition, statistical tests show that each classifier differs from the others to a statistically significant degree. In this pioneering work on Ottoman language we show that it is possible to develop efficient and effective ATC methods that can be applied to this language.

OTAP
Ottoman Text Archive Project