Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russiandoi:10.18710/T9NQ9LDataverseNO2017-08-101Berdicevskis, Aleksandrs; Eckhoff, Hanne; Gavrilova, Tatjana, 2017, "Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian", https://doi.org/10.18710/T9NQ9L, DataverseNO, V1Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russiandoi:10.18710/T9NQ9LBerdicevskis, AleksandrsEckhoff, HanneGavrilova, TatjanaUiT The Arctic University of NorwayNational Research University Higher School of Economics2016DataverseNOThe Tromsø Repository of Language and Linguistics (TROLLing)Berdicevskis, AleksandrsConzett, Philipp2016-04-182016Arts and HumanitiesMiddle RussianmorphologydiachronicinflectionField: MorphologyTime-depth: diachronicTopic: inflectionWe describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is rule-based. The other one (“TOROT”) is being used for annotating the eponymous corpus and is statistical. We apply the two analyzers to the same Middle Russian text and then compare their outputs with high-quality manual annotation. Since the analyzers use different annotation schemes and spelling principles, we have to harmonize their outputs before we can compare them. The comparison shows that TOROT performs considerably better than RNC (lemmatization 69.8% vs. 47.3%, part of speech 89.5% vs. 54.2%, morphology 81.5% vs. 16.7%). If, however, we limit the evaluation set only to those tokens for which the analyzers provide a guess and in addition consider the RNC response correct if one of the multiple guesses it provides is correct, the numbers become comparable (88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple procedure which boosts TOROT lemmatization accuracy by 8.7% by using RNC lemma guesses when TOROT fails to provide one and matching them against the existing TOROT lemma database. We conclude that a statistical analyzer (trained on a large material) can deal with non-standardised historical texts better than a rule-based one. Still, it is possible to make the analyzers collaborate, boosting the performance of the superior one.Russian Federation<a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0</a>Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–11100_readme.txttext/plain01_sergij_original.txtThis is the original (non-normalised) fragment from Sergij of Radonezh (see 3.1)text/plain02_sergij_normalised.txtThis is the same fragment as File 1 in normalised orthography (see 3.1 and 2)text/plain03_orv_combined.txtThis is the training set the TnT was trained on (see 3.1)text/plain04_orv_combined.123This is a TnT model fileapplication/vnd.lotus-1-2-305_orv_combined.lexThis is a TnT model fileapplication/octet-stream06_torotlemmata.tabThis is the list of TOROT lemmata (used for lemma guessing, see 3.1 and 4.2)text/tab-separated-values07_sergij_rnc.csvThis is the output of the RNC tagger (to get access to the tagger itself, contact the third author)text/csv08_sergij_torot.xmlThis is the output of the TOROT tagger (to get access to the tagger itself, contact the second author)text/xml09_sergij_gold.xmlThis is the gold standard (see 3.2)text/xml10_comparison.rbThis is the comparison script (use Ruby 1.9.0 or higher to launch it. Make sure files 1, 6, 7, 8 and 9 are in the same directory. Warning messages about duplicated keys can most likely be ignored, otherwise make sure all Unicode symbols are being read correctly. The script will generate files 11, 12, 13, 15, 17 and 18.). Contact the first author if you have any questions.application/octet-stream11_aligned.tabtext/tab-separated-values12_aligned_for_manual.tabThis is the same output as in 11 in a slightly different form (intended to facilitate manual comparisons). Can be generated by file 10.text/tab-separated-values13_aligned_for_morph.csvThis is the morphological tagging output, aligned with each other and with gold. Can be generated by file 10. Meant to be used as input for file 14.text/csv14_morph.rbThis is the script that performs comparison of morphological tagging (all other comparisons are made by File 10). Use Ruby 1.9.0 or higher to launch it. Make sure files 8, 9 and 13 are in the same directory. The script will generate file 16. Contact the second author if you have any questions.application/octet-stream15_comparison.tabThis is the detailed information about whether each guess of lemma and POS by both taggers is correct or no. Can be generated by file 10.text/tab-separated-values16_morph_comparison.csvThis is the detailed information about whether each guess of morphological tag by both taggers is correct or no (and how wrong it is in the latter case). Can be generated by file 14.text/csv17_results.csvThis is the summary of file 15 (lists the results reported in the paper). Can be generated by file 10.text/csv18_guess_fixme.tabThis is the output of the RNC-based booster of TOROT performance (see 4.2). Can be generated by file 10.text/tab-separated-values