Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian (doi:10.18710/T9NQ9L)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description
Citation
Title:	Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian
Identification Number:	doi:10.18710/T9NQ9L
Distributor:	DataverseNO
Date of Distribution:	2017-08-10
Version:	1
Bibliographic Citation:	Berdicevskis, Aleksandrs; Eckhoff, Hanne; Gavrilova, Tatjana, 2017, "Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian", https://doi.org/10.18710/T9NQ9L, DataverseNO, V1
Study Description
Citation
Title:	Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian
Identification Number:	doi:10.18710/T9NQ9L
Authoring Entity:	Berdicevskis, Aleksandrs (UiT The Arctic University of Norway)
	Eckhoff, Hanne (UiT The Arctic University of Norway)
	Gavrilova, Tatjana (National Research University Higher School of Economics)
Producer:	UiT The Arctic University of Norway
	National Research University Higher School of Economics
Date of Production:	2016
Distributor:	DataverseNO
Distributor:	The Tromsø Repository of Language and Linguistics (TROLLing)
Access Authority:	Berdicevskis, Aleksandrs
Depositor:	Conzett, Philipp
Date of Deposit:	2016-04-18
Date of Distribution:	2016
Holdings Information:	https://doi.org/10.18710/T9NQ9L
Study Scope
Keywords:	Arts and Humanities, Middle Russian, morphology, diachronic, inflection
Topic Classification:	Field: Morphology, Time-depth: diachronic, Topic: inflection
Abstract:	We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is rule-based. The other one (“TOROT”) is being used for annotating the eponymous corpus and is statistical. We apply the two analyzers to the same Middle Russian text and then compare their outputs with high-quality manual annotation. Since the analyzers use different annotation schemes and spelling principles, we have to harmonize their outputs before we can compare them. The comparison shows that TOROT performs considerably better than RNC (lemmatization 69.8% vs. 47.3%, part of speech 89.5% vs. 54.2%, morphology 81.5% vs. 16.7%). If, however, we limit the evaluation set only to those tokens for which the analyzers provide a guess and in addition consider the RNC response correct if one of the multiple guesses it provides is correct, the numbers become comparable (88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple procedure which boosts TOROT lemmatization accuracy by 8.7% by using RNC lemma guesses when TOROT fails to provide one and matching them against the existing TOROT lemma database. We conclude that a statistical analyzer (trained on a large material) can deal with non-standardised historical texts better than a rule-based one. Still, it is possible to make the analyzers collaborate, boosting the performance of the superior one.
Country:	Russian Federation
Methodology and Processing
Sources Statement
Data Access
Other Study Description Materials
Related Publications
Citation
Title:	Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111
Bibliographic Citation:	Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111
Other Study-Related Materials
Label:	00_readme.txt
Notes:	text/plain
Other Study-Related Materials
Label:	01_sergij_original.txt
Text:	This is the original (non-normalised) fragment from Sergij of Radonezh (see 3.1)
Notes:	text/plain
Other Study-Related Materials
Label:	02_sergij_normalised.txt
Text:	This is the same fragment as File 1 in normalised orthography (see 3.1 and 2)
Notes:	text/plain
Other Study-Related Materials
Label:	03_orv_combined.txt
Text:	This is the training set the TnT was trained on (see 3.1)
Notes:	text/plain
Other Study-Related Materials
Label:	04_orv_combined.123
Text:	This is a TnT model file
Notes:	application/vnd.lotus-1-2-3
Other Study-Related Materials
Label:	05_orv_combined.lex
Text:	This is a TnT model file
Notes:	application/octet-stream
Other Study-Related Materials
Label:	06_torotlemmata.tab
Text:	This is the list of TOROT lemmata (used for lemma guessing, see 3.1 and 4.2)
Notes:	text/tab-separated-values
Other Study-Related Materials
Label:	07_sergij_rnc.csv
Text:	This is the output of the RNC tagger (to get access to the tagger itself, contact the third author)
Notes:	text/csv
Other Study-Related Materials
Label:	08_sergij_torot.xml
Text:	This is the output of the TOROT tagger (to get access to the tagger itself, contact the second author)
Notes:	text/xml
Other Study-Related Materials
Label:	09_sergij_gold.xml
Text:	This is the gold standard (see 3.2)
Notes:	text/xml
Other Study-Related Materials
Label:	10_comparison.rb
Text:	This is the comparison script (use Ruby 1.9.0 or higher to launch it. Make sure files 1, 6, 7, 8 and 9 are in the same directory. Warning messages about duplicated keys can most likely be ignored, otherwise make sure all Unicode symbols are being read correctly. The script will generate files 11, 12, 13, 15, 17 and 18.). Contact the first author if you have any questions.
Notes:	application/octet-stream
Other Study-Related Materials
Label:	11_aligned.tab
Notes:	text/tab-separated-values
Other Study-Related Materials
Label:	12_aligned_for_manual.tab
Text:	This is the same output as in 11 in a slightly different form (intended to facilitate manual comparisons). Can be generated by file 10.
Notes:	text/tab-separated-values
Other Study-Related Materials
Label:	13_aligned_for_morph.csv
Text:	This is the morphological tagging output, aligned with each other and with gold. Can be generated by file 10. Meant to be used as input for file 14.
Notes:	text/csv
Other Study-Related Materials
Label:	14_morph.rb
Text:	This is the script that performs comparison of morphological tagging (all other comparisons are made by File 10). Use Ruby 1.9.0 or higher to launch it. Make sure files 8, 9 and 13 are in the same directory. The script will generate file 16. Contact the second author if you have any questions.
Notes:	application/octet-stream
Other Study-Related Materials
Label:	15_comparison.tab
Text:	This is the detailed information about whether each guess of lemma and POS by both taggers is correct or no. Can be generated by file 10.
Notes:	text/tab-separated-values
Other Study-Related Materials
Label:	16_morph_comparison.csv
Text:	This is the detailed information about whether each guess of morphological tag by both taggers is correct or no (and how wrong it is in the latter case). Can be generated by file 14.
Notes:	text/csv
Other Study-Related Materials
Label:	17_results.csv
Text:	This is the summary of file 15 (lists the results reported in the paper). Can be generated by file 10.
Notes:	text/csv
Other Study-Related Materials
Label:	18_guess_fixme.tab
Text:	This is the output of the RNC-based booster of TOROT performance (see 4.2). Can be generated by file 10.
Notes:	text/tab-separated-values