View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian |
Identification Number: |
doi:10.18710/T9NQ9L |
Distributor: |
DataverseNO |
Date of Distribution: |
2017-08-10 |
Version: |
1 |
Bibliographic Citation: |
Berdicevskis, Aleksandrs; Eckhoff, Hanne; Gavrilova, Tatjana, 2017, "Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian", https://doi.org/10.18710/T9NQ9L, DataverseNO, V1 |
Citation |
|
Title: |
Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian |
Identification Number: |
doi:10.18710/T9NQ9L |
Authoring Entity: |
Berdicevskis, Aleksandrs (UiT The Arctic University of Norway) |
Eckhoff, Hanne (UiT The Arctic University of Norway) |
|
Gavrilova, Tatjana (National Research University Higher School of Economics) |
|
Producer: |
UiT The Arctic University of Norway |
National Research University Higher School of Economics |
|
Date of Production: |
2016 |
Distributor: |
DataverseNO |
Distributor: |
The Tromsø Repository of Language and Linguistics (TROLLing) |
Access Authority: |
Berdicevskis, Aleksandrs |
Depositor: |
Conzett, Philipp |
Date of Deposit: |
2016-04-18 |
Date of Distribution: |
2016 |
Holdings Information: |
https://doi.org/10.18710/T9NQ9L |
Study Scope |
|
Keywords: |
Arts and Humanities, Middle Russian, morphology, diachronic, inflection |
Topic Classification: |
Field: Morphology, Time-depth: diachronic, Topic: inflection |
Abstract: |
We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is rule-based. The other one (“TOROT”) is being used for annotating the eponymous corpus and is statistical. We apply the two analyzers to the same Middle Russian text and then compare their outputs with high-quality manual annotation. Since the analyzers use different annotation schemes and spelling principles, we have to harmonize their outputs before we can compare them. The comparison shows that TOROT performs considerably better than RNC (lemmatization 69.8% vs. 47.3%, part of speech 89.5% vs. 54.2%, morphology 81.5% vs. 16.7%). If, however, we limit the evaluation set only to those tokens for which the analyzers provide a guess and in addition consider the RNC response correct if one of the multiple guesses it provides is correct, the numbers become comparable (88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple procedure which boosts TOROT lemmatization accuracy by 8.7% by using RNC lemma guesses when TOROT fails to provide one and matching them against the existing TOROT lemma database. We conclude that a statistical analyzer (trained on a large material) can deal with non-standardised historical texts better than a rule-based one. Still, it is possible to make the analyzers collaborate, boosting the performance of the superior one. |
Country: |
Russian Federation |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Other Study Description Materials |
|
Related Publications |
|
Citation |
|
Title: |
Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111 |
Bibliographic Citation: |
Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111 |
Label: |
00_readme.txt |
Notes: |
text/plain |
Label: |
01_sergij_original.txt |
Text: |
This is the original (non-normalised) fragment from Sergij of Radonezh (see 3.1) |
Notes: |
text/plain |
Label: |
02_sergij_normalised.txt |
Text: |
This is the same fragment as File 1 in normalised orthography (see 3.1 and 2) |
Notes: |
text/plain |
Label: |
03_orv_combined.txt |
Text: |
This is the training set the TnT was trained on (see 3.1) |
Notes: |
text/plain |
Label: |
04_orv_combined.123 |
Text: |
This is a TnT model file |
Notes: |
application/vnd.lotus-1-2-3 |
Label: |
05_orv_combined.lex |
Text: |
This is a TnT model file |
Notes: |
application/octet-stream |
Label: |
06_torotlemmata.tab |
Text: |
This is the list of TOROT lemmata (used for lemma guessing, see 3.1 and 4.2) |
Notes: |
text/tab-separated-values |
Label: |
07_sergij_rnc.csv |
Text: |
This is the output of the RNC tagger (to get access to the tagger itself, contact the third author) |
Notes: |
text/csv |
Label: |
08_sergij_torot.xml |
Text: |
This is the output of the TOROT tagger (to get access to the tagger itself, contact the second author) |
Notes: |
text/xml |
Label: |
09_sergij_gold.xml |
Text: |
This is the gold standard (see 3.2) |
Notes: |
text/xml |
Label: |
10_comparison.rb |
Text: |
This is the comparison script (use Ruby 1.9.0 or higher to launch it. Make sure files 1, 6, 7, 8 and 9 are in the same directory. Warning messages about duplicated keys can most likely be ignored, otherwise make sure all Unicode symbols are being read correctly. The script will generate files 11, 12, 13, 15, 17 and 18.). Contact the first author if you have any questions. |
Notes: |
application/octet-stream |
Label: |
11_aligned.tab |
Notes: |
text/tab-separated-values |
Label: |
12_aligned_for_manual.tab |
Text: |
This is the same output as in 11 in a slightly different form (intended to facilitate manual comparisons). Can be generated by file 10. |
Notes: |
text/tab-separated-values |
Label: |
13_aligned_for_morph.csv |
Text: |
This is the morphological tagging output, aligned with each other and with gold. Can be generated by file 10. Meant to be used as input for file 14. |
Notes: |
text/csv |
Label: |
14_morph.rb |
Text: |
This is the script that performs comparison of morphological tagging (all other comparisons are made by File 10). Use Ruby 1.9.0 or higher to launch it. Make sure files 8, 9 and 13 are in the same directory. The script will generate file 16. Contact the second author if you have any questions. |
Notes: |
application/octet-stream |
Label: |
15_comparison.tab |
Text: |
This is the detailed information about whether each guess of lemma and POS by both taggers is correct or no. Can be generated by file 10. |
Notes: |
text/tab-separated-values |
Label: |
16_morph_comparison.csv |
Text: |
This is the detailed information about whether each guess of morphological tag by both taggers is correct or no (and how wrong it is in the latter case). Can be generated by file 14. |
Notes: |
text/csv |
Label: |
17_results.csv |
Text: |
This is the summary of file 15 (lists the results reported in the paper). Can be generated by file 10. |
Notes: |
text/csv |
Label: |
18_guess_fixme.tab |
Text: |
This is the output of the RNC-based booster of TOROT performance (see 4.2). Can be generated by file 10. |
Notes: |
text/tab-separated-values |