Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian (doi:10.18710/T9NQ9L)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description

Citation

Title:

Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian

Identification Number:

doi:10.18710/T9NQ9L

Distributor:

DataverseNO

Date of Distribution:

2017-08-10

Version:

1

Bibliographic Citation:

Berdicevskis, Aleksandrs; Eckhoff, Hanne; Gavrilova, Tatjana, 2017, "Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian", https://doi.org/10.18710/T9NQ9L, DataverseNO, V1

Study Description

Citation

Title:

Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian

Identification Number:

doi:10.18710/T9NQ9L

Authoring Entity:

Berdicevskis, Aleksandrs (UiT The Arctic University of Norway)

Eckhoff, Hanne (UiT The Arctic University of Norway)

Gavrilova, Tatjana (National Research University Higher School of Economics)

Producer:

UiT The Arctic University of Norway

National Research University Higher School of Economics

Date of Production:

2016

Distributor:

DataverseNO

Distributor:

The Tromsø Repository of Language and Linguistics (TROLLing)

Access Authority:

Berdicevskis, Aleksandrs

Depositor:

Conzett, Philipp

Date of Deposit:

2016-04-18

Date of Distribution:

2016

Holdings Information:

https://doi.org/10.18710/T9NQ9L

Study Scope

Keywords:

Arts and Humanities, Middle Russian, morphology, diachronic, inflection

Topic Classification:

Field: Morphology, Time-depth: diachronic, Topic: inflection

Abstract:

We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is rule-based. The other one (“TOROT”) is being used for annotating the eponymous corpus and is statistical. We apply the two analyzers to the same Middle Russian text and then compare their outputs with high-quality manual annotation. Since the analyzers use different annotation schemes and spelling principles, we have to harmonize their outputs before we can compare them. The comparison shows that TOROT performs considerably better than RNC (lemmatization 69.8% vs. 47.3%, part of speech 89.5% vs. 54.2%, morphology 81.5% vs. 16.7%). If, however, we limit the evaluation set only to those tokens for which the analyzers provide a guess and in addition consider the RNC response correct if one of the multiple guesses it provides is correct, the numbers become comparable (88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple procedure which boosts TOROT lemmatization accuracy by 8.7% by using RNC lemma guesses when TOROT fails to provide one and matching them against the existing TOROT lemma database. We conclude that a statistical analyzer (trained on a large material) can deal with non-standardised historical texts better than a rule-based one. Still, it is possible to make the analyzers collaborate, boosting the performance of the superior one.

Country:

Russian Federation

Methodology and Processing

Sources Statement

Data Access

Other Study Description Materials

Related Publications

Citation

Title:

Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111

Bibliographic Citation:

Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111

Other Study-Related Materials

Label:

00_readme.txt

Notes:

text/plain

Other Study-Related Materials

Label:

01_sergij_original.txt

Text:

This is the original (non-normalised) fragment from Sergij of Radonezh (see 3.1)

Notes:

text/plain

Other Study-Related Materials

Label:

02_sergij_normalised.txt

Text:

This is the same fragment as File 1 in normalised orthography (see 3.1 and 2)

Notes:

text/plain

Other Study-Related Materials

Label:

03_orv_combined.txt

Text:

This is the training set the TnT was trained on (see 3.1)

Notes:

text/plain

Other Study-Related Materials

Label:

04_orv_combined.123

Text:

This is a TnT model file

Notes:

application/vnd.lotus-1-2-3

Other Study-Related Materials

Label:

05_orv_combined.lex

Text:

This is a TnT model file

Notes:

application/octet-stream

Other Study-Related Materials

Label:

06_torotlemmata.tab

Text:

This is the list of TOROT lemmata (used for lemma guessing, see 3.1 and 4.2)

Notes:

text/tab-separated-values

Other Study-Related Materials

Label:

07_sergij_rnc.csv

Text:

This is the output of the RNC tagger (to get access to the tagger itself, contact the third author)

Notes:

text/csv

Other Study-Related Materials

Label:

08_sergij_torot.xml

Text:

This is the output of the TOROT tagger (to get access to the tagger itself, contact the second author)

Notes:

text/xml

Other Study-Related Materials

Label:

09_sergij_gold.xml

Text:

This is the gold standard (see 3.2)

Notes:

text/xml

Other Study-Related Materials

Label:

10_comparison.rb

Text:

This is the comparison script (use Ruby 1.9.0 or higher to launch it. Make sure files 1, 6, 7, 8 and 9 are in the same directory. Warning messages about duplicated keys can most likely be ignored, otherwise make sure all Unicode symbols are being read correctly. The script will generate files 11, 12, 13, 15, 17 and 18.). Contact the first author if you have any questions.

Notes:

application/octet-stream

Other Study-Related Materials

Label:

11_aligned.tab

Notes:

text/tab-separated-values

Other Study-Related Materials

Label:

12_aligned_for_manual.tab

Text:

This is the same output as in 11 in a slightly different form (intended to facilitate manual comparisons). Can be generated by file 10.

Notes:

text/tab-separated-values

Other Study-Related Materials

Label:

13_aligned_for_morph.csv

Text:

This is the morphological tagging output, aligned with each other and with gold. Can be generated by file 10. Meant to be used as input for file 14.

Notes:

text/csv

Other Study-Related Materials

Label:

14_morph.rb

Text:

This is the script that performs comparison of morphological tagging (all other comparisons are made by File 10). Use Ruby 1.9.0 or higher to launch it. Make sure files 8, 9 and 13 are in the same directory. The script will generate file 16. Contact the second author if you have any questions.

Notes:

application/octet-stream

Other Study-Related Materials

Label:

15_comparison.tab

Text:

This is the detailed information about whether each guess of lemma and POS by both taggers is correct or no. Can be generated by file 10.

Notes:

text/tab-separated-values

Other Study-Related Materials

Label:

16_morph_comparison.csv

Text:

This is the detailed information about whether each guess of morphological tag by both taggers is correct or no (and how wrong it is in the latter case). Can be generated by file 14.

Notes:

text/csv

Other Study-Related Materials

Label:

17_results.csv

Text:

This is the summary of file 15 (lists the results reported in the paper). Can be generated by file 10.

Notes:

text/csv

Other Study-Related Materials

Label:

18_guess_fixme.tab

Text:

This is the output of the RNC-based booster of TOROT performance (see 4.2). Can be generated by file 10.

Notes:

text/tab-separated-values