<?xml version='1.0' encoding='UTF-8'?><codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"><docDscr><citation><titlStmt><titl>Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian</titl><IDNo agency="DOI">doi:10.18710/T9NQ9L</IDNo></titlStmt><distStmt><distrbtr source="archive">DataverseNO</distrbtr><distDate>2017-08-10</distDate></distStmt><verStmt source="archive"><version date="2023-09-28" type="RELEASED">1</version></verStmt><biblCit>Berdicevskis, Aleksandrs; Eckhoff, Hanne; Gavrilova, Tatjana, 2017, "Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian", https://doi.org/10.18710/T9NQ9L, DataverseNO, V1</biblCit></citation></docDscr><stdyDscr><citation><titlStmt><titl>Replication data for: The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian</titl><IDNo agency="DOI">doi:10.18710/T9NQ9L</IDNo></titlStmt><rspStmt><AuthEnty affiliation="UiT The Arctic University of Norway">Berdicevskis, Aleksandrs</AuthEnty><AuthEnty affiliation="UiT The Arctic University of Norway">Eckhoff, Hanne</AuthEnty><AuthEnty affiliation="National Research University Higher School of Economics">Gavrilova, Tatjana</AuthEnty></rspStmt><prodStmt><producer abbr="UiT">UiT The Arctic University of Norway</producer><producer abbr="HSE">National Research University Higher School of Economics</producer><prodDate>2016</prodDate></prodStmt><distStmt><distrbtr source="archive">DataverseNO</distrbtr><distrbtr abbr="TROLLing" URI="https://trolling.uit.no/">The Tromsø Repository of Language and Linguistics (TROLLing)</distrbtr><contact email="alexberd@gmail.com">Berdicevskis, Aleksandrs</contact><depositr>Conzett, Philipp</depositr><depDate>2016-04-18</depDate><distDate>2016</distDate></distStmt><holdings URI="https://doi.org/10.18710/T9NQ9L"/></citation><stdyInfo><subject><keyword xml:lang="en">Arts and Humanities</keyword><keyword>Middle Russian</keyword><keyword>morphology</keyword><keyword>diachronic</keyword><keyword>inflection</keyword><topcClas vocab="&lt;Field term: Choose one or more>">Field: Morphology</topcClas><topcClas vocab="&lt;Time depth: Choose one or more>">Time-depth: diachronic</topcClas><topcClas vocab="&lt;Topic: Choose one or more>">Topic: inflection</topcClas></subject><abstract>We describe and compare two tools for processing Middle Russian texts.
Both tools provide lemmatization, part-of-speech and morphological annotation.
One (“RNC”) was developed for annotating texts in the Russian
National Corpus and is rule-based. The other one (“TOROT”) is being used
for annotating the eponymous corpus and is statistical. We apply the two
analyzers to the same Middle Russian text and then compare their outputs
with high-quality manual annotation. Since the analyzers use different annotation
schemes and spelling principles, we have to harmonize their outputs
before we can compare them. The comparison shows that TOROT
performs considerably better than RNC (lemmatization 69.8% vs. 47.3%,
part of speech 89.5% vs. 54.2%, morphology 81.5% vs. 16.7%). If, however,
we limit the evaluation set only to those tokens for which the analyzers provide
a guess and in addition consider the RNC response correct if one of the
multiple guesses it provides is correct, the numbers become comparable
(88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple
procedure which boosts TOROT lemmatization accuracy by 8.7% by using
RNC lemma guesses when TOROT fails to provide one and matching them
against the existing TOROT lemma database. We conclude that a statistical
analyzer (trained on a large material) can deal with non-standardised
historical texts better than a rule-based one. Still, it is possible to make the
analyzers collaborate, boosting the performance of the superior one.</abstract><sumDscr><nation>Russian Federation</nation></sumDscr></stdyInfo><method><dataColl><sources/></dataColl><anlyInfo/></method><dataAccs><setAvail/><useStmt/><notes type="DVN:TOU" level="dv">&lt;a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0&lt;/a></notes></dataAccs><othrStdyMat><relPubl><citation><titlStmt><titl>Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111</titl></titlStmt><biblCit>Berdicevskis, Aleksandrs, Hanne Eckhoff and Tatjana Gavrilova. 2016. The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. Computational linguistics and intellectual technologies. Papers from the annual international conference "Dialogue", 15: 99–111</biblCit></citation><ExtLink URI="http://www.dialog-21.ru/media/3384/berdi%C4%8Devskisaetal.pdf"/></relPubl></othrStdyMat></stdyDscr><otherMat ID="f1676" URI="https://dataverse.no/api/access/datafile/1676" level="datafile"><labl>00_readme.txt</labl><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/plain</notes></otherMat><otherMat ID="f1677" URI="https://dataverse.no/api/access/datafile/1677" level="datafile"><labl>01_sergij_original.txt</labl><txt>This is the original (non-normalised) fragment from Sergij of Radonezh (see 3.1)</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/plain</notes></otherMat><otherMat ID="f1678" URI="https://dataverse.no/api/access/datafile/1678" level="datafile"><labl>02_sergij_normalised.txt</labl><txt>This is the same fragment as File 1 in normalised orthography (see 3.1 and 2)</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/plain</notes></otherMat><otherMat ID="f1679" URI="https://dataverse.no/api/access/datafile/1679" level="datafile"><labl>03_orv_combined.txt</labl><txt>This is the training set the TnT was trained on (see 3.1)</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/plain</notes></otherMat><otherMat ID="f1680" URI="https://dataverse.no/api/access/datafile/1680" level="datafile"><labl>04_orv_combined.123</labl><txt>This is a TnT model file</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">application/vnd.lotus-1-2-3</notes></otherMat><otherMat ID="f1681" URI="https://dataverse.no/api/access/datafile/1681" level="datafile"><labl>05_orv_combined.lex</labl><txt>This is a TnT model file</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">application/octet-stream</notes></otherMat><otherMat ID="f1682" URI="https://dataverse.no/api/access/datafile/1682" level="datafile"><labl>06_torotlemmata.tab</labl><txt>This is the list of TOROT lemmata (used for lemma guessing, see 3.1 and 4.2)</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/tab-separated-values</notes></otherMat><otherMat ID="f1683" URI="https://dataverse.no/api/access/datafile/1683" level="datafile"><labl>07_sergij_rnc.csv</labl><txt>This is the output of the RNC tagger (to get access to the tagger itself, contact the third author)</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/csv</notes></otherMat><otherMat ID="f1684" URI="https://dataverse.no/api/access/datafile/1684" level="datafile"><labl>08_sergij_torot.xml</labl><txt>This is the output of the TOROT tagger (to get access to the tagger itself, contact the second author)</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/xml</notes></otherMat><otherMat ID="f1685" URI="https://dataverse.no/api/access/datafile/1685" level="datafile"><labl>09_sergij_gold.xml</labl><txt>This is the gold standard (see 3.2)</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/xml</notes></otherMat><otherMat ID="f1686" URI="https://dataverse.no/api/access/datafile/1686" level="datafile"><labl>10_comparison.rb</labl><txt>This is the comparison script (use Ruby 1.9.0 or higher to launch it. Make sure files 1, 6, 7, 8 and 9 are in the same directory. Warning messages about duplicated keys can most likely be ignored, otherwise make sure all Unicode symbols are being read correctly. The script will generate files 11, 12, 13, 15, 17 and 18.). Contact the first author if you have any questions.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">application/octet-stream</notes></otherMat><otherMat ID="f1687" URI="https://dataverse.no/api/access/datafile/1687" level="datafile"><labl>11_aligned.tab</labl><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/tab-separated-values</notes></otherMat><otherMat ID="f1688" URI="https://dataverse.no/api/access/datafile/1688" level="datafile"><labl>12_aligned_for_manual.tab</labl><txt>This is the same output as in 11 in a slightly different form (intended to facilitate manual comparisons). Can be generated by file 10.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/tab-separated-values</notes></otherMat><otherMat ID="f1689" URI="https://dataverse.no/api/access/datafile/1689" level="datafile"><labl>13_aligned_for_morph.csv</labl><txt>This is the morphological tagging output, aligned with each other and with gold. Can be generated by file 10. Meant to be used as input for file 14.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/csv</notes></otherMat><otherMat ID="f1690" URI="https://dataverse.no/api/access/datafile/1690" level="datafile"><labl>14_morph.rb</labl><txt>This is the script that performs comparison of morphological tagging (all other comparisons are made by File 10). Use Ruby 1.9.0 or higher to launch it. Make sure files 8, 9 and 13 are in the same directory. The script will generate file 16. Contact the second author if you have any questions.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">application/octet-stream</notes></otherMat><otherMat ID="f1691" URI="https://dataverse.no/api/access/datafile/1691" level="datafile"><labl>15_comparison.tab</labl><txt>This is the detailed information about whether each guess of lemma and POS by both taggers is correct or no. Can be generated by file 10.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/tab-separated-values</notes></otherMat><otherMat ID="f1692" URI="https://dataverse.no/api/access/datafile/1692" level="datafile"><labl>16_morph_comparison.csv</labl><txt>This is the detailed information about whether each guess of morphological tag by both taggers is correct or no (and how wrong it is in the latter case). Can be generated by file 14.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/csv</notes></otherMat><otherMat ID="f1693" URI="https://dataverse.no/api/access/datafile/1693" level="datafile"><labl>17_results.csv</labl><txt>This is the summary of file 15 (lists the results reported in the paper). Can be generated by file 10.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/csv</notes></otherMat><otherMat ID="f1694" URI="https://dataverse.no/api/access/datafile/1694" level="datafile"><labl>18_guess_fixme.tab</labl><txt>This is the output of the RNC-based booster of TOROT performance (see 4.2). Can be generated by file 10.</txt><notes level="file" type="DATAVERSE:CONTENTTYPE" subject="Content/MIME Type">text/tab-separated-values</notes></otherMat></codeBook>