This dataset consists of a TSV file with five columns of data originating in Zaliznyak's Grammar and Dictionary (1977). The data was programmatically scraped from Giella project data (Moshagen et al., 2013) by Spektor (2021). From Spektor (2021), the data was one of four sources in their RusLex application. Once scraped from there, only symbols were removed.
The Russian word data is preserved from the original in Cyrillic. The last column contains abbreviated morphological features in English (e.g. "V" for verb, "N" for noun, "Fem" for feminine, "Cmpr" for comparative, "Impf" for imperfect). The often many features are separated by semicolons.
Stress codes were derived for each word that represented stress placement: If the stressed vowel was at the end of the word a stress code of 0 signifying oxytone stress was assigned. Next, counting from the end of the word, the penultimate stress was given a 1, meaning a stress on the paroxytone.
Next, if the antepenultimate syllable contained the stress, the word was assigned a 2, meaning a stress on the proparoxytone. The script continued until a stress code was assigned with the following exceptions: a -1 is assigned for those words without explicit stress markers.
The columns in the resultant TSV are: the word without stress markers, the word with stress markers, the derived stress code, the lemma, and all morphological features.
The dataset contains over 300,000 words from Zaliznyak (1977) with many repeated words that have unique morphological features. Please see the paper for a full description of the dataset.
References:
Moshagen, Sjur N., Tommi Pirinen, and Trond Trosterud. (2013). Building an open-source development infrastructure for language technology projects. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), (pp. 343–352).
Spektor, Y. (2021). Detection and morphological analysis of novel Russian loanwords (Master’s thesis, CUNY Graduate Center, New York, NY). Retrieved from https://academicworks.cuny.edu/gc_etds/4572/
Zaliznyak, A.A. (1977). Grammatičeskij slovar’ russkogo jazyka. Slovoizmenenie [A grammatical dictionary of Russian: Inflection]. Moscow: Russkij jazyk
(2022-09-01)