The data files in this dataset contain single words and text fragments from the following sources:
Russian National Corpus, https://ruscorpora.ru/.
Araneum Russicum Russicum Maius Corpus, http://unesco.uniba.sk/aranea_about/index.html. For more information, see the following papers:
Benko, Vladimír: Aranea: Yet Another Family of (Comparable) Web Corpora. In Petr Sojka, Aleš Horák, Ivan Kopeček and Karel Pala (Eds.): Text, Speech and Dialogue. 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings. LNCS 8655. Springer International Publishing Switzerland, 2014. pp. 257-264. ISBN: 978-3-319-10815-5 (Print), 978-3-319-10816-2 (Online).
Benko, Vladimír: Compatible Sketch Grammars for Comparable Corpora. In Andrea Abel, Chiara Vettori, Natascia Ralli (Eds.): Proceedings of the XVI EURALEX International Congress: The User In Focus. 15–19 July 2014. Bolzano/Bozen: Eurac Research, 2014. pp. 417-430. ISBN 978-88-88906-97-3.
Rychlý, Pavel: Manatee/Bonito – A Modular Corpus Manager. In 1st Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Masaryk University, 2007, pp. 65-70. ISBN 978-80-210-4471-5.
The extracted text fragments or single words that are contained in the data file of this dataset only represent non-substantial portions of the sources listed above, and they do not represent coherent larger texts. Therefore, the reuse (including redistribution) of these excerpts is permitted by the exceptions rules in IPR and database protection regulations, such as Fair use (USA cf. US Copyright Act), Fair dealing (UK; cf. Exceptions to copyright), the EU Database Directive (cf. article 8 Rights and obligations of lawful users), "lover, forskrifter, rettsavgjørelser og andre vedtak av offentlig myndighet" (Norway; cf. § 14 in Åndsverkloven), "uvesentlige deler av databaser" (Norway; cf. § 24 in Åndsverkloven), "sitatretten" (Norway; cf. § 29 in Åndsverkloven). |