{"id":24,"identifier":"700FNV","persistentUrl":"https://doi.org/10.18710/700FNV","protocol":"doi","authority":"10.18710","separator":"/","publisher":"DataverseNO","publicationDate":"2016-03-29","storageIdentifier":"S3://10037.1/10291","datasetType":"dataset","datasetVersion":{"id":3759,"datasetId":24,"datasetPersistentId":"doi:10.18710/700FNV","storageIdentifier":"S3://10037.1/10291","versionNumber":1,"versionMinorNumber":2,"versionState":"RELEASED","latestVersionPublishingState":"RELEASED","deaccessionLink":"","distributionDate":"2016","productionDate":"2016","lastUpdateTime":"2023-09-28T19:44:52Z","releaseTime":"2023-09-28T19:44:52Z","createTime":"2023-09-28T15:40:58Z","alternativePersistentId":"hdl:10037.1/10291","publicationDate":"2016-03-29","citationDate":"2016-03-29","license":{"name":"CC0 1.0","uri":"http://creativecommons.org/publicdomain/zero/1.0","iconUri":"https://licensebuttons.net/p/zero/1.0/88x31.png","rightsIdentifier":"CC0-1.0","rightsIdentifierScheme":"SPDX","schemeUri":"https://spdx.org/licenses/","languageCode":"en"},"fileAccessRequest":true,"metadataBlocks":{"citation":{"displayName":"Citation Metadata","name":"citation","fields":[{"typeName":"title","multiple":false,"typeClass":"primitive","value":"Replication data for: Who needs particles? A challenge to the classification of particles as a part of speech in Russian"},{"typeName":"author","multiple":true,"typeClass":"compound","value":[{"authorName":{"typeName":"authorName","multiple":false,"typeClass":"primitive","value":"Endresen, Anna"},"authorAffiliation":{"typeName":"authorAffiliation","multiple":false,"typeClass":"primitive","value":"UiT The Arctic University of Norway"}},{"authorName":{"typeName":"authorName","multiple":false,"typeClass":"primitive","value":"Janda, Laura A."},"authorAffiliation":{"typeName":"authorAffiliation","multiple":false,"typeClass":"primitive","value":"UiT The Arctic University of Norway"}},{"authorName":{"typeName":"authorName","multiple":false,"typeClass":"primitive","value":"Reynolds, Robert"},"authorAffiliation":{"typeName":"authorAffiliation","multiple":false,"typeClass":"primitive","value":"UiT The Arctic University of Norway"}},{"authorName":{"typeName":"authorName","multiple":false,"typeClass":"primitive","value":"Tyers, Francis M."},"authorAffiliation":{"typeName":"authorAffiliation","multiple":false,"typeClass":"primitive","value":"UiT The Arctic University of Norway"}}]},{"typeName":"datasetContact","multiple":true,"typeClass":"compound","value":[{"datasetContactName":{"typeName":"datasetContactName","multiple":false,"typeClass":"primitive","value":"Janda, Laura A."},"datasetContactAffiliation":{"typeName":"datasetContactAffiliation","multiple":false,"typeClass":"primitive","value":"UiT The Arctic University of Norway"},"datasetContactEmail":{"typeName":"datasetContactEmail","multiple":false,"typeClass":"primitive","value":"laura.janda@uit.no"}}]},{"typeName":"dsDescription","multiple":true,"typeClass":"compound","value":[{"dsDescriptionValue":{"typeName":"dsDescriptionValue","multiple":false,"typeClass":"primitive","value":"In 1985, Zwicky argued that “particle” is a pretheoretical notion that should be eliminated from linguistic analysis. We propose a reclassification of Russian particles that implements Zwicky’s directive. Russian particles lack a coherent conceptual basis as a category and many are ambiguous with respect to part of speech. Our corpus analysis of Russian particles addresses theoretical questions about the cognitive status of parts of speech and practical concerns about how particles should be represented in computational models. We focus on nine high-frequency words commonly classed as particles: ešče, tak, ved’, slovno, daže, že, li, da, net. We show that current tagging of particles in the manually disambiguated Morphological Standard of the Russian National Corpus (RNC) is not entirely consistent, and that this can create challenges for training a part-of-speech tagger. We offer an alternative tagging scheme that eliminates the category of “particle” altogether. We show that our enriched scheme makes it possible for a part-of-speech tagger to achieve more useful results. Our analysis of particles provides a detailed account of various sub-uses that correspond to different parts of speech, their relationships, and relative distribution. In this sense, our study also contributes to the study of words that exhibit part-of-speech ambigu\nities. We construct a database by extracting from the RNC gold standard 100 random sentences for each of the nine focus words. This database is used for both training and testing a Hidden Markov Model (HMM) trigram tagger (Halácsy et al. 2007), which is the standard model for training part-of-speech tagging. This is done in two rounds: in Experiment 1 we use the tagging of the nine words as in the RNC, including the use of “particle” as a tag; in Experiment 2 we use our own tagging scheme which eliminates “particle” as a tag. In both experiments we partition our database into ten chunks and perform a ten-fold cross-validation, each time using 90 sentences as the training set and 10 sentences as the test set. This means that each part of the total set is tested in the course of the ten repetitions of training and testing."},"dsDescriptionDate":{"typeName":"dsDescriptionDate","multiple":false,"typeClass":"primitive","value":"2016"}}]},{"typeName":"subject","multiple":true,"typeClass":"controlledVocabulary","value":["Arts and Humanities"]},{"typeName":"keyword","multiple":true,"typeClass":"compound","value":[{"keywordValue":{"typeName":"keywordValue","multiple":false,"typeClass":"primitive","value":"Russian"}},{"keywordValue":{"typeName":"keywordValue","multiple":false,"typeClass":"primitive","value":"Hidden Markov Model"}}]},{"typeName":"topicClassification","multiple":true,"typeClass":"compound","value":[{"topicClassValue":{"typeName":"topicClassValue","multiple":false,"typeClass":"primitive","value":"Field: Lexis"},"topicClassVocab":{"typeName":"topicClassVocab","multiple":false,"typeClass":"primitive","value":"<Field term: Choose one or more>"}},{"topicClassValue":{"typeName":"topicClassValue","multiple":false,"typeClass":"primitive","value":"Time-depth: synchronic"},"topicClassVocab":{"typeName":"topicClassVocab","multiple":false,"typeClass":"primitive","value":"<Time depth: Choose one or more>"}},{"topicClassValue":{"typeName":"topicClassValue","multiple":false,"typeClass":"primitive","value":"Topic: particles"},"topicClassVocab":{"typeName":"topicClassVocab","multiple":false,"typeClass":"primitive","value":"<Topic: Choose one or more>"}}]},{"typeName":"publication","multiple":true,"typeClass":"compound","value":[{"publicationCitation":{"typeName":"publicationCitation","multiple":false,"typeClass":"primitive","value":"Endresen, A., Janda, L. A., Reynolds, R., & Tyers, F. M. (2016). Who needs particles? A challenge to the classification of particles as a part of speech in Russian / Кому нужны частицы? Стоит ли определять частицы как отдельную часть речи в русском языке? Russian Linguistics, 40(2), 103–132. http://www.jstor.org/stable/43945159"},"publicationIDType":{"typeName":"publicationIDType","multiple":false,"typeClass":"controlledVocabulary","value":"url"},"publicationIDNumber":{"typeName":"publicationIDNumber","multiple":false,"typeClass":"primitive","value":"https://www.jstor.org/stable/43945159"},"publicationURL":{"typeName":"publicationURL","multiple":false,"typeClass":"primitive","value":"https://www.jstor.org/stable/43945159"}}]},{"typeName":"language","multiple":true,"typeClass":"controlledVocabulary","value":["English"]},{"typeName":"producer","multiple":true,"typeClass":"compound","value":[{"producerName":{"typeName":"producerName","multiple":false,"typeClass":"primitive","value":"UiT The Arctic University of Norway"},"producerAbbreviation":{"typeName":"producerAbbreviation","multiple":false,"typeClass":"primitive","value":"UiT"},"producerURL":{"typeName":"producerURL","multiple":false,"typeClass":"primitive","value":"https://en.uit.no/"}}]},{"typeName":"productionDate","multiple":false,"typeClass":"primitive","value":"2016"},{"typeName":"grantNumber","multiple":true,"typeClass":"compound","value":[{"grantNumberAgency":{"typeName":"grantNumberAgency","multiple":false,"typeClass":"primitive","value":"The Research Council of Norway"},"grantNumberValue":{"typeName":"grantNumberValue","multiple":false,"typeClass":"primitive","value":"222506"}}]},{"typeName":"distributor","multiple":true,"typeClass":"compound","value":[{"distributorName":{"typeName":"distributorName","multiple":false,"typeClass":"primitive","value":"The Tromsø Repository of Language and Linguistics (TROLLing)"},"distributorAffiliation":{"typeName":"distributorAffiliation","multiple":false,"typeClass":"primitive","value":"UiT The Arctic University of Norway"},"distributorURL":{"typeName":"distributorURL","multiple":false,"typeClass":"primitive","value":"https://trolling.uit.no/"}}]},{"typeName":"distributionDate","multiple":false,"typeClass":"primitive","value":"2016"},{"typeName":"dateOfDeposit","multiple":false,"typeClass":"primitive","value":"2016-03-28"},{"typeName":"kindOfData","multiple":true,"typeClass":"primitive","value":["corpus"]}]},"geospatial":{"displayName":"Geospatial Metadata","name":"geospatial","fields":[{"typeName":"geographicCoverage","multiple":true,"typeClass":"compound","value":[{"otherGeographicCoverage":{"typeName":"otherGeographicCoverage","multiple":false,"typeClass":"primitive","value":"Russia"}},{"otherGeographicCoverage":{"typeName":"otherGeographicCoverage","multiple":false,"typeClass":"primitive","value":"Russia"}}]}]}},"files":[{"description":"This is a spreadsheet of our database, which was used in both Experiment 1 and Experiment 2.","label":"DATABASE particles.csv","restricted":false,"version":1,"datasetVersionId":3759,"dataFile":{"id":642,"persistentId":"","filename":"DATABASE particles.csv","contentType":"text/plain; charset=UTF-8","friendlyType":"Plain Text","filesize":418876,"description":"This is a spreadsheet of our database, which was used in both Experiment 1 and Experiment 2.","storageIdentifier":"S3://uit-dataverseno-prod01:623","rootDataFileId":-1,"md5":"1b8aa0f9ac611f416e5ace5c2ed34cd7","checksum":{"type":"MD5","value":"1b8aa0f9ac611f416e5ace5c2ed34cd7"},"tabularData":false,"creationDate":"2016-03-29","publicationDate":"2016-03-29","fileAccessRequest":true}},{"description":"This file contains all the data and code needed to run Experiment 1 and Experiment 2.","label":"experiment.tar.gz","restricted":false,"version":1,"datasetVersionId":3759,"dataFile":{"id":639,"persistentId":"","filename":"experiment.tar.gz","contentType":"application/x-gzip","friendlyType":"Gzip Archive","filesize":4835563,"description":"This file contains all the data and code needed to run Experiment 1 and Experiment 2.","storageIdentifier":"S3://uit-dataverseno-prod01:620","rootDataFileId":-1,"md5":"6f1bbc8da3aa938539905d2a4b1f0e90","checksum":{"type":"MD5","value":"6f1bbc8da3aa938539905d2a4b1f0e90"},"tabularData":false,"creationDate":"2016-03-28","publicationDate":"2016-03-29","fileAccessRequest":true}},{"description":"This file describes the columns and values in those columns for the DATABASE particles.cvs file.","label":"Readme file for DATABASE particles.txt","restricted":false,"version":1,"datasetVersionId":3759,"dataFile":{"id":643,"persistentId":"","filename":"Readme file for DATABASE particles.txt","contentType":"text/plain; charset=UTF-8","friendlyType":"Plain Text","filesize":1022,"description":"This file describes the columns and values in those columns for the DATABASE particles.cvs file.","storageIdentifier":"S3://uit-dataverseno-prod01:624","rootDataFileId":-1,"md5":"1f16a8d04cfcb0b0b49cafe80f3456d9","checksum":{"type":"MD5","value":"1f16a8d04cfcb0b0b49cafe80f3456d9"},"tabularData":false,"creationDate":"2016-03-29","publicationDate":"2016-03-29","fileAccessRequest":true}},{"description":"This file explains the contents of the experiment.tar.gz file.","label":"Readme file for experiment.txt","restricted":false,"version":1,"datasetVersionId":3759,"dataFile":{"id":641,"persistentId":"","filename":"Readme file for experiment.txt","contentType":"text/plain; charset=UTF-8","friendlyType":"Plain Text","filesize":831,"description":"This file explains the contents of the experiment.tar.gz file.","storageIdentifier":"S3://uit-dataverseno-prod01:622","rootDataFileId":-1,"md5":"880a5543e7c81cb25f5d1c4ddf7aad77","checksum":{"type":"MD5","value":"880a5543e7c81cb25f5d1c4ddf7aad77"},"tabularData":false,"creationDate":"2016-03-29","publicationDate":"2016-03-29","fileAccessRequest":true}}],"citation":"Endresen, Anna; Janda, Laura A.; Reynolds, Robert; Tyers, Francis M., 2016, \"Replication data for: Who needs particles? A challenge to the classification of particles as a part of speech in Russian\", https://doi.org/10.18710/700FNV, DataverseNO, V1"}}