Replication data for: Verbal constructional profiles: reliability, distinction power and practical applicationsdoi:10.18710/T6KSX4DataverseNO2014-12-061Berdicevskis, Aleksandrs; Eckhoff, Hanne, 2014, "Replication data for: Verbal constructional profiles: reliability, distinction power and practical applications", https://doi.org/10.18710/T6KSX4, DataverseNO, V1Replication data for: Verbal constructional profiles: reliability, distinction power and practical applicationsdoi:10.18710/T6KSX4Berdicevskis, AleksandrsEckhoff, HanneUiT The Arctic University of Norway2014222506DataverseNOThe Tromsø Repository of Language and Linguistics (TROLLing)Berdicevskis, Aleksandrs2014-11-202014Arts and Humanitiesconstructional profilesprofilestreebanksRussianSynTagRusPROIELField: SyntaxTime-depth: synchronicTopic: verbsA linguistic profile is a frequency distribution of occurrences of a linguistic item across a given parameter. Containing useful quantitative information about an item's usage, a profile can help to discover fundamental properties of the item. Here we focus on verbal constructional profiles, where the item is always a verb, and the parameter is its syntactic environment. This methodology has been used for various purposes with some success, but little is known about the basic properties of the profiles. We start by addressing two general methodological questions. First, is there such thing as a reliable constructional profile, i.e. is there a stable distribution of a verb across its syntactic contexts? If yes, what corpus size is required to capture it? Second, what distinction power do the profiles possess at different corpus sizes? To test that, we used the SynTagRus treebank of modern Russian, both in its native dependency format and converted into the PROIEL format. As a secondary goal, we compare the two dependency schemes' ability to yield useful argument structure data. We then zoom in on a more language-specific question and estimate the possibility of using verbal constructional profiles as an objective criterion in the study of Russian aspect, with a view to use it in diachronic studies. We test the method's applicability to Russian aspect, taking into account the answers we give to our methodological ques
tions. We test two different hypotheses: first, that constructional profiles can be used to identify the most likely aspectual partner of a verb; second, that constructional profiles can be used to tell whether a verb is perfective or imperfective. We see that the both hypotheses hold to some extent, but only the second one seems applicable to actual research questions.Russian FederationNorwaycorpusBerdicevskis, Aleksandrs and Eckhoff, Hanne Martine (2014). "Verbal constructional profiles: reliability, distinction power and practical applications." In: Verena Henrich (et.al.) eds: Proceedings of the Thirteenth International Workshop on Treebanks and Linguistic Theories (TLT13), University of Tübingen (2014)11022/0000-0000-2E8A-2Berdicevskis, Aleksandrs and Eckhoff, Hanne Martine (2014). "Verbal constructional profiles: reliability, distinction power and practical applications." In: Verena Henrich (et.al.) eds: Proceedings of the Thirteenth International Workshop on Treebanks and Linguistic Theories (TLT13), University of Tübingen (2014)aspect.rbRuby script which performs the aspect experiments (aspect identification, partner matching). Run the script by typing "ruby stability.rb" in the command line (Ruby 1.9.3 or compatible has to be installed on your machine, the frames files have to be in the same directory. See more in readme.txt). In addition, provide three parameters (in this particular order): frame type (nm, vm or fm: resp. simple, partly or fully enriched); annotation scheme (mtt or proiel); which experiment to perform (aspect or partner). Example: ruby aspect.rb vm proiel aspecttext/plain; charset=US-ASCIIaspect_matching_vm_proiel_sub.csvResults of the aspect identification test. Profile type: partly enriched; annotation scheme: PROIEL. First column: frequency cutoff, second column: success rate; third column: number of verbs. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot.text/plain; charset=US-ASCIIframes_fm_mtt_sub.csvArgument frames for all verbs in SynTagRus. First column: the verb lemma (aspect explicitly indicated, separated by fullstop. For paired verbs, the perfective lemma will look like IMPERFECTIVE_LEMMA.pf). Second column: verb absolute frequency. Other columns: all frames that the given verbs occurs within. Annotation scheme: MTT (=SynTagRus' native annotation); profile type: fully enriched (=fm, full morphology); subjects included. Encoding: UTF-8 without BOM; cell separator: comma. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8frames_fm_proiel_sub.csvArgument frames for all verbs in SynTagRus. First column: the verb lemma (aspect explicitly indicated, separated by fullstop. For paired verbs, the perfective lemma will look like IMPERFECTIVE_LEMMA.pf). Second column: verb absolute frequency. Other columns: all frames that the given verbs occurs within. Annotation scheme: PROIEL; profile type: fully enriched (=fm, full morphology); subjects included. Encoding: UTF-8 without BOM; cell separator: comma. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8frames_nm_mtt_sub.csvArgument frames for all verbs in SynTagRus. First column: the verb lemma (aspect explicitly indicated, separated by fullstop. For paired verbs, the perfective lemma will look like IMPERFECTIVE_LEMMA.pf). Second column: verb absolute frequency. Other columns: all frames that the given verbs occurs within. Annotation scheme: MTT (=SynTagRus' native annotation); profile type: simple (=nm, no morphology); subjects included. Encoding: UTF-8 without BOM; cell separator: comma. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8frames_nm_proiel_sub.csvArgument frames for all verbs in SynTagRus. First column: the verb lemma (aspect explicitly indicated, separated by fullstop. For paired verbs, the perfective lemma will look like IMPERFECTIVE_LEMMA.pf). Second column: verb absolute frequency. Other columns: all frames that the given verbs occurs within. Annotation scheme: PROIEL; profile type: simple (=nm, no morphology); subjects included. Encoding: UTF-8 without BOM; cell separator: comma. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8frames_vm_mtt_sub.csvArgument frames for all verbs in SynTagRus. First column: the verb lemma (aspect explicitly indicated, separated by fullstop. For paired verbs, the perfective lemma will look like IMPERFECTIVE_LEMMA.pf). Second column: verb absolute frequency. Other columns: all frames that the given verbs occurs within. Annotation scheme: MTT (=SynTagRus' native annotation); profile type: partly enriched (=vm, verbal morphology); subjects included. Encoding: UTF-8 without BOM; cell separator: comma. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8frames_vm_proiel_sub.csvArgument frames for all verbs in SynTagRus. First column: the verb lemma (aspect explicitly indicated, separated by fullstop. For paired verbs, the perfective lemma will look like IMPERFECTIVE_LEMMA.pf). Second column: verb absolute frequency. Other columns: all frames that the given verbs occurs within. Annotation scheme: PROIEL; profile type: partly enriched (=vm, verbal morphology); subjects included. Encoding: UTF-8 without BOM; cell separator: comma. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8matches_fm_mtt_subsample_27.csvResults of the matching test (distinction power evaluation). Profile type: fully enriched; annotation scheme: MTT (=Syntagrus' native annotation). First column: sample size (=frequency cutoff/2), other columns (number of columns = number of permutations, 30 by default): matching rates. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot.text/plain; charset=US-ASCIImatches_fm_proiel_subsample_27.csvResults of the matching test (distinction power evaluation). Profile type: fully enriched; annotation scheme: PROIEL. First column: sample size (=frequency cutoff/2), other columns (number of columns = number of permutations, 30 by default): matching rates. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot.text/plain; charset=US-ASCIImatches_nm_mtt_subsample_27.csvResults of the matching test (distinction power evaluation). Profile type: simple; annotation scheme: MTT (=Syntagrus' native annotation). First column: sample size (=frequency cutoff/2), other columns (number of columns = number of permutations, 30 by default): matching rates. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot.text/plain; charset=US-ASCIImatches_nm_proiel_subsample_27.csvResults of the matching test (distinction power evaluation). Profile type: simple; annotation scheme: PROIEL. First column: sample size (=frequency cutoff/2), other columns (number of columns = number of permutations, 30 by default): matching rates. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot.text/plain; charset=US-ASCIImatches_vm_mtt_subsample_27.csvResults of the matching test (distinction power evaluation). Profile type: partly enriched; annotation scheme: MTT (=Syntagrus' native annotation). First column: sample size (=frequency cutoff/2), other columns (number of columns = number of permutations, 30 by default): matching rates. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot.text/plain; charset=US-ASCIImatches_vm_proiel_subsample_27.csvResults of the matching test (distinction power evaluation). Profile type: partly enriched; annotation scheme: PROIEL. First column: sample size (=frequency cutoff/2), other columns (number of columns = number of permutations, 30 by default): matching rates. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot.text/plain; charset=US-ASCIImatching.rbRuby script which evaluates the distinction power of profiles built using frames of different types (by performing the matching test). Notice that the algorithm uses randomization, i.e. the results in the output files will be slightly different every time you run the script. Run the script by typing "ruby matching.rb" in the command line (Ruby 1.9.3 or compatible has to be installed on your machine, the frames files have to be in the same directory. See more in readme.txt).text/plain; charset=US-ASCIImatching_significance.xlsxStatistical significance for the matching experiment: p-values (pairwise homoscedastic t-tests with Bonferroni correction) and effect sizes (Cohen's d). Large p-values (corrected) and negligible effect sizes are highlighted in resp. red and yellow (may not work if the software is not compatible with Excel 2010).application/octet-streampartners_matching_fm_mtt_sub.csvResults of the partner matching test. Profile type: fully enriched; annotation scheme: MTT (=SynTagRus' native annotation). First column: verb lemma, second column: aspect (NB: paired perfective and imperfective are represented by the same lemma); third column: partner's lemma, fourth column: partner's aspect, fifth column: 1 indicates that partners are matched correctly, 0 indicates failure. Before each result batch a frequency cutoff is indicated, after the batch the average success rate is given. "Rate for attempted pairs" is calculated only for those pairs that the guesser attempted to match. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8partners_matching_fm_proiel_sub.csvResults of the partner matching test. Profile type: fully enriched; annotation scheme: PROIEL. First column: verb lemma, second column: aspect (NB: paired perfective and imperfective are represented by the same lemma); third column: partner's lemma, fourth column: partner's aspect, fifth column: 1 indicates that partners are matched correctly, 0 indicates failure. Before each result batch a frequency cutoff is indicated, after the batch the average success rate is given. "Rate for attempted pairs" is calculated only for those pairs that the guesser attempted to match. Encoding: UTF-8 without BOM; cell separator: comma; decimal separator: dot. Microsoft Excel fails to read the separator and the Cyrillic characters, Open Office reads the file correctly.text/plain; charset=UTF-8readme.txtSome general information about the files available heretext/plain; charset=US-ASCIIspotcheck50.xlsConversion accuracy check for 50 randomly selected sentences. Misannotations are highlighted with light red (this might not work if the software is not compatible with Excel 2010) and marked by "!!" in the beginning. These cells contain the label provided by the conversion script and the label that should have been assigned (e.g. "!! adv obl" means that the script labelled the relation as "adv", while the correct label is "obl").application/vnd.ms-excelstability.rbRuby script which evaluates the stability of profiles built using frames of different types. For every combination of frame type and annotation scheme outputs two files: average values (not uploaded here) and full set of datapoints (uploaded. Notice that the algorithm uses randomization, i.e. the results will be slightly different every time you run the script). Run the script by typing "ruby stability.rb" in the command line (Ruby 1.9.3 or compatible has to be installed on your machine, the frames files have to be in the same directory. See more in readme.txt).text/plain; charset=US-ASCIIstability_fm_mtt_sub.csvResults of the stability experiment. Profile type: fully enriched; annotation scheme: MTT (=Syntagrus' native annotation); subjects included. First column: sample size; other columns: stability for a given datapoint. Note that the exact number of datapoints (=number of columns) depends on the number of verbs available at a given sample size, but is never less than 1000. Encoding: UTF-8 without BOM, cell separator: comma, decimal separator: dot.text/plain; charset=US-ASCIIstability_fm_proiel_sub.csvResults of the stability experiment. Profile type: fully enriched; annotation scheme: PROIEL; subjects included. First column: sample size; other columns: stability for a given datapoint. Note that the exact number of datapoints (=number of columns) depends on the number of verbs available at a given sample size, but is never less than 1000. Encoding: UTF-8 without BOM, cell separator: comma, decimal separator: dot.text/plain; charset=US-ASCIIstability_nm_mtt_sub.csvResults of the stability experiment. Profile type: simple; annotation scheme: MTT (=Syntagrus' native annotation); subjects included. First column: sample size; other columns: stability for a given datapoint. Note that the exact number of datapoints (=number of columns) depends on the number of verbs available at a given sample size, but is never less than 1000. Encoding: UTF-8 without BOM, cell separator: comma, decimal separator: dot.text/plain; charset=US-ASCIIstability_nm_proiel_sub.csvResults of the stability experiment. Profile type: simple; annotation scheme: PROIEL; subjects included. First column: sample size; other columns: stability for a given datapoint. Note that the exact number of datapoints (=number of columns) depends on the number of verbs available at a given sample size, but is never less than 1000. Encoding: UTF-8 without BOM, cell separator: comma, decimal separator: dot.text/plain; charset=US-ASCIIstability_significance.xlsxStatistical significance for the stability experiment: p-values (pairwise homoscedastic t-tests with Bonferroni correction) and effect sizes (Cohen's d). Large p-values (corrected) and negligible effect sizes are highlighted in resp. red and yellow (may not work if the software is not compatible with Excel 2010).application/octet-streamstability_vm_mtt_sub.csvResults of the stability experiment. Profile type: partly enriched; annotation scheme: MTT (=Syntagrus' native annotation); subjects included. First column: sample size; other columns: stability for a given datapoint. Note that the exact number of datapoints (=number of columns) depends on the number of verbs available at a given sample size, but is never less than 1000. Encoding: UTF-8 without BOM, cell separator: comma, decimal separator: dot.text/plain; charset=US-ASCIIstability_vm_proiel_sub.csvResults of the stability experiment. Profile type: partly enriched; annotation scheme: PROIEL; subjects included. First column: sample size; other columns: stability for a given datapoint. Note that the exact number of datapoints (=number of columns) depends on the number of verbs available at a given sample size, but is never less than 1000. Encoding: UTF-8 without BOM, cell separator: comma, decimal separator: dot.text/plain; charset=US-ASCII