View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
Romanian Weak Pronoun Choice Data |
Identification Number: |
doi:10.18710/GSV27M |
Distributor: |
DataverseNO |
Date of Distribution: |
2014-06-18 |
Version: |
1 |
Bibliographic Citation: |
Gerstenberger, Ciprian-Virgil, 2014, "Romanian Weak Pronoun Choice Data", https://doi.org/10.18710/GSV27M, DataverseNO, V1 |
Citation |
|
Title: |
Romanian Weak Pronoun Choice Data |
Identification Number: |
doi:10.18710/GSV27M |
Authoring Entity: |
Gerstenberger, Ciprian-Virgil (UiT The Arctic University of Norway) |
Producer: |
UiT The Arctic University of Norway |
Date of Production: |
2014-06-17 |
Distributor: |
DataverseNO |
Distributor: |
The Tromsø Repository of Language and Linguistics (TROLLing) |
Access Authority: |
Gerstenberger, Ciprian-Virgil |
Date of Deposit: |
2014-06-17 |
Date of Distribution: |
2014-06-17 |
Holdings Information: |
https://doi.org/10.18710/GSV27M |
Study Scope |
|
Keywords: |
Arts and Humanities, Romanian, clitic pronouns, surface form, sandhi phenomena |
Topic Classification: |
Field: Morphology, Topic: clitics, Topic: pronouns, Time-depth: synchronic |
Abstract: |
The following corpus study shows that soft linguistic constraints are hard to describe and operationalize. In specific contexts, some Romanian clitic pronouns allow a choice between phonological hosts such as in că-mi dai cartea vs. că îmi dai cartea both meaning [that you give me the book]. What determines the choice between subjunction că in că-mi and prosthetic î in îmi (cf. Lombard 1976)? Popescu (2003, p. 160) argues for speech rate as surface realization trigger (monosyllabic că-mi in fast speech vs. bisyllabic că îmi in normal speech), while Dindelegan (2013, p. 388) argues for register rules (informal că-mi vs. formal că îmi). This means that formal, written language represents one extreme of a formality scale while informal, spoken language the other. Thus, a Romanian corpus of official documents, such as legal texts, is expected to contain only or significantly many forms with prosthetic î for constellations with otherwise optional variants. To test these two hypotheses, the Romanian part of the JRC-Acquis corpus (http://ec.europa.eu/dgs/jrc/) has been tagged with the RACAI tagger (http: //www.racai.ro). The resulting corpus of 62,650,821 tokens (including punctuation) has been evaluated wrt. the phenomena under scrutiny. Taking into account specific hosts, enclitic forms have been compared with their î-prosthetic counterparts. The numbers show almost no or statistically insignificant difference in usage for some specific host+clitic pairs (e.g., 3886 să îşi vs. 3852 să-şi [that to himself/ herself ], 200 ce îi vs. 110 ce-i [what to him/her]). From a usage-based perspective, these findings are clear arguments both against the register rules purported by D indelegan (2013) and against a pure speech rate hypothesis as in Popescu (2003). Since the JRC-Acquis corpus is translated from English by different translators, perhaps both native and non-native speakers of Romanian, a further corpus of original Romanian legal texts is being compiled for further analysis and comparison. |
The full dataset consists of (1) two tgz-files containing the pos-tagged data extracted from the JRC-Acquis corpus: enclitic forms and î-prosthetic forms. The data is xml format, which is described in (2) the description file. (3) the draft of the article as pdf-file for linguistic background. |
|
Geographic Coverage: |
EU countries, Romania |
Kind of Data: |
corpus |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Other Study Description Materials |
|
Related Publications |
|
Citation |
|
Title: |
The hardness of soft linguistic constraints (work in progress) |
Bibliographic Citation: |
The hardness of soft linguistic constraints (work in progress) |
Label: |
data_description.txt |
Text: |
description of the xml format of the extracted data |
Notes: |
text/plain; charset=UTF-8 |
Label: |
hardness_of_sc.pdf |
Text: |
article draft as linguistic background |
Notes: |
application/pdf |
Label: |
jrc_ac_enclitic_weak_pronouns_RON.tgz |
Text: |
enclitic forms extracted from the Romanian part of the JRC-Acquis corpus (http://ec.europa.eu/dgs/jrc/) |
Notes: |
application/octet-stream |
Label: |
jrc_ac_prosthetic_weak_pronouns_RON.tgz |
Text: |
î-prosthetic forms extracted from the Romanian part of the JRC-Acquis corpus |
Notes: |
application/octet-stream |