Romanian Weak Pronoun Choice Data (doi:10.18710/GSV27M)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

Document Description

Citation

Title:

Romanian Weak Pronoun Choice Data

Identification Number:

doi:10.18710/GSV27M

Distributor:

DataverseNO

Date of Distribution:

2014-06-18

Version:

1

Bibliographic Citation:

Gerstenberger, Ciprian-Virgil, 2014, "Romanian Weak Pronoun Choice Data", https://doi.org/10.18710/GSV27M, DataverseNO, V1

Study Description

Citation

Title:

Romanian Weak Pronoun Choice Data

Identification Number:

doi:10.18710/GSV27M

Authoring Entity:

Gerstenberger, Ciprian-Virgil (UiT The Arctic University of Norway)

Producer:

UiT The Arctic University of Norway

Date of Production:

2014-06-17

Distributor:

DataverseNO

Distributor:

The Tromsø Repository of Language and Linguistics (TROLLing)

Access Authority:

Gerstenberger, Ciprian-Virgil

Date of Deposit:

2014-06-17

Date of Distribution:

2014-06-17

Holdings Information:

https://doi.org/10.18710/GSV27M

Study Scope

Keywords:

Arts and Humanities, Romanian, clitic pronouns, surface form, sandhi phenomena

Topic Classification:

Field: Morphology, Topic: clitics, Topic: pronouns, Time-depth: synchronic

Abstract:

The following corpus study shows that soft linguistic constraints are hard to describe and operationalize. In specific contexts, some Romanian clitic pronouns allow a choice between phonological hosts such as in că-mi dai cartea vs. că îmi dai cartea both meaning [that you give me the book]. What determines the choice between subjunction că in că-mi and prosthetic î in îmi (cf. Lombard 1976)? Popescu (2003, p. 160) argues for speech rate as surface realization trigger (monosyllabic că-mi in fast speech vs. bisyllabic că îmi in normal speech), while Dindelegan (2013, p. 388) argues for register rules (informal că-mi vs. formal că îmi). This means that formal, written language represents one extreme of a formality scale while informal, spoken language the other. Thus, a Romanian corpus of official documents, such as legal texts, is expected to contain only or significantly many forms with prosthetic î for constellations with otherwise optional variants. To test these two hypotheses, the Romanian part of the JRC-Acquis corpus (http://ec.europa.eu/dgs/jrc/) has been tagged with the RACAI tagger (http: //www.racai.ro). The resulting corpus of 62,650,821 tokens (including punctuation) has been evaluated wrt. the phenomena under scrutiny. Taking into account specific hosts, enclitic forms have been compared with their î-prosthetic counterparts. The numbers show almost no or statistically insignificant difference in usage for some specific host+clitic pairs (e.g., 3886 să îşi vs. 3852 să-şi [that to himself/ herself ], 200 ce îi vs. 110 ce-i [what to him/her]). From a usage-based perspective, these findings are clear arguments both against the register rules purported by D indelegan (2013) and against a pure speech rate hypothesis as in Popescu (2003). Since the JRC-Acquis corpus is translated from English by different translators, perhaps both native and non-native speakers of Romanian, a further corpus of original Romanian legal texts is being compiled for further analysis and comparison.

The full dataset consists of (1) two tgz-files containing the pos-tagged data extracted from the JRC-Acquis corpus: enclitic forms and î-prosthetic forms. The data is xml format, which is described in (2) the description file. (3) the draft of the article as pdf-file for linguistic background.

Geographic Coverage:

EU countries, Romania

Kind of Data:

corpus

Methodology and Processing

Sources Statement

Data Access

Other Study Description Materials

Related Publications

Citation

Title:

The hardness of soft linguistic constraints (work in progress)

Bibliographic Citation:

The hardness of soft linguistic constraints (work in progress)

Other Study-Related Materials

Label:

data_description.txt

Text:

description of the xml format of the extracted data

Notes:

text/plain; charset=UTF-8

Other Study-Related Materials

Label:

hardness_of_sc.pdf

Text:

article draft as linguistic background

Notes:

application/pdf

Other Study-Related Materials

Label:

jrc_ac_enclitic_weak_pronouns_RON.tgz

Text:

enclitic forms extracted from the Romanian part of the JRC-Acquis corpus (http://ec.europa.eu/dgs/jrc/)

Notes:

application/octet-stream

Other Study-Related Materials

Label:

jrc_ac_prosthetic_weak_pronouns_RON.tgz

Text:

î-prosthetic forms extracted from the Romanian part of the JRC-Acquis corpus

Notes:

application/octet-stream