The dataset supports the research article "Salience-simplification strategy to markedness of causal subordinators: The case of “because” and “since” in argumentative essays". In total, the dataset marks features of 976 causal adverbial subordinations retrieved from student argumentative essays.Data points were extracted from three corpora. Specifically, all essays in NESSIE (Native English Speakers’ Similarly or Identically-prompted Essays, created by Xu Jiajin, 781 essays; 291,911 tokens) and argumentative essays in LOCNESS (the Louvain Corpus of Native English Essays, created by Granger, 323 essays; 230,138 tokens) were selected. Native argumentative essays from BAWE’s (British Academic Written English, created by Hilary Nesi) Arts and Humanities disciplinary group were chosen (512 essays; 1,360,932 tokens). In total, 1,616 essays comprising 1,882,981 tokens were examined.
The dataset comprises 976 datapoints of causal subordinations conjoined by "because" and "since" in students' argumentative essays--488 data points of all "since" subordinations, and 488 randomly selected "because" subordinations. On these data points, ten contextual features that are potential predictors of people's choices between causal subordinators "because" and "since" were annotated.
The ten contextual features annotated are "position", "separation", "embeddedness", "initial adverbials", "sub-clause", "de-ranking", "clause-length ratio", "hedging terms", "clausal relationship", and "bridging".
Overall fourteen variables including ten contetual features are annotated:
(1) "No." is the ID of each data point(this is one ID marker);
(2) "subordinator" marks the logical subordinators (this categorical variable has two values: "because" and "since");
(3) "position" marks the logical adverbial clause positions compared with the main clause (this categorical variable has two values: "preposed" or "postposed");
(4) "sep" indicates whether a separating punctuation mark exists between the subordinate and main clauses(this categorical variable has two values: "YES" or "NO");
(5) "embeddedness" indicates whether a complex sentence is embedded in a larger comlex sentence(this categorical variable has two values: "YES" or "NO");
(6) "ini.adv" denotes whether an initial adverbial exists in the causal subordination(this categorical variable has two values: "YES" or "NO");
(7) "sub-clau" indicates whether the causal subordinate contains sub-clauses of any type(this categorical variable has two values: "YES" or "NO");
(8) "deranking" indicates whether the predicate of the subordinate clause is complete(this categorical variable has two values: "YES" or "NO");
(9) "sub.main.ratio" is the length ratio of the subordinate and main clauses in terms of word count (this numerical variable is converted into ln value for better interpretation);
(10) "hedging" indicates whether a hedging term exists in the subordinate clause(this categorical variable has two values: "YES" or "NO");
(11) "clau.rel" denotes the interclausal relationships on the general level(this categorical variable has two values: "direct" or "indirect");
(12) "spc.clau.rel2" denotes the interclausal relationships on the secondary level(this categorical variable has five values: "im", "rm", "asst", "inpr", and "sugg");
(13) "bridging" indicates whether the subordinate clause contains any information referring back to the preceding clause(this categorical variable has two values: "YES" or "NO");
(14) "source" shows specific corpora the data points come from (this categorical variable has three values: "NESSIE", "LOCNESS", or "BAWE") ;
This dataset was constructed to explore contextual features that discriminate between causal subordinators of "because" and "since" and to rank the effective features.
(2021-08-05)