Replication data for: Salience-simplification strategy for markedness of causal subordinators: “because” and “since” in argumentative essays

Version 1.2

Kang, Hui; Xu, Jiajin, 2022, "Replication data for: Salience-simplification strategy for markedness of causal subordinators: “because” and “since” in argumentative essays", https://doi.org/10.18710/RULYMP, DataverseNO, V1

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

50 Downloads

Description	The dataset supports the research article "Salience-simplification strategy to markedness of causal subordinators: The case of “because” and “since” in argumentative essays". In total, the dataset marks features of 976 causal adverbial subordinations retrieved from student argumentative essays.Data points were extracted from three corpora. Specifically, all essays in NESSIE (Native English Speakers’ Similarly or Identically-prompted Essays, created by Xu Jiajin, 781 essays; 291,911 tokens) and argumentative essays in LOCNESS (the Louvain Corpus of Native English Essays, created by Granger, 323 essays; 230,138 tokens) were selected. Native argumentative essays from BAWE’s (British Academic Written English, created by Hilary Nesi) Arts and Humanities disciplinary group were chosen (512 essays; 1,360,932 tokens). In total, 1,616 essays comprising 1,882,981 tokens were examined. The dataset comprises 976 datapoints of causal subordinations conjoined by "because" and "since" in students' argumentative essays--488 data points of all "since" subordinations, and 488 randomly selected "because" subordinations. On these data points, ten contextual features that are potential predictors of people's choices between causal subordinators "because" and "since" were annotated. The ten contextual features annotated are "position", "separation", "embeddedness", "initial adverbials", "sub-clause", "de-ranking", "clause-length ratio", "hedging terms", "clausal relationship", and "bridging". Overall fourteen variables including ten contetual features are annotated: (1) "No." is the ID of each data point(this is one ID marker); (2) "subordinator" marks the logical subordinators (this categorical variable has two values: "because" and "since"); (3) "position" marks the logical adverbial clause positions compared with the main clause (this categorical variable has two values: "preposed" or "postposed"); (4) "sep" indicates whether a separating punctuation mark exists between the subordinate and main clauses(this categorical variable has two values: "YES" or "NO"); (5) "embeddedness" indicates whether a complex sentence is embedded in a larger comlex sentence(this categorical variable has two values: "YES" or "NO"); (6) "ini.adv" denotes whether an initial adverbial exists in the causal subordination(this categorical variable has two values: "YES" or "NO"); (7) "sub-clau" indicates whether the causal subordinate contains sub-clauses of any type(this categorical variable has two values: "YES" or "NO"); (8) "deranking" indicates whether the predicate of the subordinate clause is complete(this categorical variable has two values: "YES" or "NO"); (9) "sub.main.ratio" is the length ratio of the subordinate and main clauses in terms of word count (this numerical variable is converted into ln value for better interpretation); (10) "hedging" indicates whether a hedging term exists in the subordinate clause(this categorical variable has two values: "YES" or "NO"); (11) "clau.rel" denotes the interclausal relationships on the general level(this categorical variable has two values: "direct" or "indirect"); (12) "spc.clau.rel2" denotes the interclausal relationships on the secondary level(this categorical variable has five values: "im", "rm", "asst", "inpr", and "sugg"); (13) "bridging" indicates whether the subordinate clause contains any information referring back to the preceding clause(this categorical variable has two values: "YES" or "NO"); (14) "source" shows specific corpora the data points come from (this categorical variable has three values: "NESSIE", "LOCNESS", or "BAWE") ; This dataset was constructed to explore contextual features that discriminate between causal subordinators of "because" and "since" and to rank the effective features. (2021-08-05)
Subject	Arts and Humanities
Keyword	causal subordinators, "because", "since", contextual features, argumentative essays, syntax, English
Related Publication	Xu, Jiajin, and Hui Kang. ‘Salience-Simplification Strategy for Markedness of Causal Subordinators: “Because” and “since” in Argumentative Essays’. Lingua, vol. 272, June 2022, p. 103256. ScienceDirect, https://doi.org/10.1016/j.lingua.2022.103256. doi: 10.1016/j.lingua.2022.103256
License/Data Use Agreement	Custom Dataset Terms

Filter by

	1 to 5 of 5 Files	Download
	word_sentence_count.R R Syntax - 1.2 KB Published Jan 27, 2022 9 Downloads MD5: 3e3a15d14db523b99718ebfeb5b660c7	Preview "word_sentence_count.R" Access File File Access Public Download Options R Syntax Download Metadata Data File Citation EndNote XML RIS BibTeX
	pub-causalsubordinator.csv Comma Separated Values - 71.5 KB Published Jan 27, 2022 11 Downloads MD5: cec02bd4b59c6f8403945004d35b8e0b	Preview "pub-causalsubordinator.csv" Access File File Access Public Download Options Comma Separated Values Download Metadata Data File Citation EndNote XML RIS BibTeX
	exact.matches.2.r R Syntax - 9.8 KB Published Jan 27, 2022 9 Downloads MD5: e67f2ddd4621eeab66d9e8d5cc2079d0	Preview "exact.matches.2.r" Access File File Access Public Download Options R Syntax Download Metadata Data File Citation EndNote XML RIS BibTeX
	corpus_processing.R R Syntax - 2.7 KB Published Jan 27, 2022 11 Downloads MD5: e17d2f4f622ffe3c1abe1a94445ff044	Preview "corpus_processing.R" Access File File Access Public Download Options R Syntax Download Metadata Data File Citation EndNote XML RIS BibTeX
	00_readme_causal-subordinators.txt Plain Text - 11.7 KB Published Jan 27, 2022 10 Downloads MD5: dafb1ac2afbe8f881afc4950cc7cbb34	Preview "00_readme_causal-subordinators.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation EndNote XML RIS BibTeX

Citation Metadata

Persistent Identifier	doi:10.18710/RULYMP
Publication Date	2022-01-27
Title	Replication data for: Salience-simplification strategy for markedness of causal subordinators: “because” and “since” in argumentative essays
Author	Kang, Hui (Dalian University of Foreign Languages) - ORCID: 0000-0002-5979-1658 Xu, Jiajin (Beijing Foreign Studies University) - ORCID: 0000-0003-3454-9352
Point of Contact	Use email button above to contact. Kang, Hui (Dalian University of Foreign Languages)
Description	The dataset supports the research article "Salience-simplification strategy to markedness of causal subordinators: The case of “because” and “since” in argumentative essays". In total, the dataset marks features of 976 causal adverbial subordinations retrieved from student argumentative essays.Data points were extracted from three corpora. Specifically, all essays in NESSIE (Native English Speakers’ Similarly or Identically-prompted Essays, created by Xu Jiajin, 781 essays; 291,911 tokens) and argumentative essays in LOCNESS (the Louvain Corpus of Native English Essays, created by Granger, 323 essays; 230,138 tokens) were selected. Native argumentative essays from BAWE’s (British Academic Written English, created by Hilary Nesi) Arts and Humanities disciplinary group were chosen (512 essays; 1,360,932 tokens). In total, 1,616 essays comprising 1,882,981 tokens were examined. The dataset comprises 976 datapoints of causal subordinations conjoined by "because" and "since" in students' argumentative essays--488 data points of all "since" subordinations, and 488 randomly selected "because" subordinations. On these data points, ten contextual features that are potential predictors of people's choices between causal subordinators "because" and "since" were annotated. The ten contextual features annotated are "position", "separation", "embeddedness", "initial adverbials", "sub-clause", "de-ranking", "clause-length ratio", "hedging terms", "clausal relationship", and "bridging". Overall fourteen variables including ten contetual features are annotated: (1) "No." is the ID of each data point(this is one ID marker); (2) "subordinator" marks the logical subordinators (this categorical variable has two values: "because" and "since"); (3) "position" marks the logical adverbial clause positions compared with the main clause (this categorical variable has two values: "preposed" or "postposed"); (4) "sep" indicates whether a separating punctuation mark exists between the subordinate and main clauses(this categorical variable has two values: "YES" or "NO"); (5) "embeddedness" indicates whether a complex sentence is embedded in a larger comlex sentence(this categorical variable has two values: "YES" or "NO"); (6) "ini.adv" denotes whether an initial adverbial exists in the causal subordination(this categorical variable has two values: "YES" or "NO"); (7) "sub-clau" indicates whether the causal subordinate contains sub-clauses of any type(this categorical variable has two values: "YES" or "NO"); (8) "deranking" indicates whether the predicate of the subordinate clause is complete(this categorical variable has two values: "YES" or "NO"); (9) "sub.main.ratio" is the length ratio of the subordinate and main clauses in terms of word count (this numerical variable is converted into ln value for better interpretation); (10) "hedging" indicates whether a hedging term exists in the subordinate clause(this categorical variable has two values: "YES" or "NO"); (11) "clau.rel" denotes the interclausal relationships on the general level(this categorical variable has two values: "direct" or "indirect"); (12) "spc.clau.rel2" denotes the interclausal relationships on the secondary level(this categorical variable has five values: "im", "rm", "asst", "inpr", and "sugg"); (13) "bridging" indicates whether the subordinate clause contains any information referring back to the preceding clause(this categorical variable has two values: "YES" or "NO"); (14) "source" shows specific corpora the data points come from (this categorical variable has three values: "NESSIE", "LOCNESS", or "BAWE") ; This dataset was constructed to explore contextual features that discriminate between causal subordinators of "because" and "since" and to rank the effective features. (2021-08-05)
Subject	Arts and Humanities
Keyword	causal subordinators "because" "since" contextual features argumentative essays syntax English
Related Publication	Xu, Jiajin, and Hui Kang. ‘Salience-Simplification Strategy for Markedness of Causal Subordinators: “Because” and “since” in Argumentative Essays’. Lingua, vol. 272, June 2022, p. 103256. ScienceDirect, https://doi.org/10.1016/j.lingua.2022.103256. doi: 10.1016/j.lingua.2022.103256 https://doi.org/10.1016/j.lingua.2022.103256
Language	English
Producer	Dalina University of Foreign Languages (DLUFL) https://www.dlufl.edu.cn/en/
Production Date	2021-08-05
Production Location	Dalian
Contributor	Hosting Institution : Software School/Intelligence Language Research Center Project Member : Wang, Luojia Project Member : Zhang, Yaxin Project Member : Zhang, Xiaobo
Funding Information	Liaoning Social Science Foundation: L20BYY016 National Social Science Fund of China (NSSFC): 19ZDA319
Distributor	The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/
Depositor	Kang, Hui
Deposit Date	2021-08-05
Time Period	Start Date: 1995 ; End Date: 2007
Date of Collection	Start Date: 2019-12-01 ; End Date: 2021-05-01
Data Type	corpus data
Software	AntConc, Version: 3.5.8 R Language, Version: 3.6.2 RStudio Team, Version: 1.1.456
Other Reference	Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik. 1985. A comprehensive grammar of the English language. London: Longman.
Data Source	NESSIE Corpus. See: Xu, Jiajin. (2012). NESSIE Corpus 1st release (NESSIEv1): Native English Speakers' Similarly or Identically-prompted Essays 1st release. Beijing: National Research Centre for Foreign Language Education, Beijing Foreign Studies University. Available at http://corpus.bfsu.edu.cn/info/1070/1335.htm.; LOCNESS (the Louvain Corpus of Native English Essays). See: Granger, S. (1998). The computer learner corpus: A versatile new source of data for SLA research. In Granger, S. (ed.) Learner English on Computer. Addison Wesley Longman : London & New York, 3-18. Available at https://www.learnercorpusassociation.org/resources/tools/locness-corpus/.; BAWE (British Academic Written English). See: Nesi, Hilary; Gardner, Sheena; Thompson, Paul; et al., 2008, British Academic Written English Corpus, Oxford Text Archive, http://hdl.handle.net/20.500.12024/2539.

Geospatial Metadata

Geographic Coverage	United States United Kingdom

Social Science and Humanities Metadata

Collector Training	Annotators on the variable "clausal relationship" were trained with definitions and examples of the two levelled clausal relationship system examined in our research.
Collection Mode	corpus retrieval

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

This dataset, "Replication data for: Salience-simplification strategy for markedness of causal subordinators: “because” and “since” in argumentative essays", may be reused according to the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license as described here: https://creativecommons.org/licenses/by-nc/4.0/.

This dataset contains statistical data obtained by analyzing texts from the following three corpora:

NESSIE Corpus. See: Xu, Jiajin. (2012). NESSIE Corpus 1st release (NESSIEv1): Native English Speakers' Similarly or Identically-prompted Essays 1st release. Beijing: National Research Centre for Foreign Language Education, Beijing Foreign Studies University. Available at http://corpus.bfsu.edu.cn/info/1070/1335.htm.
LOCNESS (the Louvain Corpus of Native English Essays). See: Granger, S. (1998). The computer learner corpus: A versatile new source of data for SLA research. In Granger, S. (ed.) Learner English on Computer. Addison Wesley Longman : London & New York, 3-18. Available at https://www.learnercorpusassociation.org/resources/tools/locness-corpus/.
BAWE (British Academic Written English). See: Nesi, Hilary; Gardner, Sheena; Thompson, Paul; et al., 2008, British Academic Written English Corpus, Oxford Text Archive, http://hdl.handle.net/20.500.12024/2539.

All three corpora can be used for non-commercial purposes only. The BAWE corpus is explicitly licensed under Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0).

In this dataset, "Replication data for: Salience-simplification strategy for markedness of causal subordinators: “because” and “since” in argumentative essays", the data file "pub-causalsubordinator.csv" contains statistical data / calculations based on texts contained in the three source corpora.

The file does not contain any coherent (parts of) utterances which the keywords were found in as all context was removed from the data file. The use of the three source corpora does thus not infringe the copyright of any right holders who have contributed to the corpora.

	Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Restricted Files Selected

The selected file(s) may not be downloaded because you have not been granted access.

You may request access to the restricted file(s) by clicking the Request Access button.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 4.7 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Restricted Files Selected

The restricted file(s) selected may not be downloaded because you have not been granted access.

Click Continue to download the files you have access to download.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Private URL

Private URL can only be used with unpublished versions of datasets.

Unpublished Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? The selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? It will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Log In to request access.

Dataset Terms

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom terms specific to this dataset Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

This dataset contains statistical data obtained by analyzing texts from the following three corpora:

NESSIE Corpus. See: Xu, Jiajin. (2012). NESSIE Corpus 1st release (NESSIEv1): Native English Speakers' Similarly or Identically-prompted Essays 1st release. Beijing: National Research Centre for Foreign Language Education, Beijing Foreign Studies University. Available at http://corpus.bfsu.edu.cn/info/1070/1335.htm.
LOCNESS (the Louvain Corpus of Native English Essays). See: Granger, S. (1998). The computer learner corpus: A versatile new source of data for SLA research. In Granger, S. (ed.) Learner English on Computer. Addison Wesley Longman : London & New York, 3-18. Available at https://www.learnercorpusassociation.org/resources/tools/locness-corpus/.
BAWE (British Academic Written English). See: Nesi, Hilary; Gardner, Sheena; Thompson, Paul; et al., 2008, British Academic Written English Corpus, Oxford Text Archive, http://hdl.handle.net/20.500.12024/2539.

All three corpora can be used for non-commercial purposes only. The BAWE corpus is explicitly licensed under Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0).

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://dataverse.no/api/access/datafile/

Request Access

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (1.3)

Major Release (2.0)

Publish Dataset

This dataset cannot be published until TROLLing is published by its administrator.

Publish Dataset

This dataset cannot be published until TROLLing and DataverseNO are published.

Return to Author

Return this dataset to contributor for modification.