Replication Data for: A multivariate account of particle alternation after bare-form try in native varieties of English

Version 1.1

Tizón-Couto, David, 2022, "Replication Data for: A multivariate account of particle alternation after bare-form try in native varieties of English", https://doi.org/10.18710/GVUZWI, DataverseNO, V1

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

170 Downloads

Description	[Dataset abstract] This is the data and code from a multifactorial study reviewing the determinants of particle alternation after uninflected try in native varieties of English. The effects of a number of previously discussed and novel predictors (see Section 3.1 of the paper) are probed in data from well-known corpora (ICE, GloWbE, BNC and COCA). The paper is published in English Language and Linguistics (https://www.doi.org/10.1017/S1360674321000393). I used R (R Core Team 2021) for all data analyses, hence the code can best be replicated in R. (2021-09-17) [Article abstract] This multifactorial study reviews the determinants of particle alternation after uninflected try in varieties where English is native. The effects of a number of previously discussed and novel predictors are probed in data from well-known corpora. The results confirm the inclinations of North American varieties (try to) in contrast with those of the Australasian, British and Irish varieties (try and in speech but try to in writing). The previously reported general effects of the tense of try, mode and horror aequi are also corroborated. As regards the effect of register, the study contributes the finding that following Latin-based infinitives favor try to in most varieties, especially in writing. The paper discusses the status of the substantiated effects with respect to the notions of conventionalization and entrenchment: crucially, the higher degree of conventionalization of try to in North American varieties (a) makes the use of this variant less conditional on the sequential need to license euphony and (b) neutralizes the general contextual/register distinction for the alternation. From a usage-based viewpoint, the findings suggest that the higher frequency of a multiword sequence in a specific variety, and the higher degree of activation in the language users’ minds, can make it less contingent on general probabilistic constraints. (2021-09-17)
Subject	Arts and Humanities
Keyword	English, particle alternation, probabilistic grammar, conventionalization, entrenchment, horror aequi
Related Publication	TIZÓN-COUTO, D. (2022). A multivariate account of particle alternation after bare-form try in native varieties of English. English Language and Linguistics, 1-32. doi:10.1017/S1360674321000393 doi: 10.1017/S1360674321000393
License/Data Use Agreement	Custom Dataset Terms

Change View

Table

Tree

Filter by

	1 File	Download
	00_ReadMe_tryandto.txt Documentation/Plain Text - 7.1 KB Published Sep 21, 2022 17 Downloads MD5: 31ce55e6476ec9fff74040ccb8d0f039 Description of the dataset	Preview "Documentation/00_ReadMe_tryandto.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation EndNote XML RIS BibTeX

Citation Metadata

Persistent Identifier	doi:10.18710/GVUZWI
Publication Date	2022-09-21
Title	Replication Data for: A multivariate account of particle alternation after bare-form try in native varieties of English
Author	Tizón-Couto, David (University of Vigo) - ORCID: 0000-0003-0788-7954
Point of Contact	Use email button above to contact. Tizón-Couto, David (University of Vigo)
Description	[Dataset abstract] This is the data and code from a multifactorial study reviewing the determinants of particle alternation after uninflected try in native varieties of English. The effects of a number of previously discussed and novel predictors (see Section 3.1 of the paper) are probed in data from well-known corpora (ICE, GloWbE, BNC and COCA). The paper is published in English Language and Linguistics (https://www.doi.org/10.1017/S1360674321000393). I used R (R Core Team 2021) for all data analyses, hence the code can best be replicated in R. (2021-09-17) [Article abstract] This multifactorial study reviews the determinants of particle alternation after uninflected try in varieties where English is native. The effects of a number of previously discussed and novel predictors are probed in data from well-known corpora. The results confirm the inclinations of North American varieties (try to) in contrast with those of the Australasian, British and Irish varieties (try and in speech but try to in writing). The previously reported general effects of the tense of try, mode and horror aequi are also corroborated. As regards the effect of register, the study contributes the finding that following Latin-based infinitives favor try to in most varieties, especially in writing. The paper discusses the status of the substantiated effects with respect to the notions of conventionalization and entrenchment: crucially, the higher degree of conventionalization of try to in North American varieties (a) makes the use of this variant less conditional on the sequential need to license euphony and (b) neutralizes the general contextual/register distinction for the alternation. From a usage-based viewpoint, the findings suggest that the higher frequency of a multiword sequence in a specific variety, and the higher degree of activation in the language users’ minds, can make it less contingent on general probabilistic constraints. (2021-09-17)
Subject	Arts and Humanities
Keyword	English particle alternation probabilistic grammar conventionalization entrenchment horror aequi
Related Publication	TIZÓN-COUTO, D. (2022). A multivariate account of particle alternation after bare-form try in native varieties of English. English Language and Linguistics, 1-32. doi:10.1017/S1360674321000393 doi: 10.1017/S1360674321000393 https://doi.org/10.1017/S1360674321000393
Language	English
Producer	University of Vigo https://lvtc.uvigo.es/
Funding Information	Spanish Ministry of Science and Innovation: PID2020-118143GA-I00 Xunta de Galicia: ED431C2021/52
Distributor	The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/
Depositor	Tizón-Couto, David
Deposit Date	2021-09-17
Time Period	Start Date: 1964 ; End Date: 2019
Date of Collection	Start Date: 2019-01-01 ; End Date: 2021-01-01
Data Type	annotated corpus data
Software	R studio, Version: 1.2.1335 R, Version: 4.0.3 brms: Bayesian Regression Models using 'Stan', Version: 2.16.1
Data Source	BNC: British National Corpus. Available online at http://www.natcorp.ox.ac.uk/. ; COCA: Corpus of Contemporary American English. Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA). Available online at https://www.english-corpora.org/coca/. ; GloWbE: Davies, Mark. (2013) Corpus of Global Web-Based English. Available online at https://www.english-corpora.org/glowbe/. ; ICE: The International Corpus of English. Available at https://www.ice-corpora.uzh.ch/en.html. Including the following components: ICE-AUS: The Australian Component of ICE. Provided by Macquarie University. ICE-CAN: The Canadian Component of ICE. Provided by University of Alberta. ICE-GB: The British Component of ICE. Provided by Survey of English Usage, University College London. ICE-IRE: The Irish Component of ICE. Provided by The Queen's University Belfast, Trinity College Dublin. ICE-NZ: The New Zealand Component of ICE. Provided by School of Linguistics & Applied Language Studies, Victoria University of Wellington.

Geospatial Metadata

Geographic Coverage	Australia Canada Ireland New Zealand United Kingdom United States

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

This dataset, "Replication Data for: A multivariate account of particle alternation after bare-form try in native varieties of English" (henceforth: "Dataset"), may be reused according to the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license as described here: https://creativecommons.org/licenses/by-nc/4.0.

This Dataset contains data from the following sources:

BNC: The British National Corpus. Examples of usage taken from the British National Corpus were obtained under the terms of the BNC End User Licence (see http://www.natcorp.ox.ac.uk/docs/licence.html or the file "BNC_End_User_Licence.pdf" included in this Dataset. Copyright in the individual texts cited resides with the original IPR holders. For information and licensing conditions relating to the BNC, please see the Terms tab on the landing page of the Dataset, and the BNC web site at http://www.natcorp.ox.ac.uk/.

Section "2 Terms of the Licence Granted to the Licensee" of the BNC End User Licence states among otherthings that "(f) [t]here is no restriction on the use of the Licensee's Results except that the Licensee may not publish in print or electronic form or exploit commercially in any form whatsoever any extracts from the BNC Processed Material other than those permitted under the fair dealings provision of copyright law."

In this Dataset, the data file "bnc.csv" contains the following information:

the keywords which the BNC was searched for, and for each token
annotations/values for six variables. This information has been provided by the author of this Dataset.

This means that the file does not contain any coherent (parts of) utterances which the keywords were found in as all context was removed from the data file. Therefore, publishing this data file is considered to be permitted under the fair dealings provision of copyright law; see details in section "Fair dealing" below.

COCA: Corpus of Contemporary American English. COCA does not provide an (openly accessible) end user license agreement. However, on their webpage (cf. https://www.english-corpora.org/copyright.asp; see also the file "COCA_Note_on_Copyright.pdf" included in this Dataset), they mention that the use of their source texts is "strictly for academic research, and is purely non-commercial". This may be interpreted as also the reuse of text from COCA being allowed for non-commercial purposes only. On the same webpage, COCA also provides evidence of their use and dissemination of the text sources being within the bounds of US Fair Use Law.

In this Dataset, the data file "coca.csv" contains the following information:

the keywords which the COCA was searched for, and for each token
annotations/values for six variables. This information has been provided by the author of this Dataset.

GloWbE: Corpus of Global Web-Based English. GloWbE does not provide an (openly accessible) end user license agreement. However, on their webpage (cf. https://www.english-corpora.org/copyright.asp; see also the file "COCA_Note_on_Copyright.pdf" included in this Dataset), they mention that the use of their source texts is "strictly for academic research, and is purely non-commercial". This may be interpreted as also the reuse of text from GloWbE being allowed for non-commercial purposes only. On the same webpage, GloWbE also provides evidence of their use and dissemination of the text sources being within the bounds of US Fair Use Law.

In this Dataset, the data file "glowbe.csv" contains the following information:

the keywords which the GloWbE was searched for, and for each token
annotations/values for six variables. This information has been provided by the author of this Dataset.

ICE: The International Corpus of English, including the following components:

ICE-AUS: The Australian Component of ICE.
ICE-CAN: The Canadian Component of ICE.
ICE-GB: The British Component of ICE.
ICE-IRE: The Irish Component of ICE.
ICE-NZ: The New Zealand Component of ICE.

ICE-CAN and ICE-IRE were used under the general ICE License Agreement; see https://www.ice-corpora.uzh.ch/dam/jcr:7ae594b2-ee97-4935-8022-7d2d91b60be4/ICElicence_UZH.pdf or the file "ICE_License_Agreement.pdf" included in this Dataset.

ICE-GB was used under the ICE-GB License Agreement; see the file "ICE-GB_License_Agreement.pdf" included in this Dataset.

ICE-NZ was used under the ICE-NZ License Agreement; see the file "ICE-NZ_License_Agreement.pdf" included in this Dataset.

The ICE license agreements mentioned above include the following conditions (here cited according to the general ICE License Agreement):

“The Corpus must be used for non-profit academic research purposes only. […] The Licensee agrees not to reproduce or redistribute the Corpus or to use all or any part of the Corpus texts in any commercial product or service.”
“Publications based on the Corpus may include citations from texts only in a way which would be permitted under the fair dealings provision of copyright law.”
“If you publish a paper using any ICE corpus, please send a reference to ice@es.uzh.ch.”

In this Dataset, the data file "ice.csv" contains the following information:

the keywords which the ICE was searched for, and for each token
the context (usually one sentence) which the keyword appears in
annotations/frequency calculations/values for ten variables. This information has been provided by the author of this Dataset.
the corpus component where the keywords were found

This means that the file only contains very limited excerpts from the works that are the bases for the ICE components that were used. Therefore, publishing this data file is considered to be permitted under the fair dealings provision of copyright law; see details in section "Fair dealing" below.

While no explicit, separate license agreement for ICE-AUS exists, its use and the publication of data from ICE-AUS as represented in this Dataset correspond to the use and publication of the data extracted from the other ICE components, and thus are considered as qualifying as fair dealing.

Fair dealing:

According to UK Copyright Law (cf. https://www.gov.uk/guidance/exceptions-to-copyright#fair-dealing), “[f]actors that have been identified by the courts as relevant in determining whether a particular dealing with a work is fair include:

"does using the work affect the market for the original work? If a use of a work acts as a substitute for it, causing the owner to lose revenue, then it is not likely to be fair"
"is the amount of the work taken reasonable and appropriate? Was it necessary to use the amount that was taken? Usually only part of a work may be used”

The corpus extracts used in this Dataset may be said to represent fair dealing according to both of these factors:

The extracted material does not affect the market for the original work, as it is unlikely that any researcher would refrain from using the corpora or the original works which the corpora are based on because of the availability of the extracted material contained in this Dataset.
The amount of the extracted work is reasonable and appropriate as it was necessary to carry out the study, and as it is necessary to replicate the study. Therefore, publishing the data files is not considered to infringe the copyright of the original IPR holders.

Fair use:

According to US Copyright Act (cf. https://www.copyright.gov/fair-use/more-info.html), "Fair use is a legal doctrine that promotes freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances". The Corpus of Contemporary American English (COCA; cf. https://www.english-corpora.org/copyright.asp; see also the file "COCA_Note_on_Copyright.pdf" included in this Dataset) provides an extended discussion of why they believe that their use of the texts in COCA is within the bounds of US Fair Use Law. These arguments may also be applied to other corpora that have been used in this Dataset. Below, the discussion by COCA is adapted to the data files included in this Dataset:

The following are the four criteria used to determine whether materials fall under the provisions of the Fair Use Law:

Criteria: The amount and substantiality of the portion taken

What favors Fair Use status: Small portions of the original text, rather than full-text access
The data files in this Dataset: Under no circumstances whatsoever do end users / reusers have access to entire texts (e.g. newspaper, magazine, or journal articles, or short stories). The vast majority of what users see are simply lists of words or phrases from different parts of the corpus and possibly frequency charts showing the frequency of these items. Access to small portions of the original text is more of an "afterthought", rather than the central feature of the text excerpts contained in the data files included in this Dataset. Access to actual portions of the original text is limited to short excerpts, in some cases only keywords. As a result, it would be difficult for end users to re-create even one paragraph from the original text, and it would be virtually impossible to re-create an entire page of text, much less the entire work.

Criteria: The purpose and character of the use

What favors Fair Use status: Academic, non-commercial
The data files in this Dataset: Given the license under which this Dataset is published, the use of any content of this Dataset is strictly for non-commercial purposes.

Criteria: The nature of the copyrighted work

What favors Fair Use status: Non-creative works
The data files in this Dataset: The source texts used in this Dataset include some creative works (e.g. short stories and small sections of novels), but the majority of these texts is composed of transcripts of TV shows, and articles from newspapers, magazines, and academic journals.

Criteria: The effect of the use upon the potential market

What favors Fair Use status: Little or no effect on the copyright holder

The data files in this Dataset: Because of the very limited access to entire works included in the corpora that have been used in this Dataset (see the first item above), it is extremely unlikely that anyone would use the data files included in this Dataset as a "substitute" for other access to the original texts. Other sources make these texts available as "complete works", which are meant to be read in their entirety. That is completely impossible by using the data files included in this Dataset. The very limited access to the texts through the data files included in this Dataset, as compared to access via other sources, serves two completely different audiences. The data files are intended for linguists and other researchers who want to see the frequency of the investigated linguistic phenomena, and it is completely inadequate for anyone who wishes to read the entire text of a work. As a result, there is very little or no "competition" between the data files as distributed in this Dataset and services that are provided by others. The distribution of the data files included in this Dataset has therefore virtually no market impact.

Restricted Files + Terms of Access

Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Edit Retention Period

The selected file or files have already been published. Contact an administrator to change the retention period date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired or the files can only be transferred via Globus.

You may request access to any restricted file(s) by clicking the Request Access button.

Ineligible Files Selected

The selected file(s) may not be transferred because you have not been granted access or the file(s) have a retention period that has expired or the files are not Globus accessible.

You may request access to any restricted file(s) by clicking the Request Access button.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 4.7 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, with an expired retention period, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Private URL

Private URL can only be used with unpublished versions of datasets.

Unpublished Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? This is permanent and the selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? This is permanent an it will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom terms specific to this dataset Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

This Dataset contains data from the following sources:

In this Dataset, the data file "bnc.csv" contains the following information:

the keywords which the BNC was searched for, and for each token
annotations/values for six variables. This information has been provided by the author of this Dataset.

In this Dataset, the data file "coca.csv" contains the following information:

the keywords which the COCA was searched for, and for each token
annotations/values for six variables. This information has been provided by the author of this Dataset.

In this Dataset, the data file "glowbe.csv" contains the following information:

the keywords which the GloWbE was searched for, and for each token
annotations/values for six variables. This information has been provided by the author of this Dataset.

ICE: The International Corpus of English, including the following components:

ICE-AUS: The Australian Component of ICE.
ICE-CAN: The Canadian Component of ICE.
ICE-GB: The British Component of ICE.
ICE-IRE: The Irish Component of ICE.
ICE-NZ: The New Zealand Component of ICE.

ICE-GB was used under the ICE-GB License Agreement; see the file "ICE-GB_License_Agreement.pdf" included in this Dataset.

ICE-NZ was used under the ICE-NZ License Agreement; see the file "ICE-NZ_License_Agreement.pdf" included in this Dataset.

The ICE license agreements mentioned above include the following conditions (here cited according to the general ICE License Agreement):

“The Corpus must be used for non-profit academic research purposes only. […] The Licensee agrees not to reproduce or redistribute the Corpus or to use all or any part of the Corpus texts in any commercial product or service.”
“Publications based on the Corpus may include citations from texts only in a way which would be permitted under the fair dealings provision of copyright law.”
“If you publish a paper using any ICE corpus, please send a reference to ice@es.uzh.ch.”

In this Dataset, the data file "ice.csv" contains the following information:

the keywords which the ICE was searched for, and for each token
the context (usually one sentence) which the keyword appears in
annotations/frequency calculations/values for ten variables. This information has been provided by the author of this Dataset.
the corpus component where the keywords were found

Fair dealing:

"does using the work affect the market for the original work? If a use of a work acts as a substitute for it, causing the owner to lose revenue, then it is not likely to be fair"
"is the amount of the work taken reasonable and appropriate? Was it necessary to use the amount that was taken? Usually only part of a work may be used”

The corpus extracts used in this Dataset may be said to represent fair dealing according to both of these factors:

The extracted material does not affect the market for the original work, as it is unlikely that any researcher would refrain from using the corpora or the original works which the corpora are based on because of the availability of the extracted material contained in this Dataset.
The amount of the extracted work is reasonable and appropriate as it was necessary to carry out the study, and as it is necessary to replicate the study. Therefore, publishing the data files is not considered to infringe the copyright of the original IPR holders.

Fair use:

The following are the four criteria used to determine whether materials fall under the provisions of the Fair Use Law:

Criteria: The amount and substantiality of the portion taken

What favors Fair Use status: Small portions of the original text, rather than full-text access
The data files in this Dataset: Under no circumstances whatsoever do end users / reusers have access to entire texts (e.g. newspaper, magazine, or journal articles, or short stories). The vast majority of what users see are simply lists of words or phrases from different parts of the corpus and possibly frequency charts showing the frequency of these items. Access to small portions of the original text is more of an "afterthought", rather than the central feature of the text excerpts contained in the data files included in this Dataset. Access to actual portions of the original text is limited to short excerpts, in some cases only keywords. As a result, it would be difficult for end users to re-create even one paragraph from the original text, and it would be virtually impossible to re-create an entire page of text, much less the entire work.

Criteria: The purpose and character of the use

What favors Fair Use status: Academic, non-commercial
The data files in this Dataset: Given the license under which this Dataset is published, the use of any content of this Dataset is strictly for non-commercial purposes.

Criteria: The nature of the copyrighted work

What favors Fair Use status: Non-creative works
The data files in this Dataset: The source texts used in this Dataset include some creative works (e.g. short stories and small sections of novels), but the majority of these texts is composed of transcripts of TV shows, and articles from newspapers, magazines, and academic journals.

Criteria: The effect of the use upon the potential market

What favors Fair Use status: Little or no effect on the copyright holder

The data files in this Dataset: Because of the very limited access to entire works included in the corpora that have been used in this Dataset (see the first item above), it is extremely unlikely that anyone would use the data files included in this Dataset as a "substitute" for other access to the original texts. Other sources make these texts available as "complete works", which are meant to be read in their entirety. That is completely impossible by using the data files included in this Dataset. The very limited access to the texts through the data files included in this Dataset, as compared to access via other sources, serves two completely different audiences. The data files are intended for linguists and other researchers who want to see the frequency of the investigated linguistic phenomena, and it is completely inadequate for anyone who wishes to read the entire text of a work. As a result, there is very little or no "competition" between the data files as distributed in this Dataset and services that are provided by others. The distribution of the data files included in this Dataset has therefore virtually no market impact.

Name

Institution

Position

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://dataverse.no/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (1.2)

Major Release (2.0)

Publish Dataset

This dataset cannot be published until TROLLing is published by its administrator.

Publish Dataset

This dataset cannot be published until TROLLing and DataverseNO are published.

Return to Author

Return this dataset to contributor for modification.