This dataset, "Replication Data for: A multivariate account of particle alternation after bare-form try in native varieties of English" (henceforth: "Dataset"), may be reused according to the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license as described here: https://creativecommons.org/licenses/by-nc/4.0.
This Dataset contains data from the following sources:
BNC: The British National Corpus. Examples of usage taken from the British National Corpus were obtained under the terms of the BNC End User Licence (see http://www.natcorp.ox.ac.uk/docs/licence.html or the file "BNC_End_User_Licence.pdf" included in this Dataset. Copyright in the individual texts cited resides with the original IPR holders. For information and licensing conditions relating to the BNC, please see the Terms tab on the landing page of the Dataset, and the BNC web site at http://www.natcorp.ox.ac.uk/.
Section "2 Terms of the Licence Granted to the Licensee" of the BNC End User Licence states among otherthings that "(f) [t]here is no restriction on the use of the Licensee's Results except that the Licensee may not publish in print or electronic form or exploit commercially in any form whatsoever any extracts from the BNC Processed Material other than those permitted under the fair dealings provision of copyright law."
In this Dataset, the data file "bnc.csv" contains the following information:
- the keywords which the BNC was searched for, and for each token
- annotations/values for six variables. This information has been provided by the author of this Dataset.
This means that the file does not contain any coherent (parts of) utterances which the keywords were found in as all context was removed from the data file. Therefore, publishing this data file is considered to be permitted under the fair dealings provision of copyright law; see details in section "Fair dealing" below.
COCA: Corpus of Contemporary American English. COCA does not provide an (openly accessible) end user license agreement. However, on their webpage (cf. https://www.english-corpora.org/copyright.asp; see also the file "COCA_Note_on_Copyright.pdf" included in this Dataset), they mention that the use of their source texts is "strictly for academic research, and is purely non-commercial". This may be interpreted as also the reuse of text from COCA being allowed for non-commercial purposes only. On the same webpage, COCA also provides evidence of their use and dissemination of the text sources being within the bounds of US Fair Use Law.
In this Dataset, the data file "coca.csv" contains the following information:
- the keywords which the COCA was searched for, and for each token
- annotations/values for six variables. This information has been provided by the author of this Dataset.
This means that the file does not contain any coherent (parts of) utterances which the keywords were found in as all context was removed from the data file. Therefore, publishing this data file is considered to be permitted under the fair dealings provision of copyright law; see details in section "Fair use" below.
GloWbE: Corpus of Global Web-Based English. GloWbE does not provide an (openly accessible) end user license agreement. However, on their webpage (cf. https://www.english-corpora.org/copyright.asp; see also the file "COCA_Note_on_Copyright.pdf" included in this Dataset), they mention that the use of their source texts is "strictly for academic research, and is purely non-commercial". This may be interpreted as also the reuse of text from GloWbE being allowed for non-commercial purposes only. On the same webpage, GloWbE also provides evidence of their use and dissemination of the text sources being within the bounds of US Fair Use Law.
In this Dataset, the data file "glowbe.csv" contains the following information:
- the keywords which the GloWbE was searched for, and for each token
- annotations/values for six variables. This information has been provided by the author of this Dataset.
This means that the file does not contain any coherent (parts of) utterances which the keywords were found in as all context was removed from the data file. Therefore, publishing this data file is considered to be permitted under the fair dealings provision of copyright law; see details in section "Fair use" below.
ICE: The International Corpus of English, including the following components:
- ICE-AUS: The Australian Component of ICE.
- ICE-CAN: The Canadian Component of ICE.
- ICE-GB: The British Component of ICE.
- ICE-IRE: The Irish Component of ICE.
- ICE-NZ: The New Zealand Component of ICE.
ICE-CAN and ICE-IRE were used under the general ICE License Agreement; see https://www.ice-corpora.uzh.ch/dam/jcr:7ae594b2-ee97-4935-8022-7d2d91b60be4/ICElicence_UZH.pdf or the file "ICE_License_Agreement.pdf" included in this Dataset.
ICE-GB was used under the ICE-GB License Agreement; see the file "ICE-GB_License_Agreement.pdf" included in this Dataset.
ICE-NZ was used under the ICE-NZ License Agreement; see the file "ICE-NZ_License_Agreement.pdf" included in this Dataset.
The ICE license agreements mentioned above include the following conditions (here cited according to the general ICE License Agreement):
- “The Corpus must be used for non-profit academic research purposes only. […] The Licensee agrees not to reproduce or redistribute the Corpus or to use all or any part of the Corpus texts in any commercial product or service.”
- “Publications based on the Corpus may include citations from texts only in a way which would be permitted under the fair dealings provision of copyright law.”
- “If you publish a paper using any ICE corpus, please send a reference to ice@es.uzh.ch.”
In this Dataset, the data file "ice.csv" contains the following information:
- the keywords which the ICE was searched for, and for each token
- the context (usually one sentence) which the keyword appears in
- annotations/frequency calculations/values for ten variables. This information has been provided by the author of this Dataset.
- the corpus component where the keywords were found
This means that the file only contains very limited excerpts from the works that are the bases for the ICE components that were used. Therefore, publishing this data file is considered to be permitted under the fair dealings provision of copyright law; see details in section "Fair dealing" below.
While no explicit, separate license agreement for ICE-AUS exists, its use and the publication of data from ICE-AUS as represented in this Dataset correspond to the use and publication of the data extracted from the other ICE components, and thus are considered as qualifying as fair dealing.
Fair dealing:
According to UK Copyright Law (cf. https://www.gov.uk/guidance/exceptions-to-copyright#fair-dealing), “[f]actors that have been identified by the courts as relevant in determining whether a particular dealing with a work is fair include:
- "does using the work affect the market for the original work? If a use of a work acts as a substitute for it, causing the owner to lose revenue, then it is not likely to be fair"
- "is the amount of the work taken reasonable and appropriate? Was it necessary to use the amount that was taken? Usually only part of a work may be used”
The corpus extracts used in this Dataset may be said to represent fair dealing according to both of these factors:
- The extracted material does not affect the market for the original work, as it is unlikely that any researcher would refrain from using the corpora or the original works which the corpora are based on because of the availability of the extracted material contained in this Dataset.
- The amount of the extracted work is reasonable and appropriate as it was necessary to carry out the study, and as it is necessary to replicate the study. Therefore, publishing the data files is not considered to infringe the copyright of the original IPR holders.
Fair use:
According to US Copyright Act (cf. https://www.copyright.gov/fair-use/more-info.html), "Fair use is a legal doctrine that promotes freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances". The Corpus of Contemporary American English (COCA; cf. https://www.english-corpora.org/copyright.asp; see also the file "COCA_Note_on_Copyright.pdf" included in this Dataset) provides an extended discussion of why they believe that their use of the texts in COCA is within the bounds of US Fair Use Law. These arguments may also be applied to other corpora that have been used in this Dataset. Below, the discussion by COCA is adapted to the data files included in this Dataset:
The following are the four criteria used to determine whether materials fall under the provisions of the Fair Use Law:
Criteria: The amount and substantiality of the portion taken
- What favors Fair Use status: Small portions of the original text, rather than full-text access
- The data files in this Dataset: Under no circumstances whatsoever do end users / reusers have access to entire texts (e.g. newspaper, magazine, or journal articles, or short stories). The vast majority of what users see are simply lists of words or phrases from different parts of the corpus and possibly frequency charts showing the frequency of these items. Access to small portions of the original text is more of an "afterthought", rather than the central feature of the text excerpts contained in the data files included in this Dataset. Access to actual portions of the original text is limited to short excerpts, in some cases only keywords. As a result, it would be difficult for end users to re-create even one paragraph from the original text, and it would be virtually impossible to re-create an entire page of text, much less the entire work.
Criteria: The purpose and character of the use
- What favors Fair Use status: Academic, non-commercial
- The data files in this Dataset: Given the license under which this Dataset is published, the use of any content of this Dataset is strictly for non-commercial purposes.
Criteria: The nature of the copyrighted work
- What favors Fair Use status: Non-creative works
- The data files in this Dataset: The source texts used in this Dataset include some creative works (e.g. short stories and small sections of novels), but the majority of these texts is composed of transcripts of TV shows, and articles from newspapers, magazines, and academic journals.
Criteria: The effect of the use upon the potential market
- What favors Fair Use status: Little or no effect on the copyright holder
- The data files in this Dataset: Because of the very limited access to entire works included in the corpora that have been used in this Dataset (see the first item above), it is extremely unlikely that anyone would use the data files included in this Dataset as a "substitute" for other access to the original texts. Other sources make these texts available as "complete works", which are meant to be read in their entirety. That is completely impossible by using the data files included in this Dataset. The very limited access to the texts through the data files included in this Dataset, as compared to access via other sources, serves two completely different audiences. The data files are intended for linguists and other researchers who want to see the frequency of the investigated linguistic phenomena, and it is completely inadequate for anyone who wishes to read the entire text of a work. As a result, there is very little or no "competition" between the data files as distributed in this Dataset and services that are provided by others. The distribution of the data files included in this Dataset has therefore virtually no market impact.