Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data

Version 1.1

Sönning, Lukas, 2023, "Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data", https://doi.org/10.18710/5KCE4U, DataverseNO, V1

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

13 Downloads

Description	Dataset description This dataset, which is adapted from Jenset and McGillivray (2017), contains tabular files documenting the alternating usage of -(e)th and -(e)s to mark third-person verb inflection in Early Modern English. The data provided by Jenset and McGillivray (2017) are drawn from the PPCEME corpus (Kroch et al. 2004) and cover the period from 1500 to 1700. In total, 13,757 third-person singular tokens (excluding the verb BE) were annotated by these authors for a range of variables. For the purposes of the present methodological study, this dataset was reduced to a subset of 11,645 tokens, and the coding of variables was in some parts revised, completed, or modified. The dataset includes information about the Author and Verb Lemma, as well as a number of predictor variables, including Genre, Year, Frequency (of the verb lemma in the third-person singular), Phonological Context (stem-final sound), and the Gender of the author. (2023-07-20) Abstract for related publication Resource constraints often force researchers to down-size the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: Year, Gender, Genre, Frequency, and Phonological Context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria. (2023-10-23)
Subject	Arts and Humanities
Keyword	Early Modern English, verb inflection, language change, lexical diffusion, third person singular, methodology, down-sampling, corpus linguistics, PPCEME, Penn-Helsinki Parsed Corpus of Early Modern English
Related Publication	Sönning, Lukas. 2024. Down-sampling from hierarchically structured corpus data. International Journal of Corpus Linguistics 29(4). 507–533. doi: 10.1075/ijcl.23079.son
License/Data Use Agreement	Custom Dataset Terms

Filter by

	1 File	Download
	data_jenset_mcgillivray_downsampling.tsv Tab-Separated Values - 2.0 MB Published Oct 24, 2023 5 Downloads MD5: ceca38119ea898d2bbb23c5dd5a32b41 Tab-delimited data table containing the 11,645 annotated verb tokens Data	Preview "data_jenset_mcgillivray_downsampling.tsv" Access File File Access Public Download Options Tab-Separated Values Download Metadata Data File Citation EndNote XML RIS BibTeX

Citation Metadata

Persistent Identifier	doi:10.18710/5KCE4U
Publication Date	2023-10-24
Title	Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data
Author	Sönning, Lukas (University of Bamberg) - ORCID: 0000-0002-2705-395X
Point of Contact	Use email button above to contact. Sönning, Lukas (University of Bamberg)
Description	Dataset description This dataset, which is adapted from Jenset and McGillivray (2017), contains tabular files documenting the alternating usage of -(e)th and -(e)s to mark third-person verb inflection in Early Modern English. The data provided by Jenset and McGillivray (2017) are drawn from the PPCEME corpus (Kroch et al. 2004) and cover the period from 1500 to 1700. In total, 13,757 third-person singular tokens (excluding the verb BE) were annotated by these authors for a range of variables. For the purposes of the present methodological study, this dataset was reduced to a subset of 11,645 tokens, and the coding of variables was in some parts revised, completed, or modified. The dataset includes information about the Author and Verb Lemma, as well as a number of predictor variables, including Genre, Year, Frequency (of the verb lemma in the third-person singular), Phonological Context (stem-final sound), and the Gender of the author. (2023-07-20) Abstract for related publication Resource constraints often force researchers to down-size the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: Year, Gender, Genre, Frequency, and Phonological Context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 sub-samples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria. (2023-10-23)
Subject	Arts and Humanities
Keyword	Early Modern English verb inflection language change lexical diffusion third person singular methodology down-sampling corpus linguistics PPCEME Penn-Helsinki Parsed Corpus of Early Modern English
Related Publication	Sönning, Lukas. 2024. Down-sampling from hierarchically structured corpus data. International Journal of Corpus Linguistics 29(4). 507–533. doi: 10.1075/ijcl.23079.son https://doi.org/10.1075/ijcl.23079.son
Language	English
Producer	Alan Turing Institute, University of Cambridge https://www.c2d3.cam.ac.uk/research/alan-turing-institute University of Bamberg https://www.uni-bamberg.de/eng-ling/
Distributor	The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/
Depositor	Sönning, Lukas
Deposit Date	2023-07-20
Time Period	Start Date: 1500-01-01 ; End Date: 1707-12-31
Date of Collection	Start Date: 2022-11-15 ; End Date: 2023-06-15
Data Type	observational data; textual linguistic data; corpus data
Software	R, Version: 4.2.1 RStudio, Version: 2023.06.2
Data Source	Data in the tabular file data_jenset_mcgillivray_downsampling.tsv has been adapted from a dataset published by Gard Jenset in 2018 ("Jenset 2018") on GitHub at https://github.com/gjenset/quanthistbook/tree/master/eme_v3sng_study. Jenset 2018 contains supporting data for Gard B. Jenset and Barbara McGillivray, 'A new methodology for quantitative historical linguistics', Quantitative Historical Linguistics: A Corpus Framework, Oxford Studies in Diachronic and Historical Linguistics (Oxford, 2017), https://doi.org/10.1093/oso/9780198718178.003.0007. Data from Jenset 2018 is reused here (as described in the ReadMe file) under a MIT License: Copyright (c) 2018 Gard Jenset Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Data included in Jenset 2018 were in turn derived from: Anthony Kroch, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. https://www.ling.upenn.edu/ppche/ppche-release-2016/PPCEME-RELEASE-3. PPCEME is currently distributed by the Linguistic Data Consortium as part of the Penn Parsed Corpora of Historical English under a user license agreement. This agreement permits the User to "include limited excerpts from the Data in articles, reports and other documents describing the results of User’s non-commercial projects related to linguistic education, research and technology development". The text fragments extracted from PPCEME by Jenset and McGillivray and incorporated into the present dataset only represent limited excerpts of the kind that may be shared under limitations and exceptions to copyright, such as Fair Use or Fair Dealing.

Geospatial Metadata

Geographic Coverage	United Kingdom

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

With the exception of the tabular file data_jenset_mcgillivray_downsampling.tsv, the dataset “Background data (adapted from Jenset & McGillivray 2017) for: Down-sampling from hierarchically structured corpus data” has been marked as dedicated to the public domain, as described here: https://creativecommons.org/publicdomain/zero/1.0/.

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

The tabular file data_jenset_mcgillivray_downsampling.tsv constitutes an adaptation of a dataset published by Gard Jenset on GitHub in 2018 ("Jenset 2018"), available at https://github.com/gjenset/quanthistbook/tree/master/eme_v3sng_study. Material has been adapted from Jenset 2018, as documented in the ReadMe file included in the present dataset, under a MIT License. Contributions made by the author of the present dataset to the adaptation are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, as described here: https://creativecommons.org/licenses/by/4.0/. Reusers of the adapted work must comply with both this CC license and the original MIT License:

Copyright (c) 2018 Gard Jenset

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Data included in Jenset 2018 were in turn derived from The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), available here: https://www.ling.upenn.edu/ppche/ppche-release-2016/PPCEME-RELEASE-3. Note that the MIT License under which Jenset 2018 is made available cannot be considered to apply to text fragments extracted from PPCEME.

Restricted Files + Terms of Access

Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Edit Retention Period

The selected file or files have already been published. Contact an administrator to change the retention period date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired or the files can only be transferred via Globus.

You may request access to any restricted file(s) by clicking the Request Access button.

Ineligible Files Selected

The selected file(s) may not be transferred because you have not been granted access or the file(s) have a retention period that has expired or the files are not Globus accessible.

You may request access to any restricted file(s) by clicking the Request Access button.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 4.7 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, with an expired retention period, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Private URL

Private URL can only be used with unpublished versions of datasets.

Unpublished Dataset Private URL

Are you sure you want to disable the Private URL? If you have shared the Private URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? This is permanent and the selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? This is permanent an it will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Custom terms specific to this dataset Custom Dataset Terms — the following Custom Dataset Terms have been defined for this dataset.

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Copyright (c) 2018 Gard Jenset

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name

Institution

Position

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://dataverse.no/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (1.2)

Major Release (2.0)

Publish Dataset

This dataset cannot be published until TROLLing is published by its administrator.

Publish Dataset

This dataset cannot be published until TROLLing and DataverseNO are published.

Return to Author

Return this dataset to contributor for modification.