Replication data for: Measure schematicity through information content: A quantitative approach to grammaticalization

Version 1.0

Zhang, Liulin, 2025, "Replication data for: Measure schematicity through information content: A quantitative approach to grammaticalization", https://doi.org/10.18710/APTUHA, DataverseNO, V1

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

14 Downloads

Description	This is a study to propose a quantitative method to compute the schematicity of constructions, which is a key indicator of the level of grammaticalization of morphemes. In this method, to estimate the schematicity of a schema made up of two morphemes, i.e., X_ (X is the target morpheme and _ represents an open slot), we need to know the total token frequency of all types of X_, and the token frequencies of all kinds of elements occurring in the open slot. For example, if we are interested in the schematicity of “_ment”. We need to know the total token frequency of “_ment”, which is the sum of the frequencies of “shipment”, “equipment”, “employment”, “appointment” … (all types of “_ment”). We also need to know the token frequencies of “ship”, “equip”, “employ”, “appoint” … (all types of elements occurring in the open slot). Therefore, the data are morpheme bigrams (2-gram) generated from the English and Chinese corpora showing what morphemes can each morpheme combine with, together with the token frequency of each bigram, and the token frequencies of its two components respectively. (2023-01-28)
Subject	Arts and Humanities
Keyword	morpheme bigram, schematicity, gradience, gradualness, Chinese, English
Related Publication	Zhang, L., & Tao, J. (2025). Measure schematicity through information content. Language and Linguistics, 26(2), 323-354. https://doi.org/https://doi.org/10.1075/lali.00189.zhadoi: 10.1075/lali.00189.zha
License/Data Use Agreement	CC BY 4.0

Filter by

	1 to 3 of 3 Files	Download
	00_ReadMe_Measure_schematicity_through_information_content.txt Plain Text - 8.9 KB Published Sep 3, 2025 8 Downloads MD5: 6024abade0c3d2c1ff18d7ab76861e7f	Preview "00_ReadMe_Measure_schematicity_through_information_content.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	01_EnglishData.csv Comma Separated Values - 4.7 MB Published Sep 3, 2025 4 Downloads MD5: bd8faebbd6077ee5bf232aacb38b328b Morpheme bigrams of the Manually Annotated Sub-Corpus (MASC) of the Open American National Corpus (https://anc.org/data/masc/). The first column “FrequencyRank” is the frequency rank of the bigram. The second column “Frequency” is the frequency of the bigram. The third column “M1” is the first morpheme of the bigram. The fourth column “M2” is the second morpheme of the bigram. The fifth column “M1Frequency” is the frequency of the first morpheme of the bigram. The sixth column “M2Frequency” is the frequency of the second morpheme of the bigram.	Preview "01_EnglishData.csv" Access File File Access Public Download Options Comma Separated Values Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	02_ChineseData.csv Comma Separated Values - 2.1 MB Published Sep 3, 2025 2 Downloads MD5: 0e9e0118de614fd29ebbd10ae979300b Morpheme bigrams of the self-built corpora of five historical periods of Chinese. The first column “Frequency” is the frequency of the bigram. The second column “C1” is the first morpheme of the bigram. The third column “C2” is the second morpheme of the bigram. The fourth column “C1Frequency” is the frequency of the first morpheme of the bigram. The fifth column “C2Frequency” is the frequency of the second morpheme of the bigram. The sixth column “Time” indicates the historical period.	Preview "02_ChineseData.csv" Access File File Access Public Download Options Comma Separated Values Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX

Citation Metadata

Persistent Identifier	doi:10.18710/APTUHA
Publication Date	2025-09-03
Title	Replication data for: Measure schematicity through information content: A quantitative approach to grammaticalization
Author	Soochow University0000-0002-9369-6232
Point of Contact	Use email button above to contact. Zhang, Liulin (Soochow University)
Description	This is a study to propose a quantitative method to compute the schematicity of constructions, which is a key indicator of the level of grammaticalization of morphemes. In this method, to estimate the schematicity of a schema made up of two morphemes, i.e., X_ (X is the target morpheme and _ represents an open slot), we need to know the total token frequency of all types of X_, and the token frequencies of all kinds of elements occurring in the open slot. For example, if we are interested in the schematicity of “_ment”. We need to know the total token frequency of “_ment”, which is the sum of the frequencies of “shipment”, “equipment”, “employment”, “appointment” … (all types of “_ment”). We also need to know the token frequencies of “ship”, “equip”, “employ”, “appoint” … (all types of elements occurring in the open slot). Therefore, the data are morpheme bigrams (2-gram) generated from the English and Chinese corpora showing what morphemes can each morpheme combine with, together with the token frequency of each bigram, and the token frequencies of its two components respectively. (2023-01-28)
Subject	Arts and Humanities
Keyword	morpheme bigram schematicity gradience gradualness Chinese English
Related Publication	Zhang, L., & Tao, J. (2025). Measure schematicity through information content. Language and Linguistics, 26(2), 323-354. https://doi.org/https://doi.org/10.1075/lali.00189.zha doi 10.1075/lali.00189.zha https://doi.org/https://doi.org/10.1075/lali.00189.zha
Language	English
Producer	Soochow University https://eng.suda.edu.cn
Distributor	The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/
Depositor	Zhang, Liulin
Deposit Date	2023-01-27
Time Period	Start Date: 400BC; End Date: 301BC Start Date: 500BC; End Date: 401BC Start Date: 0700; End Date: 0900 Start Date: 1640; End Date: 1640 Start Date: 1989; End Date: 1993 Start Date: 1990; End Date: 2015
Date of Collection	Start Date: 2023-01-01; End Date: 2023-01-31
Data Type	annotated corpus data
Software	AntConc, Version: 4.2.0
Data Source	The Manually Annotated Sub-Corpus (MASC) of the Open American National Corpus (https://anc.org/data/masc/), used under the conditions of the Creative Commons Attribution 3.0 United States License. ; A self-built Chinese corpus comprising of texts from five historical periods, including: Zuozhuan (《左传》), about 180,000 characters, the 4th century BCE; Shishuoxinyu (《世说新语》), about 79,000 characters, the 5th century AE; Six vernacular stories from Bianwen (《变文》), i.e.,《刘家太子变》《大目乾连冥间救母变文》《降魔变文》《降魔变柙座文》《八相变》《汉将王陵变》, about 40,000 characters, 700-900 AE; Four novels from Sanyan Erpai (《三言二拍》), i.e.,《施润泽滩阙遇友》《杜十娘怒沉百宝箱》《蒋兴哥重会珍珠衫》《卖油郎独占花魁》, about 70,000 characters, around 1640 AE; Five novels written by Wang Shuo (王朔), i.e.,《你不是个俗人》《玩的就是心跳》《顽主》《无人喝彩》《与青春有关的日子》, about 490,000 characters, Contemporary Modern Mandarin. The extracted words from these texts that are contained in the data files of this dataset only represent an insignificant part of the corpus, and they do not represent coherent text. Therefore, the reuse (including redistribution) of these words is permitted by the exceptions rules in IPR and database protection regulations, such as Fair use (USA cf. US Copyright Act), Fair dealing (UK; cf. Exceptions to copyright), "sitatretten" (Norway; cf. § 29 i Åndsverkloven).

Geospatial Metadata

Geographic Coverage	China

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Creative Commons Attribution 4.0 International License. CC BY 4.0

Dataset Version	Summary	Version Note	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Edit Retention Period

The selected file or files have already been published. Contact an administrator to change the retention period date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired or the files can only be transferred via Globus.

You may request access to any restricted file(s) by clicking the Request Access button.

Ineligible Files Selected

The selected file(s) may not be transferred because you have not been granted access or the file(s) have a retention period that has expired or the files are not Globus accessible.

You may request access to any restricted file(s) by clicking the Request Access button.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 9.3 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, with an expired retention period, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Preview URL

Preview URL can only be used with unpublished versions of datasets.

Unpublished Dataset Preview URL

Are you sure you want to disable the Preview URL? If you have shared the Preview URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? This is permanent and the selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? This is permanent an it will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

Creative Commons Attribution 4.0 International License. CC BY 4.0

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://dataverse.no/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

Select if this is a minor or major version update.

Minor Release (1.1)

Major Release (2.0)

Publish Dataset

This dataset cannot be published until TROLLing is published by its administrator.

Publish Dataset

This dataset cannot be published until TROLLing and DataverseNO are published.

Return to Author

Return this dataset to contributor for modification.

Add/Edit a Version Note

Styled Citation