Persistent Identifier
|
doi:10.18710/APTUHA |
Publication Date
|
2025-09-03 |
Title
| Replication data for: Measure schematicity through information content: A quantitative approach to grammaticalization |
Author
| Zhang, Liulin (Soochow University) - ORCID: 0000-0002-9369-6232 |
Point of Contact
|
Use email button above to contact.
Zhang, Liulin (Soochow University) |
Description
| This is a study to propose a quantitative method to compute the schematicity of constructions, which is a key indicator of the level of grammaticalization of morphemes. In this method, to estimate the schematicity of a schema made up of two morphemes, i.e., X_ (X is the target morpheme and _ represents an open slot), we need to know the total token frequency of all types of X_, and the token frequencies of all kinds of elements occurring in the open slot. For example, if we are interested in the schematicity of “_ment”. We need to know the total token frequency of “_ment”, which is the sum of the frequencies of “shipment”, “equipment”, “employment”, “appointment” … (all types of “_ment”). We also need to know the token frequencies of “ship”, “equip”, “employ”, “appoint” … (all types of elements occurring in the open slot). Therefore, the data are morpheme bigrams (2-gram) generated from the English and Chinese corpora showing what morphemes can each morpheme combine with, together with the token frequency of each bigram, and the token frequencies of its two components respectively. (2023-01-28) |
Subject
| Arts and Humanities |
Keyword
| morpheme bigram
schematicity
gradience
gradualness
Chinese
English |
Related Publication
| Zhang, L., & Tao, J. (2025). Measure schematicity through information content. Language and Linguistics, 26(2), 323-354. https://doi.org/https://doi.org/10.1075/lali.00189.zha doi: 10.1075/lali.00189.zha https://doi.org/https://doi.org/10.1075/lali.00189.zha |
Language
| English |
Producer
| Soochow University https://eng.suda.edu.cn |
Distributor
| The Tromsø Repository of Language and Linguistics (TROLLing) (TROLLing) https://trolling.uit.no/ |
Depositor
| Zhang, Liulin |
Deposit Date
| 2023-01-27 |
Time Period
| Start Date: 400BC ; End Date: 301BC
Start Date: 500BC ; End Date: 401BC
Start Date: 0700 ; End Date: 0900
Start Date: 1640 ; End Date: 1640
Start Date: 1989 ; End Date: 1993
Start Date: 1990 ; End Date: 2015 |
Date of Collection
| Start Date: 2023-01-01 ; End Date: 2023-01-31 |
Data Type
| annotated corpus data |
Software
| AntConc, Version: 4.2.0 |
Data Source
| The Manually Annotated Sub-Corpus (MASC) of the Open American National Corpus (https://anc.org/data/masc/), used under the conditions of the Creative Commons Attribution 3.0 United States License. ; A self-built Chinese corpus comprising of texts from five historical periods, including:
- Zuozhuan (《左传》), about 180,000 characters, the 4th century BCE;
- Shishuoxinyu (《世说新语》), about 79,000 characters, the 5th century AE;
- Six vernacular stories from Bianwen (《变文》), i.e.,《刘家太子变》《 大目乾连冥间救母变文》《降魔变文》《降魔变柙座文》《八相变》《汉将王陵变》, about 40,000 characters, 700-900 AE;
- Four novels from Sanyan Erpai (《三言二拍》), i.e.,《施润泽滩阙遇友》《杜十娘怒沉百宝箱》《蒋兴哥重会珍珠衫》《卖油郎独占花魁》, about 70,000 characters, around 1640 AE;
- Five novels written by Wang Shuo (王朔), i.e.,《你不是个俗人》《玩的就是心跳》《顽主》《无人喝彩》《与青春有关的日子》, about 490,000 characters, Contemporary Modern Mandarin.
The extracted words from these texts that are contained in the data files of this dataset only represent an insignificant part of the corpus, and they do not represent coherent text. Therefore, the reuse (including redistribution) of these words is permitted by the exceptions rules in IPR and database protection regulations, such as Fair use (USA cf. US Copyright Act), Fair dealing (UK; cf. Exceptions to copyright), "sitatretten" (Norway; cf. § 29 i Åndsverkloven). |