Title: The Study of Modern Chinese Core Vocabulary for the Second Language Acquisition

Author: Yinghua Zhai

Degree and Year: Doctor of Philosophy, Wuhan University 2012

Abstract: Core vocabulary is the main component of speech in a language. It is the most basic,most widely used group of words in the process of information exchange. In a language of diverging linguistic fields and differing linguistic styles, these words are the common core of the vocabulary, may it be common speech or the literary form. Therefore, they have rightfully become the core of the Lexical Semantics System. From a rhetorical point of view, the core vocabulary is a set of words completely comprehensible and functional for an adult native speaker of a specific language. If we were to go beyond the realm of cultural communication, it can be said that to an extent, core vocabulary is the "universal vocabulary" of a certain language. Whereas, on the basis of the sequences in second language acquisition, core vocabulary is the set of words that are suitable for the beginners level while studying a foreign language. It is the estimated limit of words to be learned for basic, everyday communication. 

There is no clear boundary between the core vocabulary and the non-core vocabulary. Both have an intermediary relation with each other. The integral vocabulary system is the gradual transition of the core vocabulary to the non-core vocabulary. This is why the core vocabulary is divided into two categories, viz. Classic category and Marginal category. Hence, it is more apt to describe the core vocabulary in terms of degree of affinity towards the core of a language.

The core vocabulary list for Teaching Chinese as a Second Language has been constructed on the basis of Modern Chinese Core Vocabulary Index. The target of this lexicon is any beginner learner for whom Chinese is a second language. Apart from this, it is essential to consider the factors like age, nationality, social status, objective of learning, learning atmosphere (target language country or non-target language country), and learning style (Standard School training or self-study) of the student, for the extensive applicability and suitability of the core vocabulary.

In a Zipfian distribution, the most common item has twice as many occurrences as the second most common, three times as many as the third, a hundred times as many as the hundredth, a thousand times as many as the thousandth, and a million times as many as the millionth. This shows that as long as one masters a small proportion of the most frequent terminologies in a language, it is possible to understand a considerable amount of the content of a language. Modern Chinese Vocabulary Distribution law indicates that the appropriate value of core vocabulary for Modern Chinese is 3000 terminologies. It covers about 75% of the total corpus. Hereafter, the growth of the integral language coverage frequency decelerates remarkably.

The main mission of Primary level of Chinese Education is formal spoken training. In comparison to the Integral Modern Chinese vocabulary distribution law, the high frequency words of Survival Spoken Chinese Corpus are much more concentrated. Thus, the core vocabulary concerning Chinese education as a second language should be less than 3000, the optimal range being 2300 to 2800 high frequency words. 

The Standard analysis for core vocabulary has two distinct angles: superficial level and internal level. Frequency, Equality and Stability are superficial standards. They are the external characteristics of the core vocabulary. Universality on the other hand is an internal standard, also an essential attribute of the core vocabulary.

The development of the core vocabulary of Teaching Modern Chinese as a Second Language can be divided into three respective layers. The first step is Contrast-extraction method. This method includes the extraction of the most frequent words from three of the balanced Modern Chinese Corpus. On this basis, the suitable core vocabulary according to the universality standard and the normalization principle are selected. Finally, after ample consideration of the Second language Acquisition Law, the universality of Modern Chinese core vocabulary list should be optimized on the microscopic level. Furthermore after proper evaluation of acquisition factor, 2397 core vocabulary words for the Modern Chinese lexicon should be constructed.


 In the process of the development of the Core vocabulary list, the following statements can be postulated:

  1. The scientific development of a vocabulary depends on its realitivity. On one hand, the stochastic nature of the lingual symbols confirms that the Corpus statistics is only a "general condition". While on the other hand, Corpus is inevitably subjected to the social and cultural factors. Thus, the vocabulary extracted from the Corpus cannot attain the absolute "Purity".
  2. Most of the time, the problems arising in building the vocabulary is due to the scientific nature and the applicability of the Corpus itself. All in all, even increasing associative linkage cannot properly overcome the limitations of the word frequency statistics. While increase in associative linkage is based on the range of the meaning of the words, it still does not reflect on the degree of frequency of a word. Moreover, considerable amount of facts have already proven that the standard of meaning is not quite reliable. Based on the objective fact that lexical statistics of the Corpus is more consistent in comparison to the associative linkage in a language, the scientific development of the vocabulary should focus again on the construction and selection of words.
  3. Listening, Reading, Speaking and Writing are completely different processes. Even among the frequently used words, the spoken corpus and the literary corpus have plenty of distinctions. Similarly, the study of core vocabulary in a second language has its own unique characteristics. The main mission of the initial phase of teaching Chinese as a second language is teaching standard spoken Chinese. Accordingly, the corpus used in the preparation of the core vocabulary for the primary level and the senior levels of teaching Chinese has its differences. The extraction of the primary level vocabulary should be themed on the balanced standardized spoken corpus, not the balanced corpus of general significance.

Keywords: core vocabulary, corpus, word frequencies, Chinese vocabulary




摘要:核心词是在运用某一语言的交际中,在传递信息过程中起重要作用的、最基本、最常用的一部分词。它们是不同语域、不同语体、不同风格的言语交际(包括口头和书面)中词汇的共核(common core)部分,在词汇语义系统处于核心位置。从语言使用者的角度来说,核心词对于任何说本族语言的成年人来说都能使用和理解。从跨文化交际的视角来看,核心词在一定程度上说可视为各种语言中“普遍词汇”。面向第二语言教学的核心词是在汉语学习的初级阶段必需掌握的词语。核心词和非核心词之间并没有明显的边界,两者之间存在中间状态,整个词汇系统是由核心词向非核心词的逐渐过渡。因此,核心词有典型范畴和边缘范畴之分,说一个词更接近词汇的核心层面,或用程度来描绘核心词更为准确。面向第二语言教学的核心词表是在现代汉语核心词表研制的基础上充分考虑第二语言教学的规律制定出来的,其服务对象是以汉语作为第二语言的任何初学者,除此之外,不考虑年龄、国别、社会地位、学习目的、学习环境(目的语国家或非目的语国家)和学习形式(学校正规训练或自学)等其他因素的影响,力图追求最广泛的通用性和适用性。 根据齐普夫定律,任何语言中最常用词的出现频率是次常用词的两倍;是第三级常用词的三倍;第一百级常用词的一百倍;第一千级常用词的一千倍……这说明,只要掌握一种语言中比例不多的最常用词,就有可能理解该语言的相当一部分内容。现代汉语词汇的分布规律表明,3000词条是现代汉语核心词的重要量值,大致可以覆盖整个语料的75%左右。在此之后,整个语言覆盖率的增长则明显缓慢。 初级阶段汉语教学的任务是规范的口语教学,与整个现代汉语词汇的分布规律相比,生活口语语料高频词的覆盖率更加集中。因此,面向第二语言教学的汉语核心词数量应该少于3000个,大约在2300到2800之间。核心词的判断标准围绕常用性、均衡性、稳定性、通用性以及组合能力强五个方面考察。 面向第二语言教学的现代汉语核心词表的研制工作分三步逐层进行。首先采取对比提取法,从三个平衡语料库中提取现代汉语常用词;在此基础上根据核心词的通用性标准和规范性原则,筛选出符合要求的现代汉语核心词表;最后根据第二语言的习得规律,从微观层而进一步优化现代汉语核心词表的通用性,并适当考虑习得因素,得到面向第二语言教学的2397个现代汉语核心词。 在词表的研制过程中,我们得出以下几点认识: 1、词表研制的科学性总是相对而言的。一方面,语言符号的随机性特征决定了语料库的统计只能是一种“大致情况。另一方面,语料库不可避免地受到社会文化因素的影响,因此提取自语料库的词表不可能做到绝对意义的“纯”。 2、词表研制中出现的问题很多时候是源于语料库本身的科学性和适用程度。从总体上看,联想添加并不能很好地克服词频统计的局限,一方面,联想添加是基于词义距离,并不反映词的常用程度;另一方面,语言研究的大量事实早已证明,以意义为标准很多时候是不可靠的。基于语料库的词汇统计比联想更加符合语言的客观事实,词表研制的科学性还是应该绕回到语料库的建设和选取上下功夫。 3、面向第二语言教学的核心词表有其自身的特点。初级阶段汉语教学的任务是规范口语,中高级阶段逐步向书面语倾斜。相应的,初级阶段和中高级阶段汉语教学词表研制所使用的语料库应该有所不同。初级阶段词表的提取应该选择题材平衡的规范口语语料库,而不是一般意义的平衡语料库。 4、从语料库生成的词表反映的是以汉语为母语的成年人使用词语的常用情况,不涉及二语学习过程中的习得顺序因素。有些常用程度很高的词语从习得顺序的角度看并不适合在初级阶段学习。因此,初级阶段汉语词表的研制还要根据习得顺序因素排查若干较难掌握的词语。