Science.Online
Publisher and Institutes
Akademie Verlag
Deutsches Institut für Urbanistik
Oldenbourg Wissenschaftsverlag
Walter de Gruyter
Schattauer
You are here: Home :: Area CULI :: Linguistics and literature
 
Eric Long, Shoichi Yokoyama

Text genre and kanji frequency

Various ways are explored in this study of using kanji frequency lists derived from multiple corpora to characterise kanji usage within the corpora. First we discuss the scope of, and issues in processing, four corpora derived from commercially available CD-ROMs: two encyclopedias, a database of newspaper articles, and a four-CD-ROM collection of the texts of mostly fictional paper back books. Next a summary of the kanji frequency data is given, and it is pointed out that the frequency distribution is noticeably different from a classic Zipf's law distribution. A comparison is made between the standard set of J y kanji and high-frequency kanj in the corpora, and the degrees of similarity among the corpora are obtained with the Chi(2) By Degrees of Freedom (CBDF) measure proposed by Kilgarriff (1997). Finally a simple method is tried and evaluated for identifying kanji that have a high frequency in a particular corpus compared to their cross-corpus frequency.

Glottometrics, RAM-Verlag

Volume: 10, 06/2005
Pages: 55-72

Show full article (external site)

Show all available items of this journal