Development of Classification Indices to Treat a Variety of Texts
|Abbreviation||:||Text Classification Indices|
|Project Leader||:||KASHINO Wakako|
Associate Professor, Department of Corpus Studies, NINJAL
|Project Period||:||October 2009 - September 2012|
|Research field||:||Japanese Linguistics|
|Keywords||:||Text classification, Writing style, Corpus|
The text classification indices for books that are commonly available are limited to NDC, used for genre classification and Japan book classification codes (C codes), used for marketing targets and sales outlets. They are not sufficient for studying texts and using corpora linguistically. This project aims to design and verify a classification scheme for handling a variety of formats, contents, and expressions necessary for text research and utilization of corpora in connection with book texts.
First, an index is provided to indicate whether the text structure is a simple type (e.g., chapter and verse structure) or an atypical type (e.g., conversation, Q&A format, illustrations, a glossary, etc.). Second, an index is provided to classify texts with simple structure according to the features of their content and expression: difficult or easy, stiff or relaxed, polite or informal, written or spoken, subjective or objective, etc.
The classification indices will be assigned manually or automatically to the more than 10,000 text examples to be included in the Balanced Corpus of Contemporary Written Japanese, and will be verified systematically.