Development of Classification Indices to Treat a Variety of Texts

Abbreviation:Text Classification Indices
Project Leader:KASHINO Wakako
Associate Professor, Department of Corpus Studies, NINJAL
Research field:Japanese Linguistics
Keywords:Text classification, Writing style, Corpus

Summary

The text classification indices for books that are commonly available are limited to NDC, used for genre classification and Japan book classification codes (C codes), used for marketing targets and sales outlets. They are not sufficient for studying texts and using corpora linguistically. This project aims to design and verify a classification scheme for handling a variety of formats, contents, and expressions necessary for text research and utilization of corpora in connection with book texts.

First, an index is provided to indicate whether the text structure is a simple type (e.g., chapter and verse structure) or an atypical type (e.g., conversation, Q&A format, illustrations, a glossary, etc.). Second, an index is provided to classify texts with simple structure according to the features of their content and expression: difficult or easy, stiff or relaxed, polite or informal, written or spoken, subjective or objective, etc.

The classification indices will be assigned manually or automatically to the more than 10,000 text examples to be included in the Balanced Corpus of Contemporary Written Japanese, and will be verified systematically.