The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese

Abbreviated Name: Diachronic Corpora
Project Leader: OGISO Toshinobu (Professor, Language Change Division, NINJAL)
Project Period: April 2016 - March 2022
Keywords: Corpus of Historical Japanese, Morphological analysis
Related Site: The Corpus of Historical Japanese (CHJ)

Summary

Background and Purpose

In language research at large, researchers have advanced corpus-based empirical research, which has yielded considerable results. A corpus is a large-scale language resource stored on computers. It systematically collects from texts examples of how a language is used, and provides information that is essential to researchers. When it comes to languages of the past, all researchers have had to base their arguments on extant texts and the extant examples of language usage therein. This is how Japanese language historians have conducted their research, and the main sources they have used are books that have rendered historical texts in print form and general reference materials that summarize the position and quantity of these examples in the book. These materials were highly specialized; they could not be used by laypeople.

If these paper-based materials can be converted onto a corpus format, it could enable historical Japanese language research to be developed using new methods. On the one hand, corpus-based historical Japanese language research will continue the trend of research hitherto and facilitate greater efficiency that is in keeping with the times. However, it will also expand the range of possibilities. For example, it will be possible to have linguistic research that incorporates statistical methods used in corpus linguistics. In addition, by making it easier to handle a variety of materials from many different time periods, a corpus will enable researchers to take a macro perspective by viewing the text as a whole. Furthermore, publishing a corpus online will encourage researchers from overseas and/or from other disciplines to refer to historical Japanese language research, which will in turn introduce broader perspectives into historical Japanese language research

In order to bring about such corpus-based historical Japanese language research, first, it is essential to create a historical corpus. The National Institute for Japanese Language and Linguistics (NINJAL) has started work on the construction of a corpus titled “Corpus of Historical Japanese (CHJ).” This project involves converting to corpus format the major historical Japanese texts, and as the final step, creating a “diachronic corpus” with which researchers can trace the history of Japanese. The project also involves preparing a “word information database” that handles Japanese language history-related information. The plan is to collate this information with the information in the corpus and open a portal site with which researchers can trace the history of the language. The various research groups assigned to each time period/research area will utilize the finished corpus to develop the research to which they were assigned.

Objectives and Methods

The project members have advanced research activities in the following three units: the “corpus construction unit,” which is responsible for creating the diachronic corpus; the “word information database unit,” which is responsible for creating the word information database and portal site; and the “corpus application unit,” which is responsible for utilizing the corpus and database in historical Japanese language research.

The corpus construction unit will input into the corpus various texts of each period.

After carrying out the processes of transliteration and annotation of document structure, the members will use morphological analysis tools to divide the entire text into linguistic units, add morphological information such as readings, parts of speech, and lemma identification, and then manually add corrections on the database.

The finished corpus will be released to the public on a corpus search application called “Chunagon.” On the site, users will be able to carry out sophisticated searches that combine various morphological information and will also be able to download usage examples.

The word information database unit will work on preparing a database of old dictionaries, linguistic maps, and language articles. They will then combine this database with statistical information acquired from the corpus, and prepare and publish a word information portal site. This site will link to various linguistic resources, and thus serve as a portal for language research.

As for the corpus application unit, they will establish a number of groups for each time period, and a number of groups for each area of research, including grammar, vocabulary, and annotation. Each research group will hold their own research presentation meetings, and develop historical Japanese language research using the corpus. The unit will hold one or more workshops and symposia to report the research outcomes. It will also hold corpus application seminars and develop activities designed to expand the range of applications of the corpus

Project Members