Extending the Diachronic Corpus through an Open Co-construction Environment
- Project Leader
- OGISO Toshinobu (Professor, NINJAL)
- Project Period
- April 2022 -
Background and Purpose
The Corpus of Historical Japanese (CHJ), a diachronic corpus that has been constructed by NINJAL, contains historical Japanese language materials from the Manyōshū era to the Meiji and Taisho eras. It is available free of charge on the Internet and has been used by many people, including researchers. It is now becoming an indispensable resource for historical research on the Japanese language.
While the main material is already included in the corpus, there is still much more needed for linguistic research, in addition to which there remain some errors in the corpus that need correction. Therefore, this project will continue and expand the previous diachronic corpus project by adding the missing material.
Furthermore, to cover even more material, we need the help of experts in various texts from different periods, and there are limits to what can be done within the scope of NINJAL's projects. This project will accordingly create an open collaborative environment that will allow data created by people outside of NINJAL to be used in the same way as NINJAL’s own corpus.
In addition, using the diachronic corpus constructed thus far, we will apply natural language processing methods, which have been rapidly developing in recent years, to the study of the history of the Japanese language.
Objectives and Methods
In response to the above three initiatives, three research groups will be established to conduct research.
The first is the CHJ Expansion Group, which will take over "The Construction of Diachronic Corpora and New Developments in Research on the History of Japanese" project until FY2021 and expand the CHJ to make it more complete as a diachronic corpus. In particular, we plan to select and compile some subcorpora of important materials from the Edo period onward, of which a large number have survived. In addition, in collaboration with the Grant-in-Aid for Scientific Research (KAKENHI) project, we will work to create a corpus of medieval "Shōmono". We will also include data from the Showa and Heisei eras, based on the results of another KAKENHI project. Furthermore, in collaboration with Oxford University, we will further develop the Oxford-NINJAL Corpus of Old Japanese.
The second is the Open CHJ research group, which builds and maintains corpora in an open collaborative environment. Utilizing the corpus construction knowledge accumulated by NINJAL, the group will develop the necessary tools and guidelines for corpus construction, including standard data formats and licenses. For example, we will extend "Web Chamame," which enables the morphological analysis of Japanese language materials from various periods, to provide support toward making one’s own data publicly available as a corpus. This will allow outside researchers and the general public interested in diachronic corpora to publish their materials on the Internet through an interface similar to that of the CHJ. In addition, we will operate a system to report errors in the corpus from the "Chunagon" corpus search application and promote the improvement of the accuracy of the corpus by harnessing the power of users. Through these efforts, we will enhance the diachronic corpus of the Japanese language throughout the entire academic community. We also plan to hold workshops and publish guidebooks for this purpose.
The third group is the Natural Language Processing Application Group for the study of Japanese language history. Together with researchers from other fields such as the Institute of Statistical Mathematics, the group will challenge new research projects utilizing diachronic corpora, such as elucidating mechanisms of language change using statistical models, extracting historical language change from corpora, and translating historical Japanese texts into contemporary Japanese using neural machine translation technology. In addition, we will conduct research on the assignment of semantic information to the CHJ based on the Bunrui Goihyō (Word List by Semantic Principles).