Development of and Linguistic Research with a Parsed Corpus of Japanese

Abbreviated Name: Parsed Corpus
Project Leader: Prashant PARDESHI (Professor, Theory & Typology Division, NINJAL)
Project Period: April 2016 - March 2022
Keywords: A parsed corpus with syntactic and semantic tagging, Annotation
Related Site: Project Site

Summary

Background and Purpose

As is often the case with Google searching, queries of currently available corpora typically return large amounts of data as search results that takes human effort to pick what is relevant. Morphological information, i.e. the specification of parts of speech such as noun and verb, is often too basic to offer information to identify sentence structures or obtain meanings. This project aims to build a corpus with high-quality syntactic annotations (e.g., subject and object) that will make search with syntactic patterns possible. For example, this will distinguish the noun kenkyuu in chuumoku-sarete iru kenkyuu (research which is watched with interest) which is subject for the embedded verb chuumoku-suru, from use of the same noun in sekai ga zessansuru kenkyuu (research which the world admires), where kenkyuu is the object of the embedded verb zessan-suru. The task of building this type of corpus is an inevitable requirement and is already being undertaken for other languages of the world. However, for Japanese no corpora are publicly available at present which have syntactic annotations indispensable for the understanding of sentence structures and meanings.

This project aims to develop and offer a freely accessible corpus with syntactic annotations attached to texts, as well as associated meaning representations (logical formulas). We hope this innovative corpus will facilitate the progress of research on Japanese. Moreover, through the publications of our research output in Japan and abroad, we hope to contribute to contrastive studies between Japanese and the languages of the world.

Objectives and Methods

In our project, there is the Research Unit, investigating problems in corpus building, and the Development Unit to build the corpus. These work together to accomplish the above-mentioned goals. We will also invite leading scholars in Japan and from abroad to join the Advisory Board with a view to making our activities widely open to scholars across the world and establish a global network of corpusbased linguistic research.

The Research Unit deals with both theoretical and practical problems in corpus building with the aim of achieving high-quality.

The Research Unit will also cooperate with the grammar group of the Project entitled “Cross-linguistic Studies of Japanese Prosody and Grammar” at NINJAL with a view to creating a new research field of corpus-based contrastive studies of Japanese and other languages.

The Development Unit aims to build up and make public a corpus with annotations attached to modern Japanese, mostly from written texts. We follow the annotation scheme of the Penn Historical Treebank, a variant of the Penn Treebank, which was first developed for English at the University of Pennsylvania and is now applied to various languages in the world. This scheme is adopted because of its abundant functional labels associated with grammatical categories, which enable correctly grasping the syntactic and semantic information of constituents of sentences. We will also provide a Romanized version of our corpus which will remove the script barrier, a user-friendly interface for non-tech-savvy researchers and students, and a manual for users both in Japanese and English.

Through the interaction of the Advisory Board, Research Unit, and Development Unit, we plan to build and make publicly available an innovative corpus for Japanese and by so doing we aim to make a valuable addition to the research of the Japanese language in the world.

Project Members

National Institute for Japanese Language and Linguistics