Agency for Cultural Affairs Commissioned Project: Digital Infrastructure Development Project for Preservation and Utilization of Modern Japanese as a Reliable Language Resource

Project Period
April 2024 -
Related Website
【Agency for Cultural Affairs commissioned project】BCCWJ2 -Balanced Corpus of Contemporary Written Japanese


In 2024, the Agency for Cultural Affairs commissioned NINJAL to undertake the “Digital Infrastructure Development Project for Preservation and Utilization of Modern Japanese as a Reliable Language Resource.”

NINJAL will expand the Balanced Corpus of Contemporary Written Japanese (BCCWJ), the first largescale balanced corpus of the Japanese language currently comprising approximately 100 million words, to the size of 200 million words, to capture the diversity of the contemporary written Japanese language.

BCCWJ makes available in its online and paid offline versions approximately 100 million words of text, randomly sampled from books, magazines, newspapers, white papers, the Web, laws, and more, to which morphological information and document structure tags are assigned. In this project, we will select and identify statistically appropriate sentence samples from books, newspapers, etc., published from 2006 to 2025 to serve as a microcosm of the modern Japanese language. Following the copyright clearance process, we will include information on the parts of speech, meaning, and sentence structure, and create an electronic database. A total of 20 million words will be added in one year, with 100 million words added over the five years to the current BCCWJ.

Share This Page