Balanced Corpus of Contemporary Written Japanese (BCCWJ)
BCCWJ is a balanced corpus of one hundred million words of contemporary written Japanese. BCCWJ is one of the components of KOTONOHA. It is probably the most important of all the KOTONOHA component corpora, because it is the written register of the contemporary Japanese that is the greatest focus of interest for language researchers as well as the general public. It is also the contemporary written language that has the greatest applicability to such applications as dictionaries and teaching materials. The compilation of BCCWJ started in 2006 as a five-year project, and is supported partly by a Grant-in-Aid for Scientific Research on Priority Area from MEXT (Japanese ministry of education) : Japanese Corpus.
As shown in the figure below, BCCWJ consists of three subcorpora. The one in the top left corner is called the Publication Subcorpus. Samples of this corpus are extracted randomly from the population of all books, magazines, and major newspapers published in the years 2001-2005.
The corpus in the top right corner is called the Library Subcorpus. Its population consists of all books that are catalogued at more than 13 metropolitan libraries in Tokyo.
Lastly, the corpus at the bottom is called the Special-purpose Subcorpus. This corpus contains a series of mutually unrelated mini corpora that are required for specific research purposes of the NINJAL research groups. The mini corpora include governmental white papers, textbooks, laws, bestselling books, and text from the Internet (provided by the courtesy of Yahoo! Japan Inc). Each of these mini corpora contains text of several million words.
35 million words
Books catalogued at more than
30 million words
Whitepaper text, Internet text, Diet minutes, Best selling books, etc.
35 million words
Structure of the BCCWJ