Corpora built by the National Institute for Japanese Language and Linguistics. (A corpus is an electronic collection of systematically gathered language data with a variety of tags added for linguistic analysis.)
- Balanced Corpus of Contemporary Written Japanese
‘The Balanced Corpus of Contemporary Written Japanese’ (BCCWJ) is a corpus created for the purpose of attempting to grasp the diversity of contemporary written Japanese. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents, among others. Morphological information and document structure were annotated to randomly taken samples. BCCWJ is available to the public online as well as a DVD set.
Shonagon is a web concordancer on which even beginners of corpus linguistics can search the string of BCCWJ.
- NINJAL-LWP for BCCWJ (NLB)
NINJAL-LWP for BCCWJ (NLB) is an online search tool for the BCCWJ which uses the lexical profiling technique. It has been jointly developed by the National Institute for Japanese Language and Linguistics (NINJAL) and Lago Gengo Kenkyusho.
- Corpus of Spontaneous Japanese
The “Corpus of Spontaneous Japanese” (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available data (7.5 million words).
The corpus has been used for a wide variety of research purposes such as spoken language processing, natural language processing, phonetics, psychology, sociology, Japanese education, and dictionary compilation.
- Corpus of Historical Japanese
This corpus collects materials to research the history of the Japanese language. The development of the corpus is ongoing, with a view to producing a diachronic corpus which covers a period from the ancient times to the modern times. What is already built is available at the moment.
- NINJAL Web Japanese Corpus
20 billion-word Web text corpus by crawling 100 million pages every three months. The corpus was automatically annotated morphological information and dependency structures.
- Learner-Corpus Study of Acquisition of Japanese as a Second Language
(1) Corpus of Japanese as a Second Language (C-JAS)
The NINJAL learners’ longitudinal oral data, C-JAS, are now open to the public. The corpus contains interview data of 6 JSL learners (3 Chinese and 3 Koreans) studying Japan for 3 years.
NB: JSL = Japanese as a Second Language
(2) International Corpus of Japanese as a Second Language (I-JAS)
In May, 2016 NINJAL made public the learners’ corpus of I-JAS, containing cross-sectional oral and written data from 20 different areas across the world with more than 12 native languages. It contains oral task data (story-telling, role-play, interview and picture-description) of 210 learners and 15 native Japanese speakers with oral sound data. It also contains written data (story-writing, e-mail writings and an essay), which were voluntary tasks. NINJAL will provide the data of 1000 learners and 50 native speakers of Japanese in 2020.
- Corpora of Modern Japanese
This is corpus developed to research the Japanese language of the Meiji and Taisho eras. The ‘Taiyo corpus’, ‘Modern women’s magazines corpus’, ‘Meiroku Zasshi corpus’, and ‘Kokumin-no-Tomo corpus’ are available.
Chunagon is a web concordancer that enables a three-way search of the corpora developed by NINJAL. In Chunagon, short unit word, long unit word, and string are available. Using a combination of morphological information, it is possible to make an advanced search of the corpus.
- A Glossed Audio Corpus of Ainu Folklore
This is the first fully glossed and annotated digital collection of Ainu folktales with translations into Japanese and English. It contains 10 stories (8 uepeker ‘prosaic folktales’ and 2 kamuy yukar ‘divine epics’) narrated by Mrs. Kimi Kimura (1900-1988, born in Penakori Village, upper district of the Saru River) with a total recording time of about 3 hours.
- The NINJAL Parsed Corpus of Modern Japanese (NPCMJ)
NPCMJ is a syntactically and semantically annotated corpus of both written and spoken Modern Japanese. There are interfaces available for anyone to search, browse, and download trees easily.
- Nagoya University Conversation Corpus
‘Nagoya University Conversation Corpus’ (NUCC) is composed of transcriptions of 129 uncontrolled, natural conversations between or among friends, family members or colleagues.
- Oxford-NINJAL Corpus of Old Japanese
“The Oxford-NINJAL Corpus of Old Japanese” is a lemmatized, parsed and comprehensively annotated digital corpus of all texts in Japanese from the Old Japanese period. In its present version, the ONCOJ contains the full corpus of Old Japanese poetic texts, including the Man'yōshū.
- BTSJ-Japanese Natural Conversation Corpus (Studies on the language use of Japanese language learners)
This is one of the world's largest corpora of naturally occurring conversations in Japanese, which currently consists of 377 conversations including transcripts and sounds by Japanese native speakers and learners of Japanese. These conversations were compiled by controlling social factors such as the speakers' and interlocutors' age, gender, and situations, and transcribed by BTSJ which is the most suitable for pragmatic and interactional analysis. In 2021, the final version of “BTSJ Japanese Natural Conversation Corpus,” which will include conversations by more than 1,000 speakers will be released.
- NPCMJ Child Language Development Timeline
The NPCMJ Child Language Development Timeline (NPCMJ-CLDT) provides an interactive timeline mediated interface to the Soyogo Parsed Corpus (a parsed corpus of child language Japanese). This interface makes morpho-syntactic analysis of child language accessible to search and exploration through, most notably, the lens of an age range filter. Here is your chance to discover patterns of morpho-syntactic development of children acquiring Japanese with an interface that eases zooming in on very specific information on individual acquisition behaviour.