Corpora

Corpora built by the National Institute for Japanese Language and Linguistics. (A corpus is an electronic collection of systematically gathered language data with a variety of tags added for linguistic analysis.)

  • Balanced Corpus of Contemporary Written Japanese

    ‘The Balanced Corpus of Contemporary Written Japanese’ (BCCWJ) is a corpus created for the purpose of attempting to grasp the diversity of contemporary written Japanese. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents, among others. Morphological information and document structure were annotated to randomly taken samples. BCCWJ is available to the public online as well as a DVD set.

  • Shonagon

    Shonagon is a web concordancer on which even beginners of corpus linguistics can search the string of BCCWJ.

  • NINJAL-LWP for BCCWJ (NLB)

    NINJAL-LWP for BCCWJ (NLB) is an online search tool for the BCCWJ which uses the lexical profiling technique. It has been jointly developed by the National Institute for Japanese Language and Linguistics (NINJAL) and Lago Gengo Kenkyusho.

  • Corpus of Spontaneous Japanese

    The “Corpus of Spontaneous Japanese” (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available data (7.5 million words).
    The corpus has been used for a wide variety of research purposes such as spoken language processing, natural language processing, phonetics, psychology, sociology, Japanese education, and dictionary compilation.

  • Corpus of Historical Japanese

    This corpus collects materials to research the history of the Japanese language. The development of the corpus is ongoing, with a view to producing a diachronic corpus which covers a period from the ancient times to the modern times. What is already built is available at the moment.

  • NINJAL Web Japanese Corpus

    20 billion-word Web text corpus by crawling 100 million pages every three months. The corpus was automatically annotated morphological information and dependency structures.

  • Learner-Corpus Study of Acquisition of Japanese as a Second Language

    (1) Corpus of Japanese as a Second Language (C-JAS)
    The NINJAL learners’ longitudinal oral data, C-JAS, are now open to the public. The corpus contains interview data of 6 JSL learners (3 Chinese and 3 Koreans) studying Japan for 3 years.
    NB: JSL = Japanese as a Second Language

    (2) International Corpus of Japanese as a Second Language (I-JAS)
    In May, 2016 NINJAL made public the learners’ corpus of I-JAS, containing cross-sectional oral and written data from 20 different areas across the world with more than 12 native languages. It contains oral task data (story-telling, role-play, interview and picture-description) of 210 learners and 15 native Japanese speakers with oral sound data. It also contains written data (story-writing, e-mail writings and an essay), which were voluntary tasks. NINJAL will provide the data of 1000 learners and 50 native speakers of Japanese in 2020.

  • Corpora of Modern Japanese

    This is corpus developed to research the Japanese language of the Meiji and Taisho eras. The ‘Taiyo corpus’, ‘Modern women’s magazines corpus’, ‘Meiroku Zasshi corpus’, and ‘Kokumin-no-Tomo corpus’ are available.

  • Chunagon

    Chunagon is a web concordancer that enables a three-way search of the corpora developed by NINJAL. In Chunagon, short unit word, long unit word, and string are available. Using a combination of morphological information, it is possible to make an advanced search of the corpus.

  • A Glossed Audio Corpus of Ainu Folklore

    This is the first fully glossed and annotated digital collection of Ainu folktales with translations into Japanese and English. It contains 10 stories (8 uepeker ‘prosaic folktales’ and 2 kamuy yukar ‘divine epics’) narrated by Mrs. Kimi Kimura (1900-1988, born in Penakori Village, upper district of the Saru River) with a total recording time of about 3 hours.

  • Learners' L1-Japanese Contrastive Databases

    "Learners' L1-Japanese Contrastive Databases", developed by NIJLA (former name of NINJAL), consist of the following two kinds of sub-databases: "Japanese Learners' Contrastive Short Essay Database" and "Japanese Learners' Contrastive Speech Production Database". The data contained in the databases are produced by Japanese learners both in Japanese language and their first languages (L1).

  • The NINJAL Parsed Corpus of Modern Japanese (NPCMJ)

    NPCMJ is a syntactically and semantically annotated corpus of both written and spoken Modern Japanese. There are interfaces available for anyone to search, browse, and download trees easily.