Research resources related to contemporary standard Japanese
- Balanced Corpus of Contemporary Written Japanese
‘The Balanced Corpus of Contemporary Written Japanese’ (BCCWJ) is a corpus created for the purpose of attempting to grasp the diversity of contemporary written Japanese. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents, among others. Morphological information and document structure were annotated to randomly taken samples. BCCWJ is available to the public online as well as a DVD set.
Shonagon is a web concordancer on which even beginners of corpus linguistics can search the string of BCCWJ.
- NINJAL-LWP for BCCWJ (NLB)
NINJAL-LWP for BCCWJ (NLB) is an online search tool for the BCCWJ which uses the lexical profiling technique. It has been jointly developed by the National Institute for Japanese Language and Linguistics (NINJAL) and Lago Gengo Kenkyusho.
- Corpus of Spontaneous Japanese
The “Corpus of Spontaneous Japanese” (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available data (7.5 million words).
The corpus has been used for a wide variety of research purposes such as spoken language processing, natural language processing, phonetics, psychology, sociology, Japanese education, and dictionary compilation.
- NINJAL Web Japanese Corpus
20 billion-word Web text corpus by crawling 100 million pages every three months. The corpus was automatically annotated morphological information and dependency structures.
- Nagoya University Conversation Corpus
‘Nagoya University Conversation Corpus’ (NUCC) is composed of transcriptions of 129 uncontrolled, natural conversations between or among friends, family members or colleagues.
Chunagon is a web concordancer that enables a three-way search of the corpora developed by NINJAL. In Chunagon, short unit word, long unit word, and string are available. Using a combination of morphological information, it is possible to make an advanced search of the corpus.
- Databases of Japanese Examples Extracted from Web Corpora (Japanese compound verbs, Sahen verbs, adjectives)
These databases provide examples of Japanese compound verbs, adjectives, and nominal verbs. Examples in the databases are extracted from special purpose Web corpora for each entry word that were constructed to collect adequate examples for every entry word, preventing bias of collected examples.
- Bunrui Goihyo (Word List by Semantic Principles, Revised and Enlarged Edition)
Bunrui Goihyo is a Japanese thesaurus. This database version was built by incorporating the contents of the Bunrui Goihyo book edition (revised and enlarged edition). It was created in the CSV file format to enable uploading into data organization software. The total number of records is 101,070.
- Linguistic Survey of Two Million Characters in Contemporary Magazines (1994)
These frequency lists are part of the outcomes of ‘A Survey of Vocabulary in Contemporary Magazines (1994)’ carried out at the National Institute for Japanese Language from 2001 to 2005.
- “Honorifics in Japanese Schools”: Results from Questionnaires
Data from the questionnaire survey on the use and awareness of the honorifics among junior high school and high school students carried out from 1989 to 1990 by the National Institute for Japanese Language and Linguistics. In all, 2,456 junior high school students from Tokyo and 339 from Yamagata, 2,222 high school students from Tokyo and 1,004 from Osaka took part in the survey.
- Web ChaMame
A tool to perform morphological analysis using various UniDic dictionaries. It allows researchers to perform a series of work necessary for morphological analysis on the Internet via a user-friendly interface.
- Compound Verb Lexicon
Comprising over 2,700 verb-verb compound verbs of contemporary Japanese, this online dictionary provides useful information on their linguistic features for both researchers and learners of Japanese. In addition to Japanese representations, it offers English, Chinese, and Korean translations for the semantic definitions and example sentences. The original Excel data downloadable upon agreement.
- The World Atlas of Transitivity Pairs (WATP)
This web application provides typological information on the formal relationship between lexical pairs of transitive and intransitive verbs in selected world languages including Japanese in the form of a map and charts.
- The NINJAL Parsed Corpus of Modern Japanese (NPCMJ)
NPCMJ is a syntactically and semantically annotated corpus of both written and spoken Modern Japanese. There are interfaces available for anyone to search, browse, and download trees easily.
A large-scale Japanese lexicon with morphological information including statistical models for a morphological analyzer MeCab.
- A Cineradiograph of Japanese Pronunciation
The Cineradiograph of articulatory movement of Japanese filmed in 1965 and 1967.
- Hideo Teramura: Collected Papers on Adnominal Modification
This site presents English translations of several of the research contributions of the late Professor Hideo Teramura (1928-1990), who contributed greatly to the development of the fundamentals of Japanese language studies and Japanese language education through the 1970s and 1980s.
- BTSJ-Japanese Natural Conversation Corpus (Studies on the language use of Japanese language learners)
This is one of the world's largest corpora of naturally occurring conversations in Japanese, which currently consists of 377 conversations including transcripts and sounds by Japanese native speakers and learners of Japanese. These conversations were compiled by controlling social factors such as the speakers' and interlocutors' age, gender, and situations, and transcribed by BTSJ which is the most suitable for pragmatic and interactional analysis. In 2021, the final version of “BTSJ Japanese Natural Conversation Corpus,” which will include conversations by more than 1,000 speakers will be released.
- NPCMJ Child Language Development Timeline
The NPCMJ Child Language Development Timeline (NPCMJ-CLDT) provides an interactive timeline mediated interface to the Soyogo Parsed Corpus (a parsed corpus of child language Japanese). This interface makes morpho-syntactic analysis of child language accessible to search and exploration through, most notably, the lens of an age range filter. Here is your chance to discover patterns of morpho-syntactic development of children acquiring Japanese with an interface that eases zooming in on very specific information on individual acquisition behaviour.