Research resources related to contemporary standard Japanese
- Balanced Corpus of Contemporary Written Japanese
‘The Balanced Corpus of Contemporary Written Japanese’ (BCCWJ) is a corpus created for the purpose of attempting to grasp the diversity of contemporary written Japanese. The data is comprised of 104.3 million words, covering genres such as general books and magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents, among others. Morphological information and document structure were annotated to randomly taken samples. BCCWJ is available to the public online as well as a DVD set.
Shonagon is a web concordancer on which even beginners of corpus linguistics can search the string of BCCWJ.
- NINJAL-LWP for BCCWJ (NLB)
NINJAL-LWP for BCCWJ (NLB) is an online search tool for the BCCWJ which uses the lexical profiling technique. It has been jointly developed by the National Institute for Japanese Language and Linguistics (NINJAL) and Lago Gengo Kenkyusho.
- Corpus of Spontaneous Japanese
The “Corpus of Spontaneous Japanese” (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available data (7.5 million words).
The corpus has been used for a wide variety of research purposes such as spoken language processing, natural language processing, phonetics, psychology, sociology, Japanese education, and dictionary compilation.
- NINJAL Web Japanese Corpus
20 billion-word Web text corpus by crawling 100 million pages every three months. The corpus was automatically annotated morphological information and dependency structures.
Chunagon is a web concordancer that enables a three-way search of the corpora developed by NINJAL. In Chunagon, short unit word, long unit word, and string are available. Using a combination of morphological information, it is possible to make an advanced search of the corpus.
- Databases of Japanese Examples Extracted from Web Corpora (Japanese compound verbs, Sahen verbs, adjectives)
These databases provide examples of Japanese compound verbs, adjectives, and nominal verbs. Examples in the databases are extracted from special purpose Web corpora for each entry word that were constructed to collect adequate examples for every entry word, preventing bias of collected examples.
- Bunrui Goihyo (Word List by Semantic Principles, Revised and Enlarged Edition)
Bunrui Goihyo is a Japanese thesaurus. This database version was built by incorporating the contents of the Bunrui Goihyo book edition (revised and enlarged edition). It was created in the CSV file format to enable uploading into data organization software. The total number of records is 101,070.
- Linguistic Survey of Two Million Characters in Contemporary Magazines (1994)
These frequency lists are part of the outcomes of ‘A Survey of Vocabulary in Contemporary Magazines (1994)’ carried out at the National Institute for Japanese Language from 2001 to 2005.
- “Honorifics in Japanese Schools”: Results from Questionnaires
Data from the questionnaire survey on the use and awareness of the honorifics among junior high school and high school students carried out from 1989 to 1990 by the National Institute for Japanese Language and Linguistics. In all, 2,456 junior high school students from Tokyo and 339 from Yamagata, 2,222 high school students from Tokyo and 1,004 from Osaka took part in the survey.
- Web ChaMame
A tool to perform morphological analysis using various UniDic dictionaries. It allows researchers to perform a series of work necessary for morphological analysis on the Internet via a user-friendly interface.
- Compound Verb Lexicon
Comprising over 2,700 verb-verb compound verbs of contemporary Japanese, this online dictionary provides useful information on their linguistic features for both researchers and learners of Japanese. In addition to Japanese representations, it offers English, Chinese, and Korean translations for the semantic definitions and example sentences. The original Excel data downloadable upon agreement.
- The World Atlas of Transitivity Pairs (WATP)
This web application provides typological information on the formal relationship between lexical pairs of transitive and intransitive verbs in selected world languages including Japanese in the form of a map and charts.
- The NINJAL Parsed Corpus of Modern Japanese (NPCMJ)
NPCMJ is a syntactically and semantically annotated corpus of both written and spoken Modern Japanese. There are interfaces available for anyone to search, browse, and download trees easily.
A large-scale Japanese lexicon with morphological information including statistical models for a morphological analyzer MeCab.