Corpus of Spontaneous Japanese (CSJ)
The Corpus of Spontaneous Japanese (CSJ) is a database of spoken Japanese stored on 18 DVD-ROM discs. It is one of the largest spoken language databases in the world. The CSJ is the result of the ‘Spontaneous Speech: Corpus and Processing Technology’ Project jointly conducted by the Communications Research Laboratory, the Tokyo Institute of Technology, and the Institute. It contains 658 hours of speech consisting of approximately 7.5 million words. The speech materials were provided by more than 1,400 speakers of ages ranging from twenties to eighties. About 95% of the CSJ is devoted to spontaneous monologues, such as academic presentations and public speaking. The remaining 5% consists of spontaneous dialogues and reading aloud. CSJ provides a rich set of annotations, including transcriptions, parts of speech, labels of phonetic segmentation and intonation, which are provided both in text files and XML format. CSJ should serve as a useful tool for research purposes, such as speech engineering, linguistics and phonetics, and lexicography. The CSJ has been publicly available since the spring of 2004. For more information, please visit the English web page of the Institute at:
/corpus_center/csj/misc/preliminary/index_e.html
Images of the text-audio brousing tool of csj

