A Multifaceted Study of Spoken Language Using a Large-scale Corpus of Everyday Japanese Conversation

Abbreviated Name: Conversation Corpus
Project Leader: KOISO Hanae (Professor, Spoken Language Division, NINJAL)
Project Period: April 2016 - March 2022
Keywords: Spoken corpus, Conversation analysis
Related Site: Project Site

Summary

Background and Purpose

Since everyday conversation is one of the foundations of social life, it is important to describe the characteristics of spoken language and clarify the mechanisms of conversational interaction.

In order to illustrate the diversity of everyday conversations, it is necessary to record various kinds of conversations occurring in our daily life. Although several corpora of Japanese conversations have been developed, most of them are biased in terms of speakers and situations and there has been no corpus that covers a diversity of ordinary conversations.

Our project will develop a large-scale corpus of Japanese everyday conversation in a balanced manner. Since informants record their conversations in everyday situations by themselves, naturally occurring conversations can be collected. To build an empirical foundation for the corpus design, we conducted a survey of ordinary conversational behavior of about 250 adults. By reference to the survey results, we will develop a corpus by collecting various kinds of everyday conversations in a balanced manner.

Language and behavior change with the times. In the future, our conversation corpus will be a precious record to know our everyday language and conversational behavior in the early part of the twenty-first century. Recording and preserving a diversity of daily conversation that mirrors Japanese culture is a significant role of researchers.

Objectives and Methods

In this project, we will build a large-scale corpus of Japanese everyday conversation, the Corpus of Everyday Japanese Conversations (CEJC), exploring the characteristics of conversations in contemporary Japanese through multiple approaches. For this purpose, we organized the following four groups: corpus construction, language register, conversational interaction, and diachronic change.

Based on the results of our survey of conversational behavior, the corpus construction group collects about 200 hours of various kinds of conversations in everyday situations in a balanced manner. The recorded speech is transcribed and is annotated by morphological information, dependency structure, utterance boundary, dialogue act, and so on.

The other three groups promote the study of Japanese conversation based on several spoken corpora including the CEJC. The language register group compares a variety of spoken language including written conversations and scenarios, analyzing their lexical, syntactic, phonetic, and prosodic characteristics. The conversational interaction group annotates the dialogue act in collaboration with the corpus construction group, investigating the roles of syntax in conversational interaction by using mainly the CEJC. The diachronic change group develops a database of speech recorded in the 1950s, comparing it with the CEJC to examine how the speaking style has changed in the last five decades.

Project Members

National Institute for Japanese Language and Linguistics