A Comprehensive Study of Spoken Language Using a Multi-Generational Corpus of Japanese Conversation

Project Leader
KOISO Hanae (Professor, NINJAL)
Project Period
April 2022 -


Background and Purpose

The spoken language we use in our everyday lives changes with age, not only in infancy but also through childhood, adolescence, prime age, adulthood, and into old age. In order to capture the characteristics of multigenerational spoken language, it is essential to utilize a spoken corpus that contains the daily conversations of a variety of speakers.

Developed by the National Institute for Japanese Language and Linguistics and released in March 2022, The Corpus of Everyday Japanese Conversation contains 200 hours of daily conversation by a wide variety of speakers and can be used to reveal many things.

For example, using the corpus, we examined the proportion of polite and non-polite forms used when talking to friends and acquaintances by age group (see figure on the left). The figure shows that teenagers rarely use polite forms, while the rate of use of polite forms increases with age and experience in society. However, for those over 60 years of age the rate of use of the polite form decreases. This could be related to the fact that the elderly tend to talk to people of either the same or a younger generation and, as they tend to have fewer new acquaintances, most of their conversations are among close acquaintances whom they have known for a long time, which entails familiarity.

Thus, not only does the use of language change as children grow up, but it also changes dramatically among adults, depending on the social environment in which they find themselves. In this project, we will use corpora to empirically clarify these multi-generational changes in language.

Usage rate of polite and non-polite forms by age
Objectives and Methods

The Corpus of Everyday Japanese Conversation, CEJC contains conversations by more than 1,500 diverse speakers, but there are few by children under 10 years old because we only asked adult informants to record conversations. In this project, we will construct a new corpus that includes child-centered conversations.

The corpus of infants and children that has been constructed so far has focused on family conversations, such as those between mother and child. However, as children grow up, they will experience a wider variety of situations and people with whom they have conversations, such as relatives, friends, and teachers at kindergarten. Therefore, we will construct a corpus with videos that includes conversations not only at home but also in other situations and with other people. We also plan to record conversations at kindergartens and elementary schools. By building a child-centered conversation corpus and combining it with the adult-centered CEJC, we will analyze language development and change across multiple generations from infants and children to the elderly.

The recorded speech will be manually transcribed and automatically divided into words, and morphological information, such as parts of speech, readings, and lemma will be added and then manually corrected. This corpus will be made available via the online search application "Chunagon." The application enables advanced searches combining part-of-speech and other information and allows the user to listen to the target speech. Raw data including video and audio will also be released for research purposes.

Search screen of the online search system "Chunagon"
Recording of conversations
