Basic Research on Corpus Annotation - Extension, Integration and Machine-aided Approaches

Project Leader
ASAHARA Masayuki (Professor, Center for Corpus Development, NINJAL)


As commoditization of morphological information annotated corpus has proceeded, the higher-layered annotations are required for the linguistic researches. The Center for Corpus Development organized three groups of “Syntax”, “Semantics”, and “Speech” to explore how to extend the annotation, how to integrate the more than one annotation, and how to incorporate machine-aided techniques on them.

The Syntax Group collaborates on advancing research in phrase-based dependency structures, predicate-argument structures and clause boundaries. We also participate an international joint research project Universal Dependencies to produce word-based dependency treebanks. The Semantics Group develops language resources based on ‘Word List by Semantic Principles (WLSP)’. We develop UniDic-WLSP alignment table and annotation data on ‘Balanced Corpus of Contemporary Written Japanese’ and ‘Corpus of Historical Japanese’. The Speech Group explores machine-aided voice quality features annotation method on ‘Corpus of Spontaneous Japanese’ and designs an articulatory movement database. We also perform researches to improve the accuracy of speech-text alignment and to develop speech browsing environment for the alignment data.