Data-driven Parsing Evaluation for Child-Parent Interactions
Abstract
We present a syntactic dependency treebank for naturalistic child and child-directed spoken English. Our annotations largely follow the guidelines of the Universal Dependencies project (UD (Zeman et al., 2022)), with detailed extensions to lexical and syntactic structures unique to spontaneous spoken language, as opposed to written texts or prepared speech. Compared to existing UD-style spoken treebanks and other dependency corpora of child-parent interactions specifically, our data set is much larger (44,744 utterances; 233,907 words) and contains data from 10 children covering a wide age range (18-66 months). We conduct thorough dependency parser evaluations using both graph-based and transition-based parsers, trained on three different types of out-of-domain written texts: news, tweets, and learner data. Out-of-domain parsers demonstrate reasonable performance for both child and parent data. In addition, parser performance for child data increases along children’s developmental paths, especially between 18-48 months, and gradually approaches the performance for parent data. These results are further validated with in-domain training.