Other

A Morpheme-Aware Child-Inspired Language Model

Details

Citation

Bolucu N & Can Buglalilar B (2025) A Morpheme-Aware Child-Inspired Language Model. Proceedings of the 30th Empirical Methods in Natural Language Processing: Volume 3: The BabyLM Challenge Association for Computational Linguistics.

Abstract
Most tokenization methods in language models rely on subword units that lack explicit linguistic correspondence. In this work, we investigate the impact of using morpheme-based tokens in a small language model, comparing them to the widely used frequency-based method, BPE. We apply the morpheme-based tokenization method to both 10-million and 100-million word datasets from the BabyLM Challenge. Our results show that using a morphological tokenizer improves EWoK (basic world knowledge) performance by around 20% and entity tracking by around 40%, highlighting the impact of morphological information in developing smaller language models. We also apply curriculum learning, in which morphological information is gradually introduced during training, mirroring the vocabulary-building stage in infants that precedes morphological processing. The results are consistent with previous research: curriculum learning yields slight improvements for some tasks, but performance degradation in others.

StatusAccepted
FundersUniversity of Stirling
PublisherAssociation for Computational Linguistics

People (1)

Dr Burcu Can Buglalilar

Dr Burcu Can Buglalilar

Lecturer in Computing Science, Computing Science

Research centres/groups