Thu-2-4-4 Sound-Image Grounding Based Focusing Mechanism for Efficient Automatic Spoken Language Acquisition

Mingxin Zhang(Tokyo Institute of Technology), Tomohiro Tanaka(Tokyo Institute of Technology), Wenxin Hou(Tokyo Institute of Technology), Shengzhou Gao(Tokyo Institute of Technology) and Takahiro Shinozaki(Tokyo Institute of Technology)
Abstract: The process of spoken language acquisition based on sound-image grounding has been one of the topics that has attracted the most significant interest of linguists and human scientists for decades. To understand the process and enable new possibilities for intelligent robots, we designed a spoken-language acquisition task in which a software robot learns to fulfill its desire by correctly identifying and uttering the name of its preferred object from the given images, without relying on any labeled dataset. We propose an unsupervised vision-based focusing strategy and a pre-training approach based on sound-image grounding to boost the efficiency of reinforcement learning. These ideas are motivated by the introspection that human babies first observe the world and then try actions to realize their desires. Our experiments show that the software robot can successfully acquire spoken language from spoken indications with images and dialogues. Moreover, the learning speed of reinforcement learning is significantly improved compared to several baseline approaches.
Student Information

Student Events

Travel Grants