Sound is a powerful force, capable of enriching our lives and disrupting them in equal measure. Unwanted sonic interference intrudes upon our health, work, and leisure, creating a constant battle for auditory focus. While the remarkable cocktail party effect allows us to selectively attend to desired sounds in noisy environments, this ability is not universal, and even those with acute auditory attention can struggle in particularly challenging situations.
This research envisions a future where we transcend these limitations, augmenting our auditory perception with intelligent, AI-driven systems that endow us with unprecedented sound control. Imagine a world where we can effortlessly quiet distracting noises, focus on specific conversations in crowded spaces, or even expand our hearing capabilities to perceive subtle sonic details. Such a world promises clearer communication in bustling restaurants, enhanced focus in busy work environments, and richer enjoyment of musical performances.
The emergence of audio computers—powerful devices controlled primarily through natural language interaction and capable of sophisticated audio manipulation—hints at this future. These systems, currently in their early stages of development, demonstrate the potential for intuitive and personalized sound control without relying on visual interfaces.
However, this research seeks to investigate a critical question: What unique role can vision play in enhancing these predominantly audio-centric systems? While our ears are remarkably adept at localizing sounds, vision provides valuable contextual information that can refine our auditory perception and guide our attention.
This project explores the integration of visual cues into audio user interfaces, investigating how head and eye tracking, combined with speech commands, can enable more intuitive sound selection, enhance source separation, and facilitate dynamic audio processing. We will explore this through a proof-of-concept system that manipulates pre-recorded and live video, utilizing off-the-shelf microphones, cameras, and motion sensors to simulate the capabilities of future wearable devices