Cognitive Machines

How to build machines that learn to use language in human-like ways, and develop tools and models to better understand how children learn to communicate.

The goal of the Cognitive Machines group is to create systems that engage in fluid, situated, meaningful communication with human partners. We seek to understand and model the processes by which words are grounded in the physical world as a result of embodied perception, action, and learning. These models are applied to create situated human-machine interfaces. We also use our computational models as a source of predictions and possible accounts for a number of cognitive phenomena including aspects of children's language acquisition, concept formation, and attention.

Research Projects

Behavior Capture from Thousands of People Online

Jeff Orkin and Deb Roy

The Restaurant Game is a multiplayer simulation that captures the behavior and language of thousands of people playing the roles of wait staff and customers. We are developing machine-learning algorithms that mine game-play logs to acquire generative models of human language, behavior, and social roles. These models will power synthetic conversational characters that interact with humans in training simulations, games, and other virtual worlds.

BlitzScribe: Speech Analysis for the Human Speechome Project

Brandon Roy and Deb Roy

BlitzScribe is a new approach to speech transcription driven by the demands of today's massive multimedia corpora. High-quality annotations are essential for indexing and analyzing many multimedia datasets; in particular, our study of language development for the Human Speechome Project depends on speech transcripts. Unfortunately, automatic speech transcription is inadequate for many natural speech recordings, and traditional approaches to manual transcription are extremely labor intensive and expensive. BlitzScribe uses a semi-automatic approach, combining human and machine effort to dramatically improve transcription speed. Automatic methods identify and segment speech in dense, multitrack audio recordings, allowing us to build streamlined user interfaces maximizing human productivity. The first version of BlitzScribe is already about 4-6 times faster than existing systems. We are exploring user-interface design, machine-learning and pattern-recognition techniques to build a human-machine collaborative system that will make massive transcription tasks feasible and affordable.

Collective Discovery

Frank Moss, Deb Roy, Ian Eslick and Charles Tam

The choices we make about diet, environment, medications, or alternative therapies constitute a massive collection of "everyday experiments." These data are largely unrecorded and underutilized by the traditional research establishment. Collective Discovery aims to leverage the intuition and insight of patient communities to capture and mine information about everyday experiences. Moving the community discourse from anecdotes to data will lead to better decision-making, stronger self-advocacy, identification of novel therapies, and inspiration of better hypotheses in traditional research, accelerating the search for new drugs and treatments. The unique characteristic of our Collective Discovery model is the use of knowledge representation and natural language processing to mediate communal hypothesis generation and to compensate for methodological errors and self-reporting bias. This model is being deployed in a real-world context as part of a partnership with the LAM Treatment Alliance and the greater LAM community.

Concrete Financial Sim

Sheng-Ying (Aithne) Pao and Deb Roy

Concrete Financial Sim aims to anticipate probable outcomes of different decisions across time. Life consistently presents choices that require a rational balance between instant gratification and long-term consequences. Should I buy the sunglasses now or should I save? Should I buy a house, or should I rent a room? What if I do it next year instead of next month? Intertemporal components of choices complicate the decision-making process. The complexity comes not in just a one-to-one immediate tradeoff decision, but in its long-term implications. Based on one’s past financial behavior and current plans, we are designing a decision environment that visualizes the future values of present choices. The goal is to create a reality-based model that informs decision makers of their probable rewards and penalties over time, and will serve as a “cognitive prosthesis” for people to externalize their mental model of intertemporal choices.

Data-Driven Architectural Design

Rony Kubat, Kenneth Jackowitz (BOA) and Deb Roy

Dense longitudinal video recording of architectural spaces presents new opportunities for design analysis, exploration, and optimization. As part of the Speechome Video for Retail Analysis project, high-resolution video cameras are being deployed in a retail banking environment. From the months of data that will be collected, a variety of performance metrics will be extracted (for example: queueing time, customer confusion, customer/employee social interaction). Beyond the analysis of current building performance, agent-based models of human behavior—trained on the collected raw data—can be used to evaluate potential changes to the space, or to evaluate unbuilt environments. Finally, this agent-based model can be used as a fitness function to evolve procedurally generated buildings to maximize performance across the extracted metrics.

Grounding Spatial Language for Video Retrieval and Robotic Direction Following

Deb Roy, Stefanie Tellex, Nicholas Roy and Thomas Kollar

Understanding spatial language is a challenging problem that requires the ability to map between language and situations in the real world. We are building a spatial language understanding system that bridges this representational gap by computationally modeling the semantics of spatial prepositions. Our model enables a system to retrieve video clips that match natural language queries such as "Show me people going across the kitchen." We are also applying it to build robots that can follow natural-language directions such as "Go through the door near the elevators." By using corpus-based machine learning techniques, our model is robust to real-world noise and linguistic variation. Exploring the connection between language and the real world in concrete domains enables us to make progress towards computers that understand language in human-like ways.

Human Speechome Project

Deb Roy, Philip DeCamp, Brandon Roy, Jethran Guinness, Rony Kubat, Stefanie Tellex and George Shaw

The Human Speechome Project is an effort to observe and computationally model the longitudinal language development of a single child at an unprecedented scale. To achieve this, we are recording, storing, visualizing, and analyzing communication and behavior patterns in over 400,000 hours of home video and speech recordings. The tools that are being developed for mining and learning from thousands of terabytes of multimedia data offer the potential for breaking open new business opportunities for a broad range of industries—from security to Internet commerce.

Internomics

Ed Boyden, Dan Ariely, Deb Roy, Nathan Greenslit, Sheng-Ying (Aithne) Pao, Coco Krumme, Deborah Egloff and James Barabas

How do high-level cognitive functions emerge from primitive neural computations to mediate complex human behavior? We are developing precise, focal ways of investigating phenomena such as trust and risk-taking, in order to understand how they play roles in purchasing, decision-making, social interaction, and other real-world scenarios.

P2P Lending Game

Deb Roy and Sheng-Ying (Aithne) Pao

Peer-to-peer lending, also known as social lending, has been growing in tight economy. Trust and risk are the most important forces which influence the market. By studying trust in a social context, we aim to model the decision-making behavior in different lender/borrower relationships. Furthermore, risky financial behavior may be mitigated if greater trust is built. We have designed a social-themed lending game that tests how social perception changes risk-taking and trusting behavior. The data collected will lead to a better understanding of how we can leverage trust to strengthen the lender/borrower relationship and create more efficient cooperation.

Space-Time Machine

Deb Roy, Philip DeCamp and George Shaw

The Space-Time Machine combines audio-video recordings from multiple cameras and microphones to generate an interactive, 3-D reconstruction of recorded events. Developed for use with the longitudinal recordings collected by the Human Speechome Project, this software enables the user to move freely throughout a virtual model of a home and to play back events at any time or speed. In addition to audio and video, the project explores how different kinds of data may be visualized in a virtual space, including speech transcripts, person tracking data, and retail transactions.

Speechome Recorder for the Study of Child Development Disorders

Sophia Yuditskaya, Kleovoulos Tsourides, Philip DeCamp, Brandon Roy, George Shaw, Matthew Goodwin and Deb Roy

Collection and analysis of dense, longitudinal observational data of child behavior in natural, ecologically valid, non-laboratory settings holds significant benefits for advancing the understanding of autism and other developmental disorders. We have developed the Speechome Recorder—a portable version of the embedded audio/video recording technology originally developed for the Human Speechome Project—to facilitate swift, cost-effective deployment in special-needs clinics and homes. Recording child behavior daily in these settings will enable us to study developmental trajectories of autistic children from infancy through early childhood, as well as atypical dynamics of social interaction as they evolve on a day-to-day basis. Its portability makes possible potentially large-scale comparative study of developmental milestones in both neurotypical and autistic children. Data-analysis tools developed in this research aim to reveal new insights toward early detection, provide more accurate assessments of context-specific behaviors for individualized treatment, and shed light on the enduring mysteries of autism.

Speechome Video for Retail Analysis

George Shaw, Kleovoulos Tsourides, Sophia Yuditskaya, Philip DeCamp, Kenneth Jackowitz (BOA) and Deb Roy

We are adapting the video data collection and analysis technology derived from the Human Speechome Project in the retail sector through real-world deployments. We will develop strategies and tools for the analysis of dense, longitudinal video data to study behavior of and interaction between customers and employees in commercial retail settings. One key question in our study is how the architecture of a retail space affects customer activity and satisfaction, and what parameters in the design of a space are operant in this causal relationship.

Study of Child Language Acquisition in the Human Speechome Project

Deb Roy and Soroush Vosoughi

What is the relationship between the input that children hear and the words that children acquire? We investigate the role of variables such as input word frequency and prosody in one child's lexical acquisition using the Human Speechome Project corpus. We analyze data from ages 9 – 24 months, including the child's first productive use of language at about 11 months and ending at the child’s active use of a vocabulary with more than 500 words.

TrackMarks: Semi-Automatic Video Annotation

Philip DeCamp and Deb Roy

This project attempts to address the practical problems involved with extracting behavioral information from large, multi-camera video corpora. Ultra-dense video recordings offer new possibilities for in-depth, quantitative analysis of human behavior, with applications ranging from child development research to determining how people are affected by different retail environments. Despite the growing sophistication of computer vision systems being developed for person tracking, gesture recognition, and object identification, these technologies remain error prone. Accurate video annotation still requires substantial human input. In order to analyze the hundreds of thousands of hours of video collected for the Human Speechome Project, we have developed a new software system for semi-automatically annotating longitudinal, multi-track video data. This system combines computer vision algorithms with a novel interface design to enable human annotators to generate and edit video annotations with speed and accuracy.