If we make a wearable computer that sees as we see and hears as we hear, might it provide new insight into our daily lives? Going further, suppose we have the computer monitor the motion and manipulations of our hands and listen to our speech. Perhaps with enough data it can infer how we interact with the world. Might we create a symbiotic arrangement, where an intelligent computer assistant lives alongside our lives, providing useful functionality in exchange for occasional tips on the meaning of patterns and correlations it observes? At Georgia Tech, we have been capturing “first person” views of everyday human interactions with others and objects in the world with wearable computers equipped with cameras, microphones, and gesture sensors. Our goal is to automatically cluster large databases of time-varying signals into groups of actions (e.g. reaching into a pocket, pressing a button, opening a door, turning a key in lock, shifting gears, steering, braking, etc.) and then reveal higher level patterns by discovering grammars of lower level actions with these objects through time (e.g. driving to work at 9am every day). By asking the user of the wearable computer to name these grammars (e.g. morning coffee, buying groceries, driving home), the wearable computer can begin to communicate with its user in more human terms and provide useful information and suggestions (“if you are about to drive home, do you need to buy groceries for your morning coffee?”). Through watching the wearable computer user, we can gain a new perspective for difficult computer vision and robotic problems by identifying objects by how they are used (turning pages indicates a book), not how they appear (the cover of Foley and van Dam versus the cover of Wired magazine). By creating increasingly more observant and useful intelligent assistants, we encourage wearable computer use and a cooperative framework for creating intelligence grounded in everyday interactions.
Thad Starner is a wearable computing pioneer. He is a Professor in the School of Interactive Computing at the Georgia Institute of Technology and a Technical Lead on Google’s Glass. Starner coined the term “augmented reality” in 1990 to describe the types of interfaces he envisioned at the time and has been wearing a head-up display based computer as part of his daily life since 1993, perhaps the longest such experience known. Thad is a founder of the annual ACM/IEEE International Symposium on Wearable Computers, now in its 18th year, and has authored over 150 peer-reviewed scientific publications. He is an inventor on over 80 United States patents awarded or in process.
Over the past decade, both the bottom-up and top-down aspects of attention and eye movements have been modeled computationally. In recent years, this work has demonstrated several successful applications of computational processing of video inputs in a goal-dependent manner. In one system which I will describe, neuromorphic algorithms of bottom-up visual attention are employed to predict, in a task-independent manner, which elements in a video scene might more strongly attract the gaze of a human observer.
These bottom-up predictions have more recently been combined with top-down “gist” or context processing, which allowed the system to learn from examples (recorded eye movements and videos of humans engaged in various 3D video games, including flight combat, driving, first-person, running a hot-dog stand that serves hungry customers) to associate particular scenes with particular locations of interest, given the task (e.g., when the task is to drive, if the scene depicts a road turning left, the system learns to look at that left turn). Pushing deeper into real-time, joint online analysis of video and eye movements using our neuromorphic models, we have recently been able to predict future gaze locations and intent of future actions when a player is engaged in a task. In a similar approach where our computational models provide a normative gold standard against one particular individual’s gaze behavior, we have demonstrated a system which can predict, by recording eye movements for 15 minutes while someone watches TV, whether that person has ADHD or other neurological disorders.
Laurent Itti received his M.S. degree in Image Processing from the Ecole Nationale Superieure des Telecommunications (Paris, France) in 1994, and his Ph.D. in Computation and Neural Systems from Caltech (Pasadena, California) in 2000. He has since then been an Assistant, Associate, and now Full Professor of Computer Science, Psychology, and Neuroscience at the University of Southern California. Dr. Itti’s research interests are in biologically-inspired computational vision, in particular in the domains of visual attention, scene understanding, control of eye movements, and surprise. This basic research has technological applications to, among others, video compression, target detection, and robotics. Dr. Itti has co-authored over 130 publications in peer-reviewed journals, books and conferences, three patents, and several open-source neuropmorphic vision software toolkits.
by Pernilla Qvarfordt