Dr. Adriana Kovashka just received an NSF Career Award. Her project is entitled "Natural Narratives and Multimodal Context as Weak Supervision for Learning Object Categories".
In a nutshell:
Object detection is the task of recognizing object categories (e.g. dog, car) by learning from a dataset. However, existing methods assume a dataset labeled meticulously by humans who draw boxes around objects in images, and provide the category name. Instead, this project explores a new form of more accessible (less expensive), but also noisier supervision akin to how children learn to name objects. If a parent uses the word "milk bottle" around her baby, and the bottle is sometimes in view, the baby will eventually learn to associate the object category "bottle", from the parent's speech, to the visual input corresponding to bottle. Similarly, plentiful videos on the web contain both a visual demonstration, and a spoken narrative, about how to do an activity. Objects are often both shown and mentioned, in some temporal proximity, but not always at the same time: A narrator might say "take the fruit out of the fridge" while the fruit is still in the fridge, i.e. not in view. The goal of this project is to cope with this noise by modeling the purpose of the spoken utterances, as well as use additional modalities (senses) beyond speech, for disambiguation.
Dr. Kovashka's project develops a framework to learn computer vision models for the detection of objects from weak, naturally-occurring supervision in the form of language (text or speech), and additional multimodal signals. It considers dynamic settings, where humans interact with their visual environment and refers to the encountered objects, e.g. "Carefully put the tomato plants in the ground" and "Please put the phone down and come set the table," and captions are written for a human audience to complement an image, e.g. news article captions. The challenge of using such language-based supervision for training detection systems is that along with the useful signal, the speech contains many irrelevant tokens. The project will benefit society by exploring novel avenues for overcoming this challenge and reducing the need for expensive and potentially unnatural crowdsourced labels for training. It has the potential to make object detection systems more scalable and thus more usable by a broad user base in a variety of settings. The resources and tools developed would allow natural, lightweight learning in different environments, e.g. different languages or types of imagery where the well-known object categories are not useful or where there is a shift in both the pixels as well as the way in which humans refer to objects (different cultures, medicine, art). This project opens possibilities for learning in vivo rather than in vitro; while the focus here is on object categories, multimodal weak supervision is useful for a larger variety of tasks. Research and education are integrated through local community outreach and research mentoring for students from lesser-known universities, new programs for student training including honing graduate students' writing skills, and the development of interactive educational modules and demos based on research findings.
This project creatively connects two domains, vision-and-language, and object detection, and pioneers training of object detection models with weak language supervision and a large vocabulary of potential classes. The impact of noise in the language channel will be mitigated through three complementary techniques that model visual concreteness of words, to what extent the text refers to the visual environment it appears with, and whether the weakly-supervised models that are learned are logically consistent. Two complementary word-region association mechanisms will be used (metric learning and cross-modal transformers), whose application is novel for weakly-supervised detection. Importantly, to make detection feasible, not only the semantics of image-text pairs but their discourse relationship will be captured. To facilitate and disambiguate the association of words to a physical environment, the latter will be represented through additional modalities, namely sound, motion, depth, and touch, which are either present in the data or estimated. This project advances knowledge of how multimodal cues contextualize the relation between image and text; no prior work has modeled image-text relationships along multiple channels (sound, depth, touch, motion). Finally, to connect the appearance of objects to the purpose and use of these objects, relationships between objects, properties and actions will be semantically organized in a graph, and grammars to represent activities involving objects will be extracted, still maintaining the weakly-supervised setting.
More information: Award Abstract at NSF