Searching for a given target object in a scene not only requires detecting the target object if it is visible, but also to identify promising locations for search if not. The quantity that measures how interesting an image region is given a certain task is called top-down saliency.
One of the most basic skills a robot should possess is predicting the effect of physical interactions with objects in the environment. This enables optimal action selection to reach a certain goal state. Traditionally, dynamics are approximated by physics-based analytical models. These models rely on specific state representations that may be hard to obtain from raw sensory data, especially if no knowledge of the object shape is assumed. More recently, we have seen learning approaches that can predict the effect of complex physical interactions directly from sensory input. It is however an open question how far these models generalize beyond their training data. In this work, we investigate the advantages and limitations of neural network based learning approaches for predicting the effects of actions based on sensory input and show how analytical and learned models can be combined to leverage the best of both worlds. As physical interaction task, we use planar pushing, for which there exists a well-known analytical model and a large
real-world dataset. We propose to use a convolutional neural network to convert raw depth images or organized point clouds into a suitable representation for the analytical model and compare this approach to using neural networks for both, perception and prediction.
A systematic evaluation of the proposed approach on a very large real-world dataset shows two
main advantages of the hybrid architecture. Compared to a pure neural network, it significantly (i) reduces required training data and (ii) improves generalization to novel physical interaction.
Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems, IEEE, IROS, October 2016 (conference)
One of the central tasks for a household robot is searching for specific objects. It does not only require localizing the target object but also identifying promising search locations in the scene if the target is not immediately visible. As computation time and hardware resources are usually limited in robotics, it is desirable to avoid expensive visual processing steps that are exhaustively applied over the entire image. The human visual system can quickly select those image locations that have to be processed in detail for a given task. This allows us to cope with huge amounts of information and to efficiently deploy the limited capacities of our visual system. In this paper, we therefore propose to use human fixation data to train a top-down saliency model that predicts relevant image locations when searching for specific objects. We show that the learned model can successfully prune bounding box proposals without rejecting the ground truth object locations. In this aspect, the proposed model outperforms a model that is trained only on the ground truth segmentations of the target object instead of fixation data.
Eberhard Karls Universität Tübingen, May 2015 (mastersthesis)
Detecting and identifying the different objects in an image fast and reliably is an
important skill for interacting with one’s environment. The main problem is that in
theory, all parts of an image have to be searched for objects on many different scales
to make sure that no object instance is missed. It however takes considerable time
and effort to actually classify the content of a given image region and both time
and computational capacities that an agent can spend on classification are limited.
Humans use a process called visual attention to quickly decide which locations of
an image need to be processed in detail and which can be ignored. This allows us
to deal with the huge amount of visual information and to employ the capacities
of our visual system efficiently.
For computer vision, researchers have to deal with exactly the same problems,
so learning from the behaviour of humans provides a promising way to improve
existing algorithms. In the presented master’s thesis, a model is trained with eye
tracking data recorded from 15 participants that were asked to search images for
objects from three different categories. It uses a deep convolutional neural network
to extract features from the input image that are then combined to form a saliency
map. This map provides information about which image regions are interesting
when searching for the given target object and can thus be used to reduce the
parts of the image that have to be processed in detail. The method is based on a
recent publication of Kümmerer et al., but in contrast to the original method that
computes general, task independent saliency, the presented model is supposed to
respond differently when searching for different target categories.
Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems