Machine learning image classifiers use context clues to help understand the contents of a room, for example, if they manage to identify a dining-room table with a high degree of confidence, that can help resolve ambiguity about other objects nearby, identifying them as chairs.
The downside of this powerful approach is that it means machine learning classifiers can be confounded by confusing, out-of-context elements in a scene, as is demonstrated in The Elephant in the Room, a paper from a trio of Toronto-based computer science academics.
The authors show that computer vision systems that are able to confidently identify a large number of items in a living-room scene (a man, a chair, a TV, a sofa, etc) become fatally confused when they add an elephant to the room. The presence of the unexpected item throws the classifiers into dire confusion: not only do they struggle to identify the elephant, they also struggle with everything else in the scene, including items they were able to confidently identify when the elephant was absent.
It's a new wrinkle on the idea of adversarial examples, those minor, often human-imperceptible changes to inputs that can completely confuse machine-learning systems.
Contextual Reasoning:
It is not common for current
object detectors to explicitly take into account context on
a semantic level, meaning that interplay between object
categories and their relative spatial layout (or possibly
additional) relations) are encoded in the reasoning process
of the network. Though many methods claim to incorporate
contextual reasoning, this is done more in a feature-wise
level, meaning that global image information is encoded
somehow in each decision. This is in contrast to older
works, in which explicit contextual reasoning was quite
popular (see [3] for mention of many such works). Still, it
is apparent that some implicit form of contextual reasoning
does seem to take place. One such example is a person
detected near the keyboard (Figure 6, last column, last
row). Some of the created images contain pairs of objects
that may never appear together in the same image in
the training set, or otherwise give rise to scenes with
unlikely configurations. For example, non co-occurring
categories, such as elephants and books, or unlikely spatial
/ functional relations such as a large person (in terms of
image area) above a small bus. Such scenes could cause
misinterpretation due to contextual reasoning, whether it
is learned explicitly or not.
The Elephant in the Room [Amir Rosenfeld, Richard Zemel and John K. Tsotsos/Arxiv]
Machine Learning Confronts the Elephant in the Room [Kevin Hartnett/Quanta]