“Visual attention is estimated as having the capacity of only 40 bits/second for humans”


“…carrying out much of the selection quickly and by bottom-up (or autonomous) mechanisms is computationally effcient, and indeed essential to respond to unexpected events. Bottom up selection is more potent and quicker) than top-down selection, which could be based on features, or objects, as well as location. Early visual processes could facilitate bottom up selection by explicitly computing and representing bottom up saliency to guide selection of salient locations. Meanwhile, any data reduction before the selection should be as information lossless as possible, for any lost information could never be selected to be perceived. This suggests a process for early vision to incorporate sequentially two data reduction strategies: (1) data compression with
minimum information loss and (2) creating a representation with explicit saliency information to
facilitate selection by saliency.”

“At its heart, vision is a problem of object recognition and localization for (eventually) motor responses.  However, before this end comes the critical task of input selection of a limited aspects of input for detailed processing by the attentional bottleneck…it is computationally effcient to carry out much of this selection quickly and by bottom up mechanisms by directing attention to restricted visual space. Towards this goal, it has been recently proposed that V1 creates a bottom up saliency map of visual space, such that a location with a higher scalar value in this map is more likely to be selected for further visual processing, i.e., to be salient and attract attention.  The saliency values are represented by the firing rates of the V1 neurons, such that the RF location of the most active V1 cell is most likely to be selected, regardless of the input feature tunings of the V1 neurons.”

It is apparent that V1’s overcomplete representation should also be useful for other
computational goals which could also be served by V1. Indeed, V1 also sends its outputs to higher
visual areas for operations, e.g., recognition and learning, beyond selection. Within the scope of
this paper, I do not elaborate further our poor understanding of

The V1 saliency theory differs from the traditional theories mainly because it was motivated
by understanding V1. It aims for fast computation, thus requires no separate feature maps or any
combinations of them, nor any decoding of the input features to obtain saliency. Indeed, many V1
neurons, e.g., an orientation and motion direction tuned neuron, are tuned to more than one feature dimension (Livingstone and Hubel 1984), making it impossible to have separate groups of V1 cells for separate feature dimensions. Furthermore, V1 neurons signal saliency by their responses despite their feature tunings, hence their ring rates are the universal currency for saliency (to bid forselection) regardless of the feature selectivity of the cells, just like the purchasing power of Euro is independent of the nationality or gender of the currency holders.  In contrast, the
traditional theories were motivated by explaining the behavioral data by a natural framework,
without specifying the cortical location of the feature maps or the master saliency map, or a drive
for algorithmic simplicity.  This in particular leads to the feature map summation rule for saliency
determination, and implies that the master saliency map should be in a higher level visual area
(such as lateral intraparietal area (LIP), where cells are untuned to features.

This paper reviews two lines of works to understand early vision by its role of data reduction in
the face of information bottlenecks. The effecient coding principle views the properties of input
sampling and input transformations by the early visual RFs as serving the goal of encoding visual
inputs efciently, so that as much input information as possible can be transmitted to higher visual
areas through information channel bottlenecks. It not only accounts for these neural properties, but
also, by linking these properties with visual sensitivity in behavior, provides an understanding of
sensitivity or perceptual changes caused by adaptation to different environment,
and of effects of developmental deprivation.  Non-trivial and easily testable predictions
have also been made, some of which have subsequently
been conrmed experimentally, for example on the correlation between the preferred orientation
and ocularity of the V1 cells.

The V1 saliency map hypothesis views V1 as creating a bottom up saliency map to facilitate information selection or discarding, so that data rate can be further reduced for detailed processing through the visual attentional bottleneck.  This hypothesis not only explains the V1 properties not accounted for by the efcient coding principle, but also links V1’s physiology to complex visual search and segmentation behavior previously thought of as not associated with V1. It also makes testable predictions, some of which have also subsequently been conrmed as shown here and previously. Furthermore, its computational considerations and physiological basis raised fundamental
questions about the traditional, behaviorally based, framework of visual selection mechanisms.
The goal of theoretical understanding is not only to give insights to the known facts, thus linking
seemingly unrelated data, e.g., from physiology and from behavior, but also to make testable
predictions and motivate new experiments and research directions. This strategy should be the
most fruitful also for answering many more unanswered questions regarding early visual processes,
most particularly the mysterious functional role of LGN, which receives retinal outputs,
sends outputs to V1, and receives massive feedback bers from V1 (Casagrande et al 2005). This
paper also exposed a lack of full understanding of the overcomplete representation in V1, despite
our recognition of its usefulness in the saliency map and its contradiction to efcient coding. The
understanding is likely to arise from a better understanding of bottom up saliency computation,
and the study of possible roles of V1 (Lennie 2003, Lee 2003, Salinas and Abbott 2000, Olshausen
and Field 2005), such as learning and recognition, beyond input selection or even bottom up visual
processes. Furthermore, such pursuit can hopefully expose gaps in our current understanding and
prepare the way to investigate behavioral and physiological phenomena beyond early vision.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s