Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)
Suzanna Becker
A biologically motivated model of cortical self-organization is pro(cid:173) posed. Context is combined with bottom-up information via a maximum likelihood cost function. Clusters of one or more units are modulated by a common contextual gating Signal; they thereby organize themselves into mutually supportive predictors of abstract contextual features. The model was tested in its ability to discover viewpoint-invariant classes on a set of real image sequences of cen(cid:173) tered, gradually rotating faces. It performed considerably better than supervised back-propagation at generalizing to novel views from a small number of training examples.
1 THE ROLE OF CONTEXT
The importance of context effects l in perception has been demonstrated in many domains. For example, letters are recognized more quickly and accurately in the context of words (see e.g. McClelland & Rumelhart, 1981), words are recognized more efficiently when preceded by related words (see e.g. Neely, 1991), individual speech utterances are more intelligible in the context of continuous speech, etc. Fur(cid:173) ther, there is mounting evidence that neuronal responses are modulated by context. For example, even at the level of the LGN in the thalamus, the primary source of visual input to the cortex, Murphy & Sillito (1987) have reported cells with "end(cid:173) stopped" or length-tuned receptive fields which depend on top-down inputs from the cortex. The end-stopped behavior disappears when the top-down connections are removed, suggesting that the cortico-thalamic connections are providing contex(cid:173) tual modulation to the LGN. Moving a bit higher up the visual hierarchy, von der Heydt et al. (1984) found cells which respond to "illusory contours", in the absence of a contoured stimulus within the cells' classical receptive fields. These exam(cid:173) ples demonstrate that neuronal responses can be modulated by secondary sources of information in complex ways, provided the information is consistent with their expected or preferred input.
1 We use the term context rather loosely here to mean any secondary source of input. It could be from a different sensory modality, a different input channel within the same modality, a temporal history of the input, or top-down information.
Learning Temporally Persistent Hierarchical Representations
825
Figure 1: Two sequences of 48 by 48 pixel images digitized with an IndyCam and prepro(cid:173) cessed with a Sobel edge filter. Eleven views of each of four to ten faces were used in the simulations reported here. The alternate (odd) views of two of the faces are shown above.
Why would contextual modulation be such a pervasive phenomenon? One obvious reason is that if context can influence processing, it can help in disambiguating or cleaning up a noisy stimulus. A less obvious reason may be that if context can influence learning, it may lead to more compact representations, and hence a more powerful processing system. To illustrate, consider the benefits of incorporating temporal history into an unsupervised classifier. Given a continuous sensory signal as input, the classifier must try to discover important partitions in its training data. If it can discover features that are temporally persistent, and thus insensitive to transformations in the input, it should be able to represent the signal compactly with a small set offeatures. FUrther, these features are more likely to be associated with the identity of objects rather than lower-level attributes. However, most classifiers group patterns together on the basis of spatial overlap. This may be reasonable if there is very little shift or other form of distortion between one time step and the next, but is not a reasonable assumption about the sensory input to the cortex. Pre-cortical stages of sensory processing, certainly in the visual system (and probably in other modalities), tend to remove low-order correlations in space and time, e.g. with centre-surround filters. Consider the image sequences of gradually rotating faces in Figure 1. They have been preprocessed by a simple edge(cid:173) filter, so that successive views of the same face have relatively little pixel overlap. In contrast, identical views of different faces may have considerable overlap. Thus, a classifier such as k-means, which groups patterns based on their Euclidean distance, would not be expected to do well at classifying these patterns. So how are people (and in fact very young children) able to learn to classify a virtually infinite number of objects based on relatively brief exposures? It is argued here that the assumption of temporal persistence is a powerful constraining factor for achieving this, and is one which may be used to advantage in artificial neural networks as well. Not only does it lead to the development of higher-order feature analyzers, but it can result in more compact codes which are important for applications like image compression. Further, as the simulations reported here show, improved generalization may be achieved by allowing high-level expectations (e.g. of class labels) to influence the development of lower-level feature detectors.
2 THE MODEL Competitive learning (for a review, see Becker & Plumbley, 1996) is considered by many to be a reasonably strong candidate model of cortical learning. It can be implemented, in its simplest form, by a Hebbian learning rule in a network