Lipreading by neural networks: Visual preprocessing, learning, and sensory integration

Part of Advances in Neural Information Processing Systems 6 (NIPS 1993)

Bibtex Metadata Paper

Authors

Gregory Wolff, K. Prasad, David Stork, Marcus Hennecke

Abstract

We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automat(cid:173) ic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/not(cid:173) visible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and inte(cid:173) grated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an er(cid:173) ror rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.