Reproduced from Interface, Summer 1986 (House Journal of Cambridge Consultants Ltd)
Why can't it see what its doing?
In designing machine vision systems for the recognition, inspection or
tracking of objects for automating industry we cannot help making comparisons with the capabilities of human vision. We take for granted the tremendous power of our sense of sight to resolve, recognise and track what we see, but even the best current machine vision systems are extremely primitive by comparison. TREVOR SMITH gives some insights into how human vision achieves its performance and suggests there may be some useful lessons for machine vision engineers.
Image processing has been with us ever since the early days of the digital computer, but until comparatively recently it has found refuge in applications such as the enhancement and display of satellite imagery, the reconstruction of sensor data into images for medical scanners (e.g. X-ray computer-aided tomography (CAT) and magnetic resonance imaging), and the manipulation of microscope images for scientific applications.
In these fields the emphasis has been on the display of images as pictures for humans to look at; a human is still relied on to make judgements from the pictorial information. The machine is merely being asked to give him or her a good picture. Moreover, the first generation of machines doing image processing were slow, large and expensive.
Industrial machine vision
With the dramatic reduction in cost, size and power consumption of computer hardware, particularly over the last decade, the use of image processing has become attractive for many other areas of application. At CCL we have seen image processing systematically applied to various military applications. Now, we see the emergence of image processing - or machine vision as it is often called - on the factory floor. In so many industries, automation has been implemented for practically every part of the process except for inspection.
One challenge for machine vision is to cut costs by automating this last labour intensive operation. However, the techniques used to display images to humans are not enough; we are now asking the machine to make the judgements. The techniques that allow enhanced images from satellites to be displayed are not in themselves enough for industrial inspection. The industrial machine vision system must make decisions, resolve conflicts and convey its finding to other machines as well as to its human supervisors.
Robots more demanding
With the advent of flexible manufacturing systems, and the use of robots in factory automation, machine vision will have a role to play in machine guidance and error recovery. The computational requirements for integrating machine vision fully into the manufacturing cell are more demanding than for inspection and will inevitably come with a second wave following the widespread installation of vision-based inspection equipment.
Another major problem for machine vision is the speed at which it can be performed. Processing a single satellite image often takes minutes on a large mainframe or super-mini computer. However, manufacturing processes often run at many units per second. You may argue that the sort of processing done on satellite images or in a CAT scanner is completely different in nature from what is required for machine vision, but even the most powerful of today's microcomputers is totally inadequate when it comes to inspecting glass bottles or pills or electronic components. Special computers, designed specifically for image processing, are needed. So called pipelined computer architectures are already well established in this area but they lack flexibility and become very costly when complicated manipulation of any image is required. Computer architectures which consist essentially of many microprocessors working in parallel are starting to emerge and represent the state of the art in image processing.
Performance comparison
When we automate inspection we are of course replacing a human vision system with a machine vision system, and we do it on the basis of cost for performance. Academics compare machine and human vision systems out of curiosity, but when we are installing automatic inspection equipment on the factory floor the comparison of human and machine performance is forced on us. We have to justify the high-technology approach; most industries these days demand an 18 months payback period for capital equipment investment!
Performance comparisons are difficult. Often machines cannot spot defects which human beings find easy to recognise, but the one great advantage of the machine is its tireless vigilance. The machine may not be perfect, but at least it goes on performing consistently while the human inspector, particularly if the task is very repetitive, will often miss major defects - which the machine can find - because he is distracted or lacks concentration. Why can't we build a machine that performs as well as a human inspector - and does it consistently? This is obviously a 'big' question. But let's just look at some of the details we know of how the human vision system achieves its performance.
It does it, remember, in a space smaller than a cubic foot (if you include that integral part - the human brain) and consumes only a small number of watts of energy.
Human vision
If you hold a copy of The Times open in front of you at arms' length and you have normal vision, you should have no difficulty in reading the print over the whole of the two pages without moving your head. The information in this visual field would require more than 15 million individual picture elements (pixels) to represent it as an image with the detail that the eye could resolve (1 minute of arc). This is assuming that a binary image is adequate (i.e. each pixel is either black or white). If we include all the shades of brightness and all the possible colours that the eye can perceive, the field of view available to the human eye actually contains something like 60 billion bits of information.
Studies of the rate of flicker at which chopped light subjectively appears as a continuous light suggest that to represent adequately what we perceive when things are moving or changing in light intensity within our visual field, the data must be sampled at about 40 times a second. At this rate of picture update the human vision system would seem to have available to it something like two million million bits of information per second. Processing this much information in real time is no mean feat so how does the human vision system cope?
If we are truthful, very little is yet known about how the human vision system works, but what we do know provides a useful insight into how we might better design machine vision systems. The first thing to realise is that although we can resolve down to about one minute of arc we can only do this over a very small area of the retina called the fovea. To demonstrate this, fix your eyes on a point on the edge of a printed page and you will find that you cannot recognise any of the words in your peripheral vision - even large newspaper headlines. The resolving power of the fovea falls off so rapidly that even the margin down the side of the page takes the characters out a region where they can be identified. What allows us to read the whole page of our newspaper is, of course, eye movement.
When I was in the Department of Medical Electronics at St. Bartholomew's Hospital, I developed some new techniques to characterise eye movements and have been fascinated ever since by the simplicity and elegance of the eye movement control system and how it is integrated into the whole visual system. I was conducting research into eye movements because their characteristics are of particular interest to pharmacologists. This is because quantitative changes in eye movement provide a sensitive and objective measure of the effects of medicines such as tranquillisers, which otherwise produce only subjective effects.
Much of my work at that time was concerned with how to measure characteristics of eye movements, such as maximum velocity or reaction time, and seeing how these changed when people had taken a particular medication. What I learnt was that the vision sensor that we call the eye is only part of a complete vision system. The overall system includes closed loop control for stabilising the eye pointing direction and for tracking moving objects. This relies on inertial sensors (in the ear) and peripheral (i.e. non-foveal) vision. Other systems that are centrally controlled focus the eye and regulate the light input by means of a diaphragm. For stereo vision the eye movement control system can automatically align the images falling on the retina for both near and far vision. This is done by lateral displacement and by rotation - did you know your eye could rotate about an axis through the pupil? The vision processing is done by a series of highly parallel computing structures (the visual cortex of the brain) which deal with the information in successively more and more complex representation. These achieve pattern recognition at the lower levels and can be used with reasoning (i.e. thought) at higher levels to achieve understanding (see Fig 2).
One key lesson for the machine vision engineer, to my mind, is the extent to which the eye achieves data reduction before passing information to the brain (see the insert box for more details). And it is a sobering thought that the underlying technology for information transmission and processing (i.e. the neurone) is extremely slow by comparison with a microchip. Gate delay times are many milliseconds and transmission rates only a few metres per second. The designers of modern computer hardware worry about delays of a few nanoseconds. With technology that runs a million times faster than the human equivalent, you would think that machine vision should be a lot better than it is!
The conclusion must be that we should worry more about system design, the topology of computer architectures and the design of algorithms. People have said this before? Well let's get on with doing it.
Eye movements as an integral part of human vision
When everything in the visual field is static, the eye uses saccadic movements. Saccades are pre-programmed jumps in the pointing direction of the eye which occur several times a second almost all the time. When you read, although you may think your eyes are scanning smoothly across the page, in fact you make a series of these jumps, probably about five or six along each line. Interestingly, you do not see anything while you are performing a saccade - the information rate would in any case be too high for your brain to process. All the word recognition occurs while the eye is static for ~200ms or so between saccades. These eye movements bring particular features of interest into optical alignment with the fovea from which recognition can be performed. A series of static images is presented like slides in a slide projector. The advantage of this system is that the brain can keep track separately of how the series of slides relates to the internal picture it is building up of the world. The visual sensor and the recognition system thus concentrate on a very small subset of the available information. In fact, we know that there are only about a million nerve fibres in the optic nerve from each eye and the information bandwidth of each fibre is only about 20 bits per second. This 20 million is a lot less than the two million million bits per second that we estimated were needed to adequately represent the information available to the eye. 20 million bits per second is about what we need for a broadcast quality colour TV picture!

The integrated human vision system which achieves better performance than the best machine vision system for most tasks, but uses technology which runs a million times slower
In fact, there are mechanisms other than the eye-movement–foveal–vision trick for reducing data bandwidth in the optic nerve and recognition system of the brain/ These have to do with how the retina pre-processes the image formed on it: for example, the information is filtered to detect edges and changes in time. Because of the high pass filtering in time, another type of eye movement is needed to keep the system functioning - the micro-saccade. Even between major saccades the eye performed a very small jittering movement.
The high pass filtering, which essentially means that changes in the signals are emphasised, ensures that relevant information is transmitted with good noise immunity. Moreover, the signals are correspondingly less affected by the rather crude representation in the neurones. In neurones, the intensity of a signal is coded by pulse repetition rate, and although some neurones are ca[able of firing at up to 1000 times per second, the rate rarely exceeds 200 and is mostly a lot less. Out-going nerves controlling muscles operate in the same way and it is interesting that one of the most notable effects of tranquillisers on eye movement is reduction in the maximum velocity that can be achieved in a saccadic eye movement. This appears to be related to a reduction in the rate of nerve firing in a part of the brain called the para-pontine reticular formation.

Since saccades are ore-programmed there is no closed loop control during the manoeuvre itself. This can result in quite long delays between the requirement to look at an object and achieving foveal vision of the object. In the tests used to measure the characteristics of eye movements, subjects were asked to follow a small spot of light as it jumps horizontally on an otherwise featureless screen usually on the opposite wall of a darkened room. Subjects rest their heads against a pad and are asked to keep still. Under these conditions it is possible to study the nature of saccades with angles up to 40° (much larger than are usually used). Since there is no vision during saccades the brain estimates the size of the saccade that will needed when the dot stimulus moves and after a reaction time of about 180 to 200 ms a primary saccade manoeuvre is executed. It takes this long for the brain to recognise an event in the peripheral vision, estimate the size of saccade needed and then initiate the saccade - see Fig 3. A large saccade may take about 80ms to perform. After this, with the eye stationary for about 100ms, vision is restored and used to assess the need for further corrective saccades. Because they are generally smaller in angle they can be performed quite quickly, but on occasions more than one inter-saccadic interval and consequent corrective saccade may be needed to achieve the required eye position. For small saccades, as used in reading, corrective saccades do not generally occur but for large saccades they are quite common. Interestingly, these large primary saccades almost always undershoot and only rarely overshoot the target. The control system for saccadic eye movements might, therefore, be described loosely as slightly over-damped successive approximation.

Figure 3
Angular eye position plotted against time while a subject is following a spot of light projected onto a wall opposite. The spot appears in one of two positions 30° apart. Note that some saccades are not large enough to get the eye on target and corrective saccades are necessary.
The author
Dr Trevor Smith is leader of the Image and Digital Processing Group. His primary interests are in the application of advanced image and signal processing techniques. He recently managed the system design phase of a large project to develop a Doppler sonar speed sensor. He is currently managing projects to achieve automatic crack detection using image processing and in automatic recognition of objects from image data. He has carried out technical consultancy assignments in machine vision and medical equipment development.
He graduated from Imperial College and, prior to joining CCL, he did research, development engineering and lecturing on the medical application of signal processing at St Bartholomew's Hospital.