Human Interfacing Issues
of Virtual Reality

What is Virtual Reality?

Virtual reality conjures up visions of people wearing strange looking helmets and bizarre gloves. They gesture wildly in space, turning and twisting their heads to see unknown visions. The term virtual reality (VR), first coined by Jaron Lanier back in 1989, refers to a computer-generated, interactive, 3D environment. There have been several terms used to label this new technology. Myron Krueger used the term artificial reality to refer to environments generated by computers. Virtual environments and virtual worlds refer to worlds which exist entirely within the memory of a computer. Telepresence allows the user to experience a real but remote environment that would ordinarily be dangerous or difficult to experience in real life. Telerobotics and video teleconferencing are other examples of such technology. Cyberspace was described as a computer environment spanning multiple computers, users, and data, forming a consensual hallucination that spanned the globe.

VR actually can be classified into three stages -- passive, exploratory, and immersive. Passive VR refers to experiences that most people are familar with in every day life -- watching TV, seeing movies, reading books, or visiting amusement parks. Exploratory VR is interactively exploring a 3D environment solely through the monitor of a computer. Immersive VR is the classic stage of VR, where the user can fully interact with the artificial environment, is provided stimulation for all the senses, and have their actions directly affect the computer generated environment. Please consult the references at the end of this article for more information about the history and roots of VR.

The Human Interfaces

What allows ours senses to be fooled by VR? VR interfaces commonly seek to stimulate our visual, auditory and tactile systems. This article will briefly examine how each of these systems process sensual information. A few examples of VR interfaces will also be cited. Please consult the references at the end of the article for more detailed information about these systems and interfaces.

Our Visual System

We obtain most of our knowledge of the world around us through our eyes. Our visual system processes information in two distinct ways -- conscious and preconscious processing. When we are looking at a photograph, or reading a book or map requires conscious visual processing and hence usually requires some learned skill. Preconscious visual processing, however, describes our basic ability to perceive light, color, form, depth and movement. Such processing is more autonomous, and we are less aware that it is happening.

Physically our eyes are fairly complicated organs. Specialized cells form structures which perform several functions -- the pupil acts as the aperture where muscles control how much light passes, the crystalline lens performs focusing of light by using muscles to change it's shape, and the retina is the workhorse converting light into electrical impulses for processing by our brains. Our brain performs visual processing by breaking down the neural information into smaller chunks and passing it thoguh several filter neurons. Some of these neurons detect only drastic changes in color, others neurons detect only vertical edges or horizontal edges.

Depth information is conveyed in many different ways. Static depth cues include interposition, brightness, size, linear perspective, and texture gradients. Motion depth cues come from the effect of motion parallax, where objects which are closer to the viewer appear to move more rapidly against the background when the head is moved back and forth. Physiological depth cues convey information in two distinct ways -- accommodation, which is how our eyes change their shape when focusing on distant objects, and convergence, which is a measurement of how far our eyes must turn inward when looking at objects closer than 20 feet. We obtain stereoscopic cues by extracting relevant depth information by comparing the left and right views coming each of our eyes.

Our sense of visual immersion in VR comes from several factors which include field of view, frame refresh rate, and eye tracking. Limited field of view can result in a tunnel vision feeling. Frame refresh rates must be high enough to allow our eyes to blend together the individual frames into the illusion of motion and limit the sense of latency between movements of the head and body and regeneration of the scene. Eye tracking can solve the problem of someone not looking where their head is oriented. Eye tracking can also help to reduce computational load when rendering frames, since we could render in high resolution only where the eyes are looking.

The sense of virtual immersion is usually achieved via some means of position and orientation tracking. The most common means of tracking include optic, ultrasonic, electromagnetic, and mechanical. All of these means have been used on various head mouted display (HMD) devices. HMDs come in three basic varieties including stereoscopic, monocular, and head coupled. The earliest stereoscopic HMD was Ivan Sutherland's Sword of Damocles, which was built in 1968 while he was a student at Harvard. It got its name from the large mechanical position sensing arm which hung frm the ceiling and made the device ungainly to wear. NASA has built several HMDs, chiefly using LCD displays which had poor resolution. The University of North Carolina has also built several HMDs using such items as LCD screens, magnifying optics and bicycle helmets. VPL Research's EyePhone series were the first commercial HMDs. A good example of a monocular HMDs is the Private Eye by Reflection Technologies of Waltham MA. This unit is just 1.2 x 1.3 x 3.5 inches and is suspended by a lightweight headband in front of one eye. The wearer sees a 12-inch monitor floating in mid air about 2 feet in front of them. The BOOM is head coupled HMD and was developed at NASA's Ames Research Center. The BOOM uses two 2.5 inch CRTs mounted within a small black box that has two hand grips on each side and is attached to end of articulated counter-balanced arms serving as position sensing.

Our Auditory System

Our ears form the most visible part of our auditory system, guiding sound waves into the auditory canal. The canal itself enhances the sounds we hear and directs them onto the ear drum. The ear drum converts the sound waves into mechanical vibrations. In the middle ear three tiny bones, the hammer, anvil, and stirrup, form a bridge across an air void and amplify slight sounds by a factor of 30. The stirrup rotates away and the ear drum tightens to inhibit loud sounds. The inner ear translates these machanical vibrations into electrochemical signals for processing by the brain. The elctrochemical signals are conveyed to the brain by the auditory nerve. Sounds detected by both ears are processed by what are called binaural cells.

Our sense of sound localization comes from three different cues. Interaural time difference is a measure of the difference in time when a sound enters our left ear versus entering our right ear. Interaural intensity difference is a measure of how a sound's intensity level drops off with distance. Acoustic shadow is the effect of higher frequency sounds being blocked by object between us and the sound's source.

In VR systems computer generated sound comes in several different forms. The use of stereo sound adds some level of sound feedback to the VR environment, but does not correctly resemble the real world. When using 3D sound, we can "place" sounds within the simulated environment using the sound localization cues described above. A 3D sound system usually begins by recording the differences in sound that reaches both of our ears by placing microphones at each ear. The recordings are then used to produce what is called a head related transfer function (HRTF). These HRTFs are used during playback of recorded sounds to effectively place them within a 3D environment. A virtual sound system requires not only the same sound localization cues but must change and react in realtime to move those sounds around within the 3D environment. An example of a 3D sound system is the Convolvotron, developed by Crystal River Engineering. This system convolves analog audio source material with the HRTFs, creating a startingly realistic 3D sound effect. Another system called the Virtual Audio Processing System (VAPS) mixes the noninteractive binaural recording techniques and Convolvotron-like signal processing to produce both live and recorded 3D sound fields. A recent development is the attempt at performing what is called aural ray tracing, which is similar to light ray tracing found in computer graphics.

Our Tactile System

Our sense of touch is performed by what is called the haptic or tactile system. The tactile system relays information about touch via two different mechanisms. The mechanoreceptors provide information about shape, texture, and temperature. Proprioceptive feedback conveys information about touch via overall muscle interactions. These muscle interations can inform the brain about the gross shape of an object, the sensing of movement or resistance to movement, the weight of an object, and the firmness of an object. This touch information is conveyed to the brain by both slowly adapting fibers and rapidly adapting fibers.

In VR systems, tactile and force feedback devices seek to emulate the tactile cues our haptic system relays to our brains. Several examples of force feedback devices have been built. The Argonne Remote Manipulator provided force feedback via a mechanical arm assembly and many tiny motors. This device was used in the molecular docking simulations of the GROPE system. The Portable Dextrous Master used piston-like cylinders which were mounted on ball joints to pass force feedback from a robot's gripper to the operator's hand. The TeleTact Data Acquisition Glove used two gloves -- one glove acquired touch data via an array of force-sensitive resistors and relayed that information to a second glove which provided feedback via many small air bladders that inflated at the pressure points to simulate the touch information.

In many VR systems the ubiquitous data glove plays the same role as that of the mouse in modern computer systems. The VPL Data Glove is perhaps the most well known. It used a series of fiber optic cables to detect the bending of fingers and used magnetic sensors for position and orientation tracking of the hand. Mattel's PowerGlove is the most popular amongst VR system hackers, due to its low cost, despite being discontinued by the manufacturer several years ago. This glove used electrically resistive ink to sense finger position and used ultrasonic sensors for detecting hand orientation.

Summary

VR systems both hold much promise and have many problems. These problems include poor display resolution, limited field of view, visual latency, and position tracking latency. The visual and position tracking latencies often lead to a common problem known as visually induced motion sickness (VIMS). Many users of VR systems will experience VIMS within about 15 minutes of doning an HMD. VIMS occurs since our experience of the VR world is mostly visual and aural, but the movement we perceive is not reflected within our bodies. Visual latency is directly related to the attainable frame refresh rate. Complex scenes with many thouands of polygons tend to lower the attainable frame refresh rate, and hence the minimum amount time that a detect movement can be reflected in an updated scene. Display size and resolution can also affect the frame refresh rate. Having more pixels on a given display increases the amount of time needed to update a frame. But having lower resolution displays spoils the illusion of reality. Other problems come from what is called display mismatch. When the user reaches out to grab a virtual ball and grabs thin air instead, the illusion of reality of fails. The user needs some sense of tactile feedback when interacting with the VR environment. Most of these problems will be solved in the future by improved technology. But until then, VR is finding success in such applications as entertainment systems, in architectural walk-thoughs, and in scientific visualization applications.