Human Computer Interaction (HCI) as the name suggests, is related to humans and computers and the way, both interact with each other. In this post we review the major approaches to multi modal human computer interaction from a computer vision perspective. In particular, we focus on core vision techniques (body, gesture, gaze) and effective interaction (facial expression recognition, and emotion in audio) which are needed for Multimodal Human Computer Interaction (MMHCI) research. Since MMHCI is a very dynamic and broad research area we do not intend to present a complete report. The main contribution of this survey, therefore, is to consolidate some of the main issues and approaches, and to highlight some of the techniques and applications developed recently within the context of MMHCI.We are also giving an idea about how HCI will turn out to be in the future. From advanced HCI techniques to improved HCI devices, we are expressing the way as to how HCI will have a revolutionary change in the future. The advanced techniques involve improved GUI related gestures, VDU fabrics and others along with related devices which can make the life of a person simpler while dealing with a computer.
I. INTRODUCTION
Humancomputer interaction (HCI) is the study of interaction between people (users) and computers. It is often regarded as the intersection of computer science, behavioural sciences, design and several other fields of study. Interaction between users and computers occurs at the user interface (or simply interface), which includes both software and hardware; for example, characters or objects displayed by software on a personal computer’s monitor, input received from users via hardware peripherals such as keyboards and mice, and other user interactions with large-scale computerized systems such as aircraft and power plants. Because human-computer interaction studies a human and a machine in conjunction, it draws from supporting knowledge on both the machine and the human side. On the machine side, techniques in computer graphics, operating systems, programming languages, and development environments are relevant. On the human side, communication theory, graphic and industrial design disciplines, linguistics, social sciences, cognitive psychology, and human factors are relevant. Attention to human-machine interaction is important, because poorly designed human- machine interfaces can lead to many unexpected problems.
A Multimodal Human Computer Interaction (MMHCI) system is simply one that responds to inputs in more than one modality or communication channel (e.g., speech, gesture, writing, and others). MMHCI lies at the crossroads of several research areas including computer vision, psychology, artificial intelligence, and many others. As computers become integrated into everyday objects (ubiquitous and pervasive computing), effective natural human-computer interaction becomes critical: in many applications, users need to be able to interact naturally with computers the way face-to- ace human-human interaction takes place. Human Computer Interaction techniques must be in phase with the fast growing technology. Our report depicts some of the uses and areas of interests in which human interaction with computers can be implemented, which can be an answer to the question that can arise in our mind which is How will the Human Computer Interaction be in the future?‚ The aim of this report is to reflect upon the changes afoot and outline a new paradigm for understanding our relationship with technology. A more extensive set of lenses, tools and methods is needed that puts human values center stage. And here, both positive and negative aspects need to be considered: on the one hand, people use technology to pursue healthier and more enjoyable lifestyles, expand their creative skills with digital tools, and instantly gain access to information required whenever necessary.
II. MMHCI Implementation
As depicted in Figure-1, multimodal techniques can be used to construct a variety of interfaces. Of particular interest for our goals are perceptual and attentive interfaces. Perceptual interfaces as defined in, are highly interactive, multimodal interfaces that enable rich, natural, and efficient interaction with computers. Perceptual interfaces seek to leverage sensing (input) and rendering (output) technologies in order to provide interactions not feasible with standard interfaces and common I/O devices such as the keyboard, the mouse and the monitor. Attentive interfaces, on the other hand, are context-aware interfaces that rely on a persons attention as the primary input the goal of these interfaces is to use gathered information to estimate the best time and approach for communicating with the user.We communicate through speech and use body language (posture, gaze & hand motions) to express emotion, mood, attitude, and attention. A multimodal HCI system is simply one that responds to inputs in more than one modality or communication channel (e.g., speech, gesture, writing, and others). Some of the MMHCI techniques are-
A. Core vision techniques
B.Affective computer interaction
A. Core Vision Techniques
We classify vision techniques for MMHCI using human-centered approach and divide them according to how humans may interact with the system:
(1) large scale body movements, (2) gestures, and (3) gaze.
1) Large-Scale Body Movements
Tracking of large-scale body movements (head, arms, torso, and legs) is necessary to interpret pose and motion in many MMHCI applications. Three important issues in articulated motion analysis: representation (joint angles or motion of all the sub-parts), computational paradigms, (deterministic or probabilistic), and computation reduction. Body posture analysis is important in many MMHCI applications. A stereo and thermal infrared video system to estimate driver posture for deployment of smart air bags. A novel method is proposed for recovering articulated body pose without initialization and tracking (using learning). The pose and velocity vectors are used to recognize body parts and detect different activities, while temporal templates are also often used. Important issues for large-scale body tracking include whether the approach uses 2D or 3D, desired accuracy, speed, occlusion and other constraints. Some of the issues pertaining to gesture recognition, discussed next, can also apply to body tracking.
2) Gesture Recognition
Psycholinguistic studies for human-to-human communication describe gestures as the critical link between our conceptualizing capacities and our linguistic abilities. Humans use a very wide variety of gestures ranging from simple actions of using the hand to point at objects to the more complex actions that express feelings and allow communication with others. Gestures should therefore play an essential role in MMHCI. A major motivation for these research efforts is the potential of using hand gestures in various applications aiming at natural interaction between the human and the computer-controlled interface.
There are several important issues that should be considered when designing a gesture recognition system. The first phase of a recognition task is choosing a mathematical model that may consider both the spatial and the emporal characteristics of the hand and hand gestures. The approach used for modeling plays a crucial role in the nature and performance of gesture interpretation. Once the model is detected, an analysis stage is required for computing the model parameters from the features that are extracted from single or multiple input streams. These parameters represent some description of the hand pose or trajectory and depend on he modeling approach used. Most of the gesture-based HCI systems allow only symbolic commands based on hand posture or 3D pointing. This is due to the complexity associated with gesture analysis and the desire to build real- time interfaces.
3) Gaze Detection
Gaze, defined as the direction to which the eyes are pointing in space, is a strong indicator of attention, and it has been studied extensively since as early as 1879 in psychology, and more recently in neuroscience and in computing applications. While early eye tracking research focused only on systems for in-lab experiments, many commercial and experimental systems are available today for a wide range of applications. Eye tracking systems can be grouped into wearable or non-wearable, and infrared-based or appearance-based. In infrared-based systems, a light shining on the subject whose gaze is to be tracked creates a red-eye effect: the difference in reflection between the cornea and the pupil is used to determine the direction of sight. In appearance- based systems, computer vision techniques are used to find the eyes in the image and then determine their orientation.
The main issues in developing gaze tracking systems are intrusiveness, speed, robustness, and accuracy. The type of hardware and algorithms necessary, however, depend highly on the level of analysis desired. Gaze analysis can be performed at three different levels: (a) highly detailed low- level micro-events, (b) low-level intentional events, and (c) coarse-level goal-based events. Micro-events include micro- saccades, jitter, nystagmus, and brief fixations, which are studied for their physiological and psychological relevance by vision scientists and psychologists. Low-level intentional events are the smallest coherent units of movement that the user is aware of during visual activity, which include sustained fixations and revisits. Although most of the work on HCI has focused on coarse-level goal-based events, it is easy to foresee the importance of analysis at lower levels, particularly to infer the users cognitive state in affective interfaces. Within this context, an important issue often overlooked is how to interpret eye tracking data.
B. Affective Human-computer Interaction
Affective states are intricately linked to other functions such as attention, perception, memory, decision- making, and learning. This suggests that it may be beneficial for computers to recognize the user’s emotions and other related cognitive states and expressions. The techniques used in this context are-
1. Facial expression recognition
2. Emotions in audio
1)Facial Expression Recognition
Expressions are classified into a predetermined set of categories. Some methods follow a feature-based approach, where one tries to detect and track specific features such as the corners of the mouth, eyebrows, etc. Other methods use a region-based‚ approach in which facial motions are measured in certain regions on the face such as the eye/eyebrow and the mouth. In addition, we can distinguish two types of classification schemes: dynamic and static. Static classifiers (e.g., Bayesian Networks) classify each frame in a video to one of the facial expression categories based on the results of a particular video frame. Dynamic classifiers use several video frames and perform classification by analyzing the temporal patterns of the regions analyzed or features extracted. They are very sensitive to appearance changes in the facial expressions of different individuals so they are more suited for person-dependent experiments. Static classifiers, on the other hand, are easier to train and in general need less training data but when used on a continuous video sequence they can be unreliable especially for frames that are not at the peak of an expression.
2) Emotions in Audio
Researchers use mainly different methods to analyze emotions. One approach is to classify emotions into discrete categories such as joy, fear, love, surprise, sadness, etc., using different modalities as audio inputs to emotion recognition models. The vocal aspect of a communicative message carries various kinds of information. If we disregard the manner in which a message is spoken and consider only the textual content, we are likely to miss the important aspects of the utterance and we might even completely misunderstand the meaning of the message. Recent studies seem to use the Ekman six basic emotions, although others in the past have used many more categories. The reasons for using these basic categories are often not justified since it is not clear whether there exist universal emotional characteristics in the voice for these six categories. The most surprising issue regarding the multimodal affect recognition problem is that although recent advances in video and audio processing could make the multimodal analysis of human affective state tractable, there are only a few research efforts that have tried to implement a multimodal affective analyzer. alarming rate. What can the HCI community do to intervene and help? How can it build on what it has achieved? In this part we map out some fundamental changes that we suggest need to occur within the field. Specifically, we suggest that HCI needs to extend its methods and approaches so as to focus more clearly on human values. This will require a more sensitive view about the role, function and consequences of design, just as it will force HCI to be more inventive. HCI will need to form new partnerships with other disciplines, too, and for this to happen HCI practitioners will need to be sympathetic to the tools and techniques of other trades.
A. GUIs to Gestures
In the last few years, new input techniques have been developed that are richer and less prone to the many shortcomings of keyboard and mouse interaction. For example, there can be tablet computers that use stylus-based interaction on a screen, and even paper-based systems that digitally capture markings made on specialized paper using a camera embedded in a pen. These developments support interaction through sketching and handwriting. Speech- recognition systems also support a different kind of‚ natural interaction, allowing people to issue commands and dictate through voice. Meanwhile, multi-touch surfaces enable interaction with the hands and the fingertips on touch- sensitive surfaces, allowing us to manipulate objects digitally as if they were physical. From GUIs to multi-touch, speech to gesturing, the ways we interact with computers are diversifying as never before (see Fig 3). Two-handed and multi-fingered input is providing a more natural and flexible means of interaction beyond the single point of contact offered by either the mouse or stylus. The shift to multiple points of input also supports novel forms of interaction where people can share a single interface by gathering around it and interacting together (see the‚ Reactable‚, below Fig 2).
Fig 2: The Reactable: a multitouch interface for playing music. Performers can simultaneously interact with it by moving and rotating physical objects on its surface.
Fig 3: The Hot Hand device: a ring worn by electric guitar players that uses motion sensors and a wireless transmitter to create different kinds of sound effects by various hand gestures.
B. VDUs to Smart Fabrics
The fixed video display units (VDUs) of the 1980s are being superseded by a whole host of flexible display technologies and ‚smart fabrics. Displays are being built in all sizes, from the tiny to the gigantic, and soon will become part of the fabric of our clothes and our buildings. By a decade or so, these advances are likely to have revolutionized the form that computers will take.
Recent advances in Organic Light Emitting Diodes (OLEDs) (see Fig4) and plastic electronics are enabling displays to be made much more cheaply, with higher resolution and lower power consumption, some without requiring a backlight to function. OLEDs are an emissive electroluminescent layer made from a film of organic compounds, enabling a matrix of pixels to emit light of different colors. Plastic electronics also use organic materials to create very thin semi-conductive transistors that can be embedded in all sorts of materials, from paper to cloth, enabling, for example, the paper in books or newspapers to be digitised. Electronic components and devices, such as Micro- Electro-Mechanical Systems (MEMS), are also being made at and extremely small size, allowing for very small displays.
Fig-4: Animated Textiles developed by Studio at the Hexagram Institute, Montreal, Canada. These two jackets‚ synch up when the wearers hold hands, and the message scrolls from the back of one person to the other.
C. Hard Disks to Digital Footprints:
People are beginning to talk about their ever growing digital footprints. Part of the reason for this is that the limits of digital storage are no longer a pressing issue. It is all around us, costing next to nothing, from ten-a-penny memory sticks and cards to vast digital Internet data banks that are freely available for individuals to store their photos, videos, emails and documents (See Fig 5).`The decreasing cost and increasing capacity of digital storage also goes hand-in-hand with new and cheap methods for capturing, creating and viewing digital media. The effect on our behaviour has been quite dramatic: people are taking thousands of pictures rather than hundreds each year. They no longer keep them in shoeboxes or stick them in albums but keep them as ever growing digital collections, often online.
Fig 5: The Rovio robotic connected to the Internet. It roams around the home providing an audio and video link to keep an eye on family or pets when you are out.
D. Changing Lives
By a decade or so more people than ever will be using computing devices of one form or other, be they a retiree in Japan, a schoolchild in Italy or a farmer in India (see Fig 6). At the same time, each generation will have its own set of demands.Technology will continue to have an important impact at all stages of life.
Fig 6: A boy using a digitally augmented probe tool that shows real-time measurements of light and moisture on an accompanying mobile device.
E. New Ways of Family Living
New technologies are proliferating that enable people to live both their own busy social and working life while enabling them to take an active part in their family life. A number of computer applications have been developed to enable family members to keep an eye on one another, from the Family Locator feature on the Disney cell phone (which allows parents to display the location of a child’s handset on a map) to devices that can be installed on cars to track their location and speed. In the next decade or two, we will witness many changes in family life brought about by technology, but also sparking new forms of digital tools. Such changes will of course have a larger impact on societal and ethical issues that is difficult to predict ( See Fig 7).
Fig 7: Audiovoxs Digital Message Center is designed to be attached to the refrigerator, letting families scribble digital notes and leave audio and video messages for each other.
IV. Conclusion
We have highlighted major vision approaches for multimodal human-computer interaction. We discussed techniques for large-scale body movement, gesture recognition, and gaze detection. We discussed facial expression recognition, emotion analysis and a variety of emerging applications. Another important issue is the affective aspect of communication that should be considered when designing an MMHCI system. Emotion modulates almost all modes of human communication facial expression, gestures, posture, tone of voice, choice of words, respiration, skin temperature and clamminess, etc.
Emotions can significantly change the message: often it is not what was said that is most important, but how it was said. How we define and think about our relationships with computers is radically changing. How we use them and rely on them is also being transformed. At the same time, we are becoming hyper connected and our interactions are being increasingly etched into our digital landscapes. There is more scope than ever before to solve hard problems and allow new forms and creativity. We have begun to raise the issues and concerns that these transformations. Some will be within the remit of Human- Computer Interaction to address and others will not.
Comments are closed.