Applied Mathematics & Information Sciences

In this paper, we propose an automatic system for recognizing continuous gestures in real-time, including Arabic numbers (0-9) and alphabets (A-Z). We present an improved method for hand area detection and segmentation based on YCbCr and HSI mixed skin color space, the improved CAMSHIFT algorithm used for hand tracking. Orientation dynamic features are obtained from spatiotemporal trajectories and then quantized to generate its code words. An improved HMM-FNN model is proposed for gesture recognition based on the code words, which combines ability of HMM model for temporal data modeling with that of fuzzy neural network for fuzzy rule modeling and fuzzy inference. The algorithm we presented has better performance and achieves average recognition rate 95.76% and 93.64% for Arabic Numbers and Alphabets, respectively.


Introduction
Tremendous technology shift has played a dominant role in all disciplines of science and technology. Virtual reality technologies, which can give humans the sensation of being involved in computer world, have been a popular research field for many years. The use of hand gesture is an active area of research in the vision community, mainly for the purpose of sign language recognition and Human-Computer Interaction (HCI).
Gesture and posture recognition are application areas in HCI to communicate with computers. A gesture is spatiotemporal pattern which maybe static, dynamic or both. Static morphs of the hands are called postures and hand movements are called gestures. In gesture recognition, Yoon et al. [1] developed a hand gesture system in which combination of location, angle and velocity is used for the recognition. Liu et al. [2] developed a system to recognize 26 alphabets by using different HMM topologies. Hunter et al. [3] used HMM for recognition in their approach where Zernike moments are used as image features for sequence of hand gestures.
In the last decade, several methods of potential applications in the advanced gesture interfaces for HCI have been suggested but these differ from one to another in their models. Some of these models are Neural Network [4], Hidden Markov Model (HMM) [5] and Fuzzy Systems [6]. Hidden Markov Models (HMM) is one of the most successful and widely used tools for modeling signals with spatiotemporal variability [7].
In this paper, we present a new method to get skincolor segmentation based on mixing nonlinear YCbCr elliptic cluster skin-color model and HSI skin-color segmentation model; we estimate the principal gesture plane using Least Squares Method and classifying gestures using HMM-FNN; the likelihood of each HMM to observation sequence is considered as membership value of FNN，and gesture is classified through fuzzy inference of FNN. By the propose algorithm we achieve better recognition result for continuous gestures trajectory.

Hand Gesture Segmentation Algorithm
To be color images, the information of skin-color is very important characteristics for human face. Research shows that: even though of different races, different ages and different gender, the difference in color chrominance is far less than the difference in the brightness. Skin distribution shows clustering distribution in the skin-color space without luminance influence.
Normally, in order to reduce the impact of brightness, we use nonlinear YCbCr elliptic cluster skin-color segmentation model that the illumination component is concentrated in a single component (Y) while the color is contained in the blue (Cb) and the red chrominance component (Cr). Cb and Cr are defined as the difference between the blue component and the difference between the red component and a reference value, respectively. We ignore Y channel in order to reduce the effect of brightness variation and use only the chrominance channels which fully represent the color information. The research of a large number of color pixels shows that skin-color cluster a very small range of CbCr color space. Normalized chrominance distribution maps, we can find that different skin-color have the same 2D Gaussian model. This method can be accurate to detect skin-color regions. The YCbCr color space decomposes the RGB color into luminance and chrominance information, the conversion formula as below: Nevertheless, because of illumination and complex background are similar to skin-color effect, this method still may make skin-color region as non-skin color, and make non-skin color as skin-color [15].
The other commonly used method of getting the skin area is based on HSI color space. HSI color space contains hue (H), saturation(S) and luminance (I). The HSI color space is very important and attractive color model for image processing applications because it represents colors similarly how the human eye senses colors. Thus the analysis of skin-color can be by hue and saturation space as to reduce the impact of luminance. This color space can detect skin-color well, so it is used in many skin-color detecting researches. The transition from RGB color space to HSI color space is displayed as the formula as below: Nevertheless, under the influence of the environment, this method still may make non-skin color as skin-color. Based on the skin color segmentation results in YCbCr Color Space and HSI Color Space, we analysis the advantages of their own and find out the shortages to think out a better method to get the more satisfactory results. By many times experiments we find that: By fusing the results receive from two methods, in other words, we perform every pixel with "OR" operation on two binary images which get from two color space.
Compared with the two methods which process the image only in YCbCr or HSI Color Space, we can get the better segmentation result [8], the experimental results are shown as below ( Figure 1 As we just care about the information of hand area, the head area is the interference region, the hand area should be segmented. Normally, the head is assumed to exhibit no motion or slight motion only. Therefore, this region can be removed by a difference operation where the current image frame is subtracted by the previous image frame. The difference operation can performed faster on binary images as compared to the color image. The binary image of the current frame is subtract from the previous binary image to remove edge noise, like face region and other stationary objects that close to the skin color. Thus, the background and the most part of face are removed as it is shown in Figure 1 (e).

Hand Tracking based on improved CAMSHIT Algorithm
Computer vision hand tracking is an active and developing field, yet the hand trackers that have been developed are not sufficient for our needs. We want a tracker that will track a given hand in the presence of noise. And it must run fast and efficiently so that objects may be tracked in real time (24 frames per second) while consuming as few system resources as possible for example that running on inexpensive consumer cameras [9].
Compared with the other similar algorithms, the method in this paper improved the accuracy without adding computational complexity and also suitable for all the registration parameters. We plan to select a method named mean shift algorithm, which is a simple iterative procedure that climbs the gradient of a probability distribution to find the nearest dominant mode. The mean shift algorithm operates on probability distributions. To track hand in video frame sequences, the image data has to be represented as a probability distribution. Distributions derived from video image sequences change over time, so the mean shift algorithm has to be modified to adapt dynamically to the probability distribution it is tracking. The new algorithm that meets all these requirements is called CAMSHIFT.

CAMSHIFT Algorithm
CAMSHIFT algorithm is a dynamic change in the distribution of the density function of the gradient estimate of non-parametric methods [10]. The course of algorithm is as follows: 1. Select a search window W size of s in skin color probability distribution.
Calculate the first moment of in the image, and x and y range over the search window.
3. Calculate the mean search window location (the centroid) is: Set the search window size equal to a function of the zeroth moment found in step 2.
5. Repeat steps 2, 3 and 4 until convergence (mean location moves less than a preset threshold).
The initial moment reflect the area of object in the image, and the chart of skin color probability distribution is discrete gray scale image which have the max value is 255. So the relation between the size of search window s and 00 Z is as follows: Consider the symmetry: s get results close to the singular.
By calculating the second-order moment can be obtained the long axis, short axis and the direction angle of object. Second-order moments are as follows: The direction angle of object is The long axis l and short axis w is When CAMSHIFT algorithm track a specific color object, the images do not have to calculate each frame all the pixels of the color probability distribution, just calculate pixel color probability distribution in the area that larger than the current search window. This can save a lot of computing [9].

The Improved CAMSHIFT Algorithm
By the formula-based region of interest are square, but hand is closer to rectangular, when the hand rotation or hand towards the camera angle change, the aspect ration of regular of hand changes.

Estimating Principal Gesture Plane
For a nonlinear gesture, we have found that the trajectory of it is almost in a plane, which we call principal gesture plane [11]. To find the principal gesture plane is the most important step. The gesture coordinate system is established by the gesture trajectory and can be represented by principal gesture plane and its normal vector. We choose singular value decomposition to compute the eigenvectors and eigenvalues [12]. Vector

Trajectory Projection
The next step after getting the normal vector of the principal gesture plane, we should project the gesture trajectory onto the plane. As frontal view gestures are chosen as training sets, we rotate side principal gesture plane parallel to the frontal ones firstly. We can get the projection result through calculate and the inference reached as follow : (4.4) Here, R is the rotation matrix between side view gesture coordinate system and frontal view gesture coordinate system. We calculate R from 1

Feature Extraction
There are three basic features: location, orientation and velocity. The previous research [13,14] showed that the orientation feature is the best in term of accuracy results. Therefore, we regard the orientation feature as the main feature during our research process.
The angles are each converted into one of the eight direction codes that are shown in Figure 4. The angle ranges of the direction codes have different widths. The feature of 2D motion trajectory of hand gesture is represented by a series of discrete movement direction value. For the 2D motion plane, we divide the direction into eight discrete values as shown (Figure 4). Therefore, the trajectory of dynamic gesture ban be described by the sequence of discrete direction value ) 8 1 ( ,..., , , : Thereby, the discrete vector is determined and then is used as input to recognition system.

Dynamic Gesture Recognition Based On HMM-FNN
Although gesture can be categorized in several different ways [15], when considered from the view of motion feature, gesture usually could be categorized into static and dynamic gesture. Static gesture, also called posture, is defined as a static movement, it express meaning just by means of hand shape or finger configuration, and it can be regarded as a special case of dynamic gesture. Dynamic gesture is defined as a dynamic movement, it involves a fixed posture and change in the position or orientation of the hand, such as making a pinching posture and changing the hand"s position.
It"s well known that HMM model has strong ability for temporal data modeling, so we apply left-right banded HMM to model gesture trajectory. Fuzzy Neural Network has strong ability for fuzzy rule modeling and fuzzy inference due to its integration of fuzzy set theory and Neural Network together. Since traditional FNN cannot model temporal data and conventional HMM do not own ability for fuzzy inference [16], we integrate the two models together to represent complex gesture trajectory and perform inference by the integrated HMM-FNN model, for the recognition of dynamic gesture. HMM-FNN model includes five layers. Its first layer, second layer and HMM layer constitute the fuzzy preprocessing part, third layer and fourth layer constitute fuzzy inference part, fifth layer is the defuzzification part of HMM-FNN and produce distinct output. The following will introduce these five layers in detail.
The first layer is the input layer of the model and it has three neurons, which correspond to the two movement components of dynamic gesture: T Q and S Q , respectively. Therein, The third layer is the layer of fuzzy inference, and each neuron represents a fuzzy rule. The connecting weights between neurons in second and third layer imply the contribution degree of the antecedent part for this rule. The output of neuron in third layer is calculated as shown: The fourth layer is normalization layer, the neuron number of which is equal to that of third layer. In order to speed up convergence of the network during training, the output of third layer is normalized to assure the sum of them is equal to 1. Output of its neuron is shown as The fifth layer is the defuzzification layer, the output of which is shown as Where j  implies the importance of each rule for the final classification output, N is the total number of fuzzy rules.
Suppose that complex gesture trajectory has already been decomposed into two independent parts during hand tracking. The feature sequences are considered as input of HMM-FNN model, and calculate the likelihood of HMM model according to forward probability method. The isolated and continuous gestures paths are recognized by its discrete vector and HMM Forward algorithm corresponding to maximal gesture models over the Viterbi best path. Moreover, BW algorithm is used to do a full training for the initialized HMM parameters to construct gestures database.
We choose left-right banded model [18] as the type of HMM model due to its straightforward structure. Corresponding to the features" type, the type of HMM models for posture changing and movement in Z-axis direction are one-dimensional continuous HMM models, while that of 2D trajectory is a one-dimensional discrete one. As for continuous HMM model, we employ Gaussian Mixture Model (GMM) as the emission probability of observation, which has the likelihood as described in Eq.5.4.
Suppose that complex gesture trajectory has already been decomposed into three independent parts during hand tracking. The three feature sequences are considered as input of HMM-FNN model, and calculate the likelihood of HMM model according to forward probability method. The final output of HMM-FNN model indicates the class type to which the input gesture belongs, such as the output of trajectory A is between the range (α, β] and trajectory B is between (β, γ] and so on. The continuous gestures paths are recognized by its discrete vector and HMM Forward algorithm corresponding to maximal gesture models over the Viterbi best path. Moreover, BW algorithm is used to do a full training for the initialized HMM parameters to construct gestures database.

Experimental Results
In our experiments, we evaluate the performance of proposed model by comparison with conventional HMM model, through 357 times tests, we get the results as shown in Table1. The experimental procedure and results are shown as Figure 5. And through our experiment, we find that several characters are similar to each other in the feature space, for example, "2"and "Z". These similar characters we call them confusion characters (Table2).
Orientation dynamic features are obtained from spatiotemporal trajectories and then quantized to generate its code words. The algorithm we present have the best performance and achieves average rate recognition 95.76% and 93.64% for Arabic Numbers and Alphabets, respectively (as show in Table 1).

Conclusion
In this paper, we propose an automatic system for recognizing continuous gestures in real-time, including Arabic numbers (0-9) and alphabets (A-Z). An improved method based on YCbCr and HSI mixed skin color space used for hand area detection and segmentation, the improved CAMSHIFT algorithm used for hand tracking. Orientation dynamic features are obtained from spatiotemporal trajectories and then quantized to generate its code words. An improved HMM-FNN model is proposed for gesture recognition based on the code words, which combines ability of HMM model for temporal data modeling with that of fuzzy neural network for fuzzy rule modeling and fuzzy inference. In this proposed system, the algorithm presents satisfactory performance and achieves average rate recognition 95.76% and 93.64% for Arabic Numbers (0-9) and Alphabets (A-Z), respectively.