Easy guide on recognition of affective body expressions using video games

7 min readJul 12, 2023

Today I’m going to discuss how video games can be used to train models to recognize emotions. Taking you through some of the research done in that area, I will discuss what are the common approaches that the researchers have followed.

As we all know, as humans we have the ability to identify others’ emotions from their bodily expressions, and if we can give computers the same ability, we will be able to develop many important and innovative applications i.e. domestic violence detection, detection of theft etc. Even lie detection will also become easier by combining all the bio signals (heart rate, EDA, etc.) that are currently take into account.

Early studies on emotion recognition were mainly based on facial expressions, while body expressions were used only to detect how strong the emotion was. However, according to studies in neuroscience and psychology, body expressions have equal importance as facial expressions in emotion recognition.

Bodily movements are difficult to analyze, and finding high-quality gesture expression datasets is not an easy task. Body gesture-based data sets can be performance (acted) data and non-performance (non-acted) data. Performance data is extracted from acts performed by actors while non-performance data is collected from the natural movements of participants in a data collection phase. The majority of the related studies so far have had their attention on acted(performance) expressions.

Researchers have started using body movements in games (captured using tools such as Nintendo Wii, and Microsoft Kinect) to improve the current methods in the recognition of affective body expressions. The combination of virtual reality (VR) experience with the above goes hand in hand to improve the immersion level that occurs when playing video games.

In this article I’m going to focus on following three studies worked on recognizing affective body expressions.

Andrea Kleinsmith and Nadia Bianchi-Berthouze : Form as a cue in the automatic recognition of non-acted affective body expressions.
Nikolaos Savva, Alfonsina Scarinzi, and Nadia Bianchi-Berthouze : Continuous recognition of player’s affective body expression as dynamic quality of aesthetic experience.
Xinyi Fu et al.: Gesture based fear recognition using nonperformance dataset from VR horror games.

In most of these studies, to capture non-acted and natural expressions, the exact intention of the study was not informed to the participants. In some cases, the participants were asked to bring a friend with them as the game partner. Researchers believed playing with a friend won’t make the participant nervous and they can acquire more accurate or real bodily expressions.

Also, videos of participants were replaced by animated/non-animated avatars in surveys conducted for labeling. A Set of external observers were hired for labeling. Online surveys are used to get their input. To build up a common understanding among observers, all researchers used different methods to set an agreement level or a base rate on how observers annotate each affective data. i.e. a base rate to identify a gesture as happiness, etc.

In the next part of this article, I discuss some high-level details about the above-mentioned three research. I will mainly point out the approaches they followed, the issues they faced, and how they overcame them. I hope this article will help someone interested in this research area to get started.

Form as a cue in the automatic recognition of non-acted affective body expressions

Kleinsmith and Bianchi-Berthouze captured a single non-performance body posture for each replay session of the game and trained a model to detect the affective state and dimension of each selected replay session. They used a Gypsy 5 motion capture suit to acquire the body movements of the participants, and postures were manually located in replay windows. They used the affective state labels; concentrating, defeated, frustrated, and triumphant, while the affective dimensions were arousal, valence, potency, and avoidance.

A vector containing 3D joint Euler rotations was used to describe each posture and it was known as the “low-level description of the posture”. Several main joints of the body were included in this vector including their rotations in the three main directions. The most important features were identified as the upper body and arms. Finally, they used Multi-Layer Perceptron (MLP) as the modeling algorithm and the features discussed above were the inputs to the models.

Continuous recognition of player’s affective body expression as dynamic quality of aesthetic experience

Savva’s approach to the recognition of emotions in video gameplay sessions is rather complex than in Kleinsmith’s work since he used dynamic body features instead of postural features. He utilized the Nintendo Wii Grand Slam Tennis video game and 17 sensors of the Animazoo IGS-190 motion capture system to record body movements. The data collected was divided into playing and nonplaying windows. Initial labels, frustration, and anger were converted into high-intensity negative emotions, while sadness and boredom were converted into low-intensity negative emotions.

Features in this research include rotation of the body segment, the angular velocity, the angular frequency, the orientation, the angular acceleration, the directionality of the body (head and spine), and the amount of movement. The selected modeling algorithm was the recurrent neural network (RNN). The main reason for using RNN was that most of the selected features (i.e. Angular velocity and angular acceleration) were based on time.

Gesture based fear recognition using nonperformance dataset from VR horror games

Fu and the crew found that the immersion level that players experience has a huge impact on the resulting affective expression, so they used VR horror games to stimulate higher immersion levels. This research only focused on recognizing fear, while the first two research focused on recognizing several other emotions using the same model.

In this case, researchers focused on creating a non-performance fear dataset with body gestures. A Kinect was used for the acquisition of gesture clips and audio and video of the participants was recorded using a mobile phone. After the cleaning phase, the dataset was organized into 1-second-long segments with 25 skeletal data points in their 3D coordinates. Labels represent the fear level of the player from 0 to 5, with 5 being the highest level of fear.

68 features were collected using the 3D coordinates of 25 skeletal points. Afterward, these features were separated into two categories: spatial distance features captured the relative positions of the skeletal points, and energy features computed using the acceleration of the skeletal points. Finally, a BLSTM-RNN model was built to recognize fear.

Identified issues and the solutions suggested by the researchers

Decisions of non-expert observers affect the annotation process: To overcome this, researchers used the repeated subsampling method to generalize the annotation results. Additionally, they suggested analyzing the empathizing capacity of observers using a questionnaire before the annotation process.
In Savva’s work, he identified carrying out the labeling at the game-point level as a limitation. Therefore, he proposed that the labeling should be done continuously in the game to obtain higher performance. Although it may improve the performance, this continuous labeling requires a considerable amount of time which will soon exhaust the observers.
3D skeletal points recognition imperfection in the Kinect created some inevitable errors in Fu at el.’s work. Therefore, we can assume that erroneous devices can always cause issues in this type of research. And to avoid the problem before starting the trials we should check the status of the devices first.
Lack of non-performance data
Individual biases in the annotators: The reason for these biases is the poor level of agreement between the annotators. The works of Kleinsmith and Savva had more elaborate processes for computing human agreement levels than Fu’s work.

In summary, in recent years, researchers have started to use modern video gaming tools to capture and recognize the emotions that players express during gameplay. Since these are non-acted data, this type of research will eventually help to build more robust and reliable emotion recognition systems. Some researchers worked on capturing postural features while others worked with more complex dynamic body features. The main steps to recognize affective body expressions in games are the following: the creation of the dataset, selection of features, modeling, and finally some testing. Although the models perform mostly as expected, researchers have identified some issues that need attention in future works in this area. Since this is one of the emerging research areas, I hope this article will give you some insights to get started if you have an interest.

Thanks for reading! I will meet you again with an interesting topic. 😊

References

Andrea Kleinsmith and Nadia Bianchi-Berthouze. “Form as a cue in the automatic recognition of non-acted affective body expressions”. In: International conference on affective computing and intelligent interaction. Springer. 2011, pp. 155–164.
Andrea Kleinsmith, Nadia Bianchi-Berthouze, and Anthony Steed. “Automatic recognition of non-acted affective postures”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part 790 B (Cybernetics) 41.4 (2011), pp. 1027–1038
Nikolaos Savva, Alfonsina Scarinzi, and Nadia Bianchi-Berthouze. “Continuous recognition of player’s affective body expression as dynamic quality of aesthetic experience”. In: IEEE Transactions on Computational Intelligence and AI in games 4.3 (2012), pp. 199–212.
Xinyi Fu et al. “Gesture based fear recognition using nonperformance dataset from VR horror games”. In: 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE. 2021, pp. 1–8.