Implementation of human behavior recognition based on visual skeletal model estimation by deep learning

September 15, 2022

Mohammad Javad Ahmadi

Autonomous Robotics, Autonomous Robotics, Education, Events, Thesis Defense

Master Thesis Defense

Implementation of human behavior recognition based on visual skeletal model estimation by deep learning

Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. However, there are some action classes that seem to be very similar in skeleton modality. Moreover, there is a great deal of diversity within each action class due to the fact that the same action can be performed differently based on the circumstances or the individual. On top of that, the classification of actions gets more challenging when the occlusion occurs in the skeleton data, causing some joint information to be unavailable in some frames. the proposed method requires exploiting the semantic information embedded in observed joints and extracting higher-level features that differ between action classes while being the same within various samples of a particular action class. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spatio-temporal feature extraction is still challenging. In this thesis, our aim is to develop neural network architectures that will allow robust spatio-temporal features to be extracted by learning the hierarchical representation of joints in an action through a dynamic process. We introduce Action Capsule which identifies action-related key joints by considering the latent correlation of joints in a skeleton sequence. To gain a deeper understanding of how our algorithms enhance the state-of-the-art and contribute to the literature, we design custom interpretation methods to analyze the proposed approach intuition quantitatively and qualitatively. We show that, during inference, our end-to-end network pays more attention to a set of joints specific to each action in both spatial and temporal dimensions, whose encoded spatio-temporal features are aggregated to recognize the action. Considering the case of occlusion, where the occluded part does not include key joints of the action, our network is still capable of detecting actions by exploring relationships between visible joints. Additionally, the use of multiple stages of action capsules enhances the ability of the network to classify similar actions. Furthermore, by leveraging multiple streams of Action Capsules that operate on different inputs including joint, motion, and bone information, classification accuracy for some classes is improved significantly. Consequently, our network outperforms the state-of-the-art approaches on the N-UCLA dataset and obtains competitive results on the NTURGBD dataset. This is while our approach has significantly lower computational requirements based on GFLOPs measurements. A brief overview of novel areas of our work that may be explored further is followed, along with a roadmap of the potential future of the subject matter.

Mohammad Javad Ahmadi http://mjahmadee.ir/

Extension and implementation of performance evaluation indices for impedance control schemes on ARASH:ASiST

ARAS-Farabi Five Years Plan Meeting

Related Posts

Extension and implementation of performance evaluation indices for impedance control schemes on ARASH:ASiST

September 15, 2022

Development and implementation of vitrectomy surgery virtual reality simulator

September 15, 2022

Master Thesis Defense: Predicting Depth from Semantic Segmentation using Game Engine Dataset

September 14, 2020

Leave a Reply Cancel reply

Menu