Implementation of human behavior recognition based on visual skeletal model estimation by deep learning

Master Thesis Defense

Implementation of human behavior recognition based on visual skeletal model estimation by deep learning

Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. However, there are some action classes that seem to be very similar in skeleton modality. Moreover, there is a great deal of diversity within each action class due to the fact that the same action can be performed differently based on the circumstances or the individual. On top of that, the classification of actions gets more challenging when the occlusion occurs in the skeleton data, causing some joint information to be unavailable in some frames. the proposed method requires exploiting the semantic information embedded in observed joints and extracting higher-level features that differ between action classes while being the same within various samples of a particular action class. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spatio-temporal feature extraction is still challenging. In this thesis, our aim is to develop neural network architectures that will allow robust spatio-temporal features to be extracted by learning the hierarchical representation of joints in an action through a dynamic process. We introduce Action Capsule which identifies action-related key joints by considering the latent correlation of joints in a skeleton sequence. To gain a deeper understanding of how our algorithms enhance the state-of-the-art and contribute to the literature, we design custom interpretation methods to analyze the proposed approach intuition quantitatively and qualitatively. We show that, during inference, our end-to-end network pays more attention to a set of joints specific to each action in both spatial and temporal dimensions, whose encoded spatio-temporal features are aggregated to recognize the action. Considering the case of occlusion, where the occluded part does not include key joints of the action, our network is still capable of detecting actions by exploring relationships between visible joints. Additionally, the use of multiple stages of action capsules enhances the ability of the network to classify similar actions. Furthermore, by leveraging multiple streams of Action Capsules that operate on different inputs including joint, motion, and bone information, classification accuracy for some classes is improved significantly. Consequently, our network outperforms the state-of-the-art approaches on the N-UCLA dataset and obtains competitive results on the NTURGBD dataset. This is while our approach has significantly lower computational requirements based on GFLOPs measurements. A brief overview of novel areas of our work that may be explored further is followed, along with a roadmap of the potential future of the subject matter.
Previous Post
Extension and implementation of performance evaluation indices for impedance control schemes on ARASH:ASiST
Next Post
ARAS-Farabi Five Years Plan Meeting

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed