Enhancing Action Recognition Models Using Synthetic Data

Rongchai Wang
Dec 03, 2024 19:31

NVIDIA explores the use of synthetic data to improve action recognition models, highlighting the benefits and applications across industries such as retail and healthcare.

In an effort to advance the field of action recognition, NVIDIA has been leveraging synthetic data to enhance the capabilities of models like PoseClassificationNet. This approach is particularly valuable in scenarios where gathering real-world data is costly or impractical, according to NVIDIA’s blog post authored by Monika Jhuria.

Challenges in Action Recognition

Action recognition models are designed to identify and classify human actions, such as walking or waving. However, developing robust models that can accurately recognize a wide range of actions across various scenarios remains challenging. A significant hurdle is acquiring sufficient and diverse training data. Synthetic data generation (SDG) emerges as a practical solution to this issue by simulating real-world scenarios through 3D simulations.

Synthetic Data Generation with NVIDIA Isaac Sim

NVIDIA’s Isaac Sim, a reference application built on the NVIDIA Omniverse, plays a crucial role in generating synthetic data. It is utilized across multiple domains, including retail, sports, warehouses, and hospitals. The process involves creating artificial data from 3D simulations that mimic real-world data, enabling the models to evolve efficiently through iterative training.

Creating a Human Action Recognition Dataset

Using Isaac Sim, NVIDIA has developed a method to create datasets for action recognition models. This involves generating action animations and extracting key points as inputs for the models. The Omni.Replicator.Agent extension within Isaac Sim facilitates the generation of synthetic data across various 3D environments, offering features like multi-camera consistency and position randomization.

Expanding Model Capabilities with Synthetic Data

The synthetic data generated is used to expand the capabilities of spatial-temporal graph convolutional network (ST-GCN) models. These models detect human actions based on skeletal information. NVIDIA’s approach involves training models like PoseClassificationNet on the 3D skeleton data produced by Isaac Sim, using NVIDIA TAO for efficient training and fine-tuning.

Training and Testing Results

In testing, the ST-GCN model, trained solely on synthetic data, achieved an impressive average accuracy of 97% across 85 action classes. This performance was further validated using the NTU-RGB+D dataset, demonstrating that the model could generalize well even when applied to real-world data it was not explicitly trained on.

Scaling and Orchestrating Data Generation

NVIDIA has also explored the use of NVIDIA OSMO, a cloud-native orchestration platform, to scale the data generation process. This has significantly accelerated data generation, allowing for the creation of thousands of samples with diverse action animations and camera angles.

For further details on NVIDIA’s approach to scaling action recognition models using synthetic data, please refer to the NVIDIA blog.

Image source: Shutterstock

Share it on social networks