EgoNRG

Egocentric Gesture Dataset for Robust Human-Robot Communication via Head Mounted Devices in Industrial and Military Settings

IEEE/CVF Winter Conference on Applications of Computer Vision ( WACV ) 2026

1Institution Hidden for Blind Review

Submission is currently under review.

Abstract

Gesture recognition in Human-Robot Interaction (HRI) presents unique challenges in industrial, military, and emergency response settings where operators must communicate non-verbal navigation commands to semi-autonomous robotic systems while wearing Personal Protective Equipment (PPE). We present EgoNRG (Egocentric Navigation Robot Gestures), a comprehensive dataset comprising 3,000 first- person multi-view video recordings and 160,000 annotated images captured from four strategically positioned head-mounted cameras. The dataset addresses critical limitations in existing egocentric gesture recognition systems by providing joint hand-arm segmentation labels for both covered and uncovered limbs across diverse environmental conditions. Our multi-viewpoint collection methodology captures twelve navigation gestures (ten Army Field Manual-derived, one deictic, and one emblem gestures) performed by 32 participants across indoor and outdoor environments with systematic variations in clothing conditions and background complexity. Comprehensive experiments demonstrate the dataset's effectiveness for training robust gesture recognition systems suitable for real- world deployment in challenging operational environments.

EgoNRG Dataset Overview

Overview of the EgoNRG dataset contents. The dataset contains 3,044 videos and 160,639 annotated frames captured from the egocentric perspective. 32 participants performed 12 gestures related to ground vehicle robot control. The dataset features conditions with and without background people visible, and both indoor and outdoor environments. The dataset was also recorded from four synchronized monochrome cameras each with a different perspective of each gesture performed by the participants.

Dataset Overview

EgoNRG -Egocentric Navigation Robot Gestures- is a comprehensive dataset features joint hand and arm segmentations captured from 32 participants (14 females and 18 males) performing 12 gesture-based commands for ground vehicle robot control. The participants were divided into four groups of eight, with each group executing a specific set of four gestures. Ten of the twelve gestures were derived from the Army Field Manual, 1 deictic gesture, and 1 emblem gesture. The dataset encompasses 3,044 videos and 160,639 annotated frames. The dataset features:

  • Joint hand and arm segmentations of each participants' left and right limb.
  • Participants' performed gestures with 1. long sleeves and gloves (wearing replica flame-resistant solid color clothing and military camouflage) and 2. bare skin to mimic conditions in real-world industrial and military environments.
  • Environments with and without background people visible.
  • Data captured in both indoor and outdoor environment at various points throughout the day (morning, midday, and dusk).
  • Data captured from four synchronized monochrome cameras each with a different perspective.
  • Gesture performed map directly to standard ground vehicle robot commands (stop, move forward, go left, move in reverse, etc.).

Segmentation Examples

EgoNRG Dataset Gestures

These are examples of the segmentation masks that were annotated for the EgoNRG dataset. You can seee the segmentation masks created for the joint hand-arm for both the left and right limbs. You can also see how the dataset has varying clothing conditions, light conditions, and background people visible.

Gesture Classes

EgoNRG Dataset Gestures

Above is an image depicting a static representation of the 12 gestures performed in the dataset. 10 where adopted from the Army Field Manual, and 1 is a deictic gesture "point", and 1 emblem gesture "approve". Below is a video recorded in the third-person view showing an example of each gesture class that was captured in the dataset from the first-person view.

Above is a video showing the 12 gestures being performed from the third-person viewpoint. This is just for reference. All gestures in the dataset were captured using the first-person point of view.

Example Videos

Examples of raw videos captured from the various viewpoints of the cameras. You can see gestures being performed in both indoor and outdoor environments, with participants wearing long sleeves and gloves as well as bare skin, in varying lighting conditions, with and without background people visible.

Egocentric Viewpoints

This video shows all four viewpoints that were captured in the dataset for each gesture performed by the participants. You can see that in each viewpoint, the part of the participants hand-arm that is visible is different across viewpoints, hence providing more information for training models that can generalize to other egocentric vision platforms.

BibTeX

@article{last2026egocentric,
  author    = {last, fist and last, first and last, first and last, first},
  title     = {Egocentric Gesture Dataset for Robust Human-Robot Communication via Head Mounted Devices in Industrial and Military Settings},
  journal   = {IEEE/CVF Winter Conference on Applications of Computer Vision},
  year      = {SUBMITTED - UNDER REVIEW},
}