Detecting Precise Hand Touch Moments in Egocentric Video

Abstract

We address the challenging task of detecting the precise moment when hands make contact with objects in egocentric videos. This frame-level detection is crucial for augmented reality, human-computer interaction, assistive technologies, and robot learning applications, where contact onset signals action initiation or completion. Temporally precise detection is particularly challenging due to subtle hand motion variations near contact, frequent occlusions, fine-grained manipulation patterns, and the inherent motion dynamics of first-person perspectives.

To tackle these challenges, we propose a Hand-informed Context Enhanced module (HiCE; pronounced 'high-see') that leverages spatiotemporal features from hand regions and their surrounding context through cross-attention mechanisms, learning to identify potential contact patterns. Our approach is further refined with a grasp-aware loss and soft label that emphasizes hand pose patterns and movement dynamics characteristic of touch events, enabling the model to distinguish between near-contact and actual contact frames. We also introduce TouchMoment, an egocentric dataset containing 4,021 videos and 8,456 annotated contact moments spanning over one million frames. Experiments on TouchMoment show that, under a strict evaluation criterion that counts a prediction as correct only if it falls within a two-frame tolerance of the ground-truth moment, our method achieves substantial gains and outperforms state-of-the-art event-spotting baselines by 16.91% average precision.

Method

HiCE augments frame-level feature extractors with dedicated hand-centric spatiotemporal modeling. The module first extracts hand regions using off-the-shelf detectors and expands them to include relevant local context. These regions are then processed by a backbone that captures both spatial structure and short-term temporal dynamics from neighboring frames. The resulting hand-specific features are fused with global frame features through cross-attention, guiding the model to attend to contact-critical regions.

TouchMoment Dataset

We introduce TouchMoment, an egocentric video dataset that captures a wide range of everyday manipulation scenarios and supplies exact touch-moment annotations for each interaction. The dataset comprises 4,021 videos sourced from diverse environments, object categories, and interaction contexts, and includes 8,456 manually annotated touch moments recorded at the frame level. Alongside touch annotations, TouchMoment provides temporal segmentation of interaction sequences and localized hand regions to support hand-centric modeling.

BibTeX

@InProceedings{nguyen_hice_cvpr_2026,
      author    = {Huy Anh Nguyen and Feras Dayoub and Minh Hoai},
      title     = {Detecting Precise Hand Touch Moments in Egocentric Video},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
      month     = {June},
      year      = {2026},
      pages     = {3565-3574}
    }