HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild

1Stony Brook University

CVPR 2024

Image.

Illustrative qualitative results of HOIST-Former. Within each row, a distinct hand-held object is assigned a unique tracking ID and is consistently represented in the same color.

Abstract

We address the challenging task of identifying, segmenting, and tracking hand-held objects, which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion, rapid motion, and the transitory nature of objects being hand-held, where an object may be held, released, and subsequently picked up again. To tackle these challenges, we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other, ensuring that the processes of identification, segmentation, and tracking of hand-held objects depend on the hands' positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover, we also contribute an extensive in-the-wild video dataset called HOIST, which comprises 4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs for hand-held objects. Through comprehensive experiments on the HOIST dataset and two additional public datasets, we demonstrate the efficacy of HOIST-Former in segmenting and tracking hand-held objects.

Architecture

Image.

Inspired by the success of Mask2Former in video instance segmentation, we designed HOIST-Former with a similar overall architectural framework, consisting of three main components: a backbone network, a pixel decoder, and a transformer decoder.

Hand-Object Transformer Decoder

Image.

The Hand-Object Transformer Decoder is an innovative component, designed to systematically determine the positions of hands and hand-held objects through an iterative and collaborative feature pooling process. This effectively conditions the identification and segmentation of hand-held objects based on the appearance of hands and their immediate environment. This innovative transformer decoder allows us to segment and track arbitrary hand-held objects in an open-world setting, satisfying selection criteria that extend beyond categorical membership and object visibility.

More results

Image.

BibTeX

@InProceedings{sn_hoist_cvpr_2024,
      author = {Supreeth Narasimhaswamy and Huy Anh Nguyen and Lihan Huang and Minh Hoai},
      title = {HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild},
      booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year = {2024},
    }