H-InDex

Abstract

Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human Hand-Informed visual representation learning framework to solve difficult Dexterous manipulation tasks (H-InDex). Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify 0.36% parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that our method largely surpasses the previous state-of-the-art method and also the recent visual foundation models for motor control.

Method Overview

Visualization of Tasks

We show the successful trajectories of our dexterous manipulation task suite, generated by policies trained with H-InDex.

Hammer

Door

Pen

Pour

Place Inside

Relocate Large Clamp

Relocate Foam Brick

Relocate Box

Relocate Mug

Relocate Mustard Bottle

Relocate Tomato Soup Can

Relocate Potted Meat Can

Visualization of Self-Supervised Keypoint Detection

We visualize the self-supervised keypoint detection results in Stage 2. The trajectory here is from the training videos.