Summary
It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments.
To achieve this goal, the robot will need to have a comprehensive understanding of the 3D structure and semantics of the scene.
In this work, we present GNFactor, a visual behavior cloning agent for multi-task robotic manipulation with
Generalizable
Neural feature Fields.
GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation.
To incorporate semantics in 3D, the reconstruction module incorporates a vision-language foundation model (e.g., Stable Diffusion) to distill rich semantic information into the deep 3D voxel.
We evaluate GNFactor on 3 real-robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor.
Method
GNFactor is composed of a volumetric rendering module and a 3D policy module, sharing the same deep volumetric representation.
The volumetric rendering module learns a Generalizable Neural Feature Field, to reconstruct the RGB image from cameras and the embedding from a vision-language foundation model, e.g., Stable Diffusion.
The 3D policy module is a Perceiver Transformer that takes the deep volumetric representation from the single RGB-D camera as input and outputs the 3D Q-function.
The task-agnostic nature of the vision-language embedding enables the volumetric representation to learn generalizable features via neural rendering and thus helps the 3D policy module better handle multi-task robotic manipulation.
The task description is encoded with CLIP to obtain the task embedding.
Visualize Policy by 3D Grad-CAM
Due to the 3D structure of our policy module, we could visualize the policy by Grad-CAM in 3D space directly. Though the supervision signal is only the Q-value for a single voxel during the training process, we observe in visualizations that the target objects are clearly attended by our policy.
Generalization to Unseen Environments
We show the generalization ability of GNFactor compared with PerAct in both the real world and the simulated environments. GNFactor distills the feature of the vision-language foundation model to help the robotic manipulation, thus learning more generalizable features and more robust to distractors in unseen environments.
Drag Stick with Distractor
Slide Smaller Block to Target
Citation
@article{Ze2023GNFactor,
title={Multi-Task Real Robot Learning with Generalizable Neural Feature Fields},
author={Yanjie Ze and Ge Yan and Yueh-Hua Wu and Annabella Macaluso and Yuying Ge and Jianglong Ye and Nicklas Hansen and Li Erran Li and Xiaolong Wang},
journal={CoRL},
year={2023},
}