GNFactor

Summary

It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot will need to have a comprehensive understanding of the 3D structure and semantics of the scene.

In this work, we present GNFactor, a visual behavior cloning agent for multi-task robotic manipulation with Generalizable Neural feature Fields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module incorporates a vision-language foundation model (e.g., Stable Diffusion) to distill rich semantic information into the deep 3D voxel.

We evaluate GNFactor on 3 real-robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor.

Method

GNFactor is composed of a volumetric rendering module and a 3D policy module, sharing the same deep volumetric representation. The volumetric rendering module learns a Generalizable Neural Feature Field, to reconstruct the RGB image from cameras and the embedding from a vision-language foundation model, e.g., Stable Diffusion. The 3D policy module is a Perceiver Transformer that takes the deep volumetric representation from the single RGB-D camera as input and outputs the 3D Q-function. The task-agnostic nature of the vision-language embedding enables the volumetric representation to learn generalizable features via neural rendering and thus helps the 3D policy module better handle multi-task robotic manipulation. The task description is encoded with CLIP to obtain the task embedding.

Visualize Policy by 3D Grad-CAM

Due to the 3D structure of our policy module, we could visualize the policy by Grad-CAM in 3D space directly. Though the supervision signal is only the Q-value for a single voxel during the training process, we observe in visualizations that the target objects are clearly attended by our policy.

"Turn the Faucet"

"Open the Top Microwave Door"

"Place the Tea Pot on the Stove"

Generalization to Unseen Environments

We show the generalization ability of GNFactor compared with PerAct in both the real world and the simulated environments. GNFactor distills the feature of the vision-language foundation model to help the robotic manipulation, thus learning more generalizable features and more robust to distractors in unseen environments.

Drag Stick with Distractor

GNFactor

PerAct

Slide Smaller Block to Target

GNFactor

PerAct

Slide Larger Block to Target

GNFactor

PerAct

Push Buttons with Distractor

GNFactor

PerAct

Open Drawer in New Position

GNFactor

PerAct

Turn Tap in New Position

GNFactor

PerAct

Citation

@article{Ze2023GNFactor,
  title={Multi-Task Real Robot Learning with Generalizable Neural Feature Fields},
  author={Yanjie Ze and Ge Yan and Yueh-Hua Wu and Annabella Macaluso and Yuying Ge and Jianglong Ye and Nicklas Hansen and Li Erran Li and Xiaolong Wang},
  journal={CoRL}, 
  year={2023},
}

GNFactor
Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

CoRL 2023 Oral
Yanjie Ze^1* Ge Yan^2* Yueh-Hua Wu^2* Annabella Macaluso²
Yuying Ge³ Jianglong Ye² Nicklas Hansen² Li Erran Li⁴ Xiaolong Wang²
¹Shanghai Jiao Tong University ²UC San Diego ³University of Hong Kong ⁴AWS AI, Amazon
^*Equal contribution

Summary

Method

Visualize Policy by 3D Grad-CAM