logo GNFactor
Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

CoRL 2023 Oral
Yanjie Ze1*   Ge Yan2*   Yueh-Hua Wu2*   Annabella Macaluso2
Yuying Ge3   Jianglong Ye2   Nicklas Hansen2   Li Erran Li4   Xiaolong Wang2

1Shanghai Jiao Tong University 2UC San Diego 3University of Hong Kong 4AWS AI, Amazon
*Equal contribution


GNFactor is a visual behavior cloning agent for real-world multi-task robotic manipulation, achieving the success of three tasks in two kitchens with a single policy and only 5 demonstrations for each task. GNFactor utilizes a Generalizable Neural Feature Field (GNF) to learn a 3D volumetric representation, which is also jointly optimized by the action prediction module.

Training

Generalization

Summary

It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot will need to have a comprehensive understanding of the 3D structure and semantics of the scene.

In this work, we present GNFactor, a visual behavior cloning agent for multi-task robotic manipulation with Generalizable Neural feature Fields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module incorporates a vision-language foundation model (e.g., Stable Diffusion) to distill rich semantic information into the deep 3D voxel.

We evaluate GNFactor on 3 real-robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor.

Method

GNFactor is composed of a volumetric rendering module and a 3D policy module, sharing the same deep volumetric representation. The volumetric rendering module learns a Generalizable Neural Feature Field, to reconstruct the RGB image from cameras and the embedding from a vision-language foundation model, e.g., Stable Diffusion. The 3D policy module is a Perceiver Transformer that takes the deep volumetric representation from the single RGB-D camera as input and outputs the 3D Q-function. The task-agnostic nature of the vision-language embedding enables the volumetric representation to learn generalizable features via neural rendering and thus helps the 3D policy module better handle multi-task robotic manipulation. The task description is encoded with CLIP to obtain the task embedding.


Visualize Policy by 3D Grad-CAM

Due to the 3D structure of our policy module, we could visualize the policy by Grad-CAM in 3D space directly. Though the supervision signal is only the Q-value for a single voxel during the training process, we observe in visualizations that the target objects are clearly attended by our policy.


"Turn the Faucet"
"Open the Top Microwave Door"
"Place the Tea Pot on the Stove"

Generalization to Unseen Environments

We show the generalization ability of GNFactor compared with PerAct in both the real world and the simulated environments. GNFactor distills the feature of the vision-language foundation model to help the robotic manipulation, thus learning more generalizable features and more robust to distractors in unseen environments.


Drag Stick with Distractor

GNFactor

PerAct

Slide Smaller Block to Target

GNFactor

PerAct

Slide Larger Block to Target

GNFactor

PerAct

Push Buttons with Distractor

GNFactor

PerAct

Open Drawer in New Position

GNFactor

PerAct

Turn Tap in New Position

GNFactor

PerAct

Citation

@article{Ze2023GNFactor,
  title={Multi-Task Real Robot Learning with Generalizable Neural Feature Fields},
  author={Yanjie Ze and Ge Yan and Yueh-Hua Wu and Annabella Macaluso and Yuying Ge and Jianglong Ye and Nicklas Hansen and Li Erran Li and Xiaolong Wang},
  journal={CoRL}, 
  year={2023},
}