Versatile, Coordinated, Whole-Body Skills

Whole-Body Manipulation

Dex Grasp

Upright Garbage Can 1

Transport "Minions"

Push Chair

Open Refrigerator

Lift Box

Swipe Cabinet

Open Door

Pick Box from Ground

Upright Garbage Can 2

Reach Far and Swipe

Push Table

Extreme Reachibility

Legged Manipulation

Lift Box with Feet Kick

Large Kick

Driven Shoot

Multiple Shoots

Open Door with Feet

Move Box with Feet

Locomotion

Side-Step Walk

Walk Backwards

Side-Step Around Table

Crouch and Side-Step Across Hole

Expressive Motion

Boxing

Dance with Human 1

Dance with Human 2

Tiptoe

APT Dance

Warm-up Exercise

End

Teleoperated Whole-Body Imitation System

Key Finding 1: RL+BC >> RL >> BC (DAgger)

Unlike HumanPlus, which uses single-stage RL, and OmniH2O, which employs DAgger, we find that our combined pipeline, RL+BC, achieves superior tracking accuracy and motion smoothness. Pure RL approaches frequently exhibit feet sliding artifacts due to their inability to anticipate future motion goals. Meanwhile, DAgger can not stably and robustly track unseen motions occasionally, due to the lack of task reward guidance like RL. In summary, while RL demonstrates better generalization than BC, the combination of both approaches yields optimal performance.

DAagger

BC cannot handle unseen motions well.

RL+BC

RL+BC generalizes better.

RL has feet slip.

RL+BC

RL+BC is more smooth.

Key Finding 2: In-House MoCap Data Matters

We find that adding even a small set of in-house MoCap sequences—retargeted online to mimic real teleoperation—substantially reduces tracking errors on unseen motions. This gain arises from two factors: (1) our in-house captures are inherently noisier and less stable, suffering from calibration drift and occlusions; and (2) our online retargeter yields less-smooth reference motions compared to the offline version.

w/o MoCap
Data

w/o MoCap Data

w/ MoCap
Data

w/ MoCap Data

Key Finding 3: Learning to Apply Force

As the learning objective of the controller is simply motion tracking, tasks requiring force exertion (e.g., lifting a box) rather than reaching target positions represent out-of-distribution scenarios, causing the controller to occasionally produce jittering behaviors. To enable the controller to learn to apply force, we propose to train controllers with large end-effector perturbations.

w/o EEF
Perturb

w/o EEF Perturb

w/ EEF
Perturb

w/ EEF Perturb

System Delay

The total teleoperation delay of our system is approximately 0.9 seconds, as measured by the video. The major overhead arises from generating tracking goals (0.7 seconds), while policy inference remains efficient (0.2 seconds). Reducing this latency further will be a key focus in future improvements.

TWIST is a General Framework for Diverse Embodiments

To demonstrate that TWIST functions as a general framework for diverse embodiments, we further evaluate it on Booster T1. Below we display the sim-to-sim evaluation results. The controller successfully tracks diverse motions, including arm swinging with coordinated whole-body joints, deeply crouching down, and walking.

For more details, please refer to our paper.

Failures by Overheating

Q & A

Q: What is the limitation of TWIST?
A: 1) Visual occlusions when teleoperating; 2) no tactile feedback to know whether the robot is holding the object tightly; 3) overheating of the robot motors after 5-10 minutes of teleoperation (especially for tasks like crouching); 4) MoCap is not portable.

Q: Why is there no autonomous result?
A: We want to focus on building a strong whole-body teleoperation system first, as there is not such a system yet. We will study how to learn visuomotor policies next with our system.

Q: How do you pause and start the teleoperation (so that human can freely move)?
A: This is very essential in our real-world teleoperation. We let the teleoperator to hold a JoyStick to pause and start the teleoperation.

Q: Is there other interesting note in system design?
A: Adding a voice reminder in the system is quite helpful.

For more questions, please contact Yanjie Ze; we will update this section accordingly.

Acknowledgements

We would like to thank all members of the CogAI group and The Movement Lab from Stanford University for their support, Sirui (Eric) Chen for his help with the video shooting and real-world experiments, and Haoyu Xiong for his helpful discussions. We also thank Stanford Robotics Center for providing the experiment space and the MoCap devices. This work is in part supported by ONR MURI N00014-24-1-2748, NSF:FRR 2153854, Stanford HAI, and Stanford Wu-Tsai Human Performance Alliance.

BibTeX

@article{ze2025twist,
title={TWIST: Teleoperated Whole-Body Imitation System},
author= {Yanjie Ze and Zixuan Chen and João Pedro Araújo and Zi-ang Cao and Xue Bin Peng and Jiajun Wu and C. Karen Liu},
year= {2025},
journal= {arXiv preprint arXiv:2505.02833}
}

Website modified from DP3 and iDP3.
@ 2025 Yanjie Ze

TWIST

Teleoperated Whole-Body Imitation System

Team

Versatile, Coordinated, Whole-Body Skills

Whole-Body Manipulation

Legged Manipulation

Locomotion

Expressive Motion

Teleoperated Whole-Body Imitation System

Key Finding 1: RL+BC >> RL >> BC (DAgger)

Key Finding 2: In-House MoCap Data Matters

Key Finding 3: Learning to Apply Force

System Delay

TWIST is a General Framework for Diverse Embodiments

For more details, please refer to our paper.

Failures by Overheating

Q & A

Related Work

Acknowledgements

BibTeX