Unlike HumanPlus, which uses single-stage RL, and OmniH2O, which employs DAgger, we find that our combined pipeline, RL+BC, achieves superior tracking accuracy and motion smoothness. Pure RL approaches frequently exhibit feet sliding artifacts due to their inability to anticipate future motion goals. Meanwhile, DAgger can not stably and robustly track unseen motions occasionally, due to the lack of task reward guidance like RL. In summary, while RL demonstrates better generalization than BC, the combination of both approaches yields optimal performance.
We find that adding even a small set of in-house MoCap sequences—retargeted online to mimic real teleoperation—substantially reduces tracking errors on unseen motions. This gain arises from two factors: (1) our in-house captures are inherently noisier and less stable, suffering from calibration drift and occlusions; and (2) our online retargeter yields less-smooth reference motions compared to the offline version.
As the learning objective of the controller is simply motion tracking, tasks requiring force exertion (e.g., lifting a box) rather than reaching target positions represent out-of-distribution scenarios, causing the controller to occasionally produce jittering behaviors. To enable the controller to learn to apply force, we propose to train controllers with large end-effector perturbations.
The total teleoperation delay of our system is approximately 0.9 seconds, as measured by the video. The major overhead arises from generating tracking goals (0.7 seconds), while policy inference remains efficient (0.2 seconds). Reducing this latency further will be a key focus in future improvements.
To demonstrate that TWIST functions as a general framework for diverse embodiments, we further evaluate it on Booster T1. Below we display the sim-to-sim evaluation results. The controller successfully tracks diverse motions, including arm swinging with coordinated whole-body joints, deeply crouching down, and walking.
Q: What is the limitation of TWIST?
A: 1) Visual occlusions when teleoperating; 2) no tactile feedback to know whether the robot is holding the object tightly; 3) overheating of the robot motors after 5-10 minutes of teleoperation (especially for tasks like crouching); 4) MoCap is not portable.
Q: Why is there no autonomous result?
A: We want to focus on building a strong whole-body teleoperation system first, as there is not such a system yet. We will study how to learn visuomotor policies next with our system.
Q: How do you pause and start the teleoperation (so that human can freely move)?
A: This is very essential in our real-world teleoperation. We let the teleoperator to hold a JoyStick to pause and start the teleoperation.
Q: Is there other interesting note in system design?
A: Adding a voice reminder in the system is quite helpful.
For more questions, please contact Yanjie Ze; we will update this section accordingly.
We would like to thank all members of the CogAI group and The Movement Lab from Stanford University for their support, Sirui (Eric) Chen for his help with the video shooting and real-world experiments, and Haoyu Xiong for his helpful discussions. We also thank Stanford Robotics Center for providing the experiment space and the MoCap devices. This work is in part supported by ONR MURI N00014-24-1-2748, NSF:FRR 2153854, Stanford HAI, and Stanford Wu-Tsai Human Performance Alliance.
@article{ze2025twist,
title={TWIST: Teleoperated Whole-Body Imitation System},
author= {Yanjie Ze and Zixuan Chen and João Pedro Araújo and Zi-ang Cao and Xue Bin Peng and Jiajun Wu and C. Karen Liu},
year= {2025},
journal= {arXiv preprint arXiv:2505.02833}
}