Many real-world manipulation tasks consist of a series of subtasks that are significantly different from one another. Such long-horizon, complex tasks highlight the potential of dexterous hands, which possess adaptability and versatility, capable of seamlessly transitioning between different modes of functionality without the need for re-grasping or external tools. However, the challenges arise due to the high-dimensional action space of dexterous hand and complex compositional dynamics of the long-horizon tasks. We present Sequential Dexterity, a general system based on reinforcement learning (RL) that chains multiple dexterous policies for achieving long-horizon task goals. The core of the system is a transition feasibility function that progressively finetunes the sub-policies for enhancing chaining success rate, while also enables autonomous policy-switching for recovery from failures and bypassing redundant stages. Despite being trained only in simulation with a few task objects, our system demonstrates generalization capability to novel object shapes and is able to zero-shot transfer to a real-world robot equipped with a dexterous hand.
The primary contributions of this work encompass:
Q: Why dexterous hand?
A: Many real-world manipulation tasks consist of a sequence of smaller but drastically different subtasks. Such a task demands a flexible and versatile manipulator to adapt and switch between different modes of functionality seamlessly, avoiding re-grasping or the use of external tools. Dexterous hand has the potential to reach human-level dexterity by utilizing various hand configurations and their inherent capabilities. Our approach underscores the extensive potential of a dexterous hand as a versatile manipulator, capable of managing a sequence of tasks without the necessity of alternating between task-oriented end-effectors.
Q: Can parallel-jaw grippers accomplish the Building Blocks task?
A: This task poses several challenges for parallel-jaw grippers. In the searching sub-task, due to the small contact area, the parallel gripper will be very inefficient in pushing the blocks and is difficult to retrieve the blocks that have been deeply buried. In the inserting sub-task, if the object pose is not optimal (e.g., the slot of the block is occluded by one of the grippers), the parallel gripper has to adjust the grasping pose through the re-grasping process and cannot perform an efficient in-hand adjustment.
Q: Why not execute each single-stage skill one after the other? (Why policy-chaining is important?)
A: While the simple strategy works in some scenarios [10, 11], a subtask in general can easily fail when encountering a starting state it has never seen during training. Regularizing the state space between neighboring skills can mitigate this out-of-distribution issue [13, 14], but long-horizon dexterous manipulation requires a comprehensive optimization of the entire skill chain, due to the complex coordination between non-adjacent tasks. For instance, in the Building Blocks task, the robot needs to strategize in advance when orienting the block, aiming for an optimal object pose that facilitates not only the immediate subsequent grasping but also the insertion task in the later stage of the entire task.
Q: Why don't fine-tune the sub-policy with the original RL value function from the next sub-policy?
A: In Table. 1 (Results for the Building Blocks task), the models learned with the transition feasibility function (Ours and Ours w/o temporal) outperform the one using the PPO-trained value function (V-Chain) for more than 30% in task success rate. This result implies that the value function of PPO policy fails to model the feasibility of subsequent policy, which further affects policy chaining results. Qualitatively, the value function from RL is hard to correctly model the final state of an MDP due to the discounted rewards, but the final state (successful or unsuccessful) is important for successful policy chaining.
@article{chen2023sequential,
title={Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation},
author={Chen, Yuanpei and Wang, Chen and Fei-Fei, Li and Liu, C Karen},
journal={arXiv preprint arXiv:2309.00987},
year={2023}
}