Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation

CoRL 2023

Stanford University
*Equal Contribution

Sequential Dexterity is a general system based on reinforcement learning (RL) that chains multiple dexterous policies for achieving long-horizon task goals.

Despite being trained only in simulation with a few task objects, our system demonstrates zero-shot transfer to a real-world robot equipped with a dexterous hand. These videos show building blocks into different shapes using our system in the real-world.


Many real-world manipulation tasks consist of a series of subtasks that are significantly different from one another. Such long-horizon, complex tasks highlight the potential of dexterous hands, which possess adaptability and versatility, capable of seamlessly transitioning between different modes of functionality without the need for re-grasping or external tools. However, the challenges arise due to the high-dimensional action space of dexterous hand and complex compositional dynamics of the long-horizon tasks. We present Sequential Dexterity, a general system based on reinforcement learning (RL) that chains multiple dexterous policies for achieving long-horizon task goals. The core of the system is a transition feasibility function that progressively finetunes the sub-policies for enhancing chaining success rate, while also enables autonomous policy-switching for recovery from failures and bypassing redundant stages. Despite being trained only in simulation with a few task objects, our system demonstrates generalization capability to novel object shapes and is able to zero-shot transfer to a real-world robot equipped with a dexterous hand.


(a). A bi-directional optimization scheme consists of a forward initialization process and a backward fine-tuning mechanism based on the transition feasibility function.
(b). The learned system is able to zero-shot transfer to the real world. The transition feasibility function serves as a policy-switching identifier to select the most appropriate policy to execute at each time step.


Environment Setups

We test Sequential Dexterity in two environments:
(a). Workspace of Building Blocks task in simulation and real-world. This long-horizon task includes four different subtasks: Searching for a block with desired dimension and color from a pile of cluttered blocks, Orienting the block to a favorable position, Grasping the block, and finally Inserting the block to its designated position on the structure. This sequence of actions repeats until the structure is completed according to the given assembly instructions.
(b). The setup of the Tool Positioning task. Initially, the tool is placed on the table in a random pose, and the dexterous hand needs to grasp the tool and re-orient it to a ready-to-use pose. The comparison results illustrate how the way of grasping directly influences subsequent orientation.

Learning sequential sub-policies in Isaac-Gym

Evaluation rollouts of the learned sub-policies

Evaluation rollouts of the chained policy sequence

Building Blocks task

Since it is impossible to simulate the contact-rich insertion of blocks in simulation, the insertion skill in sim is a simplified version and the robot learns to use finger jittering to place the block. In the real-world experiments, we only perform the insertion policy when the hand is still in the air (in-hand rotate and align to the goal location). The moving down and pressing motion used to fully insert the block is scripted.

Tool Positioning task

Hammer - Policy-Seq [10]
Hammer - Ours
Spatula (Unseen) - Policy-Seq [10]
Spatula (Unseen) - Ours
Spoon (Unseen) - Policy-Seq [10]
Spoon (Unseen) - Ours

Interactive GUI

Feel free to try out our quick demo (Interactive GUI)! To get the best performance for this demo, the policy takes the full state information (object acceleration, motor velocity, …) as inputs and is allowed to control the end-effector orientation, which is the policy before distillation for the real-world deployment.

Additional Experiments

Qualitative results of policy-switching with transition feasibility function. Each result contains an image from the wrist-mount camera (left) and its corresponding feasibility score outputted by the transition feasibility functions (right). We highlight the target block in the image for better visualization. The policy-switching process visits each sub-policy in reverse order (Insert - Grasp - Orient). The first sub-policy with a feasibility score greater than 1.0 is selected for execution. If none of the feasibility score is greater than 1.0, the Search sub-policy will be selected to execute.

(a). Performance improvements of our approach given 0/1/2/3 maximum policy-switching times. The ability to switch sub-policy autonomously with transition feasibility function greatly improves the success rate of solving long-horizon tasks.
(b). Visualization of object poses with high feasibility score for the Grasp sub-policy in Building Blocks task. The x, y, and z axes are the roll, yaw, and pitch of the object, respectively. Our transition feasibility function correctly transits the goal of the succeeding Inserting skill to the prior Grasping policy and encourages the policy to grasp the block when its studs face up (easy to insert).

Quantitative results in the Building Blocks task

Sim-to-real system preparation (target object tracking)

To transfer the learned policies to the real world, we develop an object-tracking system consisting of a top-down camera and a wrist RGB-D camera mounted on the robot:
(1). We use the top-down camera to localize the target block and guide the end-effector towards it.
(2). We use a color-based segmentation method to localize the target block in the wrist-view and use XMem to track the segment.
(3). Based on the segmentation and RGB-D inputs, DenseFusion is used to estimate the 6D pose of the block in real-time.
(4). During real-world deployment, we also increase the exponential moving average[6] smoothing factor to alleviate finger jittering.


The primary contributions of this work encompass:

  • The first to explore policy chaining for long-horizon dexterous manipulation.
  • A general bi-directional optimization framework framework that effectively chains multiple dexterous skills for long-horizon dexterous manipulation.
  • Our framework exhibits state-of-the-art results in multi-stage dexterous manipulation tasks and facilitates zero-shot transfer to a real-world dexterous robot system.

Q & A

Q: Why dexterous hand?
A: Many real-world manipulation tasks consist of a sequence of smaller but drastically different subtasks. Such a task demands a flexible and versatile manipulator to adapt and switch between different modes of functionality seamlessly, avoiding re-grasping or the use of external tools. Dexterous hand has the potential to reach human-level dexterity by utilizing various hand configurations and their inherent capabilities. Our approach underscores the extensive potential of a dexterous hand as a versatile manipulator, capable of managing a sequence of tasks without the necessity of alternating between task-oriented end-effectors.

Q: Can parallel-jaw grippers accomplish the Building Blocks task?
A: This task poses several challenges for parallel-jaw grippers. In the searching sub-task, due to the small contact area, the parallel gripper will be very inefficient in pushing the blocks and is difficult to retrieve the blocks that have been deeply buried. In the inserting sub-task, if the object pose is not optimal (e.g., the slot of the block is occluded by one of the grippers), the parallel gripper has to adjust the grasping pose through the re-grasping process and cannot perform an efficient in-hand adjustment.

Q: Why not execute each single-stage skill one after the other? (Why policy-chaining is important?)
A: While the simple strategy works in some scenarios [10, 11], a subtask in general can easily fail when encountering a starting state it has never seen during training. Regularizing the state space between neighboring skills can mitigate this out-of-distribution issue [13, 14], but long-horizon dexterous manipulation requires a comprehensive optimization of the entire skill chain, due to the complex coordination between non-adjacent tasks. For instance, in the Building Blocks task, the robot needs to strategize in advance when orienting the block, aiming for an optimal object pose that facilitates not only the immediate subsequent grasping but also the insertion task in the later stage of the entire task.

Q: Why don't fine-tune the sub-policy with the original RL value function from the next sub-policy?
A: In Table. 1 (Results for the Building Blocks task), the models learned with the transition feasibility function (Ours and Ours w/o temporal) outperform the one using the PPO-trained value function (V-Chain) for more than 30% in task success rate. This result implies that the value function of PPO policy fails to model the feasibility of subsequent policy, which further affects policy chaining results. Qualitatively, the value function from RL is hard to correctly model the final state of an MDP due to the discounted rewards, but the final state (successful or unsuccessful) is important for successful policy chaining.


      title={Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation},
      author={Chen, Yuanpei and Wang, Chen and Fei-Fei, Li and Liu, C Karen},
      journal={arXiv preprint arXiv:2309.00987},