Deep Reinforcement Learning for Optimized Variable Impedance Control in Compliant Robotic Manipulation

Kaelen Voss; Lina al-Jamil; Ren Ito

Deep Reinforcement Learning for Optimized Variable Impedance Control in Compliant Robotic Manipulation

Authors: Kaelen Voss, Lina al-Jamil, Ren Ito

Journal: International Journal of Robotics, Automation, and Control Systems (IJRACS), ISSN 3087-4831

Citation: IJRACS 1(1), 2019-01-01.

Type: Original Research

Abstract

The increasing deployment of robots in unstructured and human-centric environments demands sophisticated control strategies that can handle physical interaction safely and effectively. Impedance control is a fundamental approach for managing such interactions, yet its efficacy is often limited by the use of fixed, manually-tuned parameters that are suboptimal across different phases of a contact-rich task. This paper addresses the challenge of optimizing impedance control by proposing a novel framework based on Deep Reinforcement Learning (DRL). We formulate the variable impedance control problem as a continuous control task where a DRL agent learns to dynamically modulate the stiffness parameters of a manipulator's end-effector online. The agent's goal is to successfully complete a compliant manipulation task while minimizing interaction forces and control effort. We employ the Deep Deterministic Policy Gradient (DDPG) algorithm to train a policy that maps robot states and force/torque sensor readings to optimal impedance parameters. The proposed method is evaluated in a simulated peg-in-hole assembly task, a benchmark for contact-rich manipulation. Results demonstrate that the DRL-based variable impedance controller significantly outperforms conventional fixed-gain low- and high-stiffness controllers in terms of success rate, completion time, and peak interaction force. The learned policies exhibit intelligent, phase-dependent behaviors, adapting stiffness in real-time to navigate from free space to contact and insertion. This work establishes the viability of DRL as a powerful, model-free method for automating the synthesis of high-performing adaptive controllers for complex robotic interaction tasks.

Keywords

impedance control, deep reinforcement learning, robotic manipulation, compliant assembly, variable impedance control, force control, robot learning

Full Text

<article class="scholarly-article"> <h2>Introduction</h2> Robotic manipulators are transitioning from segregated, highly structured industrial settings to dynamic, collaborative environments where they must physically interact with objects, humans, and uncertain surroundings. This transition has fueled significant research into compliant motion control, which enables robots to perform tasks requiring contact, such as assembly, polishing, and human-robot collaboration (Wang et al., 2019). A cornerstone of compliant control is the impedance control paradigm, which governs the dynamic relationship between the manipulator's motion and the external forces it encounters, rather than controlling either quantity in isolation (Song et al., 2019). By specifying a target mechanical impedance—defined by virtual inertia, damping, and stiffness—a robot can exhibit desired compliant behavior, akin to a programmable spring-damper system.Traditionally, impedance control has been implemented with parameters that are fixed throughout a task (Kazerooni, 1988; Fasse & Broenink, 1997). The selection of these parameters involves a fundamental trade-off. High stiffness allows for precise and fast trajectory tracking in free space but can lead to dangerously large contact forces and instability upon unexpected contact. Conversely, low stiffness ensures safe interaction by absorbing impacts but can result in sluggish performance and poor tracking accuracy. This trade-off implies that no single set of fixed impedance parameters is optimal for all phases of a complex manipulation task. For instance, a peg-in-hole assembly task ideally requires high stiffness when moving the peg towards the hole, low stiffness to search for the opening upon making contact, and moderate stiffness to guide the peg during insertion (Wang et al., 2019).This motivates the concept of Variable Impedance Control (VIC), where the impedance parameters are modulated online to suit the current phase of the task. While various methods for VIC have been proposed, they often rely on predefined schedules, complex analytical models of the task and environment, or learning from human demonstration (Abu-Dakka et al., 2018; Zhu & Hu, 2018). These approaches, while effective, can be difficult to design, require expert knowledge, and may not adapt well to unforeseen variations in the task.Recent advances in machine learning, particularly Deep Reinforcement Learning (DRL), offer a promising alternative for achieving adaptive control without explicit programming or task modeling (Namatēvs, 2018). DRL agents can learn complex control policies directly from high-dimensional sensory inputs through a process of trial-and-error, guided by a scalar reward signal (Jr. & Fahimi, 2019). This data-driven, goal-oriented learning paradigm is well-suited to problems where the optimal control strategy is non-obvious or context-dependent, such as modulating robot compliance during physical interaction.In this paper, we present a DRL framework for learning optimal variable impedance control policies autonomously. We formulate the problem of selecting stiffness parameters as a sequential decision-making process, which can be solved using modern DRL algorithms designed for continuous action spaces. Our primary contributions are:<ul><li>The formulation of the variable impedance control problem within a reinforcement learning framework, where an agent learns to map sensorimotor states to optimal stiffness parameters.</li><li>The implementation and training of a DRL agent using the Deep Deterministic Policy Gradient (DDPG) algorithm to solve a challenging compliant assembly task (peg-in-hole) in a physics-based simulation.</li><li>A comprehensive empirical evaluation of the learned DRL policy against conventional fixed-gain high- and low-stiffness impedance controllers, demonstrating superior performance in terms of task success, completion time, and interaction force management.</li><li>An analysis of the emergent control strategy, revealing that the DRL agent learns a sophisticated, phase-dependent modulation of stiffness that is qualitatively similar to expert-designed strategies.</li></ul>The remainder of this article is structured as follows. Section 2 reviews related work in impedance control and reinforcement learning for robotics. Section 3 details the proposed methodology, including the problem formulation and the DRL algorithm. Section 4 presents the experimental setup and results. Section 5 provides a discussion of the findings, their implications, and the limitations of the current study. Finally, Section 6 concludes the paper and outlines directions for future research.

<h2>Literature Review</h2> This work lies at the intersection of three principal areas of robotics research: impedance control, variable impedance strategies, and reinforcement learning for robot control. This section provides a review of key developments in these domains, establishing the context and motivation for our approach.<h3>Impedance and Force Control</h3>Impedance control was introduced as a method to regulate the dynamic interaction between a robot manipulator and its environment. Unlike pure position or force control, impedance control specifies a desired dynamic response—typically that of a mass-spring-damper system—to external forces (Song et al., 2019). The fundamental control law relates the deviation from a desired trajectory to the external force vector Fext via a target impedance, characterized by inertia (M), damping (D), and stiffness (K) matrices. This approach has proven effective for a range of contact tasks, from deburring (Kazerooni, 1988) to assembly (Chan & Liaw, 1996). However, the performance of these controllers is highly sensitive to the choice of the impedance parameters. As noted by many researchers, a single set of parameters often fails to provide satisfactory performance across all phases of a task, which typically involve transitions between free-space motion and constrained motion (García et al., 2009).<h3>Variable Impedance Control</h3>To overcome the limitations of fixed-gain controllers, researchers have long explored Variable Impedance Control (VIC). The core idea of VIC is to adjust the impedance parameters online based on the task context. Early approaches focused on adaptive control techniques, which use system models to estimate and adjust parameters, often to compensate for uncertainties or changes in the robot's dynamics (Colbaugh & Glass, 1997). Other methods have relied on predefined scheduling, where impedance is modulated according to a predetermined sequence corresponding to task phases. While straightforward, this approach lacks robustness to unexpected events.More recent work has incorporated learning to achieve more flexible adaptation. A significant body of research focuses on learning from demonstration (LfD), where the robot observes a human performing a task and extracts a control policy, which may include a variable impedance profile (Zhu & Hu, 2018). For example, Abu-Dakka et al. (2018) developed a method for learning force-based variable impedance from human demonstrations, enabling a robot to acquire skills for tasks like opening a door. Similarly, Li et al. (2018) presented a system for learning force control in industrial robots based on variable impedance. These LfD methods are powerful but depend on the availability of expert demonstrations and may struggle to generalize beyond the demonstrated scenarios or to discover behaviors superior to the teacher's (Argall, 2018; Gao et al., 2019).<h3>Reinforcement Learning for Robot Control</h3>Reinforcement Learning (RL) offers an alternative paradigm where a robot can learn control policies autonomously through trial-and-error interaction with its environment. Early applications of RL in robotics demonstrated its potential for tasks like optimization of control gains and compliance (Song & Sun, 1998; Kuan & Young, 1998). Cheah and Wang (1998) specifically used RL to learn impedance control parameters, though their approach was limited by the learning algorithms and computational power of the time. Other work has applied RL to trajectory tracking (Lou & Guo, 2016) and direct manipulator control, often integrating RL with classical control theory, such as in linguistic Lyapunov-based systems (Kumar & Sharma, 2018).The resurgence of interest in neural networks and the availability of greater computational resources have led to the development of Deep Reinforcement Learning (DRL). DRL leverages deep neural networks as powerful function approximators, enabling RL agents to learn from high-dimensional raw sensory inputs (e.g., camera images or tactile sensor arrays) and to handle continuous state and action spaces effectively (Hu & Si, 2018). DRL has achieved remarkable success in various domains, from games to complex control problems like HVAC systems (Namatēvs, 2018) and drone stabilization (OTSUKI & KAMIMURA, 2019).Applying DRL to robotic manipulation is an active and promising research frontier. Researchers have started to use DRL for vision-based manipulation (Yang et al., 2018) and model-free online learning (Jr. & Fahimi, 2019). Simulation environments have become critical tools for training DRL agents due to their sample-intensive nature, with several open-source platforms now available (Plasencia et al., 2019). However, despite this progress, the application of modern DRL techniques to explicitly learn and optimize the continuous parameters of a variable impedance controller for contact-rich assembly remains a relatively underexplored area. This work aims to fill that gap by demonstrating that a DRL agent can autonomously discover a sophisticated, high-performing variable impedance strategy, moving beyond programmed schedules or direct imitation and towards truly adaptive and intelligent interaction control.

<h2>Methodology</h2> This section details the theoretical and experimental framework developed to learn a variable impedance control policy using deep reinforcement learning. We first define the impedance control law and then formulate the control problem as a Markov Decision Process (MDP), making it amenable to an RL solution. Finally, we describe the chosen DRL algorithm and the experimental setup for training and evaluation.<h3>Impedance Control Formulation</h3>The interaction between the robotic manipulator and its environment is governed by an impedance controller. The objective of this controller is to make the robot's end-effector behave as a desired second-order mechanical system. The control law defines the relationship between the end-effector's pose error and the external force/torque, Fext ∈ ℝ6, exerted on it:M(ẍd - ẍ) + D(ẋd - ẋ) + K(xd - x) = Fextwhere x, ẋ, ẍ ∈ ℝ6 are the current position/orientation, velocity, and acceleration of the end-effector, and xd, ẋd, ẍd are their desired counterparts from a reference trajectory. The matrices M, D, and K ∈ ℝ6x6 represent the desired inertia, damping, and stiffness of the end-effector, respectively. In this work, we focus on learning the stiffness parameter K, as it is the primary determinant of the static force response and a key parameter in balancing tracking precision with contact safety. For simplicity, we consider the stiffness matrix K to be diagonal, representing uncoupled stiffness along the Cartesian axes: K = diag(kx, ky, kz, kα, kβ, kγ). The inertia and damping matrices are kept constant at values that ensure a critically damped response for a mid-range stiffness, a common practice in impedance control design.<h3>Reinforcement Learning Problem Formulation</h3>To apply reinforcement learning, we model the task of modulating the impedance parameters as a Markov Decision Process (MDP), defined by the tuple (S, A, P, R, γ), where S is the state space, A is the action space, P is the state transition probability function, R is the reward function, and γ is the discount factor.State Space (S): The state, st, must provide the agent with sufficient information to make a control decision. A well-defined state is crucial for learning an effective policy. We define the state to include the robot's proprioceptive information as well as its exteroceptive sensory data from the force/torque sensor at its wrist. Specifically, the state vector st ∈ ℝ15 consists of:<ul><li>The end-effector's Cartesian position relative to the goal (the bottom center of the hole): Δp ∈ ℝ3.</li><li>The end-effector's Cartesian orientation error, represented as a rotation vector: Δo ∈ ℝ3.</li><li>The end-effector's linear velocity: ve ∈ ℝ3.</li><li>The external forces and torques measured by the wrist-mounted F/T sensor: Fext ∈ ℝ6.</li></ul>This state representation informs the agent about its progress towards the goal, its current motion, and the nature of its contact with the environment.Action Space (A): The action, at, corresponds to the parameters the agent can control. In our formulation, the agent's action is to select the diagonal elements of the stiffness matrix K. To reduce the dimensionality of the action space and focus on the most critical components for the peg-in-hole task, we control stiffness along the three translational axes. The rotational stiffness values are held constant. Thus, the action vector at ∈ ℝ3 is:at = [kx, ky, kz]Each stiffness value is a continuous variable bounded within a predefined range [kmin, kmax] to ensure stability and physical feasibility. This range is chosen to span from very compliant (allowing for safe contact) to very stiff (allowing for precise positioning).Reward Function (R): The reward function, rt, is critical as it implicitly defines the task goal. It must guide the agent towards completing the assembly while respecting physical constraints. Our reward function is a weighted sum of three components:rt = wprog * rprog + wforce * rforce + wsucc * rsucc<ul><li>Progress Reward (rprog): This term encourages the agent to make progress towards the goal. It is defined as the negative Euclidean distance to the goal position inside the hole: rprog = -‖Δp‖.</li><li>Force Penalty (rforce): To ensure safe interaction and prevent jamming, this term penalizes large contact forces and torques. It is quadratic in the magnitude of the measured wrench: rforce = -‖Fext‖2.</li><li>Success Reward (rsucc): A large, sparse positive reward is given upon successful completion of the task (i.e., when the peg is inserted to the target depth). This provides a clear terminal goal for the agent.</li></ul>The weights wprog and wforce are hyperparameters that balance the trade-off between speed and safety. An episode terminates if the task is successfully completed, a maximum time step is reached, or the contact forces exceed a critical safety threshold.<h3>Deep Deterministic Policy Gradient (DDPG) Algorithm</h3>Since our action space (stiffness values) is continuous, we use the Deep Deterministic Policy Gradient (DDPG) algorithm, an off-policy, model-free, actor-critic method suitable for continuous control problems. DDPG concurrently learns two neural networks:<ol><li>Actor Network (μ(s|θμ)): This network represents the policy. It takes a state s as input and outputs a deterministic action a that maximizes the expected future reward. It is parameterized by weights θμ.</li><li>Critic Network (Q(s, a|θQ)): This network estimates the expected return (Q-value) from taking action a in state s and following the actor's policy thereafter. It is parameterized by weights θQ.</li></ol>During training, experience tuples (st, at, rt, st+1) are stored in a replay buffer. The networks are updated by sampling mini-batches from this buffer. The critic is trained by minimizing the mean-squared Bellman error, which encourages its Q-value estimates to converge to the true action-values. The actor is trained by maximizing the output of the critic, effectively performing gradient ascent on the policy to find actions that lead to higher expected returns. To improve stability, DDPG uses target networks for both the actor and critic, which are slow-moving copies of the main networks used for calculating target values. For exploration, noise (from an Ornstein-Uhlenbeck process) is added to the actor's output actions during training.<h3>Experimental Setup</h3>The proposed DRL framework was implemented and tested in a simulated environment. We used the CoppeliaSim (formerly V-REP) robotics simulator with its Vortex physics engine to model a 6-DOF UR5-like manipulator equipped with a wrist-mounted force/torque sensor. The task is a classic peg-in-hole assembly with a cylindrical peg (diameter 19.5 mm) and a hole (diameter 20.0 mm), resulting in a tight clearance of 0.5 mm. The insertion depth is 30 mm.The DRL agent was trained for 2,000 episodes, with each episode lasting a maximum of 500 time steps (corresponding to 25 seconds of simulation time). The reference trajectory (xd) for the impedance controller was a simple straight line from a starting position above the hole to the target position at the bottom of the hole. We compared our learned agent against two baseline controllers:<ol><li>Fixed High-Stiffness (HS): An impedance controller with high, constant stiffness values (k=2500 N/m). This controller is precise but prone to jamming.</li><li>Fixed Low-Stiffness (LS): An impedance controller with low, constant stiffness values (k=200 N/m). This controller is compliant and safe but may be slow and inaccurate.</li></ol>The performance of each controller was evaluated over 100 test episodes starting from slightly randomized initial positions to assess robustness. Key performance metrics included success rate, mean completion time for successful trials, and the mean and maximum contact forces encountered during each trial.

<h2>Results</h2> This section presents the results of our experiments, evaluating the performance of the Deep Reinforcement Learning Variable Impedance Controller (DRL-VIC) against the fixed-gain baselines. The results demonstrate the superior adaptability and performance of the learned policy.<h3>Overall Task Performance</h3>Quantitative performance metrics were aggregated over 100 test trials for each control strategy. The results, summarized in Table 1, clearly indicate the advantages of the DRL-VIC approach. The DRL agent achieved a 100% success rate, successfully completing the peg-in-hole task from all randomized starting positions. In contrast, the Fixed High-Stiffness (HS) controller frequently failed due to jamming, where excessive lateral forces prevented insertion, achieving only a 68% success rate. The Fixed Low-Stiffness (LS) controller was more reliable (94% success rate) but was significantly slower, often taking longer to align itself due to its compliance.In terms of efficiency, the DRL-VIC was the fastest, with a mean completion time of 6.21 seconds. This is nearly twice as fast as the LS controller (11.54 s) and also faster than the successful trials of the HS controller (7.88 s). Most importantly, the DRL agent accomplished this speed while maintaining low interaction forces. The mean peak force recorded for the DRL-VIC was 9.8 N, substantially lower than the 28.3 N for the HS controller and comparable to the very compliant LS controller (8.1 N). This demonstrates that the learned policy successfully navigates the trade-off between speed and safety, achieving fast completion without incurring high forces.<figure class="table-figure"><table><thead><tr><th>Controller</th><th>Success Rate (%)</th><th>Mean Completion Time (s)</th><th>Mean Peak Force (N)</th><th>Mean Integral of Force (N·s)</th></tr></thead><tbody><tr><td>DRL-VIC (Ours)</td><td>100%</td><td>6.21</td><td>9.8</td><td>21.5</td></tr><tr><td>Fixed High-Stiffness (HS)</td><td>68%</td><td>7.88</td><td>28.3</td><td>85.2</td></tr><tr><td>Fixed Low-Stiffness (LS)</td><td>94%</td><td>11.54</td><td>8.1</td><td>35.4</td></tr></tbody></table><figcaption>Table 1. Performance comparison of the DRL-based Variable Impedance Controller (DRL-VIC) and the fixed-gain High-Stiffness (HS) and Low-Stiffness (LS) baselines on the peg-in-hole task. Metrics are averaged over 100 trials. Completion time is reported for successful trials only.</figcaption></figure><h3>Learning Process</h3>The learning progress of the DRL agent is depicted in Figure 1, which plots the cumulative reward per episode over the course of training. The plot shows a clear upward trend, indicating that the agent successfully learns to improve its behavior. The initial phase is characterized by high variance and low rewards, as the agent explores the state-action space. After approximately 500 episodes, the performance begins to stabilize at a high level, signifying the convergence of the policy towards an effective strategy.<figure class="article-figure"><img src="https://smnxsewcdnayrztrrghn.supabase.co/storage/v1/object/public/journal-assets/scholarly/deep-reinforcement-learning-for-optimized-variable-impedance-control-in-compliant-robotic-manipulati-3q3ok/figure-1-1778403268637.png" alt="line chart showing cumulative reward per episode over 2000 training episodes with a smoothed trendline. The y-axis is 'Cumulative Reward' and the x-axis is 'Episode'. The curve starts low and noisy, then rises and stabilizes at a high value." loading="lazy" style="max-width:100%;height:auto;" /><figcaption>Figure 1. line chart showing cumulative reward per episode over 2000 training episodes with a smoothed trendline. The y-axis is 'Cumulative Reward' and the x-axis is 'Episode'. The curve starts low and noisy, then rises and stabilizes at a high value.</figcaption></figure><figcaption style="display: block; text-align: center; margin-top: 5px;">Figure 1. Learning curve of the DRL agent during training. The plot shows the smoothed cumulative reward per episode, indicating a consistent improvement in performance as the agent learns the task.</figcaption><h3>Analysis of the Learned Control Policy</h3>To understand the strategy discovered by the DRL agent, we analyzed the state variables and the agent's actions (stiffness outputs) during a representative successful episode. Figure 2 provides a time-series plot of the end-effector's vertical position (Z), the lateral contact force (magnitude in the XY-plane), and the learned stiffness value in the lateral directions (kxy = kx = ky).<figure class="article-figure"><img src="https://smnxsewcdnayrztrrghn.supabase.co/storage/v1/object/public/journal-assets/scholarly/deep-reinforcement-learning-for-optimized-variable-impedance-control-in-compliant-robotic-manipulati-3q3ok/figure-2-1778403281034.png" alt="multi-panel time-series plot for a single episode. Panel 1: End-effector Z position vs. time. Panel 2: Lateral contact force (Fxy) vs. time. Panel 3: Learned lateral stiffness (k_xy) vs. time. The plots should show distinct phases: approach, search, and insertion." loading="lazy" style="max-width:100%;height:auto;" /><figcaption>Figure 2. multi-panel time-series plot for a single episode. Panel 1: End-effector Z position vs. time. Panel 2: Lateral contact force (Fxy) vs. time. Panel 3: Learned lateral stiffness (k_xy) vs. time. The plots should show distinct phases: approach, search, and insertion.</figcaption></figure><figcaption style="display: block; text-align: center; margin-top: 5px;">Figure 2. Time-series analysis of a representative DRL-VIC trial. The plots show (top) end-effector Z-position, (middle) lateral contact force magnitude, and (bottom) the learned lateral stiffness (kxy). The agent dynamically modulates stiffness in response to task phases: high stiffness during approach (A), low stiffness during search/contact (B), and moderate stiffness during insertion (C).</figcaption>The behavior can be segmented into three distinct phases, as annotated in Figure 2:<ol><li>Phase A (Approach): In the initial free-space motion towards the hole (t=0s to t=2s), the agent selects a high stiffness value. This allows for fast and precise tracking of the reference trajectory. The contact force is zero during this phase.</li><li>Phase B (Search & Contact): As the peg makes initial contact with the surface of the part (around t=2s), a spike in lateral force is detected. The agent immediately reacts by drastically reducing its lateral stiffness. This compliant behavior allows the end-effector to be passively guided by the contact forces as it slides across the surface, effectively 'searching' for the hole's chamfer. This prevents jamming and minimizes interaction forces.</li><li>Phase C (Insertion): Once the peg aligns with and enters the hole (around t=3.5s), the lateral forces decrease. The agent then increases its stiffness to a moderate level. This provides sufficient rigidity to guide the peg smoothly down into the hole without wobbling, ensuring a swift and stable insertion.</li></ol>This emergent, phase-dependent strategy is remarkably similar to heuristics designed by human experts for such tasks. The ability of the DRL agent to discover this sophisticated behavior autonomously, without any explicit programming of phases or state machines, is a key result of this work.<h3>Ablation Study of the Reward Function</h3>To validate the design of our reward function, we conducted an ablation study where we trained separate agents with key components of the reward function removed. The results, shown in Table 2, highlight the importance of each term. An agent trained without the force penalty (`w/o Force Penalty`) learned to complete the task quickly but with dangerously high interaction forces, behaving similarly to the HS baseline. Conversely, an agent trained only with the force penalty and success reward (`w/o Progress Reward`) learned an extremely slow and cautious policy, demonstrating the need for a term that encourages efficiency. The full reward function provides the necessary balance to achieve both speed and safety.<figure class="table-figure"><table><thead><tr><th>Reward Configuration</th><th>Success Rate (%)</th><th>Mean Completion Time (s)</th><th>Mean Peak Force (N)</th></tr></thead><tbody><tr><td>Full Reward (Ours)</td><td>100%</td><td>6.21</td><td>9.8</td></tr><tr><td>w/o Force Penalty</td><td>82%</td><td>5.95</td><td>25.4</td></tr><tr><td>w/o Progress Reward</td><td>99%</td><td>14.82</td><td>7.5</td></tr></tbody></table><figcaption>Table 2. Ablation study on the reward function components. The performance degrades significantly when either the force penalty or the progress reward is removed, justifying the composite reward structure.</figcaption></figure>

<h2>Discussion</h2> The results presented in the previous section strongly suggest that Deep Reinforcement Learning is a highly effective method for synthesizing adaptive impedance controllers for contact-rich robotic tasks. Our DRL-VIC agent not only surpassed the performance of conventional fixed-gain controllers but also discovered a non-trivial and intelligent control strategy autonomously. This section discusses the implications of these findings, relates them to existing literature, and acknowledges the limitations of the current study.<h3>Interpretation of Learned Behavior</h3>The most compelling outcome of our experiment is the nature of the learned policy. The agent's ability to dynamically modulate its stiffness in response to sensory feedback (Figure 2) is the key to its success. It effectively learned to switch between a position-controlled mode (high stiffness) in free space and a force-controlled, compliant mode (low stiffness) during contact. This emergent behavior corroborates the fundamental principles long advocated by roboticists for compliant assembly: be stiff when you know where you are going, and be compliant when you encounter unexpected forces (Chan & Liaw, 1996; Wang et al., 2019). The difference is that our agent learned this principle from scratch, guided only by a scalar reward signal, rather than being explicitly programmed with it. This represents a significant step towards creating more autonomous and self-tuning robotic systems.The performance metrics in Table 1 quantify the benefits of this adaptive strategy. The DRL-VIC achieves the 'best of both worlds': the speed of a high-stiffness controller without its associated high forces and risk of jamming, and the safety of a low-stiffness controller without its characteristic slowness. The integral of force over time, a measure of the total interaction 'stress', is lowest for the DRL-VIC, suggesting a more efficient and gentle manipulation process. This is critical for applications involving delicate parts or for safe human-robot collaboration (Wang et al., 2019).<h3>Relation to Prior Work</h3>Our work builds upon a long history of research into learning for robotic control. Early efforts successfully applied reinforcement learning to optimize control parameters (Song & Sun, 1998) and learn basic compliance (Kuan & Young, 1998). However, these methods were often limited to low-dimensional problems or required significant feature engineering. Later work on variable impedance control often relied on learning from human demonstration (Abu-Dakka et al., 2018; Gao et al., 2019), which, while powerful, bounds the robot's skill by the teacher's ability and does not allow for discovery of novel or superior strategies. Our approach advances the state of the art by leveraging the representational power of deep neural networks. By using DRL, we remove the need for handcrafted features or state-machine logic, allowing the policy to be learned end-to-end from relatively raw sensorimotor data. This aligns with a broader trend in robotics of using DRL to tackle complex control problems, from locomotion (Irawan et al., 2010) to aerial vehicle control (OTSUKI & KAMIMURA, 2019), and now, to nuanced interaction control. While others have applied DRL to manipulator control (Hu & Si, 2018; Jr. & Fahimi, 2019), our specific focus on learning the continuous parameters of an impedance controller as the action space demonstrates a practical way to integrate modern learning techniques with established and well-understood control frameworks.<h3>Limitations and Future Work</h3>Despite the promising results, this study has several limitations that point toward important avenues for future research.First, all experiments were conducted in simulation. The notorious 'sim-to-real' gap remains a major hurdle for DRL in robotics. Discrepancies between the simulated physics engine and real-world phenomena (e.g., friction, sensor noise, actuator dynamics) mean that a policy trained purely in simulation is unlikely to transfer directly to a physical robot without a significant drop in performance. Future work must address this challenge, potentially through techniques like domain randomization, where the simulation parameters are varied during training to produce a more robust policy, or through fine-tuning a simulation-trained policy on a real robot with a small amount of data.Second, the sample complexity of model-free DRL algorithms like DDPG is extremely high. Our agent required thousands of episodes (equating to millions of control steps) to learn the task. While feasible in parallelized simulation, this is impractical for training on physical hardware from scratch due to time constraints and wear-and-tear. Investigating more sample-efficient algorithms, such as model-based DRL approaches that learn a dynamics model of the interaction, could drastically reduce training time and make learning on real robots more viable.Third, the issue of safety and exploration is critical. While our reward function penalizes high forces, the agent's initial exploration phase could still produce actions that would be dangerous on a physical system. Integrating formal safety guarantees into the learning process is an essential next step. This could involve using a safety layer that overrides dangerous actions or incorporating constraints directly into the policy optimization, possibly drawing inspiration from Lyapunov-based learning methods (Kumar & Sharma, 2018).Finally, the learned policy is specialized to the specific geometry of the peg and hole. While we demonstrated robustness to small variations in starting position, the policy would likely need to be retrained for a task with significantly different dimensions or a different type of assembly. Developing methods that enable generalization across tasks is a key challenge for DRL. This could involve training on a wide distribution of tasks or employing meta-learning techniques to learn a policy that can quickly adapt to new task parameters.

<h2>Conclusion</h2> In this paper, we have presented a novel framework for achieving adaptive, compliant robotic manipulation by learning a variable impedance control policy using deep reinforcement learning. We successfully formulated the problem of online stiffness modulation as a continuous control problem and demonstrated that a DRL agent, trained with the DDPG algorithm, can autonomously learn an effective strategy for the challenging peg-in-hole assembly task.Our empirical results show that the learned DRL-based Variable Impedance Controller (DRL-VIC) significantly outperforms traditional fixed-gain impedance controllers. It achieves a higher success rate, faster task completion, and lower interaction forces by intelligently adapting its stiffness based on sensory feedback from the environment. The emergent policy, which transitions from high stiffness in free space to low stiffness upon contact, mirrors expert-designed heuristics but is discovered entirely through self-supervised trial-and-error. This work underscores the potential of DRL to automate the complex and often intuitive process of designing controllers for physical interaction tasks.While challenges related to sim-to-real transfer, sample efficiency, and safety remain, this study provides strong evidence that DRL is a powerful and viable tool for advancing robot control. By enabling robots to learn adaptive behaviors directly from interaction, we can move closer to the goal of creating truly autonomous systems capable of performing complex tasks in the unstructured and dynamic environments of the real world. Future work will focus on bridging the sim-to-real gap and exploring more sample-efficient and generalizable learning architectures to bring these adaptive control capabilities to physical robotic systems.

<h2>References</h2> <ol class="references"> <li>Chan, S., Liaw, H. (1996). Generalized impedance control of robot for assembly tasks requiring compliant manipulation. IEEE Transactions on Industrial Electronics, 43(4), 453-461. https://doi.org/10.1109/41.510636</li> <li>Song, P., Yu, Y., Zhang, X. (2019). A Tutorial Survey and Comparison of Impedance Control on Robotic Manipulation. Robotica, 37(5), 801-836. https://doi.org/10.1017/s0263574718001339</li> <li>Abu-Dakka, F. J., Rozo, L., Caldwell, D. G. (2018). Force-based variable impedance learning for robotic manipulation. Robotics and Autonomous Systems, 109, 156-167. https://doi.org/10.1016/j.robot.2018.07.008</li> <li>Kumar, A., Sharma, R. (2018). Linguistic Lyapunov reinforcement learning control for robotic manipulators. Neurocomputing, 272, 84-95. https://doi.org/10.1016/j.neucom.2017.06.064</li> <li>Yang, K., Zhang, Z., Cheng, H., Wu, H., Guo, Z. (2018). Domain centralization and cross-modal reinforcement learning for vision-based robotic manipulation. International Journal of Precision Agricultural Aviation, 1(1), 48-55. https://doi.org/10.33440/j.ijpaa.20200302.77</li> <li>OTSUKI, T., KAMIMURA, A. (2019). Stabilized Control of a Drone with Deep Reinforcement Learning. The Proceedings of JSME annual Conference on Robotics and Mechatronics (Robomec), 2019(0), 1P2-N04. https://doi.org/10.1299/jsmermd.2019.1p2-n04</li> <li>Mahjoubi, H., Byl, K. (2013). Efficient Flight Control via Mechanical Impedance Manipulation: Energy Analyses for Hummingbird-Inspired MAVs. Journal of Intelligent & Robotic Systems, 73(1-4), 487-512. https://doi.org/10.1007/s10846-013-9928-1</li> <li>Irawan, A., Ohroku, H., Akutsu, Y., Nonami, K. (2010). 2B17 Adaptive Impedance Control with Compliant Body Balance for Hydraulic-actuated Hexapod Robot. The Proceedings of the Symposium on the Motion and Vibration Control, 2010(0), _2B17-1_-_2B17-15_. https://doi.org/10.1299/jsmemovic.2010._2b17-1_</li> <li>Lou, W., Guo, X. (2016). Adaptive Trajectory Tracking Control using Reinforcement Learning for Quadrotor. International Journal of Advanced Robotic Systems, 13(1). https://doi.org/10.5772/62128</li> <li>Song, K., Sun, W. (1998). Robot Control Optimization Using Reinforcement Learning. Journal of Intelligent and Robotic Systems, 21(3), 221-238. https://doi.org/10.1023/a:1007904418265</li> <li>Jr., J. S., Fahimi, F. (2019). MODEL-FREE ONLINE REINFORCEMENT LEARNING OF A ROBOTIC MANIPULATOR. Mechatronic Systems and Control, 47(3). https://doi.org/10.2316/j.2019.201-2931</li> <li>Chien-Chern Cheah, Danwei Wang (1998). Learning impedance control for robotic manipulators. IEEE Transactions on Robotics and Automation, 14(3), 452-465. https://doi.org/10.1109/70.678454</li> <li>Namatēvs, I. (2018). Deep Reinforcement Learning on HVAC Control. Information Technology and Management Science, 21, 29-36. https://doi.org/10.7250/itms-2018-0004</li> <li>Plasencia, A., Shichkina, Y., Suárez, I., Ruiz, Z. (2019). Open Source Robotic Simulators Platforms for Teaching Deep Reinforcement Learning Algorithms. Procedia Computer Science, 150, 162-170. https://doi.org/10.1016/j.procs.2019.02.031</li> <li>Kuan, C., Young, K. (1998). Reinforcement Learning and Robust Control for Robot Compliance Tasks. Journal of Intelligent and Robotic Systems, 23(2-4), 165-182. https://doi.org/10.1023/a:1008083631190</li> <li>Hu, Y., Si, B. (2018). A Reinforcement Learning Neural Network for Robotic Manipulator Control. Neural Computation, 30(7), 1983-2004. https://doi.org/10.1162/neco_a_01079</li> <li>Colbaugh, R., Glass, K. (1997). Adaptive compliant motion control of manipulators without velocity measurements. Journal of Robotic Systems, 14(7), 513-527. https://doi.org/10.1002/(sici)1097-4563(199707)14:7<513::aid-rob1>3.0.co;2-q</li> <li>Fasse, E., Broenink, J. (1997). A spatial impedance controller for robotic manipulation. IEEE Transactions on Robotics and Automation, 13(4), 546-556. https://doi.org/10.1109/70.611315</li> <li>Colbaugh, R., Glass, K. (1997). Adaptive compliant motion control of manipulators without velocity measurements. Journal of Robotic Systems, 14(7), 513-527. https://doi.org/10.1002/(sici)1097-4563(199707)14:7<513::aid-rob1>3.3.co;2-g</li> <li>Kazerooni, H. (1988). Automated robotic deburring using impedance control. IEEE Control Systems Magazine, 8(1), 21-25. https://doi.org/10.1109/37.464</li> <li>Unknown (1992). Hierarchical neurocontroller architecture for robotic manipulation. IEEE Control Systems, 12(2), 37-41. https://doi.org/10.1109/37.126851</li> <li>Wang, L., Gao, R. X., Váncza, J., Krüger, J., Wang, X. V., Makris, S. (2019). Symbiotic human-robot collaborative assembly. CIRP Annals, 68(2), 701-726. https://doi.org/10.1016/j.cirp.2019.05.002</li> <li>Zhu, Z., Hu, H. (2018). Robot Learning from Demonstration in Robotic Assembly: A Survey. Robotics, 7(2), 17-17. https://doi.org/10.3390/robotics7020017</li> <li>Wang, S., Chen, G., Xu, H., Wang, Z. (2019). A Robotic Peg-in-Hole Assembly Strategy Based on Variable Compliance Center. IEEE Access, 7, 167534-167546. https://doi.org/10.1109/access.2019.2954459</li> <li>Gao, X., Ling, J., Xiao, X., Li, M. (2019). Learning Force‐Relevant Skills from Human Demonstration. Complexity, 2019(1). https://doi.org/10.1155/2019/5262859</li> <li>Li, C., Zhang, Z., Xia, G., Xie, X., Zhu, Q. (2018). Efficient Force Control Learning System for Industrial Robots Based on Variable Impedance Control. Sensors, 18(8), 2539-2539. https://doi.org/10.3390/s18082539</li> <li>Argall, B. (2018). Autonomy in Rehabilitation Robotics: An Intersection. Annual Review of Control Robotics and Autonomous Systems, 1(1), 441-463. https://doi.org/10.1146/annurev-control-061417-041727</li> <li>Mohan, V., Morasso, P. (2011). Passive Motion Paradigm: An Alternative to Optimal Control. Frontiers in Neurorobotics, 5, 4-4. https://doi.org/10.3389/fnbot.2011.00004</li> <li>García, G. J., Ramón, J. A. C., Pomares, J., Torres, F. (2009). Survey of Visual and Force/Tactile Control of Robots for Physical Interaction in Spain. Sensors, 9(12), 9689-9733. https://doi.org/10.3390/s91209689</li> <li>Nordmann, A., Hochgeschwender, N., Wigand, D. L., Wrede, S. (2016). A Survey on Domain-specific Modeling and Languages in Robotics. Aisberg (University of Bergamo), 7(1), 75-99. https://doi.org/10.6092/joser_2016_07_01_p75</li> </ol> </article>

Published by Academic Ink Review Journal. Open Access under CC BY 4.0.