Battery electric vehicles (BEVs) already offer high energy efficiency, but reinforcement learning (RL) presents a promising path to further optimization – especially in dual motor configurations. BMW Group is exploring RL-based control strategies to unlock remaining efficiency potential by dynamically optimizing torque distribution using data-driven algorithms. This software-centric approach minimizes energy utilization – without any changes to the vehicle’s hardware.
Reinforcement learning (RL) is a subfield of artificial intelligence in which agents make decisions based on input data. RL originated in game-playing applications such as chess and Go and is now being applied to a broad range of domains. The agent selects an action from the current state of the environment, and the action is executed in the next time step. The agent receives a reward that indicates how well the chosen action meets defined criteria. Through numerous interactions with the environment, the agent learns a policy that maximizes the long-term reward. RL agents are particularly advantageous when the relationship between state and optimal action is highly complex. In many scenarios, RL systems can achieve performance that surpasses that of human experts or conventional programming approaches.
Dual-Motor Electric Vehicles optimized by AI
As part of a BMW Group research project, RL agents are applied to powertrain operating strategies in electrified vehicles. The scenario involves a vehicle with two electric machines: one on the front axle and one on the rear axle. Under typical operating conditions, the total wheel torque requested by the driver can be set at the front axle, the rear axle, or distributed between them. The agent’s task is to determine a torque distribution for each operating point that minimizes the electrical powertrain’s energy consumption.
Scalable, Efficient Deployment of RL Agents using Python
Most artificial intelligence (AI) applications are implemented using the Python programming language. A wide range of established RL algorithms (e.g., DDPG, TD3, SAC, PPO) are available as open-source implementations (see “ Reinforcement learning (RL) ”). Python also enables convenient graphical processing unit (GPU) parallelization, significantly accelerating agent training. To minimize development effort, reusing these existing Python implementations is advantageous. The algorithms should run on a compact Linux-based industrial computer with an ARM processor. This offers several benefits:
- The hardware can be integrated in the vehicle without complex modifications or additional safety measures.
- Low power consumption minimizes load on the vehicle’s electrical network.
- A short boot time enable prompt data generation.
Therefore, it is crucial to implement a lean process with minimal overheads.
Learning Efficiency
A critical challenge in the development of robust RL agents is the appropriate definition of both state and action space, as well as the formulation of the reward function. In this project, the state is composed of vehicle speed, requested torque, battery voltage, steering angle, and several temperatures of the electric machines, among other signals. The action is a scalar value representing the desired percentage-based torque distribution between the vehicle’s electric motors. The reward is simply formulated as the negative electrical power input to the electric machines (since reward is maximized in RL). We deliberately avoided utilizing power loss as a metric, as this would require mechanical power calculations and wheel torque measurements which are not directly available but rather calculated in production vehicles. Using calculated values could lead agents to inherently learn to exploit potential modeling errors, which would reduce the outcome quality. The large training data sets acquired during the learning phase contain many operating points that have the same state but different actions, resulting in different rewards or power consumption values. The RL agent can discern these variations in power consumption and adapt the operating strategy accordingly. This approach remains feasible during recuperation phases where electrical power becomes negative.
Figure 4: Schematic implemetation of an reinforcement learning (RL) agent in the RTMaps Python bridge framework. © BMW Group
Consumption Optimization Across the Entire Drive Chain, Including Tire Losses
Choosing electrical power consumption as the reward provides the additional benefit that consumption is optimized across the entire drive chain, including tire losses. A good choice of torque distribution can not only operate the electric machines at beneficial operating points but also reduce tire slip and thus further reduce energy losses. For the sake of completeness, it should be mentioned that RL agents provide the capability to optimize not only the reward for the current operating point but also the cumulative future reward. In the context of optimal torque distribution, this means that energy consumption can be minimized over an entire drive trajectory, not just at discrete time points. As an example, the agent can learn to maintain electric machine temperatures within an efficient operational range through strategic torque distribution allocation in earlier timesteps.
Deploying RL Agents for Real-Time Control with RTMaps
The RL agent workflow is implemented using the RTMaps middleware from Intempora (a dSPACE Company). It provides a Python bridge that seamlessly integrates Python code into the signal processing pipeline. Within this framework, input arguments (state and reward) and output arguments (action) are defined, which can then be connected by a graphical interface with other signal blocks. The code structure executes a core function at a predetermined sampling rate, allowing input data to be processed using any standard Python library. With modest adaptations, the existing code can be integrated into the Python bridge. The RL agent receives state and reward information and computes the corresponding action by utilizing its underlying neural networks. To ensure responsive torque distribution control, a sampling rate of 100Hz was selected. Concurrently, the system logs data to the onboard computer for subsequent analysis and to facilitate training on more powerful cloud-based systems. Communication with the drive control unit is established via the XCPoverCAN protocol supported within RTMaps. This integration is achieved by generating a configuration file using the dSPACE Interface Manager that defines the input and output signals to the control unit. This configuration file is then compiled into an RTMaps block and incorporated directly into the workflow. For deployment, RTMaps Runtime for embedded platforms enables execution on ARM architectures without requiring a graphical user interface, thereby reducing computational overhead. The workflow is initiated automatically upon system boot, creating a streamlined operational process.
Validating RL Agents in Real Vehicle Scenarios
Generating a robust RL agent requires a large amount of training data. Thus, the vehicle is driven through a wide range of operating conditions while state, action, and reward are stored in a so-called replay buffer. Based on the replay buffer, the agents are trained using off-policy reinforcement learning algorithms that adapt their underlying neural networks. The fully trained agents are subsequently deployed to the vehicle and benchmarked against the conventional operating strategy in different operating points. The measurements confirm energy savings in the lowsingle-digit percentage range, depending on the operating point. In some operating regions, the RL-derived strategy converges with the conventional approach, confirming the optimality of the existing control methodology. While the energy savings appear modest at first glance, they represent significant value, considering the fact that they require only software adaptation without any costly hardware modifications. These improvements are particularly noteworthy as they extract efficiency gains from an already highly optimized system, targeting the last remaining percentages of potential energy savings that conventional methods have been unable to unlock.
AI-Driven, Optimized Torque Distribution
The results demonstrate that RL agents can effectively identify complex correlations between multiple state variables – including driving demands, battery voltage, and electric machine temperatures – to determine optimal torque distributions that minimize power consumption in ways difficult to achieve through conventional control engineering methodologies.
The RL agents’ outcomes can be integrated into production vehicles through different implementation pathways. The agent’s learned policies can serve purely as analytical tools, providing insights into complex physical correlations that inform enhancements to conventional control strategies – keeping the development process engineer-centric while leveraging AI-derived insights. Alternatively, the fully validated agent can be deployed directly to vehicle control units as a deterministic function that maps states to actions with consistent behavior, effectively transforming the AI methodology into an embedded product component. This deterministic nature ensures that for any given state, the agent will reliably produce identical control actions, maintaining the predictability required for automotive systems.
RTMaps as a Key Enabler for Embedded AI Deployment in Vehicles
The intuitive RTMaps middleware proved to be a key enabler for successfully implementing these agents on vehicle computers and managing various data signals and streams. The platform allowed existing Python algorithms to be used with only minor adjustments. Its low hardware requirements ensured that deployment costs in the vehicle remained minimal, facilitating rollout to additional development vehicles with different powertrain configurations. Data exchange with the vehicle using XCPoverCAN provided fast and reliable communication. Notably, at the start of the project, the XCPoverCAN and XCPover-Ethernet interfaces were not yet available; however, they were implemented upon request within three months, allowing the project to proceed without delays.
Next Steps: Extending AI Control to Thermal Management Systems
The RL agent approach extends beyond torque distribution to various operating strategies involving complex physical relationships. A promising application is thermal management system control, where components such as pumps, flaps, valves, and fans can be optimally controlled to minimize energy consumption while maintaining appropriate thermal conditions.
Dr. Benjamin Schläpfer, BMW Group
About the author
Dr. Benjamin Schläpfer is AI engineer in the powertrain research department at BMW Group in Garching, Germany
dSPACE MAGAZINE, PUBLISHED DECEMBER 2025
Reinforcement Learning
Reinforcement learning (RL) is a machine learning methodology where an agent learns optimal policies through trial-and-error interaction with its environment. The agent receives numerical reward signals that indicate action quality, enabling it to iteratively improve its decision-making strategy to maximize cumulative reward. Examples of RL algorithms include:
- Deep Deterministic Policy Gradient (DDPG)
- Twin Delayed DDPG (TD3)
- Soft Actor Critic (SAC)
- Proximal Policy Optimization (PPO)
The individual algorithms have different advantages and disadvantages. The choice of a suitable algorithm depends on the problem setup, more specifically on the environment with its state, action, and reward structure. A good and detailed overview of established RL algorithms is provided by the following source:
Spinning Up in Deep Reinforcement Learning. https://github.com/openai/spinningup
(Joshua Achiam, 2018)