How to train your wagon – Reinforcement Learning for Automated Driving

News & insights

Check our latest stories on automated driving


Written by Márton Görög / Posted at 8/9/19

How to train your wagon – Reinforcement Learning for Automated Driving

Or not only a pet, similarly to how we often learn ourselves: through trial, error and rewards. Reinforcement learning has what it takes to be the final piece of the puzzle.

Every day drivers around the world make billions of decision in situations when the next best step is unknown. These range from the most menial driving tasks, to extraordinary and dangerous situations, which humans solve by relying on their knowledge, experience, and reflexes to solve. We can extrapolate based on past events to find the right solution to such situations. The more traditional supervised-learning method of artificial intelligence, on the other hand, is ill-prepared for this. Hence the idea, why not try to teach a computer to complete tasks just as we teach ourselves, our children, or even our pets? Enter: reinforcement learning.

The information limitation

The more traditional use-cases of artificial intelligence (if you can call any of them that) focus on tasks for which the solutions are known. The format is simple: we know the answer; thus, a model can be trained to reach optimal behavior in a supervised way. However, this isn’t enough for many use-cases, which has fueled research into artificial intelligence for problems when the best next step is unknown. These are mostly environments with a lot of discrete options (a game of Go), complex continuous cases (a double inverted pendulum), and highly interactive situations (like driving around “Arc de Triomphe”).

The fundamental premise behind reinforcement learning is that even when the best next step is unknown, it is easy to tell if the long-term results of a given behavior are appropriate. In the examples above: did the AI win the game against the world champion; was the pendulum swung up and held successfully; did the vehicle reach its destination on-time without accident? By scoring a series of steps or a trajectory, we are able to choose the best solution.

Using this hindsight knowledge (called rewards), an agent trying out various strategies (called exploring) can be guided to find the optimal behavior. This may sound familiar: you're training your new puppy to sit: first you show it the optimal result and emphasize it with a treat. Reinforcement learning is, in a way, following this methodology to teach an AI agent.


Rewards can be delayed, and sparse. The feedback may be the result of earlier steps; it’s up to the agent to figure out which move or moves helped to achieve the result. Similarly to when that growing puppy receives a treat for a sequence of actions rather than obeying a single command. Think for example of the game pong: the paddle’s movement after the last hit doesn’t matter at all. Due to the unknown target policy, unknown environment dynamics and rewards, in practice it often takes machines an enormous number – millions - of moves to figure out and learn an optimal policy (state – action mapping).

The reward function can be designed in several ways, some may be more helpful for the agent. In Torcs (an open-source car racing simulator often used for algorithmic tests), we may choose, or combine some of these:

  1. A higher score for faster lap time or for being the first at the end? – These may be too sparse, too many actions taken without feedback.
  2. Overtaking an opponent? – Take care, the agent shouldn’t overtake, then let the opponent in front to overtake again to maximize rewards.
  3. Faster speed along the track? – Provides dense information, could help the agent to make sensible decisions quickly.

In its standard setup, RL learns the reward received for an action and the following ones. The next move is decided by choosing the action with the highest expected reward.


The agent needs to take exploratory steps to find better behavior, as the initial policy is determined by the random weights of a neural network. The basic exploration strategy is to take a random step sometimes, that is, to try to behave in a different, novel way and check if the received reward is higher. Exploration is a noise which is independent of the network’s output.

Another, more recently introduced method injects noise into the weights, changing the behavior through this additive noise (called Noisy Nets). This way a state-dependent, consistent change of behavior can be probed, often resulting in better final performance.

Finally, some exploration strategies motivate the agent through rewards (“intrinsically motivated”) to find unseen situations; not modifying the network’s output in any direct way.

Learning the safe way

RL opens up new avenues in automated driving development. Currently, the vast majority of AI in automated driving is connected to perception, classification and behavior prediction. With reinforcement learning, it may be possible to expand AI capabilities into other areas. However, there are several challenges to be overcome.

In a computer game, there’s nothing wrong with crashing a car 1000 times. In the real world, that’s not such a good idea… There’s no way an agent could be let loose to wreak havoc as it makes random decisions in public traffic and learns from the negative feedback.

A typical solution is to use a simulator to train the agent and only deploy the mature agent on roads. Gathering experience in a simulated environment is safe, and much more scalable than real-world driving. Nevertheless, it’s almost impossible to replicate the complexity of the real world. There will always be slight differences, no matter how realistic low-level dynamics and physics are, how realistically traffic behaves and how detailed the roads and environment are.

Another solution between supervised and reinforcement learning is called imitation learning. In this, the recordings of several hours of human driving would be used to train a model to replicate the seen behavior. There are some difficulties: it can be challenging to collect the right amount of samples of rare cases (for example doing U-turns). Furthermore, while the model can learn a good (but probably not optimal) policy, it is still dangerous that it doesn’t have knowledge concerning the value and avoidance of accidents. The training data must contain dangerous situations as well, to help the agent recognize possible accidents.

Reinforcing automation

RL presents several unique paths forward for the future of automated driving. There are possibilities for its deployment in low-level control (steering), planning (lane change decision making), or even fleet management. At AImotive we have begun research into these areas. For safety, these tests are either conducted in virtual worlds or heavily supervised. At the current maturity of the technology, artificial intelligence is never allowed to control the vehicle directly. Furthermore, existing safety limits in our drive-by-wire system ensure that no erratic accentuator commands can be given to the vehicle.


This post is the second in a series of blogs based on presentations given at the first AImotive Meetup in Budapest. To be notified of future events subscribe to our newsletter, or follow us on Linkedin. You can find the first blog of the series here

Márton Görög is a Lead AI Research Scientist at AImotive. With a computer sciences degree specializing in bionic engineering, Márton gained experience as a software developer in various industries before joining AImotive to focus on artificial intelligence. Working within our aiDrive team, he is responsible for research into AI-based technologies that improve the performance of our motion and trajectory planning systems.