Inference at the Edge

Written by Tony King-Smith / Posted at 4/12/19

Inference at the Edge

The bigger it gets, the more confusing it becomes…

The hype around neural networks (NNs) never ceases to amaze me. It seems like AI is the answer to pretty much every world problem!

However, when a technology as ubiquitous as AI is adopted on such a broad scale, across so many markets, over-simplistic generalizations inevitably arise. These often lead to people making vital engineering decisions based on erroneous data. One of these is what hardware is needed to execute AI in embedded applications like an autonomous vehicle (AV).

Many of the popular AI functions we experience today such as face detection, object recognition and natural speech recognition are based on a huge amount of computing executed over hours, days or even weeks on massive data servers. This task is called “NN Training” and requires enormous compute power. Since training takes so much compute power and is fundamental to generating accurate NN-based inference engines, training servers have become the poster child for AI – as they involve such big numbers!

However, this leads to the common assumption that any hardware platform running AI algorithms needs a platform capable of offering levels of compute power similar to those used for training. That’s not the case for “NN Inference”: the tasks of using the results of training to infer decisions quickly. And Inference is what NNs in autonomous vehicles are all about.

Powerful compute isn’t everything

NN Inference engines have very different characteristics to NN Training engines. They need to make each decision extremely quickly (for AVs, in a few milliseconds), they must run continuously for hours at a time without fail; and they need to use as little power as possible. So, while they need to be powerful computation engines, they only need to calculate results once for each set of input data.

Data, known as “weights”, are created as the result of the NN training process. During the training phase, the NN designer can change not only the weights but also the NN’s configuration (known as the “NN topology”), to see if they can make the training deliver better results. The inference engine then loads in these weights during its calculations and applies them in various ways (often some form of convolution) to each set of input data to calculate the results.

Unlike the NN training engine, the NN inference engine isn’t trying to learn anything new. It’s using the results of all that training done earlier to deliver the best answers quickly. The compute power needed by the NN inference engine is therefore usually significantly smaller than that used for “training” – but it must be very fast, especially for applications such as AVs.

Determinism and latency – key for AVs

When using inference engines to control something, they also need to deliver results consistently in a very precise amount of time. Providing a few results, then waiting for another batch to be loaded up is not an option. This is called being “deterministic”, and its crucial for applications such as AVs, robots, or anything else where robust, real-time behavior is crucial.

Furthermore, the time taken from when a sensor input is generated – a camera image, or a LiDAR point cloud for example – until the result is known is vital. This time, known as “latency”, determines how quickly an AV can respond to anything detected.

Just to confuse things more, latency means different things to different people, even within the specialized world of AI. For a NN training engine, latency is used to define the average time it takes to calculate a result.

Training servers use a technique called “batching” to execute the same NN algorithm over thousands of different data inputs such as photos. This enables the server, with its massive memory and power resources, to set up all the data in such a way that the algorithm then runs extremely quickly –the end result is you get a lot of throughput. Once a server has loaded up all the data it needs to execute a batch of results, it generates them very rapidly. However, setting it all up takes a long time. That means the first answer takes a long time to arrive, but once you get the first one, the next result, and the one after that, come extremely quickly.

That’s no good for NN inference engines. As soon as one set of data appears (e.g. an image from a camera), you want to calculate the result as soon as possible, without fail. You can’t wait when you are deciding whether you are going to hit a pedestrian or bicycle! So, NN inference engines for AVs usually ignore batching – known technically as “batch size = 1”. That means as soon as one new piece of data arrives, the result is calculated immediately.

Low latency and determinism are two of many reasons why an optimized NN inference engine hardware accelerator for AVs is designed quite differently to a NN training NN accelerator.

Designed for the job

A hardware NN accelerator designed to power real-time inference has a significantly different architecture to a hardware NN accelerator designed to power a Training engine. That’s why NN hardware accelerators designed for datacenter-based NN training are often far less suitable for NN inference than specialized inference hardware accelerators like AImotive’s aiWare.