The Need for a New Approach to Neural Network Acceleration – As the wider use of artificial intelligence unfolds, its focus is shifting from non-real-time uses, like image-classification, to latency-critical environments such as self-driving technology. However, because the majority of software developed for AI is primarily used in non-real-time applications, low latency software - and the platforms it runs on - must be re-thought. This also means that tools like benchmarks used to tune algorithms and system performance need to be re-invented.
These challenges may be best answered by a team in direct contact with these latency-critical use cases. At AImotive our hardware and software development effort are deeply connected. With good reason, self-driving, is one the most demanding uses of AI today. Accelerators in current prototypes consume up to 1000 W, a number that must be reduced for mass production to begin. To achieve this, developers must be aware of these problems and design systems with the low-latency and high-performance demands of self-driving in mind.
The task at the core of computer vision, image classification, is relatively simple, when run with small-size inputs and no latency requirements. In a real use-case scenario, it does not matter how long it takes the neural network to recognize a group of puppies, only that it recognizes them correctly. But, consider the implications of this solution on the road, at 75 mph, when the difference between life and death can be measured in milliseconds. Time is as important as accuracy. The majority of current hardware solutions are not built with such use-cases in mind. aiWare on the other hand, is designed to effectively handle high-resolution inputs in any use case, including self-driving.
In traditional implementations, low-resolution images are run through relatively simple neural-networks. However, using a 224 by 224 pixel image (common in benchmarking) in self-driving would mean the car sees only a few feet ahead clearly. Images with a resolution of at least one or two megapixels are needed to guarantee safety on the road. What happens when large images are used as input?
First, traditional hardware solutions work by storing all data relevant to the processes, other than the weights, in the on-chip cache. This is impossible with large input images. The demands of the process cannot be changed, what can be optimized is how and when the accelerator accesses external memory. Using external memory efficiently is one of the keys to viable hardware solutions for advanced real-world applications.
Second, the time needed for the convolutional tasks to be completed balloons uncontrollably – if the hardware can handle the process at all. Some industry players believe the solution to these problems lies in artificial job parallelization, in other words processing batches. This means that the system completes repeated tasks on batches of data, for example, 64 pictures at a time instead of one at a time. However, this solution is only viable in processes which are not latency-critical.
Continuing the previous example: a car running at 75 mph on a highway with cameras operating at a conservative 30 FPS (frames per second) would take over two seconds to collect one batch of images. Over two seconds have passed and the system has not even begun to process the images. Meanwhile, life and death are still separated by mere milliseconds.
The only long-term solution is for hardware to adapt. What makes GPUs good NN accelerators is their high performance and massive data-bandwidth. They are, in a sense, a naïve implementation of what embedded NN accelerators for computer vision should be, completing the tasks at hand but consuming large amounts of power.
To be truly effective, aiWare is tuned to the challenges of high-resolution input images in a latency-critical environment and streamlines the processes which most severely affect its operation. It optimizes the way external memory is accessed, reading and writing data to and from the external memory as few times as possible. Second, it maximizes MAC utilization, and brings the number of mathematical operations required to complete a task close to the theoretical minimum. Avoiding reading and writing partial results drastically reduces the strain put on external memory.
Accessing external memory is an extremely power-hungry task, but is also where the greatest gains can be made. To achieve this goal the very way RAM is accessed may have to be changed. So much so, that random access memory may not be random much longer. Today’s solutions use only a fraction of the total external data-bandwidth and a huge amount of potential speed is lost. Optimizing the bandwidth and speed of the accelerators external memory is yet another way to save power.
Ideally the power consumption of a complete L5 autonomous system should be below 50 W. A tiny amount compared to the 400–1000 W that GPU-based systems in today’s prototype vehicles consume. This makes power efficiency a central characteristic of any hardware solution for self-driving use-cases.
With the importance of artificial intelligence in our everyday lives constantly increasing, hardware suppliers must now solve a myriad of unforeseen difficulties. However, only hardware designs that keep the points above in mind will cope with the demands of latency-critical computer vision. This is why we have created aiWare.