Written by Mustafa Ali / Posted at 12/28/22
aiWare4+ – Neural Processing Unit for the future
aiMotive’s latest NPU is aimed at the next-gen automated driving workloads
Recently we have announced the launch of aiWare4+, a significant upgrade to our highly efficient aiWare4 Neural Processing Unit (NPU), to address the demands of evolving neural network workloads while maintaining efficiency. aiWare4+ brings new hardware features along with an enhanced software development kit (SDK) and aiWare Studio tool that enables users to take full advantage of these new capabilities.
Evolving Neural Network Workload
As carmakers strive to move beyond L2+ and bring higher levels of automated driving to the mass market, advancing the perception system to better perceive and respond to the most challenging driving scenarios (known as Operational Design Domains, or ODDs) is key. New kinds of neural networks are being proposed to achieve these goals, and these networks put increased demand on compute resources. Furthermore, the industry trend towards more centralized automotive electronics architecture is also a contributing factor in higher compute demand.
aiMotive has one of the largest teams of AI-focused engineers and researchers in Europe who relentlessly analyze industry trends in autonomous driving to offer the best-in-class, end-to-end solutions and tools to the autonomous driving industry. Our extensive domain knowledge of autonomous driving also guides our approach to aiWare feature support to ensure that aiWare not only executes the evolving AI workloads that the industry demands but does so with class-leading efficiency.
A recent development in the design of neural networks for automotive vision is the application of transformer networks. Researchers have shown that Transformer based networks can outperform pure CNNs in accuracy. However, transformers are known to impose increased computation demands for training as well as for inferencing.
aiWare4+ has been conceived to service these trends via flexibility, increased programmability, and scalability without compromising on efficiency, the core principle that guides aiWare’s design choices. aiWare4+ increases flexibility by enhanced programmability while ensuring efficiency is maintained such that a highly intelligent aiWare compiler can optimize NN workloads for aiWare architecture for a wider range of NN types, including LSTMs and Transformer networks.
Doing More with Less
These ever-increasing computations demands have led researchers to exploit ways of reducing workload while maintaining prediction accuracy. The AI community has looked at various ways in which the number of computations can be reduced or more computations can be executed on given hardware units. An example of the former is the use of lower precision, such as FP8 instead of FP16 or FP32, which enables higher performance by doing more computations with a given number of MAC units and reducing the memory bandwidth. On the other hand, the other way to increase performance is to skip computations where operands are zero. This is achieved by capitalizing on a property of data termed as sparsity. This is exploited by pruning data using an approach called fine-grained structured sparsity.
An important consideration is that networks must be portable between the training and inferencing stages. As it cannot be assumed that the same platform will be used during these stages, industry-standard practices are key to aligning on aspects such as a common FP8 format and sparsity approaches so that network portability can be maintained between training and inferencing. Hence, we have chosen to support 2:4 fine-grained structured sparsity and align with the emerging industry standard for FP8 format.
Data-first Architecture
We believe a ground-up design is the best approach to solving the problem efficiently. aiWare is designed from ground-up as a data-first architecture because we view efficient NN processing as a data movement problem first. Reducing off-chip bandwidth while maintaining high-efficiency requires that data can move unhindered on-chip. To reduce off-chip bandwidth, aiWare uses a combination of innovative techniques such as tile-based processing and wavefront processing. At the same time, it is vital that on-chip data movement does not become the bottleneck. aiWare achieves this using the concept of near-memory computing with a two-level on-chip memory hierarchy: closer to the MACs are smaller but very high-bandwidth memories that implement massively parallel on-chip I/O for unhindered data transfer between the large number of compute units, while aiWare’s Wavefront RAM is a larger but lower bandwidth on-chip memory to save area and power. It is used by the aiWare compiler to implement wavefront scheduling, which drastically reduces off-chip data movement.
aiWare4+ offers up to 256 TOPS in a single-core instance, scaling to 1024 TOPS using a multi-core approach. It may be the case that a workload may not need the massive compute resources in a large core. With aiWare4+, we introduce interleaved multi-tasking, which allows multiple, workloads to be scheduled by the compiler concurrently, making use of otherwise idle resources. This is especially useful to realize centralized vehicle compute running multiple workloads, such as speech recognition or driver monitoring, together with more demanding AD workload. Using interleaved multitasking, the compiler can schedule driver monitoring task using spare resources without impacting higher priority AD workload.
Not Just Hardware
With any automotive IP selection, future-proofing is important – automotive silicon has a long production cycle, and silicon vendors rightly worry about how to service the needs of the future in face of evolving technology. aiWare addresses this with a combination of core hardware architecture flexibility, which is greatly enhanced by aiWare4+, and equally importantly by offering the network designer-friendly tool, aiWare Studio. This combination of both enables network designers to tweak the network to best fit hardware capabilities and facilitates mapping existing and upcoming workloads to aiWare4+. Firstly, the intelligent aiWare compiler manages optimal mapping of an NN to a given HW configuration ensuring highest efficiency is achieved. Secondly, aiWare studio provides insights to the user on how a given NN executes on the hardware at a layer-by-layer granularity. Both these tools exploit the highly deterministic architecture of aiWare, for scheduling tasks as well as for highly accurate performance prediction.
Conclusion
aiWare4+ brings enhanced programmability, flexibility and scalability and builds on a highly efficient NPU architecture which makes it more future-proof than ever. aiWare4+ continues to deliver the automotive industry’s highest NPU efficiency of up to 98% for a wide range of AI workloads, enabling superior performance using less silicon and less power.