Written by Lajos Németh / Posted at 9/8/20
Haste makes waste
You are the data
Why do we need data in the first place? To create a well-functioning automated driving software stack a recognition engine responsible for perceiving reality as best as possible is needed, and from which an accurate virtual model space can be built. To do this we rely heavily on state-of-the-art AI solutions. The current industry trend is to collect and label tremendous amount of data. However, a few questions always emerge… How much data is required? How do we know that this is the best strategy?
It is estimated that roughly 1.4 billion vehicles drive around the world’s roads, and people travel more than 23 trillion miles by road every year. A 2016 report showed that Americans spend an average of 17,600 minutes driving every year. By those estimates, Americans are generating 1.8 TB of data every year in their vehicles, without including raw sensor data. Why is this important? Well, McKinsey believes there could be as much as $750 billion of value in vehicle data by 2030. Considering these figures, it’s no surprise that companies are spending astronomical sums to buy data. However, there must be a smarter way to gather data and to keep costs down – especially after the economy took a big hit from COVID19.
Do you need an n+1 image of the same thing?
The smarter way
We certainly think that there is a smarter way to gather data for our software stack. Companies looking ahead are already transitioning on a solution to only collect data only from scenarios in which the recognition engine is not working seamlessly and omit those where it runs smoothly.
This requires a recognition engine that can tell when it is not performing optimally. For this reason, a confidence-based performance evaluation has to be implemented which can run both online and offline in a post-processing step. This basically means that instead of labeling every single frame or piece of data, you only deal with the ones that contain valuable information – the corner cases – which saves time and resources. Corner cases are examples in which a software would perform poorly or make a bad decision. With these examples the software can be refined by either improving the software or using the significantly reduced collected data for further training of the neural networks.
The corner case list changes dynamically, for example: today an elephant on the road may be a corner case, but if you already have lots of images of elephants on the road, it is no longer a corner case. A corner case list always belongs to a specific version of the software and there is no such thing as a universal corner case list. Therefore, a feedback loop is an essential part of the solution: the collected corner cases are annotated and will improve the algorithm and the improved algorithm will define new corner cases based on the corresponding confidences and inconsistencies.
Furthermore, the software needs to be tested in real road conditions as often as possible. The AImotive fleet is active on three continents, which means we test in different driving cultures and in these diverse circumstances encounter a large percentage of possible traffic situations. Our solution also allows us to further increase our testing with a fleet maintained by a third party. Especially because our system can also collect data when the car is not in self-driving mode and the software is only running in the background. These levels are considered necessary to safely introduce a higher level of automation.
The third and final component necessary to gather data the smart way is an infrastructure that supports and processes this pipeline. This is covered by our on-premises server farm, which can be easily scaled out to any cloud provider in case of increased computing demand.
Testing on three continents allows us to gather data from different driving cultures
The goal is to speed up cycles
In various current trends the whole pipeline is built of incremental steps. After gathering and labeling a certain amount of data, the affected neural networks will be automatically be trained further. After the training is done the improved neural network will be benchmarked with an adjusted evaluation dataset and if the quality of the network is better, it will be released. The goal here is to keep the iterations as short as possible, and to speed up the cycles by totally automizing the workflow.
This is not only a smarter and more efficient method but also one that helps to cut costs, which is crucial – particularly now, when the industry is still recovering from the shock caused by COVID19. Moreover, companies could further reduce spending by using automated annotation instead of manual annotation.
There is no question that the right Machine Learning Pipeline is essential to bring new features to market.
This example shows shorter iterations and faster cycles mean more value adding features in the car