aiSim Hybrid Rendering

Neural reconstruction and rendering mixed with standard virtual assets and scenarios provide limitless possibilities and eliminate the domain gap for virtual testing of ADAS and AD systems.

Authors: Máté Tóth, Péter Kovács, Zoltán Bendefy, Zoltán Hortsin, Balázs Teréki, Tamás Matuszka

Hybrid Rendering for Multimodal Autonomous Driving: Merging Neural and Physics-Based Simulation

TL;DR: We developed a method that combines neural reconstruction with traditional physics-based rendering, enhancing both techniques to support autonomous driving development. Since our solution is integrated into aiSim, our simulator, it can be tested interactively in real time, making it ideal for demonstrations.

Abstract

Neural reconstruction has advanced significantly in the past year, and dynamic models are becoming increasingly common. However, these models are limited to handling in-domain objects that closely follow their original trajectories. This demonstration presents a hybrid approach that integrates the advantages of neural reconstruction with physics-based rendering. First, we remove dynamic objects from the scene and reconstruct the static environment using a neural reconstruction model. Then, we populate the reconstructed environment with dynamic objects in aiSim. This approach effectively mitigates the drawbacks of both methods—such as domain gap in traditional simulation and out-of-domain object rendering in neural reconstruction.

Method

We train our 3D Gaussian Splatting (3DGS) and NeRF-based models using synchronized data collected from vehicles equipped with RGB cameras, GNSS devices, and LiDAR sensors. The reconstructed environment allows for the virtual placement of dynamic agents at arbitrary locations, adjustments to environmental conditions, and rendering from novel camera viewpoints. We have significantly improved novel view synthesis quality—particularly for road surfaces and lane markings—while maintaining interactive frame rates using our novel training method (NeRF2GS), which enhances its applicability for autonomous driving tasks. This method combines the better generalization capability of NeRF-based methods with the real-time rendering speed of 3DGS (3D Gaussian Splatting) methods by training a customized NeRF model on the original images with depth-regularization coming from a noisy LiDAR point cloud and using it as a teacher model for 3DGS training, providing accurate depth, surface normal, and appearance supervision. Additionally, our method supports multiple sensor modalities (LiDAR, radar target lists), different camera models (e.g., fisheye), and accounts for camera exposure mismatches. It can also predict segmentation masks, surface normals, and depth maps even for large-scale reconstructions (>100 000 m2), using our block-based train parallelization approach.

Qualitative results:

Our method works in different operational design domains. Urban environment (San Francisco, CA).

Rotating LiDAR sensor simulation within aiSim supported by our hybrid rendering method. Colors indicate LiDAR intensity.

Neural radar target reconstruction where the static environment is reconstructed using NeRF, and another neural network predicts the radar target list (colors indicate distance from the ego vehicle).

Novel view synthesis with 3DGS on a proving ground (ZalaZone, Hungary.)

Novel view synthesis and dynamic object removal with NeRF on a 2 km-long highway section (M0, Hungary).

NeRF novel view synthesis from Waymo Open Dataset with camera model change in extreme conditions (top row: RGB/normal, bottom row: depth/segmentation).

NeRF novel view synthesis from Waymo Open Dataset with camera model change in urban environments. (top row: RGB/normal, bottom row: depth/segmentation).

Novel view synthesis from extreme viewpoint with 3DGS (top row: RGB/depth, bottom row: normal/segmentation by a pretrained Mask2Former overlayed on RGB from 3DGS). The area of the reconstruction is about 165 000 m2.

Our hybrid rendering approach can also be applied to public datasets like Waymo.

Domain gap measurement

Detections from a publicly available monocular model¹ on scenarios generated using our method. The reconstruction model rendered the static environment, while the mesh-based rendering engine introduced dynamic vehicles. As shown, the model successfully detects vehicles from both rendering methods, suggesting that no significant domain gap is introduced. Distant objects are not recognized due to model limitations (detection range <50m).

Comparison of segmentation results between our learned model and the output of a public model², using both on-trajectory and extreme novel view renderings (with a 3m horizontal camera shift). Green, yellow, and blue indicate 'car' segmentations detected by both models, only by the public model, and only by our model, respectively. Most yellow 'errors' stem from the public model predicting overly dilated object boundaries, while blue regions appear when cars are partially obstructed or too distant for the public model to detect.

Interested in more details?

You can read the Arxiv paper, which is filled with detailed explanations, equations, methodologies, supplementary materials, and more:

Read more

¹ https://github.com/abhi1kumar/DEVIANT/

² https://huggingface.co/facebook/mask2former-swin-large-mapillary-vistas-semantic with checkpoint "facebook/mask2former-swin-large-mapillary-vistas-semantic".