Introduction
This article summarizes and gives an in depth explanation on the paper by Thomas Roddick and Roberto Cipolla, ‘Predicting Semantic Map Representations from images using Pyramid Occupancy Networks’ roddick2020predicting. The authors present a novel approach for creating a orthographic top down map of the environment based on a single or series of monocular images. This approach uses a wide variety of techniques and incorporates them into a single end to end trainable machine learning model. Basic stochastic techniques enables us to acquire a 360º top down map by evaluating and combining an array of circular arranged monocular cameras. The map itself contains rich semantic information about different types of objects in the scene.
Semantic Top Down Map
The resulting top down environment map is a semantic occupancy grid map representing which objects take up which space in the scene. For a more in depth explanation lets untangle the term semantic occupancy grid map and focus on the latter part occupancy grid map.
Occupancy Grid Maps
For generating a top down occupancy grid map of the real word, we rasterize a two dimensional slice of the word into a grid. Each square in the grid \(m\) can either be \(0\) or \(1\). We can’t observe the true version of the real word because we rely on sensors to perceive our environment, this introduces a significant amount of uncertainty. We now need to treat each cell as a probability \(p(m_i\ |\ [0…1])\) which represents the probability of it being occupied.
Semantic Occupancy Grid Maps
A semantic occupancy grid map extends the definition of a regular occupancy grid map with semantic information. The grid now has multiple dimensions, one for each class we want to encode in the map. For example if we want to differentiate between carpets, tables and chairs we generate three separate occupancy grid map for each class and stack them over each other. We now have a tensor with size \(H\times W\times 3\), with \(H\) and \(W\) being the desired height and width of the occupancy map. Note that the classes are not exclusive, because with this representation we can still encode a chair sitting on a carpet.
Bayesian Filtering
To fuse multiple measurements (occupancy maps) of a time range into a single bigger occupancy map, we can use Bayesian Filtering, a popular technique used in robotics.
Predicting the Environment
As I already stated that the goal of this paper is to generate a semantic occupancy grid map of the environment based on a monocular image.
Traditional Approach
Lets first look at the preceding ways of generating a semantic occupancy grid map from a monocular image.
- Inverse Perspective Mapping (IPM) bruls2019right - A simple image transformation from perspective view into top-down view.
- RGB-D - Special cameras that are able to perceive depth.
- VED Lu_2019 - An encoder-decoder network that extracts semantic representation directly out of the input image.
- VPN Pan_2020 - A framework for generating a semantic top-down view by an array of camera images
Structure of the Network
The architecture proposed by the paper roddick2020predicting consists of 4 individual stages. The first stage is using a ResNet-50 feature extractor network to create a semantic rich representation of the input image. Unfortunately the deepest and semantically richest feature map of this network is also the lowest resolution. This resolution won’t suffice for generating a orthographic top down map. To compensate for the low resolution we use a feature pyramid network lin2017feature, that up scales the low resolution semantic rich representation with the help of immediate layers of the ResNet top down pass.
The third step of the network is, to transform the outputs of the feature pyramid (scaled feature map) into the desired orthographic top down perspective. This is necessary, because the ResNet feature maps are still in the original perspective of the camera. Unfortunately we can’t just transform a single ‘big’ feature map into the orthographic view, because distances far away from the camera will yield far lower resolution than areas near the camera. This will result in undersampling in areas near the camera and oversampling in areas far away from the camera. To generate the best possible top down feature map, we subdivide the desired output feature map along the depth axis. The transformer will then map areas near the camera to the low resolution feature maps generated by the feature pyramid.
The transformed feature map is then processed by a simple feed forward network. The network generates the desired final semantic occupancy grid map out of the transformed top down feature map.
Loss
As a loss function we use a sum of two loss functions. The first loss function is a binary cross entropy loss, that should encourage detection of the correct class in the semantic occupancy map. Miss classifications of objects that are usually small, weigh less compared to big objects because they take up less cells on the map. This is of course not desired behavior because pedestrians are at least as important as cars. To compensate for this we add a scalar to each type of loss.
Because artificial neuronal networks typically tend to output false positives, and we can’t really make any predictions on areas where the camera can’t make any measurements. Consequently we add a maximum entropy loss to areas the camera can’t see. This will drive affected values to a \(0.5\) i.e. the network is unsure about these cells.
Data
The authors used \(2\) different datasets for training and evaluating the proposed neural network architecture. The Argoverse 3D chang2019argoverse dataset and the more challenging nuScenes caesar2020nuscenes dataset. Each of these datasets contain videos of real word driving situations, annotated with 3D bounding boxes. The datasets are not originally designed for occupancy map generation, instead it is primarily used for 3D object detection. So in order to use the dataset for training on this model, a transformation into a top-down view of the annotated classes is necessary.
Evaluation
The metric used for evaluation is the intersection over union (IoU) percentage score.
CS Mean
refers to the mean IoU over all classes existent in the Cityscapes dataset, these classes are marked with an asterisk.
Objects that are not visible to the camera are ignored during evaluation.
The authors also did an ablation study to confirm, that subnetworks of the network provide a benefit in terms of evaluation accuracy.
The study shows, that each module (Dense Transformer Layer, Transformer Pyramid and Top-Down Network) improves the mean accuracy by about 1% each.
Only the dense transformer improves the detection of small objects by a significant amount.
Bibliography
[roddick2020predicting] @miscroddick2020predicting, title=Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks, author=Thomas Roddick and Roberto Cipolla, year=2020, eprint=2003.13402, archivePrefix=arXiv, primaryClass=cs.CV ↩
[Nieto2008] Nieto, Arróspide, Salgado & Santos, Video-based Driver Assistance Systems, , (2008). ↩
[bruls2019right] @miscbruls2019right, title=The Right (Angled) Perspective: Improving the Understanding of Road Scenes Using Boosted Inverse Perspective Mapping, author=Tom Bruls and Horia Porav and Lars Kunze and Paul Newman, year=2019, eprint=1812.00913, archivePrefix=arXiv, primaryClass=cs.CV ↩
[Lu_2019] Lu, van de Molengraft & Dubbelman, Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder–Decoder Networks, IEEE Robotics and Automation Letters, 4(2), 445–452 (2019). link. doi. ↩
[Pan_2020] Pan, Sun, Leung, Andonian & Zhou, Cross-View Semantic Segmentation for Sensing Surroundings, IEEE Robotics and Automation Letters, 5(3), 4867–4873 (2020). link. doi. ↩
[lin2017feature] @misclin2017feature, title=Feature Pyramid Networks for Object Detection, author=Tsung-Yi Lin and Piotr Dollár and Ross Girshick and Kaiming He and Bharath Hariharan and Serge Belongie, year=2017, eprint=1612.03144, archivePrefix=arXiv, primaryClass=cs.CV ↩
[chang2019argoverse] @miscchang2019argoverse, title=Argoverse: 3D Tracking and Forecasting with Rich Maps, author=Ming-Fang Chang and John Lambert and Patsorn Sangkloy and Jagjeet Singh and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr and Simon Lucey and Deva Ramanan and James Hays, year=2019, eprint=1911.02620, archivePrefix=arXiv, primaryClass=cs.CV ↩
[caesar2020nuscenes] @misccaesar2020nuscenes, title=nuScenes: A multimodal dataset for autonomous driving, author=Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and Giancarlo Baldan and Oscar Beijbom, year=2020, eprint=1903.11027, archivePrefix=arXiv, primaryClass=cs.LG ↩