Introduction

This article summarizes and gives an in depth explanation on the paper by Thomas Roddick and Roberto Cipolla, ‘Predicting Semantic Map Representations from images using Pyramid Occupancy Networks’ ^{roddick2020predicting}. The authors present a novel approach for creating a orthographic top down map of the environment based on a single or series of monocular images. This approach uses a wide variety of techniques and incorporates them into a single end to end trainable machine learning model. Basic stochastic techniques enables us to acquire a 360º top down map by evaluating and combining an array of circular arranged monocular cameras. The map itself contains rich semantic information about different types of objects in the scene.

Semantic Top Down Map

The resulting top down environment map is a semantic occupancy grid map representing which objects take up which space in the scene. For a more in depth explanation lets untangle the term semantic occupancy grid map and focus on the latter part occupancy grid map.

Occupancy Grid Maps

For generating a top down occupancy grid map of the real word, we rasterize a two dimensional slice of the word into a grid. Each square in the grid \(m\) can either be \(0\) or \(1\). We can’t observe the true version of the real word because we rely on sensors to perceive our environment, this introduces a significant amount of uncertainty. We now need to treat each cell as a probability \(p(m_i\ |\ [0…1])\) which represents the probability of it being occupied.

Semantic Occupancy Grid Maps

A semantic occupancy grid map extends the definition of a regular occupancy grid map with semantic information. The grid now has multiple dimensions, one for each class we want to encode in the map. For example if we want to differentiate between carpets, tables and chairs we generate three separate occupancy grid map for each class and stack them over each other. We now have a tensor with size \(H\times W\times 3\), with \(H\) and \(W\) being the desired height and width of the occupancy map. Note that the classes are not exclusive, because with this representation we can still encode a chair sitting on a carpet.

Bayesian Filtering

To fuse multiple measurements (occupancy maps) of a time range into a single bigger occupancy map, we can use Bayesian Filtering, a popular technique used in robotics.

Predicting the Environment

As I already stated that the goal of this paper is to generate a semantic occupancy grid map of the environment based on a monocular image.

Traditional Approach

Figure 1: Inverse perspective mapping (IPM) transforming a monocular perspective image into a orthographic top down perspective. Nieto2008 — Figure 1: Inverse perspective mapping (IPM) transforming a monocular perspective image into a orthographic top down perspective. ^Nieto2008

Lets first look at the preceding ways of generating a semantic occupancy grid map from a monocular image.

Inverse Perspective Mapping (IPM) ^{bruls2019right} - A simple image transformation from perspective view into top-down view.
RGB-D - Special cameras that are able to perceive depth.
VED ^Lu_2019 - An encoder-decoder network that extracts semantic representation directly out of the input image.
VPN ^Pan_2020 - A framework for generating a semantic top-down view by an array of camera images

Structure of the Network

Figure 2: The network architecture proposed by the paper.

The architecture proposed by the paper ^{roddick2020predicting} consists of 4 individual stages. The first stage is using a ResNet-50 feature extractor network to create a semantic rich representation of the input image. Unfortunately the deepest and semantically richest feature map of this network is also the lowest resolution. This resolution won’t suffice for generating a orthographic top down map. To compensate for the low resolution we use a feature pyramid network ^{lin2017feature}, that up scales the low resolution semantic rich representation with the help of immediate layers of the ResNet top down pass.

The third step of the network is, to transform the outputs of the feature pyramid (scaled feature map) into the desired orthographic top down perspective. This is necessary, because the ResNet feature maps are still in the original perspective of the camera. Unfortunately we can’t just transform a single ‘big’ feature map into the orthographic view, because distances far away from the camera will yield far lower resolution than areas near the camera. This will result in undersampling in areas near the camera and oversampling in areas far away from the camera. To generate the best possible top down feature map, we subdivide the desired output feature map along the depth axis. The transformer will then map areas near the camera to the low resolution feature maps generated by the feature pyramid.

The transformed feature map is then processed by a simple feed forward network. The network generates the desired final semantic occupancy grid map out of the transformed top down feature map.

Loss

As a loss function we use a sum of two loss functions. The first loss function is a binary cross entropy loss, that should encourage detection of the correct class in the semantic occupancy map. Miss classifications of objects that are usually small, weigh less compared to big objects because they take up less cells on the map. This is of course not desired behavior because pedestrians are at least as important as cars. To compensate for this we add a scalar to each type of loss.

Because artificial neuronal networks typically tend to output false positives, and we can’t really make any predictions on areas where the camera can’t make any measurements. Consequently we add a maximum entropy loss to areas the camera can’t see. This will drive affected values to a \(0.5\) i.e. the network is unsure about these cells.

Data

The authors used \(2\) different datasets for training and evaluating the proposed neural network architecture. The Argoverse 3D ^{chang2019argoverse} dataset and the more challenging nuScenes ^{caesar2020nuscenes} dataset. Each of these datasets contain videos of real word driving situations, annotated with 3D bounding boxes. The datasets are not originally designed for occupancy map generation, instead it is primarily used for 3D object detection. So in order to use the dataset for training on this model, a transformation into a top-down view of the annotated classes is necessary.

Evaluation

Figure 4: The model proposed by Thomas Roddick and Roberto Cipolla, improves on other state of the art approaches by about 9%.

The metric used for evaluation is the intersection over union (IoU) percentage score. CS Mean refers to the mean IoU over all classes existent in the Cityscapes dataset, these classes are marked with an asterisk. Objects that are not visible to the camera are ignored during evaluation. The authors also did an ablation study to confirm, that subnetworks of the network provide a benefit in terms of evaluation accuracy. The study shows, that each module (Dense Transformer Layer, Transformer Pyramid and Top-Down Network) improves the mean accuracy by about 1% each. Only the dense transformer improves the detection of small objects by a significant amount.

Bibliography

[roddick2020predicting] @miscroddick2020predicting, title=Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks, author=Thomas Roddick and Roberto Cipolla, year=2020, eprint=2003.13402, archivePrefix=arXiv, primaryClass=cs.CV ↩

[Nieto2008] Nieto, Arróspide, Salgado & Santos, Video-based Driver Assistance Systems, , (2008). ↩

[bruls2019right] @miscbruls2019right, title=The Right (Angled) Perspective: Improving the Understanding of Road Scenes Using Boosted Inverse Perspective Mapping, author=Tom Bruls and Horia Porav and Lars Kunze and Paul Newman, year=2019, eprint=1812.00913, archivePrefix=arXiv, primaryClass=cs.CV ↩

[Lu_2019] Lu, van de Molengraft & Dubbelman, Monocular Semantic Occupancy Grid Mapping With Convolutional Variational Encoder–Decoder Networks, IEEE Robotics and Automation Letters, 4(2), 445–452 (2019). link. doi. ↩

[Pan_2020] Pan, Sun, Leung, Andonian & Zhou, Cross-View Semantic Segmentation for Sensing Surroundings, IEEE Robotics and Automation Letters, 5(3), 4867–4873 (2020). link. doi. ↩

[lin2017feature] @misclin2017feature, title=Feature Pyramid Networks for Object Detection, author=Tsung-Yi Lin and Piotr Dollár and Ross Girshick and Kaiming He and Bharath Hariharan and Serge Belongie, year=2017, eprint=1612.03144, archivePrefix=arXiv, primaryClass=cs.CV ↩

[chang2019argoverse] @miscchang2019argoverse, title=Argoverse: 3D Tracking and Forecasting with Rich Maps, author=Ming-Fang Chang and John Lambert and Patsorn Sangkloy and Jagjeet Singh and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr and Simon Lucey and Deva Ramanan and James Hays, year=2019, eprint=1911.02620, archivePrefix=arXiv, primaryClass=cs.CV ↩

[caesar2020nuscenes] @misccaesar2020nuscenes, title=nuScenes: A multimodal dataset for autonomous driving, author=Holger Caesar and Varun Bankiti and Alex H. Lang and Sourabh Vora and Venice Erin Liong and Qiang Xu and Anush Krishnan and Yu Pan and Giancarlo Baldan and Oscar Beijbom, year=2020, eprint=1903.11027, archivePrefix=arXiv, primaryClass=cs.LG ↩

Summary: Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks

Paper by Thomas Roddick and Roberto Cipolla

Introduction

Semantic Top Down Map

Occupancy Grid Maps

Semantic Occupancy Grid Maps

Bayesian Filtering

Predicting the Environment

Traditional Approach

Structure of the Network

Loss

Data

Evaluation

Bibliography

Summary: Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks

Paper by Thomas Roddick and Roberto Cipolla

Introduction

Semantic Top Down Map

Occupancy Grid Maps

Semantic Occupancy Grid Maps

Bayesian Filtering

Predicting the Environment

Traditional Approach

Structure of the Network

Loss

Data

Evaluation

Bibliography

See Also