Mesh labeling is the key problem of classifying the facets of a 3D mesh with a label among a set of possible ones. State-of-the-art methods model mesh labeling as a Markov Random Field over the facets. These algorithms map image segmentations to the mesh by minimizing an energy function that comprises a data term, a smoothness terms, and class-specific priors. The latter favor a labeling with respect to another depending on the orientation of the facet normals. In this paper we propose a novel energy term that acts as a prior, but does not require any prior knowledge about the scene nor scene-specific relationship among classes. It bootstraps from a coarse mapping of the 2D segmentations on the mesh, and it favors the facets to be labeled according to the statistics of the mesh normals in their neighborhood. We tested our approach against five different datasets and, even if we do not inject prior knowledge, our method adapts to the data and overcomes the state-of-the-art.

* Accepted at 3DV2018
Click to Read Paper
Urban reconstruction from a video captured by a surveying vehicle constitutes a core module of automated mapping. When computational power represents a limited resource and, a detailed map is not the primary goal, the reconstruction can be performed incrementally, from a monocular video, carving a 3D Delaunay triangulation of sparse points; this allows online incremental mapping for tasks such as traversability analysis or obstacle avoidance. To exploit the sharp edges of urban landscape, we propose to use a Delaunay triangulation of Edge-Points, which are the 3D points corresponding to image edges. These points constrain the edges of the 3D Delaunay triangulation to real-world edges. Besides the use of the Edge-Points, a second contribution of this paper is the Inverse Cone Heuristic that preemptively avoids the creation of artifacts in the reconstructed manifold surface. We force the reconstruction of a manifold surface since it makes it possible to apply computer graphics or photometric refinement algorithms to the output mesh. We evaluated our approach on four real sequences of the public available KITTI dataset by comparing the incremental reconstruction against Velodyne measurements.

* Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on (IROS) 2015. http://hdl.handle.net/11311/972021
Click to Read Paper
As incremental Structure from Motion algorithms become effective, a good sparse point cloud representing the map of the scene becomes available frame-by-frame. From the 3D Delaunay triangulation of these points, state-of-the-art algorithms build a manifold rough model of the scene. These algorithms integrate incrementally new points to the 3D reconstruction only if their position estimate does not change. Indeed, whenever a point moves in a 3D Delaunay triangulation, for instance because its estimation gets refined, a set of tetrahedra have to be removed and replaced with new ones to maintain the Delaunay property; the management of the manifold reconstruction becomes thus complex and it entails a potentially big overhead. In this paper we investigate different approaches and we propose an efficient policy to deal with moving points in the manifold estimation process. We tested our approach with four sequences of the KITTI dataset and we show the effectiveness of our proposal in comparison with state-of-the-art approaches.

* Accepted in International Conference on Image Analysis and Processing (ICIAP 2015)
Click to Read Paper
This paper presents a novel method for the reconstruction of 3D edges in multi-view stereo scenarios. Previous research in the field typically relied on video sequences and limited the reconstruction process to either straight line-segments, or edge-points, i.e., 3D points that correspond to image edges. We instead propose a system, denoted as EdgeGraph3D, able to recover both straight and curved 3D edges from an unordered image sequence. A second contribution of this work is a graph-based representation for 2D edges that allows the identification of the most structurally significant edges detected in an image. We integrate EdgeGraph3D in a multi-view stereo reconstruction pipeline and analyze the benefits provided by 3D edges to the accuracy of the recovered surfaces. We evaluate the effectiveness of our approach on multiple datasets from two different collections in the multi-view stereo literature. Experimental results demonstrate the ability of EdgeGraph3D to work in presence of strong illumination changes and reflections, which are usually detrimental to the effectiveness of classical photometric reconstruction systems.

* Accepted for WACV 2018
Click to Read Paper
3D reconstruction is a core task in many applications such as robot navigation or sites inspections. Finding the best poses to capture part of the scene is one of the most challenging topic that goes under the name of Next Best View. Recently, many volumetric methods have been proposed; they choose the Next Best View by reasoning over a 3D voxelized space and by finding which pose minimizes the uncertainty decoded into the voxels. Such methods are effective, but they do not scale well since the underlaying representation requires a huge amount of memory. In this paper we propose a novel mesh-based approach which focuses on the worst reconstructed region of the environment mesh. We define a photo-consistent index to evaluate the 3D mesh accuracy, and an energy function over the worst regions of the mesh which takes into account the mutual parallax with respect to the previous cameras, the angle of incidence of the viewing ray to the surface and the visibility of the region. We test our approach over a well known dataset and achieve state-of-the-art results.

* 13 pages, 5 figures, to be published in IAS-15
Click to Read Paper
In the era of autonomous driving, urban mapping represents a core step to let vehicles interact with the urban context. Successful mapping algorithms have been proposed in the last decade building the map leveraging on data from a single sensor. The focus of the system presented in this paper is twofold: the joint estimation of a 3D map from lidar data and images, based on a 3D mesh, and its texturing. Indeed, even if most surveying vehicles for mapping are endowed by cameras and lidar, existing mapping algorithms usually rely on either images or lidar data; moreover both image-based and lidar-based systems often represent the map as a point cloud, while a continuous textured mesh representation would be useful for visualization and navigation purposes. In the proposed framework, we join the accuracy of the 3D lidar data, and the dense information and appearance carried by the images, in estimating a visibility consistent map upon the lidar measurements, and refining it photometrically through the acquired images. We evaluate the proposed framework against the KITTI dataset and we show the performance improvement with respect to two state of the art urban mapping algorithms, and two widely used surface reconstruction algorithms in Computer Graphics.

* accepted at iros 2017
Click to Read Paper
Detecting moving objects in dynamic scenes from sequences of lidar scans is an important task in object tracking, mapping, localization, and navigation. Many works focus on changes detection in previously observed scenes, while a very limited amount of literature addresses moving objects detection. The state-of-the-art method exploits Dempster-Shafer Theory to evaluate the occupancy of a lidar scan and to discriminate points belonging to the static scene from moving ones. In this paper we improve both speed and accuracy of this method by discretizing the occupancy representation, and by removing false positives through visual cues. Many false positives lying on the ground plane are also removed thanks to a novel ground plane removal algorithm. Efficiency is improved through an octree indexing strategy. Experimental evaluation against the KITTI public dataset shows the effectiveness of our approach, both qualitatively and quantitatively with respect to the state- of-the-art.

* 6 pages, to appear in IROS 2016
Click to Read Paper
Event-based cameras are neuromorphic sensors capable of efficiently encoding visual information in the form of sparse sequences of events. Being biologically inspired, they are commonly used to exploit some of the computational and power consumption benefits of biological vision. In this paper we focus on a specific feature of vision: visual attention. We propose two attentive models for event based vision. An algorithm that tracks events activity within the field of view to locate regions of interest and a fully-differentiable attention procedure based on DRAW neural model. We highlight the strengths and weaknesses of the proposed methods on two datasets, the Shifted N-MNIST and Shifted MNIST-DVS collections, using the Phased LSTM recognition network as a baseline reference model obtaining improvements in terms of both translation and scale invariance.

* WACV2019 Submission
Click to Read Paper
Event-based cameras are bioinspired sensors able to perceive changes in the scene at high frequency with a low power consumption. Becoming available only very recently, a limited amount of work addresses object detection on these devices. In this paper we propose two neural networks architectures for object detection: YOLE, which integrates the events into frames and uses a frame-based model to process them, eFCN, a event-based fully convolutional network that uses a novel and general formalization of the convolutional and max pooling layers to exploit the sparsity of the camera events. We evaluated the algorithm with different extension of publicly available dataset, and on a novel custom dataset.

* Submitted to BMVC 2018
Click to Read Paper
While 3D reconstruction is a well-established and widely explored research topic, semantic 3D reconstruction has only recently witnessed an increasing share of attention from the Computer Vision community. Semantic annotations allow in fact to enforce strong class-dependent priors, as planarity for ground and walls, which can be exploited to refine the reconstruction often resulting in non-trivial performance improvements. State-of-the art methods propose volumetric approaches to fuse RGB image data with semantic labels; even if successful, they do not scale well and fail to output high resolution meshes. In this paper we propose a novel method to refine both the geometry and the semantic labeling of a given mesh. We refine the mesh geometry by applying a variational method that optimizes a composite energy made of a state-of-the-art pairwise photo-metric term and a single-view term that models the semantic consistency between the labels of the 3D mesh and those of the segmented images. We also update the semantic labeling through a novel Markov Random Field (MRF) formulation that, together with the classical data and smoothness terms, takes into account class-specific priors estimated directly from the annotated mesh. This is in contrast to state-of-the-art methods that are typically based on handcrafted or learned priors. We are the first, jointly with the very recent and seminal work of [M. Blaha et al arXiv:1706.08336, 2017], to propose the use of semantics inside a mesh refinement framework. Differently from [M. Blaha et al arXiv:1706.08336, 2017], which adopts a more classical pairwise comparison to estimate the flow of the mesh, we apply a single-view comparison between the semantically annotated image and the current 3D mesh labels; this improves the robustness in case of noisy segmentations.

* {\pounds}D Reconstruction Meets Semantic, ICCV workshop
Click to Read Paper
In this paper we propose a new approach to incrementally initialize a manifold surface for automatic 3D reconstruction from images. More precisely we focus on the automatic initialization of a 3D mesh as close as possible to the final solution; indeed many approaches require a good initial solution for further refinement via multi-view stereo techniques. Our novel algorithm automatically estimates an initial manifold mesh for surface evolving multi-view stereo algorithms, where the manifold property needs to be enforced. It bootstraps from 3D points extracted via Structure from Motion, then iterates between a state-of-the-art manifold reconstruction step and a novel mesh sweeping algorithm that looks for new 3D points in the neighborhood of the reconstructed manifold to be added in the manifold reconstruction. The experimental results show quantitatively that the mesh sweeping improves the resolution and the accuracy of the manifold reconstruction, allowing a better convergence of state-of-the-art surface evolution multi-view stereo algorithms.

* in IEEE Winter Conference on Applications of Computer Vision (WACV) 2016
Click to Read Paper
We introduce ReConvNet, a recurrent convolutional architecture for semi-supervised video object segmentation that is able to fast adapt its features to focus on any specific object of interest at inference time. Generalization to new objects never observed during training is known to be a hard task for supervised approaches that would need to be retrained. To tackle this problem, we propose a more efficient solution that learns spatio-temporal features self-adapting to the object of interest via conditional affine transformations. This approach is simple, can be trained end-to-end and does not necessarily require extra training steps at inference time. Our method shows competitive results on DAVIS2016 with respect to state-of-the art approaches that use online fine-tuning, and outperforms them on DAVIS2017. ReConvNet shows also promising results on the DAVIS-Challenge 2018 winning the $10$-th position.

* CVPR Workshop - DAVIS Challenge 2018
Click to Read Paper
Recently, sparse representation based visual tracking methods have attracted increasing attention in the computer vision community. Although achieve superior performance to traditional tracking methods, however, a basic problem has not been answered yet --- that whether the sparsity constrain is really needed for visual tracking? To answer this question, in this paper, we first propose a robust non-sparse representation based tracker and then conduct extensive experiments to compare it against several state-of-the-art sparse representation based trackers. Our experiment results and analysis indicate that the proposed non-sparse tracker achieved competitive tracking accuracy with sparse trackers while having faster running speed, which support our non-sparse tracker to be used in practical applications.

* 12 pages, 3 figures
Click to Read Paper
In this paper, we propose a deep neural network architecture for object recognition based on recurrent neural networks. The proposed network, called ReNet, replaces the ubiquitous convolution+pooling layer of the deep convolutional neural network with four recurrent neural networks that sweep horizontally and vertically in both directions across the image. We evaluate the proposed ReNet on three widely-used benchmark datasets; MNIST, CIFAR-10 and SVHN. The result suggests that ReNet is a viable alternative to the deep convolutional neural network, and that further investigation is needed.

Click to Read Paper
We propose a structured prediction architecture, which exploits the local generic features extracted by Convolutional Neural Networks and the capacity of Recurrent Neural Networks (RNN) to retrieve distant dependencies. The proposed architecture, called ReSeg, is based on the recently introduced ReNet model for image classification. We modify and extend it to perform the more challenging task of semantic segmentation. Each ReNet layer is composed of four RNN that sweep the image horizontally and vertically in both directions, encoding patches or activations, and providing relevant global information. Moreover, ReNet layers are stacked on top of pre-trained convolutional layers, benefiting from generic local features. Upsampling layers follow ReNet layers to recover the original image resolution in the final predictions. The proposed ReSeg architecture is efficient, flexible and suitable for a variety of semantic segmentation tasks. We evaluate ReSeg on several widely-used semantic segmentation datasets: Weizmann Horse, Oxford Flower, and CamVid; achieving state-of-the-art performance. Results show that ReSeg can act as a suitable architecture for semantic segmentation tasks, and may have further applications in other structured prediction problems. The source code and model hyperparameters are available on https://github.com/fvisin/reseg.

* In CVPR Deep Vision Workshop, 2016
Click to Read Paper