Models, code, and papers for "Li Erran Li":
We present an approach with a novel differentiable flow-to-depth layer for video depth estimation. The model consists of a flow-to-depth layer, a camera pose refinement module, and a depth fusion network. Given optical flow and camera pose, our flow-to-depth layer generates depth proposals and the corresponding confidence maps by explicitly solving an epipolar geometry optimization problem. Unlike other methods, our flow-to-depth layer is differentiable, and thus we can refine camera poses by maximizing the aggregated confidence in camera pose refinement module. Our depth fusion network can utilize depth proposals and their confidence maps inferred from different adjacent frames to produce the final depth map. Furthermore, the depth fusion network can additionally take the depth proposals generated by other methods to improve the results further. The experiments on three public datasets show that our approach outperforms state-of-the-art depth estimation methods, and has strong generalization capability: our model trained on KITTI performs well on the unseen Waymo dataset while other methods degenerate a lot.
An increasing amount of analytics is performed on data that is procured in a real-time fashion to make real-time decisions. Such tasks include simple reporting on streams to sophisticated model building. However, the practicality of such analyses are impeded in several domains because they are faced with a fundamental trade-off between data collection latency and analysis accuracy. In this paper, we study this trade-off in the context of a specific domain, Cellular Radio Access Networks (RAN). Our choice of this domain is influenced by its commonalities with several other domains that produce real-time data, our access to a large live dataset, and their real-time nature and dimensionality which makes it a natural fit for a popular analysis technique, machine learning (ML). We find that the latency accuracy trade-off can be resolved using two broad, general techniques: intelligent data grouping and task formulations that leverage domain characteristics. Based on this, we present CellScope, a system that addresses this challenge by applying a domain specific formulation and application of Multi-task Learning (MTL) to RAN performance analysis. It achieves this goal using three techniques: feature engineering to transform raw data into effective features, a PCA inspired similarity metric to group data from geographically nearby base stations sharing performance commonalities, and a hybrid online-offline model for efficient model updates. Our evaluation of CellScope shows that its accuracy improvements over direct application of ML range from 2.5x to 4.4x while reducing the model update overhead by up to 4.8x. We have also used CellScope to analyze a live LTE consisting of over 2 million subscribers for a period of over 10 months, where it uncovered several problems and insights, some of them previously unknown.
We present Uncertainty-aware Cascaded Stereo Network (UCS-Net) for 3D reconstruction from multiple RGB images. Multi-view stereo (MVS) aims to reconstruct fine-grained scene geometry from multi-view images. Previous learning-based MVS methods estimate per-view depth using plane sweep volumes with a fixed depth hypothesis at each plane; this generally requires densely sampled planes for desired accuracy, and it is very hard to achieve high-resolution depth. In contrast, we propose adaptive thin volumes (ATVs); in an ATV, the depth hypothesis of each plane is spatially varying, which adapts to the uncertainties of previous per-pixel depth predictions. Our UCS-Net has three stages: the first stage processes a small standard plane sweep volume to predict low-resolution depth; two ATVs are then used in the following stages to refine the depth with higher resolution and higher accuracy. Our ATV consists of only a small number of planes; yet, it efficiently partitions local depth ranges within learned small intervals. In particular, we propose to use variance-based uncertainty estimates to adaptively construct ATVs; this differentiable process introduces reasonable and fine-grained spatial partitioning. Our multi-stage framework progressively subdivides the vast scene space with increasing depth resolution and precision, which enables scene reconstruction with high completeness and accuracy in a coarse-to-fine fashion. We demonstrate that our method achieves superior performance compared with state-of-the-art benchmarks on various challenging datasets.
Recently, autonomous driving development ignited competition among car makers and technical corporations. Low-level automation cars are already commercially available. But high automated vehicles where the vehicle drives by itself without human monitoring is still at infancy. Such autonomous vehicles (AVs) rely on the computing system in the car to to interpret the environment and make driving decisions. Therefore, computing system design is essential particularly in enhancing the attainment of driving safety. However, to our knowledge, no clear guideline exists so far regarding safety-aware AV computing system and architecture design. To understand the safety requirement of AV computing system, we performed a field study by running industrial Level-4 autonomous driving fleets in various locations, road conditions, and traffic patterns. The field study indicates that traditional computing system performance metrics, such as tail latency, average latency, maximum latency, and timeout, cannot fully satisfy the safety requirement for AV computing system design. To address this issue, we propose a `safety score' as a primary metric for measuring the level of safety in AV computing system design. Furthermore, we propose a perception latency model, which helps architects estimate the safety score of given architecture and system design without physically testing them in an AV. We demonstrate the use of our safety score and latency model, by developing and evaluating a safety-aware AV computing system computation hardware resource management scheme.
In spite of achieving revolutionary successes in machine learning, deep convolutional neural networks have been recently found to be vulnerable to adversarial attacks and difficult to generalize to novel test images with reasonably large geometric transformations. Inspired by a recent neuroscience discovery revealing that primate brain employs disentangled shape and appearance representations for object recognition, we propose a general disentangled deep autoencoding regularization framework that can be easily applied to any deep embedding based classification model for improving the robustness of deep neural networks. Our framework effectively learns disentangled appearance code and geometric code for robust image classification, which is the first disentangling based method defending against adversarial attacks and complementary to standard defense methods. Extensive experiments on several benchmark datasets show that, our proposed regularization framework leveraging disentangled embedding significantly outperforms traditional unregularized convolutional neural networks for image classification on robustness against adversarial attacks and generalization to novel test data.
In recent years, convolutional neural networks (CNNs) have achieved impressive performance for various visual recognition scenarios. CNNs trained on large labeled datasets can not only obtain significant performance on most challenging benchmarks but also provide powerful representations, which can be used to a wide range of other tasks. However, the requirement of massive amounts of data to train deep neural networks is a major drawback of these models, as the data available is usually limited or imbalanced. Fine-tuning (FT) is an effective way to transfer knowledge learned in a source dataset to a target task. In this paper, we introduce and systematically investigate several factors that influence the performance of fine-tuning for visual recognition. These factors include parameters for the retraining procedure (e.g., the initial learning rate of fine-tuning), the distribution of the source and target data (e.g., the number of categories in the source dataset, the distance between the source and target datasets) and so on. We quantitatively and qualitatively analyze these factors, evaluate their influence, and present many empirical observations. The results reveal insights into what fine-tuning changes CNN parameters and provide useful and evidence-backed intuitions about how to implement fine-tuning for computer vision tasks.
Since scenes are composed in part of objects, accurate recognition of scenes requires knowledge about both scenes and objects. In this paper we address two related problems: 1) scale induced dataset bias in multi-scale convolutional neural network (CNN) architectures, and 2) how to combine effectively scene-centric and object-centric knowledge (i.e. Places and ImageNet) in CNNs. An earlier attempt, Hybrid-CNN, showed that incorporating ImageNet did not help much. Here we propose an alternative method taking the scale into account, resulting in significant recognition gains. By analyzing the response of ImageNet-CNNs and Places-CNNs at different scales we find that both operate in different scale ranges, so using the same network for all the scales induces dataset bias resulting in limited performance. Thus, adapting the feature extractor to each particular scale (i.e. scale-specific CNNs) is crucial to improve recognition, since the objects in the scenes have their specific range of scales. Experimental results show that the recognition accuracy highly depends on the scale, and that simple yet carefully chosen multi-scale combinations of ImageNet-CNNs and Places-CNNs, can push the state-of-the-art recognition accuracy in SUN397 up to 66.26% (and even 70.17% with deeper architectures, comparable to human performance).
Traditional framework of discriminative correlation filters (DCF) is often subject to undesired boundary effects. Several approaches to enlarge search regions have been already proposed in the past years to make up for this shortcoming. However, with excessive background information, more background noises are also introduced and the discriminative filter is prone to learn from the ambiance rather than the object. This situation, along with appearance changes of objects caused by full/partial occlusion, illumination variation, and other reasons has made it more likely to have aberrances in the detection process, which could substantially degrade the credibility of its result. Therefore, in this work, a novel approach to repress the aberrances happening during the detection process is proposed, i.e., aberrance repressed correlation filter (ARCF). By enforcing restriction to the rate of alteration in response maps generated in the detection phase, the ARCF tracker can evidently suppress aberrances and is thus more robust and accurate to track objects. Considerable experiments are conducted on different UAV datasets to perform object tracking from an aerial view, i.e., UAV123, UAVDT, and DTB70, with 243 challenging image sequences containing over 90K frames to verify the performance of the ARCF tracker and it has proven itself to have outperformed other 20 state-of-the-art trackers based on DCF and deep-based frameworks with sufficient speed for real-time applications.
We report on experimental results on the use of a learning-based approach to infer the location of a mobile user of a cellular network within a cell, for a 5G-type Massive multiple input, multiple output (MIMO) system. We describe how the sample spatial covariance matrix computed from the CSI can be used as the input to a learning algorithm which attempts to relate it to user location. We discuss several learning approaches, and analyze in depth the application of extreme learning machines, for which theoretical approximate performance benchmarks are available, to the localization problem. We validate the proposed approach using experimental data collected on a Huawei 5G testbed, provide some performance and robustness benchmarks, and discuss practical issues related to the deployment of such a technique in 5G networks.