Models, code, and papers for "Matthew Johnson-Roberson":
Autonomous robot manipulation involves estimating the pose of the object to be manipulated. Methods using RGB-D data have shown great success in solving this problem. However, there are situations where cost constraints or the working environment may limit the use of RGB-D sensors. When limited to monocular camera data only, the problem of object pose estimation is very challenging. In this work, we introduce a novel method called SilhoNet that predicts 6D object pose from monocular images. We use a Convolutional Neural Network (CNN) pipeline that takes in region of interest proposals to simultaneously predict an intermediate silhouette representation for objects with an associated occlusion mask and a 3D translation vector. The 3D orientation is then regressed from the predicted silhouettes. We show that our method achieves better overall performance than the state-of-the art PoseCNN network for 6D pose estimation.
Autonomous robot manipulation often involves both estimating the pose of the object to be manipulated and selecting a viable grasp point. Methods using RGB-D data have shown great success in solving these problems. However, there are situations where cost constraints or the working environment may limit the use of RGB-D sensors. When limited to monocular camera data only, both the problem of object pose estimation and of grasp point selection are very challenging. In the past, research has focused on solving these problems separately. In this work, we introduce a novel method called SilhoNet that bridges the gap between these two tasks. We use a Convolutional Neural Network (CNN) pipeline that takes in ROI proposals to simultaneously predict an intermediate silhouette representation for objects with an associated occlusion mask. The 3D pose is then regressed from the predicted silhouettes. Grasp points from a precomputed database are filtered by back-projecting them onto the occlusion mask to find which points are visible in the scene. We show that our method achieves better overall performance than the state-of-the art PoseCNN network for 3D pose estimation on the YCB-video dataset.
Illumination effects in images, specifically cast shadows and shading, have been shown to decrease the performance of deep neural networks on a large number of vision-based detection, recognition and segmentation tasks in urban driving scenes. A key factor that contributes to this performance gap is the lack of `time-of-day' diversity within real, labeled datasets. There have been impressive advances in the realm of image to image translation in transferring previously unseen visual effects into a dataset, specifically in day to night translation. However, it is not easy to constrain what visual effects, let alone illumination effects, are transferred from one dataset to another during the training process. To address this problem, we propose deep learning framework, called Shadow Transfer, that can relight complex outdoor scenes by transferring realistic shadow, shading, and other lighting effects onto a single image. The novelty of the proposed framework is that it is both self-supervised, and is designed to operate on sensor and label information that is easily available in autonomous vehicle datasets. We show the effectiveness of this method on both synthetic and real datasets, and we provide experiments that demonstrate that the proposed method produces images of higher visual quality than state of the art image to image translation methods.
Highway driving places significant demands on human drivers and autonomous vehicles (AVs) alike due to high speeds and the complex interactions in dense traffic. Merging onto the highway poses additional challenges by limiting the amount of time available for decision-making. Predicting others' trajectories accurately and quickly is crucial to safely executing these maneuvers. Many existing prediction methods based on neural networks have focused on modeling interactions to achieve better accuracy while assuming the existence of observation windows over 3s long. This paper proposes a novel probabilistic model for trajectory prediction that performs competitively with as little as 400ms of observations. The proposed method fits a low-dimensional car-following model to observed behavior and introduces nonconvex regularization terms that enforce realistic driving behaviors in the predictions. The resulting inference procedure allows for realtime forecasts up to 10s into the future while accounting for interactions between vehicles. Experiments on dense traffic in the NGSIM dataset demonstrate that the proposed method achieves state-of-the-art performance with both highly constrained and more traditional observation windows.
In applications such as autonomous driving, it is important to understand, infer, and anticipate the intention and future behavior of pedestrians. This ability allows vehicles to avoid collisions and improve ride safety and quality. This paper proposes a biomechanically inspired recurrent neural network (Bio-LSTM) that can predict the location and 3D articulated body pose of pedestrians in a global coordinate frame, given 3D poses and locations estimated in prior frames with inaccuracy. The proposed network is able to predict poses and global locations for multiple pedestrians simultaneously, for pedestrians up to 45 meters from the cameras (urban intersection scale). The outputs of the proposed network are full-body 3D meshes represented in Skinned Multi-Person Linear (SMPL) model parameters. The proposed approach relies on a novel objective function that incorporates the periodicity of human walking (gait), the mirror symmetry of the human body, and the change of ground reaction forces in a human gait cycle. This paper presents prediction results on the PedX dataset, a large-scale, in-the-wild data set collected at real urban intersections with heavy pedestrian traffic. Results show that the proposed network can successfully learn the characteristics of pedestrian gait and produce accurate and consistent 3D pose predictions.
This paper presents an interconnected control-planning strategy for redundant manipulators, subject to system and environmental constraints. The method incorporates low-level control characteristics and high-level planning components into a robust strategy for manipulators acting in complex environments, subject to joint limits. This strategy is formulated using an adaptive control rule, the estimated dynamic model of the robotic system and the nullspace of the linearized constraints. A path is generated that takes into account the capabilities of the platform. The proposed method is computationally efficient, enabling its implementation on a real multi-body robotic system. Through experimental results with a 7 DOF manipulator, we demonstrate the performance of the method in real-world scenarios.
Knowing and predicting dangerous factors within a scene are two key components during autonomous driving, especially in a crowded urban environment. To navigate safely in environments, risk assessment is needed to quantify and associate the risk of taking a certain action. Risk assessment and planning is usually done by first tracking and predicting trajectories of other agents, such as vehicles and pedestrians, and then choosing an action to avoid collision in the future. However, few existing risk assessment algorithms handle occlusion and other sensory limitations effectively. This paper explores the possibility of efficient risk assessment under occlusion via both forward and backward reachability. The proposed algorithm can not only identify where the risk-induced factors are, but also be used for motion planning by executing low-level commands, such as throttle. The proposed method is evaluated on various four-way highly occluded intersections with up to five other vehicles in the scene. Compared with other risk assessment algorithms, the proposed method shows better efficiency, meaning that the ego vehicle reaches the goal at a higher speed. In addition, it also lowers the median collision rate by 7.5x.
Navigating safely in urban environments remains a challenging problem for autonomous vehicles. Occlusion and limited sensor range can pose significant challenges to safely navigate among pedestrians and other vehicles in the environment. Enabling vehicles to quantify the risk posed by unseen regions allows them to anticipate future possibilities, resulting in increased safety and ride comfort. This paper proposes an algorithm that takes advantage of the known road layouts to forecast, quantify, and aggregate risk associated with occlusions and limited sensor range. This allows us to make predictions of risk induced by unobserved vehicles even in heavily occluded urban environments. The risk can then be used either by a low-level planning algorithm to generate better trajectories, or by a high-level one to plan a better route. The proposed algorithm is evaluated on intersection layouts from real-world map data with up to five other vehicles in the scene, and verified to reduce collision rates by 4.8x comparing to a baseline method while improving driving comfort.
Urban environments pose a significant challenge for autonomous vehicles (AVs) as they must safely navigate while in close proximity to many pedestrians. It is crucial for the AV to correctly understand and predict the future trajectories of pedestrians to avoid collision and plan a safe path. Deep neural networks (DNNs) have shown promising results in accurately predicting pedestrian trajectories, relying on large amounts of annotated real-world data to learn pedestrian behavior. However, collecting and annotating these large real-world pedestrian datasets is costly in both time and labor. This paper describes a novel method using a stochastic sampling-based simulation to train DNNs for pedestrian trajectory prediction with social interaction. Our novel simulation method can generate vast amounts of automatically-annotated, realistic, and naturalistic synthetic pedestrian trajectories based on small amounts of real annotation. We then use such synthetic trajectories to train an off-the-shelf state-of-the-art deep learning approach Social GAN (Generative Adversarial Network) to perform pedestrian trajectory prediction. Our proposed architecture, trained only using synthetic trajectories, achieves better prediction results compared to those trained on human-annotated real-world data using the same network. Our work demonstrates the effectiveness and potential of using simulation as a substitution for human annotation efforts to train high-performing prediction algorithms such as the DNNs.
Recent work has focused on generating synthetic imagery to increase the size and variability of training data for learning visual tasks in urban scenes. This includes increasing the occurrence of occlusions or varying environmental and weather effects. However, few have addressed modeling variation in the sensor domain. Sensor effects can degrade real images, limiting generalizability of network performance on visual tasks trained on synthetic data and tested in real environments. This paper proposes an efficient, automatic, physically-based augmentation pipeline to vary sensor effects --chromatic aberration, blur, exposure, noise, and color cast-- for synthetic imagery. In particular, this paper illustrates that augmenting synthetic training datasets with the proposed pipeline reduces the domain gap between synthetic and real domains for the task of object detection in urban driving scenes.
Performance on benchmark datasets has drastically improved with advances in deep learning. Still, cross- dataset generalization performance remains relatively low due to the domain shift that can occur between two different datasets. This domain shift is especially exaggerated between synthetic and real datasets. Significant research has been done to reduce this gap, specifically via modeling variation in the spatial layout of a scene, such as occlusions, and scene environmental factors, such as time of day and weather effects. However, few works have addressed modeling the variation in the sensor domain as a means of reducing the synthetic to real domain gap. The camera or sensor used to capture a dataset introduces artifacts into the image data that are unique to the sensor model, suggesting that sensor effects may also contribute to domain shift. To address this, we propose a learned augmentation network composed of physically-based augmentation functions. Our proposed augmentation pipeline transfers specific effects of the sensor model --chromatic aberration, blur, exposure, noise, and color temperature-- from a real dataset to a synthetic dataset. We provide experiments that demonstrate that augmenting synthetic training datasets with the proposed learned augmentation framework reduces the domain gap between synthetic and real domains for object detection in urban driving scenes.
Recent work has shown that convolutional neural networks (CNNs) can be applied successfully in disparity estimation, but these methods still suffer from errors in regions of low-texture, occlusions and reflections. Concurrently, deep learning for semantic segmentation has shown great progress in recent years. In this paper, we design a CNN architecture that combines these two tasks to improve the quality and accuracy of disparity estimation with the help of semantic segmentation. Specifically, we propose a network structure in which these two tasks are highly coupled. One key novelty of this approach is the two-stage refinement process. Initial disparity estimates are refined with an embedding learned from the semantic segmentation branch of the network. The proposed model is trained using an unsupervised approach, in which images from one half of the stereo pair are warped and compared against images from the other camera. Another key advantage of the proposed approach is that a single network is capable of outputting disparity estimates and semantic labels. These outputs are of great use in autonomous vehicle operation; with real-time constraints being key, such performance improvements increase the viability of driving applications. Experiments on KITTI and Cityscapes datasets show that our model can achieve state-of-the-art results and that leveraging embedding learned from semantic segmentation improves the performance of disparity estimation.
Path planning for autonomous vehicles in arbitrary environments requires a guarantee of safety, but this can be impractical to ensure in real-time when the vehicle is described with a high-fidelity model. To address this problem, this paper develops a method to perform trajectory design by considering a low-fidelity model that accounts for model mismatch. The presented method begins by computing a conservative Forward Reachable Set (FRS) of a high-fidelity model's trajectories produced when tracking trajectories of a low-fidelity model over a finite time horizon. At runtime, the vehicle intersects this FRS with obstacles in the environment to eliminate trajectories that can lead to a collision, then selects an optimal plan from the remaining safe set. By bounding the time for this set intersection and subsequent path selection, this paper proves a lower bound for the FRS time horizon and sensing horizon to guarantee safety. This method is demonstrated in simulation using a kinematic Dubin's car as the low-fidelity model and a dynamic unicycle as the high-fidelity model.
An accurate depth map of the environment is critical to the safe operation of autonomous robots and vehicles. Currently, either light detection and ranging (LIDAR) or stereo matching algorithms are used to acquire such depth information. However, a high-resolution LIDAR is expensive and produces sparse depth map at large range; stereo matching algorithms are able to generate denser depth maps but are typically less accurate than LIDAR at long range. This paper combines these approaches together to generate high-quality dense depth maps. Unlike previous approaches that are trained using ground-truth labels, the proposed model adopts a self-supervised training process. Experiments show that the proposed method is able to generate high-quality dense depth maps and performs robustly even with low-resolution inputs. This shows the potential to reduce the cost by using LIDARs with lower resolution in concert with stereo systems while maintaining high resolution.
One of the major open challenges in self-driving cars is the ability to detect cars and pedestrians to safely navigate in the world. Deep learning-based object detector approaches have enabled great advances in using camera imagery to detect and classify objects. But for a safety critical application, such as autonomous driving, the error rates of the current state of the art are still too high to enable safe operation. Moreover, the characterization of object detector performance is primarily limited to testing on prerecorded datasets. Errors that occur on novel data go undetected without additional human labels. In this letter, we propose an automated method to identify mistakes made by object detectors without ground truth labels. We show that inconsistencies in the object detector output between a pair of similar images can be used as hypotheses for false negatives (e.g., missed detections) and using a novel set of features for each hypothesis, an off-the-shelf binary classifier can be used to find valid errors. In particular, we study two distinct cues - temporal and stereo inconsistencies - using data that are readily available on most autonomous vehicles. Our method can be used with any camera-based object detector and we illustrate the technique on several sets of real world data. We show that a state-of-the-art detector, tracker, and our classifier trained only on synthetic data can identify valid errors on KITTI tracking dataset with an average precision of 0.94. We also release a new tracking dataset with 104 sequences totaling 80,655 labeled pairs of stereo images along with ground truth disparity from a game engine to facilitate further research. The dataset and code are available at https://fcav.engin.umich.edu/research/failing-to-learn
This paper reports on WaterGAN, a generative adversarial network (GAN) for generating realistic underwater images from in-air image and depth pairings in an unsupervised pipeline used for color correction of monocular underwater images. Cameras onboard autonomous and remotely operated vehicles can capture high resolution images to map the seafloor, however, underwater image formation is subject to the complex process of light propagation through the water column. The raw images retrieved are characteristically different than images taken in air due to effects such as absorption and scattering, which cause attenuation of light at different rates for different wavelengths. While this physical process is well described theoretically, the model depends on many parameters intrinsic to the water column as well as the objects in the scene. These factors make recovery of these parameters difficult without simplifying assumptions or field calibration, hence, restoration of underwater images is a non-trivial problem. Deep learning has demonstrated great success in modeling complex nonlinear systems but requires a large amount of training data, which is difficult to compile in deep sea environments. Using WaterGAN, we generate a large training dataset of paired imagery, both raw underwater and true color in-air, as well as depth data. This data serves as input to a novel end-to-end network for color correction of monocular underwater images. Due to the depth-dependent water column effects inherent to underwater environments, we show that our end-to-end network implicitly learns a coarse depth estimate of the underwater scene from monocular underwater images. Our proposed pipeline is validated with testing on real data collected from both a pure water tank and from underwater surveys in field testing. Source code is made publicly available with sample datasets and pretrained models.
Online control design using a high-fidelity, full-order model for a bipedal robot can be challenging due to the size of the state space of the model. A commonly adopted solution to overcome this challenge is to approximate the full-order model (anchor) with a simplified, reduced-order model (template), while performing control synthesis. Unfortunately it is challenging to make formal guarantees about the safety of an anchor model using a controller designed in an online fashion using a template model. To address this problem, this paper proposes a method to generate safety-preserving controllers for anchor models by performing reachability analysis on template models while bounding the modeling error. This paper describes how this reachable set can be incorporated into a Model Predictive Control framework to select controllers that result in safe walking on the anchor model in an online fashion. The method is illustrated on a 5-link RABBIT model, and is shown to allow the robot to walk safely while utilizing controllers designed in an online fashion.
Autonomous mobile robots must operate with limited sensor horizons in unpredictable environments. To do so, they use a receding-horizon strategy to plan trajectories, by executing a short plan while creating the next plan. However, creating safe, dynamically-feasible trajectories in real time is challenging; and, planners must ensure that they are persistently feasible, meaning that a new trajectory is always available before the previous one has finished executing. Existing approaches make a tradeoff between model complexity and planning speed, which can require sacrificing guarantees of safety and dynamic feasibility. This work presents the Reachability-based Trajectory Design (RTD) method for trajectory planning. RTD begins with an offline Forward Reachable Set (FRS) computation of a robot's motion while it tracks parameterized trajectories; the FRS also provably bounds tracking error. At runtime, the FRS is used to map obstacles to the space of parameterized trajectories, which allows RTD to select a safe trajectory at every planning iteration. RTD prescribes a method of representing obstacles to ensure that these constraints can be created and evaluated in real time while maintaining provable safety. Persistent feasibility is achieved by prescribing a minimum duration of planned trajectories, and a minimum sensor horizon. A system decomposition approach is used to increase the dimension of the parameterized trajectories in the FRS, allowing for RTD to create more complex plans at runtime. RTD is compared in simulation with Rapidly-exploring Random Trees (RRT) and Nonlinear Model-Predictive Control (NMPC). RTD is also demonstrated on two hardware platforms in randomly-crafted environments: a differential-drive Segway, and a car-like Rover. The proposed method is shown as safe and persistently feasible across thousands of simulations and dozens of hardware demos.