Models, code, and papers for "Dorsa Sadigh":
Data generation and labeling are usually an expensive part of learning for robotics. While active learning methods are commonly used to tackle the former problem, preference-based learning is a concept that attempts to solve the latter by querying users with preference questions. In this paper, we will develop a new algorithm, batch active preference-based learning, that enables efficient learning of reward functions using as few data samples as possible while still having short query generation times. We introduce several approximations to the batch active learning problem, and provide theoretical guarantees for the convergence of our algorithms. Finally, we present our experimental results for a variety of robotics tasks in simulation. Our results suggest that our batch active learning algorithm requires only a few queries that are computed in a short amount of time. We then showcase our algorithm in a study to learn human users' preferences.
Controller synthesis for hybrid systems that satisfy temporal specifications expressing various system properties is a challenging problem that has drawn the attention of many researchers. However, making the assumption that such temporal properties are deterministic is far from the reality. For example, many of the properties the controller has to satisfy are learned through machine learning techniques based on sensor input data. In this paper, we propose a new logic, Probabilistic Signal Temporal Logic (PrSTL), as an expressive language to define the stochastic properties, and enforce probabilistic guarantees on them. We further show how to synthesize safe controllers using this logic for cyber-physical systems under the assumption that the stochastic properties are based on a set of Gaussian random variables. One of the key distinguishing features of PrSTL is that the encoded logic is adaptive and changes as the system encounters additional data and updates its beliefs about the latent random variables that define the safety properties. We demonstrate our approach by synthesizing safe controllers under the PrSTL specifications for multiple case studies including control of quadrotors and autonomous vehicles in dynamic environments.
Humans often assume that robots are rational. We believe robots take optimal actions given their objective; hence, when we are uncertain about what the robot's objective is, we interpret the robot's actions as optimal with respect to our estimate of its objective. This approach makes sense when robots straightforwardly optimize their objective, and enables humans to learn what the robot is trying to achieve. However, our insight is that---when robots are aware that humans learn by trusting that the robot actions are rational---intelligent robots do not act as the human expects; instead, they take advantage of the human's trust, and exploit this trust to more efficiently optimize their own objective. In this paper, we formally model instances of human-robot interaction (HRI) where the human does not know the robot's objective using a two-player game. We formulate different ways in which the robot can model the uncertain human, and compare solutions of this game when the robot has conservative, optimistic, rational, and trusting human models. In an offline linear-quadratic case study and a real-time user study, we show that trusting human models can naturally lead to communicative robot behavior, which influences end-users and increases their involvement.
Although deep reinforcement learning has advanced significantly over the past several years, sample efficiency remains a major challenge. Careful choice of input representations can help improve efficiency depending on the structure present in the problem. In this work, we present an attention-based method to project inputs into an efficient representation space that is invariant under changes to input ordering. We show that our proposed representation results in a search space that is a factor of m! smaller for inputs of m objects. Our experiments demonstrate improvements in sample efficiency for policy gradient methods on a variety of tasks. We show that our representation allows us to solve problems that are otherwise intractable when using naive approaches.
Poor sample efficiency is a major limitation of deep reinforcement learning in many domains. This work presents an attention-based method to project neural network inputs into an efficient representation space that is invariant under changes to input ordering. We show that our proposed representation results in an input space that is a factor of $m!$ smaller for inputs of $m$ objects. We also show that our method is able to represent inputs over variable numbers of objects. Our experiments demonstrate improvements in sample efficiency for policy gradient methods on a variety of tasks. We show that our representation allows us to solve problems that are otherwise intractable when using na\"ive approaches.
In this paper, we present a planning framework that uses a combination of implicit (robot motion) and explicit (visual/audio/haptic feedback) communication during mobile robot navigation in a manner that humans find understandable and trustworthy. First, we developed a model that approximates both continuous movements and discrete decisions in human navigation, considering the effects of implicit and explicit communication on human decision making. The model approximates the human as an optimal agent, with a reward function obtained through inverse reinforcement learning. Second, a planner uses this model to generate communicative actions that maximize the robot's transparency and efficiency. We implemented the planner on a mobile robot, using a wearable haptic device for explicit communication. In a user study of navigation in an indoor environment, the robot was able to actively communicate its intent to users in order to avoid collisions and facilitate efficient trajectories. Results showed that the planner generated plans that were easier to understand, reduced users' effort, and increased users' trust of the robot, compared to simply performing collision avoidance.
Verified artificial intelligence (AI) is the goal of designing AI-based systems that are provably correct with respect to mathematically-specified requirements. This paper considers Verified AI from a formal methods perspective. We describe five challenges for achieving Verified AI, and five corresponding principles for addressing these challenges.
Road congestion induces significant costs across the world, and road network disturbances, such as traffic accidents, can cause highly congested traffic patterns. If a planner had control over the routing of all vehicles in the network, they could easily reverse this effect. In a more realistic scenario, we consider a planner that controls autonomous cars, which are a fraction of all present cars. We study a dynamic routing game, in which the route choices of autonomous cars can be controlled and the human drivers react selfishly and dynamically to autonomous cars' actions. As the problem is prohibitively large, we use deep reinforcement learning to learn a policy for controlling the autonomous vehicles. This policy influences human drivers to route themselves in such a way that minimizes congestion on the network. To gauge the effectiveness of our learned policies, we establish theoretical results characterizing equilibria on a network of parallel roads and empirically compare the learned policy results with best possible equilibria. Moreover, we show that in the absence of these policies, high demands and network perturbations would result in large congestion, whereas using the policy greatly decreases the travel times by minimizing the congestion. To the best of our knowledge, this is the first work that employs deep reinforcement learning to reduce congestion by influencing humans' routing decisions in mixed-autonomy traffic.
Data collection and labeling is one of the main challenges in employing machine learning algorithms in a variety of real-world applications with limited data. While active learning methods attempt to tackle this issue by labeling only the data samples that give high information, they generally suffer from large computational costs and are impractical in settings where data can be collected in parallel. Batch active learning methods attempt to overcome this computational burden by querying batches of samples at a time. To avoid redundancy between samples, previous works rely on some ad hoc combination of sample quality and diversity. In this paper, we present a new principled batch active learning method using Determinantal Point Processes, a repulsive point process that enables generating diverse batches of samples. We develop tractable algorithms to approximate the mode of a DPP distribution, and provide theoretical guarantees on the degree of approximation. We further demonstrate that an iterative greedy method for DPP maximization, which has lower computational costs but worse theoretical guarantees, still gives competitive results for batch active learning. Our experiments show the value of our methods on several datasets against state-of-the-art baselines.
Autonomous vehicles have the potential to increase the capacity of roads via platooning, even when human drivers and autonomous vehicles share roads. However, when users of a road network choose their routes selfishly, the resulting traffic configuration may be very inefficient. Because of this, we consider how to influence human decisions so as to decrease congestion on these roads. We consider a network of parallel roads with two modes of transportation: (i) human drivers who will choose the quickest route available to them, and (ii) ride hailing service which provides an array of autonomous vehicle ride options, each with different prices, to users. In this work, we seek to design these prices so that when autonomous service users choose from these options and human drivers selfishly choose their resulting routes, road usage is maximized and transit delay is minimized. To do so, we formalize a model of how autonomous service users make choices between routes with different price/delay values. Developing a preference-based algorithm to learn the preferences of the users, and using a vehicle flow model related to the Fundamental Diagram of Traffic, we formulate a planning optimization to maximize a social objective and demonstrate the benefit of the proposed routing and learning scheme.
While reinforcement learning (RL) has the potential to enable robots to autonomously acquire a wide range of skills, in practice, RL usually requires manual, per-task engineering of reward functions, especially in real world settings where aspects of the environment needed to compute progress are not directly accessible. To enable robots to autonomously learn skills, we instead consider the problem of reinforcement learning without access to rewards. We aim to learn an unsupervised embedding space under which the robot can measure progress towards a goal for itself. Our approach explicitly optimizes for a metric space under which action sequences that reach a particular state are optimal when the goal is the final state reached. This enables learning effective and control-centric representations that lead to more autonomous reinforcement learning algorithms. Our experiments on three simulated environments and two real-world manipulation problems show that our method can learn effective goal metrics from unlabeled interaction, and use the learned goal metrics for autonomous reinforcement learning.
Traffic congestion has large economic and social costs. The introduction of autonomous vehicles can potentially reduce this congestion, both by increasing network throughput and by enabling a social planner to incentivize users of autonomous vehicles to take longer routes that can alleviate congestion on more direct roads. We formalize the effects of altruistic autonomy on roads shared between human drivers and autonomous vehicles. In this work, we develop a formal model of road congestion on shared roads based on the fundamental diagram of traffic. We consider a network of parallel roads and provide algorithms that compute optimal equilibria that are robust to additional unforeseen demand. We further plan for optimal routings when users have varying degrees of altruism. We find that even with arbitrarily small altruism, total latency can be unboundedly better than without altruism, and that the best selfish equilibrium can be unboundedly better than the worst selfish equilibrium. We validate our theoretical results through microscopic traffic simulations and show average latency decrease of a factor of 4 from worst-case selfish equilibrium to the optimal equilibrium when autonomous vehicles are altruistic.
The emerging technology enabling autonomy in vehicles has led to a variety of new problems in transportation networks, such as planning and perception for autonomous vehicles. Other works consider social objectives such as decreasing fuel consumption and travel time by platooning. However, these strategies are limited by the actions of the surrounding human drivers. In this paper, we consider proactively achieving these social objectives by influencing human behavior through planned interactions. Our key insight is that we can use these social objectives to design local interactions that influence human behavior to achieve these goals. To this end, we characterize the increase in road capacity afforded by platooning, as well as the vehicle configuration that maximizes road capacity. We present a novel algorithm that uses a low-level control framework to leverage local interactions to optimally rearrange vehicles. We showcase our algorithm using a simulated road shared between autonomous and human-driven vehicles, in which we illustrate the reordering in action.
Imitation learning algorithms can be used to learn a policy from expert demonstrations without access to a reward signal. However, most existing approaches are not applicable in multi-agent settings due to the existence of multiple (Nash) equilibria and non-stationary environments. We propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We further introduce a practical multi-agent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents.
To communicate with new partners in new contexts, humans rapidly form new linguistic conventions. Recent language models trained with deep neural networks are able to comprehend and produce the existing conventions present in their training data, but are not able to flexibly and interactively adapt those conventions on the fly as humans do. We introduce a repeated reference task as a benchmark for models of adaptation in communication and propose a regularized continual learning framework that allows an artificial agent initialized with a generic language model to more accurately and efficiently communicate with a partner over time. We evaluate this framework through simulations on COCO and in real-time reference game experiments with human partners.
When teams of robots collaborate to complete a task, communication is often necessary. Like humans, robot teammates should implicitly communicate through their actions: but interpreting our partner's actions is typically difficult, since a given action may have many different underlying reasons. Here we propose an alternate approach: instead of not being able to infer whether an action is due to exploration, exploitation, or communication, we define separate roles for each agent. Because each role defines a distinct reason for acting (e.g., only exploit, only communicate), teammates now correctly interpret the meaning behind their partner's actions. Our results suggest that leveraging and alternating roles leads to performance comparable to teams that explicitly exchange messages. You can find more images and videos of our experimental setups at http://ai.stanford.edu/blog/learning-from-partners/.
Our goal is to accurately and efficiently learn reward functions for autonomous robots. Current approaches to this problem include inverse reinforcement learning (IRL), which uses expert demonstrations, and preference-based learning, which iteratively queries the user for her preferences between trajectories. In robotics however, IRL often struggles because it is difficult to get high-quality demonstrations; conversely, preference-based learning is very inefficient since it attempts to learn a continuous, high-dimensional function from binary feedback. We propose a new framework for reward learning, DemPref, that uses both demonstrations and preference queries to learn a reward function. Specifically, we (1) use the demonstrations to learn a coarse prior over the space of reward functions, to reduce the effective size of the space from which queries are generated; and (2) use the demonstrations to ground the (active) query generation process, to improve the quality of the generated queries. Our method alleviates the efficiency issues faced by standard preference-based learning methods and does not exclusively depend on (possibly low-quality) demonstrations. In numerical experiments, we find that DemPref is significantly more efficient than a standard active preference-based learning method. In a user study, we compare our method to a standard IRL method; we find that users rated the robot trained with DemPref as being more successful at learning their desired behavior, and preferred to use the DemPref system (over IRL) to train the robot.
We propose a safe exploration algorithm for deterministic Markov Decision Processes with unknown transition models. Our algorithm guarantees safety by leveraging Lipschitz-continuity to ensure that no unsafe states are visited during exploration. Unlike many other existing techniques, the provided safety guarantee is deterministic. Our algorithm is optimized to reduce the number of actions needed for exploring the safe space. We demonstrate the performance of our algorithm in comparison with baseline methods in simulation on navigation tasks.
Subspace identification is a classical and very well studied problem in system identification. The problem was recently posed as a convex optimization problem via the nuclear norm relaxation. Inspired by robust PCA, we extend this framework to handle outliers. The proposed framework takes the form of a convex optimization problem with an objective that trades off fit, rank and sparsity. As in robust PCA, it can be problematic to find a suitable regularization parameter. We show how the space in which a suitable parameter should be sought can be limited to a bounded open set of the two dimensional parameter space. In practice, this is very useful since it restricts the parameter space that is needed to be surveyed.