Models, code, and papers for "Hang Chu":
Neural networks have recently become good at engaging in dialog. However, current approaches are based solely on verbal text, lacking the richness of a real face-to-face conversation. We propose a neural conversation model that aims to read and generate facial gestures alongside with text. This allows our model to adapt its response based on the "mood" of the conversation. In particular, we introduce an RNN encoder-decoder that exploits the movement of facial muscles, as well as the verbal conversation. The decoder consists of two layers, where the lower layer aims at generating the verbal response and coarse facial expressions, while the second layer fills in the subtle gestures, making the generated output more smooth and natural. We train our neural network by having it "watch" 250 movies. We showcase our joint face-text model in generating more natural conversations through automatic metrics and a human study. We demonstrate an example application with a face-to-face chatting avatar.
We present a novel framework for generating pop music. Our model is a hierarchical Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our generated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing.
We propose a method for accurately localizing ground vehicles with the aid of satellite imagery. Our approach takes a ground image as input, and outputs the location from which it was taken on a georeferenced satellite image. We perform visual localization by estimating the co-occurrence probabilities between the ground and satellite images based on a ground-satellite feature dictionary. The method is able to estimate likelihoods over arbitrary locations without the need for a dense ground image database. We present a ranking-loss based algorithm that learns location-discriminative feature projection matrices that result in further improvements in accuracy. We evaluate our method on the Malaga and KITTI public datasets and demonstrate significant improvements over a baseline that performs exhaustive search.
In this paper, a new heat-map-based (HMB) algorithm is proposed for group activity recognition. The proposed algorithm first models human trajectories as series of "heat sources" and then applies a thermal diffusion process to create a heat map (HM) for representing the group activities. Based on this heat map, a new key-point based (KPB) method is used for handling the alignments among heat maps with different scales and rotations. And a surface-fitting (SF) method is also proposed for recognizing group activities. Our proposed HM feature can efficiently embed the temporal motion information of the group activities while the proposed KPB and SF methods can effectively utilize the characteristics of the heat map for activity recognition. Experimental results demonstrate the effectiveness of our proposed algorithms.
We tackle the problem of using 3D information in convolutional neural networks for down-stream recognition tasks. Using depth as an additional channel alongside the RGB input has the scale variance problem present in image convolution based approaches. On the other hand, 3D convolution wastes a large amount of memory on mostly unoccupied 3D space, which consists of only the surface visible to the sensor. Instead, we propose SurfConv, which "slides" compact 2D filters along the visible 3D surface. SurfConv is formulated as a simple depth-aware multi-scale 2D convolution, through a new Data-Driven Depth Discretization (D4) scheme. We demonstrate the effectiveness of our method on indoor and outdoor 3D semantic segmentation datasets. Our method achieves state-of-the-art performance with less than 30% parameters used by the 3D convolution-based approaches.
We propose Neural Turtle Graphics (NTG), a novel generative model for spatial graphs, and demonstrate its applications in modeling city road layouts. Specifically, we represent the road layout using a graph where nodes in the graph represent control points and edges in the graph represent road segments. NTG is a sequential generative model parameterized by a neural network. It iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph. We train NTG on Open Street Map data and show that it outperforms existing approaches using a set of diverse performance metrics. Moreover, our method allows users to control styles of generated road layouts mimicking existing cities as well as to sketch parts of the city road layout to be synthesized. In addition to synthesis, the proposed NTG finds uses in an analytical task of aerial road parsing. Experimental results show that it achieves state-of-the-art performance on the SpaceNet dataset.
In this paper we introduce the TorontoCity benchmark, which covers the full greater Toronto area (GTA) with 712.5 $km^2$ of land, 8439 $km$ of road and around 400,000 buildings. Our benchmark provides different perspectives of the world captured from airplanes, drones and cars driving around the city. Manually labeling such a large scale dataset is infeasible. Instead, we propose to utilize different sources of high-precision maps to create our ground truth. Towards this goal, we develop algorithms that allow us to align all data sources with the maps while requiring minimal human supervision. We have designed a wide variety of tasks including building height estimation (reconstruction), road centerline and curb extraction, building instance segmentation, building contour extraction (reorganization), semantic labeling and scene type classification (recognition). Our pilot study shows that most of these tasks are still difficult for modern convolutional neural networks.
Deep reinforcement learning for multi-agent cooperation and competition has been a hot topic recently. This paper focuses on cooperative multi-agent problem based on actor-critic methods under local observations settings. Multi agent deep deterministic policy gradient obtained state of art results for some multi-agent games, whereas, it cannot scale well with growing amount of agents. In order to boost scalability, we propose a parameter sharing deterministic policy gradient method with three variants based on neural networks, including actor-critic sharing, actor sharing and actor sharing with partially shared critic. Benchmarks from rllab show that the proposed method has advantages in learning speed and memory efficiency, well scales with growing amount of agents, and moreover, it can make full use of reward sharing and exchangeability if possible.
Rectifier neuron units (ReLUs) have been widely used in deep convolutional networks. An ReLU converts negative values to zeros, and does not change positive values, which leads to a high sparsity of neurons. In this work, we first examine the sparsity of the outputs of ReLUs in some popular deep convolutional architectures. And then we use the sparsity property of ReLUs to accelerate the calculation of convolution by skipping calculations of zero-valued neurons. The proposed sparse convolution algorithm achieves some speedup improvements on CPUs compared to the traditional matrix-matrix multiplication algorithm for convolution when the sparsity is not less than 0.9.
Stochastic simulation approaches perform probabilistic inference in Bayesian networks by estimating the probability of an event based on the frequency that the event occurs in a set of simulation trials. This paper describes the evidence weighting mechanism, for augmenting the logic sampling stochastic simulation algorithm [Henrion, 1986]. Evidence weighting modifies the logic sampling algorithm by weighting each simulation trial by the likelihood of a network's evidence given the sampled state node values for that trial. We also describe an enhancement to the basic algorithm which uses the evidential integration technique [Chin and Cooper, 1987]. A comparison of the basic evidence weighting mechanism with the Markov blanket algorithm [Pearl, 1987], the logic sampling algorithm, and the evidence integration algorithm is presented. The comparison is aided by analyzing the performance of the algorithms in a simple example network.
In almost all situation assessment problems, it is useful to dynamically contract and expand the states under consideration as assessment proceeds. Contraction is most often used to combine similar events or low probability events together in order to reduce computation. Expansion is most often used to make distinctions of interest which have significant probability in order to improve the quality of the assessment. Although other uncertainty calculi, notably Dempster-Shafer [Shafer, 1976], have addressed these operations, there has not yet been any approach of refining and coarsening state spaces for the Bayesian Network technology. This paper presents two operations for refining and coarsening the state space in Bayesian Networks. We also discuss their practical implications for knowledge acquisition.
Recent research on the Symbolic Probabilistic Inference (SPI) algorithm has focused attention on the importance of resolving general queries in Bayesian networks. SPI applies the concept of dependency-directed backward search to probabilistic inference, and is incremental with respect to both queries and observations. In response to this research we have extended the evidence potential algorithm  with the same features. We call the extension symbolic evidence potential inference (SEPI). SEPI like SPI can handle generic queries and is incremental with respect to queries and observations. While in SPI, operations are done on a search tree constructed from the nodes of the original network, in SEPI, a clique-tree structure obtained from the evidence potential algorithm  is the basic framework for recursive query processing. In this paper, we describe the systematic query and caching procedure of SEPI. SEPI begins with finding a clique tree from a Bayesian network-the standard procedure of the evidence potential algorithm. With the clique tree, various probability distributions are computed and stored in each clique. This is the ?pre-processing? step of SEPI. Once this step is done, the query can then be computed. To process a query, a recursive process similar to the SPI algorithm is used. The queries are directed to the root clique and decomposed into queries for the clique's subtrees until a particular query can be answered at the clique at which it is directed. The algorithm and the computation are simple. The SEPI algorithm will be presented in this paper along with several examples.
Research on Symbolic Probabilistic Inference (SPI) [2, 3] has provided an algorithm for resolving general queries in Bayesian networks. SPI applies the concept of dependency directed backward search to probabilistic inference, and is incremental with respect to both queries and observations. Unlike traditional Bayesian network inferencing algorithms, SPI algorithm is goal directed, performing only those calculations that are required to respond to queries. Research to date on SPI applies to Bayesian networks with discrete-valued variables and does not address variables with continuous values. In this papers, we extend the SPI algorithm to handle Bayesian networks made up of continuous variables where the relationships between the variables are restricted to be ?linear gaussian?. We call this variation of the SPI algorithm, SPI Continuous (SPIC). SPIC modifies the three basic SPI operations: multiplication, summation, and substitution. However, SPIC retains the framework of the SPI algorithm, namely building the search tree and recursive query mechanism and therefore retains the goal-directed and incrementality features of SPI.
The evolution of MobileNets has laid a solid foundation for neural network applications on mobile end. With the latest MobileNetV3, neural architecture search again claimed its supremacy in network design. Unfortunately, till today all mobile methods mainly focus on CPU latencies instead of GPU, the latter, however, is much preferred in practice for it has faster speed, lower overhead and less interference. Bearing the target hardware in mind, we propose the first Mobile GPU-Aware (MoGA) neural architecture search in order to be precisely tailored for real-world applications. Further, the ultimate objective to devise a mobile network lies in achieving better performance by maximizing the utilization of bounded resources. Urging higher capability while restraining time consumption is not reconcilable. We alleviate the tension by weighted evolution techniques. Moreover, we encourage increasing the number of parameters for higher representational power. With 200x fewer GPU days than MnasNet, we obtain a series of models that outperform MobileNetV3 under the similar latency constraints, i.e., MoGA-A achieves 75.9% top-1 accuracy on ImageNet, MoGA-B meets 75.5% which costs only 0.5 ms more on mobile GPU. MoGA-C best attests GPU-awareness by reaching 75.3% and being slower on CPU but faster on GPU.The models and test code is made available here https://github.com/xiaomi-automl/MoGA.
We consider Hadamard product parametrization as a change-of-variable (over-parametrization) technique for solving least square problems in the context of linear regression. Despite the non-convexity and exponentially many saddle points induced by the change-of-variable, we show that under certain conditions, this over-parametrization leads to implicit regularization: if we directly apply gradient descent to the residual sum of squares with sufficiently small initial values, then under proper early stopping rule, the iterates converge to a nearly sparse rate-optimal solution with relatively better accuracy than explicit regularized approaches. In particular, the resulting estimator does not suffer from extra bias due to explicit penalties, and can achieve the parametric root-$n$ rate (independent of the dimension) under proper conditions on the signal-to-noise ratio. We perform simulations to compare our methods with high dimensional linear regression with explicit regularizations. Our results illustrate advantages of using implicit regularization via gradient descent after over-parametrization in sparse vector estimation.
Chinese input methods are used to convert pinyin sequence or other Latin encoding systems into Chinese character sentences. For more effective pinyin-to-character conversion, typical Input Method Engines (IMEs) rely on a predefined vocabulary that demands manually maintenance on schedule. For the purpose of removing the inconvenient vocabulary setting, this work focuses on automatic wordhood acquisition by fully considering that Chinese inputting is a free human-computer interaction procedure. Instead of strictly defining words, a loose word likelihood is introduced for measuring how likely a character sequence can be a user-recognized word with respect to using IME. Then an online algorithm is proposed to adjust the word likelihood or generate new words by comparing user true choice for inputting and the algorithm prediction. The experimental results show that the proposed solution can agilely adapt to diverse typings and demonstrate performance approaching highly-optimized IME with fixed vocabulary.
Probabilistic modeling is one of the foundations of modern machine learning and artificial intelligence. In this paper, we propose a novel type of probabilistic models named latent dependency forest models (LDFMs). A LDFM models the dependencies between random variables with a forest structure that can change dynamically based on the variable values. It is therefore capable of modeling context-specific independence. We parameterize a LDFM using a first-order non-projective dependency grammar. Learning LDFMs from data can be formulated purely as a parameter learning problem, and hence the difficult problem of model structure learning is circumvented. Our experimental results show that LDFMs are competitive with existing probabilistic models.
Mixed language data is one of the difficult yet less explored domains of natural language processing. Most research in fields like machine translation or sentiment analysis assume monolingual input. However, people who are capable of using more than one language often communicate using multiple languages at the same time. Sociolinguists believe this "code-switching" phenomenon to be socially motivated. For example, to express solidarity or to establish authority. Most past work depend on external tools or resources, such as part-of-speech tagging, dictionary look-up, or named-entity recognizers to extract rich features for training machine learning models. In this paper, we train recurrent neural networks with only raw features, and use word embedding to automatically learn meaningful representations. Using the same mixed-language Twitter corpus, our system is able to outperform the best SVM-based systems reported in the EMNLP'14 Code-Switching Workshop by 1% in accuracy, or by 17% in error rate reduction.
In this paper, we argue for the need to distinguish between task and dialogue initiatives, and present a model for tracking shifts in both types of initiatives in dialogue interactions. Our model predicts the initiative holders in the next dialogue turn based on the current initiative holders and the effect that observed cues have on changing them. Our evaluation across various corpora shows that the use of cues consistently improves the accuracy in the system's prediction of task and dialogue initiative holders by 2-4 and 8-13 percentage points, respectively, thus illustrating the generality of our model.