Models, code, and papers for "Hao Tong":
Surrogate-assisted evolutionary algorithms (SAEAs) are powerful optimisation tools for computationally expensive problems (CEPs). However, a randomly selected algorithm may fail in solving unknown problems due to no free lunch theorems, and it will cause more computational resource if we re-run the algorithm or try other algorithms to get a much solution, which is more serious in CEPs. In this paper, we consider an algorithm portfolio for SAEAs to reduce the risk of choosing an inappropriate algorithm for CEPs. We propose two portfolio frameworks for very expensive problems in which the maximal number of fitness evaluations is only 5 times of the problem's dimension. One framework named Par-IBSAEA runs all algorithm candidates in parallel and a more sophisticated framework named UCB-IBSAEA employs the Upper Confidence Bound (UCB) policy from reinforcement learning to help select the most appropriate algorithm at each iteration. An effective reward definition is proposed for the UCB policy. We consider three state-of-the-art individual-based SAEAs on different problems and compare them to the portfolios built from their instances on several benchmark problems given limited computation budgets. Our experimental studies demonstrate that our proposed portfolio frameworks significantly outperform any single algorithm on the set of benchmark problems.
We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor-box free, as well as proposal free. By eliminating the pre-defined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training and significantly reduces the training memory footprint. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), our detector FCOS outperforms previous anchor-based one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at: https://tinyurl.com/FCOSv1
We extend Convolutional Neural Networks (CNNs) on flat and regular domains (e.g. 2D images) to curved surfaces embedded in 3D Euclidean space that are discretized as irregular meshes and widely used to represent geometric data in Computer Vision and Graphics. We define surface convolution on tangent spaces of a surface domain, where the convolution has two desirable properties: 1) the distortion of surface domain signals is locally minimal when being projected to the tangent space, and 2) the translation equi-variance property holds locally, by aligning tangent spaces with the canonical parallel transport that preserves metric. For computation, we rely on a parallel N-direction frame field on the surface that minimizes field variation and therefore is as compatible as possible to and approximates the parallel transport. On the tangent spaces equipped with parallel frames, the computation of surface convolution becomes standard routine. The frames have rotational symmetry which we disambiguate by constructing the covering space of surface induced by the parallel frames and grouping the feature maps into N sets accordingly; convolution is computed on the N branches of the cover space with respective feature maps while the kernel weights are shared. To handle irregular points of a discrete mesh while sharing kernel weights, we make the convolution semi-discrete, i.e. the convolution kernels are polynomial functions, and their convolution with discrete surface points becomes sampling and weighted summation. Pooling and unpooling operations are computed along a mesh hierarchy built through simplification. The presented surface CNNs allow effective deep learning on meshes. We show that for tasks of classification, segmentation and non-rigid registration, surface CNNs using only raw input signals achieve superior performances than previous models using sophisticated input features.
Very expensive problems are very common in practical system that one fitness evaluation costs several hours or even days. Surrogate assisted evolutionary algorithms (SAEAs) have been widely used to solve this crucial problem in the past decades. However, most studied SAEAs focus on solving problems with a budget of at least ten times of the dimension of problems which is unacceptable in many very expensive real-world problems. In this paper, we employ Voronoi diagram to boost the performance of SAEAs and propose a novel framework named Voronoi-based efficient surrogate assisted evolutionary algorithm (VESAEA) for very expensive problems, in which the optimization budget, in terms of fitness evaluations, is only 5 times of the problem's dimension. In the proposed framework, the Voronoi diagram divides the whole search space into several subspace and then the local search is operated in some potentially better subspace. Additionally, in order to trade off the exploration and exploitation, the framework involves a global search stage developed by combining leave-one-out cross-validation and radial basis function surrogate model. A performance selector is designed to switch the search dynamically and automatically between the global and local search stages. The empirical results on a variety of benchmark problems demonstrate that the proposed framework significantly outperforms several state-of-art algorithms with extremely limited fitness evaluations. Besides, the efficacy of Voronoi-diagram is furtherly analyzed, and the results show its potential to optimize very expensive problems.
With a lot of work about context-free question answering systems, there is an emerging trend of conversational question answering models in the natural language processing field. Thanks to the recently collected datasets, including QuAC and CoQA, there has been more work on conversational question answering, and some models have achieved competitive performance on both datasets. However, to best of our knowledge, two important questions for conversational comprehension research have not been well studied: 1) How well can the benchmark dataset reflect models' content understanding? 2) Do the models utilize the conversation content well? To investigate the two questions, we design different training settings, testing settings, as well as an attack to verify the models' capability of content understanding on QuAC and CoQA. The results indicate some potential hazards of using QuAC and CoQA for conversational comprehension research. Our analysis also sheds some light one both models and datasets. Given deeper investigation of the task, it is believed that this work is beneficial to the future progress of conversation comprehension.
High levels of air pollution may seriously affect people's living environment and even endanger their lives. In order to reduce air pollution concentrations, and warn the public before the occurrence of hazardous air pollutants, it is urgent to design an accurate and reliable air pollutant forecasting model. However, most previous research have many deficiencies, such as ignoring the importance of predictive stability, and poor initial parameters and so on, which have significantly effect on the performance of air pollution prediction. Therefore, to address these issues, a novel hybrid model is proposed in this study. Specifically, a powerful data preprocessing techniques is applied to decompose the original time series into different modes from low- frequency to high- frequency. Next, a new multi-objective algorithm called MOHHO is first developed in this study, which are introduced to tune the parameters of ELM model with high forecasting accuracy and stability for air pollution series prediction, simultaneously. And the optimized ELM model is used to perform the time series prediction. Finally, a scientific and robust evaluation system including several error criteria, benchmark models, and several experiments using six air pollutant concentrations time series from three cities in China is designed to perform a compressive assessment for the presented hybrid forecasting model. Experimental results indicate that the proposed hybrid model can guarantee a more stable and higher predictive performance compared to others, whose superior prediction ability may help to develop effective plans for air pollutant emissions and prevent health problems caused by air pollution.
The CCKS2019 shared task was devoted to inter-personal relationship extraction. Given two person entities and at least one sentence containing these two entities, participating teams are asked to predict the relationship between the entities according to a given relation list. This year, 358 teams from various universities and organizations participated in this task. In this paper, we present the task definition, the description of data and the evaluation methodology used during this shared task. We also present a brief overview of the various methods adopted by the participating teams. Finally, we present the evaluation results.
This paper presents a novel end-to-end Learned Point Cloud Geometry Compression (a.k.a., Learned-PCGC) framework, to efficiently compress the point cloud geometry (PCG) using deep neural networks (DNN) based variational autoencoders (VAE). In our approach, PCG is first voxelized, scaled and partitioned into non-overlapped 3D cubes, which is then fed into stacked 3D convolutions for compact latent feature and hyperprior generation. Hyperpriors are used to improve the conditional probability modeling of latent features. A weighted binary cross-entropy (WBCE) loss is applied in training while an adaptive thresholding is used in inference to remove unnecessary voxels and reduce the distortion. Objectively, our method exceeds the geometry-based point cloud compression (G-PCC) algorithm standardized by well-known Moving Picture Experts Group (MPEG) with a significant performance margin, e.g., at least 60% BD-Rate (Bjontegaard Delta Rate) gains, using common test datasets. Subjectively, our method has presented better visual quality with smoother surface reconstruction and appealing details, in comparison to all existing MPEG standard compliant PCC methods. Our method requires about 2.5MB parameters in total, which is a fairly small size for practical implementation, even on embedded platform. Additional ablation studies analyze a variety of aspects (e.g., cube size, kernels, etc) to explore the application potentials of our learned-PCGC.
End-to-end dialogue generation has achieved promising results without using handcrafted features and attributes specific for each task and corpus. However, one of the fatal drawbacks in such approaches is that they are unable to generate informative utterances, so it limits their usage from some real-world conversational applications. This paper attempts at generating diverse and informative responses with a variational generation model, which contains a joint attention mechanism conditioning on the information from both dialogue contexts and extra knowledge.
Omnidirectional scene text detection has received increasing research attention. Previous methods directly predict words or text lines of quadrilateral shapes. However, most methods neglect the significance of consistent labeling, which is important to maintain a stable training process, especially when a large amount of data are included. For the first time, we solve the problem in this paper by proposing a novel method termed Sequential-free Box Discretization (SBD). The proposed SBD first discretizes the quadrilateral box into several key edges, which contains all potential horizontal and vertical positions. In order to decode accurate vertex positions, a simple yet effective matching procedure is proposed to reconstruct the quadrilateral bounding boxes. It departs from the learning ambiguity which has a significant influence during the learning process. Exhaustive ablation studies have been conducted to quantitatively validate the effectiveness of our proposed method. More importantly, built upon SBD, we provide a detailed analysis of the impact of a collection of refinements, in the hope to inspire others to build state-of-the-art networks. Combining both SBD and these useful refinements, we achieve state-of-the-art performance on various benchmarks, including ICDAR 2015, and MLT. Our method also wins the first place in text detection task of the recent ICDAR2019 Robust Reading Challenge on Reading Chinese Text on Signboard, further demonstrating its powerful generalization ability. Code is available at https://tinyurl.com/sbdnet.
As facial appearance is subject to significant intra-class variations caused by the aging process over time, age-invariant face recognition (AIFR) remains a major challenge in face recognition community. To reduce the intra-class discrepancy caused by the aging, in this paper we propose a novel approach (namely, Orthogonal Embedding CNNs, or OE-CNNs) to learn the age-invariant deep face features. Specifically, we decompose deep face features into two orthogonal components to represent age-related and identity-related features. As a result, identity-related features that are robust to aging are then used for AIFR. Besides, for complementing the existing cross-age datasets and advancing the research in this field, we construct a brand-new large-scale Cross-Age Face dataset (CAF). Extensive experiments conducted on the three public domain face aging datasets (MORPH Album 2, CACD-VS and FG-NET) have shown the effectiveness of the proposed approach and the value of the constructed CAF dataset on AIFR. Benchmarking our algorithm on one of the most popular general face recognition (GFR) dataset LFW additionally demonstrates the comparable generalization performance on GFR.
We introduce SDM-NET, a deep generative neural network which produces structured deformable meshes. Specifically, the network is trained to generate a spatial arrangement of closed, deformable mesh parts, which respect the global part structure of a shape collection, e.g., chairs, airplanes, etc. Our key observation is that while the overall structure of a 3D shape can be complex, the shape can usually be decomposed into a set of parts, each homeomorphic to a box, and the finer-scale geometry of the part can be recovered by deforming the box. The architecture of SDM-NET is that of a two-level variational autoencoder (VAE). At the part level, a PartVAE learns a deformable model of part geometries. At the structural level, we train a Structured Parts VAE (SP-VAE), which jointly learns the part structure of a shape collection and the part geometries, ensuring a coherence between global shape structure and surface details. Through extensive experiments and comparisons with the state-of-the-art deep generative models of shapes, we demonstrate the superiority of SDM-NET in generating meshes with visual quality, flexible topology, and meaningful structures, which benefit shape interpolation and other subsequently modeling tasks.
Recently, the Network Representation Learning (NRL) techniques, which represent graph structure via low-dimension vectors to support social-oriented application, have attracted wide attention. Though large efforts have been made, they may fail to describe the multiple aspects of similarity between social users, as only a single vector for one unique aspect has been represented for each node. To that end, in this paper, we propose a novel end-to-end framework named MCNE to learn multiple conditional network representations, so that various preferences for multiple behaviors could be fully captured. Specifically, we first design a binary mask layer to divide the single vector as conditional embeddings for multiple behaviors. Then, we introduce the attention network to model interaction relationship among multiple preferences, and further utilize the adapted message sending and receiving operation of graph neural network, so that multi-aspect preference information from high-order neighbors will be captured. Finally, we utilize Bayesian Personalized Ranking loss function to learn the preference similarity on each behavior, and jointly learn multiple conditional node embeddings via multi-task learning framework. Extensive experiments on public datasets validate that our MCNE framework could significantly outperform several state-of-the-art baselines, and further support the visualization and transfer learning tasks with excellent interpretability and robustness.
Understanding the coordinated activity underlying brain computations requires large-scale, simultaneous recordings from distributed neuronal structures at a cellular-level resolution. One major hurdle to design high-bandwidth, high-precision, large-scale neural interfaces lies in the formidable data streams that are generated by the recorder chip and need to be online transferred to a remote computer. The data rates can require hundreds to thousands of I/O pads on the recorder chip and power consumption on the order of Watts for data streaming alone. We developed a deep learning-based compression model to reduce the data rate of multichannel action potentials. The proposed model is built upon a deep compressive autoencoder (CAE) with discrete latent embeddings. The encoder is equipped with residual transformations to extract representative features from spikes, which are mapped into the latent embedding space and updated via vector quantization (VQ). The decoder network reconstructs spike waveforms from the quantized latent embeddings. Experimental results show that the proposed model consistently outperforms conventional methods by achieving much higher compression ratios (20-500x) and better or comparable reconstruction accuracies. Testing results also indicate that CAE is robust against a diverse range of imperfections, such as waveform variation and spike misalignment, and has minor influence on spike sorting accuracy. Furthermore, we have estimated the hardware cost and real-time performance of CAE and shown that it could support thousands of recording channels simultaneously without excessive power/heat dissipation. The proposed model can reduce the required data transmission bandwidth in large-scale recording experiments and maintain good signal qualities. The code of this work has been made available at https://github.com/tong-wu-umn/spike-compression-autoencoder
We present a new method for black-box adversarial attack. Unlike previous methods that combined transfer-based and scored-based methods by using the gradient or initialization of a surrogate white-box model, this new method tries to learn a low-dimensional embedding using a pretrained model, and then performs efficient search within the embedding space to attack an unknown target network. The method produces adversarial perturbations with high level semantic patterns that are easily transferable. We show that this approach can greatly improve the query efficiency of black-box adversarial attack across different target network architectures. We evaluate our approach on MNIST, ImageNet and Google Cloud Vision API, resulting in a significant reduction on the number of queries. We also attack adversarially defended networks on CIFAR10 and ImageNet, where our method not only reduces the number of queries, but also improves the attack success rate.
Accurate state estimation is a fundamental module for various intelligent applications, such as robot navigation, autonomous driving, virtual and augmented reality. Visual and inertial fusion is a popular technology for 6-DOF state estimation in recent years. Time instants at which different sensors' measurements are recorded are of crucial importance to the system's robustness and accuracy. In practice, timestamps of each sensor typically suffer from triggering and transmission delays, leading to temporal misalignment (time offsets) among different sensors. Such temporal offset dramatically influences the performance of sensor fusion. To this end, we propose an online approach for calibrating temporal offset between visual and inertial measurements. Our approach achieves temporal offset calibration by jointly optimizing time offset, camera and IMU states, as well as feature locations in a SLAM system. Furthermore, the approach is a general model, which can be easily employed in several feature-based optimization frameworks. Simulation and experimental results demonstrate the high accuracy of our calibration approach even compared with other state-of-art offline tools. The VIO comparison against other methods proves that the online temporal calibration significantly benefits visual-inertial systems. The source code of temporal calibration is integrated into our public project, VINS-Mono.
Uniform sampling of training data has been commonly used in traditional stochastic optimization algorithms such as Proximal Stochastic Gradient Descent (prox-SGD) and Proximal Stochastic Dual Coordinate Ascent (prox-SDCA). Although uniform sampling can guarantee that the sampled stochastic quantity is an unbiased estimate of the corresponding true quantity, the resulting estimator may have a rather high variance, which negatively affects the convergence of the underlying optimization procedure. In this paper we study stochastic optimization with importance sampling, which improves the convergence rate by reducing the stochastic variance. Specifically, we study prox-SGD (actually, stochastic mirror descent) with importance sampling and prox-SDCA with importance sampling. For prox-SGD, instead of adopting uniform sampling throughout the training process, the proposed algorithm employs importance sampling to minimize the variance of the stochastic gradient. For prox-SDCA, the proposed importance sampling scheme aims to achieve higher expected dual value at each dual coordinate ascent step. We provide extensive theoretical analysis to show that the convergence rates with the proposed importance sampling methods can be significantly improved under suitable conditions both for prox-SGD and for prox-SDCA. Experiments are provided to verify the theoretical analysis.
Stochastic Gradient Descent (SGD) is a popular optimization method which has been applied to many important machine learning tasks such as Support Vector Machines and Deep Neural Networks. In order to parallelize SGD, minibatch training is often employed. The standard approach is to uniformly sample a minibatch at each step, which often leads to high variance. In this paper we propose a stratified sampling strategy, which divides the whole dataset into clusters with low within-cluster variance; we then take examples from these clusters using a stratified sampling technique. It is shown that the convergence rate can be significantly improved by the algorithm. Encouraging experimental results confirm the effectiveness of the proposed method.
Many image processing tasks involve image-to-image mapping, which can be addressed well by fully convolutional networks (FCN) without any heavy preprocessing. Although empirically designing and training FCNs can achieve satisfactory results, reasons for the improvement in performance are slightly ambiguous. Our study is to make progress in understanding their generalization abilities through visualizing the optimization landscapes. The visualization of objective functions is obtained by choosing a solution and projecting its vicinity onto a 3D space. We compare three FCN-based networks (two existing models and a new proposed in this paper for comparison) on multiple datasets. It has been observed in practice that the connections from the pre-pooled feature maps to the post-upsampled can achieve better results. We investigate the cause and provide experiments to shows that the skip-layer connections in FCN can promote flat optimization landscape, which is well known to generalize better. Additionally, we explore the relationship between the models generalization ability and loss surface under different batch sizes. Results show that large-batch training makes the model converge to sharp minimizers with chaotic vicinities while small-batch method leads the model to flat minimizers with smooth and nearly convex regions. Our work may contribute to insights and analysis for designing and training FCNs.