Models, code, and papers for "Wei Hua":

To deal with changing environments, a new performance measure---adaptive regret, defined as the maximum static regret over any interval, is proposed in online learning. Under the setting of online convex optimization, several algorithms have been successfully developed to minimize the adaptive regret. However, existing algorithms lack universality in the sense that they can only handle one type of convex functions and need apriori knowledge of parameters. By contrast, there exist universal algorithms, such as MetaGrad, that attain optimal static regret for multiple types of convex functions simultaneously. Along this line of research, this paper presents the first universal algorithm for minimizing the adaptive regret of convex functions. Specifically, we borrow the idea of maintaining multiple learning rates in MetaGrad to handle the uncertainty of functions, and utilize the technique of sleeping experts to capture changing environments. In this way, our algorithm automatically adapts to the property of functions (convex, exponentially concave, or strongly convex), as well as the nature of environments (stationary or changing). As a by product, it also allows the type of functions to switch between rounds.

Reusable model design becomes desirable with the rapid expansion of machine learning applications. In this paper, we focus on the reusability of pre-trained deep convolutional models. Specifically, different from treating pre-trained models as feature extractors, we reveal more treasures beneath convolutional layers, i.e., the convolutional activations could act as a detector for the common object in the image co-localization problem. We propose a simple but effective method, named Deep Descriptor Transforming (DDT), for evaluating the correlations of descriptors and then obtaining the category-consistent regions, which can accurately locate the common object in a set of images. Empirical studies validate the effectiveness of the proposed DDT method. On benchmark image co-localization datasets, DDT consistently outperforms existing state-of-the-art methods by a large margin. Moreover, DDT also demonstrates good generalization ability for unseen categories and robustness for dealing with noisy data.

Disagreement-based approaches generate multiple classifiers and exploit the disagreement among them with unlabeled data to improve learning performance. Co-training is a representative paradigm of them, which trains two classifiers separately on two sufficient and redundant views; while for the applications where there is only one view, several successful variants of co-training with two different classifiers on single-view data instead of two views have been proposed. For these disagreement-based approaches, there are several important issues which still are unsolved, in this article we present theoretical analyses to address these issues, which provides a theoretical foundation of co-training and disagreement-based approaches.

AUC (area under ROC curve) is an important evaluation criterion, which has been popularly used in many learning tasks such as class-imbalance learning, cost-sensitive learning, learning to rank, etc. Many learning approaches try to optimize AUC, while owing to the non-convexity and discontinuousness of AUC, almost all approaches work with surrogate loss functions. Thus, the consistency of AUC is crucial; however, it has been almost untouched before. In this paper, we provide a sufficient condition for the asymptotic consistency of learning approaches based on surrogate loss functions. Based on this result, we prove that exponential loss and logistic loss are consistent with AUC, but hinge loss is inconsistent. Then, we derive the $q$-norm hinge loss and general hinge loss that are consistent with AUC. We also derive the consistent bounds for exponential loss and logistic loss, and obtain the consistent bounds for many surrogate loss functions under the non-noise setting. Further, we disclose an equivalence between the exponential surrogate loss of AUC and exponential surrogate loss of accuracy, and one straightforward consequence of such finding is that AdaBoost and RankBoost are equivalent.

Great successes of deep neural networks have been witnessed in various real applications. Many algorithmic and implementation techniques have been developed, however, theoretical understanding of many aspects of deep neural networks is far from clear. A particular interesting issue is the usefulness of dropout, which was motivated from the intuition of preventing complex co-adaptation of feature detectors. In this paper, we study the Rademacher complexity of different types of dropout, and our theoretical results disclose that for shallow neural networks (with one or none hidden layer) dropout is able to reduce the Rademacher complexity in polynomial, whereas for deep neural networks it can amazingly lead to an exponential reduction of the Rademacher complexity.

Margin theory provides one of the most popular explanations to the success of \texttt{AdaBoost}, where the central point lies in the recognition that \textit{margin} is the key for characterizing the performance of \texttt{AdaBoost}. This theory has been very influential, e.g., it has been used to argue that \texttt{AdaBoost} usually does not overfit since it tends to enlarge the margin even after the training error reaches zero. Previously the \textit{minimum margin bound} was established for \texttt{AdaBoost}, however, \cite{Breiman1999} pointed out that maximizing the minimum margin does not necessarily lead to a better generalization. Later, \cite{Reyzin:Schapire2006} emphasized that the margin distribution rather than minimum margin is crucial to the performance of \texttt{AdaBoost}. In this paper, we first present the \textit{$k$th margin bound} and further study on its relationship to previous work such as the minimum margin bound and Emargin bound. Then, we improve the previous empirical Bernstein bounds \citep{Maurer:Pontil2009,Audibert:Munos:Szepesvari2009}, and based on such findings, we defend the margin-based explanation against Breiman's doubts by proving a new generalization error bound that considers exactly the same factors as \cite{Schapire:Freund:Bartlett:Lee1998} but is sharper than \cite{Breiman1999}'s minimum margin bound. By incorporating factors such as average margin and variance, we present a generalization error bound that is heavily related to the whole margin distribution. We also provide margin distribution bounds for generalization error of voting classifiers in finite VC-dimension space.

The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be $\widetilde{O}(\log\frac{1}{\epsilon})$, contrasting to single-view setting where the polynomial improvement is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is $\widetilde{O}(\frac{1}{\epsilon})$, where the order of $1/\epsilon$ is independent of the parameter in Tsybakov noise, contrasting to previous polynomial bounds where the order of $1/\epsilon$ is related to the parameter in Tsybakov noise.

To integrate strategic, tactical and operational decisions, the two-stage optimization has been widely used to guide dynamic decision making. In this paper, we study the two-stage stochastic programming for complex systems with unknown response estimated by simulation. We introduce the global-local metamodel assisted two-stage optimization via simulation that can efficiently employ the simulation resource to iteratively solve for the optimal first- and second-stage decisions. Specifically, at each visited first-stage decision, we develop a local metamodel to simultaneously solve a set of scenario-based second-stage optimization problems, which also allows us to estimate the optimality gap. Then, we construct a global metamodel accounting for the errors induced by: (1) using a finite number of scenarios to approximate the expected future cost occurring in the planning horizon, (2) second-stage optimality gap, and (3) finite visited first-stage decisions. Assisted by the global-local metamodel, we propose a new simulation optimization approach that can efficiently and iteratively search for the optimal first- and second-stage decisions. Our framework can guarantee the convergence of optimal solution for the discrete two-stage optimization with unknown objective, and the empirical study indicates that it achieves substantial efficiency and accuracy.

Developing high-performance entity normalization algorithms that can alleviate the term variation problem is of great interest to the biomedical community. Although deep learning-based methods have been successfully applied to biomedical entity normalization, they often depend on traditional context-independent word embeddings. Bidirectional Encoder Representations from Transformers (BERT), BERT for Biomedical Text Mining (BioBERT) and BERT for Clinical Text Mining (ClinicalBERT) were recently introduced to pre-train contextualized word representation models using bidirectional Transformers, advancing the state-of-the-art for many natural language processing tasks. In this study, we proposed an entity normalization architecture by fine-tuning the pre-trained BERT / BioBERT / ClinicalBERT models and conducted extensive experiments to evaluate the effectiveness of the pre-trained models for biomedical entity normalization using three different types of datasets. Our experimental results show that the best fine-tuned models consistently outperformed previous methods and advanced the state-of-the-art for biomedical entity normalization, with up to 1.17% increase in accuracy.

This paper considers the optimal control problem of an extended spring-loaded inverted pendulum (SLIP) model with two additional actuators for active leg length and hip torque modulation. These additional features arise naturally in practice, allowing for consideration of swing leg kinematics during flight and active control over stance dynamics. On the other hand, nonlinearity and the hybrid nature of the overall SLIP dynamics introduce challenges in the analysis and control of the model. In this paper, we first show that the stance dynamics of the considered SLIP model are differentially flat, which has a strong implication regarding controllability of the stance dynamics. Leveraging this powerful property, a tractable optimal control strategy is developed. This strategy enables online solution while also treating the hybrid nature of the SLIP dynamics. Together with the optimal control strategy, the extended SLIP model grants active disturbance rejection capability at any point during the gait. Performance of the proposed control strategy is demonstrated via numerical tests and shows significant advantage over existing methods.

A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.

Connecting consumers with relevant products is a very important problem in both online and offline commerce. In physical retail, product placement is an effective way to connect consumers with products. However, selecting product locations within a store can be a tedious process. Moreover, learning important spatial patterns in offline retail is challenging due to the scarcity of data and the high cost of exploration and experimentation in the physical world. To address these challenges, we propose a stochastic model of spatial demand in physical retail. We show that the proposed model is more predictive of demand than existing baselines. We also perform a preliminary study into different automation techniques and show that an optimal product allocation policy can be learned through Deep Q-Learning.

Traffic signal control is an important and challenging real-world problem, which aims to minimize the travel time of vehicles by coordinating their movements at the road intersections. Current traffic signal control systems in use still rely heavily on oversimplified information and rule-based methods, although we now have richer data, more computing power and advanced methods to drive the development of intelligent transportation. With the growing interest in intelligent transportation using machine learning methods like reinforcement learning, this survey covers the widely acknowledged transportation approaches and a comprehensive list of recent literature on reinforcement for traffic signal control. We hope this survey can foster interdisciplinary research on this important topic.

Printed text recognition is an important problem for industrial OCR systems. Printed text is constructed in a standard procedural fashion in most settings. We develop a mathematical model for this process that can be applied to the backward inference problem of text recognition from an image. Through ablation experiments we show that this model is realistic and that a multi-task objective setting can help to stabilize estimation of its free parameters, enabling use of conventional deep learning methods. Furthermore, by directly modeling the geometric perturbations of text synthesis we show that our model can help recover missing characters from incomplete text regions, the bane of multicomponent OCR systems, enabling recognition even when the detection returns incomplete information.

We propose a novel linear discriminant analysis approach for the classification of high-dimensional matrix-valued data that commonly arises from imaging studies. Motivated by the equivalence of the conventional linear discriminant analysis and the ordinary least squares, we consider an efficient nuclear norm penalized regression that encourages a low-rank structure. Theoretical properties including a non-asymptotic risk bound and a rank consistency result are established. Simulation studies and an application to electroencephalography data show the superior performance of the proposed method over the existing approaches.

Nearest neighbor has always been one of the most appealing non-parametric approaches in machine learning, pattern recognition, computer vision, etc. Previous empirical studies partly shows that nearest neighbor is resistant to noise, yet there is a lack of deep analysis. This work presents the finite-sample and distribution-dependent bounds on the consistency of nearest neighbor in the random noise setting. The theoretical results show that, for asymmetric noises, k-nearest neighbor is robust enough to classify most data correctly, except for a handful of examples, whose labels are totally misled by random noises. For symmetric noises, however, k-nearest neighbor achieves the same consistent rate as that of noise-free setting, which verifies the resistance of k-nearest neighbor to random noisy labels. Motivated by the theoretical analysis, we propose the Robust k-Nearest Neighbor (RkNN) approach to deal with noisy labels. The basic idea is to make unilateral corrections to examples, whose labels are totally misled by random noises, and classify the others directly by utilizing the robustness of k-nearest neighbor. We verify the effectiveness of the proposed algorithm both theoretically and empirically.

The incidence of thyroid nodule is very high and generally increases with the age. Thyroid nodule may presage the emergence of thyroid cancer. The thyroid nodule can be completely cured if detected early. Fine needle aspiration cytology is a recognized early diagnosis method of thyroid nodule. There are still some limitations in the fine needle aspiration cytology, and the ultrasound diagnosis of thyroid nodule has become the first choice for auxiliary examination of thyroid nodular disease. If we could combine medical imaging technology and fine needle aspiration cytology, the diagnostic rate of thyroid nodule would be improved significantly. The properties of ultrasound will degrade the image quality, which makes it difficult to recognize the edges for physicians. Image segmentation technique based on graph theory has become a research hotspot at present. Normalized cut (Ncut) is a representative one, which is suitable for segmentation of feature parts of medical image. However, how to solve the normalized cut has become a problem, which needs large memory capacity and heavy calculation of weight matrix. It always generates over segmentation or less segmentation which leads to inaccurate in the segmentation. The speckle noise in B ultrasound image of thyroid tumor makes the quality of the image deteriorate. In the light of this characteristic, we combine the anisotropic diffusion model with the normalized cut in this paper. After the enhancement of anisotropic diffusion model, it removes the noise in the B ultrasound image while preserves the important edges and local details. This reduces the amount of computation in constructing the weight matrix of the improved normalized cut and improves the accuracy of the final segmentation results. The feasibility of the method is proved by the experimental results.

Recommender system has attracted much attention during the past decade. Many attack detection algorithms have been developed for better recommendations, mostly focusing on shilling attacks, where an attack organizer produces a large number of user profiles by the same strategy to promote or demote an item. This work considers a different attack style: unorganized malicious attacks, where attackers individually utilize a small number of user profiles to attack different items without any organizer. This attack style occurs in many real applications, yet relevant study remains open. We first formulate the unorganized malicious attacks detection as a matrix completion problem, and propose the Unorganized Malicious Attacks detection (UMA) approach, a proximal alternating splitting augmented Lagrangian method. We verify, both theoretically and empirically, the effectiveness of our proposed approach.

Despite being so vital to success of Support Vector Machines, the principle of separating margin maximisation is not used in deep learning. We show that minimisation of margin variance and not maximisation of the margin is more suitable for improving generalisation in deep architectures. We propose the Halfway loss function that minimises the Normalised Margin Variance (NMV) at the output of a deep learning models and evaluate its performance against the Softmax Cross-Entropy loss on the MNIST, smallNORB and CIFAR-10 datasets.

AUC is an important performance measure and many algorithms have been devoted to AUC optimization, mostly by minimizing a surrogate convex loss on a training data set. In this work, we focus on one-pass AUC optimization that requires only going through the training data once without storing the entire training dataset, where conventional online learning algorithms cannot be applied directly because AUC is measured by a sum of losses defined over pairs of instances from different classes. We develop a regression-based algorithm which only needs to maintain the first and second order statistics of training data in memory, resulting a storage requirement independent from the size of training data. To efficiently handle high dimensional data, we develop a randomized algorithm that approximates the covariance matrices by low rank matrices. We verify, both theoretically and empirically, the effectiveness of the proposed algorithm.