Models, code, and papers for "Ying Qu":

Photorealistic style transfer aims to transfer the style of a reference photo onto a content photo naturally, such that the stylized image looks like a real photo taken by a camera. Existing state-of-the-art methods are prone to spatial structure distortion of the content image and global color inconsistency across different semantic objects, making the results less photorealistic. In this paper, we propose a one-shot mutual Dirichlet network, to address these challenging issues. The essential contribution of the work is the realization of a representation scheme that successfully decouples the spatial structure and color information of images, such that the spatial structure can be well preserved during stylization. This representation is discriminative and context-sensitive with respect to semantic objects. It is extracted with a shared sparse Dirichlet encoder. Moreover, such representation is encouraged to be matched between the content and style images for faithful color transfer. The affine-transfer model is embedded in the decoder of the network to facilitate the color transfer. The strong representative and discriminative power of the proposed network enables one-shot learning given only one content-style image pair. Experimental results demonstrate that the proposed method is able to generate photorealistic photos without spatial distortion or abrupt color changes.

Hyperspectral images (HSI) provide rich spectral information that contributed to the successful performance improvement of numerous computer vision tasks. However, it can only be achieved at the expense of images' spatial resolution. Hyperspectral image super-resolution (HSI-SR) addresses this problem by fusing low resolution (LR) HSI with multispectral image (MSI) carrying much higher spatial resolution (HR). All existing HSI-SR approaches require the LR HSI and HR MSI to be well registered and the reconstruction accuracy of the HR HSI relies heavily on the registration accuracy of different modalities. This paper exploits the uncharted problem domain of HSI-SR without the requirement of multi-modality registration. Given the unregistered LR HSI and HR MSI with overlapped regions, we design a unique unsupervised learning structure linking the two unregistered modalities by projecting them into the same statistical space through the same encoder. The mutual information (MI) is further adopted to capture the non-linear statistical dependencies between the representations from two modalities (carrying spatial information) and their raw inputs. By maximizing the MI, spatial correlations between different modalities can be well characterized to further reduce the spectral distortion. A collaborative $l_{2,1}$ norm is employed as the reconstruction error instead of the more common $l_2$ norm, so that individual pixels can be recovered as accurately as possible. With this design, the network allows to extract correlated spectral and spatial information from unregistered images that better preserves the spectral information. The proposed method is referred to as unregistered and unsupervised mutual Dirichlet Net ($u^2$-MDN). Extensive experimental results using benchmark HSI datasets demonstrate the superior performance of $u^2$-MDN as compared to the state-of-the-art.

In many computer vision applications, obtaining images of high resolution in both the spatial and spectral domains are equally important. However, due to hardware limitations, one can only expect to acquire images of high resolution in either the spatial or spectral domains. This paper focuses on hyperspectral image super-resolution (HSI-SR), where a hyperspectral image (HSI) with low spatial resolution (LR) but high spectral resolution is fused with a multispectral image (MSI) with high spatial resolution (HR) but low spectral resolution to obtain HR HSI. Existing deep learning-based solutions are all supervised that would need a large training set and the availability of HR HSI, which is unrealistic. Here, we make the first attempt to solving the HSI-SR problem using an unsupervised encoder-decoder architecture that carries the following uniquenesses. First, it is composed of two encoder-decoder networks, coupled through a shared decoder, in order to preserve the rich spectral information from the HSI network. Second, the network encourages the representations from both modalities to follow a sparse Dirichlet distribution which naturally incorporates the two physical constraints of HSI and MSI. Third, the angular difference between representations are minimized in order to reduce the spectral distortion. We refer to the proposed architecture as unsupervised Sparse Dirichlet-Net, or uSDN. Extensive experimental results demonstrate the superior performance of uSDN as compared to the state-of-the-art.

Keyword Spotting (KWS) provides the start signal of ASR problem, and thus it is essential to ensure a high recall rate. However, its real-time property requires low computation complexity. This contradiction inspires people to find a suitable model which is small enough to perform well in multi environments. To deal with this contradiction, we implement the Hierarchical Neural Network(HNN), which is proved to be effective in many speech recognition problems. HNN outperforms traditional DNN and CNN even though its model size and computation complexity are slightly less. Also, its simple topology structure makes easy to deploy on any device.

Compressed sensing has shown great potential in reducing data acquisition time in magnetic resonance imaging (MRI). Recently, a spread spectrum compressed sensing MRI method modulates an image with a quadratic phase. It performs better than the conventional compressed sensing MRI with variable density sampling, since the coherence between the sensing and sparsity bases are reduced. However, spread spectrum in that method is implemented via a shim coil which limits its modulation intensity and is not convenient to operate. In this letter, we propose to apply chirp (linear frequency-swept) radio frequency pulses to easily control the spread spectrum. To accelerate the image reconstruction, an alternating direction algorithm is modified by exploiting the complex orthogonality of the quadratic phase encoding. Reconstruction on the acquired data demonstrates that more image features are preserved using the proposed approach than those of conventional CS-MRI.

To improve the efficiency of surgical trajectory segmentation for robot learning in robot-assisted minimally invasive surgery, this paper presents a fast unsupervised method using video and kinematic data, followed by a promoting procedure to address the over-segmentation issue. Unsupervised deep learning network, stacking convolutional auto-encoder, is employed to extract more discriminative features from videos in an effective way. To further improve the accuracy of segmentation, on one hand, wavelet transform is used to filter out the noises existed in the features from video and kinematic data. On the other hand, the segmentation result is promoted by identifying the adjacent segments with no state transition based on the predefined similarity measurements. Extensive experiments on a public dataset JIGSAWS show that our method achieves much higher accuracy of segmentation than state-of-the-art methods in the shorter time.

Predicting user responses, such as clicks and conversions, is of great importance and has found its usage in many Web applications including recommender systems, web search and online advertising. The data in those applications is mostly categorical and contains multiple fields; a typical representation is to transform it into a high-dimensional sparse binary feature representation via one-hot encoding. Facing with the extreme sparsity, traditional models may limit their capacity of mining shallow patterns from the data, i.e. low-order feature combinations. Deep models like deep neural networks, on the other hand, cannot be directly applied for the high-dimensional input because of the huge feature space. In this paper, we propose a Product-based Neural Networks (PNN) with an embedding layer to learn a distributed representation of the categorical data, a product layer to capture interactive patterns between inter-field categories, and further fully connected layers to explore high-order feature interactions. Our experimental results on two large-scale real-world ad click datasets demonstrate that PNNs consistently outperform the state-of-the-art models on various metrics.

The game of Chinese Checkers is a challenging traditional board game of perfect information that differs from other traditional games in two main aspects: first, unlike Chess, all checkers remain indefinitely in the game and hence the branching factor of the search tree does not decrease as the game progresses; second, unlike Go, there are also no upper bounds on the depth of the search tree since repetitions and backward movements are allowed. Therefore, even in a restricted game instance, the state-space of the game can still be unbounded, making it challenging for a computer program to excel. In this work, we present an approach that effectively combines the use of heuristics, Monte Carlo tree search, and deep reinforcement learning for building a Chinese Checkers agent without the use of any human game-play data. Experiment results show that our agent is competent under different scenarios and reaches the level of experienced human players.

Signals are generally modeled as a superposition of exponential functions in spectroscopy of chemistry, biology and medical imaging. For fast data acquisition or other inevitable reasons, however, only a small amount of samples may be acquired and thus how to recover the full signal becomes an active research topic. But existing approaches can not efficiently recover $N$-dimensional exponential signals with $N\geq 3$. In this paper, we study the problem of recovering N-dimensional (particularly $N\geq 3$) exponential signals from partial observations, and formulate this problem as a low-rank tensor completion problem with exponential factor vectors. The full signal is reconstructed by simultaneously exploiting the CANDECOMP/PARAFAC structure and the exponential structure of the associated factor vectors. The latter is promoted by minimizing an objective function involving the nuclear norm of Hankel matrices. Experimental results on simulated and real magnetic resonance spectroscopy data show that the proposed approach can successfully recover full signals from very limited samples and is robust to the estimated tensor rank.

In this paper, we consider unregularized online learning algorithms in a Reproducing Kernel Hilbert Spaces (RKHS). Firstly, we derive explicit convergence rates of the unregularized online learning algorithms for classification associated with a general gamma-activating loss (see Definition 1 in the paper). Our results extend and refine the results in Ying and Pontil (2008) for the least-square loss and the recent result in Bach and Moulines (2011) for the loss function with a Lipschitz-continuous gradient. Moreover, we establish a very general condition on the step sizes which guarantees the convergence of the last iterate of such algorithms. Secondly, we establish, for the first time, the convergence of the unregularized pairwise learning algorithm with a general loss function and derive explicit rates under the assumption of polynomially decaying step sizes. Concrete examples are used to illustrate our main results. The main techniques are tools from convex analysis, refined inequalities of Gaussian averages, and an induction approach.

Aesthetic assessment is subjective, and the distribution of the aesthetic levels is imbalanced. In order to realize the auto-assessment of photo aesthetics, we focus on retraining the CNN-based aesthetic assessment model by dropping out the unavailable samples in the middle levels from the training data set repetitively to overcome the effect of imbalanced aesthetic data on classification. Further, the method of extracting aesthetics highlight region of the photo image by using the two repetitively trained models is presented. Therefore, the correlation of the extracted region with the aesthetic levels is analyzed to illustrate what aesthetics features influence the aesthetic quality of the photo. Moreover, the testing data set is from the different data source called 500px. Experimental results show that the proposed method is effective.

This paper shows that pairwise PageRank orders emerge from two-hop walks. The main tool used here refers to a specially designed sign-mirror function and a parameter curve, whose low-order derivative information implies pairwise PageRank orders with high probability. We study the pairwise correct rate by placing the Google matrix $\textbf{G}$ in a probabilistic framework, where $\textbf{G}$ may be equipped with different random ensembles for model-generated or real-world networks with sparse, small-world, scale-free features, the proof of which is mixed by mathematical and numerical evidence. We believe that the underlying spectral distribution of aforementioned networks is responsible for the high pairwise correct rate. Moreover, the perspective of this paper naturally leads to an $O(1)$ algorithm for any single pairwise PageRank comparison if assuming both $\textbf{A}=\textbf{G}-\textbf{I}_n$, where $\textbf{I}_n$ denotes the identity matrix of order $n$, and $\textbf{A}^2$ are ready on hand (e.g., constructed offline in an incremental manner), based on which it is easy to extract the top $k$ list in $O(kn)$, thus making it possible for PageRank algorithm to deal with super large-scale datasets in real time.

This paper considers a general data-fitting problem over a networked system, in which many computing nodes are connected by an undirected graph. This kind of problem can find many real-world applications and has been studied extensively in the literature. However, existing solutions either need a central controller for information sharing or requires slot synchronization among different nodes, which increases the difficulty of practical implementations, especially for a very large and heterogeneous system. As a contrast, in this paper, we treat the data-fitting problem over the network as a stochastic programming problem with many constraints. By adapting the results in a recent paper, we design a fully distributed and asynchronized stochastic gradient descent (SGD) algorithm. We show that our algorithm can achieve global optimality and consensus asymptotically by only local computations and communications. Additionally, we provide a sharp lower bound for the convergence speed in the regular graph case. This result fits the intuition and provides guidance to design a `good' network topology to speed up the convergence. Also, the merit of our design is validated by experiments on both synthetic and real-world datasets.

Large-batch training approaches have enabled researchers to utilize large-scale distributed processing and greatly accelerate deep-neural net (DNN) training. For example, by scaling the batch size from 256 to 32K, researchers have been able to reduce the training time of ResNet50 on ImageNet from 29 hours to 2.2 minutes (Ying et al., 2018). In this paper, we propose a new approach called linear-epoch gradual-warmup (LEGW) for better large-batch training. With LEGW, we are able to conduct large-batch training for both CNNs and RNNs with the Sqrt Scaling scheme. LEGW enables Sqrt Scaling scheme to be useful in practice and as a result we achieve much better results than the Linear Scaling learning rate scheme. For LSTM applications, we are able to scale the batch size by a factor of 64 without losing accuracy and without tuning the hyper-parameters. For CNN applications, LEGW is able to achieve the same accuracy even as we scale the batch size to 32K. LEGW works better than previous large-batch auto-tuning techniques. LEGW achieves a 5.3X average speedup over the baselines for four LSTM-based applications on the same hardware. We also provide some theoretical explanations for LEGW.

This paper presents a neural network approach for solving two-dimensional optical tomography (OT) problems based on the radiative transfer equation. The mathematical problem of OT is to recover the optical properties of an object based on the albedo operator that is accessible from boundary measurements. Both the forward map from the optical properties to the albedo operator and the inverse map are high-dimensional and nonlinear. For the circular tomography geometry, a perturbative analysis shows that the forward map can be approximated by a vectorized convolution operator in the angular direction. Motivated by this, we propose effective neural network architectures for the forward and inverse maps based on convolution layers, with weights learned from training datasets. Numerical results demonstrate the efficiency of the proposed neural networks.

This paper considers a multi-agent Markov Decision Process (MDP), where there are $n$ agents and each agent $i$ is associated with a state $s_i$ and action $a_i$ taking values from a finite set. Though the global state space size and action space size are exponential in $n$, we impose local dependence structures and focus on local policies that only depend on local states, and we propose a method that finds nearly optimal local policies in polynomial time (in $n$) when the dependence structure is a one directional tree. The algorithm builds on approximated reward functions which are evaluated using locally truncated Markov process. Further, under some special conditions, we prove that the gap between the approximated reward function and the true reward function is decaying exponentially fast as the length of the truncated Markov process gets longer. The intuition behind this is that under some assumptions, the effect of agent interactions decays exponentially in the distance between agents, which we term "fast decaying property".

The recent decades have seen a surge of interests in distributed computing. Existing work focus primarily on either distributed computing platforms, data query tools, or, algorithms to divide big data and conquer at individual machines etc. It is, however, increasingly often that the data of interest are inherently distributed, i.e., data are stored at multiple distributed sites due to diverse collection channels, business operations etc. We propose to enable learning and inference in such a setting via a general framework based on the distortion minimizing local transformations. This framework only requires a small amount of local signatures to be shared among distributed sites, eliminating the need of having to transmitting big data. Computation can be done very efficiently via parallel local computation. The error incurred due to distributed computing vanishes when increasing the size of local signatures. As the shared data need not be in their original form, data privacy may also be preserved. Experiments on linear (logistic) regression and Random Forests have shown promise of this approach. This framework is expected to apply to a general class of tools in learning and inference with the continuity property.

We consider the dynamics of a linear stochastic approximation algorithm driven by Markovian noise, and derive finite-time bounds on the moments of the error, i.e., deviation of the output of the algorithm from the equilibrium point of an associated ordinary differential equation (ODE). We obtain finite-time bounds on the mean-square error in the case of constant step-size algorithms by considering the drift of an appropriately chosen Lyapunov function. The Lyapunov function can be interpreted either in terms of Stein's method to obtain bounds on steady-state performance or in terms of Lyapunov stability theory for linear ODEs. We also provide a comprehensive treatment of the moments of the square of the 2-norm of the approximation error. Our analysis yields the following results: (i) for a given step-size, we show that the lower-order moments can be made small as a function of the step-size and can be upper-bounded by the moments of a Gaussian random variable; (ii) we show that the higher-order moments beyond a threshold may be infinite in steady-state; and (iii) we characterize the number of samples needed for the finite-time bounds to be of the same order as the steady-state bounds. As a by-product of our analysis, we also solve the open problem of obtaining finite-time bounds for the performance of temporal difference learning algorithms with linear function approximation and a constant step-size, without requiring a projection step or an i.i.d. noise assumption.

Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.