Research papers and code for "Lei Wu":
Representation learning and unsupervised learning are two central topics of machine learning and signal processing. Deep learning is one of the most effective unsupervised representation learning approach. The main contributions of this paper to the topics are as follows. (i) We propose to view the representative deep learning approaches as special cases of the knowledge reuse framework of clustering ensemble. (ii) We propose to view sparse coding when used as a feature encoder as the consensus function of clustering ensemble, and view dictionary learning as the training process of the base clusterings of clustering ensemble. (ii) Based on the above two views, we propose a very simple deep learning algorithm, named deep random model ensemble (DRME). It is a stack of random model ensembles. Each random model ensemble is a special k-means ensemble that discards the expectation-maximization optimization of each base k-means but only preserves the default initialization method of the base k-means. (iv) We propose to select the most powerful representation among the layers by applying DRME to clustering where the single-linkage is used as the clustering algorithm. Moreover, the DRME based clustering can also detect the number of the natural clusters accurately. Extensive experimental comparisons with 5 representation learning methods on 19 benchmark data sets demonstrate the effectiveness of DRME.

* This paper has been withdrawn by the author due to a lack of full empirical evaluation. More advanced method has been developed. This method has been fully out of date
Click to Read Paper and Get Code
Mismatching problem between the source and target noisy corpora severely hinder the practical use of the machine-learning-based voice activity detection (VAD). In this paper, we try to address this problem in the transfer learning prospective. Transfer learning tries to find a common learning machine or a common feature subspace that is shared by both the source corpus and the target corpus. The denoising deep neural network is used as the learning machine. Three transfer techniques, which aim to learn common feature representations, are used for analysis. Experimental results demonstrate the effectiveness of the transfer learning schemes on the mismatch problem.

* This paper has been submitted to the conference "INTERSPEECH2013" in March 4, 2013 for review
Click to Read Paper and Get Code
Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.

* This paper has been accepted by IEEE ICASSP-2013, and will be published online after May, 2013
Click to Read Paper and Get Code
One of the key issues in the analysis of machine learning models is to identify the appropriate function space for the model. This is the space of functions that the particular machine learning model can approximate with good accuracy, endowed with a natural norm associated with the approximation process. In this paper, we address this issue for two representative neural network models: the two-layer networks and the residual neural networks. We define Barron space and show that it is the right space for two-layer neural network models in the sense that optimal direct and inverse approximation theorems hold for functions in the Barron space. For residual neural network models, we construct the so-called compositional function space, and prove direct and inverse approximation theorems for this space. In addition, we show that the Rademacher complexity has the optimal upper bounds for these spaces.

Click to Read Paper and Get Code
With the thriving of deep learning, 3D Convolutional Neural Networks have become a popular choice in volumetric image analysis due to their impressive 3D contexts mining ability. However, the 3D convolutional kernels will introduce a significant increase in the amount of trainable parameters. Considering the training data is often limited in biomedical tasks, a tradeoff has to be made between model size and its representational power. To address this concern, in this paper, we propose a novel 3D Dense Separated Convolution (3D-DSC) module to replace the original 3D convolutional kernels. The 3D-DSC module is constructed by a series of densely connected 1D filters. The decomposition of 3D kernel into 1D filters reduces the risk of over-fitting by removing the redundancy of 3D kernels in a topologically constrained manner, while providing the infrastructure for deepening the network. By further introducing nonlinear layers and dense connections between 1D filters, the network's representational power can be significantly improved while maintaining a compact architecture. We demonstrate the superiority of 3D-DSC on volumetric image classification and segmentation, which are two challenging tasks often encountered in biomedical image computing.

* 7 pages,5 figures
Click to Read Paper and Get Code
A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general regimes for the network width and training data size are considered. In the over-parametrized regime, it is shown that gradient descent dynamics can achieve zero training loss exponentially fast regardless of the quality of the labels. In addition, it is proved that throughout the training process the functions represented by the neural network model are uniformly close to that of a kernel method. For general values of the network width and training data size, sharp estimates of the generalization error is established for target functions in the appropriate reproducing kernel Hilbert space. Our analysis suggests strongly that in terms of `implicit regularization', two-layer neural network models do not outperform the kernel method.

* 30 pages, 5 figures
Click to Read Paper and Get Code
3D face reconstruction is an important task in the field of computer vision. Although 3D face reconstruction has being developing rapidly in recent years, it is still a challenge for face reconstruction under large pose. That is because much of the information about a face in a large pose will be unknowable. In order to address this issue, this paper proposes a novel 3D face reconstruction algorithm (PIFR) based on 3D Morphable Model (3DMM). After input a single face image, it generates a frontal image by normalizing the image. Then we set weighted sum of the 3D parameters of the two images. Our method solves the problem of face reconstruction of a single image of a traditional method in a large pose, works on arbitrary Pose and Expressions, greatly improves the accuracy of reconstruction. Experiments on the challenging AFW, LFPW and AFLW database show that our algorithm significantly improves the accuracy of 3D face reconstruction even under extreme poses .

* 8 pages
Click to Read Paper and Get Code
New estimates for the generalization error are established for the two-layer neural network model. These new estimates are a priori in nature in the sense that the bounds depend only on some norms of the underlying functions to be fitted, not the parameters in the model. In contrast, most existing results for neural networks are a posteriori in nature in the sense that the bounds depend on some norms of the model parameters. The error rates are comparable to that of the Monte Carlo method for integration problems. Moreover, these bounds are equally effective in the over-parametrized regime when the network size is much larger than the size of the dataset.

* 14 pages, 2 figures
Click to Read Paper and Get Code
We propose a general framework for solving statistical mechanics of systems with a finite size. The approach extends the celebrated variational mean-field approaches using autoregressive neural networks which support direct sampling and exact calculation of normalized probability of configurations. The network computes variational free energy, estimates physical quantities such as entropy, magnetizations and correlations, and generates uncorrelated samples all at once. Training of the network employs the policy gradient approach in reinforcement learning, which unbiasedly estimates the gradient of variational parameters. We apply our approach to several classical systems, including 2-d Ising models, Hopfield model, Sherrington--Kirkpatrick spin glasses, and the inverse Ising model, for demonstrating its advantages over existing variational mean-field methods. Our approach sheds light on solving statistical physics problems using modern deep generative neural networks.

Click to Read Paper and Get Code
It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don't. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments.

Click to Read Paper and Get Code
Distributed learning is an effective way to analyze big data. In distributed regression, a typical approach is to divide the big data into multiple blocks, apply a base regression algorithm on each of them, and then simply average the output functions learnt from these blocks. Since the average process will decrease the variance, not the bias, bias correction is expected to improve the learning performance if the base regression algorithm is a biased one. Regularization kernel network is an effective and widely used method for nonlinear regression analysis. In this paper we will investigate a bias corrected version of regularization kernel network. We derive the error bounds when it is applied to a single data set and when it is applied as a base algorithm in distributed regression. We show that, under certain appropriate conditions, the optimal learning rates can be reached in both situations.

Click to Read Paper and Get Code
3D face alignment of monocular images is a crucial process in the recognition of faces with disguise.3D face reconstruction facilitated by alignment can restore the face structure which is helpful in detcting disguise interference.This paper proposes a dual attention mechanism and an efficient end-to-end 3D face alignment framework.We build a stable network model through Depthwise Separable Convolution, Densely Connected Convolutional and Lightweight Channel Attention Mechanism. In order to enhance the ability of the network model to extract the spatial features of the face region, we adopt Spatial Group-wise Feature enhancement module to improve the representation ability of the network. Different loss functions are applied jointly to constrain the 3D parameters of a 3D Morphable Model (3DMM) and its 3D vertices. We use a variety of data enhancement methods and generate large virtual pose face data sets to solve the data imbalance problem. The experiments on the challenging AFLW,AFLW2000-3D datasets show that our algorithm significantly improves the accuracy of 3D face alignment. Our experiments using the field DFW dataset show that DAMDNet exhibits excellent performance in the 3D alignment and reconstruction of challenging disguised faces.The model parameters and the complexity of the proposed method are also reduced significantly.The code is publicly available at https:// github.com/LeiJiangJNU/DAMDNet

* ICCV2019 workshop
* 10 pages
Click to Read Paper and Get Code
We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a.k.a. captionbot) and a text-to-image generator (a.k.a. drawingbot). The key idea behind the joint training is that image-to-text generation and text-to-image generation as dual problems can form a closed loop to provide informative feedback to each other. Based on such feedback, we introduce a new loss metric by comparing the original input with the output produced by the closed loop. In addition to the old loss metrics used in captionbot and drawingbot, this extra loss metric makes the jointly trained captionbot and drawingbot better than the separately trained captionbot and drawingbot. Furthermore, the turbo-learning approach enables semi-supervised learning since the closed loop can provide peudo-labels for unlabeled samples. Experimental results on the COCO dataset demonstrate that the proposed turbo learning can significantly improve the performance of both captionbot and drawingbot by a large margin.

Click to Read Paper and Get Code
Co-training is a popular semi-supervised learning framework to utilize a large amount of unlabeled data in addition to a small labeled set. Co-training methods exploit predicted labels on the unlabeled data and select samples based on prediction confidence to augment the training. However, the selection of samples in existing co-training methods is based on a predetermined policy, which ignores the sampling bias between the unlabeled and the labeled subsets, and fails to explore the data space. In this paper, we propose a novel method, Reinforced Co-Training, to select high-quality unlabeled samples to better co-train on. More specifically, our approach uses Q-learning to learn a data selection policy with a small labeled dataset, and then exploits this policy to train the co-training classifiers automatically. Experimental results on clickbait detection and generic text classification tasks demonstrate that our proposed method can obtain more accurate text classification results.

* 11 pages, 3 figures. Accepted to NAACL 2018
Click to Read Paper and Get Code
Convolutional neural networks (CNNs) can be applied to graph similarity matching, in which case they are called graph CNNs. Graph CNNs are attracting increasing attention due to their effectiveness and efficiency. However, the existing convolution approaches focus only on regular data forms and require the transfer of the graph or key node neighborhoods of the graph into the same fixed form. During this transfer process, structural information of the graph can be lost, and some redundant information can be incorporated. To overcome this problem, we propose the disordered graph convolutional neural network (DGCNN) based on the mixed Gaussian model, which extends the CNN by adding a preprocessing layer called the disordered graph convolutional layer (DGCL). The DGCL uses a mixed Gaussian function to realize the mapping between the convolution kernel and the nodes in the neighborhood of the graph. The output of the DGCL is the input of the CNN. We further implement a backward-propagation optimization process of the convolutional layer by which we incorporate the feature-learning model of the irregular node neighborhood structure into the network. Thereafter, the optimization of the convolution kernel becomes part of the neural network learning process. The DGCNN can accept arbitrary scaled and disordered neighborhood graph structures as the receptive fields of CNNs, which reduces information loss during graph transformation. Finally, we perform experiments on multiple standard graph datasets. The results show that the proposed method outperforms the state-of-the-art methods in graph classification and retrieval.

* 16 pages,8 figures
Click to Read Paper and Get Code
Energy disaggregation is to discover the energy consumption of individual appliances from their aggregated energy values. To solve the problem, most existing approaches rely on either appliances' signatures or their state transition patterns, both hard to obtain in practice. Aiming at developing a simple, universal model that works without depending on sophisticated machine learning techniques or auxiliary equipments, we make use of easily accessible knowledge of appliances and the sparsity of the switching events to design a Sparse Switching Event Recovering (SSER) method. By minimizing the total variation (TV) of the (sparse) event matrix, SSER can effectively recover the individual energy consumption values from the aggregated ones. To speed up the process, a Parallel Local Optimization Algorithm (PLOA) is proposed to solve the problem in active epochs of appliance activities in parallel. Using real-world trace data, we compare the performance of our method with that of the state-of-the-art solutions, including Least Square Estimation (LSE) and iterative Hidden Markov Model (HMM). The results show that our approach has an overall higher detection accuracy and a smaller overhead.

* 12 pages, 5 figures, and 4 tables
Click to Read Paper and Get Code
The behavior of the gradient descent (GD) algorithm is analyzed for a deep neural network model with skip-connections. It is proved that in the over-parametrized regime, for a suitable initialization, with high probability GD can find a global minimum exponentially fast. Generalization error estimates along the GD path are also established. As a consequence, it is shown that when the target function is in the reproducing kernel Hilbert space (RKHS) with a kernel defined by the initialization, there exist generalizable early-stopping solutions along the GD path. In addition, it is also shown that the GD path is uniformly close to the functions given by the related random feature model. Consequently, in this "implicit regularization" setting, the deep neural network model deteriorates to a random feature model. Our results hold for neural networks of any width larger than the input dimension.

* 29 pages, 4 figures
Click to Read Paper and Get Code
Generative Adversarial Networks (GANs) have been widely used for the image-to-image translation task. While these models rely heavily on the labeled image pairs, recently some GAN variants have been proposed to tackle the unpaired image translation task. These models exploited supervision at the domain level with a reconstruction process for unpaired image translation. On the other hand, parallel works have shown that leveraging perceptual loss functions based on high level deep features could enhance the generated image quality. Nevertheless, as these GAN-based models either depended on the pretrained deep network structure or relied on the labeled image pairs, they could not be directly applied to the unpaired image translation task. Moreover, despite the improvement of the introduced perceptual losses from deep neural networks, few researchers have explored the possibility of improving the generated image quality from classical image quality measures. To tackle the above two challenges, in this paper, we propose a unified quality-aware GAN-based framework for unpaired image-to-image translation, where a quality-aware loss is explicitly incorporated by comparing each source image and the reconstructed image at the domain level. Specifically, we design two detailed implementations of the quality loss. The first method is based on a classical image quality assessment measure by defining a classical quality-aware loss. The second method proposes an adaptive deep network based loss. Finally, extensive experimental results on many real-world datasets clearly show the quality improvement of our proposed framework, and the superiority of leveraging classical image quality measures for unpaired image translation compared to the deep network based model.

* IEEE Transactions on Multimedia
Click to Read Paper and Get Code
Network embedding represents nodes in a continuous vector space and preserves structure information from the Network. Existing methods usually adopt a "one-size-fits-all" approach when concerning multi-scale structure information, such as first- and second-order proximity of nodes, ignoring the fact that different scales play different roles in the embedding learning. In this paper, we propose an Attention-based Adversarial Autoencoder Network Embedding(AAANE) framework, which promotes the collaboration of different scales and lets them vote for robust representations. The proposed AAANE consists of two components: 1) Attention-based autoencoder effectively capture the highly non-linear network structure, which can de-emphasize irrelevant scales during training. 2) An adversarial regularization guides the autoencoder learn robust representations by matching the posterior distribution of the latent embeddings to given prior distribution. This is the first attempt to introduce attention mechanisms to multi-scale network embedding. Experimental results on real-world networks show that our learned attention parameters are different for every network and the proposed approach outperforms existing state-of-the-art approaches for network embedding.

* 8 pages, 5 figures
Click to Read Paper and Get Code
State-of-the-art deep neural networks are known to be vulnerable to adversarial examples, formed by applying small but malicious perturbations to the original inputs. Moreover, the perturbations can \textit{transfer across models}: adversarial examples generated for a specific model will often mislead other unseen models. Consequently the adversary can leverage it to attack deployed systems without any query, which severely hinder the application of deep learning, especially in the areas where security is crucial. In this work, we systematically study how two classes of factors that might influence the transferability of adversarial examples. One is about model-specific factors, including network architecture, model capacity and test accuracy. The other is the local smoothness of loss function for constructing adversarial examples. Based on these understanding, a simple but effective strategy is proposed to enhance transferability. We call it variance-reduced attack, since it utilizes the variance-reduced gradient to generate adversarial example. The effectiveness is confirmed by a variety of experiments on both CIFAR-10 and ImageNet datasets.

* 15 pages
Click to Read Paper and Get Code