Models, code, and papers for "Xiaolin Wu":
Deep convolutional neural networks (DCNN) have enjoyed great successes in many signal processing applications because they can learn complex, non-linear causal relationships from input to output. In this light, DCNNs are well suited for the task of sequential prediction of multidimensional signals, such as images, and have the potential of improving the performance of traditional linear predictors. In this research we investigate how far DCNNs can push the envelop in terms of prediction precision. We propose, in a case study, a two-stage deep regression DCNN framework for nonlinear prediction of two-dimensional image signals. In the first-stage regression, the proposed deep prediction network (PredNet) takes the causal context as input and emits a prediction of the present pixel. Three PredNets are trained with the regression objectives of minimizing $\ell_1$, $\ell_2$ and $\ell_\infty$ norms of prediction residuals, respectively. The second-stage regression combines the outputs of the three PredNets to generate an even more precise and robust prediction. The proposed deep regression model is applied to lossless predictive image coding, and it outperforms the state-of-the-art linear predictors by appreciable margin.
Recently a number of CNN-based techniques were proposed to remove image compression artifacts. As in other restoration applications, these techniques all learn a mapping from decompressed patches to the original counterparts under the ubiquitous L2 metric. However, this approach is incapable of restoring distinctive image details which may be statistical outliers but have high semantic importance (e.g., tiny lesions in medical images). To overcome this weakness, we propose to incorporate an L-infinity fidelity criterion in the design of neural network so that no small, distinctive structures of the original image can be dropped or distorted. Moreover, our anti-artifacts neural network is designed to work on a range of compression bit rates, rather than a fixed one as in the past. Experimental results demonstrate that the proposed method outperforms the state-of-the-art methods in L-infinity error metric and perceptual quality, while being competitive in L2 error metric as well. It can restore subtle image details that are otherwise destroyed or missed by other algorithms. Our research suggests a new machine learning paradigm of ultra high fidelity image compression that is ideally suited for applications in medicine, space, and sciences.
In November 2016 we submitted to arXiv our paper "Automated Inference on Criminality Using Face Images". It generated a great deal of discussions in the Internet and some media outlets. Our work is only intended for pure academic discussions; how it has become a media consumption is a total surprise to us. Although in agreement with our critics on the need and importance of policing AI research for the general good of the society, we are deeply baffled by the ways some of them mispresented our work, in particular the motive and objective of our research.
Small compression noises, despite being transparent to human eyes, can adversely affect the results of many image restoration processes, if left unaccounted for. Especially, compression noises are highly detrimental to inverse operators of high-boosting (sharpening) nature, such as deblurring and superresolution against a convolution kernel. By incorporating the non-linear DCT quantization mechanism into the formulation for image restoration, we propose a new sparsity-based convex programming approach for joint compression noise removal and image restoration. Experimental results demonstrate significant performance gains of the new approach over existing image restoration methods.
Regularization techniques are widely used to improve the generality, robustness, and efficiency of deep convolutional neural networks (DCNNs). In this paper, we propose a novel approach of regulating DCNN convolutional kernels by a structured filter bank. Comparing with the existing regularization methods, such as $\ell_1$ or $\ell_2$ minimization of DCNN kernel weights and the kernel orthogonality, which ignore sample correlations within a kernel, the use of filter bank in regularization of DCNNs can mold the DCNN kernels to common spatial structures and features (e.g., edges or textures of various orientations and frequencies) of natural images. On the other hand, unlike directly making DCNN kernels fixed filters, the filter bank regularization still allows the freedom of optimizing DCNN weights via deep learning. This new DCNN design strategy aims to combine the best of two worlds: the inclusion of structural image priors of traditional filter banks to improve the robustness and generality of DCNN solutions and the capability of modern deep learning to model complex non-linear functions hidden in training data. Experimental results on object recognition tasks show that the proposed regularization approach guides DCNNs to faster convergence and better generalization than existing regularization methods of weight decay and kernel orthogonality.
Given the success of the deep convolutional neural networks (DCNNs) in applications of visual recognition and classification, it would be tantalizing to test if DCNNs can also learn spatial concepts, such as straightness, convexity, left/right, front/back, relative size, aspect ratio, polygons, etc., from varied visual examples of these concepts that are simple and yet vital for spatial reasoning. Much to our dismay, extensive experiments of the type of cognitive psychology demonstrate that the data-driven deep learning (DL) cannot see through superficial variations in visual representations and grasp the spatial concept in abstraction. The root cause of failure turns out to be the learning methodology, not the computational model of the neural network itself. By incorporating task-specific convolutional kernels, we are able to construct DCNNs for spatial cognition tasks that can generalize to input images not drawn from the same distribution of the training set. This work raises a precaution that without manually-incorporated priors or features DCCNs may fail spatial cognitive tasks at rudimentary level.
In many applications of deep learning, particularly those in image restoration, it is either very difficult, prohibitively expensive, or outright impossible to obtain paired training data precisely as in the real world. In such cases, one is forced to use synthesized paired data to train the deep convolutional neural network (DCNN). However, due to the unavoidable generalization error in statistical learning, the synthetically trained DCNN often performs poorly on real world data. To overcome this problem, we propose a new general training method that can compensate for, to a large extent, the generalization errors of synthetically trained DCNNs.
Subitizing, or the sense of small natural numbers, is a cognitive construct so primary and critical to the survival and well-being of humans and primates that is considered and proven to be innate; it responds to visual stimuli prior to the development of any symbolic skills, language or arithmetic. Given highly acclaimed successes of deep convolutional neural networks (DCNN) in tasks of visual intelligence, one would expect that DCNNs can learn subitizing. But somewhat surprisingly, our carefully crafted extensive experiments, which are similar to those of cognitive psychology, demonstrate that DCNNs cannot, even with strong supervision, see through superficial variations in visual representations and distill the abstract notion of natural number, a task that children perform with high accuracy and confidence. The DCNN black box learners driven by very large training sets are apparently still confused by geometric variations and fail to grasp the topological essence in subitizing. In sharp contrast to the failures of the black box learning, by incorporating a mechanism of mathematical morphology into convolutional kernels, we are able to construct a recurrent convolutional neural network that can perform subitizing deterministically. Our findings in this study of cognitive computing, without and with prior of human knowledge, are discussed; they are, we believe, significant and thought-provoking in the interests of AI research, because visual-based numerosity is a benchmark of minimum sort for human cognition.
Taking photos of optoelectronic displays is a direct and spontaneous way of transferring data and keeping records, which is widely practiced. However, due to the analog signal interference between the pixel grids of the display screen and camera sensor array, objectionable moir\'e (alias) patterns appear in captured screen images. As the moir\'e patterns are structured and highly variant, they are difficult to be completely removed without affecting the underneath latent image. In this paper, we propose an approach of deep convolutional neural network for demoir\'eing screen photos. The proposed DCNN consists of a coarse-scale network and a fine-scale network. In the coarse-scale network, the input image is first downsampled and then processed by stacked residual blocks to remove the moir\'e artifacts. After that, the fine-scale network upsamples the demoir\'ed low-resolution image back to the original resolution. Extensive experimental results have demonstrated that the proposed technique can efficiently remove the moir\'e patterns for camera acquired screen images; the new technique outperforms the existing ones.
All existing image enhancement methods, such as HDR tone mapping, cannot recover A/D quantization losses due to insufficient or excessive lighting, (underflow and overflow problems). The loss of image details due to A/D quantization is complete and it cannot be recovered by traditional image processing methods, but the modern data-driven machine learning approach offers a much needed cure to the problem. In this work we propose a novel approach to restore and enhance images acquired in low and uneven lighting. First, the ill illumination is algorithmically compensated by emulating the effects of artificial supplementary lighting. Then a DCNN trained using only synthetic data recovers the missing detail caused by quantization.
In recent years, deep neural networks have yielded state-of-the-art performance on several tasks. Although some recent works have focused on combining deep learning with recommendation, we highlight three issues of existing works. First, most works perform deep content feature learning and resort to matrix factorization, which cannot effectively model the highly complex user-item interaction function. Second, due to the difficulty on training deep neural networks, existing models utilize a shallow architecture, and thus limit the expressive potential of deep learning. Third, neural network models are easy to overfit on the implicit setting, because negative interactions are not taken into account. To tackle these issues, we present a generic recommender framework called Neural Collaborative Autoencoder (NCAE) to perform collaborative filtering, which works well for both explicit feedback and implicit feedback. NCAE can effectively capture the relationship between interactions via a non-linear matrix factorization process. To optimize the deep architecture of NCAE, we develop a three-stage pre-training mechanism that combines supervised and unsupervised feature learning. Moreover, to prevent overfitting on the implicit setting, we propose an error reweighting module and a sparsity-aware data-augmentation strategy. Extensive experiments on three real-world datasets demonstrate that NCAE can significantly advance the state-of-the-art.
This paper presents a generic pre-processor for expediting conventional template matching techniques. Instead of locating the best matched patch in the reference image to a query template via exhaustive search, the proposed algorithm rules out regions with no possible matches with minimum computational efforts. While working on simple patch features, such as mean, variance and gradient, the fast pre-screening is highly discriminative. Its computational efficiency is gained by using a novel octagonal-star-shaped template and the inclusion-exclusion principle to extract and compare patch features. Moreover, it can handle arbitrary rotation and scaling of reference images effectively. Extensive experiments demonstrate that the proposed algorithm greatly reduces the search space while never missing the best match.
This article is a sequel to our earlier work . The main objective of our research is to explore the potential of supervised machine learning in face-induced social computing and cognition, riding on the momentum of much heralded successes of face processing, analysis and recognition on the tasks of biometric-based identification. We present a case study of automated statistical inference on sociopsychological perceptions of female faces controlled for race, attractiveness, age and nationality. Our empirical evidences point to the possibility of training machine learning algorithms, using example face images characterized by internet users, to predict perceptions of personality traits and demeanors.
Image of a scene captured through a piece of transparent and reflective material, such as glass, is often spoiled by a superimposed layer of reflection image. While separating the reflection from a familiar object in an image is mentally not difficult for humans, it is a challenging, ill-posed problem in computer vision. In this paper, we propose a novel deep convolutional encoder-decoder method to remove the objectionable reflection by learning a map between image pairs with and without reflection. For training the neural network, we model the physical formation of reflections in images and synthesize a large number of photo-realistic reflection-tainted images from reflection-free images collected online. Extensive experimental results show that, although the neural network learns only from synthetic data, the proposed method is effective on real-world images, and it significantly outperforms the other tested state-of-the-art techniques.
Given the prevalence of JPEG compressed images, optimizing image reconstruction from the compressed format remains an important problem. Instead of simply reconstructing a pixel block from the centers of indexed DCT coefficient quantization bins (hard decoding), soft decoding reconstructs a block by selecting appropriate coefficient values within the indexed bins with the help of signal priors. The challenge thus lies in how to define suitable priors and apply them effectively. In this paper, we combine three image priors---Laplacian prior for DCT coefficients, sparsity prior and graph-signal smoothness prior for image patches---to construct an efficient JPEG soft decoding algorithm. Specifically, we first use the Laplacian prior to compute a minimum mean square error (MMSE) initial solution for each code block. Next, we show that while the sparsity prior can reduce block artifacts, limiting the size of the over-complete dictionary (to lower computation) would lead to poor recovery of high DCT frequencies. To alleviate this problem, we design a new graph-signal smoothness prior (desired signal has mainly low graph frequencies) based on the left eigenvectors of the random walk graph Laplacian matrix (LERaG). Compared to previous graph-signal smoothness priors, LERaG has desirable image filtering properties with low computation overhead. We demonstrate how LERaG can facilitate recovery of high DCT frequencies of a piecewise smooth (PWS) signal via an interpretation of low graph frequency components as relaxed solutions to normalized cut in spectral clustering. Finally, we construct a soft decoding algorithm using the three signal priors with appropriate prior weights. Experimental results show that our proposal outperforms state-of-the-art soft decoding algorithms in both objective and subjective evaluations noticeably.
As a powerful statistical image modeling technique, sparse representation has been successfully used in various image restoration applications. The success of sparse representation owes to the development of l1-norm optimization techniques, and the fact that natural images are intrinsically sparse in some domain. The image restoration quality largely depends on whether the employed sparse domain can represent well the underlying image. Considering that the contents can vary significantly across different images or different patches in a single image, we propose to learn various sets of bases from a pre-collected dataset of example image patches, and then for a given patch to be processed, one set of bases are adaptively selected to characterize the local sparse domain. We further introduce two adaptive regularization terms into the sparse representation framework. First, a set of autoregressive (AR) models are learned from the dataset of example image patches. The best fitted AR models to a given patch are adaptively selected to regularize the image local structures. Second, the image non-local self-similarity is introduced as another regularization term. In addition, the sparsity regularization parameter is adaptively estimated for better image restoration performance. Extensive experiments on image deblurring and super-resolution validate that by using adaptive sparse domain selection and adaptive regularization, the proposed method achieves much better results than many state-of-the-art algorithms in terms of both PSNR and visual perception.
This paper studies the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input belong to two different modalities, e.g., the output is an image and the input is a sketch. We solve this problem by learning two models that bear similarities to those in reinforcement learning and optimal control. One model is policy-like. It generates the output directly by a non-linear transformation of the input and a noise vector. This amounts to fast thinking because the conditional generation is accomplished by direct sampling. The other model is planner-like. It learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. This amounts to slow thinking because the sampling process is accomplished by an iterative algorithm such as Langevin dynamics. We propose to learn the two models jointly, where the fast thinking policy-like model serves to initialize the sampling of the slow thinking planner-like model, and the planner-like model refines the initial output by an iterative algorithm. The planner-like model learns from the difference between the refined output and the observed output, while the policy-like model learns from how the planner-like model refines its initial output. We demonstrate the effectiveness of the proposed method on various image generation tasks.
This paper proposes the divergence triangle as a framework for joint training of generator model, energy-based model and inference model. The divergence triangle is a compact and symmetric (anti-symmetric) objective function that seamlessly integrates variational learning, adversarial learning, wake-sleep algorithm, and contrastive divergence in a unified probabilistic formulation. This unification makes the processes of sampling, inference, energy evaluation readily available without the need for costly Markov chain Monte Carlo methods. Our experiments demonstrate that the divergence triangle is capable of learning (1) an energy-based model with well-formed energy landscape, (2) direct sampling in the form of a generator network, and (3) feed-forward inference that faithfully reconstructs observed as well as synthesized data. The divergence triangle is a robust training method that can learn from incomplete data.
Distillation-based learning boosts the performance of the miniaturized neural network based on the hypothesis that the representation of a teacher model can be used as structured and relatively weak supervision, and thus would be easily learned by a miniaturized model. However, we find that the representation of a converged heavy model is still a strong constraint for training a small student model, which leads to a high lower bound of congruence loss. In this work, inspired by curriculum learning we consider the knowledge distillation from the perspective of curriculum learning by routing. Instead of supervising the student model with a converged teacher model, we supervised it with some anchor points selected from the route in parameter space that the teacher model passed by, as we called route constrained optimization (RCO). We experimentally demonstrate this simple operation greatly reduces the lower bound of congruence loss for knowledge distillation, hint and mimicking learning. On close-set classification tasks like CIFAR100 and ImageNet, RCO improves knowledge distillation by 2.14% and 1.5% respectively. For the sake of evaluating the generalization, we also test RCO on the open-set face recognition task MegaFace.
Although deeper and larger neural networks have achieved better performance, the complex network structure and increasing computational cost cannot meet the demands of many resource-constrained applications. Existing methods usually choose to execute or skip an entire specific layer, which can only alter the depth of the network. In this paper, we propose a novel method called Dynamic Multi-path Neural Network (DMNN), which provides more path selection choices in terms of network width and depth during inference. The inference path of the network is determined by a controller, which takes into account both previous state and object category information. The proposed method can be easily incorporated into most modern network architectures. Experimental results on ImageNet and CIFAR-100 demonstrate the superiority of our method on both efficiency and overall classification accuracy. To be specific, DMNN-101 significantly outperforms ResNet-101 with an encouraging 45.1% FLOPs reduction, and DMNN-50 performs comparably to ResNet-101 while saving 42.1% parameters.