Models, code, and papers for "Jinwen Ma":
In this paper, we propose a novel non-convex tensor rank surrogate function and a novel non-convex sparsity measure for tensor. The basic idea is to sidestep the bias of $\ell_1-$norm by introducing concavity. Furthermore, we employ the proposed non-convex penalties in tensor recovery problems such as tensor completion and tensor robust principal component analysis, which has various real applications such as image inpainting and denoising. Due to the concavity, the models are difficult to solve. To tackle this problem, we devise majorization minimization algorithms, which optimize upper bounds of original functions in each iteration, and every sub-problem is solved by alternating direction multiplier method. Finally, experimental results on natural images and hyperspectral images demonstrate the effectiveness and efficiency of the proposed methods.
Recent studies on face attribute transfer have achieved great success. A lot of models are able to transfer face attributes with an input image. However, they suffer from three limitations: (1) incapability of generating image by exemplars; (2) being unable to transfer multiple face attributes simultaneously; (3) low quality of generated images, such as low-resolution or artifacts. To address these limitations, we propose a novel model which receives two images of opposite attributes as inputs. Our model can transfer exactly the same type of attributes from one image to another by exchanging certain part of their encodings. All the attributes are encoded in a disentangled manner in the latent space, which enables us to manipulate several attributes simultaneously. Besides, our model learns the residual images so as to facilitate training on higher resolution images. With the help of multi-scale discriminators for adversarial training, it can even generate high-quality images with finer details and less artifacts. We demonstrate the effectiveness of our model on overcoming the above three limitations by comparing with other methods on the CelebA face database. A pytorch implementation is available at https://github.com/Prinsphield/ELEGANT.
Disentangling factors of variation has become a very challenging problem on representation learning. Existing algorithms suffer from many limitations, such as unpredictable disentangling factors, poor quality of generated images from encodings, lack of identity information, etc. In this paper, we propose a supervised learning model called DNA-GAN which tries to disentangle different factors or attributes of images. The latent representations of images are DNA-like, in which each individual piece (of the encoding) represents an independent factor of the variation. By annihilating the recessive piece and swapping a certain piece of one latent representation with that of the other one, we obtain two different representations which could be decoded into two kinds of images with the existence of the corresponding attribute being changed. In order to obtain realistic images and also disentangled representations, we further introduce the discriminator for adversarial training. Experiments on Multi-PIE and CelebA datasets finally demonstrate that our proposed method is effective for factors disentangling and even overcome certain limitations of the existing methods.
Fashion landmark detection is a challenging task even using the current deep learning techniques, due to the large variation and non-rigid deformation of clothes. In order to tackle these problems, we propose Spatial-Aware Non-Local (SANL) block, an attentive module in deep neural network which can utilize spatial information while capturing global dependency. Actually, the SANL block is constructed from the non-local block in the residual manner which can learn the spatial related representation by taking a spatial attention map from Grad-CAM. We then establish our fashion landmark detection framework on feature pyramid network, equipped with four SANL blocks in the backbone. It is demonstrated by the experimental results on two large-scale fashion datasets that our proposed fashion landmark detection approach with the SANL blocks outperforms the current state-of-the-art methods considerably. Some supplementary experiments on fine-grained image classification also show the effectiveness of the proposed SANL block.
Universal style transfer is an image editing task that renders an input content image using the visual style of arbitrary reference images, including both artistic and photorealistic stylization. Given a pair of images as the source of content and the reference of style, existing solutions usually first train an auto-encoder (AE) to reconstruct the image using deep features and then embeds pre-defined style transfer modules into the AE reconstruction procedure to transfer the style of the reconstructed image through modifying the deep features. While existing methods typically need multiple rounds of time-consuming AE reconstruction for better stylization, our work intends to design novel neural network architectures on top of AE for fast style transfer with fewer artifacts and distortions all in one pass of end-to-end inference. To this end, we propose two network architectures named ArtNet and PhotoNet to improve artistic and photo-realistic stylization, respectively. Extensive experiments demonstrate that ArtNet generates images with fewer artifacts and distortions against the state-of-the-art artistic transfer algorithms, while PhotoNet improves the photorealistic stylization results by creating sharp images faithfully preserving rich details of the input content. Moreover, ArtNet and PhotoNet can achieve 3X to 100X speed-up over the state-of-the-art algorithms, which is a major advantage for large content images.
Neural Architecture Search (NAS) has been widely studied for designing discriminative deep learning models such as image classification, object detection, and semantic segmentation. As a large number of priors have been obtained through the manual design of architectures in the fields, NAS is usually considered as a supplement approach. In this paper, we have significantly expanded the application areas of NAS by performing an empirical study of NAS to search generative models, or specifically, auto-encoder based universal style transfer, which lacks systematic exploration, if any, from the architecture search aspect. In our work, we first designed a search space where common operators for image style transfer such as VGG-based encoders, whitening and coloring transforms (WCT), convolution kernels, instance normalization operators, and skip connections were searched in a combinatorial approach. With a simple yet effective parallel evolutionary NAS algorithm with multiple objectives, we derived the first group of end-to-end deep networks for universal photorealistic style transfer. Comparing to random search, a NAS method that is gaining popularity recently, we demonstrated that carefully designed search strategy leads to much better architecture design. Finally compared to existing universal style transfer networks for photorealistic rendering such as PhotoWCT that stacks multiple well-trained auto-encoders and WCT transforms in a non-end-to-end manner, the architectures designed by StyleNAS produce better style-transferred images with details preserving, using a tiny number of operators/parameters, and enjoying around 500x inference time speed-up.
Understanding the behavior of stochastic gradient descent (SGD) in the context of deep neural networks has raised lots of concerns recently. Along this line, we theoretically study a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics. Through investigating this general optimization dynamics, we analyze the behavior of SGD on escaping from minima and its regularization effects. A novel indicator is derived to characterize the efficiency of escaping from minima through measuring the alignment of noise covariance and the curvature of loss function. Based on this indicator, two conditions are established to show which type of noise structure is superior to isotropic noise in term of escaping efficiency. We further show that the anisotropic noise in SGD satisfies the two conditions, and thus helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. We verify our understanding through comparing this anisotropic diffusion with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics) and other types of position-dependent noise.
Existing neural models for dialogue response generation assume that utterances are sequentially organized. However, many real-world dialogues involve multiple interlocutors (i.e., multi-party dialogues), where the assumption does not hold as utterances from different interlocutors can occur "in parallel." This paper generalizes existing sequence-based models to a Graph-Structured neural Network (GSN) for dialogue modeling. The core of GSN is a graph-based encoder that can model the information flow along the graph-structured dialogues (two-party sequential dialogues are a special case). Experimental results show that GSN significantly outperforms existing sequence-based models.
Images or videos always contain multiple objects or actions. Multi-label recognition has been witnessed to achieve pretty performance attribute to the rapid development of deep learning technologies. Recently, graph convolution network (GCN) is leveraged to boost the performance of multi-label recognition. However, what is the best way for label correlation modeling and how feature learning can be improved with label system awareness are still unclear. In this paper, we propose a label graph superimposing framework to improve the conventional GCN+CNN framework developed for multi-label recognition in the following two aspects. Firstly, we model the label correlations by superimposing label graph built from statistical co-occurrence information into the graph constructed from knowledge priors of labels, and then multi-layer graph convolutions are applied on the final superimposed graph for label embedding abstraction. Secondly, we propose to leverage embedding of the whole label system for better representation learning. In detail, lateral connections between GCN and CNN are added at shallow, middle and deep layers to inject information of label system into backbone CNN for label-awareness in the feature learning process. Extensive experiments are carried out on MS-COCO and Charades datasets, showing that our proposed solution can greatly improve the recognition performance and achieves new state-of-the-art recognition performance.
Sparsity is regarded as a desirable property of representations, especially in terms of explanation. However, its usage has been limited due to the gap with dense representations. Most NLP research progresses in recent years are based on dense representations. Thus the desirable property of sparsity cannot be leveraged. Inspired by Fourier Transformation, in this paper, we propose a novel Semantic Transformation method to bridge the dense and sparse spaces, which can facilitate the NLP research to shift from dense space to sparse space or to jointly use both spaces. The key idea of the proposed approach is to use a Forward Transformation to transform dense representations to sparse representations. Then some useful operations in the sparse space can be performed over the sparse representations, and the sparse representations can be used directly to perform downstream tasks such as text classification and natural language inference. Then, a Backward Transformation can also be carried out to transform those processed sparse representations to dense representations. Experiments using classification tasks and natural language inference task show that the proposed Semantic Transformation is effective.
This paper deals with inference and prediction for multiple correlated time series, where one has also the choice of using a candidate pool of contemporaneous predictors for each target series. Starting with a structural model for the time-series, Bayesian tools are used for model fitting, prediction, and feature selection, thus extending some recent work along these lines for the univariate case. The Bayesian paradigm in this multivariate setting helps the model avoid overfitting as well as capture correlations among the multiple time series with the various state components. The model provides needed flexibility to choose a different set of components and available predictors for each target series. The cyclical component in the model can handle large variations in the short term, which may be caused by external shocks. We run extensive simulations to investigate properties such as estimation accuracy and performance in forecasting. We then run an empirical study with one-step-ahead prediction on the max log return of a portfolio of stocks that involve four leading financial institutions. Both the simulation studies and the extensive empirical study confirm that this multivariate model outperforms three other benchmark models, viz. a model that treats each target series as independent, the autoregressive integrated moving average model with regression (ARIMAX), and the multivariate ARIMAX (MARIMAX) model.
Widely popular transformer-based NLP models such as BERT and GPT have enormous capacity trending to billions of parameters. Current execution methods demand brute-force resources such as HBM devices and high speed interconnectivity for data parallelism. In this paper, we introduce a new relay-style execution technique called L2L (layer-to-layer) where at any given moment, the device memory is primarily populated only with the executing layer(s)'s footprint. The model resides in the DRAM memory attached to either a CPU or an FPGA as an entity we call eager param-server (EPS). Unlike a traditional param-server, EPS transmits the model piecemeal to the devices thereby allowing it to perform other tasks in the background such as reduction and distributed optimization. To overcome the bandwidth issues of shuttling parameters to and from EPS, the model is executed a layer at a time across many micro-batches instead of the conventional method of minibatches over whole model. In this paper, we explore a conservative version of L2L that is implemented on a modest Azure instance for BERT-Large running it with a batch size of 32 on a single V100 GPU using less than 8GB memory. Our results show a more stable learning curve, faster convergence, better accuracy and 35% reduction in memory compared to the state-of-the-art baseline. Our method reproduces BERT results on any mid-level GPU that was hitherto not feasible. L2L scales to arbitrary depth without impacting memory or devices allowing researchers to develop affordable devices. It also enables dynamic approaches such as neural architecture search. This work has been performed on GPUs first but also targeted towards high TFLOPS/Watt accelerators such as Graphcore IPUs. The code will soon be available on github.
Deep learning has gained tremendous success and great popularity in the past few years. However, recent research found that it is suffering several inherent weaknesses, which can threaten the security and privacy of the stackholders. Deep learning's wide use further magnifies the caused consequences. To this end, lots of research has been conducted with the purpose of exhaustively identifying intrinsic weaknesses and subsequently proposing feasible mitigation. Yet few is clear about how these weaknesses are incurred and how effective are these attack approaches in assaulting deep learning. In order to unveil the security weaknesses and aid in the development of a robust deep learning system, we are devoted to undertaking a comprehensive investigation on attacks towards deep learning, and extensively evaluating these attacks in multiple views. In particular, we focus on four types of attacks associated with security and privacy of deep learning: model extraction attack, model inversion attack, poisoning attack and adversarial attack. For each type of attack, we construct its essential workflow as well as adversary capabilities and attack goals. Many pivot metrics are devised for evaluating the attack approaches, by which we perform a quantitative and qualitative analysis. From the analysis, we have identified significant and indispensable factors in an attack vector, \eg, how to reduce queries to target models, what distance used for measuring perturbation. We spot light on 17 findings covering these approaches' merits and demerits, success probability, deployment complexity and prospects. Moreover, we discuss other potential security weaknesses and possible mitigation which can inspire relevant researchers in this area.
Flocking control has been studied extensively along with the wide application of multi-vehicle systems. In this paper the Multi-vehicles System (MVS) flocking control with collision avoidance and communication preserving is considered based on the deep reinforcement learning framework. Specifically the deep deterministic policy gradient (DDPG) with centralized training and distributed execution process is implemented to obtain the flocking control policy. First, to avoid the dynamically changed observation of state, a three layers tensor based representation of the observation is used so that the state remains constant although the observation dimension is changing. A reward function is designed to guide the way-points tracking, collision avoidance and communication preserving. The reward function is augmented by introducing the local reward function of neighbors. Finally, a centralized training process which trains the shared policy based on common training set among all agents. The proposed method is tested under simulated scenarios with different setup.