Models, code, and papers for "Yuntao Chen":
In blind image deconvolution, priors are often leveraged to constrain the solution space, so as to alleviate the under-determinacy. Priors which are trained separately from the task of deconvolution tend to be instable, or ineffective. We propose the Golf Optimizer, a novel but simple form of network that learns deep priors from data with better propagation behavior. Like playing golf, our method first estimates an aggressive propagation towards optimum using one network, and recurrently applies a residual CNN to learn the gradient of prior for delicate correction on restoration. Experiments show that our network achieves competitive performance on GoPro dataset, and our model is extremely lightweight compared with the state-of-art works.
Spectral images captured by satellites and radio-telescopes are analyzed to obtain information about geological compositions distributions, distant asters as well as undersea terrain. Spectral images usually contain tens to hundreds of continuous narrow spectral bands and are widely used in various fields. But the vast majority of those image signals are beyond the visible range, which calls for special visualization technique. The visualizations of spectral images shall convey as much information as possible from the original signal and facilitate image interpretation. However, most of the existing visualizatio methods display spectral images in false colors, which contradict with human's experience and expectation. In this paper, we present a novel visualization generative adversarial network (GAN) to display spectral images in natural colors. To achieve our goal, we propose a loss function which consists of an adversarial loss and a structure loss. The adversarial loss pushes our solution to the natural image distribution using a discriminator network that is trained to differentiate between false-color images and natural-color images. We also use a cycle loss as the structure constraint to guarantee structure consistency. Experimental results show that our method is able to generate structure-preserved and natural-looking visualizations.
Displaying the large number of bands in a hyper spectral image on a trichromatic monitor has been an active research topic. The visualized image shall convey as much information as possible form the original data and facilitate image interpretation. Most existing methods display HSIs in false colors which contradict with human's experience and expectation. In this paper, we propose a nonlinear approach to visualize an input HSI with natural colors by taking advantage of a corresponding RGB image. Our approach is based on Moving Least Squares, an interpolation scheme for reconstructing a surface from a set of control points, which in our case is a set of matching pixels between the HSI and the corresponding RGB image. Based on MLS, the proposed method solves for each spectral signature a unique transformation so that the non linear structure of the HSI can be preserved. The matching pixels between a pair of HSI and RGB image can be reused to display other HSIs captured b the same imaging sensor with natural colors. Experiments show that the output image of the proposed method no only have natural colors but also maintain the visual information necessary for human analysis.
We have witnessed rapid evolution of deep neural network architecture design in the past years. These latest progresses greatly facilitate the developments in various areas such as computer vision and natural language processing. However, along with the extraordinary performance, these state-of-the-art models also bring in expensive computational cost. Directly deploying these models into applications with real-time requirement is still infeasible. Recently, Hinton etal. have shown that the dark knowledge within a powerful teacher model can significantly help the training of a smaller and faster student network. These knowledge are vastly beneficial to improve the generalization ability of the student model. Inspired by their work, we introduce a new type of knowledge -- cross sample similarities for model compression and acceleration. This knowledge can be naturally derived from deep metric learning model. To transfer them, we bring the "learning to rank" technique into deep metric learning formulation. We test our proposed DarkRank method on various metric learning tasks including pedestrian re-identification, image retrieval and image clustering. The results are quite encouraging. Our method can improve over the baseline method by a large margin. Moreover, it is fully compatible with other existing methods. When combined, the performance can be further boosted.
Deep learning has been widely used for hyperspectral pixel classification due to its ability of generating deep feature representation. However, how to construct an efficient and powerful network suitable for hyperspectral data is still under exploration. In this paper, a novel neural network model is designed for taking full advantage of the spectral-spatial structure of hyperspectral data. Firstly, we extract pixel-based intrinsic features from rich yet redundant spectral bands by a subnetwork with supervised pre-training scheme. Secondly, in order to utilize the local spatial correlation among pixels, we share the previous subnetwork as a spectral feature extractor for each pixel in a patch of image, after which the spectral features of all pixels in a patch are combined and feeded into the subsequent classification subnetwork. Finally, the whole network is further fine-tuned to improve its classification performance. Specially, the spectral-spatial factorization scheme is applied in our model architecture, making the network size and the number of parameters great less than the existing spectral-spatial deep networks for hyperspectral image classification. Experiments on the hyperspectral data sets show that, compared with some state-of-art deep learning methods, our method achieves better classification results while having smaller network size and less parameters.
Video objection detection (VID) has been a rising research direction in recent years. A central issue of VID is the appearance degradation of video frames caused by fast motion. This problem is essentially ill-posed for a single frame. Therefore, aggregating features from other frames becomes a natural choice. Existing methods rely heavily on optical flow or recurrent neural networks for feature aggregation. However, these methods emphasize more on the temporally nearby frames. In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection. To achieve this goal, we devise a novel Sequence Level Semantics Aggregation (SELSA) module. We further demonstrate the close relationship between the proposed method and the classic spectral clustering method, providing a novel view for understanding the VID problem. We test the proposed method on the ImageNet VID and the EPIC KITCHENS dataset and achieve new state-of-the-art results. Our method does not need complicated postprocessing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean.
Recently, one-stage object detectors gain much attention due to their simplicity in practice. Its fully convolutional nature greatly reduces the difficulty of training and deployment compared with two-stage detectors which require NMS and sorting for the proposal stage. However, a fundamental issue lies in all one-stage detectors is the misalignment between anchor boxes and convolutional features, which significantly hinders the performance of one-stage detectors. In this work, we first reveal the deep connection between the widely used im2col operator and the RoIAlign operator. Guided by this illuminating observation, we propose a RoIConv operator which aligns the features and its corresponding anchors in one-stage detection in a principled way. We then design a fully convolutional AlignDet architecture which combines the flexibility of learned anchors and the preciseness of aligned features. Specifically, our AlignDet achieves a state-of-the-art mAP of 44.1 on the COCO test-dev with ResNeXt-101 backbone.
Scale variation is one of the key challenges in object detection. In this work, we first present a controlled experiment to investigate the effect of receptive fields on the detection of different scale objects. Based on the findings from the exploration experiments, we propose a novel Trident Network (TridentNet) aiming to generate scale-specific feature maps with a uniform representational power. We construct a parallel multi-branch architecture in which each branch shares the same transformation parameters but with different receptive fields. Then, we propose a scale-aware training scheme to specialize each branch by sampling object instances of proper scales for training. As a bonus, a fast approximation version of TridentNet could achieve significant improvements without any additional parameters and computational cost. On the COCO dataset, our TridentNet with ResNet-101 backbone achieves state-of-the-art single-model results by obtaining an mAP of 48.4. Code will be made publicly available.
With the surge of deep learning techniques, the field of person re-identification has witnessed rapid progress in recent years. Deep learning based methods focus on learning a feature space where samples are clustered compactly according to their corresponding identities. Most existing methods rely on powerful CNNs to transform the samples individually. In contrast, we propose to consider the sample relations in the transformation. To achieve this goal, we incorporate spectral clustering technique into CNN. We derive a novel module named Spectral Feature Transformation and seamlessly integrate it into existing CNN pipeline with negligible cost,which makes our method enjoy the best of two worlds. Empirical studies show that the proposed approach outperforms previous state-of-the-art methods on four public benchmarks by a considerable margin without bells and whistles.
Object detection and instance recognition play a central role in many AI applications like autonomous driving, video surveillance and medical image analysis. However, training object detection models on large scale datasets remains computationally expensive and time consuming. This paper presents an efficient and open source object detection framework called SimpleDet which enables the training of state-of-the-art detection models on consumer grade hardware at large scale. SimpleDet supports up-to-date detection models with best practice. SimpleDet also supports distributed training with near linear scaling out of box. Codes, examples and documents of SimpleDet can be found at https://github.com/tusimple/simpledet .