Models, code, and papers for "Hairong Qi":

Discriminative Cross-View Binary Representation Learning

Apr 04, 2018
Liu Liu, Hairong Qi

Learning compact representation is vital and challenging for large scale multimedia data. Cross-view/cross-modal hashing for effective binary representation learning has received significant attention with exponentially growing availability of multimedia content. Most existing cross-view hashing algorithms emphasize the similarities in individual views, which are then connected via cross-view similarities. In this work, we focus on the exploitation of the discriminative information from different views, and propose an end-to-end method to learn semantic-preserving and discriminative binary representation, dubbed Discriminative Cross-View Hashing (DCVH), in light of learning multitasking binary representation for various tasks including cross-view retrieval, image-to-image retrieval, and image annotation/tagging. The proposed DCVH has the following key components. First, it uses convolutional neural network (CNN) based nonlinear hashing functions and multilabel classification for both images and texts simultaneously. Such hashing functions achieve effective continuous relaxation during training without explicit quantization loss by using Direct Binary Embedding (DBE) layers. Second, we propose an effective view alignment via Hamming distance minimization, which is efficiently accomplished by bit-wise XOR operation. Extensive experiments on two image-text benchmark datasets demonstrate that DCVH outperforms state-of-the-art cross-view hashing algorithms as well as single-view image hashing algorithms. In addition, DCVH can provide competitive performance for image annotation/tagging.

* WACV2018
* Published in WACV2018. Code will be available soon
RRPN: Radar Region Proposal Network for Object Detection in Autonomous Vehicles

May 15, 2019
Ramin Nabati, Hairong Qi

Region proposal algorithms play an important role in most state-of-the-art two-stage object detection networks by hypothesizing object locations in the image. Nonetheless, region proposal algorithms are known to be the bottleneck in most two-stage object detection networks, increasing the processing time for each image and resulting in slow networks not suitable for real-time applications such as autonomous driving vehicles. In this paper we introduce RRPN, a Radar-based real-time region proposal algorithm for object detection in autonomous driving vehicles. RRPN generates object proposals by mapping Radar detections to the image coordinate system and generating pre-defined anchor boxes for each mapped Radar detection point. These anchor boxes are then transformed and scaled based on the object's distance from the vehicle, to provide more accurate proposals for the detected objects. We evaluate our method on the newly released NuScenes dataset [1] using the Fast R-CNN object detection network [2]. Compared to the Selective Search object proposal algorithm [3], our model operates more than 100x faster while at the same time achieves higher detection precision and recall. Code has been made publicly available at https://github.com/mrnabati/RRPN .

* To appear in ICIP 2019
Attention-based Few-Shot Person Re-identification Using Meta Learning

Oct 08, 2018
Alireza Rahimpour, Hairong Qi

In this paper, we investigate the challenging task of person re-identification from a new perspective and propose an end-to-end attention-based architecture for few-shot re-identification through meta-learning. The motivation for this task lies in the fact that humans, can usually identify another person after just seeing that given person a few times (or even once) by attending to their memory. On the other hand, the unique nature of the person re-identification problem, i.e., only few examples exist per identity and new identities always appearing during testing, calls for a few shot learning architecture with the capacity of handling new identities. Hence, we frame the problem within a meta-learning setting, where a neural network based meta-learner is trained to optimize a learner i.e., an attention-based matching function. Another challenge of the person re-identification problem is the small inter-class difference between different identities and large intra-class difference of the same identity. In order to increase the discriminative power of the model, we propose a new attention-based feature encoding scheme that takes into account the critical intra-view and cross-view relationship of images. We refer to the proposed Attention-based Re-identification Metalearning model as ARM. Extensive evaluations demonstrate the advantages of the ARM as compared to the state-of-the-art on the challenging PRID2011, CUHK01, CUHK03 and Market1501 datasets.

Addressing Ambiguity in Multi-target Tracking by Hierarchical Strategy

May 30, 2017
Ali Taalimi, Liu Liu, Hairong Qi

This paper presents a novel hierarchical approach for the simultaneous tracking of multiple targets in a video. We use a network flow approach to link detections in low-level and tracklets in high-level. At each step of the hierarchy, the confidence of candidates is measured by using a new scoring system, ConfRank, that considers the quality and the quantity of its neighborhood. The output of the first stage is a collection of safe tracklets and unlinked high-confidence detections. For each individual detection, we determine if it belongs to an existing or is a new tracklet. We show the effect of our framework to recover missed detections and reduce switch identity. The proposed tracker is referred to as TVOD for multi-target tracking using the visual tracker and generic object detector. We achieve competitive results with lower identity switches on several datasets comparing to state-of-the-art.

* 5 pages, Accepted in International Conference of Image Processing, 2017
Image color transfer to evoke different emotions based on color combinations

Nov 02, 2014
Li He, Hairong Qi, Russell Zaretzki

In this paper, a color transfer framework to evoke different emotions for images based on color combinations is proposed. The purpose of this color transfer is to change the "look and feel" of images, i.e., evoking different emotions. Colors are confirmed as the most attractive factor in images. In addition, various studies in both art and science areas have concluded that other than single color, color combinations are necessary to evoke specific emotions. Therefore, we propose a novel framework to transfer color of images based on color combinations, using a predefined color emotion model. The contribution of this new framework is three-fold. First, users do not need to provide reference images as used in traditional color transfer algorithms. In most situations, users may not have enough aesthetic knowledge or path to choose desired reference images. Second, because of the usage of color combinations instead of single color for emotions, a new color transfer algorithm that does not require an image library is proposed. Third, again because of the usage of color combinations, artifacts that are normally seen in traditional frameworks using single color are avoided. We present encouraging results generated from this new framework and its potential in several possible applications including color transfer of photos and paintings.

* Signal, Image and Video Processing, September 2014
One-Shot Mutual Affine-Transfer for Photorealistic Stylization

Jul 24, 2019
Ying Qu, Zhenzhou Shao, Hairong Qi

Photorealistic style transfer aims to transfer the style of a reference photo onto a content photo naturally, such that the stylized image looks like a real photo taken by a camera. Existing state-of-the-art methods are prone to spatial structure distortion of the content image and global color inconsistency across different semantic objects, making the results less photorealistic. In this paper, we propose a one-shot mutual Dirichlet network, to address these challenging issues. The essential contribution of the work is the realization of a representation scheme that successfully decouples the spatial structure and color information of images, such that the spatial structure can be well preserved during stylization. This representation is discriminative and context-sensitive with respect to semantic objects. It is extracted with a shared sparse Dirichlet encoder. Moreover, such representation is encouraged to be matched between the content and style images for faithful color transfer. The affine-transfer model is embedded in the decoder of the network to facilitate the color transfer. The strong representative and discriminative power of the proposed network enables one-shot learning given only one content-style image pair. Experimental results demonstrate that the proposed method is able to generate photorealistic photos without spatial distortion or abrupt color changes.

Unsupervised and Unregistered Hyperspectral Image Super-Resolution with Mutual Dirichlet-Net

May 10, 2019
Ying Qu, Hairong Qi, Chiman Kwan

Hyperspectral images (HSI) provide rich spectral information that contributed to the successful performance improvement of numerous computer vision tasks. However, it can only be achieved at the expense of images' spatial resolution. Hyperspectral image super-resolution (HSI-SR) addresses this problem by fusing low resolution (LR) HSI with multispectral image (MSI) carrying much higher spatial resolution (HR). All existing HSI-SR approaches require the LR HSI and HR MSI to be well registered and the reconstruction accuracy of the HR HSI relies heavily on the registration accuracy of different modalities. This paper exploits the uncharted problem domain of HSI-SR without the requirement of multi-modality registration. Given the unregistered LR HSI and HR MSI with overlapped regions, we design a unique unsupervised learning structure linking the two unregistered modalities by projecting them into the same statistical space through the same encoder. The mutual information (MI) is further adopted to capture the non-linear statistical dependencies between the representations from two modalities (carrying spatial information) and their raw inputs. By maximizing the MI, spatial correlations between different modalities can be well characterized to further reduce the spectral distortion. A collaborative $l_{2,1}$ norm is employed as the reconstruction error instead of the more common $l_2$ norm, so that individual pixels can be recovered as accurately as possible. With this design, the network allows to extract correlated spectral and spatial information from unregistered images that better preserves the spectral information. The proposed method is referred to as unregistered and unsupervised mutual Dirichlet Net ($u^2$-MDN). Extensive experimental results using benchmark HSI datasets demonstrate the superior performance of $u^2$-MDN as compared to the state-of-the-art.

* Submitted to IEEE Transactions on Image Processing
Unsupervised Sparse Dirichlet-Net for Hyperspectral Image Super-Resolution

Jul 15, 2018
Ying Qu, Hairong Qi, Chiman Kwan

In many computer vision applications, obtaining images of high resolution in both the spatial and spectral domains are equally important. However, due to hardware limitations, one can only expect to acquire images of high resolution in either the spatial or spectral domains. This paper focuses on hyperspectral image super-resolution (HSI-SR), where a hyperspectral image (HSI) with low spatial resolution (LR) but high spectral resolution is fused with a multispectral image (MSI) with high spatial resolution (HR) but low spectral resolution to obtain HR HSI. Existing deep learning-based solutions are all supervised that would need a large training set and the availability of HR HSI, which is unrealistic. Here, we make the first attempt to solving the HSI-SR problem using an unsupervised encoder-decoder architecture that carries the following uniquenesses. First, it is composed of two encoder-decoder networks, coupled through a shared decoder, in order to preserve the rich spectral information from the HSI network. Second, the network encourages the representations from both modalities to follow a sparse Dirichlet distribution which naturally incorporates the two physical constraints of HSI and MSI. Third, the angular difference between representations are minimized in order to reduce the spectral distortion. We refer to the proposed architecture as unsupervised Sparse Dirichlet-Net, or uSDN. Extensive experimental results demonstrate the superior performance of uSDN as compared to the state-of-the-art.

* Accepted by The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018, Spotlight)
Fast-converging Conditional Generative Adversarial Networks for Image Synthesis

May 05, 2018
Chengcheng Li, Zi Wang, Hairong Qi

Building on top of the success of generative adversarial networks (GANs), conditional GANs attempt to better direct the data generation process by conditioning with certain additional information. Inspired by the most recent AC-GAN, in this paper we propose a fast-converging conditional GAN (FC-GAN). In addition to the real/fake classifier used in vanilla GANs, our discriminator has an advanced auxiliary classifier which distinguishes each real class from an extra fake' class. The fake' class avoids mixing generated data with real data, which can potentially confuse the classification of real data as AC-GAN does, and makes the advanced auxiliary classifier behave as another real/fake classifier. As a result, FC-GAN can accelerate the process of differentiation of all classes, thus boost the convergence speed. Experimental results on image synthesis demonstrate our model is competitive in the quality of images generated while achieving a faster convergence rate.

* Accepted by ICIP 2018
Decoupled Learning for Conditional Adversarial Networks

Jan 21, 2018
Zhifei Zhang, Yang Song, Hairong Qi

Incorporating encoding-decoding nets with adversarial nets has been widely adopted in image generation tasks. We observe that the state-of-the-art achievements were obtained by carefully balancing the reconstruction loss and adversarial loss, and such balance shifts with different network structures, datasets, and training strategies. Empirical studies have demonstrated that an inappropriate weight between the two losses may cause instability, and it is tricky to search for the optimal setting, especially when lacking prior knowledge on the data and network. This paper gives the first attempt to relax the need of manual balancing by proposing the concept of \textit{decoupled learning}, where a novel network structure is designed that explicitly disentangles the backpropagation paths of the two losses. Experimental results demonstrate the effectiveness, robustness, and generality of the proposed method. The other contribution of the paper is the design of a new evaluation metric to measure the image quality of generative models. We propose the so-called \textit{normalized relative discriminative score} (NRDS), which introduces the idea of relative comparison, rather than providing absolute estimates like existing metrics.

r-BTN: Cross-domain Face Composite and Synthesis from Limited Facial Patches

Dec 06, 2017
Yang Song, Zhifei Zhang, Hairong Qi

We start by asking an interesting yet challenging question, "If an eyewitness can only recall the eye features of the suspect, such that the forensic artist can only produce a sketch of the eyes (e.g., the top-left sketch shown in Fig. 1), can advanced computer vision techniques help generate the whole face image?" A more generalized question is that if a large proportion (e.g., more than 50%) of the face/sketch is missing, can a realistic whole face sketch/image still be estimated. Existing face completion and generation methods either do not conduct domain transfer learning or can not handle large missing area. For example, the inpainting approach tends to blur the generated region when the missing area is large (i.e., more than 50%). In this paper, we exploit the potential of deep learning networks in filling large missing region (e.g., as high as 95% missing) and generating realistic faces with high-fidelity in cross domains. We propose the recursive generation by bidirectional transformation networks (r-BTN) that recursively generates a whole face/sketch from a small sketch/face patch. The large missing area and the cross domain challenge make it difficult to generate satisfactory results using a unidirectional cross-domain learning structure. On the other hand, a forward and backward bidirectional learning between the face and sketch domains would enable recursive estimation of the missing region in an incremental manner (Fig. 1) and yield appealing results. r-BTN also adopts an adversarial constraint to encourage the generation of realistic faces/sketches. Extensive experiments have been conducted to demonstrate the superior performance from r-BTN as compared to existing potential solutions.

* Accepted by AAAI 2018
Feature Encoding in Band-limited Distributed Surveillance Systems

Jun 06, 2017
Alireza Rahimpour, Ali Taalimi, Hairong Qi

Distributed surveillance systems have become popular in recent years due to security concerns. However, transmitting high dimensional data in bandwidth-limited distributed systems becomes a major challenge. In this paper, we address this issue by proposing a novel probabilistic algorithm based on the divergence between the probability distributions of the visual features in order to reduce their dimensionality and thus save the network bandwidth in distributed wireless smart camera networks. We demonstrate the effectiveness of the proposed approach through extensive experiments on two surveillance recognition tasks.

* To be published (Accepted) in: The 42th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2017)
Age Progression/Regression by Conditional Adversarial Autoencoder

Mar 28, 2017
Zhifei Zhang, Yang Song, Hairong Qi

"If I provide you a face image of mine (without telling you the actual age when I took the picture) and a large amount of face images that I crawled (containing labeled faces of different ages but not necessarily paired), can you show me what I would look like when I am 80 or what I was like when I was 5?" The answer is probably a "No." Most existing face aging works attempt to learn the transformation between age groups and thus would require the paired samples as well as the labeled query image. In this paper, we look at the problem from a generative modeling perspective such that no paired samples is required. In addition, given an unlabeled image, the generative model can directly produce the image with desired age attribute. We propose a conditional adversarial autoencoder (CAAE) that learns a face manifold, traversing on which smooth age progression and regression can be realized simultaneously. In CAAE, the face is first mapped to a latent vector through a convolutional encoder, and then the vector is projected to the face manifold conditional on age through a deconvolutional generator. The latent vector preserves personalized face features (i.e., personality) and the age condition controls progression vs. regression. Two adversarial networks are imposed on the encoder and generator, respectively, forcing to generate more photo-realistic faces. Experimental results demonstrate the appealing performance and flexibility of the proposed framework by comparing with the state-of-the-art and ground truth.

* Accepted by The IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)
Single-shot Channel Pruning Based on Alternating Direction Method of Multipliers

Feb 18, 2019
Chengcheng Li, Zi Wang, Xiangyang Wang, Hairong Qi

Channel pruning has been identified as an effective approach to constructing efficient network structures. Its typical pipeline requires iterative pruning and fine-tuning. In this work, we propose a novel single-shot channel pruning approach based on alternating direction methods of multipliers (ADMM), which can eliminate the need for complex iterative pruning and fine-tuning procedure and achieve a target compression ratio with only one run of pruning and fine-tuning. To the best of our knowledge, this is the first study of single-shot channel pruning. The proposed method introduces filter-level sparsity during training and can achieve competitive performance with a simple heuristic pruning criterion (L1-norm). Extensive evaluations have been conducted with various widely-used benchmark architectures and image datasets for object classification purpose. The experimental results on classification accuracy show that the proposed method can outperform state-of-the-art network pruning works under various scenarios.

* Submitted to ICIP 2019
End-to-end Binary Representation Learning via Direct Binary Embedding

Jun 04, 2017
Liu Liu, Alireza Rahimpour, Ali Taalimi, Hairong Qi

Learning binary representation is essential to large-scale computer vision tasks. Most existing algorithms require a separate quantization constraint to learn effective hashing functions. In this work, we present Direct Binary Embedding (DBE), a simple yet very effective algorithm to learn binary representation in an end-to-end fashion. By appending an ingeniously designed DBE layer to the deep convolutional neural network (DCNN), DBE learns binary code directly from the continuous DBE layer activation without quantization error. By employing the deep residual network (ResNet) as DCNN component, DBE captures rich semantics from images. Furthermore, in the effort of handling multilabel images, we design a joint cross entropy loss that includes both softmax cross entropy and weighted binary cross entropy in consideration of the correlation and independence of labels, respectively. Extensive experiments demonstrate the significant superiority of DBE over state-of-the-art methods on tasks of natural object recognition, image retrieval and image annotation.

* Accepted by ICIP'17
Multi-View Task-Driven Recognition in Visual Sensor Networks

May 31, 2017
Ali Taalimi, Alireza Rahimpour, Liu Liu, Hairong Qi

Nowadays, distributed smart cameras are deployed for a wide set of tasks in several application scenarios, ranging from object recognition, image retrieval, and forensic applications. Due to limited bandwidth in distributed systems, efficient coding of local visual features has in fact been an active topic of research. In this paper, we propose a novel approach to obtain a compact representation of high-dimensional visual data using sensor fusion techniques. We convert the problem of visual analysis in resource-limited scenarios to a multi-view representation learning, and we show that the key to finding properly compressed representation is to exploit the position of cameras with respect to each other as a norm-based regularization in the particular signal representation of sparse coding. Learning the representation of each camera is viewed as an individual task and a multi-task learning with joint sparsity for all nodes is employed. The proposed representation learning scheme is referred to as the multi-view task-driven learning for visual sensor network (MT-VSN). We demonstrate that MT-VSN outperforms state-of-the-art in various surveillance recognition tasks.

* 5 pages, Accepted in International Conference of Image Processing, 2017
Context Aware Road-user Importance Estimation (iCARE)

Aug 30, 2019
Alireza Rahimpour, Sujitha Martin, Ashish Tawari, Hairong Qi

Road-users are a critical part of decision-making for both self-driving cars and driver assistance systems. Some road-users, however, are more important for decision-making than others because of their respective intentions, ego vehicle's intention and their effects on each other. In this paper, we propose a novel architecture for road-user importance estimation which takes advantage of the local and global context of the scene. For local context, the model exploits the appearance of the road users (which captures orientation, intention, etc.) and their location relative to ego-vehicle. The global context in our model is defined based on the feature map of the convolutional layer of the module which predicts the future path of the ego-vehicle and contains rich global information of the scene (e.g., infrastructure, road lanes, etc.), as well as the ego vehicle's intention information. Moreover, this paper introduces a new data set of real-world driving, concentrated around inter-sections and includes annotations of important road users. Systematic evaluations of our proposed method against several baselines show promising results.

* Published in: IEEE Intelligent Vehicles (IV), 2019
Investigating Channel Pruning through Structural Redundancy Reduction - A Statistical Study

May 19, 2019
Chengcheng Li, Zi Wang, Dali Wang, Xiangyang Wang, Hairong Qi

Most existing channel pruning methods formulate the pruning task from a perspective of inefficiency reduction which iteratively rank and remove the least important filters, or find the set of filters that minimizes some reconstruction errors after pruning. In this work, we investigate the channel pruning from a new perspective with statistical modeling. We hypothesize that the number of filters at a certain layer reflects the level of 'redundancy' in that layer and thus formulate the pruning problem from the aspect of redundancy reduction. Based on both theoretic analysis and empirical studies, we make an important discovery: randomly pruning filters from layers of high redundancy outperforms pruning the least important filters across all layers based on the state-of-the-art ranking criterion. These results advance our understanding of pruning and further testify to the recent findings that the structure of the pruned model plays a key role in the network efficiency as compared to inherited weights.

* 2019 ICML Workshop, Joint Workshop on On-Device Machine Learning & Compact Deep Neural Network Representations (ODML-CDNNR). Unnecessary figure after reference removed
Image Super-Resolution by Neural Texture Transfer

Mar 06, 2019
Zhifei Zhang, Zhaowen Wang, Zhe Lin, Hairong Qi

Due to the significant information loss in low-resolution (LR) images, it has become extremely challenging to further advance the state-of-the-art of single image super-resolution (SISR). Reference-based super-resolution (RefSR), on the other hand, has proven to be promising in recovering high-resolution (HR) details when a reference (Ref) image with similar content as that of the LR input is given. However, the quality of RefSR can degrade severely when Ref is less similar. This paper aims to unleash the potential of RefSR by leveraging more texture details from Ref images with stronger robustness even when irrelevant Ref images are provided. Inspired by the recent work on image stylization, we formulate the RefSR problem as neural texture transfer. We design an end-to-end deep model which enriches HR details by adaptively transferring the texture from Ref images according to their textural similarity. Instead of matching content in the raw pixel space as done by previous methods, our key contribution is a multi-level matching conducted in the neural space. This matching scheme facilitates multi-scale neural transfer that allows the model to benefit more from those semantically related Ref patches, and gracefully degrade to SISR performance on the least relevant Ref inputs. We build a benchmark dataset for the general research of RefSR, which contains Ref images paired with LR inputs with varying levels of similarity. Both quantitative and qualitative evaluations demonstrate the superiority of our method over state-of-the-art.

* Project Page: http://web.eecs.utk.edu/~zzhang61/project_page/SRNTT/SRNTT.html. arXiv admin note: text overlap with arXiv:1804.03360
Speeding up convolutional networks pruning with coarse ranking

Feb 18, 2019
Zi Wang, Chengcheng Li, Dali Wang, Xiangyang Wang, Hairong Qi

Channel-based pruning has achieved significant successes in accelerating deep convolutional neural network, whose pipeline is an iterative three-step procedure: ranking, pruning and fine-tuning. However, this iterative procedure is computationally expensive. In this study, we present a novel computationally efficient channel pruning approach based on the coarse ranking that utilizes the intermediate results during fine-tuning to rank the importance of filters, built upon state-of-the-art works with data-driven ranking criteria. The goal of this work is not to propose a single improved approach built upon a specific channel pruning method, but to introduce a new general framework that works for a series of channel pruning methods. Various benchmark image datasets (CIFAR-10, ImageNet, Birds-200, and Flowers-102) and network architectures (AlexNet and VGG-16) are utilized to evaluate the proposed approach for object classification purpose. Experimental results show that the proposed method can achieve almost identical performance with the corresponding state-of-the-art works (baseline) while our ranking time is negligibly short. In specific, with the proposed method, 75% and 54% of the total computation time for the whole pruning procedure can be reduced for AlexNet on CIFAR-10, and for VGG-16 on ImageNet, respectively. Our approach would significantly facilitate pruning practice, especially on resource-constrained platforms.

* Submitted to ICIP 2019