Models, code, and papers for "Haojie Li":

Probabilistic Filtered Soft Labels for Domain Adaptation

Dec 24, 2019
Wei Wang, Zhihui Wang, Haojie Li, Zhengming Ding

Many domain adaptation (DA) methods aim to project the source and target domains into a common feature space, where the inter-domain distributional differences are reduced and some intra-domain properties preserved. Recent research obtains their respective new representations using some predefined statistics. However, they usually formulate the class-wise statistics using the pseudo hard labels due to no labeled target data, such as class-wise MMD and class scatter matrice. The probabilities of data points belonging to each class given by the hard labels are either 0 or 1, while the soft labels could relax the strong constraint of hard labels and provide a random value between them. Although existing work have noticed the advantage of soft labels, they either deal with thoes class-wise statistics inadequately or introduce those small irrelevant probabilities in soft labels. Therefore, we propose the filtered soft labels to discard thoes confusing probabilities, then both of the class-wise MMD and class scatter matrice are modeled in this way. In order to obtain more accurate filtered soft labels, we take advantage of a well-designed Graph-based Label Propagation (GLP) method, and incorporate it into the DA procedure to formulate a unified framework.

* 15 pages, 7 figures, IEEE Tansactions on Image Processing. arXiv admin note: text overlap with arXiv:1906.07441 by other authors 

  Access Model/Code and Paper
Sequential Dual Deep Learning with Shape and Texture Features for Sketch Recognition

Aug 09, 2017
Qi Jia, Meiyu Yu, Xin Fan, Haojie Li

Recognizing freehand sketches with high arbitrariness is greatly challenging. Most existing methods either ignore the geometric characteristics or treat sketches as handwritten characters with fixed structural ordering. Consequently, they can hardly yield high recognition performance even though sophisticated learning techniques are employed. In this paper, we propose a sequential deep learning strategy that combines both shape and texture features. A coded shape descriptor is exploited to characterize the geometry of sketch strokes with high flexibility, while the outputs of constitutional neural networks (CNN) are taken as the abstract texture feature. We develop dual deep networks with memorable gated recurrent units (GRUs), and sequentially feed these two types of features into the dual networks, respectively. These dual networks enable the feature fusion by another gated recurrent unit (GRU), and thus accurately recognize sketches invariant to stroke ordering. The experiments on the TU-Berlin data set show that our method outperforms the average of human and state-of-the-art algorithms even when significant shape and appearance variations occur.

* 8 pages, 8 figures 

  Access Model/Code and Paper
Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

Apr 01, 2019
Xinzhu Ma, Zhihui Wang, Haojie Li, Wanli Ouyang, Pengbo Zhang

In this paper, we propose a monocular 3D object detection framework in the domain of autonomous driving. Unlike previous image-based methods which focus on RGB feature extracted from 2D images, our method solves this problem in the reconstructed 3D space in order to exploit 3D contexts explicitly. To this end, we first leverage a stand-alone module to transform the input data from 2D image plane to 3D point clouds space for a better input representation, then we perform the 3D detection using PointNet backbone net to obtain objects 3D locations, dimensions and orientations. To enhance the discriminative capability of point clouds, we propose a multi-modal feature fusion module to embed the complementary RGB cue into the generated point clouds representation. We argue that it is more effective to infer the 3D bounding boxes from the generated 3D scene space (i.e., X,Y, Z space) compared to the image plane (i.e., R,G,B image plane). Evaluation on the challenging KITTI dataset shows that our approach boosts the performance of state-of-the-art monocular approach by a large margin, i.e., around 15% absolute AP on both 3D localization and detection tasks for Car category at 0.7 IoU threshold.

* arXiv admin note: text overlap with arXiv:1711.06396 by other authors 

  Access Model/Code and Paper
User-Guided Deep Anime Line Art Colorization with Conditional Adversarial Networks

Aug 10, 2018
Yuanzheng Ci, Xinzhu Ma, Zhihui Wang, Haojie Li, Zhongxuan Luo

Scribble colors based line art colorization is a challenging computer vision problem since neither greyscale values nor semantic information is presented in line arts, and the lack of authentic illustration-line art training pairs also increases difficulty of model generalization. Recently, several Generative Adversarial Nets (GANs) based methods have achieved great success. They can generate colorized illustrations conditioned on given line art and color hints. However, these methods fail to capture the authentic illustration distributions and are hence perceptually unsatisfying in the sense that they often lack accurate shading. To address these challenges, we propose a novel deep conditional adversarial architecture for scribble based anime line art colorization. Specifically, we integrate the conditional framework with WGAN-GP criteria as well as the perceptual loss to enable us to robustly train a deep network that makes the synthesized images more natural and real. We also introduce a local features network that is independent of synthetic data. With GANs conditioned on features from such network, we notably increase the generalization capability over "in the wild" line arts. Furthermore, we collect two datasets that provide high-quality colorful illustrations and authentic line arts for training and benchmarking. With the proposed model trained on our illustration dataset, we demonstrate that images synthesized by the presented approach are considerably more realistic and precise than alternative approaches.

* Accepted for publication at the 2018 ACM Multimedia Conference (MM '18) 

  Access Model/Code and Paper
A Single Shot Text Detector with Scale-adaptive Anchors

Jul 05, 2018
Qi Yuan, Bingwang Zhang, Haojie Li, Zhihui Wang, Zhongxuan Luo

Currently, most top-performing text detection networks tend to employ fixed-size anchor boxes to guide the search for text instances. They usually rely on a large amount of anchors with different scales to discover texts in scene images, thus leading to high computational cost. In this paper, we propose an end-to-end box-based text detector with scale-adaptive anchors, which can dynamically adjust the scales of anchors according to the sizes of underlying texts by introducing an additional scale regression layer. The proposed scale-adaptive anchors allow us to use a few number of anchors to handle multi-scale texts and therefore significantly improve the computational efficiency. Moreover, compared to discrete scales used in previous methods, the learned continuous scales are more reliable, especially for small texts detection. Additionally, we propose Anchor convolution to better exploit necessary feature information by dynamically adjusting the sizes of receptive fields according to the learned scales. Extensive experiments demonstrate that the proposed detector is fast, taking only $0.28$ second per image, while outperforming most state-of-the-art methods in accuracy.

* 8 pages, 6figures 

  Access Model/Code and Paper
UDD: An Underwater Open-sea Farm Object Detection Dataset for Underwater Robot Picking

Mar 03, 2020
Zhihui Wang, Chongwei Liu, Shijie Wang, Tao Tang, Yulong Tao, Caifei Yang, Haojie Li, Xing Liu, Xin Fan

To promote the development of underwater robot picking in sea farms, we propose an underwater open-sea farm object detection dataset called UDD. Concretely, UDD consists of 3 categories (seacucumber, seaurchin, and scallop) with 2227 images. To the best of our knowledge, it's the first dataset collected in a real open-sea farm for underwater robot picking and we also propose a novel Poisson-blending-embedded Generative Adversarial Network (Poisson GAN) to overcome the class-imbalance and massive small objects issues in UDD. By utilizing Poisson GAN to change the number, position, even size of objects in UDD, we construct a large scale augmented dataset (AUDD) containing 18K images. Besides, in order to make the detector better adapted to the underwater picking environment, a dataset (Pre-trained dataset) for pre-training containing 590K images is also proposed. Finally, we design a lightweight network (UnderwaterNet) to address the problems that detecting small objects from cloudy underwater pictures and meeting the efficiency requirements in robots. Specifically, we design a depth-wise-convolution-based Multi-scale Contextual Features Fusion (MFF) block and a Multi-scale Blursampling (MBP) module to reduce the parameters of the network to 1.3M at 48FPS, without any loss on accuracy. Extensive experiments verify the effectiveness of the proposed UnderwaterNet, Poisson GAN, UDD, AUDD, and Pre-trained datasets.

* 10 pages, 9 figures 

  Access Model/Code and Paper
Hybrid Robotic-assisted Frameworks for Endomicroscopy Scanning in Retinal Surgeries

Sep 15, 2019
Zhaoshuo Li, Mahya Shahbazi, Niravkumar Patel, Eimear O' Sullivan, Haojie Zhang, Khushi Vyas, Preetham Chalasani, Anton Deguet, Peter L. Gehlbach, Iulian Iordachita, Guang-Zhong Yang, Russell H. Taylor

High-resolution real-time imaging at cellular levelin retinal surgeries is very challenging due to extremely confinedspace within the eyeball and lack of appropriate modalities.Probe-based confocal laser endomicroscopy (pCLE) system,which has a small footprint and provides highly-magnified im-ages, can be a potential imaging modality for improved diagnosis.The ability to visualize in cellular-level the retinal pigmentepithelium and the chorodial blood vessels underneath canprovide useful information for surgical outcomes in conditionssuch as retinal detachment. However, the adoption of pCLE islimited due to narrow field of view and micron-level range offocus. The physiological tremor of surgeons' hand also deterioratethe image quality considerably and leads to poor imaging results. In this paper, a novel image-based hybrid motion controlapproach is proposed to mitigate challenges of using pCLEin retinal surgeries. The proposed framework enables sharedcontrol of the pCLE probe by a surgeon to scan the tissueprecisely without hand tremors and an auto-focus image-basedcontrol algorithm that optimizes quality of pCLE images. Thecontrol strategy is deployed on two semi-autonomous frameworks: cooperative and teleoperated. Both frameworks consist of theSteady-Hand Eye Robot (SHER), whose end-effector holds thepCLE probe...

* 12 pages, TMRB 

  Access Model/Code and Paper
Object-Based Image Coding: A Learning-Driven Revisit

Mar 18, 2020
Qi Xia, Haojie Liu, Zhan Ma

The Object-Based Image Coding (OBIC) that was extensively studied about two decades ago, promised a vast application perspective for both ultra-low bitrate communication and high-level semantical content understanding, but it had rarely been used due to the inefficient compact representation of object with arbitrary shape. A fundamental issue behind is how to efficiently process the arbitrary-shaped objects at a fine granularity (e.g., feature element or pixel wise). To attack this, we have proposed to apply the element-wise masking and compression by devising an object segmentation network for image layer decomposition, and parallel convolution-based neural image compression networks to process masked foreground objects and background scene separately. All components are optimized in an end-to-end learning framework to intelligently weigh their (e.g., object and background) contributions for visually pleasant reconstruction. We have conducted comprehensive experiments to evaluate the performance on PASCAL VOC dataset at a very low bitrate scenario (e.g., $\lesssim$0.1 bits per pixel - bpp) which have demonstrated noticeable subjective quality improvement compared with JPEG2K, HEVC-based BPG and another learned image compression method. All relevant materials are made publicly accessible at

* ICME2020 

  Access Model/Code and Paper
Multi-Sensor 3D Object Box Refinement for Autonomous Driving

Sep 11, 2019
Peiliang Li, Siqi Liu, Shaojie Shen

We propose a 3D object detection system with multi-sensor refinement in the context of autonomous driving. In our framework, the monocular camera serves as the fundamental sensor for 2D object proposal and initial 3D bounding box prediction. While the stereo cameras and LiDAR are treated as adaptive plug-in sensors to refine the 3D box localization performance. For each observed element in the raw measurement domain (e.g., pixels for stereo, 3D points for LiDAR), we model the local geometry as an instance vector representation, which indicates the 3D coordinate of each element respecting to the object frame. Using this unified geometric representation, the 3D object location can be unified refined by the stereo photometric alignment or point cloud alignment. We demonstrate superior 3D detection and localization performance compared to state-of-the-art monocular, stereo methods and competitive performance compared with the baseline LiDAR method on the KITTI object benchmark.

  Access Model/Code and Paper
Stereo R-CNN based 3D Object Detection for Autonomous Driving

Apr 10, 2019
Peiliang Li, Xiaozhi Chen, Shaojie Shen

We propose a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. We add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D object bounding box. We then recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. Our method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. Experiments on the challenging KITTI dataset show that our method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks. Code has been released at

* Accepted by cvpr2019 

  Access Model/Code and Paper
Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving

Jul 26, 2018
Peiliang Li, Tong Qin, Shaojie Shen

We propose a stereo vision-based approach for tracking the camera ego-motion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using end-to-end approaches, we propose to use the easy-to-labeled 2D detection and discrete viewpoint classification together with a light-weight semantic inference method to obtain rough 3D object measurements. Based on the object-aware-aided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the ego-motion estimation and object localization are compared with the state-of-of-the-art solutions.

* 14 pages, 9 figures, eccv2018 

  Access Model/Code and Paper
Relocalization, Global Optimization and Map Merging for Monocular Visual-Inertial SLAM

Mar 05, 2018
Tong Qin, Perliang Li, Shaojie Shen

The monocular visual-inertial system (VINS), which consists one camera and one low-cost inertial measurement unit (IMU), is a popular approach to achieve accurate 6-DOF state estimation. However, such locally accurate visual-inertial odometry is prone to drift and cannot provide absolute pose estimation. Leveraging history information to relocalize and correct drift has become a hot topic. In this paper, we propose a monocular visual-inertial SLAM system, which can relocalize camera and get the absolute pose in a previous-built map. Then 4-DOF pose graph optimization is performed to correct drifts and achieve global consistent. The 4-DOF contains x, y, z, and yaw angle, which is the actual drifted direction in the visual-inertial system. Furthermore, the proposed system can reuse a map by saving and loading it in an efficient way. Current map and previous map can be merged together by the global pose graph optimization. We validate the accuracy of our system on public datasets and compare against other state-of-the-art algorithms. We also evaluate the map merging ability of our system in the large-scale outdoor environment. The source code of map reuse is integrated into our public code, VINS-Mono.

* 8 pages 

  Access Model/Code and Paper
VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator

Aug 13, 2017
Tong Qin, Peiliang Li, Shaojie Shen

A monocular visual-inertial system (VINS), consisting of a camera and a low-cost inertial measurement unit (IMU), forms the minimum sensor suite for metric six degrees-of-freedom (DOF) state estimation. However, the lack of direct distance measurement poses significant challenges in terms of IMU processing, estimator initialization, extrinsic calibration, and nonlinear optimization. In this work, we present VINS-Mono: a robust and versatile monocular visual-inertial state estimator.Our approach starts with a robust procedure for estimator initialization and failure recovery. A tightly-coupled, nonlinear optimization-based method is used to obtain high accuracy visual-inertial odometry by fusing pre-integrated IMU measurements and feature observations. A loop detection module, in combination with our tightly-coupled formulation, enables relocalization with minimum computation overhead.We additionally perform four degrees-of-freedom pose graph optimization to enforce global consistency. We validate the performance of our system on public datasets and real-world experiments and compare against other state-of-the-art algorithms. We also perform onboard closed-loop autonomous flight on the MAV platform and port the algorithm to an iOS-based demonstration. We highlight that the proposed work is a reliable, complete, and versatile system that is applicable for different applications that require high accuracy localization. We open source our implementations for both PCs and iOS mobile devices.

* journal paper 

  Access Model/Code and Paper
Towards Distribution-Free Multi-Armed Bandits with Combinatorial Strategies

Oct 05, 2014
Xiang-yang Li, Shaojie Tang, Yaqin Zhou

In this paper we study a generalized version of classical multi-armed bandits (MABs) problem by allowing for arbitrary constraints on constituent bandits at each decision point. The motivation of this study comes from many situations that involve repeatedly making choices subject to arbitrary constraints in an uncertain environment: for instance, regularly deciding which advertisements to display online in order to gain high click-through-rate without knowing user preferences, or what route to drive home each day under uncertain weather and traffic conditions. Assume that there are $K$ unknown random variables (RVs), i.e., arms, each evolving as an \emph{i.i.d} stochastic process over time. At each decision epoch, we select a strategy, i.e., a subset of RVs, subject to arbitrary constraints on constituent RVs. We then gain a reward that is a linear combination of observations on selected RVs. The performance of prior results for this problem heavily depends on the distribution of strategies generated by corresponding learning policy. For example, if the reward-difference between the best and second best strategy approaches zero, prior result may lead to arbitrarily large regret. Meanwhile, when there are exponential number of possible strategies at each decision point, naive extension of a prior distribution-free policy would cause poor performance in terms of regret, computation and space complexity. To this end, we propose an efficient Distribution-Free Learning (DFL) policy that achieves zero regret, regardless of the probability distribution of the resultant strategies. Our learning policy has both $O(K)$ time complexity and $O(K)$ space complexity. In successive generations, we show that even if finding the optimal strategy at each decision point is NP-hard, our policy still allows for approximated solutions while retaining near zero-regret.

  Access Model/Code and Paper
Neural Subgraph Isomorphism Counting

Dec 25, 2019
Xin Liu, Haojie Pan, Mutian He, Yangqiu Song, Xin Jiang

In this paper, we study a new graph learning problem: learning to count subgraph isomorphisms. Although the learning based approach is inexact, we are able to generalize to count large patterns and data graphs in polynomial time compared to the exponential time of the original NP-complete problem. Different from other traditional graph learning problems such as node classification and link prediction, subgraph isomorphism counting requires more global inference to oversee the whole graph. To tackle this problem, we propose a dynamic intermedium attention memory network (DIAMNet) which augments different representation learning architectures and iteratively attends pattern and target data graphs to memorize different subgraph isomorphisms for the global counting. We develop both small graphs (<= 1,024 subgraph isomorphisms in each) and large graphs (<= 4,096 subgraph isomorphisms in each) sets to evaluate different models. Experimental results show that learning based subgraph isomorphism counting can help reduce the time complexity with acceptable accuracy. Our DIAMNet can further improve existing representation learning models for this more global problem.

  Access Model/Code and Paper
Multi-Channel Pyramid Person Matching Network for Person Re-Identification

Mar 07, 2018
Chaojie Mao, Yingming Li, Yaqing Zhang, Zhongfei Zhang, Xi Li

In this work, we present a Multi-Channel deep convolutional Pyramid Person Matching Network (MC-PPMN) based on the combination of the semantic-components and the color-texture distributions to address the problem of person re-identification. In particular, we learn separate deep representations for semantic-components and color-texture distributions from two person images and then employ pyramid person matching network (PPMN) to obtain correspondence representations. These correspondence representations are fused to perform the re-identification task. Further, the proposed framework is optimized via a unified end-to-end deep learning scheme. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our approach against the state-of-the-art literature, especially on the rank-1 recognition rate.

* 9 pages, 5 figures, 7 tables and accepted by the 32nd AAAI Conference on Artificial Intelligence 

  Access Model/Code and Paper
Pyramid Person Matching Network for Person Re-identification

Mar 07, 2018
Chaojie Mao, Yingming Li, Zhongfei Zhang, Yaqing Zhang, Xi Li

In this work, we present a deep convolutional pyramid person matching network (PPMN) with specially designed Pyramid Matching Module to address the problem of person re-identification. The architecture takes a pair of RGB images as input, and outputs a similiarity value indicating whether the two input images represent the same person or not. Based on deep convolutional neural networks, our approach first learns the discriminative semantic representation with the semantic-component-aware features for persons and then employs the Pyramid Matching Module to match the common semantic-components of persons, which is robust to the variation of spatial scales and misalignment of locations posed by viewpoint changes. The above two processes are jointly optimized via a unified end-to-end deep learning scheme. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our approach against the state-of-the-art approaches, especially on the rank-1 recognition rate.

* 11pages, 3 figures, 4 tables and accepted by Proceedings of 9th Asian Conference on Machine Learning (ACML2017) JMLR Workshop and Conference Proceedings, vol. 77, 2017 

  Access Model/Code and Paper
Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus

Mar 05, 2020
Bang Liu, Haojie Wei, Di Niu, Haolan Chen, Yancheng He

The ability to ask questions is important in both human and machine intelligence. Learning to ask questions helps knowledge acquisition, improves question-answering and machine reading comprehension tasks, and helps a chatbot to keep the conversation flowing with a human. Existing question generation models are ineffective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neural question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia.

* Accepted by The Web Conference 2020 (WWW 2020) as full paper (oral presentation) 

  Access Model/Code and Paper
PLIN: A Network for Pseudo-LiDAR Point Cloud Interpolation

Sep 16, 2019
Haojie Liu, Kang Liao, Chunyu Lin, Yao Zhao, Yulan Guo

LiDAR sensors can provide dependable 3D spatial information at a low frequency (around 10Hz) and have been widely applied in the field of autonomous driving and UAV. However, the camera with a higher frequency (around 20Hz) has to be decreased so as to match with LiDAR in a multi-sensor system. In this paper, we propose a novel Pseudo-LiDAR interpolation network (PLIN) to increase the frequency of LiDAR sensors. PLIN can generate temporally and spatially high-quality point cloud sequences to match the high frequency of cameras. To achieve this goal, we design a coarse interpolation stage guided by consecutive sparse depth maps and motion relationship. We also propose a refined interpolation stage guided by the realistic scene. Using this coarse-to-fine cascade structure, our method can progressively perceive multi-modal information and generate accurate intermediate point clouds. To the best of our knowledge, this is the first deep framework for Pseudo-LiDAR point cloud interpolation, which shows appealing applications in navigation systems equipped with LiDAR and cameras. Experimental results demonstrate that PLIN achieves promising performance on the KITTI dataset, significantly outperforming the traditional interpolation method and the state-of-the-art video interpolation technique.

* 7 pages, 5 figures, Submitted to ICRA2020 

  Access Model/Code and Paper