Models, code, and papers for "Yuning Jiang":

##### What Can Help Pedestrian Detection?

May 08, 2017
Jiayuan Mao, Tete Xiao, Yuning Jiang, Zhimin Cao

Aggregating extra features has been considered as an effective approach to boost traditional pedestrian detection methods. However, there is still a lack of studies on whether and how CNN-based pedestrian detectors can benefit from these extra features. The first contribution of this paper is exploring this issue by aggregating extra features into CNN-based pedestrian detection framework. Through extensive experiments, we evaluate the effects of different kinds of extra features quantitatively. Moreover, we propose a novel network architecture, namely HyperLearner, to jointly learn pedestrian detection as well as the given extra feature. By multi-task training, HyperLearner is able to utilize the information of given features and improve detection performance without extra inputs in inference. The experimental results on multiple pedestrian benchmarks validate the effectiveness of the proposed HyperLearner.

* Accepted to IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2017
##### Acquisition of Localization Confidence for Accurate Object Detection

Jul 30, 2018
Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, Yuning Jiang

Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.

* Accepted to European Conference on Computer Vision (ECCV) 2018
##### SOLO: Segmenting Objects by Locations

Dec 15, 2019
Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, Lei Li

We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-thensegment' strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance mask segmentation into a classification-solvable problem. Now instance segmentation is decomposed into two classification tasks. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent singleshot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.

* 10 pages. Add "error analysis"; and fix typos
##### FoveaBox: Beyond Anchor-based Object Detector

Apr 08, 2019
Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Jianbo Shi

We present FoveaBox, an accurate, flexible and completely anchor-free framework for object detection. While almost all state-of-the-art object detectors utilize the predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, their performance and generalization ability are also limited to the design of anchors. Instead, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image. Without bells and whistles, FoveaBox achieves state-of-the-art single model performance of 42.1 AP on the standard COCO detection benchmark. Specially for the objects with arbitrary aspect ratios, FoveaBox brings in significant improvement compared to the anchor-based detectors. More surprisingly, when it is challenged by the stretched testing images, FoveaBox shows great robustness and generalization ability to the changed distribution of bounding box shapes. The code will be made publicly available.

* Technical report
##### Consistent Optimization for Single-Shot Object Detection

Jan 23, 2019
Tao Kong, Fuchun Sun, Huaping Liu, Yuning Jiang, Jianbo Shi

We present consistent optimization for single stage object detection. Previous works of single stage object detectors usually rely on the regular, dense sampled anchors to generate hypothesis for the optimization of the model. Through an examination of the behavior of the detector, we observe that the misalignment between the optimization target and inference configurations has hindered the performance improvement. We propose to bride this gap by consistent optimization, which is an extension of the traditional single stage detector's optimization strategy. Consistent optimization focuses on matching the training hypotheses and the inference quality by utilizing of the refined anchors during training. To evaluate its effectiveness, we conduct various design choices based on the state-of-the-art RetinaNet detector. We demonstrate it is the consistent optimization, not the architecture design, that yields the performance boosts. Consistent optimization is nearly cost-free, and achieves stable performance gains independent of the model capacities or input scales. Specifically, utilizing consistent optimization improves RetinaNet from 39.1 AP to 40.1 AP on COCO dataset without any bells or whistles, which surpasses the accuracy of all existing state-of-the-art one-stage detectors when adopting ResNet-101 as backbone. The code will be made available.

* Technical report
##### Unified Perceptual Parsing for Scene Understanding

Jul 26, 2018
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun

Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes. Models are available at \url{https://github.com/CSAILVision/unifiedparsing}.

* Accepted to European Conference on Computer Vision (ECCV) 2018
##### Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

Jun 27, 2018
Haoyue Shi, Jiayuan Mao, Tete Xiao, Yuning Jiang, Jian Sun

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of current frameworks and image-text datasets (e.g., MS-COCO) both quantitatively and qualitatively. The large gap between the number of possible constitutions of real-world semantics and the size of parallel data, to a large extent, restricts the model to establish the link between textual semantics and visual concepts. We alleviate this problem by augmenting the MS-COCO image captioning datasets with textual contrastive adversarial samples. These samples are synthesized using linguistic rules and the WordNet knowledge base. The construction procedure is both syntax- and semantics-aware. The samples enforce the model to ground learned embeddings to concrete concepts within the image. This simple but powerful technique brings a noticeable improvement over the baselines on a diverse set of downstream tasks, in addition to defending known-type adversarial attacks. We release the codes at https://github.com/ExplorerFreda/VSE-C.

* To Appear at COLING 2018
##### FastMask: Segment Multi-scale Object Candidates in One Shot

Apr 11, 2017
Hexiang Hu, Shiyi Lan, Yuning Jiang, Zhimin Cao, Fei Sha

Objects appear to scale differently in natural images. This fact requires methods dealing with object-centric tasks (e.g. object proposal) to have robust performance over variances in object scales. In the paper, we present a novel segment proposal framework, namely FastMask, which takes advantage of hierarchical features in deep convolutional neural networks to segment multi-scale objects in one shot. Innovatively, we adapt segment proposal network into three different functional components (body, neck and head). We further propose a weight-shared residual neck module as well as a scale-tolerant attentional head module for efficient one-shot inference. On MS COCO benchmark, the proposed FastMask outperforms all state-of-the-art segment proposal methods in average recall being 2~5 times faster. Moreover, with a slight trade-off in accuracy, FastMask can segment objects in near real time (~13 fps) with 800*600 resolution images, demonstrating its potential in practical applications. Our implementation is available on https://github.com/voidrank/FastMask.

* Accepted as CVPR 2017
##### UnitBox: An Advanced Object Detection Network

Aug 04, 2016
Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, Thomas Huang

In present object detection systems, the deep convolutional neural networks (CNNs) are utilized to predict bounding boxes of object candidates, and have gained performance advantages over the traditional region proposal methods. However, existing deep CNN methods assume the object bounds to be four independent variables, which could be regressed by the $\ell_2$ loss separately. Such an oversimplified assumption is contrary to the well-received observation, that those variables are correlated, resulting to less accurate localization. To address the issue, we firstly introduce a novel Intersection over Union ($IoU$) loss function for bounding box prediction, which regresses the four bounds of a predicted box as a whole unit. By taking the advantages of $IoU$ loss and deep fully convolutional networks, the UnitBox is introduced, which performs accurate and efficient localization, shows robust to objects of varied shapes and scales, and converges fast. We apply UnitBox on face detection task and achieve the best performance among all published methods on the FDDB benchmark.

* To appear in ACM MM 2016, 5 pages, 6 figures
##### Learning Deep Face Representation

Mar 12, 2014
Haoqiang Fan, Zhimin Cao, Yuning Jiang, Qi Yin, Chinchilla Doudou

Face representation is a crucial step of face recognition systems. An optimal face representation should be discriminative, robust, compact, and very easy-to-implement. While numerous hand-crafted and learning-based representations have been proposed, considerable room for improvement is still present. In this paper, we present a very easy-to-implement deep learning framework for face representation. Our method bases on a new structure of deep network (called Pyramid CNN). The proposed Pyramid CNN adopts a greedy-filter-and-down-sample operation, which enables the training procedure to be very fast and computation-efficient. In addition, the structure of Pyramid CNN can naturally incorporate feature sharing across multi-scale face representations, increasing the discriminative ability of resulting representation. Our basic network is capable of achieving high recognition accuracy ($85.8\%$ on LFW benchmark) with only 8 dimension representation. When extended to feature-sharing Pyramid CNN, our system achieves the state-of-the-art performance ($97.3\%$) on LFW benchmark. We also introduce a new benchmark of realistic face images on social network and validate our proposed representation has a good ability of generalization.

##### DeepDualMapper: A Gated Fusion Network for Automatic Map Extraction using Aerial Images and Trajectories

Feb 17, 2020
Hao Wu, Hanyuan Zhang, Xinyu Zhang, Weiwei Sun, Baihua Zheng, Yuning Jiang

Automatic map extraction is of great importance to urban computing and location-based services. Aerial image and GPS trajectory data refer to two different data sources that could be leveraged to generate the map, although they carry different types of information. Most previous works on data fusion between aerial images and data from auxiliary sensors do not fully utilize the information of both modalities and hence suffer from the issue of information loss. We propose a deep convolutional neural network called DeepDualMapper which fuses the aerial image and trajectory data in a more seamless manner to extract the digital map. We design a gated fusion module to explicitly control the information flows from both modalities in a complementary-aware manner. Moreover, we propose a novel densely supervised refinement decoder to generate the prediction in a coarse-to-fine way. Our comprehensive experiments demonstrate that DeepDualMapper can fuse the information of images and trajectories much more effectively than existing approaches, and is able to generate maps with higher accuracy.

* 7 pages, AAAI 2020 accepted paper
##### Task-Aware Monocular Depth Estimation for 3D Object Detection

Sep 17, 2019
Xinlong Wang, Wei Yin, Tao Kong, Yuning Jiang, Lei Li, Chunhua Shen

Monocular depth estimation enables 3D perception from a single 2D image, thus attracting much research attention for years. Almost all methods treat foreground and background regions (things and stuff'') in an image equally. However, not all pixels are equal. Depth of foreground objects plays a crucial role in 3D object recognition and localization. To date how to boost the depth prediction accuracy of foreground objects is rarely discussed. In this paper, we first analyse the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground depth and background depth using separate optimization objectives and depth decoders. Our method significantly improves the depth estimation performance on foreground objects. Applying ForeSeE to 3D object detection, we achieve 7.5 AP gains and set new state-of-the-art results among other monocular methods.

##### Effective Domain Knowledge Transfer with Soft Fine-tuning

Sep 05, 2019
Zhichen Zhao, Bowen Zhang, Yuning Jiang, Li Xu, Lei Li, Wei-Ying Ma

Convolutional neural networks require numerous data for training. Considering the difficulties in data collection and labeling in some specific tasks, existing approaches generally use models pre-trained on a large source domain (e.g. ImageNet), and then fine-tune them on these tasks. However, the datasets from source domain are simply discarded in the fine-tuning process. We argue that the source datasets could be better utilized and benefit fine-tuning. This paper firstly introduces the concept of general discrimination to describe ability of a network to distinguish untrained patterns, and then experimentally demonstrates that general discrimination could potentially enhance the total discrimination ability on target domain. Furthermore, we propose a novel and light-weighted method, namely soft fine-tuning. Unlike traditional fine-tuning which directly replaces optimization objective by a loss function on the target domain, soft fine-tuning effectively keeps general discrimination by holding the previous loss and removes it softly. By doing so, soft fine-tuning improves the robustness of the network to data bias, and meanwhile accelerates the convergence. We evaluate our approach on several visual recognition tasks. Extensive experimental results support that soft fine-tuning provides consistent improvement on all evaluated tasks, and outperforms the state-of-the-art significantly. Codes will be made available to the public.

##### Repulsion Loss: Detecting Pedestrians in a Crowd

Mar 26, 2018
Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, Chunhua Shen

Detecting individual pedestrians in a crowd remains a challenging problem since the pedestrians often gather together and occlude each other in real-world scenarios. In this paper, we first explore how a state-of-the-art pedestrian detector is harmed by crowd occlusion via experimentation, providing insights into the crowd occlusion problem. Then, we propose a novel bounding box regression loss specifically designed for crowd scenes, termed repulsion loss. This loss is driven by two motivations: the attraction by target, and the repulsion by other surrounding objects. The repulsion term prevents the proposal from shifting to surrounding objects thus leading to more crowd-robust localization. Our detector trained by repulsion loss outperforms all the state-of-the-art methods with a significant improvement in occlusion cases.

* Accepted to IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018
##### UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations

Apr 28, 2019
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma

We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a joint space of visual and textual concepts. The space unifies the concepts at different levels, including objects, attributes, relations, and full scenes. A contrastive learning approach is proposed for the fine-grained alignment from only image-caption pairs. Moreover, we present an effective approach for enforcing the coverage of semantic components that appear in the sentence. We demonstrate the robustness of Unified VSE in defending text-domain adversarial attacks on cross-modal retrieval tasks. Such robustness also empowers the use of visual cues to resolve word dependencies in novel sentences.

* v1 is the full version which is accepted by CVPR 2019. v2 is the short version accepted by NAACL 2019 SpLU-RoboNLP workshop (in non-archival proceedings)
##### Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Apr 11, 2019
Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, Wei-Ying Ma

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learning a joint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes, relations, and full scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learning approach is proposed for the effective learning of this fine-grained alignment from only image-caption pairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appear in the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks. Moreover, our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences.

* Accepted by CVPR 2019
##### MegDet: A Large Mini-Batch Object Detector

Apr 11, 2018
Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, Jian Sun

The improvements in recent CNN-based object detection works, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from new network, new framework, or novel loss design. But mini-batch size, a key factor in the training, has not been well studied. In this paper, we propose a Large MiniBatch Object Detector (MegDet) to enable the training with much larger mini-batch size than before (e.g. from 16 to 256), so that we can effectively utilize multiple GPUs (up to 128 in our experiments) to significantly shorten the training time. Technically, we suggest a learning rate policy and Cross-GPU Batch Normalization, which together allow us to successfully train a large mini-batch detector in much less time (e.g., from 33 hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.