Models, code, and papers for "Hao Ji":
The key idea of variational auto-encoders (VAEs) resembles that of traditional auto-encoder models in which spatial information is supposed to be explicitly encoded in the latent space. However, the latent variables in VAEs are vectors, which are commonly interpreted as multiple feature maps of size 1x1. Such representations can only convey spatial information implicitly when coupled with powerful decoders. In this work, we propose spatial VAEs that use latent variables as feature maps of larger size to explicitly capture spatial information. This is achieved by allowing the latent variables to be sampled from matrix-variate normal (MVN) distributions whose parameters are computed from the encoder network. To increase dependencies among locations on latent feature maps and reduce the number of parameters, we further propose spatial VAEs via low-rank MVN distributions. Experimental results show that the proposed spatial VAEs outperform original VAEs in capturing rich structural and spatial information.
In this paper, we describe work in progress towards a real-time vision-based traffic flow prediction (TFP) system. The proposed method consists of three elemental operators, that are dynamic texture model based motion segmentation, feature extraction and Gaussian process (GP) regression. The objective of motion segmentation is to recognize the target regions covering the moving vehicles in the sequence of visual processes. The feature extraction operator aims to extract useful features from the target regions. The extracted features are then mapped to the number of vehicles through the operator of GP regression. A training stage using historical visual data is required for determining the parameter values of the GP. Using a low-resolution visual data set, we performed preliminary evaluations on the performance of the proposed method. The results show that our method beats a benchmark solution based on Gaussian mixture model, and has the potential to be developed into qualified and practical solutions to real-time TFP.
Faster R-CNN is one of the most representative and successful methods for object detection, and has been becoming increasingly popular in various objection detection applications. In this report, we propose a robust deep face detection approach based on Faster R-CNN. In our approach, we exploit several new techniques including new multi-task loss function design, online hard example mining, and multi-scale training strategy to improve Faster R-CNN in multiple aspects. The proposed approach is well suited for face detection, so we call it Face R-CNN. Extensive experiments are conducted on two most popular and challenging face detection benchmarks, FDDB and WIDER FACE, to demonstrate the superiority of the proposed approach over state-of-the-arts.
Today's state-of-the-art image classifiers fail to correctly classify carefully manipulated adversarial images. In this work, we develop a new, localized adversarial attack that generates adversarial examples by imperceptibly altering the backgrounds of normal images. We first use this attack to highlight the unnecessary sensitivity of neural networks to changes in the background of an image, then use it as part of a new training technique: localized adversarial training. By including locally adversarial images in the training set, we are able to create a classifier that suffers less loss than a non-adversarially trained counterpart model on both natural and adversarial inputs. The evaluation of our localized adversarial training algorithm on MNIST and CIFAR-10 datasets shows decreased accuracy loss on natural images, and increased robustness against adversarial inputs.
Visualizing the details of different cellular structures is of great importance to elucidate cellular functions. However, it is challenging to obtain high quality images of different structures directly due to complex cellular environments. Fluorescence microscopy is a popular technique to label different structures but has several drawbacks. In particular, labeling is time consuming and may affect cell morphology, and simultaneous labels are inherently limited. This raises the need of building computational models to learn relationships between unlabeled and labeled fluorescence images, and to infer fluorescent labels of other unlabeled fluorescence images. We propose to develop a novel deep model for fluorescence image prediction. We first propose a novel network layer, known as the global transformer layer, that fuses global information from inputs effectively. The proposed global transformer layer can generate outputs with arbitrary dimensions, and can be employed for all the regular, down-sampling, and up-sampling operators. We then incorporate our proposed global transformer layers and dense blocks to build an U-Net like network. We believe such a design can promote feature reusing between layers. In addition, we propose a multi-scale input strategy to encourage networks to capture features at different scales. We conduct evaluations across various label-free prediction tasks to demonstrate the effectiveness of our approach. Both quantitative and qualitative results show that our method outperforms the state-of-the-art approach significantly. It is also shown that our proposed global transformer layer is useful to improve the fluorescence image prediction results.
Despite the great success of word embedding, sentence embedding remains a not-well-solved problem. In this paper, we present a supervised learning framework to exploit sentence embedding for the medical question answering task. The learning framework consists of two main parts: 1) a sentence embedding producing module, and 2) a scoring module. The former is developed with contextual self-attention and multi-scale techniques to encode a sentence into an embedding tensor. This module is shortly called Contextual self-Attention Multi-scale Sentence Embedding (CAMSE). The latter employs two scoring strategies: Semantic Matching Scoring (SMS) and Semantic Association Scoring (SAS). SMS measures similarity while SAS captures association between sentence pairs: a medical question concatenated with a candidate choice, and a piece of corresponding supportive evidence. The proposed framework is examined by two Medical Question Answering(MedicalQA) datasets which are collected from real-world applications: medical exam and clinical diagnosis based on electronic medical records (EMR). The comparison results show that our proposed framework achieved significant improvements compared to competitive baseline approaches. Additionally, a series of controlled experiments are also conducted to illustrate that the multi-scale strategy and the contextual self-attention layer play important roles for producing effective sentence embedding, and the two kinds of scoring strategies are highly complementary to each other for question answering problems.
Deconvolutional layers have been widely used in a variety of deep models for up-sampling, including encoder-decoder networks for semantic segmentation and deep generative models for unsupervised learning. One of the key limitations of deconvolutional operations is that they result in the so-called checkerboard problem. This is caused by the fact that no direct relationship exists among adjacent pixels on the output feature map. To address this problem, we propose the pixel deconvolutional layer (PixelDCL) to establish direct relationships among adjacent pixels on the up-sampled feature map. Our method is based on a fresh interpretation of the regular deconvolution operation. The resulting PixelDCL can be used to replace any deconvolutional layer in a plug-and-play manner without compromising the fully trainable capabilities of original models. The proposed PixelDCL may result in slight decrease in efficiency, but this can be overcome by an implementation trick. Experimental results on semantic segmentation demonstrate that PixelDCL can consider spatial features such as edges and shapes and yields more accurate segmentation outputs than deconvolutional layers. When used in image generation tasks, our PixelDCL can largely overcome the checkerboard problem suffered by regular deconvolution operations.
Bone age is one of the most important indicators for assessing bone's maturity, which can help to interpret human's growth development level and potential progress. In the clinical practice, bone age assessment (BAA) of X-ray images requires the joint consideration of the appearance and location information of hand bones. These kinds of information can be effectively captured by the relation of different anatomical parts of hand bone. Recently developed methods differ mostly in how they model the part relation and choose useful parts for BAA. However, these methods neglect the mining of relationship among different parts, which can help to improve the assessment accuracy. In this paper, we propose a novel part relation module, which accurately discovers the underlying concurrency of parts by using multi-scale context information of deep learning feature representation. Furthermore, based on the part relation, we explore a new part selection module, which comprehensively measures the importance of parts and select the top ranking parts for assisting BAA. We jointly train our part relation and selection modules in an end-to-end way, achieving state-of-the-art performance on the public RSNA 2017 Pediatric Bone Age benchmark dataset and outperforming other competitive methods by a significant margin.
Recommending personalized learning materials for online language learning is challenging because we typically lack data about the student's ability and the relative difficulty of learning materials. This makes it hard to recommend appropriate content that matches the student's prior knowledge. In this paper, we propose a refined hierarchical knowledge structure to model vocabulary knowledge, which enables us to automatically organize the authentic and up-to-date learning materials collected from the internet. Based on this knowledge structure, we then introduce a hybrid approach to recommend learning materials that adapts to a student's language level. We evaluate our work with an online Japanese learning tool and the results suggest adding adaptivity into material recommendation significantly increases student engagement.
Particle Imaging Velocimetry (PIV) estimates the flow of fluid by analyzing the motion of injected particles. The problem is challenging as the particles lie at different depths but have similar appearance and tracking a large number of particles is particularly difficult. In this paper, we present a PIV solution that uses densely sampled light field to reconstruct and track 3D particles. We exploit the refocusing capability and focal symmetry constraint of the light field for reliable particle depth estimation. We further propose a new motion-constrained optical flow estimation scheme by enforcing local motion rigidity and the Navier-Stoke constraint. Comprehensive experiments on synthetic and real experiments show that using a single light field camera, our technique can recover dense and accurate 3D fluid flows in small to medium volumes.
Face detection has achieved great success using the region-based methods. In this report, we propose a region-based face detector applying deep networks in a fully convolutional fashion, named Face R-FCN. Based on Region-based Fully Convolutional Networks (R-FCN), our face detector is more accurate and computational efficient compared with the previous R-CNN based face detectors. In our approach, we adopt the fully convolutional Residual Network (ResNet) as the backbone network. Particularly, We exploit several new techniques including position-sensitive average pooling, multi-scale training and testing and on-line hard example mining strategy to improve the detection accuracy. Over two most popular and challenging face detection benchmarks, FDDB and WIDER FACE, Face R-FCN achieves superior performance over state-of-the-arts.
Heterogeneous information network (HIN) embedding has gained increasing interests recently. However, the current way of random-walk based HIN embedding methods have paid few attention to the higher-order Markov chain nature of meta-path guided random walks, especially to the stationarity issue. In this paper, we systematically formalize the meta-path guided random walk as a higher-order Markov chain process, and present a heterogeneous personalized spacey random walk to efficiently and effectively attain the expected stationary distribution among nodes. Then we propose a generalized scalable framework to leverage the heterogeneous personalized spacey random walk to learn embeddings for multiple types of nodes in an HIN guided by a meta-path, a meta-graph, and a meta-schema respectively. We conduct extensive experiments in several heterogeneous networks and demonstrate that our methods substantially outperform the existing state-of-the-art network embedding algorithms.
Image classification is an important task in the field of machine learning and image processing. However, the usually used classification method --- the K Nearest-Neighbor algorithm has high complexity, because its two main processes: similarity computing and searching are time-consuming. Especially in the era of big data, the problem is prominent when the amount of images to be classified is large. In this paper, we try to use the powerful parallel computing ability of quantum computers to optimize the efficiency of image classification. The scheme is based on quantum K Nearest-Neighbor algorithm. Firstly, the feature vectors of images are extracted on classical computers. Then the feature vectors are inputted into a quantum superposition state, which is used to achieve parallel computing of similarity. Next, the quantum minimum search algorithm is used to speed up searching process for similarity. Finally, the image is classified by quantum measurement. The complexity of the quantum algorithm is only O((kM)^(1/2)), which is superior to the classical algorithms. Moreover, the measurement step is executed only once to ensure the validity of the scheme. The experimental results show that, the classification accuracy is 83.1% on Graz-01 dataset and 78% on Caltech-101 dataset, which is close to existing classical algorithms. Hence, our quantum scheme has a good classification performance while greatly improving the efficiency.
In human-computer interaction, it is important to accurately estimate the hand pose especially fingertips. However, traditional approaches for fingertip localization mainly rely on depth images and thus suffer considerably from the noise and missing values. Instead of depth images, stereo images can also provide 3D information of hands and promote 3D hand pose estimation. There are nevertheless limitations on the dataset size, global viewpoints, hand articulations and hand shapes in the publicly available stereo-based hand pose datasets. To mitigate these limitations and promote further research on hand pose estimation from stereo images, we propose a new large-scale binocular hand pose dataset called THU-Bi-Hand, offering a new perspective for fingertip localization. In the THU-Bi-Hand dataset, there are 447k pairs of stereo images of different hand shapes from 10 subjects with accurate 3D location annotations of the wrist and five fingertips. Captured with minimal restriction on the range of hand motion, the dataset covers large global viewpoint space and hand articulation space. To better present the performance of fingertip localization on THU-Bi-Hand, we propose a novel scheme termed Bi-stream Pose Guided Region Ensemble Network (Bi-Pose-REN). It extracts more representative feature regions around joint points in the feature maps under the guidance of the previously estimated pose. The feature regions are integrated hierarchically according to the topology of hand joints to regress the refined hand pose. Bi-Pose-REN and several existing methods are evaluated on THU-Bi-Hand so that benchmarks are provided for further research. Experimental results show that our new method has achieved the best performance on THU-Bi-Hand.
This paper proposes a new learning paradigm called filter grafting, which aims to improve the representation capability of Deep Neural Networks (DNNs). The motivation is that DNNs have unimportant (invalid) filters (e.g., l1 norm close to 0). These filters limit the potential of DNNs since they are identified as having little effect on the network. While filter pruning removes these invalid filters for efficiency consideration, filter grafting re-activates them from an accuracy boosting perspective. The activation is processed by grafting external information (weights) into invalid filters. To better perform the grafting process, we develop an entropy-based criterion to measure the information of filters and an adaptive weighting strategy for balancing the grafted information among networks. After the grafting operation, the network has very few invalid filters compared with its untouched state, enpowering the model with more representation capacity. We also perform extensive experiments on the classification and recognition tasks to show the superiority of our method. For example, the grafted MobileNetV2 outperforms the non-grafted MobileNetV2 by about 7 percent on CIFAR-100 dataset.
Face recognition has made extraordinary progress owing to the advancement of deep convolutional neural networks (CNNs). The central task of face recognition, including face verification and identification, involves face feature discrimination. However, the traditional softmax loss of deep CNNs usually lacks the power of discrimination. To address this problem, recently several loss functions such as center loss, large margin softmax loss, and angular softmax loss have been proposed. All these improved losses share the same idea: maximizing inter-class variance and minimizing intra-class variance. In this paper, we propose a novel loss function, namely large margin cosine loss (LMCL), to realize this idea from a different perspective. More specifically, we reformulate the softmax loss as a cosine loss by $L_2$ normalizing both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space. As a result, minimum intra-class variance and maximum inter-class variance are achieved by virtue of normalization and cosine decision margin maximization. We refer to our model trained with LMCL as CosFace. Extensive experimental evaluations are conducted on the most popular public-domain face recognition datasets such as MegaFace Challenge, Youtube Faces (YTF) and Labeled Face in the Wild (LFW). We achieve the state-of-the-art performance on these benchmarks, which confirms the effectiveness of our proposed approach.
As facial appearance is subject to significant intra-class variations caused by the aging process over time, age-invariant face recognition (AIFR) remains a major challenge in face recognition community. To reduce the intra-class discrepancy caused by the aging, in this paper we propose a novel approach (namely, Orthogonal Embedding CNNs, or OE-CNNs) to learn the age-invariant deep face features. Specifically, we decompose deep face features into two orthogonal components to represent age-related and identity-related features. As a result, identity-related features that are robust to aging are then used for AIFR. Besides, for complementing the existing cross-age datasets and advancing the research in this field, we construct a brand-new large-scale Cross-Age Face dataset (CAF). Extensive experiments conducted on the three public domain face aging datasets (MORPH Album 2, CACD-VS and FG-NET) have shown the effectiveness of the proposed approach and the value of the constructed CAF dataset on AIFR. Benchmarking our algorithm on one of the most popular general face recognition (GFR) dataset LFW additionally demonstrates the comparable generalization performance on GFR.
This paper proposes a new method for simultaneous 3D reconstruction and semantic segmentation of indoor scenes. Unlike existing methods that require recording a video using a color camera and/or a depth camera, our method only needs a small number of (e.g., 3-5) color images from uncalibrated sparse views as input, which greatly simplifies data acquisition and extends applicable scenarios. Since different views have limited overlaps, our method allows a single image as input to discern the depth and semantic information of the scene. The key issue is how to recover relatively accurate depth from single images and reconstruct a 3D scene by fusing very few depth maps. To address this problem, we first design an iterative deep architecture, IterNet, that estimates depth and semantic segmentation alternately, so that they benefit each other. To deal with the little overlap and non-rigid transformation between views, we further propose a joint global and local registration method to reconstruct a 3D scene with semantic information from sparse views. We also make available a new indoor synthetic dataset simultaneously providing photorealistic high-resolution RGB images, accurate depth maps and pixel-level semantic labels for thousands of complex layouts, useful for training and evaluation. Experimental results on public datasets and our dataset demonstrate that our method achieves more accurate depth estimation, smaller semantic segmentation errors and better 3D reconstruction results, compared with state-of-the-art methods.
Person re-identification (re-ID), is a challenging task due to the high variance within identity samples and imaging conditions. Although recent advances in deep learning have achieved remarkable accuracy in settled scenes, i.e., source domain, few works can generalize well on the unseen target domain. One popular solution is assigning unlabeled target images with pseudo labels by clustering, and then retraining the model. However, clustering methods tend to introduce noisy labels and discard low confidence samples as outliers, which may hinder the retraining process and thus limit the generalization ability. In this study, we argue that by explicitly adding a sample filtering procedure after the clustering, the mined examples can be much more efficiently used. To this end, we design an asymmetric co-teaching framework, which resists noisy labels by cooperating two models to select data with possibly clean labels for each other. Meanwhile, one of the models receives samples as pure as possible, while the other takes in samples as diverse as possible. This procedure encourages that the selected training samples can be both clean and miscellaneous, and that the two models can promote each other iteratively. Extensive experiments show that the proposed framework can consistently benefit most clustering-based methods, and boost the state-of-the-art adaptation accuracy. Our code is available at https://github.com/FlyingRoastDuck/ACT_AAAI20.
In this demo paper, we present the XFake system, an explainable fake news detector that assists end-users to identify news credibility. To effectively detect and interpret the fakeness of news items, we jointly consider both attributes (e.g., speaker) and statements. Specifically, MIMIC, ATTN and PERT frameworks are designed, where MIMIC is built for attribute analysis, ATTN is for statement semantic analysis and PERT is for statement linguistic analysis. Beyond the explanations extracted from the designed frameworks, relevant supporting examples as well as visualization are further provided to facilitate the interpretation. Our implemented system is demonstrated on a real-world dataset crawled from PolitiFact, where thousands of verified political news have been collected.