Models, code, and papers for "Zhihui Hao":
A growing demand for natural-scene text detection has been witnessed by the computer vision community since text information plays a significant role in scene understanding and image indexing. Deep neural networks are being used due to their strong capabilities of pixel-wise classification or word localization, similar to being used in common vision problems. In this paper, we present a novel two-task network with integrating bottom and top cues. The first task aims to predict a pixel-by-pixel labeling and based on which, word proposals are generated with a canonical connected component analysis. The second task aims to output a bundle of character candidates used later to verify the word proposals. The two sub-networks share base convolutional features and moreover, we present a new loss to strengthen the interaction between them. We evaluate the proposed network on public benchmark datasets and show it can detect arbitrary-orientation scene text with a finer output boundary. In ICDAR 2013 text localization task, we achieve the state-of-the-art performance with an F-score of 0.919 and a much better recall of 0.915.
In this work, we propose a new optimization framework for multiclass boosting learning. In the literature, AdaBoost.MO and AdaBoost.ECC are the two successful multiclass boosting algorithms, which can use binary weak learners. We explicitly derive these two algorithms' Lagrange dual problems based on their regularized loss functions. We show that the Lagrange dual formulations enable us to design totally-corrective multiclass algorithms by using the primal-dual optimization technique. Experiments on benchmark data sets suggest that our multiclass boosting can achieve a comparable generalization capability with state-of-the-art, but the convergence speed is much faster than stage-wise gradient descent boosting. In other words, the new totally corrective algorithms can maximize the margin more aggressively.
Many domain adaptation (DA) methods aim to project the source and target domains into a common feature space, where the inter-domain distributional differences are reduced and some intra-domain properties preserved. Recent research obtains their respective new representations using some predefined statistics. However, they usually formulate the class-wise statistics using the pseudo hard labels due to no labeled target data, such as class-wise MMD and class scatter matrice. The probabilities of data points belonging to each class given by the hard labels are either 0 or 1, while the soft labels could relax the strong constraint of hard labels and provide a random value between them. Although existing work have noticed the advantage of soft labels, they either deal with thoes class-wise statistics inadequately or introduce those small irrelevant probabilities in soft labels. Therefore, we propose the filtered soft labels to discard thoes confusing probabilities, then both of the class-wise MMD and class scatter matrice are modeled in this way. In order to obtain more accurate filtered soft labels, we take advantage of a well-designed Graph-based Label Propagation (GLP) method, and incorporate it into the DA procedure to formulate a unified framework.
With the increasing variety of services that e-commerce platforms provide, criteria for evaluating their success become also increasingly multi-targeting. This work introduces a multi-target optimization framework with Bayesian modeling of the target events, called Deep Bayesian Multi-Target Learning (DBMTL). In this framework, target events are modeled as forming a Bayesian network, in which directed links are parameterized by hidden layers, and learned from training samples. The structure of Bayesian network is determined by model selection. We applied the framework to Taobao live-streaming recommendation, to simultaneously optimize (and strike a balance) on targets including click-through rate, user stay time in live room, purchasing behaviors and interactions. Significant improvement has been observed for the proposed method over other MTL frameworks and the non-MTL model. Our practice shows that with an integrated causality structure, we can effectively make the learning of a target benefit from other targets, creating significant synergy effects that improve all targets. The neural network construction guided by DBMTL fits in with the general probabilistic model connecting features and multiple targets, taking weaker assumption than the other methods discussed in this paper. This theoretical generality brings about practical generalization power over various targets distributions, including sparse targets and continuous-value ones.
In this paper, we propose a monocular 3D object detection framework in the domain of autonomous driving. Unlike previous image-based methods which focus on RGB feature extracted from 2D images, our method solves this problem in the reconstructed 3D space in order to exploit 3D contexts explicitly. To this end, we first leverage a stand-alone module to transform the input data from 2D image plane to 3D point clouds space for a better input representation, then we perform the 3D detection using PointNet backbone net to obtain objects 3D locations, dimensions and orientations. To enhance the discriminative capability of point clouds, we propose a multi-modal feature fusion module to embed the complementary RGB cue into the generated point clouds representation. We argue that it is more effective to infer the 3D bounding boxes from the generated 3D scene space (i.e., X,Y, Z space) compared to the image plane (i.e., R,G,B image plane). Evaluation on the challenging KITTI dataset shows that our approach boosts the performance of state-of-the-art monocular approach by a large margin, i.e., around 15% absolute AP on both 3D localization and detection tasks for Car category at 0.7 IoU threshold.
Scribble colors based line art colorization is a challenging computer vision problem since neither greyscale values nor semantic information is presented in line arts, and the lack of authentic illustration-line art training pairs also increases difficulty of model generalization. Recently, several Generative Adversarial Nets (GANs) based methods have achieved great success. They can generate colorized illustrations conditioned on given line art and color hints. However, these methods fail to capture the authentic illustration distributions and are hence perceptually unsatisfying in the sense that they often lack accurate shading. To address these challenges, we propose a novel deep conditional adversarial architecture for scribble based anime line art colorization. Specifically, we integrate the conditional framework with WGAN-GP criteria as well as the perceptual loss to enable us to robustly train a deep network that makes the synthesized images more natural and real. We also introduce a local features network that is independent of synthetic data. With GANs conditioned on features from such network, we notably increase the generalization capability over "in the wild" line arts. Furthermore, we collect two datasets that provide high-quality colorful illustrations and authentic line arts for training and benchmarking. With the proposed model trained on our illustration dataset, we demonstrate that images synthesized by the presented approach are considerably more realistic and precise than alternative approaches.
Currently, most top-performing text detection networks tend to employ fixed-size anchor boxes to guide the search for text instances. They usually rely on a large amount of anchors with different scales to discover texts in scene images, thus leading to high computational cost. In this paper, we propose an end-to-end box-based text detector with scale-adaptive anchors, which can dynamically adjust the scales of anchors according to the sizes of underlying texts by introducing an additional scale regression layer. The proposed scale-adaptive anchors allow us to use a few number of anchors to handle multi-scale texts and therefore significantly improve the computational efficiency. Moreover, compared to discrete scales used in previous methods, the learned continuous scales are more reliable, especially for small texts detection. Additionally, we propose Anchor convolution to better exploit necessary feature information by dynamically adjusting the sizes of receptive fields according to the learned scales. Extensive experiments demonstrate that the proposed detector is fast, taking only $0.28$ second per image, while outperforming most state-of-the-art methods in accuracy.