Research papers and code for "Xiao Shu":
Small compression noises, despite being transparent to human eyes, can adversely affect the results of many image restoration processes, if left unaccounted for. Especially, compression noises are highly detrimental to inverse operators of high-boosting (sharpening) nature, such as deblurring and superresolution against a convolution kernel. By incorporating the non-linear DCT quantization mechanism into the formulation for image restoration, we propose a new sparsity-based convex programming approach for joint compression noise removal and image restoration. Experimental results demonstrate significant performance gains of the new approach over existing image restoration methods.

Click to Read Paper and Get Code
In many applications of deep learning, particularly those in image restoration, it is either very difficult, prohibitively expensive, or outright impossible to obtain paired training data precisely as in the real world. In such cases, one is forced to use synthesized paired data to train the deep convolutional neural network (DCNN). However, due to the unavoidable generalization error in statistical learning, the synthetically trained DCNN often performs poorly on real world data. To overcome this problem, we propose a new general training method that can compensate for, to a large extent, the generalization errors of synthetically trained DCNNs.

Click to Read Paper and Get Code
Subitizing, or the sense of small natural numbers, is a cognitive construct so primary and critical to the survival and well-being of humans and primates that is considered and proven to be innate; it responds to visual stimuli prior to the development of any symbolic skills, language or arithmetic. Given highly acclaimed successes of deep convolutional neural networks (DCNN) in tasks of visual intelligence, one would expect that DCNNs can learn subitizing. But somewhat surprisingly, our carefully crafted extensive experiments, which are similar to those of cognitive psychology, demonstrate that DCNNs cannot, even with strong supervision, see through superficial variations in visual representations and distill the abstract notion of natural number, a task that children perform with high accuracy and confidence. The DCNN black box learners driven by very large training sets are apparently still confused by geometric variations and fail to grasp the topological essence in subitizing. In sharp contrast to the failures of the black box learning, by incorporating a mechanism of mathematical morphology into convolutional kernels, we are able to construct a recurrent convolutional neural network that can perform subitizing deterministically. Our findings in this study of cognitive computing, without and with prior of human knowledge, are discussed; they are, we believe, significant and thought-provoking in the interests of AI research, because visual-based numerosity is a benchmark of minimum sort for human cognition.

Click to Read Paper and Get Code
Taking photos of optoelectronic displays is a direct and spontaneous way of transferring data and keeping records, which is widely practiced. However, due to the analog signal interference between the pixel grids of the display screen and camera sensor array, objectionable moir\'e (alias) patterns appear in captured screen images. As the moir\'e patterns are structured and highly variant, they are difficult to be completely removed without affecting the underneath latent image. In this paper, we propose an approach of deep convolutional neural network for demoir\'eing screen photos. The proposed DCNN consists of a coarse-scale network and a fine-scale network. In the coarse-scale network, the input image is first downsampled and then processed by stacked residual blocks to remove the moir\'e artifacts. After that, the fine-scale network upsamples the demoir\'ed low-resolution image back to the original resolution. Extensive experimental results have demonstrated that the proposed technique can efficiently remove the moir\'e patterns for camera acquired screen images; the new technique outperforms the existing ones.

Click to Read Paper and Get Code
All existing image enhancement methods, such as HDR tone mapping, cannot recover A/D quantization losses due to insufficient or excessive lighting, (underflow and overflow problems). The loss of image details due to A/D quantization is complete and it cannot be recovered by traditional image processing methods, but the modern data-driven machine learning approach offers a much needed cure to the problem. In this work we propose a novel approach to restore and enhance images acquired in low and uneven lighting. First, the ill illumination is algorithmically compensated by emulating the effects of artificial supplementary lighting. Then a DCNN trained using only synthetic data recovers the missing detail caused by quantization.

Click to Read Paper and Get Code
This paper presents a generic pre-processor for expediting conventional template matching techniques. Instead of locating the best matched patch in the reference image to a query template via exhaustive search, the proposed algorithm rules out regions with no possible matches with minimum computational efforts. While working on simple patch features, such as mean, variance and gradient, the fast pre-screening is highly discriminative. Its computational efficiency is gained by using a novel octagonal-star-shaped template and the inclusion-exclusion principle to extract and compare patch features. Moreover, it can handle arbitrary rotation and scaling of reference images effectively. Extensive experiments demonstrate that the proposed algorithm greatly reduces the search space while never missing the best match.

Click to Read Paper and Get Code
Image of a scene captured through a piece of transparent and reflective material, such as glass, is often spoiled by a superimposed layer of reflection image. While separating the reflection from a familiar object in an image is mentally not difficult for humans, it is a challenging, ill-posed problem in computer vision. In this paper, we propose a novel deep convolutional encoder-decoder method to remove the objectionable reflection by learning a map between image pairs with and without reflection. For training the neural network, we model the physical formation of reflections in images and synthesize a large number of photo-realistic reflection-tainted images from reflection-free images collected online. Extensive experimental results show that, although the neural network learns only from synthetic data, the proposed method is effective on real-world images, and it significantly outperforms the other tested state-of-the-art techniques.

Click to Read Paper and Get Code
High-quality dehazing performance is highly dependent upon the accurate estimation of transmission map. In this work, the coarse estimation version is first obtained by weightedly fusing two different transmission maps, which are generated from foreground and sky regions, respectively. A hybrid variational model with promoted regularization terms is then proposed to assisting in refining transmission map. The resulting complicated optimization problem is effectively solved via an alternating direction algorithm. The final haze-free image can be effectively obtained according to the refined transmission map and atmospheric scattering model. Our dehazing framework has the capacity of preserving important image details while suppressing undesirable artifacts, even for hazy images with large sky regions. Experiments on both synthetic and realistic images have illustrated that the proposed method is competitive with or even outperforms the state-of-the-art dehazing techniques under different imaging conditions.

* 5 pages, 5 figures
Click to Read Paper and Get Code
We focus on the task of amodal 3D object detection in RGB-D images, which aims to produce a 3D bounding box of an object in metric form at its full extent. We introduce Deep Sliding Shapes, a 3D ConvNet formulation that takes a 3D volumetric scene from a RGB-D image as input and outputs 3D object bounding boxes. In our approach, we propose the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and color features in 2D. In particular, we handle objects of various sizes by training an amodal RPN at two different scales and an ORN to regress 3D bounding boxes. Experiments show that our algorithm outperforms the state-of-the-art by 13.8 in mAP and is 200x faster than the original Sliding Shapes. All source code and pre-trained models will be available at GitHub.

Click to Read Paper and Get Code
Although there has been significant progress in the past decade,tracking is still a very challenging computer vision task, due to problems such as occlusion and model drift.Recently, the increased popularity of depth sensors e.g. Microsoft Kinect has made it easy to obtain depth data at low cost.This may be a game changer for tracking, since depth information can be used to prevent model drift and handle occlusion.In this paper, we construct a benchmark dataset of 100 RGBD videos with high diversity, including deformable objects, various occlusion conditions and moving cameras. We propose a very simple but strong baseline model for RGBD tracking, and present a quantitative comparison of several state-of-the-art tracking algorithms.Experimental results show that including depth information and reasoning about occlusion significantly improves tracking performance. The datasets, evaluation details, source code for the baseline algorithm, and instructions for submitting new models will be made available online after acceptance.

Click to Read Paper and Get Code
In this paper we tackle a very novel problem, namely height estimation from a single monocular remote sensing image, which is inherently ambiguous, and a technically ill-posed problem, with a large source of uncertainty coming from the overall scale. We propose a fully convolutional-deconvolutional network architecture being trained end-to-end, encompassing residual learning, to model the ambiguous mapping between monocular remote sensing images and height maps. Specifically, it is composed of two parts, i.e., convolutional sub-network and deconvolutional sub-network. The former corresponds to feature extractor that transforms the input remote sensing image to high-level multidimensional feature representation, whereas the latter plays the role of a height generator that produces height map from the feature extracted from the convolutional sub-network. Moreover, to preserve fine edge details of estimated height maps, we introduce a skip connection to the network, which is able to shuttle low-level visual information, e.g., object boundaries and edges, directly across the network. To demonstrate the usefulness of single-view height prediction, we show a practical example of instance segmentation of buildings using estimated height map. This paper, for the first time in the remote sensing community, attempts to estimate height from monocular vision. The proposed network is validated using a large-scale high resolution aerial image data set covered an area of Berlin. Both visual and quantitative analysis of the experimental results demonstrate the effectiveness of our approach.

Click to Read Paper and Get Code
While general object recognition is still far from being solved, this paper proposes a way for a robot to recognize every object at an almost human-level accuracy. Our key observation is that many robots will stay in a relatively closed environment (e.g. a house or an office). By constraining a robot to stay in a limited territory, we can ensure that the robot has seen most objects before and the speed of introducing a new object is slow. Furthermore, we can build a 3D map of the environment to reliably subtract the background to make recognition easier. We propose extremely robust algorithms to obtain a 3D map and enable humans to collectively annotate objects. During testing time, our algorithm can recognize all objects very reliably, and query humans from crowd sourcing platform if confidence is low or new objects are identified. This paper explains design decisions in building such a system, and constructs a benchmark for extensive evaluation. Experiments suggest that making robot vision appear to be working from an end user's perspective is a reachable goal today, as long as the robot stays in a closed environment. By formulating this task, we hope to lay the foundation of a new direction in vision for robotics. Code and data will be available upon acceptance.

Click to Read Paper and Get Code
This work first attempts to automatically recognize pancreatitis on CT scan images. However, different form the traditional object recognition, such pancreatitis recognition is challenging due to the fine-grained and non-rigid appearance variability of the local diseased regions. To this end, we propose a customized Region-Manipulated Fusion Networks (RMFN) to capture the key characteristics of local lesion for pancreatitis recognition. Specifically, to effectively highlight the imperceptible lesion regions, a novel region-manipulated scheme in RMFN is proposed to force the lesion regions while weaken the non-lesion regions by ceaselessly aggregating the multi-scale local information onto feature maps. The proposed scheme can be flexibly equipped into the existing neural networks, such as AlexNet and VGG. To evaluate the performance of the propose method, a real CT image database about pancreatitis is collected from hospitals \footnote{The database is available later}. And experimental results on such database well demonstrate the effectiveness of the proposed method for pancreatitis recognition.

Click to Read Paper and Get Code
We present a unified, efficient and effective framework for point-cloud based 3D object detection. Our two-stage approach utilizes both voxel representation and raw point cloud data to exploit respective advantages. The first stage network, with voxel representation as input, only consists of light convolutional operations, producing a small number of high-quality initial predictions. Coordinate and indexed convolutional feature of each point in initial prediction are effectively fused with the attention mechanism, preserving both accurate localization and context information. The second stage works on interior points with their fused feature for further refining the prediction. Our method is evaluated on KITTI dataset, in terms of both 3D and Bird's Eye View (BEV) detection, and achieves state-of-the-arts with a 15FPS detection rate.

Click to Read Paper and Get Code
Online personalized news product needs a suitable cover for the article. The news cover demands to be with high image quality, and draw readers' attention at same time, which is extraordinary challenging due to the subjectivity of the task. In this paper, we assess the news cover from image clarity and object salience perspective. We propose an end-to-end multi-task learning network for image clarity assessment and semantic segmentation simultaneously, the results of which can be guided for news cover assessment. The proposed network is based on a modified DeepLabv3+ model. The network backbone is used for multiple scale spatial features exaction, followed by two branches for image clarity assessment and semantic segmentation, respectively. The experiment results show that the proposed model is able to capture important content in images and performs better than single-task learning baselines on our proposed game content based CIA dataset.

* 6 pages, 9 figures
Click to Read Paper and Get Code
State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as not differentiable and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.

Click to Read Paper and Get Code
Regression based methods are not performing as well as detection based methods for human pose estimation. A central problem is that the structural information in the pose is not well exploited in the previous regression methods. In this work, we propose a structure-aware regression approach. It adopts a reparameterized pose representation using bones instead of joints. It exploits the joint connection structure to define a compositional loss function that encodes the long range interactions in the pose. It is simple, effective, and general for both 2D and 3D pose estimation in a unified setting. Comprehensive evaluation validates the effectiveness of our approach. It significantly advances the state-of-the-art on Human3.6M and is competitive with state-of-the-art results on MPII.

* Accepted by International Conference on Computer Vision (ICCV) 2017
Click to Read Paper and Get Code
With the industry trend of shifting from a traditional hierarchical approach to flatter management structure, crowdsourced performance assessment gained mainstream popularity. One fundamental challenge of crowdsourced performance assessment is the risks that personal interest can introduce distortions of facts, especially when the system is used to determine merit pay or promotion. In this paper, we developed a method to identify bias and strategic behavior in crowdsourced performance assessment, using a rich dataset collected from a professional service firm in China. We find a pattern of "discriminatory generosity" on the part of peer evaluation, where raters downgrade their peer coworkers who have passed objective promotion requirements while overrating their peer coworkers who have not yet passed. This introduces two types of biases: the first aimed against more competent competitors, and the other favoring less eligible peers which can serve as a mask of the first bias. This paper also aims to bring angles of fairness-aware data mining to talent and management computing. Historical decision records, such as performance ratings, often contain subjective judgment which is prone to bias and strategic behavior. For practitioners of predictive talent analytics, it is important to investigate potential bias and strategic behavior underlying historical decision records.

* International Workshop of Talent and Management Computing, KDD 2019
Click to Read Paper and Get Code
Principal Component Analysis (PCA) is a popular tool for dimensionality reduction and feature extraction in data analysis. There is a probabilistic version of PCA, known as Probabilistic PCA (PPCA). However, standard PCA and PPCA are not robust, as they are sensitive to outliers. To alleviate this problem, this paper introduces the Self-Paced Learning mechanism into PPCA, and proposes a novel method called Self-Paced Probabilistic Principal Component Analysis (SP-PPCA). Furthermore, we design the corresponding optimization algorithm based on the alternative search strategy and the expectation-maximization algorithm. SP-PPCA looks for optimal projection vectors and filters out outliers iteratively. Experiments on both synthetic problems and real-world datasets clearly demonstrate that SP-PPCA is able to reduce or eliminate the impact of outliers.

Click to Read Paper and Get Code