Models, code, and papers for "Jingyi Li":
The availability of affordable 3D full body reconstruction systems has given rise to free-viewpoint video (FVV) of human shapes. Most existing solutions produce temporally uncorrelated point clouds or meshes with unknown point/vertex correspondences. Individually compressing each frame is ineffective and still yields to ultra-large data sizes. We present an end-to-end deep learning scheme to establish dense shape correspondences and subsequently compress the data. Our approach uses sparse set of "panoramic" depth maps or PDMs, each emulating an inward-viewing concentric mosaics. We then develop a learning-based technique to learn pixel-wise feature descriptors on PDMs. The results are fed into an autoencoder-based network for compression. Comprehensive experiments demonstrate our solution is robust and effective on both public and our newly captured datasets.
Community detection or clustering is a fundamental task in the analysis of network data. Many real networks have a bipartite structure which makes community detection challenging. In this paper, we consider a model which allows for matched communities in the bipartite setting, in addition to node covariates with information about the matching. We derive a simple fast algorithm for fitting the model based on variational inference ideas and show its effectiveness on both simulated and real data. A variation of the model to allow for degree-correction is also considered, in addition to a novel approach to fitting such degree-corrected models.
Particle Imaging Velocimetry (PIV) estimates the flow of fluid by analyzing the motion of injected particles. The problem is challenging as the particles lie at different depths but have similar appearance and tracking a large number of particles is particularly difficult. In this paper, we present a PIV solution that uses densely sampled light field to reconstruct and track 3D particles. We exploit the refocusing capability and focal symmetry constraint of the light field for reliable particle depth estimation. We further propose a new motion-constrained optical flow estimation scheme by enforcing local motion rigidity and the Navier-Stoke constraint. Comprehensive experiments on synthetic and real experiments show that using a single light field camera, our technique can recover dense and accurate 3D fluid flows in small to medium volumes.
We present an intelligent virtual interviewer that engages with a user in a text-based conversation and automatically infers the user's psychological traits, such as personality. We investigate how the personality of a virtual interviewer influences a user's behavior from two perspectives: the user's willingness to confide in, and listen to, a virtual interviewer. We have developed two virtual interviewers with distinct personalities and deployed them in a real-world recruiting event. We present findings from completed interviews with 316 actual job applicants. Notably, users are more willing to confide in and listen to a virtual interviewer with a serious, assertive personality. Moreover, users' personality traits, inferred from their chat text, influence their perception of a virtual interviewer, and their willingness to confide in and listen to a virtual interviewer. Finally, we discuss the implications of our work on building hyper-personalized, intelligent agents based on user traits.
Human visual system relies on both binocular stereo cues and monocular focusness cues to gain effective 3D perception. In computer vision, the two problems are traditionally solved in separate tracks. In this paper, we present a unified learning-based technique that simultaneously uses both types of cues for depth inference. Specifically, we use a pair of focal stacks as input to emulate human perception. We first construct a comprehensive focal stack training dataset synthesized by depth-guided light field rendering. We then construct three individual networks: a FocusNet to extract depth from a single focal stack, a EDoFNet to obtain the extended depth of field (EDoF) image from the focal stack, and a StereoNet to conduct stereo matching. We then integrate them into a unified solution to obtain high quality depth maps. Comprehensive experiments show that our approach outperforms the state-of-the-art in both accuracy and speed and effectively emulates human vision systems.
In multi-view human body capture systems, the recovered 3D geometry or even the acquired imagery data can be heavily corrupted due to occlusions, noise, limited field of- view, etc. Direct estimation of 3D pose, body shape or motion on these low-quality data has been traditionally challenging.In this paper, we present a graph-based non-rigid shape registration framework that can simultaneously recover 3D human body geometry and estimate pose/motion at high fidelity.Our approach first generates a global full-body template by registering all poses in the acquired motion sequence.We then construct a deformable graph by utilizing the rigid components in the global template. We directly warp the global template graph back to each motion frame in order to fill in missing geometry. Specifically, we combine local rigidity and temporal coherence constraints to maintain geometry and motion consistencies. Comprehensive experiments on various scenes show that our method is accurate and robust even in the presence of drastic motions.
Depth from defocus (DfD) and stereo matching are two most studied passive depth sensing schemes. The techniques are essentially complementary: DfD can robustly handle repetitive textures that are problematic for stereo matching whereas stereo matching is insensitive to defocus blurs and can handle large depth range. In this paper, we present a unified learning-based technique to conduct hybrid DfD and stereo matching. Our input is image triplets: a stereo pair and a defocused image of one of the stereo views. We first apply depth-guided light field rendering to construct a comprehensive training dataset for such hybrid sensing setups. Next, we adopt the hourglass network architecture to separately conduct depth inference from DfD and stereo. Finally, we exploit different connection methods between the two separate networks for integrating them into a unified solution to produce high fidelity 3D disparity maps. Comprehensive experiments on real and synthetic data show that our new learning-based hybrid 3D sensing technique can significantly improve accuracy and robustness in 3D reconstruction.
Nearly all existing visual saliency models by far have focused on predicting a universal saliency map across all observers. Yet psychology studies suggest that visual attention of different observers can vary significantly under specific circumstances, especially a scene is composed of multiple salient objects. To study such heterogenous visual attention pattern across observers, we first construct a personalized saliency dataset and explore correlations between visual attention, personal preferences, and image contents. Specifically, we propose to decompose a personalized saliency map (referred to as PSM) into a universal saliency map (referred to as USM) predictable by existing saliency detection models and a new discrepancy map across users that characterizes personalized saliency. We then present two solutions towards predicting such discrepancy maps, i.e., a multi-task convolutional neural network (CNN) framework and an extended CNN with Person-specific Information Encoded Filters (CNN-PIEF). Extensive experimental results demonstrate the effectiveness of our models for PSM prediction as well their generalization capability for unseen observers.
This paper presents a novel Coprime Blurred Pair (CBP) model for visual data-hiding for security in camera surveillance. While most previous approaches have focused on completely encrypting the video stream, we introduce a spatial encryption scheme by blurring the image/video contents to create a CBP. Our goal is to obscure detail in public video streams by blurring while allowing behavior to be recognized and to quickly deblur the stream so that details are available if behavior is recognized as suspicious. We create a CBP by blurring the same latent image with two unknown kernels. The two kernels are coprime when mapped to bivariate polynomials in the z domain. To deblur the CBP we first use the coprime constraint to approximate the kernels and sample the bivariate CBP polynomials in one dimension on the unit circle. At each sample point, we factor the 1D polynomial pair and compose the results into a 2D kernel matrix. Finally, we compute the inverse Fast Fourier Transform (FFT) of the kernel matrices to recover the coprime kernels and then the latent video stream. It is therefore only possible to deblur the video stream if a user has access to both streams. To improve the practicability of our algorithm, we implement our algorithm using a graphics processing unit (GPU) to decrypt the blurred video streams in real-time, and extensive experimental results demonstrate that our new scheme can effectively protect sensitive identity information in surveillance videos and faithfully reconstruct the unblurred video stream when two blurred sequences are available.
Robust segmentation of hair from portrait images remains challenging: hair does not conform to a uniform shape, style or even color; dark hair in particular lacks features. We present a novel computational imaging solution that tackles the problem from both input and processing fronts. We explore using Time-of-Flight (ToF) RGBD sensors on recent mobile devices. We first conduct a comprehensive analysis to show that scattering and inter-reflection cause different noise patterns on hair vs. non-hair regions on ToF images, by changing the light path and/or combining multiple paths. We then develop a deep network based approach that employs both ToF depth map and the RGB gradient maps to produce an initial hair segmentation with labeled hair components. We then refine the result by imposing ToF noise prior under the conditional random field. We collect the first ToF RGBD hair dataset with 20k+ head images captured on 30 human subjects with a variety of hairstyles at different view angles. Comprehensive experiments show that our approach outperforms the RGB based techniques in accuracy and robustness and can handle traditionally challenging cases such as dark hair, similar hair/background, similar hair/foreground, etc.
Time-of-Flight (ToF) depth sensing camera is able to obtain depth maps at a high frame rate. However, its low resolution and sensitivity to the noise are always a concern. A popular solution is upsampling the obtained noisy low resolution depth map with the guidance of the companion high resolution color image. However, due to the constrains in the existing upsampling models, the high resolution depth map obtained in such way may suffer from either texture copy artifacts or blur of depth discontinuity. In this paper, a novel optimization framework is proposed with the brand new data term and smoothness term. The comprehensive experiments using both synthetic data and real data show that the proposed method well tackles the problem of texture copy artifacts and blur of depth discontinuity. It also demonstrates sufficient robustness to the noise. Moreover, a data driven scheme is proposed to adaptively estimate the parameter in the upsampling optimization framework. The encouraging performance is maintained even in the case of large upsampling e.g. $8\times$ and $16\times$.
A surface light field represents the radiance of rays originating from any points on the surface in any directions. Traditional approaches require ultra-dense sampling to ensure the rendering quality. In this paper, we present a novel neural network based technique called deep surface light field or DSLF to use only moderate sampling for high fidelity rendering. DSLF automatically fills in the missing data by leveraging different sampling patterns across the vertices and at the same time eliminates redundancies due to the network's prediction capability. For real data, we address the image registration problem as well as conduct texture-aware remeshing for aligning texture edges with vertices to avoid blurring. Comprehensive experiments show that DSLF can further achieve high data compression ratio while facilitating real-time rendering on the GPU.
We present a learning-based scheme for robustly and accurately estimating clothing fitness as well as the human shape on clothed 3D human scans. Our approach maps the clothed human geometry to a geometry image that we call clothed-GI. To align clothed-GI under different clothing, we extend the parametric human model and employ skeleton detection and warping for reliable alignment. For each pixel on the clothed-GI, we extract a feature vector including color/texture, position, normal, etc. and train a modified conditional GAN network for per-pixel fitness prediction using a comprehensive 3D clothing. Our technique significantly improves the accuracy of human shape prediction, especially under loose and fitted clothing. We further demonstrate using our results for human/clothing segmentation and virtual clothes fitting at a high visual realism.
We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database. The benchmark consists of two tasks: part-level segmentation of 3D shapes and 3D reconstruction from single view images. Ten teams have participated in the challenge and the best performing teams have outperformed state-of-the-art approaches on both tasks. A few novel deep learning architectures have been proposed on various 3D representations on both tasks. We report the techniques used by each team and the corresponding performances. In addition, we summarize the major discoveries from the reported results and possible trends for the future work in the field.
Health data are generally complex in type and small in sample size. Such domain-specific challenges make it difficult to capture information reliably and contribute further to the issue of generalization. To assist the analytics of healthcare datasets, we develop a feature selection method based on the concept of Coverage Adjusted Standardized Mutual Information (CASMI). The main advantages of the proposed method are: 1) it selects features more efficiently with the help of an improved entropy estimator, particularly when the sample size is small, and 2) it automatically learns the number of features to be selected based on the information from sample data. Additionally, the proposed method handles feature redundancy from the perspective of joint-distribution. The proposed method focuses on non-ordinal data, while it works with numerical data with an appropriate binning method. A simulation study comparing the proposed method to six widely cited feature selection methods shows that the proposed method performs better when measured by the Information Recovery Ratio, particularly when the sample size is small.
Occlusion is one of the most challenging problems in depth estimation. Previous work has modeled the single-occluder occlusion in light field and get good results, however it is still difficult to obtain accurate depth for multi-occluder occlusion. In this paper, we explore the multi-occluder occlusion model in light field, and derive the occluder-consistency between the spatial and angular space which is used as a guidance to select the un-occluded views for each candidate occlusion point. Then an anti-occlusion energy function is built to regularize depth map. The experimental results on public light field datasets have demonstrated the advantages of the proposed algorithm compared with other state-of-the-art light field depth estimation algorithms, especially in multi-occluder areas.
Stochastic approximation (SA) algorithms have been widely applied in minimization problems where the loss functions and/or the gradient are only accessible through noisy evaluations. Among all the SA algorithms, the second-order simultaneous perturbation stochastic approximation (2SPSA) and the second-order stochastic gradient (2SG) are particularly efficient in high-dimensional problems covering both gradient-free and gradient-based scenarios. However, due to the necessary matrix operations, the per-iteration floating-point-operation cost of the original 2SPSA/2SG is $ O(p^3) $ with $ p $ being the dimension of the underlying parameter. Note that the $O(p^3)$ floating-point-operation cost is distinct from the classical SPSA-based per-iteration $O(1)$ cost in terms of the number of noisy function evaluations. In this work, we propose a technique to efficiently implement the 2SPSA/2SG algorithms via the symmetric indefinite matrix factorization and show that the per-iteration floating-point-operation cost is reduced from $ O(p^3) $ to $ O(p^2) $. The almost sure convergence and rate of convergence for the newly-proposed scheme are inherited from the original 2SPSA/2SG naturally. The numerical improvement manifests its superiority in numerical studies in terms of computational complexity and numerical stability.
Recently, it has been shown that deep neural networks (DNN) are subject to attacks through adversarial samples. Adversarial samples are often crafted through adversarial perturbation, i.e., manipulating the original sample with minor modifications so that the DNN model labels the sample incorrectly. Given that it is almost impossible to train perfect DNN, adversarial samples are shown to be easy to generate. As DNN are increasingly used in safety-critical systems like autonomous cars, it is crucial to develop techniques for defending such attacks. Existing defense mechanisms which aim to make adversarial perturbation challenging have been shown to be ineffective. In this work, we propose an alternative approach. We first observe that adversarial samples are much more sensitive to perturbations than normal samples. That is, if we impose random perturbations on a normal and an adversarial sample respectively, there is a significant difference between the ratio of label change due to the perturbations. Observing this, we design a statistical adversary detection algorithm called nMutant (inspired by mutation testing from software engineering community). Our experiments show that nMutant effectively detects most of the adversarial samples generated by recently proposed attacking methods. Furthermore, we provide an error bound with certain statistical significance along with the detection.
Contact modeling is essential for robotic grasping and manipulation. The relation between friction and relative body motion is fundamental for controlled pushing. An accurate friction model is indispensable for grasp analysis as the stability heavily relies on friction. To increase the grasp stability, soft fingers are widely deployed for manipulation tasks as they adapt to the object geometry, where the deformability results in a curved contact area. The friction of such curved surfaces is in six dimensions and its model is yet not well defined. To address this issue, we derive the friction computation for curved surfaces by combining concepts of differential geometry and Coulomb friction law. We further generalize two classical limit surface models from three to six dimensions, which describe the friction-motion constraints for a single contact. To analyze multi contacts for grasping, we build the grasp wrench space by merging the normal wrench and the fitted limit surfaces of each contact. The performance of the two limit surface models is evaluated with six parametric surfaces and 2473 meshed contacts obtained from simulations using the finite element method. Results indicate that the proposed models yield 1.81% fitting error of the 6D friction wrench samples. We demonstrate the applicability of the proposed models to predict grasp success for a parallel-jaw gripper. Robotic experiments suggest that a prediction accuracy of up to 92.6% can be achieved with the presented frictional contact modeling.
The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.