Models, code, and papers for "Yu-Shen Liu":
Low-rank metric learning aims to learn better discrimination of data subject to low-rank constraints. It keeps the intrinsic low-rank structure of datasets and reduces the time cost and memory usage in metric learning. However, it is still a challenge for current methods to handle datasets with both high dimensions and large numbers of samples. To address this issue, we present a novel fast low-rank metric learning (FLRML) method.FLRML casts the low-rank metric learning problem into an unconstrained optimization on the Stiefel manifold, which can be efficiently solved by searching along the descent curves of the manifold.FLRML significantly reduces the complexity and memory usage in optimization, which makes the method scalable to both high dimensions and large numbers of samples.Furthermore, we introduce a mini-batch version of FLRML to make the method scalable to larger datasets which are hard to be loaded and decomposed in limited memory. The outperforming experimental results show that our method is with high accuracy and much faster than the state-of-the-art methods under several benchmarks with large numbers of high-dimensional data. Code has been made available at https://github.com/highan911/FLRML
Learning discriminative shape representation directly on point clouds is still challenging in 3D shape analysis and understanding. Recent studies usually involve three steps: first splitting a point cloud into some local regions, then extracting corresponding feature of each local region, and finally aggregating all individual local region features into a global feature as shape representation using simple max pooling. However, such pooling-based feature aggregation methods do not adequately take the spatial relationships between local regions into account, which greatly limits the ability to learn discriminative shape representation. To address this issue, we propose a novel deep learning network, named Point2SpatialCapsule, for aggregating features and spatial relationships of local regions on point clouds, which aims to learn more discriminative shape representation. Compared with traditional max-pooling based feature aggregation networks, Point2SpatialCapsule can explicitly learn not only geometric features of local regions but also spatial relationships among them. It consists of two modules. To resolve the disorder problem of local regions, the first module, named geometric feature aggregation, is designed to aggregate the local region features into the learnable cluster centers, which explicitly encodes the spatial locations from the original 3D space. The second module, named spatial relationship aggregation, is proposed for further aggregating clustered features and the spatial relationships among them in the feature space using the spatial-aware capsules developed in this paper. Compared to the previous capsule network based methods, the feature routing on the spatial-aware capsules can learn more discriminative spatial relationships among local regions for point clouds, which establishes a direct mapping between log priors and the spatial locations through feature clusters.
Deep learning has achieved remarkable results in 3D shape analysis by learning global shape features from the pixel-level over multiple views. Previous methods, however, compute low-level features for entire views without considering part-level information. In contrast, we propose a deep neural network, called Parts4Feature, to learn 3D global features from part-level information in multiple views. We introduce a novel definition of generally semantic parts, which Parts4Feature learns to detect in multiple views from different 3D shape segmentation benchmarks. A key idea of our architecture is that it transfers the ability to detect semantically meaningful parts in multiple views to learn 3D global features. Parts4Feature achieves this by combining a local part detection branch and a global feature learning branch with a shared region proposal module. The global feature learning branch aggregates the detected parts in terms of learned part patterns with a novel multi-attention mechanism, while the region proposal module enables locally and globally discriminative information to be promoted by each other. We demonstrate that Parts4Feature outperforms the state-of-the-art under three large-scale 3D shape benchmarks.
Exploring contextual information in the local region is important for shape understanding and analysis. Existing studies often employ hand-crafted or explicit ways to encode contextual information of local regions. However, it is hard to capture fine-grained contextual information in hand-crafted or explicit manners, such as the correlation between different areas in a local region, which limits the discriminative ability of learned features. To resolve this issue, we propose a novel deep learning model for 3D point clouds, named Point2Sequence, to learn 3D shape features by capturing fine-grained contextual information in a novel implicit way. Point2Sequence employs a novel sequence learning model for point clouds to capture the correlations by aggregating multi-scale areas of each local region with attention. Specifically, Point2Sequence first learns the feature of each area scale in a local region. Then, it captures the correlation between area scales in the process of aggregating all area scales using a recurrent neural network (RNN) based encoder-decoder structure, where an attention mechanism is proposed to highlight the importance of different area scales. Experimental results show that Point2Sequence achieves state-of-the-art performance in shape classification and segmentation tasks.
3D shape captioning is a challenging application in 3D shape understanding. Captions from recent multi-view based methods reveal that they cannot capture part-level characteristics of 3D shapes. This leads to a lack of detailed part-level description in captions, which human tend to focus on. To resolve this issue, we propose ShapeCaptioner, a generative caption network, to perform 3D shape captioning from semantic parts detected in multiple views. Our novelty lies in learning the knowledge of part detection in multiple views from 3D shape segmentations and transferring this knowledge to facilitate learning the mapping from 3D shapes to sentences. Specifically, ShapeCaptioner aggregates the parts detected in multiple colored views using our novel part class specific aggregation to represent a 3D shape, and then, employs a sequence to sequence model to generate the caption. Our outperforming results show that ShapeCaptioner can learn 3D shape features with more detailed part characteristics to facilitate better 3D shape captioning than previous work.
Unsupervised feature learning for point clouds has been vital for large-scale point cloud understanding. Recent deep learning based methods depend on learning global geometry from self-reconstruction. However, these methods are still suffering from ineffective learning of local geometry, which significantly limits the discriminability of learned features. To resolve this issue, we propose MAP-VAE to enable the learning of global and local geometry by jointly leveraging global and local self-supervision. To enable effective local self-supervision, we introduce multi-angle analysis for point clouds. In a multi-angle scenario, we first split a point cloud into a front half and a back half from each angle, and then, train MAP-VAE to learn to predict a back half sequence from the corresponding front half sequence. MAP-VAE performs this half-to-half prediction using RNN to simultaneously learn each local geometry and the spatial relationship among them. In addition, MAP-VAE also learns global geometry via self-reconstruction, where we employ a variational constraint to facilitate novel shape generation. The outperforming results in four shape analysis tasks show that MAP-VAE can learn more discriminative global or local features than the state-of-the-art methods.
In this paper we present a novel unsupervised representation learning approach for 3D shapes, which is an important research challenge as it avoids the manual effort required for collecting supervised data. Our method trains an RNN-based neural network architecture to solve multiple view inter-prediction tasks for each shape. Given several nearby views of a shape, we define view inter-prediction as the task of predicting the center view between the input views, and reconstructing the input views in a low-level feature space. The key idea of our approach is to implement the shape representation as a shape-specific global memory that is shared between all local view inter-predictions for each shape. Intuitively, this memory enables the system to aggregate information that is useful to better solve the view inter-prediction tasks for each shape, and to leverage the memory as a view-independent shape representation. Our approach obtains the best results using a combination of L_2 and adversarial losses for the view inter-prediction task. We show that VIP-GAN outperforms state-of-the-art methods in unsupervised 3D feature learning on three large scale 3D shape benchmarks.
Auto-encoder is an important architecture to understand point clouds in an encoding and decoding procedure of self reconstruction. Current auto-encoder mainly focuses on the learning of global structure by global shape reconstruction, while ignoring the learning of local structures. To resolve this issue, we propose Local-to-Global auto-encoder (L2G-AE) to simultaneously learn the local and global structure of point clouds by local to global reconstruction. Specifically, L2G-AE employs an encoder to encode the geometry information of multiple scales in a local region at the same time. In addition, we introduce a novel hierarchical self-attention mechanism to highlight the important points, scales and regions at different levels in the information aggregation of the encoder. Simultaneously, L2G-AE employs a recurrent neural network (RNN) as decoder to reconstruct a sequence of scales in a local region, based on which the global point cloud is incrementally reconstructed. Our outperforming results in shape classification, retrieval and upsampling show that L2G-AE can understand point clouds better than state-of-the-art methods.
A recent method employs 3D voxels to represent 3D shapes, but this limits the approach to low resolutions due to the computational cost caused by the cubic complexity of 3D voxels. Hence the method suffers from a lack of detailed geometry. To resolve this issue, we propose Y^2Seq2Seq, a view-based model, to learn cross-modal representations by joint reconstruction and prediction of view and word sequences. Specifically, the network architecture of Y^2Seq2Seq bridges the semantic meaning embedded in the two modalities by two coupled `Y' like sequence-to-sequence (Seq2Seq) structures. In addition, our novel hierarchical constraints further increase the discriminability of the cross-modal representations by employing more detailed discriminative information. Experimental results on cross-modal retrieval and 3D shape captioning show that Y^2Seq2Seq outperforms the state-of-the-art methods.
Learning global features by aggregating information over multiple views has been shown to be effective for 3D shape analysis. For view aggregation in deep learning models, pooling has been applied extensively. However, pooling leads to a loss of the content within views, and the spatial relationship among views, which limits the discriminability of learned features. We propose 3DViewGraph to resolve this issue, which learns 3D global features by more effectively aggregating unordered views with attention. Specifically, unordered views taken around a shape are regarded as view nodes on a view graph. 3DViewGraph first learns a novel latent semantic mapping to project low-level view features into meaningful latent semantic embeddings in a lower dimensional space, which is spanned by latent semantic patterns. Then, the content and spatial information of each pair of view nodes are encoded by a novel spatial pattern correlation, where the correlation is computed among latent semantic patterns. Finally, all spatial pattern correlations are integrated with attention weights learned by a novel attention mechanism. This further increases the discriminability of learned features by highlighting the unordered view nodes with distinctive characteristics and depressing the ones with appearance ambiguity. We show that 3DViewGraph outperforms state-of-the-art methods under three large-scale benchmarks.
Deep Convolutional Neural Networks (CNNs) are widely employed in modern computer vision algorithms, where the input image is convolved iteratively by many kernels to extract the knowledge behind it. However, with the depth of convolutional layers getting deeper and deeper in recent years, the enormous computational complexity makes it difficult to be deployed on embedded systems with limited hardware resources. In this paper, we propose two computation-performance optimization methods to reduce the redundant convolution kernels of a CNN with performance and architecture constraints, and apply it to a network for super resolution (SR). Using PSNR drop compared to the original network as the performance criterion, our method can get the optimal PSNR under a certain computation budget constraint. On the other hand, our method is also capable of minimizing the computation required under a given PSNR drop.