Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kent Fujiwara

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches

May 08, 2024
Qing Yu, Mikihiro Tanaka, Kent Fujiwara

To build a cross-modal latent space between 3D human motion and language, acquiring large-scale and high-quality human motion data is crucial. However, unlike the abundance of image data, the scarcity of motion data has limited the performance of existing motion-language models. To counter this, we introduce "motion patches", a new representation of motion sequences, and propose using Vision Transformers (ViT) as motion encoders via transfer learning, aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches, created by dividing and sorting skeleton joints based on body parts in motion sequences, are robust to varying skeleton structures, and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis, presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches, used jointly with ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval, and other novel challenging tasks, such as cross-skeleton recognition, zero-shot motion classification, and human interaction recognition, which are currently impeded by the lack of data.

* Accepted to CVPR 2024, Project website: https://yu1ut.com/MotionPatches-HP/

Via

Access Paper or Ask Questions

Canonical and Compact Point Cloud Representation for Shape Classification

Sep 13, 2018
Kent Fujiwara, Ikuro Sato, Mitsuru Ambai, Yuichi Yoshida, Yoshiaki Sakakura

Figure 1 for Canonical and Compact Point Cloud Representation for Shape Classification

Figure 2 for Canonical and Compact Point Cloud Representation for Shape Classification

Figure 3 for Canonical and Compact Point Cloud Representation for Shape Classification

Figure 4 for Canonical and Compact Point Cloud Representation for Shape Classification

We present a novel compact point cloud representation that is inherently invariant to scale, coordinate change and point permutation. The key idea is to parametrize a distance field around an individual shape into a unique, canonical, and compact vector in an unsupervised manner. We firstly project a distance field to a $4$D canonical space using singular value decomposition. We then train a neural network for each instance to non-linearly embed its distance field into network parameters. We employ a bias-free Extreme Learning Machine (ELM) with ReLU activation units, which has scale-factor commutative property between layers. We demonstrate the descriptiveness of the instance-wise, shape-embedded network parameters by using them to classify shapes in $3$D datasets. Our learning-based representation requires minimal augmentation and simple neural networks, where previous approaches demand numerous representations to handle coordinate change and point permutation.

* 16 pages, 5 figures

Via

Access Paper or Ask Questions