Models, code, and papers for "Piotr Koniusz":

Hallucinating Statistical Moment and Subspace Descriptors from Object and Saliency Detectors for Action Recognition

Jan 14, 2020
Lei Wang, Piotr Koniusz

In this paper, we build on a deep translational action recognition network which takes RGB frames as input to learn to predict both action concepts and auxiliary supervisory feature descriptors e.g., Optical Flow Features and/or Improved Dense Trajectory descriptors. The translation is performed by so-called hallucination streams trained to predict auxiliary cues which are simultaneously fed into classification layers, and then hallucinated for free at the testing stage to boost recognition. In this paper, we design and hallucinate two descriptors, one leveraging four popular object detectors applied to training videos, and the other leveraging image- and video-level saliency detectors. The first descriptor encodes the detector- and ImageNet-wise class prediction scores, confidence scores, and spatial locations of bounding boxes and frame indexes to capture the spatio-temporal distribution of features per video. Another descriptor encodes spatio-angular gradient distributions of saliency maps and intensity patterns. Inspired by the characteristic function of the probability distribution, we capture four statistical moments on the above intermediate descriptors. As numbers of coefficients in the mean, covariance, coskewness and cokurtotsis grow linearly, quadratically, cubically and quartically w.r.t. the dimension of feature vectors, we describe the covariance matrix by its leading n' eigenvectors (so-called subspace) and we capture skewness/kurtosis rather than costly coskewness/cokurtosis. We obtain state of the art on three popular datasets.


  Access Model/Code and Paper
Power Normalizing Second-order Similarity Network for Few-shot Learning

Nov 10, 2018
Hongguang Zhang, Piotr Koniusz

Second- and higher-order statistics of data points have played an important role in advancing the state of the art on several computer vision problems such as the fine-grained image and scene recognition. However, these statistics need to be passed via an appropriate pooling scheme to obtain the best performance. Power Normalizations are non-linear activation units which enjoy probability-inspired derivations and can be applied in CNNs. In this paper, we propose a similarity learning network leveraging second-order information and Power Normalizations. To this end, we propose several formulations capturing second-order statistics and derive a sigmoid-like Power Normalizing function to demonstrate its interpretability. Our model is trained end-to-end to learn the similarity between the support set and query images for the problem of one- and few-shot learning. The evaluations on Omniglot, miniImagenet and Open MIC datasets demonstrate that this network obtains state-of-the-art results on several few-shot learning protocols.


  Access Model/Code and Paper
Model Selection for Generalized Zero-shot Learning

Nov 08, 2018
Hongguang Zhang, Piotr Koniusz

In the problem of generalized zero-shot learning, the datapoints from unknown classes are not available during training. The main challenge for generalized zero-shot learning is the unbalanced data distribution which makes it hard for the classifier to distinguish if a given testing sample comes from a seen or unseen class. However, using Generative Adversarial Network (GAN) to generate auxiliary datapoints by the semantic embeddings of unseen classes alleviates the above problem. Current approaches combine the auxiliary datapoints and original training data to train the generalized zero-shot learning model and obtain state-of-the-art results. Inspired by such models, we propose to feed the generated data via a model selection mechanism. Specifically, we leverage two sources of datapoints (observed and auxiliary) to train some classifier to recognize which test datapoints come from seen and which from unseen classes. This way, generalized zero-shot learning can be divided into two disjoint classification tasks, thus reducing the negative influence of the unbalanced data distribution. Our evaluations on four publicly available datasets for generalized zero-shot learning show that our model obtains state-of-the-art results.


  Access Model/Code and Paper
CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps

Jun 24, 2018
Yusuf Tas, Piotr Koniusz

Deep learning is ubiquitous across many areas areas of computer vision. It often requires large scale datasets for training before being fine-tuned on small-to-medium scale problems. Activity, or, in other words, action recognition, is one of many application areas of deep learning. While there exist many Convolutional Neural Network architectures that work with the RGB and optical flow frames, training on the time sequences of 3D body skeleton joints is often performed via recurrent networks such as LSTM. In this paper, we propose a new representation which encodes sequences of 3D body skeleton joints in texture-like representations derived from mathematically rigorous kernel methods. Such a representation becomes the first layer in a standard CNN network e.g., ResNet-50, which is then used in the supervised domain adaptation pipeline to transfer information from the source to target dataset. This lets us leverage the available Kinect-based data beyond training on a single dataset and outperform simple fine-tuning on any two datasets combined in a naive manner. More specifically, in this paper we utilize the overlapping classes between datasets. We associate datapoints of the same class via so-called commonality, known from the supervised domain adaptation. We demonstrate state-of-the-art results on three publicly available benchmarks.


  Access Model/Code and Paper
Zero-Shot Kernel Learning

Jun 24, 2018
Hongguang Zhang, Piotr Koniusz

In this paper, we address an open problem of zero-shot learning. Its principle is based on learning a mapping that associates feature vectors extracted from i.e. images and attribute vectors that describe objects and/or scenes of interest. In turns, this allows classifying unseen object classes and/or scenes by matching feature vectors via mapping to a newly defined attribute vector describing a new class. Due to importance of such a learning task, there exist many methods that learn semantic, probabilistic, linear or piece-wise linear mappings. In contrast, we apply well-established kernel methods to learn a non-linear mapping between the feature and attribute spaces. We propose an easy learning objective inspired by the Linear Discriminant Analysis, Kernel-Target Alignment and Kernel Polarization methods that promotes incoherence. We evaluate performance of our algorithm on the Polynomial as well as shift-invariant Gaussian and Cauchy kernels. Despite simplicity of our approach, we obtain state-of-the-art results on several zero-shot learning datasets and benchmarks including a recent AWA2 dataset.

* IEEE Conference on Computer Vision and Pattern Recognition 2018 

  Access Model/Code and Paper
Dictionary Learning and Sparse Coding for Third-order Super-symmetric Tensors

Sep 09, 2015
Piotr Koniusz, Anoop Cherian

Super-symmetric tensors - a higher-order extension of scatter matrices - are becoming increasingly popular in machine learning and computer vision for modelling data statistics, co-occurrences, or even as visual descriptors. However, the size of these tensors are exponential in the data dimensionality, which is a significant concern. In this paper, we study third-order super-symmetric tensor descriptors in the context of dictionary learning and sparse coding. Our goal is to approximate these tensors as sparse conic combinations of atoms from a learned dictionary, where each atom is a symmetric positive semi-definite matrix. Apart from the significant benefits to tensor compression that this framework provides, our experiments demonstrate that the sparse coefficients produced by the scheme lead to better aggregation of high-dimensional data, and showcases superior performance on two common computer vision tasks compared to the state-of-the-art.

* 13 pages, NIPS 

  Access Model/Code and Paper
Few-Shot Learning via Saliency-guided Hallucination of Samples

Apr 06, 2019
Hongguang Zhang, Jing Zhang, Piotr Koniusz

Learning new concepts from a few of samples is a standard challenge in computer vision. The main directions to improve the learning ability of few-shot training models include (i) a robust similarity learning and (ii) generating or hallucinating additional data from the limited existing samples. In this paper, we follow the latter direction and present a novel data hallucination model. Currently, most datapoint generators contain a specialized network (i.e., GAN) tasked with hallucinating new datapoints, thus requiring large numbers of annotated data for their training in the first place. In this paper, we propose a novel less-costly hallucination method for few-shot learning which utilizes saliency maps. To this end, we employ a saliency network to obtain the foregrounds and backgrounds of available image samples and feed the resulting maps into a two-stream network to hallucinate datapoints directly in the feature space from viable foreground-background combinations. To the best of our knowledge, we are the first to leverage saliency maps for such a task and we demonstrate their usefulness in hallucinating additional datapoints for few-shot learning. Our proposed network achieves the state of the art on publicly available datasets.

* IEEE Conference on Computer Vision and Pattern Recognition 2019 

  Access Model/Code and Paper
Fisher-Bures Adversary Graph Convolutional Networks

Mar 11, 2019
Ke Sun, Piotr Koniusz, Jeff Wang

In a graph convolutional network, we assume that the graph $G$ is generated with respect to some observation noise. We make small random perturbations $\Delta{}G$ of the graph and try to improve generalization. Based on quantum information geometry, we can have quantitative measurements on the scale of $\Delta{}G$. We try to maximize the intrinsic scale of the permutation with a small budget while minimizing the loss based on the perturbed $G+\Delta{G}$. Our proposed model can consistently improve graph convolutional networks on semi-supervised node classification tasks with reasonable computational overhead. We present two different types of geometry on the manifold of graphs: one is for measuring the intrinsic change of a graph; the other is for measuring how such changes can affect externally a graph neural network. These new analytical tools will be useful in developing a good understanding of graph neural networks and fostering new techniques.


  Access Model/Code and Paper
A Deeper Look at Power Normalizations

Jun 24, 2018
Piotr Koniusz, Hongguang Zhang, Fatih Porikli

Power Normalizations (PN) are very useful non-linear operators in the context of Bag-of-Words data representations as they tackle problems such as feature imbalance. In this paper, we reconsider these operators in the deep learning setup by introducing a novel layer that implements PN for non-linear pooling of feature maps. Specifically, by using a kernel formulation, our layer combines the feature vectors and their respective spatial locations in the feature maps produced by the last convolutional layer of CNN. Linearization of such a kernel results in a positive definite matrix capturing the second-order statistics of the feature vectors, to which PN operators are applied. We study two types of PN functions, namely (i) MaxExp and (ii) Gamma, addressing their role and meaning in the context of nonlinear pooling. We also provide a probabilistic interpretation of these operators and derive their surrogates with well-behaved gradients for end-to-end CNN learning. We apply our theory to practice by implementing the PN layer on a ResNet-50 model and showcase experiments on four benchmarks for fine-grained recognition, scene recognition, and material classification. Our results demonstrate state-of-the-art performance across all these tasks.

* IEEE Conference on Computer Vision and Pattern Recognition, 2018 

  Access Model/Code and Paper
Artwork Identification from Wearable Camera Images for Enhancing Experience of Museum Audiences

Jun 24, 2018
Rui Zhang, Yusuf Tas, Piotr Koniusz

Recommendation systems based on image recognition could prove a vital tool in enhancing the experience of museum audiences. However, for practical systems utilizing wearable cameras, a number of challenges exist which affect the quality of image recognition. In this pilot study, we focus on recognition of museum collections by using a wearable camera in three different museum spaces. We discuss the application of wearable cameras, and the practical and technical challenges in devising a robust system that can recognize artworks viewed by the visitors to create a detailed record of their visit. Specifically, to illustrate the impact of different kinds of museum spaces on image recognition, we collect three training datasets of museum exhibits containing variety of paintings, clocks, and sculptures. Subsequently, we equip selected visitors with wearable cameras to capture artworks viewed by them as they stroll along exhibitions. We use Convolutional Neural Networks (CNN) which are pre-trained on the ImageNet dataset and fine-tuned on each of the training sets for the purpose of artwork identification. In the testing stage, we use CNNs to identify artworks captured by the visitors with a wearable camera. We analyze the accuracy of their recognition and provide an insight into the applicability of such a system to further engage audiences with museum exhibitions.

* Museums and the Web, 2017 

  Access Model/Code and Paper
Domain Adaptation by Mixture of Alignments of Second- or Higher-Order Scatter Tensors

Apr 11, 2017
Piotr Koniusz, Yusuf Tas, Fatih Porikli

In this paper, we propose an approach to the domain adaptation, dubbed Second- or Higher-order Transfer of Knowledge (So-HoT), based on the mixture of alignments of second- or higher-order scatter statistics between the source and target domains. The human ability to learn from few labeled samples is a recurring motivation in the literature for domain adaptation. Towards this end, we investigate the supervised target scenario for which few labeled target training samples per category exist. Specifically, we utilize two CNN streams: the source and target networks fused at the classifier level. Features from the fully connected layers fc7 of each network are used to compute second- or even higher-order scatter tensors; one per network stream per class. As the source and target distributions are somewhat different despite being related, we align the scatters of the two network streams of the same class (within-class scatters) to a desired degree with our bespoke loss while maintaining good separation of the between-class scatters. We train the entire network in end-to-end fashion. We provide evaluations on the standard Office benchmark (visual domains), RGB-D combined with Caltech256 (depth-to-rgb transfer) and Pascal VOC2007 combined with the TU Berlin dataset (image-to-sketch transfer). We attain state-of-the-art results.

* CVPR'17 

  Access Model/Code and Paper
Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition

Jan 19, 2017
Anoop Cherian, Piotr Koniusz, Stephen Gould

Most successful deep learning algorithms for action recognition extend models designed for image-based tasks such as object recognition to video. Such extensions are typically trained for actions on single video frames or very short clips, and then their predictions from sliding-windows over the video sequence are pooled for recognizing the action at the sequence level. Usually this pooling step uses the first-order statistics of frame-level action predictions. In this paper, we explore the advantages of using higher-order correlations; specifically, we introduce Higher-order Kernel (HOK) descriptors generated from the late fusion of CNN classifier scores from all the frames in a sequence. To generate these descriptors, we use the idea of kernel linearization. Specifically, a similarity kernel matrix, which captures the temporal evolution of deep classifier scores, is first linearized into kernel feature maps. The HOK descriptors are then generated from the higher-order co-occurrences of these feature maps, and are then used as input to a video-level classifier. We provide experiments on two fine-grained action recognition datasets and show that our scheme leads to state-of-the-art results.

* 9 pages 

  Access Model/Code and Paper
Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons (Extended Version)

Jul 28, 2016
Piotr Koniusz, Anoop Cherian, Fatih Porikli

In this paper, we explore tensor representations that can compactly capture higher-order relationships between skeleton joints for 3D action recognition. We first define RBF kernels on 3D joint sequences, which are then linearized to form kernel descriptors. The higher-order outer-products of these kernel descriptors form our tensor representations. We present two different kernels for action recognition, namely (i) a sequence compatibility kernel that captures the spatio-temporal compatibility of joints in one sequence against those in the other, and (ii) a dynamics compatibility kernel that explicitly models the action dynamics of a sequence. Tensors formed from these kernels are then used to train an SVM. We present experiments on several benchmark datasets and demonstrate state of the art results, substantiating the effectiveness of our representations.


  Access Model/Code and Paper
A Comparative Review of Recent Kinect-based Action Recognition Algorithms

Jun 24, 2019
Lei Wang, Du Q. Huynh, Piotr Koniusz

Video-based human action recognition is currently one of the most active research areas in computer vision. Various research studies indicate that the performance of action recognition is highly dependent on the type of features being extracted and how the actions are represented. Since the release of the Kinect camera, a large number of Kinect-based human action recognition techniques have been proposed in the literature. However, there still does not exist a thorough comparison of these Kinect-based techniques under the grouping of feature types, such as handcrafted versus deep learning features and depth-based versus skeleton-based features. In this paper, we analyze and compare ten recent Kinect-based algorithms for both cross-subject action recognition and cross-view action recognition using six benchmark datasets. In addition, we have implemented and improved some of these techniques and included their variants in the comparison. Our experiments show that the majority of methods perform better on cross-subject action recognition than cross-view action recognition, that skeleton-based features are more robust for cross-view recognition than depth-based features, and that deep learning features are suitable for large datasets.

* Accepted by the IEEE Transactions on Image Processing 

  Access Model/Code and Paper
Hallucinating Bag-of-Words and Fisher Vector IDT terms for CNN-based Action Recognition

Jun 13, 2019
Lei Wang, Piotr Koniusz, Du Q. Huynh

In this paper, we revive the use of old-fashioned handcrafted video representations and put new life into these techniques via a CNN-based hallucination step. Specifically, we address the problem of action classification in videos via an I3D network pre-trained on the large scale Kinetics-400 dataset. Despite of the use of RGB and optical flow frames, the I3D model (amongst others) thrives on combining its output with the Improved Dense Trajectory (IDT) and extracted with it low-level video descriptors encoded via Bag-of-Words (BoW) and Fisher Vectors (FV). Such a fusion of CNNs and hand crafted representations is time-consuming due to various pre-processing steps, descriptor extraction, encoding and fine-tuning of the model. In this paper, we propose an end-to-end trainable network with streams which learn the IDT-based BoW/FV representations at the training stage and are simple to integrate with the I3D model. Specifically, each stream takes I3D feature maps ahead of the last 1D conv. layer and learns to `translate' these maps to BoW/FV representations. Thus, our enhanced I3D model can hallucinate and use such synthesized BoW/FV representations at the testing stage. We demonstrate simplicity/usefulness of our model on three publicly available datasets and we show state-of-the-art results.

* First two authors contributed equally 

  Access Model/Code and Paper
Second-order Democratic Aggregation

Aug 22, 2018
Tsung-Yu Lin, Subhransu Maji, Piotr Koniusz

Aggregated second-order features extracted from deep convolutional networks have been shown to be effective for texture generation, fine-grained recognition, material classification, and scene understanding. In this paper, we study a class of orderless aggregation functions designed to minimize interference or equalize contributions in the context of second-order features and we show that they can be computed just as efficiently as their first-order counterparts and they have favorable properties over aggregation by summation. Another line of work has shown that matrix power normalization after aggregation can significantly improve the generalization of second-order representations. We show that matrix power normalization implicitly equalizes contributions during aggregation thus establishing a connection between matrix normalization techniques and prior work on minimizing interference. Based on the analysis we present {\gamma}-democratic aggregators that interpolate between sum ({\gamma}=1) and democratic pooling ({\gamma}=0) outperforming both on several classification tasks. Moreover, unlike power normalization, the {\gamma}-democratic aggregations can be computed in a low dimensional space by sketching that allows the use of very high-dimensional second-order features. This results in a state-of-the-art performance on several datasets.


  Access Model/Code and Paper
6DoF Object Pose Estimation via Differentiable Proxy Voting Loss

Feb 10, 2020
Xin Yu, Zheyu Zhuang, Piotr Koniusz, Hongdong Li

Estimating a 6DOF object pose from a single image is very challenging due to occlusions or textureless appearances. Vector-field based keypoint voting has demonstrated its effectiveness and superiority on tackling those issues. However, direct regression of vector-fields neglects that the distances between pixels and keypoints also affect the deviations of hypotheses dramatically. In other words, small errors in direction vectors may generate severely deviated hypotheses when pixels are far away from a keypoint. In this paper, we aim to reduce such errors by incorporating the distances between pixels and keypoints into our objective. To this end, we develop a simple yet effective differentiable proxy voting loss (DPVL) which mimics the hypothesis selection in the voting procedure. By exploiting our voting loss, we are able to train our network in an end-to-end manner. Experiments on widely used datasets, i.e. LINEMOD and Occlusion LINEMOD, manifest that our DPVL improves pose estimation performance significantly and speeds up the training convergence.


  Access Model/Code and Paper
Few-shot Learning with Multi-scale Self-supervision

Jan 06, 2020
Hongguang Zhang, Philip H. S. Torr, Piotr Koniusz

Learning concepts from the limited number of datapoints is a challenging task usually addressed by the so-called one- or few-shot learning. Recently, an application of second-order pooling in few-shot learning demonstrated its superior performance due to the aggregation step handling varying image resolutions without the need of modifying CNNs to fit to specific image sizes, yet capturing highly descriptive co-occurrences. However, using a single resolution per image (even if the resolution varies across a dataset) is suboptimal as the importance of image contents varies across the coarse-to-fine levels depending on the object and its class label e. g., generic objects and scenes rely on their global appearance while fine-grained objects rely more on their localized texture patterns. Multi-scale representations are popular in image deblurring, super-resolution and image recognition but they have not been investigated in few-shot learning due to its relational nature complicating the use of standard techniques. In this paper, we propose a novel multi-scale relation network based on the properties of second-order pooling to estimate image relations in few-shot setting. To optimize the model, we leverage a scale selector to re-weight scale-wise representations based on their second-order features. Furthermore, we propose to a apply self-supervised scale prediction. Specifically, we leverage an extra discriminator to predict the scale labels and the scale discrepancy between pairs of images. Our model achieves state-of-the-art results on standard few-shot learning datasets.


  Access Model/Code and Paper
Deep Stacked Hierarchical Multi-patch Network for Image Deblurring

Apr 06, 2019
Hongguang Zhang, Yuchao Dai, Hongdong Li, Piotr Koniusz

Despite deep end-to-end learning methods have shown their superiority in removing non-uniform motion blur, there still exist major challenges with the current multi-scale and scale-recurrent models: 1) Deconvolution/upsampling operations in the coarse-to-fine scheme result in expensive runtime; 2) Simply increasing the model depth with finer-scale levels cannot improve the quality of deblurring. To tackle the above problems, we present a deep hierarchical multi-patch network inspired by Spatial Pyramid Matching to deal with blurry images via a fine-to-coarse hierarchical representation. To deal with the performance saturation w.r.t. depth, we propose a stacked version of our multi-patch model. Our proposed basic multi-patch model achieves the state-of-the-art performance on the GoPro dataset while enjoying a 40x faster runtime compared to current multi-scale methods. With 30ms to process an image at 1280x720 resolution, it is the first real-time deep motion deblurring model for 720p images at 30fps. For stacked networks, significant improvements (over 1.2dB) are achieved on the GoPro dataset by increasing the network depth. Moreover, by varying the depth of the stacked model, one can adapt the performance and runtime of the same network for different application scenarios.

* IEEE Conference on Computer Vision and Pattern Recognition 2019 

  Access Model/Code and Paper
Face Destylization

Feb 05, 2018
Fatemeh Shiri, Xin Yu, Fatih Porikli, Piotr Koniusz

Numerous style transfer methods which produce artistic styles of portraits have been proposed to date. However, the inverse problem of converting the stylized portraits back into realistic faces is yet to be investigated thoroughly. Reverting an artistic portrait to its original photo-realistic face image has potential to facilitate human perception and identity analysis. In this paper, we propose a novel Face Destylization Neural Network (FDNN) to restore the latent photo-realistic faces from the stylized ones. We develop a Style Removal Network composed of convolutional, fully-connected and deconvolutional layers. The convolutional layers are designed to extract facial components from stylized face images. Consecutively, the fully-connected layer transfers the extracted feature maps of stylized images into the corresponding feature maps of real faces and the deconvolutional layers generate real faces from the transferred feature maps. To enforce the destylized faces to be similar to authentic face images, we employ a discriminative network, which consists of convolutional and fully connected layers. We demonstrate the effectiveness of our network by conducting experiments on an extensive set of synthetic images. Furthermore, we illustrate our network can recover faces from stylized portraits and real paintings for which the stylized data was unavailable during the training phase.


  Access Model/Code and Paper