Models, code, and papers for "Fatih Porikli":

Sparse Coding on Cascaded Residuals

Nov 07, 2019
Tong Zhang, Fatih Porikli

This paper seeks to combine dictionary learning and hierarchical image representation in a principled way. To make dictionary atoms capturing additional information from extended receptive fields and attain improved descriptive capacity, we present a two-pass multi-resolution cascade framework for dictionary learning and sparse coding. The cascade allows collaborative reconstructions at different resolutions using the same dimensional dictionary atoms. Our jointly learned dictionary comprises atoms that adapt to the information available at the coarsest layer where the support of atoms reaches their maximum range and the residual images where the supplementary details progressively refine the reconstruction objective. The residual at a layer is computed by the difference between the aggregated reconstructions of the previous layers and the downsampled original image at that layer. Our method generates more flexible and accurate representations using much less number of coefficients. Its computational efficiency stems from encoding at the coarsest resolution, which is minuscule, and encoding the residuals, which are relatively much sparse. Our extensive experiments on multiple datasets demonstrate that this new method is powerful in image coding, denoising, inpainting and artifact removal tasks outperforming the state-of-the-art techniques.

* ACCV 2016 

  Click for Model/Code and Paper
Multipartite Pooling for Deep Convolutional Neural Networks

Oct 20, 2017
Arash Shahriari, Fatih Porikli

We propose a novel pooling strategy that learns how to adaptively rank deep convolutional features for selecting more informative representations. To this end, we exploit discriminative analysis to project the features onto a space spanned by the number of classes in the dataset under study. This maps the notion of labels in the feature space into instances in the projected space. We employ these projected distances as a measure to rank the existing features with respect to their specific discriminant power for each individual class. We then apply multipartite ranking to score the separability of the instances and aggregate one-versus-all scores to compute an overall distinction score for each feature. For the pooling, we pick features with the highest scores in a pooling window instead of maximum, average or stochastic random assignments. Our experiments on various benchmarks confirm that the proposed strategy of multipartite pooling is highly beneficial to consistently improve the performance of deep convolutional networks via better generalization of the trained models for the test-time data.


  Click for Model/Code and Paper
DeepTrack: Learning Discriminative Feature Representations Online for Robust Visual Tracking

Feb 28, 2015
Hanxi Li, Yi Li, Fatih Porikli

Deep neural networks, albeit their great success on feature learning in various computer vision tasks, are usually considered as impractical for online visual tracking because they require very long training time and a large number of training samples. In this work, we present an efficient and very robust tracking algorithm using a single Convolutional Neural Network (CNN) for learning effective feature representations of the target object, in a purely online manner. Our contributions are multifold: First, we introduce a novel truncated structural loss function that maintains as many training samples as possible and reduces the risk of tracking error accumulation. Second, we enhance the ordinary Stochastic Gradient Descent approach in CNN training with a robust sample selection mechanism. The sampling mechanism randomly generates positive and negative samples from different temporal distributions, which are generated by taking the temporal relations and label noise into account. Finally, a lazy yet effective updating scheme is designed for CNN training. Equipped with this novel updating algorithm, the CNN model is robust to some long-existing difficulties in visual tracking such as occlusion or incorrect detections, without loss of the effective adaption for significant appearance changes. In the experiment, our CNN tracker outperforms all compared state-of-the-art methods on two recently proposed benchmarks which in total involve over 60 video sequences. The remarkable performance improvement over the existing trackers illustrates the superiority of the feature representations which are learned

* 12 pages 

  Click for Model/Code and Paper
Component Attention Guided Face Super-Resolution Network: CAGFace

Oct 19, 2019
Ratheesh Kalarot, Tao Li, Fatih Porikli

To make the best use of the underlying structure of faces, the collective information through face datasets and the intermediate estimates during the upsampling process, here we introduce a fully convolutional multi-stage neural network for 4$\times$ super-resolution for face images. We implicitly impose facial component-wise attention maps using a segmentation network to allow our network to focus on face-inherent patterns. Each stage of our network is composed of a stem layer, a residual backbone, and spatial upsampling layers. We recurrently apply stages to reconstruct an intermediate image, and then reuse its space-to-depth converted versions to bootstrap and enhance image quality progressively. Our experiments show that our face super-resolution method achieves quantitatively superior and perceptually pleasing results in comparison to state of the art.

* Submitted to WACV 2020 

  Click for Model/Code and Paper
Deep Underwater Image Enhancement

Jul 10, 2018
Saeed Anwar, Chongyi Li, Fatih Porikli

In an underwater scene, wavelength-dependent light absorption and scattering degrade the visibility of images, causing low contrast and distorted color casts. To address this problem, we propose a convolutional neural network based image enhancement model, i.e., UWCNN, which is trained efficiently using a synthetic underwater image database. Unlike the existing works that require the parameters of underwater imaging model estimation or impose inflexible frameworks applicable only for specific scenes, our model directly reconstructs the clear latent underwater image by leveraging on an automatic end-to-end and data-driven training mechanism. Compliant with underwater imaging models and optical properties of underwater scenes, we first synthesize ten different marine image databases. Then, we separately train multiple UWCNN models for each underwater image formation type. Experimental results on real-world and synthetic underwater images demonstrate that the presented method generalizes well on different underwater scenes and outperforms the existing methods both qualitatively and quantitatively. Besides, we conduct an ablation study to demonstrate the effect of each component in our network.

* 12 pages, 11 figures 

  Click for Model/Code and Paper
Local Gradients Smoothing: Defense against localized adversarial attacks

Jul 03, 2018
Muzammal Naseer, Salman Khan, Fatih Porikli

Deep neural networks (DNNs) have shown vulnerability to adversarial attacks, i.e., carefully perturbed inputs designed to mislead the network at inference time. Recently introduced localized attacks, LaVAN and Adversarial patch, posed a new challenge to deep learning security by adding adversarial noise only within a specific region without affecting the salient objects in an image. Driven by the observation that such attacks introduce concentrated high-frequency changes at a particular image location, we have developed an effective method to estimate noise location in gradient domain and transform those high activation regions caused by adversarial noise in image domain while having minimal effect on the salient object that is important for correct classification. Our proposed Local Gradients Smoothing (LGS) scheme achieves this by regularizing gradients in the estimated noisy region before feeding the image to DNN for inference. We have shown the effectiveness of our method in comparison to other defense methods including JPEG compression, Total Variance Minimization (TVM) and Feature squeezing on ImageNet dataset. In addition, we systematically study the robustness of the proposed defense mechanism against Back Pass Differentiable Approximation (BPDA), a state of the art attack recently developed to break defenses that transform an input sample to minimize the adversarial effect. Compared to other defense mechanisms, LGS is by far the most resistant to BPDA in localized adversarial attack setting.


  Click for Model/Code and Paper
A Deeper Look at Power Normalizations

Jun 24, 2018
Piotr Koniusz, Hongguang Zhang, Fatih Porikli

Power Normalizations (PN) are very useful non-linear operators in the context of Bag-of-Words data representations as they tackle problems such as feature imbalance. In this paper, we reconsider these operators in the deep learning setup by introducing a novel layer that implements PN for non-linear pooling of feature maps. Specifically, by using a kernel formulation, our layer combines the feature vectors and their respective spatial locations in the feature maps produced by the last convolutional layer of CNN. Linearization of such a kernel results in a positive definite matrix capturing the second-order statistics of the feature vectors, to which PN operators are applied. We study two types of PN functions, namely (i) MaxExp and (ii) Gamma, addressing their role and meaning in the context of nonlinear pooling. We also provide a probabilistic interpretation of these operators and derive their surrogates with well-behaved gradients for end-to-end CNN learning. We apply our theory to practice by implementing the PN layer on a ResNet-50 model and showcase experiments on four benchmarks for fine-grained recognition, scene recognition, and material classification. Our results demonstrate state-of-the-art performance across all these tasks.

* IEEE Conference on Computer Vision and Pattern Recognition, 2018 

  Click for Model/Code and Paper
Selective Video Object Cutout

Mar 22, 2018
Wenguan Wang, Jianbing Shen, Fatih Porikli

Conventional video segmentation approaches rely heavily on appearance models. Such methods often use appearance descriptors that have limited discriminative power under complex scenarios. To improve the segmentation performance, this paper presents a pyramid histogram based confidence map that incorporates structure information into appearance statistics. It also combines geodesic distance based dynamic models. Then, it employs an efficient measure of uncertainty propagation using local classifiers to determine the image regions where the object labels might be ambiguous. The final foreground cutout is obtained by refining on the uncertain regions. Additionally, to reduce manual labeling, our method determines the frames to be labeled by the human operator in a principled manner, which further boosts the segmentation performance and minimizes the labeling effort. Our extensive experimental analyses on two big benchmarks demonstrate that our solution achieves superior performance, favorable computational efficiency, and reduced manual labeling in comparison to the state-of-the-art.

* IEEE Transactions on Image Processing, Vol. 26, No. 12, pp 5645-5655, 2017 
* W. Wang, J. Shen, and F. Porikli. "Selective video object cutout." IEEE Transactions on Image Processing 26.12 (2017): 5645-5655 

  Click for Model/Code and Paper
Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Mar 16, 2018
Shafin Rahman, Salman Khan, Fatih Porikli

Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of a single dominant unseen object category in a test image. We hypothesize that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complex scene, warranting both the `recognition' and `localization' of an unseen category. To address this limitation, we introduce a new \emph{`Zero-Shot Detection'} (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories without any training examples. We also propose a new experimental protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to practical issues, e.g., the rarity of unseen objects. To the best of our knowledge, this is the first end-to-end deep network for ZSD that jointly models the interplay between visual and semantic domain information. To overcome the noise in the automatically derived semantic descriptions, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic space clustering. Furthermore, we present a baseline approach extended from recognition to detection setting. Our extensive experiments show significant performance boost over the baseline on the imperative yet difficult ZSD problem.


  Click for Model/Code and Paper
Regularization of Deep Neural Networks with Spectral Dropout

Nov 23, 2017
Salman Khan, Munawar Hayat, Fatih Porikli

The big breakthrough on the ImageNet challenge in 2012 was partially due to the `dropout' technique used to avoid overfitting. Here, we introduce a new approach called `Spectral Dropout' to improve the generalization ability of deep neural networks. We cast the proposed approach in the form of regular Convolutional Neural Network (CNN) weight layers using a decorrelation transform with fixed basis functions. Our spectral dropout method prevents overfitting by eliminating weak and `noisy' Fourier domain coefficients of the neural network activations, leading to remarkably better results than the current regularization methods. Furthermore, the proposed is very efficient due to the fixed basis functions used for spectral transformation. In particular, compared to Dropout and Drop-Connect, our method significantly speeds up the network convergence rate during the training process (roughly x2), with considerably higher neuron pruning rates (an increase of ~ 30%). We demonstrate that the spectral dropout can also be used in conjunction with other regularization approaches resulting in additional performance gains.


  Click for Model/Code and Paper
Learning an Invariant Hilbert Space for Domain Adaptation

Apr 17, 2017
Samitha Herath, Mehrtash Harandi, Fatih Porikli

This paper introduces a learning scheme to construct a Hilbert space (i.e., a vector space along its inner product) to address both unsupervised and semi-supervised domain adaptation problems. This is achieved by learning projections from each domain to a latent space along the Mahalanobis metric of the latent space to simultaneously minimizing a notion of domain variance while maximizing a measure of discriminatory power. In particular, we make use of the Riemannian optimization techniques to match statistical properties (e.g., first and second order statistics) between samples projected into the latent space from different domains. Upon availability of class labels, we further deem samples sharing the same label to form more compact clusters while pulling away samples coming from different classes.We extensively evaluate and contrast our proposal against state-of-the-art methods for the task of visual domain adaptation using both handcrafted and deep-net features. Our experiments show that even with a simple nearest neighbor classifier, the proposed method can outperform several state-of-the-art methods benefitting from more involved classification schemes.

* 24 pages, 7 figures 

  Click for Model/Code and Paper
Domain Adaptation by Mixture of Alignments of Second- or Higher-Order Scatter Tensors

Apr 11, 2017
Piotr Koniusz, Yusuf Tas, Fatih Porikli

In this paper, we propose an approach to the domain adaptation, dubbed Second- or Higher-order Transfer of Knowledge (So-HoT), based on the mixture of alignments of second- or higher-order scatter statistics between the source and target domains. The human ability to learn from few labeled samples is a recurring motivation in the literature for domain adaptation. Towards this end, we investigate the supervised target scenario for which few labeled target training samples per category exist. Specifically, we utilize two CNN streams: the source and target networks fused at the classifier level. Features from the fully connected layers fc7 of each network are used to compute second- or even higher-order scatter tensors; one per network stream per class. As the source and target distributions are somewhat different despite being related, we align the scatters of the two network streams of the same class (within-class scatters) to a desired degree with our bespoke loss while maintaining good separation of the between-class scatters. We train the entire network in end-to-end fashion. We provide evaluations on the standard Office benchmark (visual domains), RGB-D combined with Caltech256 (depth-to-rgb transfer) and Pascal VOC2007 combined with the TU Berlin dataset (image-to-sketch transfer). We attain state-of-the-art results.

* CVPR'17 

  Click for Model/Code and Paper
Ordered Pooling of Optical Flow Sequences for Action Recognition

Apr 06, 2017
Jue Wang, Anoop Cherian, Fatih Porikli

Training of Convolutional Neural Networks (CNNs) on long video sequences is computationally expensive due to the substantial memory requirements and the massive number of parameters that deep architectures demand. Early fusion of video frames is thus a standard technique, in which several consecutive frames are first agglomerated into a compact representation, and then fed into the CNN as an input sample. For this purpose, a summarization approach that represents a set of consecutive RGB frames by a single dynamic image to capture pixel dynamics is proposed recently. In this paper, we introduce a novel ordered representation of consecutive optical flow frames as an alternative and argue that this representation captures the action dynamics more effectively than RGB frames. We provide intuitions on why such a representation is better for action recognition. We validate our claims on standard benchmark datasets and demonstrate that using summaries of flow images lead to significant improvements over RGB frames while achieving accuracy comparable to the state-of-the-art on UCF101 and HMDB datasets.

* Accepted in WACV 2017 

  Click for Model/Code and Paper
Going Deeper into Action Recognition: A Survey

Feb 01, 2017
Samitha Herath, Mehrtash Harandi, Fatih Porikli

Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader.


  Click for Model/Code and Paper
Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons (Extended Version)

Jul 28, 2016
Piotr Koniusz, Anoop Cherian, Fatih Porikli

In this paper, we explore tensor representations that can compactly capture higher-order relationships between skeleton joints for 3D action recognition. We first define RBF kernels on 3D joint sequences, which are then linearized to form kernel descriptors. The higher-order outer-products of these kernel descriptors form our tensor representations. We present two different kernels for action recognition, namely (i) a sequence compatibility kernel that captures the spatio-temporal compatibility of joints in one sequence against those in the other, and (ii) a dynamics compatibility kernel that explicitly models the action dynamics of a sequence. Tensors formed from these kernels are then used to train an SVM. We present experiments on several benchmark datasets and demonstrate state of the art results, substantiating the effectiveness of our representations.


  Click for Model/Code and Paper
Beyond Local Search: Tracking Objects Everywhere with Instance-Specific Proposals

May 06, 2016
Gao Zhu, Fatih Porikli, Hongdong Li

Most tracking-by-detection methods employ a local search window around the predicted object location in the current frame assuming the previous location is accurate, the trajectory is smooth, and the computational capacity permits a search radius that can accommodate the maximum speed yet small enough to reduce mismatches. These, however, may not be valid always, in particular for fast and irregularly moving objects. Here, we present an object tracker that is not limited to a local search window and has ability to probe efficiently the entire frame. Our method generates a small number of "high-quality" proposals by a novel instance-specific objectness measure and evaluates them against the object model that can be adopted from an existing tracking-by-detection approach as a core tracker. During the tracking process, we update the object model concentrating on hard false-positives supplied by the proposals, which help suppressing distractors caused by difficult background clutters, and learn how to re-rank proposals according to the object model. Since we reduce significantly the number of hypotheses the core tracker evaluates, we can use richer object descriptors and stronger detector. Our method outperforms most recent state-of-the-art trackers on popular tracking benchmarks, and provides improved robustness for fast moving objects as well as for ultra low-frame-rate videos.

* CVPR'16. arXiv admin note: text overlap with arXiv:1507.08085 

  Click for Model/Code and Paper
Tracking Randomly Moving Objects on Edge Box Proposals

Nov 29, 2015
Gao Zhu, Fatih Porikli, Hongdong Li

Most tracking-by-detection methods employ a local search window around the predicted object location in the current frame assuming the previous location is accurate, the trajectory is smooth, and the computational capacity permits a search radius that can accommodate the maximum speed yet small enough to reduce mismatches. These, however, may not be valid always, in particular for fast and irregularly moving objects. Here, we present an object tracker that is not limited to a local search window and has ability to probe efficiently the entire frame. Our method generates a small number of "high-quality" proposals by a novel instance-specific objectness measure and evaluates them against the object model that can be adopted from an existing tracking-by-detection approach as a core tracker. During the tracking process, we update the object model concentrating on hard false-positives supplied by the proposals, which help suppressing distractors caused by difficult background clutters, and learn how to re-rank proposals according to the object model. Since we reduce significantly the number of hypotheses the core tracker evaluates, we can use richer object descriptors and stronger detector. Our method outperforms most recent state-of-the-art trackers on popular tracking benchmarks, and provides improved robustness for fast moving objects as well as for ultra low-frame-rate videos.


  Click for Model/Code and Paper
When VLAD met Hilbert

Jul 30, 2015
Mehrtash Harandi, Mathieu Salzmann, Fatih Porikli

Vectors of Locally Aggregated Descriptors (VLAD) have emerged as powerful image/video representations that compete with or even outperform state-of-the-art approaches on many challenging visual recognition tasks. In this paper, we address two fundamental limitations of VLAD: its requirement for the local descriptors to have vector form and its restriction to linear classifiers due to its high-dimensionality. To this end, we introduce a kernelized version of VLAD. This not only lets us inherently exploit more sophisticated classification schemes, but also enables us to efficiently aggregate non-vector descriptors (e.g., tensors) in the VLAD framework. Furthermore, we propose three approximate formulations that allow us to accelerate the coding process while still benefiting from the properties of kernel VLAD. Our experiments demonstrate the effectiveness of our approach at handling manifold-valued data, such as covariance descriptors, on several classification tasks. Our results also evidence the benefits of our nonlinear VLAD descriptors against the linear ones in Euclidean space using several standard benchmark datasets.


  Click for Model/Code and Paper
Efficient Upsampling of Natural Images

Feb 28, 2015
Chinmay Hegde, Oncel Tuzel, Fatih Porikli

We propose a novel method of efficient upsampling of a single natural image. Current methods for image upsampling tend to produce high-resolution images with either blurry salient edges, or loss of fine textural detail, or spurious noise artifacts. In our method, we mitigate these effects by modeling the input image as a sum of edge and detail layers, operating upon these layers separately, and merging the upscaled results in an automatic fashion. We formulate the upsampled output image as the solution to a non-convex energy minimization problem, and propose an algorithm to obtain a tractable approximate solution. Our algorithm comprises two main stages. 1) For the edge layer, we use a nonparametric approach by constructing a dictionary of patches from a given image, and synthesize edge regions in a higher-resolution version of the image. 2) For the detail layer, we use a global parametric texture enhancement approach to synthesize detail regions across the image. We demonstrate that our method is able to accurately reproduce sharp edges as well as synthesize photorealistic textures, while avoiding common artifacts such as ringing and haloing. In addition, our method involves no training phase or estimation of model parameters, and is easily parallelizable. We demonstrate the utility of our method on a number of challenging standard test photos.


  Click for Model/Code and Paper
Bregman Divergences for Infinite Dimensional Covariance Matrices

Jun 11, 2014
Mehrtash Harandi, Mathieu Salzmann, Fatih Porikli

We introduce an approach to computing and comparing Covariance Descriptors (CovDs) in infinite-dimensional spaces. CovDs have become increasingly popular to address classification problems in computer vision. While CovDs offer some robustness to measurement variations, they also throw away part of the information contained in the original data by only retaining the second-order statistics over the measurements. Here, we propose to overcome this limitation by first mapping the original data to a high-dimensional Hilbert space, and only then compute the CovDs. We show that several Bregman divergences can be computed between the resulting CovDs in Hilbert space via the use of kernels. We then exploit these divergences for classification purposes. Our experiments demonstrate the benefits of our approach on several tasks, such as material and texture recognition, person re-identification, and action recognition from motion capture data.

* IEEE Conference on Computer Vision and Pattern Recognition (2014) 

  Click for Model/Code and Paper