Research papers and code for "Tomasz Trzcinski":
Classification of human emotions remains an important and challenging task for many computer vision algorithms, especially in the era of humanoid robots which coexist with humans in their everyday life. Currently proposed methods for emotion recognition solve this task using multi-layered convolutional networks that do not explicitly infer any facial features in the classification phase. In this work, we postulate a fundamentally different approach to solve emotion recognition task that relies on incorporating facial landmarks as a part of the classification loss function. To that end, we extend a recently proposed Deep Alignment Network (DAN) with a term related to facial features. Thanks to this simple modification, our model called EmotionalDAN is able to outperform state-of-the-art emotion classification methods on two challenging benchmark dataset by up to 5%. Furthermore, we visualize image regions analyzed by the network when making a decision and the results indicate that our EmotionalDAN model is able to correctly identify facial landmarks responsible for expressing the emotions.

* Under review at Special Issue on Deep Neural Networks for Digital Media Algorithms, Fundamenta Informaticae. arXiv admin note: text overlap with arXiv:1805.00326
Click to Read Paper and Get Code
In this paper we present an efficient method for aggregating binary feature descriptors to allow compact representation of 3D scene model in incremental structure-from-motion and SLAM applications. All feature descriptors linked with one 3D scene point or landmark are represented by a single low-dimensional real-valued vector called a \emph{prototype}. The method allows significant reduction of memory required to store and process feature descriptors in large-scale structure-from-motion applications. An efficient approximate nearest neighbours search methods suited for real-valued descriptors, such as FLANN, can be used on the resulting prototypes to speed up matching processed frames.

Click to Read Paper and Get Code
The work by Gatys et al. [1] recently showed a neural style algorithm that can produce an image in the style of another image. Some further works introduced various improvements regarding generalization, quality and efficiency, but each of them was mostly focused on styles such as paintings, abstract images or photo-realistic style. In this paper, we present a comparison of how state-of-the-art style transfer methods cope with transferring various comic styles on different images. We select different combinations of Adaptive Instance Normalization [11] and Universal Style Transfer [16] models and confront them to find their advantages and disadvantages in terms of qualitative and quantitative analysis. Finally, we present the results of a survey conducted on over 100 people that aims at validating the evaluation results in a real-life application of comic style transfer.

* 10 pages
Click to Read Paper and Get Code
Predicting popularity of social media videos before they are published is a challenging task, mainly due to the complexity of content distribution network as well as the number of factors that play part in this process. As solving this task provides tremendous help for media content creators, many successful methods were proposed to solve this problem with machine learning. In this work, we change the viewpoint and postulate that it is not only the predicted popularity that matters, but also, maybe even more importantly, understanding of how individual parts influence the final popularity score. To that end, we propose to combine the Grad-CAM visualization method with a soft attention mechanism. Our preliminary results show that this approach allows for more intuitive interpretation of the content impact on video popularity, while achieving competitive results in terms of prediction accuracy.

* Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), CVPR 2018 workshop on Visual Understanding of Subjective Attributes of Data (V-USAD)
Click to Read Paper and Get Code
Approximate nearest neighbour (ANN) search is one of the most important problems in computer science fields such as data mining or computer vision. In this paper, we focus on ANN for high-dimensional binary vectors and we propose a simple yet powerful search method that uses Random Binary Search Trees (RBST). We apply our method to a dataset of 1.25M binary local feature descriptors obtained from a real-life image-based localisation system provided by Google as a part of Project Tango. An extensive evaluation of our method against the state-of-the-art variations of Locality Sensitive Hashing (LSH), namely Uniform LSH and Multi-probe LSH, shows the superiority of our method in terms of retrieval precision with performance boost of over 20%

Click to Read Paper and Get Code
Evaluating aesthetic value of digital photographs is a challenging task, mainly due to numerous factors that need to be taken into account and subjective manner of this process. In this paper, we propose to approach this problem using deep convolutional neural networks. Using a dataset of over 1.7 million photos collected from Flickr, we train and evaluate a deep learning model whose goal is to classify input images by analysing their aesthetic value. The result of this work is a publicly available Web-based application that can be used in several real-life applications, e.g. to improve the workflow of professional photographers by pre-selecting the best photos.

Click to Read Paper and Get Code
In this paper we evaluate performance of data-dependent hashing methods on binary data. The goal is to find a hashing method that can effectively produce lower dimensional binary representation of 512-bit FREAK descriptors. A representative sample of recent unsupervised, semi-supervised and supervised hashing methods was experimentally evaluated on large datasets of labelled binary FREAK feature descriptors.

Click to Read Paper and Get Code
In this work, we propose a regression method to predict the popularity of an online video based on temporal and visual cues. Our method uses Support Vector Regression with Gaussian Radial Basis Functions. We show that modelling popularity patterns with this approach provides higher and more stable prediction results, mainly thanks to the non-linearity character of the proposed method as well as its resistance against overfitting. We compare our method with the state of the art on datasets containing over 14,000 videos from YouTube and Facebook. Furthermore, we show that results obtained relying only on the early distribution patterns, can be improved by adding social and visual metadata.

* Transactions on Multimedia, 2017
Click to Read Paper and Get Code
Classification of human emotions remains an important and challenging task for many computer vision algorithms, especially in the era of humanoid robots which coexist with humans in their everyday life. Currently proposed methods for emotion recognition solve this task using multi-layered convolutional networks that do not explicitly infer any facial features in the classification phase. In this work, we postulate a fundamentally different approach to solve emotion recognition task that relies on incorporating facial landmarks as a part of the classification loss function. To that end, we extend a recently proposed Deep Alignment Network (DAN), that achieves state-of-the-art results in the recent facial landmark recognition challenge, with a term related to facial features. Thanks to this simple modification, our model called EmotionalDAN is able to outperform state-of-the-art emotion classification methods on two challenging benchmark dataset by up to 5%.

* CVPRW 2018, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2018
Click to Read Paper and Get Code
In this paper we propose a new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convolutional neural network applied directly on magnitude spectrograms. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 hours of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outperforms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.

Click to Read Paper and Get Code
In this paper, we propose Deep Alignment Network (DAN), a robust face alignment method based on a deep neural network architecture. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage. Our method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. This is possible thanks to the use of landmark heatmaps which provide visual information about landmark locations estimated at the previous stages of the algorithm. The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initializations. An extensive evaluation on two publicly available datasets shows that DAN reduces the state-of-the-art failure rate by up to 70%. Our method has also been submitted for evaluation as part of the Menpo challenge.

* IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) 2017
Click to Read Paper and Get Code
In this paper, we propose a novel method to incorporate partial evidence in the inference of deep convolutional neural networks. Contrary to the existing methods, which either iteratively modify the input of the network or exploit external label taxonomy to take partial evidence into account, we add separate network modules to the intermediate layers of a pre-trained convolutional network. The goal of those modules is to incorporate additional signal - information about known labels - into the inference procedure and adjust the predicted outputs accordingly. Since the attached "Plugin Networks", have a simple structure consisting of only fully connected layers, we drastically reduce the computational cost of training and inference. At the same time, the proposed architecture allows to propagate the information about known labels directly to the intermediate layers that are trained to intrinsically model correlations between the labels. Extensive evaluation of the proposed method confirms that our Plugin Networks outperform the state-of-the-art in a variety of tasks, including scene categorization and multi-label image annotation.

Click to Read Paper and Get Code
Textual overlays are often used in social media videos as people who watch them without the sound would otherwise miss essential information conveyed in the audio stream. This is why extraction of those overlays can serve as an important meta-data source, e.g. for content classification or retrieval tasks. In this work, we present a robust method for extracting textual overlays from videos that builds up on multiple neural network architectures. The proposed solution relies on several processing steps: keyframe extraction, text detection and text recognition. The main component of our system, i.e. the text recognition module, is inspired by a convolutional recurrent neural network architecture and we improve its performance using synthetically generated dataset of over 600,000 images with text prepared by authors specifically for this task. We also develop a filtering method that reduces the amount of overlapping text phrases using Levenshtein distance and further boosts system's performance. The final accuracy of our solution reaches over 80A% and is au pair with state-of-the-art methods.

* International Conference on Computer Vision and Graphics (ICCVG) 2018
Click to Read Paper and Get Code
In this paper, we address the problem of popularity prediction of online videos shared in social media. We prove that this challenging task can be approached using recently proposed deep neural network architectures. We cast the popularity prediction problem as a classification task and we aim to solve it using only visual cues extracted from videos. To that end, we propose a new method based on a Long-term Recurrent Convolutional Network (LRCN) that incorporates the sequentiality of the information in the model. Results obtained on a dataset of over 37'000 videos published on Facebook show that using our method leads to over 30% improvement in prediction performance over the traditional shallow approaches and can provide valuable insights for content creators.

Click to Read Paper and Get Code
In the recent years, social media have become one of the main places where creative content is being published and consumed by billions of users. Contrary to traditional media, social media allow the publishers to receive almost instantaneous feedback regarding their creative work at an unprecedented scale. This is a perfect use case for machine learning methods that can use these massive amounts of data to provide content creators with inspirational ideas and constructive criticism of their work. In this work, we present a comprehensive overview of machine learning-empowered tools we developed for video creators at Group Nine Media - one of the major social media companies that creates short-form videos with over three billion views per month. Our main contribution is a set of tools that allow the creators to leverage massive amounts of data to improve their creation process, evaluate their videos before the publication and improve content quality. These applications include an interactive conversational bot that allows access to material archives, a Web-based application for automatic selection of optimal video thumbnail, as well as deep learning methods for optimizing headline and predicting video popularity. Our A/B tests show that deployment of our tools leads to significant increase of average video view count by 12.9%. Our additional contribution is a set of considerations collected during the deployment of those tools that can hel

* 2pages, 6 figures
Click to Read Paper and Get Code
Deep generative architectures provide a way to model not only images, but also complex, 3-dimensional objects, such as point clouds. In this work, we present a novel method to obtain meaningful representations of 3D shapes that can be used for clustering and reconstruction. Contrary to existing methods for 3D point cloud generation that train separate decoupled models for representation learning and generation, our approach is the first end-to-end solution that allows to simultaneously learn a latent space of representation and generate 3D shape out of it. To achieve this goal, we extend a deep Adversarial Autoencoder model (AAE) to accept 3D input and create 3D output. Thanks to our end-to-end training regime, the resulting method called 3D Adversarial Autoencoder (3dAAE) obtains either binary or continuous latent space that covers much wider portion of training data distribution, hence allowing smooth interpolation between the shapes. Finally, our extensive quantitative evaluation shows that 3dAAE provides state-of-the-art results on a set of benchmark tasks.

* 9 pages, 8 figures
Click to Read Paper and Get Code
In this paper, we propose a novel regularization method for Generative Adversarial Networks, which allows the model to learn discriminative yet compact binary representations of image patches (image descriptors). We employ the dimensionality reduction that takes place in the intermediate layers of the discriminator network and train binarized low-dimensional representation of the penultimate layer to mimic the distribution of the higher-dimensional preceding layers. To achieve this, we introduce two loss terms that aim at: (i) reducing the correlation between the dimensions of the binarized low-dimensional representation of the penultimate layer i. e. maximizing joint entropy) and (ii) propagating the relations between the dimensions in the high-dimensional space to the low-dimensional space. We evaluate the resulting binary image descriptors on two challenging applications, image matching and retrieval, and achieve state-of-the-art results.

* Paper accepted to NIPS 2018
Click to Read Paper and Get Code
With the ever decreasing attention span of contemporary Internet users, the title of online content (such as a news article or video) can be a major factor in determining its popularity. To take advantage of this phenomenon, we propose a new method based on a bidirectional Long Short-Term Memory (LSTM) neural network designed to predict the popularity of online content using only its title. We evaluate the proposed architecture on two distinct datasets of news articles and news videos distributed in social media that contain over 40,000 samples in total. On those datasets, our approach improves the performance over traditional shallow approaches by a margin of 15%. Additionally, we show that using pre-trained word vectors in the embedding layer improves the results of LSTM models, especially when the training set is small. To our knowledge, this is the first attempt of applying popularity prediction using only textual information from the title.

Click to Read Paper and Get Code
In this paper, we propose a solution to transform a video into a comics. We approach this task using a neural style algorithm based on Generative Adversarial Networks (GANs). Several recent works in the field of Neural Style Transfer showed that producing an image in the style of another image is feasible. In this paper, we build up on these works and extend the existing set of style transfer use cases with a working application of video comixification. To that end, we train an end-to-end solution that transforms input video into a comics in two stages. In the first stage, we propose a state-of-the-art keyframes extraction algorithm that selects a subset of frames from the video to provide the most comprehensive video context and we filter those frames using image aesthetic estimation engine. In the second stage, the style of selected keyframes is transferred into a comics. To provide the most aesthetically compelling results, we selected the most state-of-the art style transfer solution and based on that implement our own ComixGAN framework. The final contribution of our work is a Web-based working application of video comixification available at http://comixify.ii.pw.edu.pl.

* 14 pages. arXiv admin note: substantial text overlap with arXiv:1809.01726
Click to Read Paper and Get Code
State-of-the-art machine learning algorithms can be fooled by carefully crafted adversarial examples. As such, adversarial examples present a concrete problem in AI safety. In this work we turn the tables and ask the following question: can we harness the power of adversarial examples to prevent malicious adversaries from learning identifying information from data while allowing non-malicious entities to benefit from the utility of the same data? For instance, can we use adversarial examples to anonymize biometric dataset of faces while retaining usefulness of this data for other purposes, such as emotion recognition? To address this question, we propose a simple yet effective method, called Siamese Generative Adversarial Privatizer (SGAP), that exploits the properties of a Siamese neural network to find discriminative features that convey identifying information. When coupled with a generative model, our approach is able to correctly locate and disguise identifying information, while minimally reducing the utility of the privatized dataset. Extensive evaluation on a biometric dataset of fingerprints and cartoon faces confirms usefulness of our simple yet effective method.

* Paper accepted to ACCV 2018 (Asian Conference on Computer Vision)
Click to Read Paper and Get Code