Models, code, and papers for "Hadi Kazemi":
In this paper, we address the problem of face hallucination by proposing a novel multi-scale generative adversarial network (GAN) architecture optimized for face verification. First, we propose a multi-scale generator architecture for face hallucination with a high up-scaling ratio factor, which has multiple intermediate outputs at different resolutions. The intermediate outputs have the growing goal of synthesizing small to large images. Second, we incorporate a face verifier with the original GAN discriminator and propose a novel discriminator which learns to discriminate different identities while distinguishing fake generated HR face images from their ground truth images. In particular, the learned generator cares for not only the visual quality of hallucinated face images but also preserving the discriminative features in the hallucination process. In addition, to capture perceptually relevant differences we employ a perceptual similarity loss, instead of similarity in pixel space. We perform a quantitative and qualitative evaluation of our framework on the LFW and CelebA datasets. The experimental results show the advantages of our proposed method against the state-of-the-art methods on the 8x downsampled testing dataset.
The objective for this work is to develop a data-driven proxy to high-fidelity numerical flow simulations using digital images. The proposed model can capture the flow field and permeability in a large verity of digital porous media based on solid grain geometry and pore size distribution by detailed analyses of the local pore geometry and the local flow fields. To develop the model, the detailed pore space geometry and simulation runs data from 3500 two-dimensional high-fidelity Lattice Boltzmann simulation runs are used to train and to predict the solutions with a high accuracy in much less computational time. The proposed methodology harness the enormous amount of generated data from high-fidelity flow simulations to decode the often under-utilized patterns in simulations and to accurately predict solutions to new cases. The developed model can truly capture the physics of the problem and enhance prediction capabilities of the simulations at a much lower cost. These predictive models, in essence, do not spatio-temporally reduce the order of the problem. They, however, possess the same numerical resolutions as their Lattice Boltzmann simulations equivalents do with the great advantage that their solutions can be achieved by significant reduction in computational costs (speed and memory).
Face sketch-photo synthesis is a critical application in law enforcement and digital entertainment industry where the goal is to learn the mapping between a face sketch image and its corresponding photo-realistic image. However, the limited number of paired sketch-photo training data usually prevents the current frameworks to learn a robust mapping between the geometry of sketches and their matching photo-realistic images. Consequently, in this work, we present an approach for learning to synthesize a photo-realistic image from a face sketch in an unsupervised fashion. In contrast to current unsupervised image-to-image translation techniques, our framework leverages a novel perceptual discriminator to learn the geometry of human face. Learning facial prior information empowers the network to remove the geometrical artifacts in the face sketch. We demonstrate that a simultaneous optimization of the face photo generator network, employing the proposed perceptual discriminator in combination with a texture-wise discriminator, results in a significant improvement in quality and recognition rate of the synthesized photos. We evaluate the proposed network by conducting extensive experiments on multiple baseline sketch-photo datasets.
Disentangling factors of variation within data has become a very challenging problem for image generation tasks. Current frameworks for training a Generative Adversarial Network (GAN), learn to disentangle the representations of the data in an unsupervised fashion and capture the most significant factors of the data variations. However, these approaches ignore the principle of content and style disentanglement in image generation, which means their learned latent code may alter the content and style of the generated images at the same time. This paper describes the Style and Content Disentangled GAN (SC-GAN), a new unsupervised algorithm for training GANs that learns disentangled style and content representations of the data. We assume that the representation of an image can be decomposed into a content code that represents the geometrical information of the data, and a style code that captures textural properties. Consequently, by fixing the style portion of the latent representation, we can generate diverse images in a particular style. Reversely, we can set the content code and generate a specific scene in a variety of styles. The proposed SC-GAN has two components: a content code which is the input to the generator, and a style code which modifies the scene style through modification of the Adaptive Instance Normalization (AdaIN) layers' parameters. We evaluate the proposed SC-GAN framework on a set of baseline datasets.
In this paper, we present a deep coupled learning frame- work to address the problem of matching polarimetric ther- mal face photos against a gallery of visible faces. Polariza- tion state information of thermal faces provides the miss- ing textural and geometrics details in the thermal face im- agery which exist in visible spectrum. we propose a coupled deep neural network architecture which leverages relatively large visible and thermal datasets to overcome the problem of overfitting and eventually we train it by a polarimetric thermal face dataset which is the first of its kind. The pro- posed architecture is able to make full use of the polari- metric thermal information to train a deep model compared to the conventional shallow thermal-to-visible face recogni- tion methods. Proposed coupled deep neural network also finds global discriminative features in a nonlinear embed- ding space to relate the polarimetric thermal faces to their corresponding visible faces. The results show the superior- ity of our method compared to the state-of-the-art models in cross thermal-to-visible face recognition algorithms.
Situational awareness in vehicular networks could be substantially improved utilizing reliable trajectory prediction methods. More precise situational awareness, in turn, results in notably better performance of critical safety applications, such as Forward Collision Warning (FCW), as well as comfort applications like Cooperative Adaptive Cruise Control (CACC). Therefore, vehicle trajectory prediction problem needs to be deeply investigated in order to come up with an end to end framework with enough precision required by the safety applications' controllers. This problem has been tackled in the literature using different methods. However, machine learning, which is a promising and emerging field with remarkable potential for time series prediction, has not been explored enough for this purpose. In this paper, a two-layer neural network-based system is developed which predicts the future values of vehicle parameters, such as velocity, acceleration, and yaw rate, in the first layer and then predicts the two-dimensional, i.e. longitudinal and lateral, trajectory points based on the first layer's outputs. The performance of the proposed framework has been evaluated in realistic cut-in scenarios from Safety Pilot Model Deployment (SPMD) dataset and the results show a noticeable improvement in the prediction accuracy in comparison with the kinematics model which is the dominant employed model by the automotive industry. Both ideal and nonideal communication circumstances have been investigated for our system evaluation. For non-ideal case, an estimation step is included in the framework before the parameter prediction block to handle the drawbacks of packet drops or sensor failures and reconstruct the time series of vehicle parameters at a desirable frequency.
In this paper, we propose a deep multimodal fusion network to fuse multiple modalities (face, iris, and fingerprint) for person identification. The proposed deep multimodal fusion algorithm consists of multiple streams of modality-specific Convolutional Neural Networks (CNNs), which are jointly optimized at multiple feature abstraction levels. Multiple features are extracted at several different convolutional layers from each modality-specific CNN for joint feature fusion, optimization, and classification. Features extracted at different convolutional layers of a modality-specific CNN represent the input at several different levels of abstract representations. We demonstrate that an efficient multimodal classification can be accomplished with a significant reduction in the number of network parameters by exploiting these multi-level abstract representations extracted from all the modality-specific CNNs. We demonstrate an increase in multimodal person identification performance by utilizing the proposed multi-level feature abstract representations in our multimodal fusion, rather than using only the features from the last layer of each modality-specific CNNs. We show that our deep multi-modal CNNs with multimodal fusion at several different feature level abstraction can significantly outperform the unimodal representation accuracy. We also demonstrate that the joint optimization of all the modality-specific CNNs excels the score and decision level fusions of independently optimized CNNs.
Face sketches are able to capture the spatial topology of a face while lacking some facial attributes such as race, skin, or hair color. Existing sketch-photo recognition approaches have mostly ignored the importance of facial attributes. In this paper, we propose a new loss function, called attribute-centered loss, to train a Deep Coupled Convolutional Neural Network (DCCNN) for the facial attribute guided sketch to photo matching. Specifically, an attribute-centered loss is proposed which learns several distinct centers, in a shared embedding space, for photos and sketches with different combinations of attributes. The DCCNN simultaneously is trained to map photos and pairs of testified attributes and corresponding forensic sketches around their associated centers, while preserving the spatial topology information. Importantly, the centers learn to keep a relative distance from each other, related to their number of contradictory attributes. Extensive experiments are performed on composite (E-PRIP) and semi-forensic (IIIT-D Semi-forensic) databases. The proposed method significantly outperforms the state-of-the-art.
Unsupervised image-to-image translation is a class of computer vision problems which aims at modeling conditional distribution of images in the target domain, given a set of unpaired images in the source and target domains. An image in the source domain might have multiple representations in the target domain. Therefore, ambiguity in modeling of the conditional distribution arises, specially when the images in the source and target domains come from different modalities. Current approaches mostly rely on simplifying assumptions to map both domains into a shared-latent space. Consequently, they are only able to model the domain-invariant information between the two modalities. These approaches usually fail to model domain-specific information which has no representation in the target domain. In this work, we propose an unsupervised image-to-image translation framework which maximizes a domain-specific variational information bound and learns the target domain-invariant representation of the two domain. The proposed framework makes it possible to map a single source image into multiple images in the target domain, utilizing several target domain-specific codes sampled randomly from the prior distribution, or extracted from reference images.
In this paper, we present a deep coupled framework to address the problem of matching sketch image against a gallery of mugshots. Face sketches have the essential in- formation about the spatial topology and geometric details of faces while missing some important facial attributes such as ethnicity, hair, eye, and skin color. We propose a cou- pled deep neural network architecture which utilizes facial attributes in order to improve the sketch-photo recognition performance. The proposed Attribute-Assisted Deep Con- volutional Neural Network (AADCNN) method exploits the facial attributes and leverages the loss functions from the facial attributes identification and face verification tasks in order to learn rich discriminative features in a common em- bedding subspace. The facial attribute identification task increases the inter-personal variations by pushing apart the embedded features extracted from individuals with differ- ent facial attributes, while the verification task reduces the intra-personal variations by pulling together all the fea- tures that are related to one person. The learned discrim- inative features can be well generalized to new identities not seen in the training data. The proposed architecture is able to make full use of the sketch and complementary fa- cial attribute information to train a deep model compared to the conventional sketch-photo recognition methods. Exten- sive experiments are performed on composite (E-PRIP) and semi-forensic (IIIT-D semi-forensic) datasets. The results show the superiority of our method compared to the state- of-the-art models in sketch-photo recognition algorithms
Elastic distortion of fingerprints has a negative effect on the performance of fingerprint recognition systems. This negative effect brings inconvenience to users in authentication applications. However, in the negative recognition scenario where users may intentionally distort their fingerprints, this can be a serious problem since distortion will prevent recognition system from identifying malicious users. Current methods aimed at addressing this problem still have limitations. They are often not accurate because they estimate distortion parameters based on the ridge frequency map and orientation map of input samples, which are not reliable due to distortion. Secondly, they are not efficient and requiring significant computation time to rectify samples. In this paper, we develop a rectification model based on a Deep Convolutional Neural Network (DCNN) to accurately estimate distortion parameters from the input image. Using a comprehensive database of synthetic distorted samples, the DCNN learns to accurately estimate distortion bases ten times faster than the dictionary search methods used in the previous approaches. Evaluating the proposed method on public databases of distorted samples shows that it can significantly improve the matching performance of distorted samples.
In this paper a novel cross-device text-independent speaker verification architecture is proposed. Majority of the state-of-the-art deep architectures that are used for speaker verification tasks consider Mel-frequency cepstral coefficients. In contrast, our proposed Siamese convolutional neural network architecture uses Mel-frequency spectrogram coefficients to benefit from the dependency of the adjacent spectro-temporal features. Moreover, although spectro-temporal features have proved to be highly reliable in speaker verification models, they only represent some aspects of short-term acoustic level traits of the speaker's voice. However, the human voice consists of several linguistic levels such as acoustic, lexicon, prosody, and phonetics, that can be utilized in speaker verification models. To compensate for these inherited shortcomings in spectro-temporal features, we propose to enhance the proposed Siamese convolutional neural network architecture by deploying a multilayer perceptron network to incorporate the prosodic, jitter, and shimmer features. The proposed end-to-end verification architecture performs feature extraction and verification simultaneously. This proposed architecture displays significant improvement over classical signal processing approaches and deep algorithms for forensic cross-device speaker verification.
Performing recognition tasks using latent fingerprint samples is often challenging for automated identification systems due to poor quality, distortion, and partially missing information from the input samples. We propose a direct latent fingerprint reconstruction model based on conditional generative adversarial networks (cGANs). Two modifications are applied to the cGAN to adapt it for the task of latent fingerprint reconstruction. First, the model is forced to generate three additional maps to the ridge map to ensure that the orientation and frequency information is considered in the generation process, and prevent the model from filling large missing areas and generating erroneous minutiae. Second, a perceptual ID preservation approach is developed to force the generator to preserve the ID information during the reconstruction process. Using a synthetically generated database of latent fingerprints, the deep network learns to predict missing information from the input latent samples. We evaluate the proposed method in combination with two different fingerprint matching algorithms on several publicly available latent fingerprint datasets. We achieved the rank-10 accuracy of 88.02\% on the IIIT-Delhi latent fingerprint database for the task of latent-to-latent matching and rank-50 accuracy of 70.89\% on the IIIT-Delhi MOLF database for the task of latent-to-sensor matching. Experimental results of matching reconstructed samples in both latent-to-sensor and latent-to-latent frameworks indicate that the proposed method significantly increases the matching accuracy of the fingerprint recognition systems for the latent samples.
The generalization and robustness of an electroencephalogram (EEG)-based computer aided diagnostic system are crucial requirements in actual clinical practice. To reach these goals, we propose a new EEG representation that provides a more realistic view of brain functionality by applying multi-instance (MI) framework to consider the non-stationarity of the EEG signal. The non-stationary characteristic of EEG is considered by describing the signal as a bag of relevant and irrelevant concepts. The concepts are provided by a robust representation of homogenous segments of EEG signal using spatial covariance matrices. Due to the nonlinear geometry of the space of covariance matrices, we determine the boundaries of the homogeneous segments based on adaptive segmentation of the signal in a Riemannian framework. Each subject is described as a bag of covariance matrices of homogenous segments and the bag-level discriminative information is used for classification. To evaluate the performance of the proposed approach, we examine it in attention deficit hyperactivity/bipolar mood disorder detection and depression/normal diagnosis applications. Experimental results confirm the superiority of the proposed approach, which is gained due to the robustness of covariance descriptor, the effectiveness of Riemannian geometry, and the benefits of considering the inherent non-stationary nature of the brain.