Many details about our world are not captured in written records because they are too mundane or too abstract to describe in words. Fortunately, since the invention of the camera, an ever-increasing number of photographs capture much of this otherwise lost information. This plethora of artifacts documenting our "visual culture" is a treasure trove of knowledge as yet untapped by historians. We present a dataset of 37,921 frontal-facing American high school yearbook photos that allow us to use computation to glimpse into the historical visual record too voluminous to be evaluated manually. The collected portraits provide a constant visual frame of reference with varying content. We can therefore use them to consider issues such as a decade's defining style elements, or trends in fashion and social norms over time. We demonstrate that our historical image dataset may be used together with weakly-supervised data-driven techniques to perform scalable historical analysis of large image corpora with minimal human effort, much in the same way that large text corpora together with natural language processing revolutionized historians' workflow. Furthermore, we demonstrate the use of our dataset in dating grayscale portraits using deep learning methods. Click to Read Paper
The aim of this paper is twofold: first we will use vector space distributional compositional categorical models of meaning to compare the meaning of sentences in Irish and in English (and thus ascertain when a sentence is the translation of another sentence) using the cosine similarity score. Then we shall outline a procedure which translates nouns by understanding their context, using a conceptual space model of cognition. We shall use metrics on the category ConvexRel to determine the distance between concepts (and determine when a noun is the translation of another noun). This paper will focus on applications to Irish, a member of the Gaelic family of languages. Click to Read Paper
In many classification problems it is desirable to output well-calibrated probabilities on the different classes. We propose a robust, non-parametric method of calibrating probabilities called SplineCalib that utilizes smoothing splines to determine a calibration function. We demonstrate how applying certain transformations as part of the calibration process can improve performance on problems in deep learning and other domains where the scores tend to be "overconfident". We adapt the approach to multi-class problems and find that better calibration can improve accuracy as well as log-loss by better resolving uncertain cases. Finally, we present a cross-validated approach to calibration which conserves data. Significant improvements to log-loss and accuracy are shown on several different problems. We also introduce the ml-insights python package which contains an implementation of the SplineCalib algorithm. Click to Read Paper
Many regression problems involve not one but several response variables (y's). Often the responses are suspected to share a common underlying structure, in which case it may be advantageous to share information across them; this is known as multitask learning. As a special case, we can use multiple responses to better identify shared predictive features -- a project we might call multitask feature selection. This thesis is organized as follows. Section 1 introduces feature selection for regression, focusing on ell_0 regularization methods and their interpretation within a Minimum Description Length (MDL) framework. Section 2 proposes a novel extension of MDL feature selection to the multitask setting. The approach, called the "Multiple Inclusion Criterion" (MIC), is designed to borrow information across regression tasks by more easily selecting features that are associated with multiple responses. We show in experiments on synthetic and real biological data sets that MIC can reduce prediction error in settings where features are at least partially shared across responses. Section 3 surveys hypothesis testing by regression with a single response, focusing on the parallel between the standard Bonferroni correction and an MDL approach. Mirroring the ideas in Section 2, Section 4 proposes a novel MIC approach to hypothesis testing with multiple responses and shows that on synthetic data with significant sharing of features across responses, MIC sometimes outperforms standard FDR-controlling methods in terms of finding true positives for a given level of false positives. Section 5 concludes. Click to Read Paper
Machine learning is finding increasingly broad application in the physical sciences. This most often involves building a model relationship between a dependent, measurable output and an associated set of controllable, but complicated, independent inputs. We present a tutorial on current techniques in machine learning -- a jumping-off point for interested researchers to advance their work. We focus on deep neural networks with an emphasis on demystifying deep learning. We begin with background ideas in machine learning and some example applications from current research in plasma physics. We discuss supervised learning techniques for modeling complicated functions, beginning with familiar regression schemes, then advancing to more sophisticated deep learning methods. We also address unsupervised learning and techniques for reducing the dimensionality of input spaces. Along the way, we describe methods for practitioners to help ensure that their models generalize from their training data to as-yet-unseen test data. We describe classes of tasks -- predicting scalars, handling images, fitting time-series -- and prepare the reader to choose an appropriate technique. We finally point out some limitations to modern machine learning and speculate on some ways that practitioners from the physical sciences may be particularly suited to help. Click to Read Paper
Jakob Bernoulli, working in the late 17th century, identified a gap in contemporary probability theory. He cautioned that it was inadequate to specify force of proof (probability of provability) for some kinds of uncertain arguments. After 300 years, this gap remains in present-day probability theory. We present axioms analogous to Kolmogorov's axioms for probability, specifying uncertainty that lies in an argument's inference/implication itself rather than in its premise and conclusion. The axioms focus on arguments spanning two Boolean algebras, but generalize the obligatory: "force of proof of A implies B is the probability of B or not A" in the case that the Boolean algebras are identical. We propose a categorical framework that relies on generalized probabilities (objects) to express uncertainty in premises, to mix with arguments (morphisms) to express uncertainty embedded directly in inference/implication. There is a direct application to Shafer's evidence theory (Dempster-Shafer theory), greatly expanding its scope for applications. Therefore, we can offer this framework not only as an optimal solution to a difficult historical puzzle, but also to advance the frontiers of contemporary artificial intelligence. Keywords: force of proof, probability of provability, Ars Conjectandi, non additive probabilities, evidence theory. Click to Read Paper
We present a novel, automatic eye gaze tracking scheme inspired by smooth pursuit eye motion while playing mobile games or watching virtual reality contents. Our algorithm continuously calibrates an eye tracking system for a head mounted display. This eliminates the need for an explicit calibration step and automatically compensates for small movements of the headset with respect to the head. The algorithm finds correspondences between corneal motion and screen space motion, and uses these to generate Gaussian Process Regression models. A combination of those models provides a continuous mapping from corneal position to screen space position. Accuracy is nearly as good as achieved with an explicit calibration step. Click to Read Paper
The ubiquitous proliferation of online social networks has led to the widescale emergence of relational graphs expressing unique patterns in link formation and descriptive user node features. Matrix Factorization and Completion have become popular methods for Link Prediction due to the low rank nature of mutual node friendship information, and the availability of parallel computer architectures for rapid matrix processing. Current Link Prediction literature has demonstrated vast performance improvement through the utilization of sparsity in addition to the low rank matrix assumption. However, the majority of research has introduced sparsity through the limited L1 or Frobenius norms, instead of considering the more detailed distributions which led to the graph formation and relationship evolution. In particular, social networks have been found to express either Pareto, or more recently discovered, Log Normal distributions. Employing the convexity-inducing Lovasz Extension, we demonstrate how incorporating specific degree distribution information can lead to large scale improvements in Matrix Completion based Link prediction. We introduce Log-Normal Matrix Completion (LNMC), and solve the complex optimization problem by employing Alternating Direction Method of Multipliers. Using data from three popular social networks, our experiments yield up to 5% AUC increase over top-performing non-structured sparsity based methods. Click to Read Paper
This work studies two interrelated problems - online robust PCA (RPCA) and online low-rank matrix completion (MC). In recent work by Cand\`{e}s et al., RPCA has been defined as a problem of separating a low-rank matrix (true data), $L:=[\ell_1, \ell_2, \dots \ell_{t}, \dots , \ell_{t_{\max}}]$ and a sparse matrix (outliers), $S:=[x_1, x_2, \dots x_{t}, \dots, x_{t_{\max}}]$ from their sum, $M:=L+S$. Our work uses this definition of RPCA. An important application where both these problems occur is in video analytics in trying to separate sparse foregrounds (e.g., moving objects) and slowly changing backgrounds. While there has been a large amount of recent work on both developing and analyzing batch RPCA and batch MC algorithms, the online problem is largely open. In this work, we develop a practical modification of our recently proposed algorithm to solve both the online RPCA and online MC problems. The main contribution of this work is that we obtain correctness results for the proposed algorithms under mild assumptions. The assumptions that we need are: (a) a good estimate of the initial subspace is available (easy to obtain using a short sequence of background-only frames in video surveillance); (b) the $\ell_t$'s obey a `slow subspace change' assumption; (c) the basis vectors for the subspace from which $\ell_t$ is generated are dense (non-sparse); (d) the support of $x_t$ changes by at least a certain amount at least every so often; and (e) algorithm parameters are appropriately set Click to Read Paper
Many applications of intelligent systems require reasoning about the mental states of agents in the domain. We may want to reason about an agent's beliefs, including beliefs about other agents; we may also want to reason about an agent's preferences, and how his beliefs and preferences relate to his behavior. We define a probabilistic epistemic logic (PEL) in which belief statements are given a formal semantics, and provide an algorithm for asserting and querying PEL formulas in Bayesian networks. We then show how to reason about an agent's behavior by modeling his decision process as an influence diagram and assuming that he behaves rationally. PEL can then be used for reasoning from an agent's observed actions to conclusions about other aspects of the domain, including unobserved domain variables and the agent's mental states. Click to Read Paper
Bayesian models offer great flexibility for clustering applications---Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between k-means and mixtures of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like clustering objective that includes a penalty for the number of clusters. We generalize this analysis to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does not fix the number of clusters in the graph. Click to Read Paper
In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results. Click to Read Paper
In this paper, we propose a method of improving Convolutional Neural Networks (CNN) by determining the optimal alignment of weights and inputs using dynamic programming. Conventional CNNs convolve learnable shared weights, or filters, across the input data. These filters use an inner product to linearly match the shared weights to a window of the input. However, it is possible that there exists a more optimal alignment of weights. Thus, we propose the use of Dynamic Time Warping (DTW) to dynamically align the weights to optimized input elements. This dynamic alignment is especially useful for time series recognition due to the complexities with temporal distortions, such as varying rates and sequence lengths. We demonstrate the effectiveness of the proposed architecture on the Unipen online handwritten digit and character datasets, the UCI Spoken Arabic Digit dataset, and the UCI Activities of Daily Life dataset. Click to Read Paper
Covariate shift relaxes the widely-employed independent and identically distributed (IID) assumption by allowing different training and testing input distributions. Unfortunately, common methods for addressing covariate shift by trying to remove the bias between training and testing distributions using importance weighting often provide poor performance guarantees in theory and unreliable predictions with high variance in practice. Recently developed methods that construct a predictor that is inherently robust to the difficulties of learning under covariate shift are restricted to minimizing logloss and can be too conservative when faced with high-dimensional learning tasks. We address these limitations in two ways: by robustly minimizing various loss functions, including non-convex ones, under the testing distribution; and by separately shaping the influence of covariate shift according to different feature-based views of the relationship between input variables and example labels. These generalizations make robust covariate shift prediction applicable to more task scenarios. We demonstrate the benefits on classification under covariate shift tasks. Click to Read Paper
This paper introduces a novel approach to in-painting where the identity of the object to remove or change is preserved and accounted for at inference time: Exemplar GANs (ExGANs). ExGANs are a type of conditional GAN that utilize exemplar information to produce high-quality, personalized in painting results. We propose using exemplar information in the form of a reference image of the region to in-paint, or a perceptual code describing that object. Unlike previous conditional GAN formulations, this extra information can be inserted at multiple points within the adversarial network, thus increasing its descriptive power. We show that ExGANs can produce photo-realistic personalized in-painting results that are both perceptually and semantically plausible by applying them to the task of closed to-open eye in-painting in natural pictures. A new benchmark dataset is also introduced for the task of eye in-painting for future comparisons. Click to Read Paper
Biometrics have a long-held hope of replacing passwords by establishing a non-repudiated identity and providing authentication with convenience. Convenience drives consumers toward biometrics-based access management solutions. Unlike passwords, biometrics cannot be script-injected; however, biometric data is considered highly sensitive due to its personal nature and unique association with users. Biometrics differ from passwords in that compromised passwords may be reset. Compromised biometrics offer no such relief. A compromised biometric offers unlimited risk in privacy (anyone can view the biometric) and authentication (anyone may use the biometric). Standards such as the Biometric Open Protocol Standard (BOPS) (IEEE 2410-2016) provide a detailed mechanism to authenticate biometrics based on pre-enrolled devices and a previous identity by storing the biometric in encrypted form. This paper describes a biometric-agnostic approach that addresses the privacy concerns of biometrics through the implementation of BOPS. Specifically, two novel concepts are introduced. First, a biometric is applied to a neural network to create a feature vector. This neural network alone can be used for one-to-one matching (authentication), but would require a search in linear time for the one-to-many case (identity lookup). The classifying algorithm described in this paper addresses this concern by producing normalized floating-point values for each feature vector. This allows authentication lookup to occur in up to polynomial time, allowing for search in encrypted biometric databases with speed, accuracy and privacy. Click to Read Paper
Bayesian nonparametrics are a class of probabilistic models in which the model size is inferred from data. A recently developed methodology in this field is small-variance asymptotic analysis, a mathematical technique for deriving learning algorithms that capture much of the flexibility of Bayesian nonparametric inference algorithms, but are simpler to implement and less computationally expensive. Past work on small-variance analysis of Bayesian nonparametric inference algorithms has exclusively considered batch models trained on a single, static dataset, which are incapable of capturing time evolution in the latent structure of the data. This work presents a small-variance analysis of the maximum a posteriori filtering problem for a temporally varying mixture model with a Markov dependence structure, which captures temporally evolving clusters within a dataset. Two clustering algorithms result from the analysis: D-Means, an iterative clustering algorithm for linearly separable, spherical clusters; and SD-Means, a spectral clustering algorithm derived from a kernelized, relaxed version of the clustering problem. Empirical results from experiments demonstrate the advantages of using D-Means and SD-Means over contemporary clustering algorithms, in terms of both computational cost and clustering accuracy. Click to Read Paper
I suggest an approach that helps the online marketers to target their Gamification elements to users by modifying the order of the list of tasks that they send to users. It is more realistic and flexible as it allows the model to learn more parameters when the online marketers collect more data. The targeting approach is scalable and quick, and it can be used over streaming data. Click to Read Paper
Contagions such as the spread of popular news stories, or infectious diseases, propagate in cascades over dynamic networks with unobservable topologies. However, "social signals" such as product purchase time, or blog entry timestamps are measurable, and implicitly depend on the underlying topology, making it possible to track it over time. Interestingly, network topologies often "jump" between discrete states that may account for sudden changes in the observed signals. The present paper advocates a switched dynamic structural equation model to capture the topology-dependent cascade evolution, as well as the discrete states driving the underlying topologies. Conditions under which the proposed switched model is identifiable are established. Leveraging the edge sparsity inherent to social networks, a recursive $\ell_1$-norm regularized least-squares estimator is put forth to jointly track the states and network topologies. An efficient first-order proximal-gradient algorithm is developed to solve the resulting optimization problem. Numerical experiments on both synthetic data and real cascades measured over the span of one year are conducted, and test results corroborate the efficacy of the advocated approach. Click to Read Paper
Most real-world networks exhibit community structure, a phenomenon characterized by existence of node clusters whose intra-edge connectivity is stronger than edge connectivities between nodes belonging to different clusters. In addition to facilitating a better understanding of network behavior, community detection finds many practical applications in diverse settings. Communities in online social networks are indicative of shared functional roles, or affiliation to a common socio-economic status, the knowledge of which is vital for targeted advertisement. In buyer-seller networks, community detection facilitates better product recommendations. Unfortunately, reliability of community assignments is hindered by anomalous user behavior often observed as unfair self-promotion, or "fake" highly-connected accounts created to promote fraud. The present paper advocates a novel approach for jointly tracking communities while detecting such anomalous nodes in time-varying networks. By postulating edge creation as the result of mutual community participation by node pairs, a dynamic factor model with anomalous memberships captured through a sparse outlier matrix is put forth. Efficient tracking algorithms suitable for both online and decentralized operation are developed. Experiments conducted on both synthetic and real network time series successfully unveil underlying communities and anomalous nodes. Click to Read Paper