Models, code, and papers for "Alexander Wong":
Much of the focus in the design of deep neural networks has been on improving accuracy, leading to more powerful yet highly complex network architectures that are difficult to deploy in practical scenarios, particularly on edge devices such as mobile and other consumer devices given their high computational and memory requirements. As a result, there has been a recent interest in the design of quantitative metrics for evaluating deep neural networks that accounts for more than just model accuracy as the sole indicator of network performance. In this study, we continue the conversation towards universal metrics for evaluating the performance of deep neural networks for practical on-device edge usage. In particular, we propose a new balanced metric called NetScore, which is designed specifically to provide a quantitative assessment of the balance between accuracy, computational complexity, and network architecture complexity of a deep neural network, which is important for on-device edge operation. In what is one of the largest comparative analysis between deep neural networks in literature, the NetScore metric, the top-1 accuracy metric, and the popular information density metric were compared across a diverse set of 60 different deep convolutional neural networks for image classification on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012) dataset. The evaluation results across these three metrics for this diverse set of networks are presented in this study to act as a reference guide for practitioners in the field. The proposed NetScore metric, along with the other tested metrics, are by no means perfect, but the hope is to push the conversation towards better universal metrics for evaluating deep neural networks for use in practical on-device edge scenarios to help guide practitioners in model design for such scenarios.
Modern face recognition systems leverage datasets containing images of hundreds of thousands of specific individuals' faces to train deep convolutional neural networks to learn an embedding space that maps an arbitrary individual's face to a vector representation of their identity. The performance of a face recognition system in face verification (1:1) and face identification (1:N) tasks is directly related to the ability of an embedding space to discriminate between identities. Recently, there has been significant public scrutiny into the source and privacy implications of large-scale face recognition training datasets such as MS-Celeb-1M and MegaFace, as many people are uncomfortable with their face being used to train dual-use technologies that can enable mass surveillance. However, the impact of an individual's inclusion in training data on a derived system's ability to recognize them has not previously been studied. In this work, we audit ArcFace, a state-of-the-art, open source face recognition system, in a large-scale face identification experiment with more than one million distractor images. We find a Rank-1 face identification accuracy of 79.71% for individuals present in the model's training data and an accuracy of 75.73% for those not present. This modest difference in accuracy demonstrates that face recognition systems using deep learning work better for individuals they are trained on, which has serious privacy implications when one considers all major open source face recognition training datasets do not obtain informed consent from individuals during their collection.
From fully connected neural networks to convolutional neural networks, the learned parameters within a neural network have been primarily relegated to the linear parameters (e.g., convolutional filters). The non-linear functions (e.g., activation functions) have largely remained, with few exceptions in recent years, parameter-less, static throughout training, and seen limited variation in design. Largely ignored by the deep learning community, radial basis function (RBF) networks provide an interesting mechanism for learning more complex non-linear activation functions in addition to the linear parameters in a network. However, the interest in RBF networks has waned over time due to the difficulty of integrating RBFs into more complex deep neural network architectures in a tractable and stable manner. In this work, we present a novel approach that enables end-to-end learning of deep RBF networks with fully learnable activation basis functions in an automatic and tractable manner. We demonstrate that our approach for enabling the use of learnable activation basis functions in deep neural networks, which we will refer to as DeepLABNet, is an effective tool for automated activation function learning within complex network architectures.
The ImageNet dataset ushered in a flood of academic and industry interest in deep learning for computer vision applications. Despite its significant impact, there has not been a comprehensive investigation into the demographic attributes of images contained within the dataset. Such a study could lead to new insights on inherent biases within ImageNet, particularly important given it is frequently used to pretrain models for a wide variety of computer vision tasks. In this work, we introduce a model-driven framework for the automatic annotation of apparent age and gender attributes in large-scale image datasets. Using this framework, we conduct the first demographic audit of the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) subset of ImageNet and the "person" hierarchical category of ImageNet. We find that 41.62% of faces in ILSVRC appear as female, 1.71% appear as individuals above the age of 60, and males aged 15 to 29 account for the largest subgroup with 27.11%. We note that the presented model-driven framework is not fair for all intersectional groups, so annotation are subject to bias. We present this work as the starting point for future development of unbiased annotation models and for the study of downstream effects of imbalances in the demographics of ImageNet. Code and annotations are available at: http://bit.ly/ImageNetDemoAudit
Researchers are actively trying to gain better insights into the representational properties of convolutional neural networks for guiding better network designs and for interpreting a network's computational nature. Gaining such insights can be an arduous task due to the number of parameters in a network and the complexity of a network's architecture. Current approaches of neural network interpretation include Bayesian probabilistic interpretations and information theoretic interpretations. In this study, we take a different approach to studying convolutional neural networks by proposing an abstract algebraic interpretation using finite transformation semigroup theory. Specifically, convolutional layers are broken up and mapped to a finite space. The state space of the proposed finite transformation semigroup is then defined as a single element within the convolutional layer, with the acting elements defined by surrounding state elements combined with convolution kernel elements. Generators of the finite transformation semigroup are defined to complete the interpretation. We leverage this approach to analyze the basic properties of the resulting finite transformation semigroup to gain insights on the representational properties of convolutional neural networks, including insights into quantized network representation. Such a finite transformation semigroup interpretation can also enable better understanding outside of the confines of fixed lattice data structures, thus useful for handling data that lie on irregular lattices. Furthermore, the proposed abstract algebraic interpretation is shown to be viable for interpreting convolutional operations within a variety of convolutional neural network architectures.
Computer vision based technology is becoming ubiquitous in society. One application area that has seen an increase in computer vision is assistive technologies, specifically for those with visual impairment. Research has shown the ability of computer vision models to achieve tasks such provide scene captions, detect objects and recognize faces. Although assisting individuals with visual impairment with these tasks increases their independence and autonomy, concerns over bias, privacy and potential usefulness arise. This paper addresses the positive and negative implications computer vision based assistive technologies have on individuals with visual impairment, as well as considerations for computer vision researchers and developers in order to mitigate the amount of negative implications.
Recent improvements in object detection have shown potential to aid in tasks where previous solutions were not able to achieve. A particular area is assistive devices for individuals with visual impairment. While state-of-the-art deep neural networks have been shown to achieve superior object detection performance, their high computational and memory requirements make them cost prohibitive for on-device operation. Alternatively, cloud-based operation leads to privacy concerns, both not attractive to potential users. To address these challenges, this study investigates creating an efficient object detection network specifically for OLIV, an AI-powered assistant for object localization for the visually impaired, via micro-architecture design exploration. In particular, we formulate the problem of finding an optimal network micro-architecture as an numerical optimization problem, where we find the set of hyperparameters controlling the MobileNetV2-SSD network micro-architecture that maximizes a modified NetScore objective function for the MSCOCO-OLIV dataset of indoor objects. Experimental results show that such a micro-architecture design exploration strategy leads to a compact deep neural network with a balanced trade-off between accuracy, size, and speed, making it well-suited for enabling on-device computer vision driven assistive devices for the visually impaired.
In this study, we propose the Affine Variational Autoencoder (AVAE), a variant of Variational Autoencoder (VAE) designed to improve robustness by overcoming the inability of VAEs to generalize to distributional shifts in the form of affine perturbations. By optimizing an affine transform to maximize ELBO, the proposed AVAE transforms an input to the training distribution without the need to increase model complexity to model the full distribution of affine transforms. In addition, we introduce a training procedure to create an efficient model by learning a subset of the training distribution, and using the AVAE to improve generalization and robustness to distributional shift at test time. Experiments on affine perturbations demonstrate that the proposed AVAE significantly improves generalization and robustness to distributional shift in the form of affine perturbations without an increase in model complexity.
Automated deep neural network architecture design has received a significant amount of recent attention. However, this attention has not been equally shared by one of the fundamental building blocks of a deep neural network, the neurons. In this study, we propose PolyNeuron, a novel automatic neuron discovery approach based on learned polyharmonic spline activations. More specifically, PolyNeuron revolves around learning polyharmonic splines, characterized by a set of control points, that represent the activation functions of the neurons in a deep neural network. A relaxed variant of PolyNeuron, which we term PolyNeuron-R, loosens the constraints imposed by PolyNeuron to reduce the computational complexity for discovering the neuron activation functions in an automated manner. Experiments show both PolyNeuron and PolyNeuron-R lead to networks that have improved or comparable performance on multiple network architectures (LeNet-5 and ResNet-20) using different datasets (MNIST and CIFAR10). As such, automatic neuron discovery approaches such as PolyNeuron is a worthy direction to explore.
While microscopic analysis of histopathological slides is generally considered as the gold standard method for performing cancer diagnosis and grading, the current method for analysis is extremely time consuming and labour intensive as it requires pathologists to visually inspect tissue samples in a detailed fashion for the presence of cancer. As such, there has been significant recent interest in computer aided diagnosis systems for analysing histopathological slides for cancer grading to aid pathologists to perform cancer diagnosis and grading in a more efficient, accurate, and consistent manner. In this work, we investigate and explore a deep triple-stream residual network (TriResNet) architecture for the purpose of tile-level histopathology grading, which is the critical first step to computer-aided whole-slide histopathology grading. In particular, the design mentality behind the proposed TriResNet network architecture is to facilitate for the learning of a more diverse set of quantitative features to better characterize the complex tissue characteristics found in histopathology samples. Experimental results on two widely-used computer-aided histopathology benchmark datasets (CAMELYON16 dataset and Invasive Ductal Carcinoma (IDC) dataset) demonstrated that the proposed TriResNet network architecture was able to achieve noticeably improved accuracies when compared with two other state-of-the-art deep convolutional neural network architectures. Based on these promising results, the hope is that the proposed TriResNet network architecture could become a useful tool to aiding pathologists increase the consistency, speed, and accuracy of the histopathology grading process.
There exists a fundamental tradeoff between spectral resolution and the efficiency or throughput for all optical spectrometers. The primary factors affecting the spectral resolution and throughput of an optical spectrometer are the size of the entrance aperture and the optical power of the focusing element. Thus far collective optimization of the above mentioned has proven difficult. Here, we introduce the concept of high-throughput computational slits (HTCS), a numerical technique for improving both the effective spectral resolution and efficiency of a spectrometer. The proposed HTCS approach was experimentally validated using an optical spectrometer configured with a 200 um entrance aperture, test, and a 50 um entrance aperture, control, demonstrating improvements in spectral resolution of the spectrum by ~ 50% over the control spectral resolution and improvements in efficiency of > 2 times over the efficiency of the largest entrance aperture used in the study while producing highly accurate spectra.
Wide-field lensfree on-chip microscopy, which leverages holography principles to capture interferometric light-field encodings without lenses, is an emerging imaging modality with widespread interest given the large field-of-view compared to lens-based techniques. In this study, we introduce the idea of laser light-field fusion for lensfree on-chip phase contrast nanoscopy, where interferometric laser light-field encodings acquired using an on-chip setup with laser pulsations at different wavelengths are fused to produce marker-free phase contrast images of superior quality with resolving power more than five times below the pixel pitch of the sensor array and more than 40% beyond the diffraction limit. As a proof of concept, we demonstrate, for the first time, a wide-field lensfree on-chip instrument successfully detecting 300 nm particles, resulting in a numerical aperture of 1.1, across a large field-of-view of $\sim$ 30 mm$^2$ without any specialized or intricate sample preparation, or the use of synthetic aperture- or shift-based techniques.
Much of the focus in the area of knowledge distillation has been on distilling knowledge from a larger teacher network to a smaller student network. However, there has been little research on how the concept of distillation can be leveraged to distill the knowledge encapsulated in the training data itself into a reduced form. In this study, we explore the concept of progressive label distillation, where we leverage a series of teacher-student network pairs to progressively generate distilled training data for learning deep neural networks with greatly reduced input dimensions. To investigate the efficacy of the proposed progressive label distillation approach, we experimented with learning a deep limited vocabulary speech recognition network based on generated 500ms input utterances distilled progressively from 1000ms source training data, and demonstrated a significant increase in test accuracy of almost 78% compared to direct learning.
While skin cancer is the most diagnosed form of cancer in men and women, with more cases diagnosed each year than all other cancers combined, sufficiently early diagnosis results in very good prognosis and as such makes early detection crucial. While radiomics have shown considerable promise as a powerful diagnostic tool for significantly improving oncological diagnostic accuracy and efficiency, current radiomics-driven methods have largely rely on pre-defined, hand-crafted quantitative features, which can greatly limit the ability to fully characterize unique cancer phenotype that distinguish it from healthy tissue. Recently, the notion of discovery radiomics was introduced, where a large amount of custom, quantitative radiomic features are directly discovered from the wealth of readily available medical imaging data. In this study, we present a novel discovery radiomics framework for skin cancer detection, where we leverage novel deep multi-column radiomic sequencers for high-throughput discovery and extraction of a large amount of custom radiomic features tailored for characterizing unique skin cancer tissue phenotype. The discovered radiomic sequencer was tested against 9,152 biopsy-proven clinical images comprising of different skin cancers such as melanoma and basal cell carcinoma, and demonstrated sensitivity and specificity of 91% and 75%, respectively, thus achieving dermatologist-level performance and \break hence can be a powerful tool for assisting general practitioners and dermatologists alike in improving the efficiency, consistency, and accuracy of skin cancer diagnosis.
There has been significant recent interest towards achieving highly efficient deep neural network architectures. A promising paradigm for achieving this is the concept of evolutionary deep intelligence, which attempts to mimic biological evolution processes to synthesize highly-efficient deep neural networks over successive generations. An important aspect of evolutionary deep intelligence is the genetic encoding scheme used to mimic heredity, which can have a significant impact on the quality of offspring deep neural networks. Motivated by the neurobiological phenomenon of synaptic clustering, we introduce a new genetic encoding scheme where synaptic probability is driven towards the formation of a highly sparse set of synaptic clusters. Experimental results for the task of image classification demonstrated that the synthesized offspring networks using this synaptic cluster-driven genetic encoding scheme can achieve state-of-the-art performance while having network architectures that are not only significantly more efficient (with a ~125-fold decrease in synapses for MNIST) compared to the original ancestor network, but also tailored for GPU-accelerated machine learning applications.
Evolutionary deep intelligence has recently shown great promise for producing small, powerful deep neural network models via the synthesis of increasingly efficient architectures over successive generations. Despite recent research showing the efficacy of multi-parent evolutionary synthesis, little has been done to directly assess architectural similarity between networks during the synthesis process for improved parent network selection. In this work, we present a preliminary study into quantifying architectural similarity via the percentage overlap of architectural clusters. Results show that networks synthesized using architectural alignment (via gene tagging) maintain higher architectural similarities within each generation, potentially restricting the search space of highly efficient network architectures.
Massive Open Online Courses are educational programs that are open and accessible to a large number of people through the internet. To facilitate learning, MOOC discussion forums exist where students and instructors communicate questions, answers, and thoughts related to the course. The primary objective of this paper is to investigate tracing discussion forum posts back to course lecture videos and readings using topic analysis. We utilize both unsupervised and supervised variants of Latent Dirichlet Allocation (LDA) to extract topics from course material and classify forum posts. We validate our approach on posts bootstrapped from five Coursera courses and determine that topic models can be used to map student discussion posts back to the underlying course lecture or reading. Labeled LDA outperforms unsupervised Hierarchical Dirichlet Process LDA and base LDA for our traceability task. This research is useful as it provides an automated approach for clustering student discussions by course material, enabling instructors to quickly evaluate student misunderstanding of content and clarify materials accordingly.
While depth cameras and inertial sensors have been frequently leveraged for human action recognition, these sensing modalities are impractical in many scenarios where cost or environmental constraints prohibit their use. As such, there has been recent interest on human action recognition using low-cost, readily-available RGB cameras via deep convolutional neural networks. However, many of the deep convolutional neural networks proposed for action recognition thus far have relied heavily on learning global appearance cues directly from imaging data, resulting in highly complex network architectures that are computationally expensive and difficult to train. Motivated to reduce network complexity and achieve higher performance, we introduce the concept of spatio-temporal activation reprojection (STAR). More specifically, we reproject the spatio-temporal activations generated by human pose estimation layers in space and time using a stack of 3D convolutions. Experimental results on UTD-MHAD and J-HMDB demonstrate that an end-to-end architecture based on the proposed STAR framework (which we nickname STAR-Net) is proficient in single-environment and small-scale applications. On UTD-MHAD, STAR-Net outperforms several methods using richer data modalities such as depth and inertial sensors.