Deep convolutional neural networks take GPU days of compute time to train on large data sets. Pedestrian detection for self driving cars requires very low latency. Image recognition for mobile phones is constrained by limited processing resources. The success of convolutional neural networks in these situations is limited by how fast we can compute them. Conventional FFT based convolution is fast for large filters, but state of the art convolutional neural networks use small, 3x3 filters. We introduce a new class of fast algorithms for convolutional neural networks using Winograd's minimal filtering algorithms. The algorithms compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes. We benchmark a GPU implementation of our algorithm with the VGG network and show state of the art throughput at batch sizes from 1 to 64.

Click to Read Paper
Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications. Flexpoint tensors have a shared exponent that is dynamically adjusted to minimize overflows and maximize available dynamic range. We validate Flexpoint by training AlexNet, a deep residual network and a generative adversarial network, using a simulator implemented with the neon deep learning framework. We demonstrate that 16-bit Flexpoint closely matches 32-bit floating point in training all three models, without any need for tuning of model hyperparameters. Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.

* 14 pages, 5 figures, accepted in Neural Information Processing Systems 2017
Click to Read Paper
At least two software packages---DARWIN, Eckerd College, and FinScan, Texas A&M---exist to facilitate the identification of cetaceans---whales, dolphins, porpoises---based upon the naturally occurring features along the edges of their dorsal fins. Such identification is useful for biological studies of population, social interaction, migration, etc. The process whereby fin outlines are extracted in current fin-recognition software packages is manually intensive and represents a major user input bottleneck: it is both time consuming and visually fatiguing. This research aims to develop automated methods (employing unsupervised thresholding and morphological processing techniques) to extract cetacean dorsal fin outlines from digital photographs thereby reducing manual user input. Ideally, automatic outline generation will improve the overall user experience and improve the ability of the software to correctly identify cetaceans. Various transformations from color to gray space were examined to determine which produced a grayscale image in which a suitable threshold could be easily identified. To assist with unsupervised thresholding, a new metric was developed to evaluate the jaggedness of figures ("pixelarity") in an image after thresholding. The metric indicates how cleanly a threshold segments background and foreground elements and hence provides a good measure of the quality of a given threshold. This research results in successful extractions in roughly 93% of images, and significantly reduces user-input time.

Click to Read Paper
This paper introduces a fast, general method for dictionary-free parameter estimation in quantitative magnetic resonance imaging (QMRI) via regression with kernels (PERK). PERK first uses prior distributions and the nonlinear MR signal model to simulate many parameter-measurement pairs. Inspired by machine learning, PERK then takes these parameter-measurement pairs as labeled training points and learns from them a nonlinear regression function using kernel functions and convex optimization. PERK admits a simple implementation as per-voxel nonlinear lifting of MRI measurements followed by linear minimum mean-squared error regression. We demonstrate PERK for $T_1,T_2$ estimation, a well-studied application where it is simple to compare PERK estimates against dictionary-based grid search estimates. Numerical simulations as well as single-slice phantom and in vivo experiments demonstrate that PERK and grid search produce comparable $T_1,T_2$ estimates in white and gray matter, but PERK is consistently at least $23\times$ faster. This acceleration factor will increase by several orders of magnitude for full-volume QMRI estimation problems involving more latent parameters per voxel.

* submitted to IEEE Transactions on Medical Imaging
Click to Read Paper