Deep convolutional neural networks take GPU days of compute time to train on large data sets. Pedestrian detection for self driving cars requires very low latency. Image recognition for mobile phones is constrained by limited processing resources. The success of convolutional neural networks in these situations is limited by how fast we can compute them. Conventional FFT based convolution is fast for large filters, but state of the art convolutional neural networks use small, 3x3 filters. We introduce a new class of fast algorithms for convolutional neural networks using Winograd's minimal filtering algorithms. The algorithms compute minimal complexity convolution over small tiles, which makes them fast with small filters and small batch sizes. We benchmark a GPU implementation of our algorithm with the VGG network and show state of the art throughput at batch sizes from 1 to 64.

**Click to Read Paper and Get Code**
Generating Long Sequences with Sparse Transformers

Apr 23, 2019

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.
Apr 23, 2019

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

**Click to Read Paper and Get Code**

Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks

Dec 02, 2017

Urs Köster, Tristan J. Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William H. Constable, Oğuz H. Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, Naveen Rao

Deep neural networks are commonly developed and trained in 32-bit floating point format. Significant gains in performance and energy efficiency could be realized by training and inference in numerical formats optimized for deep learning. Despite advances in limited precision inference in recent years, training of neural networks in low bit-width remains a challenging problem. Here we present the Flexpoint data format, aiming at a complete replacement of 32-bit floating point format training and inference, designed to support modern deep network topologies without modifications. Flexpoint tensors have a shared exponent that is dynamically adjusted to minimize overflows and maximize available dynamic range. We validate Flexpoint by training AlexNet, a deep residual network and a generative adversarial network, using a simulator implemented with the neon deep learning framework. We demonstrate that 16-bit Flexpoint closely matches 32-bit floating point in training all three models, without any need for tuning of model hyperparameters. Our results suggest Flexpoint as a promising numerical format for future hardware for training and inference.
Dec 02, 2017

Urs Köster, Tristan J. Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William H. Constable, Oğuz H. Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J. Pai, Naveen Rao

* 14 pages, 5 figures, accepted in Neural Information Processing Systems 2017

**Click to Read Paper and Get Code**

Unsupervised Threshold for Automatic Extraction of Dolphin Dorsal Fin Outlines from Digital Photographs in DARWIN (Digital Analysis and Recognition of Whale Images on a Network)

Feb 18, 2012

Scott A. Hale

At least two software packages---DARWIN, Eckerd College, and FinScan, Texas A&M---exist to facilitate the identification of cetaceans---whales, dolphins, porpoises---based upon the naturally occurring features along the edges of their dorsal fins. Such identification is useful for biological studies of population, social interaction, migration, etc. The process whereby fin outlines are extracted in current fin-recognition software packages is manually intensive and represents a major user input bottleneck: it is both time consuming and visually fatiguing. This research aims to develop automated methods (employing unsupervised thresholding and morphological processing techniques) to extract cetacean dorsal fin outlines from digital photographs thereby reducing manual user input. Ideally, automatic outline generation will improve the overall user experience and improve the ability of the software to correctly identify cetaceans. Various transformations from color to gray space were examined to determine which produced a grayscale image in which a suitable threshold could be easily identified. To assist with unsupervised thresholding, a new metric was developed to evaluate the jaggedness of figures ("pixelarity") in an image after thresholding. The metric indicates how cleanly a threshold segments background and foreground elements and hence provides a good measure of the quality of a given threshold. This research results in successful extractions in roughly 93% of images, and significantly reduces user-input time.
Feb 18, 2012

Scott A. Hale

**Click to Read Paper and Get Code**

Dictionary-Free MRI PERK: Parameter Estimation via Regression with Kernels

Oct 06, 2017

Gopal Nataraj, Jon-Fredrik Nielsen, Clayton Scott, Jeffrey A. Fessler

This paper introduces a fast, general method for dictionary-free parameter estimation in quantitative magnetic resonance imaging (QMRI) via regression with kernels (PERK). PERK first uses prior distributions and the nonlinear MR signal model to simulate many parameter-measurement pairs. Inspired by machine learning, PERK then takes these parameter-measurement pairs as labeled training points and learns from them a nonlinear regression function using kernel functions and convex optimization. PERK admits a simple implementation as per-voxel nonlinear lifting of MRI measurements followed by linear minimum mean-squared error regression. We demonstrate PERK for $T_1,T_2$ estimation, a well-studied application where it is simple to compare PERK estimates against dictionary-based grid search estimates. Numerical simulations as well as single-slice phantom and in vivo experiments demonstrate that PERK and grid search produce comparable $T_1,T_2$ estimates in white and gray matter, but PERK is consistently at least $23\times$ faster. This acceleration factor will increase by several orders of magnitude for full-volume QMRI estimation problems involving more latent parameters per voxel.
Oct 06, 2017

Gopal Nataraj, Jon-Fredrik Nielsen, Clayton Scott, Jeffrey A. Fessler

* submitted to IEEE Transactions on Medical Imaging

**Click to Read Paper and Get Code**