In this work, we build a generic architecture of Convolutional Neural Networks to discover empirical properties of neural networks. Our first contribution is to introduce a state-of-the-art framework that depends upon few hyper parameters and to study the network when we vary them. It has no max pooling, no biases, only 13 layers, is purely convolutional and yields up to 95.4% and 79.6% accuracy respectively on CIFAR10 and CIFAR100. We show that the nonlinearity of a deep network does not need to be continuous, non expansive or point-wise, to achieve good performance. We show that increasing the width of our network permits being competitive with very deep networks. Our second contribution is an analysis of the contraction and separation properties of this network. Indeed, a 1-nearest neighbor classifier applied on deep features progressively improves with depth, which indicates that the representation is progressively more regular. Besides, we defined and analyzed local support vectors that separate classes locally. All our experiments are reproducible and code is available online, based on TensorFlow.

* CVPR 2017, 8 pages
Click to Read Paper
Dictionary learning algorithms or supervised deep convolution networks have considerably improved the efficiency of predefined feature representations such as SIFT. We introduce a deep scattering convolution network, with predefined wavelet filters over spatial and angular variables. This representation brings an important improvement to results previously obtained with predefined features over object image databases such as Caltech and CIFAR. The resulting accuracy is comparable to results obtained with unsupervised deep learning and dictionary based representations. This shows that refining image representations by using geometric priors is a promising direction to improve image classification and its understanding.

* 9 pages, 3 figures, CVPR 2015 paper
Click to Read Paper
A commonly cited inefficiency of neural network training by back-propagation is the update locking problem: each layer must wait for the signal to propagate through the network before updating. We consider and analyze a training procedure, Decoupled Greedy Learning (DGL), that addresses this problem more effectively and at scales beyond those of previous solutions. It is based on a greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classification. We consider an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization in layers. We show theoretically and empirically that this approach converges. In addition, we empirically find that it can lead to better generalization than sequential greedy optimization and even standard end-to-end back-propagation. We show that an extension of this approach to asynchronous settings, where modules can operate with large communication delays, is possible with the use of a replay buffer. We demonstrate the effectiveness of DGL on the CIFAR-10 datasets against alternatives and on the large-scale ImageNet dataset, where we are able to effectively train VGG and ResNet-152 models.

* 14 pages
Click to Read Paper
Shallow supervised 1-hidden layer neural networks have a number of favorable properties that make them easier to interpret, analyze, and optimize than their deep counterparts, but lack their representational power. Here we use 1-hidden layer learning problems to sequentially build deep networks layer by layer, which can inherit properties from shallow networks. Contrary to previous approaches using shallow networks, we focus on problems where deep learning is reported as critical for success. We thus study CNNs on image recognition tasks using the large-scale ImageNet dataset and the CIFAR-10 dataset. Using a simple set of ideas for architecture and training we find that solving sequential 1-hidden-layer auxiliary problems leads to a CNN that exceeds AlexNet performance on ImageNet. Extending our training methodology to construct individual layers by solving 2-and-3-hidden layer auxiliary problems, we obtain an 11-layer network that exceeds VGG-11 on ImageNet obtaining 89.8% top-5 single crop. To our knowledge, this is the first competitive alternative to end-to-end training of CNNs that can scale to ImageNet. We conduct a wide range of experiments to study the properties this induces on the intermediate layers.

Click to Read Paper
We use the scattering network as a generic and fixed ini-tialization of the first layers of a supervised hybrid deep network. We show that early layers do not necessarily need to be learned, providing the best results to-date with pre-defined representations while being competitive with Deep CNNs. Using a shallow cascade of 1 x 1 convolutions, which encodes scattering coefficients that correspond to spatial windows of very small sizes, permits to obtain AlexNet accuracy on the imagenet ILSVRC2012. We demonstrate that this local encoding explicitly learns invariance w.r.t. rotations. Combining scattering networks with a modern ResNet, we achieve a single-crop top 5 error of 11.4% on imagenet ILSVRC2012, comparable to the Resnet-18 architecture, while utilizing only 10 layers. We also find that hybrid architectures can yield excellent performance in the small sample regime, exceeding their end-to-end counterparts, through their ability to incorporate geometrical priors. We demonstrate this on subsets of the CIFAR-10 dataset and on the STL-10 dataset.

Click to Read Paper
We introduce a two-layer wavelet scattering network, for object classification. This scattering transform computes a spatial wavelet transform on the first layer and a new joint wavelet transform along spatial, angular and scale variables in the second layer. Numerical experiments demonstrate that this two layer convolution network, which involves no learning and no max pooling, performs efficiently on complex image data sets such as CalTech, with structural objects variability and clutter. It opens the possibility to simplify deep neural network learning by initializing the first layers with wavelet filters.

* Workshop, 3 pages, prepared for ICLR 2014
Click to Read Paper
It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.

* ICLR 2018 - International Conference on Learning Representations, Apr 2018, Vancouver, Canada. 2018, https://iclr.cc/
Click to Read Paper
We study the first-order scattering transform as a candidate for reducing the signal processed by a convolutional neural network (CNN). We show theoretical and empirical evidence that in the case of natural images and sufficiently small translation invariance, this transform preserves most of the signal information needed for classification while substantially reducing the spatial resolution and total signal size. We demonstrate that cascading a CNN with this representation performs on par with ImageNet classification models, commonly used in downstream tasks, such as the ResNet-50. We subsequently apply our trained hybrid ImageNet model as a base model on a detection system, which has typically larger image inputs. On Pascal VOC and COCO detection tasks we demonstrate improvements in the inference speed and training memory consumption compared to models trained directly on the input image.

* ECCV 2018
Click to Read Paper
The Regularized Nonlinear Acceleration (RNA) algorithm is an acceleration method capable of improving the rate of convergence of many optimization schemes such as gradient descend, SAGA or SVRG. Until now, its analysis is limited to convex problems, but empirical observations shows that RNA may be extended to wider settings. In this paper, we investigate further the benefits of RNA when applied to neural networks, in particular for the task of image recognition on CIFAR10 and ImageNet. With very few modifications of exiting frameworks, RNA improves slightly the optimization process of CNNs, after training.

Click to Read Paper
Regularized nonlinear acceleration (RNA) is a generic extrapolation scheme for optimization methods, with marginal computational overhead. It aims to improve convergence using only the iterates of simple iterative algorithms. However, so far its application to optimization was theoretically limited to gradient descent and other single-step algorithms. Here, we adapt RNA to a much broader setting including stochastic gradient with momentum and Nesterov's fast gradient. We use it to train deep neural networks, and empirically observe that extrapolated networks are more accurate, especially in the early iterations. A straightforward application of our algorithm when training ResNet-152 on ImageNet produces a top-1 test error of 20.88%, improving by 0.8% the reference classification pipeline. Furthermore, the code runs offline in this case, so it never negatively affects performance.

Click to Read Paper
Deep neural network algorithms are difficult to analyze because they lack structure allowing to understand the properties of underlying transforms and invariants. Multiscale hierarchical convolutional networks are structured deep convolutional networks where layers are indexed by progressively higher dimensional attributes, which are learned from training data. Each new layer is computed with multidimensional convolutions along spatial and attribute variables. We introduce an efficient implementation of such networks where the dimensionality is progressively reduced by averaging intermediate layers along attribute indices. Hierarchical networks are tested on CIFAR image data bases where they obtain comparable precisions to state of the art networks, with much fewer parameters. We study some properties of the attributes learned from these databases.

Click to Read Paper
Scattering networks are a class of designed Convolutional Neural Networks (CNNs) with fixed weights. We argue they can serve as generic representations for modelling images. In particular, by working in scattering space, we achieve competitive results both for supervised and unsupervised learning tasks, while making progress towards constructing more interpretable CNNs. For supervised learning, we demonstrate that the early layers of CNNs do not necessarily need to be learned, and can be replaced with a scattering network instead. Indeed, using hybrid architectures, we achieve the best results with predefined representations to-date, while being competitive with end-to-end learned CNNs. Specifically, even applying a shallow cascade of small-windowed scattering coefficients followed by 1$\times$1-convolutions results in AlexNet accuracy on the ILSVRC2012 classification task. Moreover, by combining scattering networks with deep residual networks, we achieve a single-crop top-5 error of 11.4% on ILSVRC2012. Also, we show they can yield excellent performance in the small sample regime on CIFAR-10 and STL-10 datasets, exceeding their end-to-end counterparts, through their ability to incorporate geometrical priors. For unsupervised learning, scattering coefficients can be a competitive representation that permits image recovery. We use this fact to train hybrid GANs to generate images. Finally, we empirically analyze several properties related to stability and reconstruction of images from scattering coefficients.

* IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers, 2018, pp.11
* arXiv admin note: substantial text overlap with arXiv:1703.08961
Click to Read Paper
The wavelet scattering transform is an invariant signal representation suitable for many signal processing and machine learning applications. We present the Kymatio software package, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks. All transforms may be executed on a GPU (in addition to CPU), offering a considerable speed up over CPU implementations. The package also has a small memory footprint, resulting inefficient memory usage. The source code, documentation, and examples are available undera BSD license at https://www.kymat.io/

Click to Read Paper