A new prior is proposed for representation learning, which can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by the phenomenon of consciousness seen as the formation of a low-dimensional combination of a few concepts constituting a conscious thought, i.e., consciousness as awareness at a particular time instant. This provides a powerful constraint on the representation in that such low-dimensional thought vectors can correspond to statements about reality which are true, highly probable, or very useful for taking decisions. The fact that a few elements of the current state can be combined into such a predictive or useful statement is a strong constraint and deviates considerably from the maximum likelihood approaches to modelling data and how states unfold in the future based on an agent's actions. Instead of making predictions in the sensory (e.g. pixel) space, the consciousness prior allows the agent to make predictions in the abstract space, with only a few dimensions of that space being involved in each of these predictions. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in the form of facts and rules, although the conscious states may be richer than what can be expressed easily in the form of a sentence, a fact or a rule.

**Click to Read Paper**
How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation

Sep 18, 2014

Yoshua Bengio

Sep 18, 2014

Yoshua Bengio

**Click to Read Paper**

Deep learning research aims at discovering learning algorithms that discover multiple levels of distributed representations, with higher levels representing more abstract concepts. Although the study of deep learning has already led to impressive theoretical results, learning algorithms and breakthrough experiments, several challenges lie ahead. This paper proposes to examine some of these challenges, centering on the questions of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, and learning to disentangle the factors of variation underlying the observed data. It also proposes a few forward-looking research directions aimed at overcoming these challenges.

**Click to Read Paper**
Stochastic neurons can be useful for a number of reasons in deep learning models, but in many cases they pose a challenging problem: how to estimate the gradient of a loss function with respect to the input of such stochastic neurons, i.e., can we "back-propagate" through these stochastic neurons? We examine this question, existing approaches, and present two novel families of solutions, applicable in different settings. In particular, it is demonstrated that a simple biologically plausible formula gives rise to an an unbiased (but noisy) estimator of the gradient with respect to a binary stochastic neuron firing probability. Unlike other estimators which view the noise as a small perturbation in order to estimate gradients by finite differences, this estimator is unbiased even without assuming that the stochastic perturbation is small. This estimator is also interesting because it can be applied in very general settings which do not allow gradient back-propagation, including the estimation of the gradient with respect to future rewards, as required in reinforcement learning setups. We also propose an approach to approximating this unbiased but high-variance estimator by learning to predict it using a biased estimator. The second approach we propose assumes that an estimator of the gradient can be back-propagated and it provides an unbiased estimator of the gradient, but can only work with non-linearities unlike the hard threshold, but like the rectifier, that are not flat for all of their range. This is similar to traditional sigmoidal units but has the advantage that for many inputs, a hard decision (e.g., a 0 output) can be produced, which would be convenient for conditional computation and achieving sparse representations and sparse gradients.

**Click to Read Paper****Click to Read Paper**

Practical recommendations for gradient-based training of deep architectures

Sep 16, 2012

Yoshua Bengio

Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on back-propagated gradient and gradient-based optimization. It also discusses how to deal with the fact that more interesting results can be obtained when allowing one to adjust many hyper-parameters. Overall, it describes elements of the practice used to successfully and efficiently train and debug large-scale and often deep multi-layer neural networks. It closes with open questions about the training difficulties observed with deeper architectures.
Sep 16, 2012

Yoshua Bengio

**Click to Read Paper**

On the difficulty of training Recurrent Neural Networks

Feb 16, 2013

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio

Feb 16, 2013

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio

* Improved description of the exploding gradient problem and description and analysis of the vanishing gradient problem

**Click to Read Paper**

How to Construct Deep Recurrent Neural Networks

Apr 24, 2014

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio

Apr 24, 2014

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio

* Accepted at ICLR 2014 (Conference Track). 10-page text + 3-page references

**Click to Read Paper**

The Benefits of Over-parameterization at Initialization in Deep ReLU Networks

Jan 11, 2019

Devansh Arpit, Yoshua Bengio

Jan 11, 2019

Devansh Arpit, Yoshua Bengio

**Click to Read Paper**

Speech and Speaker Recognition from Raw Waveform with SincNet

Dec 13, 2018

Mirco Ravanelli, Yoshua Bengio

Dec 13, 2018

Mirco Ravanelli, Yoshua Bengio

* submitted to ICASSP 2019. arXiv admin note: substantial text overlap with arXiv:1811.09725, arXiv:1808.00158

**Click to Read Paper**

The effects of negative adaptation in Model-Agnostic Meta-Learning

Dec 05, 2018

Tristan Deleu, Yoshua Bengio

Dec 05, 2018

Tristan Deleu, Yoshua Bengio

* Workshop on Meta-Learning - 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, Canada

**Click to Read Paper**

Learning Speaker Representations with Mutual Information

Dec 01, 2018

Mirco Ravanelli, Yoshua Bengio

Dec 01, 2018

Mirco Ravanelli, Yoshua Bengio

* Submitted to ICASSP 2019

**Click to Read Paper**

* In Proceedings of NIPS@IRASL 2018. arXiv admin note: substantial text overlap with arXiv:1808.00158

**Click to Read Paper**

Depth with Nonlinearity Creates No Bad Local Minima in ResNets

Oct 21, 2018

Kenji Kawaguchi, Yoshua Bengio

In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets studied in previous work, in the sense that the values of all local minima are no worse than the global minima values of corresponding shallow linear predictors with arbitrary fixed features, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems (NIPS) 2018. We note that even though our paper advances the theoretical foundation of deep learning and non-convex optimization, there is still a gap between theory and many practical deep learning applications.
Oct 21, 2018

Kenji Kawaguchi, Yoshua Bengio

**Click to Read Paper**

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

* Accepted at SLT 2018

* Accepted at SLT 2018

**Click to Read Paper**
Equivalence of Equilibrium Propagation and Recurrent Backpropagation

May 22, 2018

Benjamin Scellier, Yoshua Bengio

May 22, 2018

Benjamin Scellier, Yoshua Bengio

**Click to Read Paper**

Low-memory convolutional neural networks through incremental depth-first processing

Apr 28, 2018

Jonathan Binas, Yoshua Bengio

Apr 28, 2018

Jonathan Binas, Yoshua Bengio

**Click to Read Paper**

Measuring the tendency of CNNs to Learn Surface Statistical Regularities

Nov 30, 2017

Jason Jo, Yoshua Bengio

Nov 30, 2017

Jason Jo, Yoshua Bengio

* Submitted

**Click to Read Paper**

Learning Independent Features with Adversarial Nets for Non-linear ICA

Oct 13, 2017

Philemon Brakel, Yoshua Bengio

Reliable measures of statistical dependence could be useful tools for learning independent features and performing tasks like source separation using Independent Component Analysis (ICA). Unfortunately, many of such measures, like the mutual information, are hard to estimate and optimize directly. We propose to learn independent features with adversarial objectives which optimize such measures implicitly. These objectives compare samples from the joint distribution and the product of the marginals without the need to compute any probability densities. We also propose two methods for obtaining samples from the product of the marginals using either a simple resampling trick or a separate parametric distribution. Our experiments show that this strategy can easily be applied to different types of model architectures and solve both linear and non-linear ICA problems.
Oct 13, 2017

Philemon Brakel, Yoshua Bengio

* A preliminary version of this work was presented at the ICML 2017 workshop on implicit models

**Click to Read Paper**

Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation

Mar 28, 2017

Benjamin Scellier, Yoshua Bengio

Mar 28, 2017

Benjamin Scellier, Yoshua Bengio

**Click to Read Paper**