Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mingwei Wei

How noise affects the Hessian spectrum in overparameterized neural networks

Oct 29, 2019
Mingwei Wei, David J Schwab

Figure 1 for How noise affects the Hessian spectrum in overparameterized neural networks

Figure 2 for How noise affects the Hessian spectrum in overparameterized neural networks

Figure 3 for How noise affects the Hessian spectrum in overparameterized neural networks

Figure 4 for How noise affects the Hessian spectrum in overparameterized neural networks

Stochastic gradient descent (SGD) forms the core optimization method for deep neural networks. While some theoretical progress has been made, it still remains unclear why SGD leads the learning dynamics in overparameterized networks to solutions that generalize well. Here we show that for overparameterized networks with a degenerate valley in their loss landscape, SGD on average decreases the trace of the Hessian of the loss. We also generalize this result to other noise structures and show that isotropic noise in the non-degenerate subspace of the Hessian decreases its determinant. In addition to explaining SGDs role in sculpting the Hessian spectrum, this opens the door to new optimization approaches that may confer better generalization performance. We test our results with experiments on toy models and deep neural networks.

Via

Access Paper or Ask Questions

Mean-field Analysis of Batch Normalization

Mar 06, 2019
Mingwei Wei, James Stokes, David J Schwab

Figure 1 for Mean-field Analysis of Batch Normalization

Figure 2 for Mean-field Analysis of Batch Normalization

Figure 3 for Mean-field Analysis of Batch Normalization

Figure 4 for Mean-field Analysis of Batch Normalization

Batch Normalization (BatchNorm) is an extremely useful component of modern neural network architectures, enabling optimization using higher learning rates and achieving faster convergence. In this paper, we use mean-field theory to analytically quantify the impact of BatchNorm on the geometry of the loss landscape for multi-layer networks consisting of fully-connected and convolutional layers. We show that it has a flattening effect on the loss landscape, as quantified by the maximum eigenvalue of the Fisher Information Matrix. These findings are then used to justify the use of larger learning rates for networks that use BatchNorm, and we provide quantitative characterization of the maximal allowable learning rate to ensure convergence. Experiments support our theoretically predicted maximum learning rate, and furthermore suggest that networks with smaller values of the BatchNorm parameter achieve lower loss after the same number of epochs of training.

Via

Access Paper or Ask Questions