Three stories on deep linear networks
Seminar Données et Aléatoire Théorie & Applications
26/09/2024 - 14:00 Pierre Marion (EPFL) Salle 106
We discuss three related facets of optimization dynamics of deep linear networks with a quadratic loss, in connection to the largest eigenvalue of the loss Hessian, also known as the sharpness. The first result regards the maximal learning rate to ensure stable learning. We show that it is upper-bounded and explain this by a lower-bound on the sharpness of minimizers, which grows linearly with depth. Second, we study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate, starting from a small-scale initialisation. We show that the learned weight matrices are approximately rank-one and that their singular vectors align. This implies an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. Finally, we study the case of a residual initialization. Convergence of the gradient flow for a Gaussian initialization of the residual network is proven in this case, as well as a bound on the sharpness of the minimizer.