Monday, 06 January 2020

In these two lectures, we will introduce the first order black box setting of convex optimization, describe some fundamental algorithms: gradient descent, Nesterov’s accelerated gradient descent, mirror descent, stochastic gradient descent and stochastic variance reduced gradient and derive bounds on their performance for appropriate (smooth/nonsmooth) classes of convex functions. The lectures will not assume any prior background in optimization from the audience.

Deep networks are mysterious. These over-parametrized machine learning models, trained with rudimentary optimization algorithms on non-convex landscapes in millions of dimensions have defied attempts to put a sound theoretical footing beneath their impressive performance.

This talk will shed light upon some of these mysteries. I will employ diverse ideas---from thermodynamics and optimal transportation to partial differential equations, control theory and Bayesian inference---and paint a picture of the training process of deep networks. Along the way, I will develop state-of-the-art algorithms for non-convex optimization.

I will conclude with a few recent results on learning with few labeled data.

We show that a random concave function having a periodic hessian on an equilateral lattice has a quadratic scaling limit, if the average hessian of the function satisfies certain conditions. We consider the set of all concave functions g on an equilateral lattice L that when shifted by an element of nL, incur addition by a linear function (this condition is equivalent to the periodicity of the hessian of g). We identify this set, up to addition by a constant, with a convex polytope P(s), where s corresponds to the average hessian. We show that the diameter of P(s) is bounded below by c(s)n^2 in sup norm, where c(s) is a positive constant depending only on s. Our main result is that, for any for any epsilon > 0, the normalized Lebesgue measure of all points in P(s) that are not contained in a cube Q of side length epsilon, centered at the unique (up to addition of a linear term) quadratic polynomial with hessian s, tends to 0 as n tends to ∞. Such results for other function spaces would lead to a better understanding of function fitting in Bayesian contexts. ” Arxiv preprint: “Random concave functions on an equilateral lattice with periodic hessians.” (2019), https://arxiv.org/abs/1909.08586 .

Artificial Intelligence is about to have a dramatic impact on many sectors of human activity. In the last ten years, thanks to the development of machine learning in “deep networks”, we have experienced spectacular breakthroughs in diverse applications such as automatic interpretation of images, speech recognition, consumer profiling, or go and chess playing. Algorithms are now competing with the best professionals at analyzing skin cancer symptoms or detecting specific anomalies in radiology; and much more is to come. Worrisome perspectives are frequently raised, from massive job destruction to autonomous decision-making “warrior” robots.

In this talk, we shall open the black box of deep networks and explore how they are programmed to learn from data by themselves. This will allow us to understand their limits, to question whether their achievements have anything to do with “intelligence”, and to reflect on the foundations of scientific intelligence.

Tuesday, 07 January 2020

TBA

TBA

Approximate Message Passing or AMP is a class of low complexity, scalable algorithms used to solve high-dimensional noisy linear regression problems where the goal is to estimate a vector x from a noisy measurement y = Ax + noise. AMP has the attractive feature that its performance, measured for example by the squared error loss, can be tracked accurately by a scalar iteration referred to as state evolution. In this talk, I will present recent performance guarantees for analysis of this algorithm under various problem conditions and I will introduce recent applications in statistical inference and estimation.

In these two lectures, we will introduce the first order black box setting of convex optimization, describe some fundamental algorithms: gradient descent, Nesterov’s accelerated gradient descent, mirror descent, stochastic gradient descent and stochastic variance reduced gradient and derive bounds on their performance for appropriate (smooth/nonsmooth) classes of convex functions. The lectures will not assume any prior background in optimization from the audience.

Wednesday, 08 January 2020

TBA

I will survey recent progress on the study of the landscapes of inference.

First about their topological complexity, and then about the performance of simple plain-vanilla algorithms, mostly Gradient Descent and Online Stochastic Gradient. I will also explore some of the smarter algorithms needed to "escape mediocrity” at the beginning of the search.

This will be done in the hard case of Tensor PCA, then for the p+k model, and finally for the usual inference problems in the class of Generalized Linear Models (which are not linear, as everybody knows). This will use recent works by G. Biroli, C. Cammarota, R. Gheissari, A. Jagannath, S. Mei, A. Maillard, A. Montanari, M. Nica, V.Ros, E. Subag and more.

Recent engineering achievements of deep-learning machines have both impressed and intrigued the scientific community due to our limited theoretical understanding of the underlying reasons for their success. This work provides a general principled framework to investigate the function space of different types of deep-learning machines, based on the generating functional analysis. The framework facilitates studying the number of solution networks of a given error around a reference multi-layer network. Exploring the function landscape of densely-connected networks, we uncover a general layer-by-layer learning behaviour, while the study of sparsely-connected networks indicates the advantage in having more layers for increasing generalization ability in such models. This framework accommodates other network architectures and computing elements, including networks with correlated weights, convolutional networks and discretised variables. Additionally, using large deviation theory it facilitates studying the finite size effect of deep-learning networks, the sensitivity of networks to noise, weight binarisation and sparsification. It offers a useful tool for better understanding the properties of deep-learning machines.

Bo Li and David Saad, Physical Review Letters 120, 248301 (2018)

Bo Li and David Saad, arXiv:1910.05769 (2019)

In understanding natural systems over hundreds of years, physicists

have developed a wealth of dynamics and viewpoints. Some of these

methods, when viewed (and abstracted) through the lens of computation,

could lead to new optimization and sampling techniques for problems in

machine learning, statistics, and theoretical computer science. I will

present some recent examples from my research on such interactions between

physics and algorithms, e.g., a Hamiltonian dynamics inspired algorithm for

sampling from continuous distributions.

Based on joint works with Oren Mangoubi

Thursday, 09 January 2020

I will survey recent progress on the study of the landscapes of inference.

First about their topological complexity, and then about the performance of simple plain-vanilla algorithms, mostly Gradient Descent and Online Stochastic Gradient. I will also explore some of the smarter algorithms needed to "escape mediocrity” at the beginning of the search.

This will be done in the hard case of Tensor PCA, then for the p+k model, and finally for the usual inference problems in the class of Generalized Linear Models (which are not linear, as everybody knows). This will use recent works by G. Biroli, C. Cammarota, R. Gheissari, A. Jagannath, S. Mei, A. Maillard, A. Montanari, M. Nica, V.Ros, E. Subag and more.

The problem of approximating partition functions of spin systems is an important and well-studied algorithmic primitive, in part due to its strong connections with Gibbs sampling. Randomized methods based on

Markov chain Monte Carlo have traditionally been the tool of choice for attacking such problems. More recently, deterministic algorithms based on decay of correlations have also been used to tackle the problem, and

these yielded some of the tightest known connections between computational complexity theory and statistical physics.

This talk discusses a third method, proposed by Alexander Barvinok, which is based on the geometry of polynomials, and, in hindsight, is related to the Yang-Lee theory of phase transitions. We discuss two

recent cases where the method yields deterministic polynomial time algorithms for approximating classical partition functions (on bounded degree graphs) that have remained out of the reach of correlation decay

methods: the ferromagnetic Ising model on bounded degree graphs beyond the correlation decay threshold, and the zero-temperature anti-ferromagnetic Potts model ("random colourings").

We also discuss emerging connections between the geometry of polynomials, correlation decay, and Markov chain Monte Carlo, that are suggested by this line of work.

Based on joint work with Jingcheng Liu and Alistair Sinclair.

References:

1. Liu, J., Sinclair, A. & Srivastava, P. The Ising Partition Function: Zeros and Deterministic Approximation. J Stat Phys 174, 287â€“315, 2019.

https://doi.org/10.1007/s10955-018-2199-2

2. Liu, J., Sinclair, A. & Srivastava, P. Fisher zeros and correlation decay in the Ising model. J Math Phys 60, 103304, 2019.

https://doi.org/10.1063/1.5082552

3. Liu J., Sinclair, A. & Srivastava P. A deterministic algorithm for counting colorings with 2Δ colors. Extended abstract in Proc. IEEE

Annual Symposium on Foundation of Computer Science (FOCS), 2019.

https://arxiv.org/abs/1906.01228

TBA

Gradient-based optimization algorithms applied to artificial neural networks with many parameters typically lead to models with good train and test performance. Two lines of theoretical research have recently emerged as attempts to explain this phenomenon: the lazy training and the mean-field analysis. The lazy training analysis studies the situation when a model remains in a small neighborhood of its initialization throughout training and implicitly performs a kernel regression; and the mean-field analysis is a general description of the training dynamics of infinitely wide two-layer neural networks. In this talk, I will present the insights brought by these analyses, their connections, strengths and limitations.

This talk is based on joint work with Francis Bach and Edouard Oyallon

https://arxiv.org/abs/1805.09545

https://arxiv.org/abs/1812.07956

https://arxiv.org/abs/1907.10300

Friday, 10 January 2020

I will survey recent progress on the study of the landscapes of inference.

First about their topological complexity, and then about the performance of simple plain-vanilla algorithms, mostly Gradient Descent and Online Stochastic Gradient. I will also explore some of the smarter algorithms needed to "escape mediocrity” at the beginning of the search.

This will be done in the hard case of Tensor PCA, then for the p+k model, and finally for the usual inference problems in the class of Generalized Linear Models (which are not linear, as everybody knows). This will use recent works by G. Biroli, C. Cammarota, R. Gheissari, A. Jagannath, S. Mei, A. Maillard, A. Montanari, M. Nica, V.Ros, E. Subag and more.

TBA

Non-convex optimization is ubiquitous in modern machine learning applications. Gradient descent based algorithms (such as gradient descent or Nesterov's accelerated gradient descent) are widely used in practice since they have been observed to perform well on these problems. From a theoretical standpoint however, this is quite surprising since these nonconvex problems are teeming with highly suboptimal saddle points, and it is well known that gradient descent (and variants) can get stuck in such saddle points. A key question that arises in resolving this apparent paradox is whether gradient descent based algorithms escape saddle points where the Hessian has negative eigenvalues and converge to points where Hessian is positive semidefinite (such points are called second-order stationary points) efficiently. We answer this question in the affirmative by showing the following:

(1) Perturbed gradient descent converges to second-order stationary points in a number of iterations which depends only poly-logarithmically on dimension (i.e., it is almost “dimension-free”). The convergence rate of this procedure matches the wellknown convergence rate of gradient descent to first-order stationary points, up to log factors, and

(2) A variant of Nesterov’s accelerated gradient descent converges to second-order stationary points at a faster rate than perturbed gradient descent.

The key technical components behind these results are (1) understanding geometry around saddle points and (2) designing a potential function for Nesterov’s accelerated gradient descent for non-convex problems.

Based on joint works with Chi Jin, Rong Ge, Sham M. Kakade and Mike Jordan.

Deep learning is very powerful at a variety of tasks, including self-driving cars and playing Go beyond human level. Despite these engineering successes, why deep learning works remains a question with many facets. I will discuss two of them: (i) Deep supervised learning is a fitting procedure, achieved by defining a loss function which is high when data are poorly fitted. Learning corresponds to a descent in the loss landscape. Why isn't it stuck in bad local minima, as occurs when cooling glassy systems in physics? What is the geometry of the loss landscape? (ii) In recent years it has been realised that deep learning works best in the overparametrized regime, where the number of fitting parameters is much larger than the number of data to be fitted, contrarily to intuition and to standard views in statistics. I will propose a resolution of these two problems, based on both an analogy with the energy landscape of repulsive particles and an analysis of asymptotically wide nets.

Two distinct limits for deep learning have been derived as the network width h→∞, depending on how the weights of the last layer scale with h. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the Mean Field limit, the dynamics can be expressed in terms of the distribution of the parameters associated to a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2 at initialization. By varying α and h, we probe the crossover between the two limits. We observe two regimes that we call "lazy training" and "feature training".