Monday, 03 July 2023

Graph alignment is a generic high-dimensional statistical inference problem with many applications, ranging from social network data de-anonymization to separation of functional regions in cortical surfaces. The aim of this short course is to present several recent theoretical results on graph alignment when the graphs to be aligned come from a specific generative model, namely the correlated Erdös-Rényi random graph model. Specifically, we shall first present results on information-theoretic limits to feasibility of graph alignment: below some threshold, the observed graphs do not contain enough information for alignment to be feasible, while above that threshold, some algorithms are known to succeed at alignment, although in exponential time. We shall then present results on computational feasibility of alignment, describing a second threshold which determines when a family of ‘local’, polynomial-time algorithms succeed at alignment. Together, these results show a rich phenomenology for graph alignment, displaying an ‘impossible phase’, a ‘hard phase’ and an ‘easy phase’ for feasibility of the task.

Graph alignment is a generic high-dimensional statistical inference problem with many applications, ranging from social network data de-anonymization to separation of functional regions in cortical surfaces. The aim of this short course is to present several recent theoretical results on graph alignment when the graphs to be aligned come from a specific generative model, namely the correlated Erdös-Rényi random graph model. Specifically, we shall first present results on information-theoretic limits to feasibility of graph alignment: below some threshold, the observed graphs do not contain enough information for alignment to be feasible, while above that threshold, some algorithms are known to succeed at alignment, although in exponential time. We shall then present results on computational feasibility of alignment, describing a second threshold which determines when a family of ‘local’, polynomial-time algorithms succeed at alignment. Together, these results show a rich phenomenology for graph alignment, displaying an ‘impossible phase’, a ‘hard phase’ and an ‘easy phase’ for feasibility of the task.

Many data processing problems can be cast as minimizing a convex functional over (a subset of) the space of signed measures. These includes problems as diverse as training two-layer neural networks, sparse spikes deconvolution, Wasserstein barycenters, and trajectory inference. In this mini-course I will present recent conceptual tools and theoretical guarantees pertaining to "particle methods". This is a class of approaches to solve such problems which are versatile and practical, where the unknown measure is parameterized as a sum of Dirac masses. We will cover fixed-support methods, conic particle methods, noisy gradient methods and discuss their link with Wasserstein geometry, and (generalized) Langevin dynamics for sampling. Concepts from Optimal Transport, Stochastic Differential Equations (SDEs) and Partial Differential Equations (PDEs) will be introduced and used throughout the course.

Stein Variational Gradient Descent (SVGD) has emerged as a popular particle based sampling algorithm. While its infinite particle mean field limit is well understood, its finite particle behavior is not well understood. In this talk, I will introduce the Virtual Particle SVGD (VP-SVGD) which is computationally efficient and converges provably fast to the target in the finite particle regime, the first such algorith. The main idea is to employ unbiased stochastic approximation in the space of probability distribution and use this to track the evolution of KL-divergence to the target distribution.

The sample complexity of sequential or active testing problems can often be determined by simple information-theoretic arguments, but not always. We will discuss in this talk some paradigmatic situations, some of which are well understood and some of which remain to be investigated.

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm, when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal O(1/t) rate, both in expectation and with high probability. Our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation, and show that this variant fares favourably in problems with ill-conditioned features.

Tuesday, 04 July 2023

Many data processing problems can be cast as minimizing a convex functional over (a subset of) the space of signed measures. These includes problems as diverse as training two-layer neural networks, sparse spikes deconvolution, Wasserstein barycenters, and trajectory inference. In this mini-course I will present recent conceptual tools and theoretical guarantees pertaining to "particle methods". This is a class of approaches to solve such problems which are versatile and practical, where the unknown measure is parameterized as a sum of Dirac masses. We will cover fixed-support methods, conic particle methods, noisy gradient methods and discuss their link with Wasserstein geometry, and (generalized) Langevin dynamics for sampling. Concepts from Optimal Transport, Stochastic Differential Equations (SDEs) and Partial Differential Equations (PDEs) will be introduced and used throughout the course.

Many data processing problems can be cast as minimizing a convex functional over (a subset of) the space of signed measures. These includes problems as diverse as training two-layer neural networks, sparse spikes deconvolution, Wasserstein barycenters, and trajectory inference. In this mini-course I will present recent conceptual tools and theoretical guarantees pertaining to "particle methods". This is a class of approaches to solve such problems which are versatile and practical, where the unknown measure is parameterized as a sum of Dirac masses. We will cover fixed-support methods, conic particle methods, noisy gradient methods and discuss their link with Wasserstein geometry, and (generalized) Langevin dynamics for sampling. Concepts from Optimal Transport, Stochastic Differential Equations (SDEs) and Partial Differential Equations (PDEs) will be introduced and used throughout the course.

Graph alignment is a generic high-dimensional statistical inference problem with many applications, ranging from social network data de-anonymization to separation of functional regions in cortical surfaces. The aim of this short course is to present several recent theoretical results on graph alignment when the graphs to be aligned come from a specific generative model, namely the correlated Erdös-Rényi random graph model. Specifically, we shall first present results on information-theoretic limits to feasibility of graph alignment: below some threshold, the observed graphs do not contain enough information for alignment to be feasible, while above that threshold, some algorithms are known to succeed at alignment, although in exponential time. We shall then present results on computational feasibility of alignment, describing a second threshold which determines when a family of ‘local’, polynomial-time algorithms succeed at alignment. Together, these results show a rich phenomenology for graph alignment, displaying an ‘impossible phase’, a ‘hard phase’ and an ‘easy phase’ for feasibility of the task.

Matrix completion is the study of recovering an underlying matrix from a sparse subset of noisy observations. Traditionally, it is assumed entries of the matrix are “missing completely at random” (MCAR), i.e., each entry is revealed at random, independent of everything else, with uniform probability. This is likely unrealistic due to the presence of “latent confounders”, i.e., unobserved factors that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. For example, in the context of movie recommender systems—a canonical application for matrix completion—a user who vehemently dislikes horror films is unlikely to ever watch horror films. In general, these confounders yield “missing not at random” (MNAR) data, which can severely impact any inference procedure. We develop a formal causal model for matrix completion through the language of potential outcomes, and provide novel identification arguments for a variety of causal estimands of interest. We combine a nearest neighbors approach for matrix completion —popularly known as collaborative filtering—with the synthetic controls approach for panel data, to design a simple two-step algorithm, which we call “synthetic nearest neighbors” (SNN) to estimate these causal estimands. We prove entry-wise finite-sample consistency and asymptotic normality of the SNN estimator for matrix completion with MNAR data.

Lastly, we show how algorithms for reinforcement learning, where the data is collected offline and there is potentially unobserved confounding in how actions are picked, can be effectively designed through the lens of causal tensor completion, an extension of matrix completion to higher dimensions.

Motivated by applications in Reinforcement Learning (RL), this talk focuses on the Stochastic Appproximation (SA) method to find fixed points of a contractive operator. First proposed by Robins and Monro, SA is a popular approach for solving fixed point equations when the information is corrupted by noise. We consider the SA algorithm for operators that are contractive under arbitrary norms (especially the l-infinity norm). We present finite sample bounds on the mean square error, which are established using a Lyapunov framework based on infimal convolution and generalized Moreau envelope. We then present our more recent result on concentration of the tail error, even when the iterates are not bounded by a constant. These tail bounds are obtained using exponential supermartingales in conjunction with the Moreau envelop and a novel bootstrapping approach. Our results immediately imply the state-of-the-art sample complexity results for a large class of RL algorithms.

Artificial intelligence (AI) has focused on a paradigm in which intelligence inheres in a single, autonomous agent. Social issues are entirely secondary in this paradigm. When AI systems are deployed in social contexts, however, the overall design of such systems is often naive---a centralized entity provides services to passive agents and reaps the rewards. Such a paradigm need not be the dominant paradigm for information technology. In a broader framing, agents are active, they are cooperative, and they wish to obtain value from their participation in learning-based systems. Agents may supply data and other resources to the system, only if it is in their interest to do so. Critically, intelligence inheres as much in the overall system as it does in individual agents, be they humans or computers. This is a perspective familiar in the social sciences, and a first goal in this line of work is to bring economics into contact with the computing and data sciences. The long-term goal is two-fold---to provide a broader conceptual foundation for emerging real-world AI systems, and to upend received wisdom in the computational, economic, and inferential disciplines.

Wednesday, 05 July 2023

Artificial intelligence (AI) has focused on a paradigm in which intelligence inheres in a single, autonomous agent. Social issues are entirely secondary in this paradigm. When AI systems are deployed in social contexts, however, the overall design of such systems is often naive---a centralized entity provides services to passive agents and reaps the rewards. Such a paradigm need not be the dominant paradigm for information technology. In a broader framing, agents are active, they are cooperative, and they wish to obtain value from their participation in learning-based systems. Agents may supply data and other resources to the system, only if it is in their interest to do so. Critically, intelligence inheres as much in the overall system as it does in individual agents, be they humans or computers. This is a perspective familiar in the social sciences, and a first goal in this line of work is to bring economics into contact with the computing and data sciences. The long-term goal is two-fold---to provide a broader conceptual foundation for emerging real-world AI systems, and to upend received wisdom in the computational, economic, and inferential disciplines.

Recently, the theory of infinite-width neural networks led to the first technology, muTransfer, for tuning enormous neural networks that are too expensive to train more than once. For example, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count. In this talk, I will explain the core insight behind this theory. In fact, this is an instance of what I call the *Optimal Scaling Thesis*, which connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI. I'll end with several concrete key mathematical research questions whose resolutions will have incredible impact on how practitioners scale up their NNs.

In this talk, we discuss several contemporary aspects of risk-aware multi-armed bandits. First, we design and analyze Thompson sampling-style algorithms for mean-variance MAB and provide comprehensive regret analyses for Gaussian and Bernoulli bandits. Our algorithms achieve the best known regret bounds for mean-variance MABs and also attain the information-theoretic bounds in some parameter regimes. Second, we seek to identify the best arm under the fixed-confidence setup and under a stringent constraint that the variance of the chosen arm is strictly smaller than a given threshold. An algorithm VA-LUCB is developed and its sample complexity is shown to be characterized by a fundamental variance-aware hardness quantity H_{VA} . By proving an information-theoretic lower bound, we show that sample complexity of VA-LUCB is optimal up to a factor logarithmic in H_{VA}. Finally, we formulate a probably anytime-safe stochastic combinatorial semi-bandits problem in which the agent is given the option to select a subset of size at most K from a set of L ground items. Each item is associated with a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least 1−δ, over the entire horizon of time T, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget.

We present results on fitting manifolds to data where the standard deviation of the noise is large compared to the reach of the manifold being fit. This is joint work with Charles Fefferman, Sergei Ivanov and Matti Lassas.

In this talk, we discuss several contemporary aspects of risk-aware multi-armed bandits. First, we design and analyze Thompson sampling-style algorithms for mean-variance MAB and provide comprehensive regret analyses for Gaussian and Bernoulli bandits. Our algorithms achieve the best known regret bounds for mean-variance MABs and also attain the information-theoretic bounds in some parameter regimes. Second, we seek to identify the best arm under the fixed-confidence setup and under a stringent constraint that the variance of the chosen arm is strictly smaller than a given threshold. An algorithm VA-LUCB is developed and its sample complexity is shown to be characterized by a fundamental variance-aware hardness quantity H_{VA} . By proving an information-theoretic lower bound, we show that sample complexity of VA-LUCB is optimal up to a factor logarithmic in H_{VA}. Finally, we formulate a probably anytime-safe stochastic combinatorial semi-bandits problem in which the agent is given the option to select a subset of size at most K from a set of L ground items. Each item is associated with a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least 1−δ, over the entire horizon of time T, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget.

Thursday, 06 July 2023

Artificial intelligence (AI) has focused on a paradigm in which intelligence inheres in a single, autonomous agent. Social issues are entirely secondary in this paradigm. When AI systems are deployed in social contexts, however, the overall design of such systems is often naive---a centralized entity provides services to passive agents and reaps the rewards. Such a paradigm need not be the dominant paradigm for information technology. In a broader framing, agents are active, they are cooperative, and they wish to obtain value from their participation in learning-based systems. Agents may supply data and other resources to the system, only if it is in their interest to do so. Critically, intelligence inheres as much in the overall system as it does in individual agents, be they humans or computers. This is a perspective familiar in the social sciences, and a first goal in this line of work is to bring economics into contact with the computing and data sciences. The long-term goal is two-fold---to provide a broader conceptual foundation for emerging real-world AI systems, and to upend received wisdom in the computational, economic, and inferential disciplines.

In this talk, we discuss several contemporary aspects of risk-aware multi-armed bandits. First, we design and analyze Thompson sampling-style algorithms for mean-variance MAB and provide comprehensive regret analyses for Gaussian and Bernoulli bandits. Our algorithms achieve the best known regret bounds for mean-variance MABs and also attain the information-theoretic bounds in some parameter regimes. Second, we seek to identify the best arm under the fixed-confidence setup and under a stringent constraint that the variance of the chosen arm is strictly smaller than a given threshold. An algorithm VA-LUCB is developed and its sample complexity is shown to be characterized by a fundamental variance-aware hardness quantity H_{VA} . By proving an information-theoretic lower bound, we show that sample complexity of VA-LUCB is optimal up to a factor logarithmic in H_{VA}. Finally, we formulate a probably anytime-safe stochastic combinatorial semi-bandits problem in which the agent is given the option to select a subset of size at most K from a set of L ground items. Each item is associated with a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least 1−δ, over the entire horizon of time T, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget.

We consider a novel formulation of the dynamic pricing and demand learning problem, where the evolution of demand in response to posted prices is governed by a stochastic variant of the popular Bass model with parameters (α, β) that are linked to the so-called "innovation" and "imitation" effects. Finding a revenue-maximizing dynamic pricing policy in this model is non-trivial even when model parameters are known and requires solving for the optimal non-stationary policy of a continuous-time, continuous-state Markov Decision Process. In this paper, we consider the problem of dynamic pricing in conjunction with learning the model parameters, with the objective of optimizing the cumulative revenues over a given selling horizon. Our main contribution is an algorithm with a regret guarantee of O (m^2/3), where m is mnemonic for the (known) market size, along with a matching lower bound.

Friday, 07 July 2023

Recently, the theory of infinite-width neural networks led to the first technology, muTransfer, for tuning enormous neural networks that are too expensive to train more than once. For example, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count. In this talk, I will explain the core insight behind this theory. In fact, this is an instance of what I call the *Optimal Scaling Thesis*, which connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI. I'll end with several concrete key mathematical research questions whose resolutions will have incredible impact on how practitioners scale up their NNs.

Recently, the theory of infinite-width neural networks led to the first technology, muTransfer, for tuning enormous neural networks that are too expensive to train more than once. For example, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count. In this talk, I will explain the core insight behind this theory. In fact, this is an instance of what I call the *Optimal Scaling Thesis*, which connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI. I'll end with several concrete key mathematical research questions whose resolutions will have incredible impact on how practitioners scale up their NNs.

Running a random walk in a convex body K ⊆ Rⁿ is a standard approach to sample approximately uniformly from the body. The requirement is that from a suitable initial distribution, the distribution of the walk comes close to the uniform distribution π on K after a number of steps polynomial in the dimension n and the aspect ratio R/r (i.e., when the body is contained in a ball of radius R and contains a ball of radius r).

Proofs of rapid mixing of such walks often require that the initial distribution from which the random walk starts should be somewhat diffuse: formally, the probability density η₀ of the initial distribution with respect to π should be at most polynomial in the dimension n: this is called a "warm start". Achieving a warm start often requires non-trivial pre-processing before starting the random walk.

This motivates proving rapid mixing from a "cold start", where the initial density η₀ with respect to π can be exponential in the dimension n. Unlike warm starts, a cold start is usually trivial to achieve. However, a random walk need not mix rapidly from a cold start: an example being the well-known "ball walk". On the other hand, Lovász and Vempala proved that the "hit-and-run" random walk mixes rapidly from a cold start. For the related coordinate hit-and-run (CHR) walk, which has been found to be promising in computational experiments, rapid mixing from a warm start was proved only recently but the question of rapid mixing from a cold start remained open.

We construct a family of random walks inspired by the classical Whitney decomposition of subsets of Rⁿ into countably many axis-aligned dyadic cubes. We show that even with a cold start, the mixing times of these walks are bounded by a polynomial in n and the aspect ratio. Our main technical ingredient is an isoperimetric inequality for K for a metric that magnifies distances between points close to the boundary of K. As a corollary, we show that the coordinate hit-and-run walk also mixes rapidly both from a cold start and even from any initial point not too close to the boundary of K.

Joint work with Hariharan Narayanan (TIFR) and Amit Rajaraman (IIT Bombay).