Seminar of Statistics              stylocafe                

CMAP, Ecole Polytechnique, IP Paris
Cristina Butucea
Karim Lounici
Alexandre B. Tsybakov
Zoltan Szabo


Sept 14:

Anna Korba

A Non Asymptotic Analysis of Stein Variational Gradient Descent

Abstract : We study the Stein Variational Gradient Descent (SVGD) algorithm, which optimises a set of particles to approximate a target probability distribution π e^(−V) on ℝ^d. In the population limit, SVGD performs gradient descent in the space of probability distributions on the KL divergence with respect to π, where the gradient is smoothed through a kernel integral operator. In this paper, we provide a novel finite time analysis for the SVGD algorithm.

We provide a descent lemma establishing that the algorithm decreases the objective at each iteration, and rates of convergence.

We also provide a convergence result of the finite particle system corresponding to the practical implementation of SVGD to its population version.

Sept 21

Jaouad Mourtada

Statistical leverage in prediction: from least squares to logistic regression

Abstract: We consider statistical learning/prediction problems: based on an iid sample, one aims to find a good predictor of a response given some features. In this context, an input sample has high leverage (relative to some learning algorithm or class of predictors) if the prediction at this point is sensitive to the associated output.
In the context of random-design linear prediction with square loss, we show that the hardness of the problem is characterized by the distribution of leverage scores of feature points. As an application, in high dimension Gaussian design (with nearly constant leverage) is seen to be nearly most favorable through lower bounds on the minimax risk, as well as refinements depending on the signal-to-noise ratio.
We then turn to conditional density estimation with entropy risk. We study this problem in a statistical learning framework, where given a class of conditional densities (a model), the goal is to find estimators that predict almost as well as the best distribution in the model, without restrictive assumptions on the true distribution. We introduce a new estimator, given by the solution of some min-max problem involving 'virtual' samples, which quantifies uncertainty according to a notion of leverage. In the case of logistic regression, this procedure achieves fast learning rates under weak assumptions, improving over within-model estimators such as regularized MLE; it also admits a simple form and does not require posterior sampling, providing a computationally less demanding alternative to Bayesian approaches.
Sept 28

Tor Lattimore

DeepMind, London
Improved regret for zeroth-order adversarial bandit convex optimisation

: I will explain recent advances in adversarial bandit convex optimisation, improving the best known minimax regret from O(d^(9.5) sqrt(n) log(n)^(7.5)) to O(d^(2.5) sqrt(n) og(n)) where d is the dimension and n is the number of interactions. The new analysis combines minimax duality with the information-theoretic Bayesian regret analysis by Russo and Van Roy as well as basic tools from asymptotic convex geometry. A preprint of this work is available at
Oct 5

Chao Gao

University of Chicago
Iterative Algorithm for Discrete Structure Recovery

We propose a general modeling and algorithmic framework for discrete structure recovery that can be applied to a wide range of problems. Under this framework, we are able to study the recovery of clustering labels, ranks of players, signs of regression coefficients, cyclic shifts, and even group elements from a unified perspective. A simple iterative algorithm is proposed for discrete structure recovery, which generalizes methods including Lloyd's algorithm and the power method. A linear convergence result for the proposed algorithm is established in this paper under appropriate abstract conditions on stochastic errors and initialization. We illustrate our general theory by applying it on several representative problems: (1) clustering in Gaussian mixture model, (2) approximate ranking, (3) sign recovery in compressed sensing, (4) multireference alignment, and (5) group synchronization, and show that minimax rate is achieved in each case.

Oct 12
Séminaire Parisien
Oct 19 

Rémi Flamary

CO-Optimal Transport : Optimal Transport across non registered space

Optimal transport (OT) is a powerful geometric and probabilistic tool for finding correspondences and measuring similarity between two distributions. Yet, its original formulation relies on the existence of a cost function between the samples of the two distributions, which makes it impractical for comparing data distributions supported on different topological spaces. To circumvent this limitation, we propose a novel OT problem, named COOT for CO-Optimal Transport, that aims to
simultaneously optimize two transport maps between both samples and features. This is different from other approaches that either discard the individual features by focusing on pairwise distances (e.g. Gromov-Wasserstein) or need to model explicitly the relations between the features. COOT leads to interpretable correspondences between both samples and feature representations and holds metric properties. We provide a theoretical analysis of our framework and establish rich connections with the Gromov-Wasserstein distance. We demonstrate its versatility with two machine learning applications in heterogeneous domain adaptation and co-clustering/data summarization, where COOT leads to performance improvements over the competing state-of-the-art methods.
Oct 26
Pas de séminaire
Nov 2

Yuhao Wang

Tsinghua University

Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders

We consider estimation of average treatment effects given observational data with high-dimensional pretreatment variables. Existing methods for this problem typically assume some form of sparsity for the regression functions. In this work, we introduce a debiased inverse propensity score weighting (DIPW) scheme for average treatment effect estimation that delivers $\sqrt{n}$ consistent estimates of the average treatment effect when the propensity score follows a sparse logistic regression model; the regression functions are permitted to be arbitrarily complex. Given the lack of assumptions on the regression functions, averages of transformed responses under each treatment may also be estimated at the $\sqrt{n}$ rate, and so for example, the variances of the responses may be estimated. We show how confidence intervals centred on our estimates may be constructed, and also extend the method to estimate heterogeneous treatment effects. This is joint work with Rajen Shah from the University of Cambridge.
Nov 9

Gil Kur


On the Minimax Optimality of the Maximum Likelihood in the non-Donsker regime

The minimax optimality of the Maximum Likelihood Estimator (MLE) is a fundamental question in mathematical statistics. For Donsker classes, it is well-known that the MLE is minimax optimal. However, in the non-Donsker regime, the MLE could be minimax sub-optimal. In this talk, we present new techniques to evaluate the statistical performance of the MLE in the non-Donsker regime. As an application, we demonstrate that the log-concave MLE is optimal in all dimensions. In comparison, for convex regression, the MLE can be suboptimal when d \ge 5. This talk is based on joint work with Dagan, Gao, Guntuboyina, Rakhlin and Sen.
Nov 16

Aaditya Ramdas

Carnegie Melon University

Dimension-agnostic inference

Classical asymptotic theory for statistical hypothesis testing, for example Wilks' theorem for likelihood ratios, usually involves calibrating the test statistic by fixing the dimension d while letting the sample size n increase to infinity. In the last few decades, a great deal of effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where d_n and n both increase to infinity together at some prescribed relative rate. This often leads to different tests in the two settings, depending on the assumptions about the dimensionality. This leaves the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming n >> d, or d_n/n \approx 0.2? This talk considers the goal of dimension-agnostic inference---developing methods whose validity does not depend on any assumption on d_n. I will summarize results from 3 papers that achieve this goal in parametric and nonparametric settings.
(A) Universal inference (PNAS'20) --- we describe the split/crossfit likelihood ratio test whose validity holds nonasymptotically without any regularity conditions.
(B) Classification accuracy as a proxy for two-sample testing (Annals of Statistics'21)
(C) Dimension-agnostic inference ( We describe a generic approach that uses variational representations of existing test statistics along with sample splitting and self-normalization (studentization) to produce a Gaussian limiting null distribution. We exemplify this technique for a handful of classical problems, such as one-sample mean testing, testing if a covariance matrix equals the identity, and kernel methods for testing equality of distributions using degenerate U-statistics like the maximum mean discrepancy. Without explicitly targeting the high-dimensional setting, our tests are shown to be minimax rate-optimal, meaning that the power of our tests cannot be improved further up to a constant factor of \sqrt{2}.
This is primarily joint work with several excellent coauthors including Ilmun Kim (postdoc, Cambridge), Larry Wasserman, Sivaraman Balakrishnan and Aarti Singh.
Nov 23
Séminaire Parisien
Nov 30

Matthieu Lerasle


Optimal Change-Point Detection and Localization

Given a times series Y in R n, with a piece-wise contant mean and independent components, the twin problems of change-point detection and change-point localization respectively amount to detecting the existence of times where the mean varies and estimating the positions of those change-points. In this work, we tightly characterize optimal rates for both problems and uncover the phase transition phenomenon from a global testing problem to a local estimation problem. Introducing a suitable definition of the energy of a change-point, we first establish in the single change-point setting that the optimal detection threshold is p 2 log log(n). When the energy is just above the detection threshold, then the problem of localizing the change-point becomes purely parametric: it only depends on the difference in means and not on the position of the change-point anymore. Interestingly, for most change-point positions, including all those away from the endpoints of the time series, it is possible to detect and localize them at a much smaller energy level. In the multiple change-point setting, we establish the energy detection threshold and show similarly that the optimal localization error of a specific change-point becomes purely parametric. Along the way, tight optimal rates for Hausdorff and l1 estimation losses of the vector of all change-points positions are also established. Two procedures achieving these optimal rates are introduced. The first one is a least-squares estimator with a new multiscale penalty that favours well spread change-points. The second one is a two-step multiscale post-processing procedure whose computational complexity can be as low as O(n log(n)). Notably, these two procedures accommodate with the presence of possibly many low-energy and therefore undetectable change-points and are still able to detect and localize high-energy change-points even with the presence of those nuisance parameters.​
Dec 7
Pas de séminaire
Dec 14
Pas de séminaire
Jan 4
Séminaire Parisien
Jan 11

Nicolas Schreuder


A minimax framework for quantifying risk-fairness trade-off in regression

Abstract : A theoretical framework is proposed for the problem of learning a real-valued function which meets fairness requirements. This framework is built upon the notion of alpha-relative (fairness) improvement of the regression function which we introduce using the theory of optimal transport. Setting alpha = 0 corresponds to the regression problem under the Demographic Parity constraint, while alpha = 1 corresponds to the classical regression problem without any constraints. For alpha \in (0, 1) the proposed framework allows to continuously interpolate between these two extreme cases and to study partially fair predictors. Within this framework we precisely quantify the cost in risk induced by the introduction of the fairness constraint. We put forward a statistical minimax setup and derive a general problem-dependent lower bound on the risk of any estimator satisfying alpha-relative improvement constraint. We illustrate our framework on a model of linear regression with Gaussian design and systematic group-dependent bias, deriving matching (up to absolute constants) upper and lower bounds on the minimax risk under the introduced constraint. This talk is based on a joint work with Evgenii Chzhen, see [arXiv:2007.14265].
Jan 18

Mayya Zhylova

Georgia Tech
Higher-order accuracy of the bootstrap and the normal approximation on the set of all Euclidean balls

Abstract: In this talk I would like to discuss the problem of establishing a higher-order accuracy of bootstrapping procedures and (non-)normal approximation in a multivariate or high-dimensional setting. This topic is important for numerous problems in statistical inference and applications concerned with confidence estimation and hypothesis testing, and involving a growing dimension of random data or unknown parameter. In particular, I will focus on higher-order expansions for the uniform distance over the set of all Euclidean balls. The talk will include an overview of the latest results on this topic, and examples of statistical problems where these results lead to improvements in an approximation accuracy.
Jan 25
Pas de séminaire
Feb 1 - 11h00

Pierre Alquier

RIKEN, Tokyo

Parametric estimation via MMD optimization: robustness to outliers and dependence

In  this  talk,  I  will  study  the  properties  of  parametric  estimators  based  on  the  Maximum Mean Discrepancy (MMD) defined by Briol et al.  (2019).  In a first time,  I will show that these estimators are universal in the i.i.d setting: even in case of misspecification, they converge to the best approximation of the distribution of the data in the model, without ANY assumption on this model. This leads to very strong robustness properties. In a second time,I will show that these results remain valid when the data is not independent,  but satisfy instead a weak-dependence condition. This condition is based on a new dependence coefficient, which is itself defined thanks to the MMD. I will show through examples that this new notion of dependence is actually quite general.  This talk is based on the following papers and softwares, with Badr-Eddine Chérief Abdellatif (Oxford University), Mathieu Gerber(University of Bristol), Jean-David Fermanian (ENSAE Paris) and Alexis Derumigny (University of Twente):
Feb 8
Séminaire Parisien
Feb 15 - 17h00

Jiantao Jiao

UC Berkeley

Sharp Minimax Rates for Imitation Learning

Abstract: We establish sharp minimax bounds on Imitation Learning (IL) in episodic Markov Decision Processes (MDPs), where the learner is provided a dataset of demonstrations from an expert. It is known that Behavior Cloning (BC) achieves suboptimality growing quadratically in horizon, which is termed as error compounding in the literature. We show that when the MDP transition function is unknown, all algorithms have to suffer a suboptimality that grows quadratically with the horizon, even if the algorithm can interactively query the expert such as in the setting of DAGGER. We then propose the setting of known transitions and show that one can provably break the quadratic dependence and improve the exponent to 3/2, which is shown to be tight. Our upper bound is established using a computationally efficient algorithm which we name as Mimic-MD, and the lower bound is established by proving a two-way reduction between IL and the value estimation problem of the unknown expert policy under any given reward function, as well as linear functional estimation with subsampled observations. We further show that under the additional assumption that the expert is optimal for the true reward function, there exists an efficient algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality independent of the horizon for arbitrary 3-state MDPs with rewards only at the terminal layer. In contrast, no algorithm can achieve suboptimality growing slower than the square root of the horizon with high probability if the expert is not constrained to be optimal. We formally establish the benefit of expert optimal assumption in the known transition setting and show that this additional assumption does not help when the transition functions are unknown.

Based on joint work with Nived Rajaraman, Yanjun Han, Lin F. Yang, and Kannan Ramchandran.
Feb 22 Pas de séminaire
March 1

Gabriel Peyré


Scaling Optimal Transport for High dimensional Learning   

Abstract: Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will explain how to leverage entropic regularization methods to define computationally efficient loss functions, approximating OT with a better sample complexity. More information and references can be found on the website of our book "Computational Optimal Transport"
March 8

Matteo Pirotta

Facebook AI

Introduction to exploration-exploitation in reinforcement learning 

One of the major challenges in online reinforcement learning (RL) is to trade off between exploration of the environment to gather information and exploitation of the samples observed so far to perform a near-optimal policy. While the exploration-exploitation trade-off has been widely studied in the multi-armed bandit literature, the RL setting poses specific challenges due to the dynamical nature of the environment.
In this seminar, we will review basic notions of RL (i.e., Markov decision process, value function, value iteration) and we will introduce the regret minimization problem in the finite-horizon setting in environments with finite states and actions. Then we will study how algorithmic principles, such as optimism in face of uncertainty, are instantiated in RL and what are their theoretical guarantees. Finally, we will briefly discuss the most recent results on the topic and the remaining open questions.

March 15

Alessandro Lazaric

Facebook AI

Exploration-exploitation in reinforcement learning with function approximation 

Most of recent reinforcement learning algorithms combine standard dynamic programming algorithms (e.g., value iteration) with advanced function approximation techniques (notably, adapted from deep learning). Nonetheless, the effect of function approximation on the online performance of RL algorithms is still relatively poorly understood.
In this seminar, we will review how exploration-exploitation techniques can be paired with RL function approximation methods in environments with continuous state-action spaces. In particular, we will focus on linear approximations and show how they impact the learning performance. We will also review the different assumptions that have been used in this space and what are the major open questions.

March 22
Séminaire Parisien
March 29 - 16h00 CET

Csaba Szepesvari

University of Alberta
Towards Efficient Planning in Large Markov Decision Processes

Abstract: During the last decade reinforcement learning algorithms achieved outcomes far beyond expectations and not only in one area, but in a diverse set of areas, such as control, robotics, electronics, or computer games, just to list a few. What is common in the algorithms powering these breakthroughs that they all use powerful computers, simulators and reinforcement learning algorithms with one common key component: function approximation. Formally, these algorithms solve planning problems in large Markov decision processes where intractabality is avoided by using function approximation. Why, when and how these algorithms succeed is the central questions of this talk, in which I report on recent theoretical advances made on large scale planning in the presence of function approximation. The emerging picture is intriguingly complex: seemingly small changes to the way a problem is posed can cause a provably tractable problem to become intractable and vice versa. In this talk I will summarize the most striking of these results and will discuss a few of fascinating open questions.
April 5
Pas de séminaire
April 12 Séminaire Parisien
April 19

Bharath Sriperumbudur

Pennsylvania State University
Johnson & Lindenstrauss meet Hilbert at a Kernel
Abstract: Johnson-Lindenstrauss (J-L) lemma is a famous result about the almost-isometric embeddability of a finite set in Euclidean space into a Euclidean space of much smaller dimension. This result has applications in fast dimensionality reduction through random projections and has led to many randomized algorithms that achieve fast matrix computations. In this work, we extend this idea of random projections to an infinite-dimensional reproducing kernel Hilbert space (RKHS), wherein a finite-dimensional representation is obtained for each function in the RKHS by projecting it along certain random directions. Using this finite-dimensional representation, we obtain a J-L type lemma by showing that certain weighted distances and inner products are preserved. We show the random projection in RKHS to be closely related to Gaussian sketching and discuss its implications in speeding up the kernel algorithms.
(Joint work with Samory Kpotufe, Columbia University)
April 26 - 15h00 CET

Bodhisattva Sen

Columbia University

Multivariate Distribution-free Nonparametric Testing using Optimal Transport

Abstract : We propose a general framework for distribution-free nonparametric testing in multi-dimensions, based on a notion of multivariate ranks defined using the theory of optimal transport (see e.g., Villani (2003)). We demonstrate the applicability of this approach by constructing exactly distribution-free tests for two classical nonparametric problems: (i) testing for the equality of two multivariate distributions, and (ii) testing for mutual independence between two random vectors. In particular, we propose (multivariate) rank versions of Hotelling T^2 and kernel two-sample tests (e.g., Gretton et al. (2012), Szekely and Rizzo (2013)), and kernel tests for independence (e.g., Gretton et al. (2007), Szekely et al. (2007)) for scenarios (i) and (ii) respectively. We investigate the consistency and asymptotic distributions of these tests, both under the null and local contiguous alternatives. We also study the local power and asymptotic (Pitman) efficiency of these multivariate tests (based on optimal transport), and show that a subclass of these tests achieve attractive efficiency lower bounds that mimic the remarkable efficiency results of Hodges and Lehmann (1956) and Chernoff and Savage (1958) (for the Wilcoxon-rank sum test). To the best of our knowledge, these are the first collection of multivariate, nonparametric, exactly distribution-free tests that provably achieve such attractive efficiency lower bounds. We also study the rates of convergence of the rank maps (aka optimal transport maps).
May 3
Pas de séminaire
May 10

Gabor Lugosi

Pompeu Fabra University

Mean estimation in high dimensions

Abstract : We consider the problem of estimating the mean of a random vector based on independent, identically distributed observations.  In particular, we are interested in estimators that are close to the true mean vector in all directions. We prove the existence of an estimator that has a near-optimal error in all directions in which the variance of the one dimensional marginal of the random vector is not too small. The random vector may have a heavy-tailed distribution.

The proof relies on novel uniform bounds for the ratio of empirical and true probabilities.

The talk is based on joint work with Shahar Mendelson.

May 17

Larry Wasserman

Carnegie Mellon University

Causal Inference in Time of Covid-19

I will discuss our recent work on developing causal models for estimating the effect of social mobility on deaths from Covid-19. We propose a marginal structural model motivated by a basic epidemic model. We estimate the counterfactual time series of deaths under interventions on mobility. We conduct several types of sensitivity analyses. We find that the data support the idea that reduced mobility causes reduced deaths, but the conclusion comes with caveats. There is evidence of sensitivity to model misspecification and unmeasured confounding which implies that the size of the causal effect needs to be interpreted with caution. While there is little doubt that the effect is real, our work highlights the challenges in drawing causal inferences from pandemic data. This is joint work with Matteo Bonvini, Edward Kennedy and Valerie Ventura.

May 24
Pas de séminaire
May 31

Olivier Guédon

Université Gustav Eiffel
Géométrie de polytopes aléatoires et quelques liens avec le compressed sensing 

Résumé : Nous étudierons la géométrie de polytopes aléatoires conv{ X_1, ... , X_N} lorsque les X_i sont des copies indépendantes d'un vecteur aléatoire X de R^n satisfaisant des hypothèses minimales (il peut en particulier être à queues lourdes). Je présenterai aussi les liens entre cette étude et un problème de stabilité en théorie du compressed sensing.

Travail en commun avec
F. Krahmer, C. Kümmerle, S. Mendelson, H. Rauhut
June 7 - 15h00 CET

Samory Kpotude

Columbia University

Some Recent Insights on Transfer-Learning

A common situation in Machine Learning is one where training data is not fully representative of a target population due to bias in the sampling mechanism or due to prohibitive target sampling costs. In such situations, we aim to ’transfer’ relevant information from the training data (a.k.a. source data) to the target application. How much information is in the source data about the target application? Would some amount of target data improve transfer? These are all practical questions that depend crucially on 'how far' the source domain is from the target. However, how to properly measure 'distance' between source and target domains remains largely unclear.
In this talk we will argue that much of the traditional notions of 'distance' (e.g. KL-divergence, extensions of TV such as D_A discrepancy, density-ratios, Wasserstein distance) can yield an over-pessimistic picture of transferability. Instead, we show that some new notions of 'relative dimension' between source and target (which we simply term 'transfer-exponents') capture a continuum from easy to hard transfer. Transfer-exponents uncover a rich set of situations where transfer is possible even at fast rates; they encode relative benefits of source and target samples, and have interesting implications for related problems such as 'multi-task or multi-source learning'.
In particular, in the case of transfer from multiple sources, we will discuss (if time permits) a strong dichotomy between minimax and adaptive rates: no adaptive procedure exists that can achieve the same rates as minimax (oracle) procedures.

The talk is based on earlier work with Guillaume Martinet, and ongoing work with Steve Hanneke.

June 14

Jonathan Niles-Weed

New-York University

The "all-or-nothing" phenomenon in sparse estimation

Abstract: We explore a sharp phase transition known as the "all-or-nothing" phenomenon in estimation problems. This phenomenon arises when there exists a critical signal to noise ratio (SNR) such that below this threshold it is impossible to achieve any positive correlation with the hidden signal, whereas above this threshold it is possible to achieve almost perfect correlation with the hidden signal. This phenomenon has been observed in a few different models and settings, but with no unified explanation. We give a sharp characterization of this phenomenon in the presence of Gaussian noise and give general conditions under which it holds. As a corollary, we obtain the all-or-nothing phenomenon for the sparse tensor PCA, Bernoulli group testing, and the planted Gaussian perceptron problems. Joint work with Ilias Zadik.
June 21 Séminaire Parisien
June 28