
Sept 14:
Anna Korba
ENSAE

A Non Asymptotic Analysis of
Stein Variational Gradient Descent
Abstract : We study the Stein
Variational Gradient Descent (SVGD) algorithm,
which optimises a set of particles to
approximate a target probability distribution
π∝ e^(−V)
on ℝ^d. In the
population limit, SVGD performs gradient
descent in the space of probability
distributions on the KL divergence with
respect to π, where the gradient is smoothed
through a kernel integral operator. In this
paper, we provide a novel finite time analysis
for the SVGD algorithm.
We provide a descent
lemma establishing that the algorithm
decreases the objective at each iteration, and
rates of convergence.
We also provide a
convergence result of the finite particle
system corresponding to the practical
implementation of SVGD to its population
version.

Sept 21
Jaouad Mourtada
ENSAE

Statistical
leverage in prediction: from least squares
to logistic regression
Abstract: We
consider statistical learning/prediction
problems: based on an iid sample, one aims to
find a good predictor of a response given some
features. In this context, an input sample has
high leverage (relative to some learning
algorithm or class of predictors) if the
prediction at this point is sensitive to the
associated output.
In the context of randomdesign linear
prediction with square loss, we show that the
hardness of the problem is characterized by
the distribution of leverage scores of feature
points. As an application, in high dimension
Gaussian design (with nearly constant
leverage) is seen to be nearly most favorable
through lower bounds on the minimax risk, as
well as refinements depending on the
signaltonoise ratio.
We then turn to conditional density estimation
with entropy risk. We study this problem in a
statistical learning framework, where given a
class of conditional densities (a model), the
goal is to find estimators that predict almost
as well as the best distribution in the model,
without restrictive assumptions on the true
distribution. We introduce a new estimator,
given by the solution of some minmax problem
involving 'virtual' samples, which quantifies
uncertainty according to a notion of leverage.
In the case of logistic regression, this
procedure achieves fast learning rates under
weak assumptions, improving over withinmodel
estimators such as regularized MLE; it also
admits a simple form and does not require
posterior sampling, providing a
computationally less demanding alternative to
Bayesian approaches.

Sept 28
Tor Lattimore
DeepMind, London

Improved regret for
zerothorder adversarial bandit convex
optimisation
Abstract: I will explain
recent advances in adversarial bandit convex
optimisation, improving the best known
minimax regret from O(d^(9.5) sqrt(n)
log(n)^(7.5)) to O(d^(2.5) sqrt(n) og(n))
where d is the dimension and n is the number
of interactions. The new analysis combines
minimax duality with the
informationtheoretic Bayesian regret
analysis by Russo and Van Roy as well as
basic tools from asymptotic convex geometry.
A preprint of this work is available at https://arxiv.org/pdf/2006.00475.pdf.

Oct 5
Chao Gao
University of Chicago

Iterative Algorithm
for Discrete Structure Recovery
We propose a general
modeling and algorithmic framework for
discrete structure recovery that can be
applied to a wide range of problems. Under
this framework, we are able to study the
recovery of clustering labels, ranks of
players, signs of regression coefficients,
cyclic shifts, and even group elements from a
unified perspective. A simple iterative
algorithm is proposed for discrete structure
recovery, which generalizes methods including
Lloyd's algorithm and the power method. A
linear convergence result for the proposed
algorithm is established in this paper under
appropriate abstract conditions on stochastic
errors and initialization. We illustrate our
general theory by applying it on several
representative problems: (1) clustering in
Gaussian mixture model, (2) approximate
ranking, (3) sign recovery in compressed
sensing, (4) multireference alignment, and (5)
group synchronization, and show that minimax
rate is achieved in each case.

Oct 12

Séminaire
Parisien

Oct 19
Rémi Flamary
CMAP

COOptimal
Transport : Optimal Transport across non
registered space
Abstract: Optimal
transport (OT) is a powerful geometric and
probabilistic tool for
finding correspondences and measuring
similarity between two distributions.
Yet, its original formulation relies on
the existence of a cost
function between the samples of the two
distributions, which makes it
impractical for comparing data
distributions supported on different
topological spaces. To circumvent this
limitation, we propose a novel
OT problem, named COOT for COOptimal
Transport, that aims to simultaneously
optimize two transport maps between both
samples and features.
This is different from other approaches that
either discard the
individual features by focusing on pairwise
distances (e.g. GromovWasserstein)
or need to model explicitly the relations
between the
features. COOT leads to interpretable
correspondences between both samples
and feature representations and holds metric
properties. We provide a
theoretical analysis of our framework and
establish rich connections
with the GromovWasserstein distance. We
demonstrate its versatility
with two machine learning applications in
heterogeneous domain
adaptation and coclustering/data
summarization, where COOT leads to
performance improvements over the competing
stateoftheart methods.

Oct 26

Pas de
séminaire 
Nov 2
Yuhao Wang
Tsinghua University

Debiased
Inverse Propensity Score Weighting for
Estimation of Average Treatment Effects
with HighDimensional Confounders
We consider
estimation of average treatment effects given
observational data with highdimensional
pretreatment variables. Existing methods for
this problem typically assume some form of
sparsity for the regression functions. In this
work, we introduce a debiased inverse
propensity score weighting (DIPW) scheme for
average treatment effect estimation that
delivers $\sqrt{n}$ consistent estimates of
the average treatment effect when the
propensity score follows a sparse logistic
regression model; the regression functions are
permitted to be arbitrarily complex. Given the
lack of assumptions on the regression
functions, averages of transformed responses
under each treatment may also be estimated at
the $\sqrt{n}$ rate, and so for example, the
variances of the responses may be estimated.
We show how confidence intervals centred on
our estimates may be constructed, and also
extend the method to estimate heterogeneous
treatment effects. This is joint work with
Rajen Shah from the University of Cambridge.

Nov 9
Gil Kur
MIT

On the
Minimax Optimality of the Maximum Likelihood
in the nonDonsker regime
The minimax
optimality of the Maximum Likelihood Estimator
(MLE) is a fundamental question in mathematical
statistics. For Donsker classes, it is
wellknown that the MLE is minimax optimal.
However, in the nonDonsker regime, the MLE
could be minimax suboptimal. In this talk, we
present new techniques to evaluate the
statistical performance of the MLE in the
nonDonsker regime. As an application, we
demonstrate that the logconcave MLE is optimal
in all dimensions. In comparison, for convex
regression, the MLE can be suboptimal when d \ge
5. This talk is based on joint work with Dagan,
Gao, Guntuboyina, Rakhlin and Sen.

Nov 16
Aaditya Ramdas
Carnegie Melon University

Dimensionagnostic
inference
Classical asymptotic theory for
statistical hypothesis testing, for example
Wilks' theorem for likelihood ratios, usually
involves calibrating the test statistic by
fixing the dimension d while letting the
sample size n increase to infinity. In the
last few decades, a great deal of effort has
been dedicated towards understanding how these
methods behave in highdimensional settings,
where d_n and n both increase to infinity
together at some prescribed relative rate.
This often leads to different tests in the two
settings, depending on the assumptions about
the dimensionality. This leaves the
practitioner in a bind: given a dataset with
100 samples in 20 dimensions, should they
calibrate by assuming n >> d, or d_n/n
\approx 0.2? This talk considers the goal of
dimensionagnostic inferencedeveloping
methods whose validity does not depend on any
assumption on d_n. I will summarize results
from 3 papers that achieve this goal in
parametric and nonparametric settings.
(A) Universal
inference
(PNAS'20)https://arxiv.org/abs/1912.11436 
we describe the split/crossfit likelihood
ratio test whose validity holds
nonasymptotically without any regularity
conditions.
(B)
Classification accuracy as a proxy for
twosample testing (Annals of
Statistics'21) https://arxiv.org/abs/1602.02210.
(C)
Dimensionagnostic inference (http://arxiv.org/abs/2011.05068)
We describe a generic approach that uses
variational representations of existing test
statistics along with sample splitting and
selfnormalization (studentization) to produce
a Gaussian limiting null distribution. We
exemplify this technique for a handful of
classical problems, such as onesample mean
testing, testing if a covariance matrix equals
the identity, and kernel methods for testing
equality of distributions using degenerate
Ustatistics like the maximum mean
discrepancy. Without explicitly targeting the
highdimensional setting, our tests are shown
to be minimax rateoptimal, meaning that the
power of our tests cannot be improved further
up to a constant factor of \sqrt{2}.
This is
primarily joint work with several excellent
coauthors including Ilmun Kim (postdoc,
Cambridge), Larry Wasserman, Sivaraman
Balakrishnan and Aarti Singh.

Nov 23

Séminaire
Parisien 
Nov 30
Matthieu Lerasle
ENSAE

Optimal ChangePoint Detection
and Localization
Given a times series
Y in R n, with a piecewise contant mean and
independent components, the twin problems of
changepoint detection and changepoint
localization respectively amount to detecting
the existence of times where the mean varies
and estimating the positions of those
changepoints. In this work, we tightly
characterize optimal rates for both problems
and uncover the phase transition phenomenon
from a global testing problem to a local
estimation problem. Introducing a suitable
definition of the energy of a changepoint, we
first establish in the single changepoint
setting that the optimal detection threshold
is p 2 log log(n). When the energy is just
above the detection threshold, then the
problem of localizing the changepoint becomes
purely parametric: it only depends on the
difference in means and not on the position of
the changepoint anymore. Interestingly, for
most changepoint positions, including all
those away from the endpoints of the time
series, it is possible to detect and localize
them at a much smaller energy level. In the
multiple changepoint setting, we establish
the energy detection threshold and show
similarly that the optimal localization error
of a specific changepoint becomes purely
parametric. Along the way, tight optimal rates
for Hausdorff and l1 estimation losses of the
vector of all changepoints positions are also
established. Two procedures achieving these
optimal rates are introduced. The first one is
a leastsquares estimator with a new
multiscale penalty that favours well spread
changepoints. The second one is a twostep
multiscale postprocessing procedure whose
computational complexity can be as low as O(n
log(n)). Notably, these two procedures
accommodate with the presence of possibly many
lowenergy and therefore undetectable
changepoints and are still able to detect and
localize highenergy changepoints even with
the presence of those nuisance parameters.

Dec 7

Pas de
séminaire 
Dec 14

Pas de
séminaire

Jan 4

Séminaire
Parisien 
Jan 11
Nicolas Schreuder
ENSAE

A minimax
framework for quantifying riskfairness
tradeoff in regression
Abstract : A theoretical framework is
proposed for the problem of learning a
realvalued function which meets fairness
requirements. This framework is built upon the
notion of alpharelative (fairness)
improvement of the regression function which
we introduce using the theory of optimal
transport. Setting alpha = 0 corresponds to
the regression problem under the Demographic
Parity constraint, while alpha = 1 corresponds
to the classical regression problem without
any constraints. For alpha \in (0, 1) the
proposed framework allows to continuously
interpolate between these two extreme cases
and to study partially fair predictors. Within
this framework we precisely quantify the cost
in risk induced by the introduction of the
fairness constraint. We put forward a
statistical minimax setup and derive a general
problemdependent lower bound on the risk of
any estimator satisfying alpharelative
improvement constraint. We illustrate our
framework on a model of linear regression with
Gaussian design and systematic groupdependent
bias, deriving matching (up to absolute
constants) upper and lower bounds on the
minimax risk under the introduced constraint.
This talk is based on a joint work with
Evgenii Chzhen, see [arXiv:2007.14265].

Jan 18
Mayya Zhylova
Georgia Tech

Higherorder
accuracy of the bootstrap and the normal
approximation on the set of all Euclidean
balls
Abstract: In
this talk I would like to discuss the problem
of establishing a higherorder accuracy of
bootstrapping procedures and (non)normal
approximation in a multivariate or
highdimensional setting. This topic is
important for numerous problems in statistical
inference and applications concerned with
confidence estimation and hypothesis testing,
and involving a growing dimension of random
data or unknown parameter. In particular, I
will focus on higherorder expansions for the
uniform distance over the set of all Euclidean
balls. The talk will include an overview of
the latest results on this topic, and examples
of statistical problems where these results
lead to improvements in an approximation
accuracy.

Jan 25

Pas de
séminaire 
Feb 1  11h00
Pierre Alquier
RIKEN, Tokyo

Parametric
estimation via MMD optimization: robustness
to outliers and dependence
In this
talk, I will study
the properties of
parametric estimators based
on the Maximum Mean
Discrepancy (MMD) defined by Briol et al.
(2019). In a first time, I
will show that these estimators are universal
in the i.i.d setting: even in case of
misspecification, they converge to the best
approximation of the distribution of the data
in the model, without ANY assumption on this
model. This leads to very strong robustness
properties. In a second time,I will show that
these results remain valid when the data is
not independent, but satisfy instead a
weakdependence condition. This condition is
based on a new dependence coefficient, which
is itself defined thanks to the MMD. I will
show through examples that this new notion of
dependence is actually quite general.
This talk is based on the following papers and
softwares, with BadrEddine Chérief Abdellatif
(Oxford University), Mathieu Gerber(University
of Bristol), JeanDavid Fermanian (ENSAE
Paris) and Alexis Derumigny (University of
Twente):

Feb 8

Séminaire
Parisien 
Feb 15  17h00
Jiantao Jiao
UC Berkeley

Sharp
Minimax Rates for Imitation Learning
Abstract:
We establish sharp minimax bounds on
Imitation Learning (IL) in episodic Markov
Decision Processes (MDPs), where the learner
is provided a dataset of demonstrations from
an expert. It is known that Behavior Cloning
(BC) achieves suboptimality growing
quadratically in horizon, which is termed as
error compounding in the literature. We show
that when the MDP transition function is
unknown, all algorithms have to suffer a
suboptimality that grows quadratically with
the horizon, even if the algorithm can
interactively query the expert such as in
the setting of DAGGER. We then propose the
setting of known transitions and show that
one can provably break the quadratic
dependence and improve the exponent to 3/2,
which is shown to be tight. Our upper bound
is established using a computationally
efficient algorithm which we name as
MimicMD, and the lower bound is established
by proving a twoway reduction between IL
and the value estimation problem of the
unknown expert policy under any given reward
function, as well as linear functional
estimation with subsampled observations. We
further show that under the additional
assumption that the expert is optimal for
the true reward function, there exists an
efficient algorithm, which we term as
MimicMixture, that provably achieves
suboptimality independent of the horizon for
arbitrary 3state MDPs with rewards only at
the terminal layer. In contrast, no
algorithm can achieve suboptimality growing
slower than the square root of the horizon
with high probability if the expert is not
constrained to be optimal. We formally
establish the benefit of expert optimal
assumption in the known transition setting
and show that this additional assumption
does not help when the transition functions
are unknown.
Based on joint work
with Nived Rajaraman, Yanjun Han, Lin F.
Yang, and Kannan Ramchandran.

Feb 22 
Pas de
séminaire 
March 1
Gabriel Peyré
CNRS  ENS

Scaling
Optimal Transport for High dimensional
Learning
Abstract:
Optimal transport (OT) has recently gained
lot of interest in machine learning. It is a
natural tool to compare in a geometrically
faithful way probability distributions. It
finds applications in both supervised
learning (using geometric loss functions)
and unsupervised learning (to perform
generative model fitting). OT is however
plagued by the curse of dimensionality,
since it might require a number of samples
which grows exponentially with the
dimension. In this talk, I will explain how
to leverage entropic regularization methods
to define computationally efficient loss
functions, approximating OT with a better
sample complexity. More information and
references can be found on the website of
our book "Computational Optimal Transport" https://optimaltransport.github.io/

March 8
Matteo Pirotta
Facebook AI

Introduction
to explorationexploitation in
reinforcement learning
One of the
major challenges in online reinforcement
learning (RL) is to trade off between
exploration of the environment to gather
information and exploitation of the samples
observed so far to perform a nearoptimal
policy. While the explorationexploitation
tradeoff has been widely studied in the
multiarmed bandit literature, the RL
setting poses specific challenges due to the
dynamical nature of the environment.
In this
seminar, we will review basic notions of RL
(i.e., Markov decision process, value
function, value iteration) and we will
introduce the regret minimization problem in
the finitehorizon setting in environments
with finite states and actions. Then we will
study how algorithmic principles, such as
optimism in face of uncertainty, are
instantiated in RL and what are their
theoretical guarantees. Finally, we will
briefly discuss the most recent results on
the topic and the remaining open questions.

March 15
Alessandro Lazaric
Facebook AI

Explorationexploitation in
reinforcement learning with function
approximation
Most of
recent reinforcement learning algorithms
combine standard dynamic programming
algorithms (e.g., value iteration) with
advanced function approximation techniques
(notably, adapted from deep learning).
Nonetheless, the effect of function
approximation on the online performance of
RL algorithms is still relatively poorly
understood.
In this
seminar, we will review how
explorationexploitation techniques can be
paired with RL function approximation
methods in environments with continuous
stateaction spaces. In particular, we will
focus on linear approximations and show how
they impact the learning performance. We
will also review the different assumptions
that have been used in this space and what
are the major open questions.

March 22

Séminaire
Parisien 
March 29  16h00 CET
Csaba Szepesvari
University of Alberta

Towards Efficient
Planning in Large Markov Decision Processes
Abstract: During the last decade
reinforcement learning algorithms achieved
outcomes far beyond expectations and not only
in one area, but in a diverse set of areas,
such as control, robotics, electronics, or
computer games, just to list a few. What is
common in the algorithms powering these
breakthroughs that they all use powerful
computers, simulators and reinforcement
learning algorithms with one common key
component: function approximation. Formally,
these algorithms solve planning problems in
large Markov decision processes where
intractabality is avoided by using function
approximation. Why, when and how these
algorithms succeed is the central questions of
this talk, in which I report on recent
theoretical advances made on large scale
planning in the presence of function
approximation. The emerging picture is
intriguingly complex: seemingly small changes
to the way a problem is posed can cause a
provably tractable problem to become
intractable and vice versa. In this talk I
will summarize the most striking of these
results and will discuss a few of fascinating
open questions.

April 5

Pas de
séminaire 
April 12 
Séminaire
Parisien 
April 19
Bharath Sriperumbudur
Pennsylvania State University

Johnson &
Lindenstrauss meet Hilbert at a Kernel
Abstract: JohnsonLindenstrauss
(JL) lemma is a famous result about the
almostisometric embeddability of a finite set
in Euclidean space into a Euclidean space of
much smaller dimension. This result has
applications in fast dimensionality reduction
through random projections and has led to many
randomized algorithms that achieve fast matrix
computations. In this work, we extend this
idea of random projections to an
infinitedimensional reproducing kernel
Hilbert space (RKHS), wherein a
finitedimensional representation is obtained
for each function in the RKHS by projecting it
along certain random directions. Using this
finitedimensional representation, we obtain a
JL type lemma by showing that certain
weighted distances and inner products are
preserved. We show the random projection in
RKHS to be closely related to Gaussian
sketching and discuss its implications in
speeding up the kernel algorithms.
(Joint work with Samory Kpotufe, Columbia
University)

April 26 
15h00 CET
Bodhisattva Sen
Columbia University

Multivariate Distributionfree
Nonparametric Testing using Optimal
Transport
Abstract :
We propose a general framework for
distributionfree nonparametric testing in
multidimensions, based on a notion of
multivariate ranks defined using the theory of
optimal transport (see e.g., Villani (2003)).
We demonstrate the applicability of this
approach by constructing exactly
distributionfree tests for two classical
nonparametric problems: (i) testing for the
equality of two multivariate distributions,
and (ii) testing for mutual independence
between two random vectors. In particular, we
propose (multivariate) rank versions of
Hotelling T^2 and kernel twosample tests
(e.g., Gretton et al. (2012), Szekely and
Rizzo (2013)), and kernel tests for
independence (e.g., Gretton et al. (2007),
Szekely et al. (2007)) for scenarios (i) and
(ii) respectively. We investigate the
consistency and asymptotic distributions of
these tests, both under the null and local
contiguous alternatives. We also study the
local power and asymptotic (Pitman) efficiency
of these multivariate tests (based on optimal
transport), and show that a subclass of these
tests achieve attractive efficiency lower
bounds that mimic the remarkable efficiency
results of Hodges and Lehmann (1956) and
Chernoff and Savage (1958) (for the
Wilcoxonrank sum test). To the best of our
knowledge, these are the first collection of
multivariate, nonparametric, exactly
distributionfree tests that provably achieve
such attractive efficiency lower bounds. We
also study the rates of convergence of the
rank maps (aka optimal transport maps).

May 3

Pas de
séminaire 
May 10
Gabor Lugosi
Pompeu Fabra University

Mean estimation in high
dimensions
Abstract : We consider the problem
of estimating the mean of a random vector
based on independent, identically distributed
observations. In particular, we are
interested in estimators that are close to the
true mean vector in all directions. We prove
the existence of an estimator that has a
nearoptimal error in all directions in which
the variance of the one dimensional marginal
of the random vector is not too small. The
random vector may have a heavytailed
distribution.
The proof relies on novel
uniform bounds for the ratio of empirical and
true probabilities.
The talk is based on
joint work with Shahar Mendelson.

May 17
Larry Wasserman
Carnegie Mellon University

Causal
Inference in Time of Covid19
I will discuss our recent work on developing
causal models for estimating the effect of
social mobility on deaths from Covid19. We
propose a marginal structural model motivated
by a basic epidemic model. We estimate the
counterfactual time series of deaths under
interventions on mobility. We conduct several
types of sensitivity analyses. We find that
the data support the idea that reduced
mobility causes reduced deaths, but the
conclusion comes with caveats. There is
evidence of sensitivity to model
misspecification and unmeasured confounding
which implies that the size of the causal
effect needs to be interpreted with caution.
While there is little doubt that the effect is
real, our work highlights the challenges in
drawing causal inferences from pandemic data.
This is joint work with Matteo Bonvini, Edward
Kennedy and Valerie Ventura.

May 24

Pas de
séminaire 
May 31
Olivier Guédon
Université Gustav Eiffel

Géométrie
de polytopes aléatoires et quelques liens
avec le compressed sensing
Résumé :
Nous
étudierons la géométrie de polytopes
aléatoires conv{ X_1, ... , X_N}
lorsque les X_i sont des copies indépendantes
d'un vecteur aléatoire X de R^n satisfaisant
des hypothèses minimales (il peut en
particulier être à queues lourdes). Je
présenterai aussi les liens entre cette étude
et un problème de stabilité en théorie du
compressed sensing.
Travail en commun avec
F. Krahmer, C.
Kümmerle, S. Mendelson, H. Rauhut 
June 7  15h00
CET
Samory Kpotude
Columbia University

Some Recent
Insights on TransferLearning
A common situation in Machine
Learning is one where training data is not fully
representative of a target population due to
bias in the sampling mechanism or due to
prohibitive target sampling costs. In such
situations, we aim to ’transfer’ relevant
information from the training data (a.k.a.
source data) to the target application. How much
information is in the source data about the
target application? Would some amount of target
data improve transfer? These are all practical
questions that depend crucially on 'how far' the
source domain is from the target. However, how
to properly measure 'distance' between source
and target domains remains largely unclear.
In this talk we will argue that much of the
traditional notions of 'distance' (e.g.
KLdivergence, extensions of TV such as D_A
discrepancy, densityratios, Wasserstein
distance) can yield an overpessimistic picture
of transferability. Instead, we show that some
new notions of 'relative dimension' between
source and target (which we simply term
'transferexponents') capture a continuum from
easy to hard transfer. Transferexponents
uncover a rich set of situations where transfer
is possible even at fast rates; they encode
relative benefits of source and target samples,
and have interesting implications for related
problems such as 'multitask or multisource
learning'.
In particular, in the case of transfer from
multiple sources, we will discuss (if time
permits) a strong dichotomy between minimax and
adaptive rates: no adaptive procedure exists
that can achieve the same rates as minimax
(oracle) procedures.
The talk is based on earlier work with Guillaume
Martinet, and ongoing work with Steve Hanneke.

June 14
Jonathan NilesWeed
NewYork University

The "allornothing" phenomenon in
sparse estimation
Abstract:
We explore a sharp phase transition known as
the "allornothing" phenomenon in
estimation problems. This phenomenon arises
when there exists a critical signal to noise
ratio (SNR) such that below this threshold
it is impossible to achieve any positive
correlation with the hidden signal, whereas
above this threshold it is possible to
achieve almost perfect correlation with the
hidden signal. This phenomenon has been
observed in a few different models and
settings, but with no unified explanation.
We give a sharp characterization of this
phenomenon in the presence of Gaussian noise
and give general conditions under which it
holds. As a corollary, we obtain
the allornothing phenomenon for
the sparse tensor PCA, Bernoulli group
testing, and the planted Gaussian perceptron
problems. Joint work with Ilias Zadik.

June 21 
Séminaire
Parisien 
June 28


