Universal estimation with Maximum Mean Discrepancy (MMD)

A very old and yet very exciting problem in statistics is the definition of a universal estimator $\hat{\theta}$. An estimation procedure that would work all the time. Close your eyes, push the button, it works, for any model, in any context.

Formally speaking, we want that for some metric $d$ on probability distributions, for any statistical model $(P_\theta,\theta\in\Theta)$, given $X_1,\dots,X_n$ drawn i.i.d from some $P^0$ not necessarily in the model,

\begin{equation} \label{equa:universal} d\left(P_{\hat{\theta}},P^0 \right) \leq \inf_{\theta\in\Theta} d\left(P_{\theta},P^0 \right) + r_n(\Theta), \end{equation}

where $r_n(\Theta) \rightarrow 0$ when $n\rightarrow \infty$ holds, either in expectation or with large probability.

Why would this be nice? Well, first, if the model is well specified, that is, $P^0 = P_{\theta^0}$ for some $\theta^0\in\Theta$, we would have:

\[d\left(P_{\hat{\theta}},P^0 \right) \leq r_n(\Theta) \rightarrow 0,\]

so the estimator is consistent. But the ill-specified case is also relevant. Remember that “all models are wrong, some models are useful”. A very interesting case is Huber’s contamination model: assume that the data is drawn from the model, but might be corrupted with small probability. That is, $P^0 = (1-\varepsilon) P_{\theta^0} + \varepsilon Q$ where $\varepsilon\in[0,1]$ is small and $Q$, the contamination distribution, can be absolutely whatever. Then we would have:

\[d\left(P_{\hat{\theta}},P^0 \right) \leq d(P_{\theta^0},P^0) + r_n(\Theta) = d(P_{\theta^0},(1-\varepsilon) P_{\theta^0} + \varepsilon Q) + r_n(\Theta).\]

If the metric $d$ is such that $d(P_{\theta^0},(1-\varepsilon) P_{\theta^0} + \varepsilon Q) \leq C \varepsilon$ for some constant $C$, we end up with

\[d\left(P_{\hat{\theta}},P^0 \right) \leq C \varepsilon + r_n(\Theta),\]

which means that, as long as $\varepsilon$ remains quite small, the estimator is still not too bad. That is, the estimator is robust to contamination.


In case you believe that popular estimators such as Maximum Likelihood Estimator (MLE) is universal, surprise: it’s not. There is no metric such that \eqref{equa:universal} holds for the MLE. Even if you relax the metric assumption, and consider a divergence, for example, the Kullback-Leibler (KL) divergence, $KL(P^0,P_\theta)$… it still does not work (you can find counterexamples in [8]). And anyway, $KL((1-\varepsilon) P_{\theta^0} + \varepsilon Q,P_{\theta^0}) = + \infty$ for many contamination distributions (for example, as soon as the distributions $P_\theta$ in the model are continuous and the contamination $Q$ is discrete).


The first example of universal estimator we are aware of: Yatracos’ estimator [7], with $d$ being the total variation distance. It works with the rate $r_n(\Theta) = [\mathrm{dim}(\Theta)/n]^{\frac{1}{2}}$ when $\Theta$ is in some finite dimensional space. And it doesn’t work for nonparametric estimation. Still, it’s nice, and the paper is beautiful (and short). Equally beautiful is the book by Devroye and Lugosi [9] which studies many estimations methods for the total variation distance, including variants of Yatracos’ estimator.

Another example is Birgé, Barraud and Sart’s $\rho$-estimator [8], which satisfies a similar result for the Hellinger distance. If you want to read the paper, be aware: this is extremely difficult to prove! It is also very nice, because the Hellinger distance looks locally very similar to the KL in many models. In some sense, the $\rho$-estimator actually does what the MLE should do.

By the way. We live in the big data era, high dimensional data, big networks, more layers, you know. So it must be said that Yatracos’ estimator, and the $\rho$-estimators, cannot be used in practice. They require an exhaustive search in a fine discretization of the parameter space, don’t expect to do that with a deep NN. Don’t expect to do it either for a very shallow NN, not even for a linear regression in dimension 50 (as discussed later, a variant of Yatracos’ estimator by Devroye and Lugosi might be feasible, though).


Let us now describe yet another metric, and another estimator. This metric is based on kernels. So, let $K$ be a kernel on the observations space, $\mathcal{X}$. This means that there is an Hilbert space $\mathcal{H}$, equipped with a norm $\left\lVert\cdot\right\rVert _{\mathcal{H}}$, and a continuous map $\Phi:\mathcal{X}\rightarrow \mathcal{H}$ such that $K(x,x’) = \left<\Phi(x),\Phi(x’)\right>$.

Given a probability distribution $P$ on $\mathcal{X}$, let us define the kernel mean embedding

\[\mu(P) = \int \Phi(x) P(\mathrm{d} x) = \mathbb{E} _{X \sim P }[\Phi(X)].\]

Wait. Of course, this is not always defined! It is, say, if $\int \left\lVert\Phi(x)\right\rVert_{\mathcal{H}} P(\mathrm{d} x) <+\infty$.

Now, it appears that some kernels $k$ are known such that:

(i) $k(x,x) = \left\lVert\Phi(x) \right\rVert_{\mathcal{H}}^2 \leq 1 $ for any $x$, which in turn ensures that $\mu(P)$ is well defined for any $P$.

(ii) $P\mapsto \mu(P)$ is one to one.

For example, when $\mathcal{X}=\mathbb{R}^d$, the Gaussian kernel

\[k(x,x') = \exp\left(- \frac{\left\lVert x-x'\right\rVert^2}{\gamma^2} \right),\]

for some $\gamma>0$, satisfies (i) and (ii).

For any such kernel,

\[d_{k}(P,Q)=\left\lVert \mu(P)-\mu(Q)\right\rVert_{\mathcal{H}}\]

is a distance between probability distributions. We can now define the MMD-estimator. Let $\hat{P}_n$ denote the empirical probability distribution, that is,

\[\hat{P} _n = \frac{1}{n}\sum_{i=1}^{n} \delta_{X_i}.\]

The estimator $\hat{\theta}$ is defined by

\[\hat{\theta} = \arg\min_{\theta\in\Theta}d_k(P_\theta,\hat{P}_n).\]

Note: [12] below provide some conditions ensuring that the minimizer indeed exists. But actually, if if does not exist, just take any $\varepsilon$-minimizer for $\varepsilon$ small enough, and all the good properties discussed below still hold.

Note: if you believed the statement “Close your eyes, push the button, it works” above, you will of course be disappointed. Life is not that simple. The choice of the kernel $k$ is far from easy, and is of course context dependent.


We now prove that, as long as the kernel satisfies (i) and (ii) above, for any statistical model $(P_\theta,\theta\in\Theta)$ (parametric, or not!), given $X_1,\dots,X_n$ drawn i.i.d from some $P^0$,

\[\mathbb{E} \left[ d_k\left(P_{\hat{\theta}},P^0 \right) \right] \leq \inf_{\theta\in\Theta} d_k\left(P_{\theta},P^0 \right) + \frac{2}{\sqrt{n}}.\]

This is taken from our paper [1]. (Note that the expectation is with respect to the sample: $X_1,\dots,X_n$ as $\hat{\theta}=\hat{\theta}(X_1,\dots,X_n)$, the dependence with respect to the sample is always dropped from the notation in satistics and in machine learning).

First, for any $\theta\in\Theta$,

\[d_k\left(P_{\hat{\theta}_n},P^0 \right) \leq d_k\left(P_{\hat{\theta}_n},\hat{P}_n \right) + d_k\left(\hat{P}_n,P^0 \right)\]

by the triangle inequality. Using the defining property of $\hat{\theta}$, that is, that it mimizes the first term in the right-hand side,

\[d_k\left(P_{\hat{\theta}_n},P^0 \right) \leq d_k\left(P_{\theta},\hat{P}_n \right) + d_k\left(\hat{P}_n,P^0 \right),\]

and using again the triangle inequality,

\[d_k\left(P_{\hat{\theta}_n},P^0 \right) \leq d_k \left(P_{\theta},P^0 \right) + 2 d_k\left(\hat{P}_n,P^0 \right).\]

Take the expectation of both sides, and keeping in mind that this holds for any $\theta$, this gives:

\[\mathbb{E} \left[ d_k\left(P_{\hat{\theta}},P^0 \right) \right] \leq \inf_{\theta\in\Theta} d_k\left(P_{\theta},P^0 \right) + 2 \mathbb{E} \left[d_k\left(\hat{P}_n,P^0 \right) \right].\]

So, it all boils down to a control of the expectation of $d_k(\hat{P}_n,P^0 )$ in the right-hand side. Using Jensen’s inequality,

\[\mathbb{E} \left[ d_k\left(\hat{P}_n,P^0 \right)\right] \leq \sqrt{\mathbb{E} \left[d_k^2\left(\hat{P}_n,P^0 \right)\right]}\]

so let us just focus on bounding the expected square distance. Using the definition of the MMD distance,

\[\mathbb{E} \left[d_k^2\left(\hat{P}_n,P^0 \right)\right] = \mathbb{E} \left[ \left\lVert \frac{1}{n} \sum_{i=1}^n \Phi(X_i)-\mathbb{E}_{X\sim P^0}[\Phi(X)] \right\rVert_{\mathcal{H}}^2 \right] = \mathrm{Var}\left(\frac{1}{n} \sum_{i=1}^n \Phi(X_i) \right)\]

and so, as $X_1,\dots,X_n$ are i.i.d,

\[\mathbb{E} \left[d_k^2\left(\hat{P}_n,P^0 \right)\right] = \frac{1}{n} \mathrm{Var}[\Phi(X_1)] \leq \frac{1}{n} \mathbb{E}\left[ \left\lVert\Phi(X_1) \right\rVert_\mathcal{H}^2\right] \leq \frac{1}{n}.\]


Now, of course, one question remains: is it easier to compute $\hat{\theta}$ than Yatracos’ estimator?

This question is discussed in depth in [1] and [12]. The main message is that the minimization of $d_k^2(P_\theta,\hat{P}_{n}) $ is usually a smooth, but non-convex problem (in [1] we exhibit one model for which the problem is convex, though). An unbiased estimate of the gradient of this quantity is easy to build. So, it is possible to use a stochastic gradient algorithm (SGA), but because of the non-convexity of the problem, it is not possible to show that this will lead to a global minimum. Still, in practice, the performances of the estimator obtained by using the SGA are excellent, see [1,12].


The idea to use an estimator of the form

\begin{equation} \label{equa:mde} \hat{\theta} = \arg\min_{\theta\in\Theta}d(P_\theta,\hat{P}_n) \end{equation}

goes back to the 50s, see [5]. The paper [6] is followed by a discussion by Sture Holm who argues that this leads to robust estimators when the distance $d$ is bounded. The reader can try for example the Kolmogorov-Smirnov distance defined for $\mathcal{X}=\mathbb{R}$,

\[d(P,Q) = \sup_{a\in\mathbb{R}} |P(X\leq a) - Q(X\leq a)|.\]

Another example is the total variation distance. Note that the initial procedure proposed by Yatracos [7] is more involved that simply~\eqref{equa:mde} with the TV distance. But later, Devroye and Lugosi [9] proved that~\eqref{equa:mde} used with the TV distance leads to the same guarantees than Yatracos’ estimator.

Also, we mention that the procedure used by Barraud, Birgé and Sart in [8] is much more involved and cannot be interpreted as minimum distance estimation.

The MMD distance has been used in kernel methods for years, we refer the reader to the excellent tutorial [10]. However, up to your knowledge, the first time it was used as described in this blog post was in [11] where the authors used this technique to estimate $\theta$ in a special type of model $(P_\theta,\theta\in\Theta)$ called Generative Adversarial Network (GAN, I guessed you already heard about it). The first general study of MMD-estimation is [12], where the authors study the consistency and asymptotic normality of $\hat{\theta}$ (among others!).


Our preprint on MMD-estimation and robustness study specifically the robustness properties of $\hat{\theta}$. Namely, in Huber’s contamination mode, we study in detail the dependence of $\mathbb{E}[d(P_{\hat{\theta}},P^0)]$ with respect to $n$, the sample size, and $\varepsilon$, the level of contamination. Moreover, in this paper, we also extend the consistency proof to the case where the observations are not independent.

[1] (B.-E. Chérief-Abdellatif and P. Alquier (2019). Finite Sample Properties of Parametric MMD Estimation: Robustness to Misspecification and Dependence. Preprint arxiv:1912.05737.)

In the short paper on MMD-Bayes, we study a generalized Bayes procedure using the MMD distance:

\[\pi(\theta|X_1,\dots,X_n) \propto \exp\left(- \eta d_k^2(P_\theta,\hat{P}_n) \right) \pi(\theta)\]

where $\pi$ is some prior distribution and $\eta>0$ some tuning parameter. This leads to robust Bayesian estimators.

[2] (B.-E. Chérief-Abdellatif and P. Alquier (2020). MMD-Bayes: Robust Bayesian Estimation via Maximum Mean Discrepancy. Proceedings of The 2nd Symposium on Advances in Approximate Bayesian Inference (AABI), Proceedings of Machine Learning Research, vol. 118, pp. 1-21.)

We finally have two recent preprints on MMD-estimation in regression models and in copulas models, respectively. These models have in common that they are semi-parametric: we want to estimate a parameter that does not completely define the distribution of the data. For example, in the linear regression model $Y_i = \left<\theta,X_i\right> + \varepsilon_i$, we usually only specify the distribution of $\varepsilon$, say $\mathcal{N}(0,\sigma^2)$. However, in order to use the MMD-estimator as defined above, one should also specify the distribution of the $X_i$’s. In [4], we propose a trick to avoid that, but the analysis becomes immediately much more complex.

[3] (P. Alquier, B.-E. Chérief-Abdellatif, A. Derumigny and J.-D. Fermanian (2009). Estimation of Copulas via Maximum Mean Discrepancy, 2020. Preprint arxiv:2009.03017.) The paper comes with the R package (MMDCopula).

[4] (P. Alquier and M. Gerber (2020). Universal Robust Regression via Maximum Mean Discrepancy. Preprint arxiv:2006.00840.) We are currently working on the corresponding R package.


[5] (J. Wolfowitz (1957). The minimum distance method. The Annals of Mathematical Statistics, 28(1), 75-88.)

[6] P. J. Bickel (1976). Another Look at Robustness: A Review of Reviews and Some New Developments. Scandinavian Journal of Statistics 3, 145-168.

[7] (Y. G. Yatracos (1985). Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics, 13(2), 768-774.)

[8] (Y. Baraud, L. Birgé and M. Sart (2017). A new method for estimation and model selection: $\rho$-estimation. Inventiones mathematicae, 207, 425-517.)

[9] (L. Devroye and G. Lugosi (2001). Combinatorial methods in density estimation. Springer.)

[10] (K. Muandet, K. Fukumizu, B. Sriperumbudur and B. Schölkopf, (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2), 1-141.)

[11] (G. K. Dziugaite, D. M. Roy and Z. Ghahramani (2015). Training generative neural networks via maximum mean discrepancy optimization. UAI’15: Proceedings of the Thirty-First Conference on Uncertainty in Artificial IntelligenceJuly 2015, 258-267.)

[12] (F.-X. Briol, A. Barp, A. B. Duncan and M. Girolami (2019). Statistical Inference for Generative Models via Maximum Mean Discrepancy. Preprint arXiv:1906.05944.)