Bayesian Likelihood Free Inference using Mixtures of Experts

We extend Bayesian Synthetic Likelihood (BSL) methods to non-Gaussian approximations of the likelihood function. In this setting, we introduce Mixtures of Experts (MoEs), a class of neural network models, as surrogate likelihoods that exhibit desirable approximation theoretic properties. Moreover, MoEs can be estimated using Expectation-Maximization algorithm-based approaches, such as the Gaussian Locally Linear Mapping model estimators that we implement. Further, we provide theoretical evidence towards the ability of our procedure to estimate and approximate a wide range of likelihood functions. Through simulations, we demonstrate the superiority of our approach over existing BSL variants in terms of both posterior approximation accuracy and computational efficiency.


I. INTRODUCTION
Likelihood-free inference, or simulation-based inference, entails estimation of parameters θ of a stochastic model without a feasible likelihood function.However, we assume the ability to simulate observations given known parameters.This study takes a Bayesian stance, where we aim to estimate posterior parameter distributions.
In this work, we concentrate on methods that utilize surrogate parametric models in place of an intractable likelihoods.Specifically, we study the class of methods generally called Synthetic likelihood (SL) procedures.These methods provide estimates of likelihood functions that are then used as inputs for a sampling procedure, such as a Markov chain Monte Carlo (MCMC) scheme, to estimate the posterior distribution.Bayesian Synthetic Likelihood (BSL) approaches [23] have, in the past, been investigated as Bayesian extensions of the SL approach of [31].SL is typically characterized by a Gaussian assumption, while more general formulations are studied under the parametric Bayesian indirect likelihood (pBIL) framework, which includes a number of variants [5].For comparisons with our approaches, we focus on the variants implemented in the BSL package in R [2].
The typical approach of BSL methods is to approximate the intractable likelihood by a multivariate Gaussian distribution whose mean and covariance parameters depend on θ and are estimated pointwise for each value of θ via the empirical mean and covariance estimators of a sample of m independent and identically distributed (i.i.d.) summary statistics, simulated from the underlying likelihood, respectively (cf.[23], [31]).
For good performance, the number of simulations m should not be too small, and evidence suggests that, in practice, the ideal m increases as the dimension of the summary statistics grows [23].Various approaches have been explored to decrease the required number of simulations, such as sparse techniques [3] and shrinkage techniques [22], [24], which aim to diminish the number of parameters necessary for estimating the covariance matrix.Additionally, the uBSL approach of [23] consider unbiased estimation of the normal density functional rather than its mean and covariance parameters.A Semi-parametric variant, semiBSL, has also been suggested to provide robustness when likelihoods are non-Gaussian [1].
In [7], two other variants are considered, referred to as miss-BSL.They aim to estimate the Gaussian synthetic likelihood in a more robust manner, to account for incompatibilities between model and summary choice, i.e., model misspecification.The first approach, denoted as missBSLmean, augments the mean of the simulated summaries with additional free parameters, while the second approach, missBSLvar, augments the variance with free parameters, instead.
For all described BSL alternatives, above, an MCMC scheme is required to carry out the posterior inference.Therefore, if I evaluations of the likelihood are needed in the subsequent MCMC algorithm, it is necessary to simulate I values of θ according to the prior and then simulate I × m values of observations y due to the pointwise construction of the SL estimators.For large I and m, this can be overly costly.The solution we investigate is based on Mixture of Experts (MoE), a class of neural network models [15], [33], using the so-called Gaussian Locally Linear Mapping model (GLLiM; [4], [32]) estimator.Our approach has the advantage to both reduce the number of simulations m and depart from Gaussianity assumptions.Additionally, it allows us to exploit recent approximation and estimation theoretic results regarding MoEs [16]- [20] to establish desirable theoretical results.These results fill a gap in the BSL literature, where there is a lack of theory based on mild and easily checkable assumptions that allow for guarantees in the relationship between the estimated BSL posterior and the target posterior measures.

II. BAYESIAN SYNTHETIC LIKELIHOOD
Let (Ω, F, P) be a probability space.We observe data X n = (X i ) i∈ [n] , where [n] = {1, . . ., n} and X i : Ω → X is a random variable taking value in the measurable space (X, X), for each i ∈ [n].We can thus endow X n with the push-forward probability space (X n , X ⊗n , P n ).Further define the parameter space (T, T), with typical element θ, and equip it with the prior measure Π : T → [0, 1].In classical Bayesian inference (see e.g., [25] and [29]), one assumes that (θ, X 1 , . . ., for some dominating measure λ (e.g.Lebesgue or counting measures), where, for each A ∈ T ⊗ X ⊗n : We typically call f n : X n × T → R ≥0 the likelihood function, with the property that X n f n (x n |θ)λ(dx n ) = 1, for each θ ∈ T, and where π : T → R ≥0 is the density of the prior measure Π, with respect to λ.The target of Bayesian inference is to either provide an expression for the posterior measure Π(•|x n ) : T → [0, 1], characterized by integration with respect to the posterior density π(θ|x n ) ∝ f n (x n |θ)π(θ) or to construct Monte Carlo estimators for integrals with respect to Π(•|x n ).

A. BSL original and variants
Most Bayesian approaches require a closed-form expression for f n and cannot be used in the likelihood-free setting.In the BSL setting, f n can be intractable but we assume that we can simulate i.i.d.samples: , where θ j has prior measure Π and X k n,j |θ j has measure We then use these data to estimate some tractable replacement for the likelihood function: , for each θ ∈ T and a summary statistic η : X n → R d .In particular, the approach of [23] suggests to use simulations Y m N to estimate the likelihood replacement where and N d (•; µ, Σ) is the d-dimensional normal density function, with mean µ ∈ R d and positive definite covariance Σ ∈ R d×d .Variations on this construction have been proposed, for example, by [1] and [7].In [1] the authors propose to replace (1), by a copula transformation of a marginal kernel density estimator, leading to the semiBSL.In [7], the authors introduce a prior on an additional free parameter that improves robustness to misspecification of choice of summary statistic η, leading to the missBSL approach.Further refinements have subsequently been considered by [8], where the covariance estimator (3) is replaced by more general classes of covariance estimators.

B. Theoretical insights of BSL
It is noteworthy that a number of theoretical results have been proved with respect to the described BSL algorithms.Firstly, for fixed x n , [5] proved a weak consistency result regarding the approximate posterior measure defined by density to a limiting posterior measure defined by in measure f n (•|θ j )dλ, along with uniform integrability assumptions on the sequences of measures (g n,m (•|x n )dλ) m∈N and densities . We note that these results only say that Stronger results were obtained by [8], who required stricter conditions, to prove Bernstein-von Mises-type normal limit theorems for the class of covariance estimator-adjusted BSL techniques.For instance, the authors assume a central limit theorem with respect to the summary η(X n ) and its limit η 0 , for some θ 0 ∈ R d .They further assume that there is a mapping θ → η(θ), for which η(θ 0 ) = η 0 , uniquely for some θ 0 ∈ T, and further that η is differentiable with full-rank Jacobian in a neighbourhood of θ 0 .The covariance matrix estimator is further assumed to satisfy a uniform law of large numbers, under appropriate scaling, and the moment generating function of the scaled difference between η(X n ) and η 0 , which admits a central limit theorem, also has sub-Gaussian tails for sufficiently large n.Under these assumptions, the BSL posterior density estimator obtained using covariance estimators that satisfy the regularity conditions will converge, in probability, to a normal density function, in the total variation topology, as both n and m approach infinity.This can better be interpreted as the convergence in distribution of an appropriately scaling of the posterior mean, X n → T θg m (θ|X n )λ(dθ), to a normal random variable.
Notice that these assumptions are difficult to intuit and to verify for many sufficiently complex practical scenarios, and can be violated in simple cases.For example, one cannot use summary statistics such as M-estimator solutions [27], where the extrema are unidentifiable (see, e.g., [26, Sec.5.1]); nor U-and V-statistics defined by degenerate kernels [12,Ch. 4].

III. BSL VIA MIXTURE OF EXPERTS A. Surrogate likelihoods via mixture of experts
As with the BSL methods described above, we seek to approximate the likelihood f n (x n |θ) in some form, when T ⊂ R p , for some p ∈ N. Namely, given a choice of summary statistic η : X n → R d , we consider the classes of MoEs with normal experts and Gaussian gating (cf.[11], [15], [32]).The main reason for this choice is that the maximum conditional likelihood estimator (MCLE) is well approximated by the computationally more convenient GLLiM model estimator of [4].Mutatis mutandis, the same results can be obtained for the softmax gating function via the equivalence between two classes (cf.[17,Lem. 1]).Writing η n = η(x n ), our likelihood approximators take the form where K ∈ N is the number of mixture components, and in the set of Gaussian gating functions: where . Then we assume that Ψ K = (ψ K , χ K ) ∈ X , for some domain X satisfying the parameter space restrictions, above.
For a fixed Ψ K , a mixture in M K can be seen as a function of θ.The idea is then to learn an estimate of Ψ K so that the corresponding mixture is a good approximation of the likelihood for every θ.

B. Approximation capacities
The approximation of the likelihood by a function in M K is appealing for a number of reasons.Let fn (η n |θ) denote the pushforward likelihood of η n = η(X n ), based on f n (x n |θ).Then, on every compact subset K ⊂ T, as long as fn (η n ; •) is continuous on T, for every > 0, there exists a sufficiently large K ∈ N and h n,K (η n |θ) ∈ M K , such that the conditional expectations according to fn (η n |θ) and h n,K (η n |θ) are uniformly close [16]: This implies that we can approximate the mean of any pushforward likelihood arbitrarily well using an approximation in M K , for sufficiently large K. Further, on any compact sets H ⊂ R d and K ⊂ T, if fn is a density on H for each fixed θ ∈ K, and if fn is continuous on H × K, then, by [17, Thm.1], for each q ∈ [1, ∞) and > 0, there exists a Thus, not only is the mean of h n,K (η n |θ) close to its target, but if the target is compactly supported, then h n,K (η n |θ) will be close to its target conditional density in any q-norm as well.

C. Posterior consistency
In Bayesian settings, the subsequent step is to consider the posterior distribution induced by the likelihood approximation.For fixed, K ∈ N, we first wish to estimate the parameter Ψ K that determines the optimal h n,K = h n,K (•|•; Ψ k ), where we now make the dependence on Ψ K explicit.More specifically, using N simulated i.i.d.samples Y N = ((θ j , X n,j )) j∈[N ] from the joint measure Q n capturing the likelihood information, we consider the MCLE of Ψ K : and parameters Ψ * K minimizing the Kullback-Leibler divergence between fn and mixtures in M K (cf.[30,Ch. 21]): .
For each parameter Ψ K , we then define the posterior measure corresponding to h n,K (η(x n )|θ; Ψ K ) via the density and show the following convergence result.
Theorem 1 (Posterior consistency).Assume that η(X n ) and θ have finite second moments with respect to Q n and X is compact.Then, if Φ is compact and ΨK,N N ∈N is a convergent sequence, the posterior measures defined by ΨK,N N ∈N converge in total variation, almost surely, to the posterior measure defined by Ψ * K , in the sense that, for each The proof of Theorem 1 is given in the Appendix.

D. Fast convergence rates
Note that not only is the MoE approximation of the likelihood and posterior consistency attractive, but we can also obtain near-optimal convergence estimation rates via Theorem 2, which is proved in the Appendix.We specialize to the well-specified case, where the generative measure Q n has conditional density in M K , denoted as h n,K 0 (•|•; Ψ 0 K 0 ) with K 0 number of mixture components, where K 0 ≤ K.For each θ ∈ T, we define the Hellinger distance, denoted by He(•, •), as follows: .
Theorem 2 (Conditional density estimation).Assume that ((θ j , X n,j )) j∈[N ] are sampled i.i.d from generative joint measure Q n .Assume that X is compact and T is bounded.Given ΨK,N defined in (5), the corresponding conditional density function h n,K (•|•; ΨK,N ) admits the convergence rate of order O((log N/N ) −1/2 ) under the Hellinger distance in the sense that: where C 1 , C 2 and C 3 are universal positive constants.

IV. NUMERICAL ILLUSTRATIONS A. Surrogate MoE likelihoods via GLLiM
For our numerical illustration, we use the GLLiM estimator of [4].GLLiM has been used previously in [6] to provide surrogate posterior estimators.In our current setting, it is the likelihood that we approximate as an MoE: with n = 1 and η n (X n ) = X in each of our examples.
In the pBIL framework and notation of [5], we thus have an auxiliary model h n,K , which can be viewed as a mixture of K Gaussian densities with parameters Specifically, Φ(θ, Ψ K ) is now a parametric function of θ that depends on Ψ K and specifies the proportions, means and covariance matrices of the K components.To define the mapping Φ, we only need an estimate of Ψ K .The parameter Ψ K can be estimated, from the sample Y N , using a GLLiM model estimator ΨK,N , computed via a standard Expectation-Maximization (EM) algorithm.Details of the estimation procedure appears in [4].Once we have computed ΨK,N , no further simulations are required.That is, Ψ K can be estimated using only the size N simulation: Y N = ((θ j , X n,j )) j∈[N ] , with N fixed and independent of the required number of MCMC iterations.In the sequel, we will referred to our approach, using GLLiM model estimated MoE for BSL, as GLLiM-BSL.

B. Posterior samples
To sample from the posterior measure, BSL procedures use an estimation of the likelihood, plugged into an MCMC algorithm.In the BSL package [2], the default MCMC scheme is a Random Walk Metropolis Hastings (RW MH) algorithm, as provided by the mcmc package [9].The covariance matrix of the Gaussian proposal is set to sI, where s > 0 is a scale parameter that has to be carefully chosen and I is the identity matrix.For GLLiM-BSL, we also test a Slice Sampler (SS) [14] and a Metropolis Hastings scheme, using the GLLiM approximation of the posterior as a proposal distribution (G MH).These two latter choices have the advantage of not requiring tuning.For all MCMC schemes, we perform 3 × 10 5 iterations, with a burnin of 2 × 10 5 and a 1-in-100 sample thinning, resulting in a sample of 1000 θ values.
An MoE is learned on a sample Y N of size N = 10 5 , obtained by simulating parameters from the prior and underlying measure defined by f n , using a GLLiM estimator.The Bayesian information criterion (BIC) is used to choose the number of mixture components K. Once estimated with the selected K, the MoE provides an approximation of the likelihood which is used together with one of the aforementioned MCMC schemes.For comparison, we also use the GLLiM model approximation of the posterior measure to directly generate a sample of size 1000, as per [6].This does not require any MCMC scheme.These direct GLLiMbased samples are then compared with samples resulting from various BSL procedures from the BSL package: BSL, semiBSL, missBSLmean, missBSLvar and uBSL; see [2] for details.We limit to visual comparison as it is enough to illustrate the improvement obtained by our method.There exists quantitative criteria for comparing samples, such as distances between samples (e.g.Wasserstein, energy distances etc.), 2-sample tests, etc. [13].However, they provide highly volatile and inconsistent rankings between methods that are inconsistent with visual diagnoses.The development of quality assessment tools in likelihood free settings is actually an open question.It is a promising direction for future research that falls outside the scope of this paper.

C. Two moons example
The two moons model corresponds to a simulator that, given some parameters θ = (θ 1 , θ 2 ) ∈ R 2 , produces an observation X ∈ R 2 via the scheme: , where U is the uniform distribution.
We adopt the same setting as in [10].Variable P follow a single crescent-shaped distribution, which is subsequently shifted and rotated around the origin, depending on θ.The absolute value |θ 1 + θ 2 | gives rise to the second crescent in the posterior.The prior is uniform over [−1, 1] 2 and the observed data is set to x = (0, 0).The likelihood cannot be expressed explicitly but Figure 1 (a) shows 1000 simulations for θ = (−0.5,0.75), which clearly exhibit a non-Gaussian shape.A sample obtained from the GLLiM approximation of the likelihood, with K = 49 Gaussian components, is shown in Figure 1 (b), for comparison.The approximation is quite good, with a few extra outliers visible on the right indicating that some of the components are located there, but with low weight.The true posterior measure is made of two moon-like parts, see e.g.[10] and Figure 2 (a).The GLLiM model is estimated using simulations Y N .Selecting from K = 2 to 50, the smallest BIC was obtained for K = 49.Figure 2 shows the different obtained samples.In this example, only the RW MH algorithm is tested as a MCMC scheme.We note that the uBSL variant from the BSL package exhibited a runtime error in this particular example and, as a result, was not used.
All methods identify the bimodality of the posterior distribution, but the BSL methods do not correctly recover the local structure of the two parts.In contrast, GLLiM-BSL provides a good representation of the posterior mass and moon structures.Among GLLiM-based procedures, the two moons were slightly better recovered with GLLiM-BSL than with the direct GLLiM posterior approximation.Table I shows the computing times obtained on a laptop (MacBook Pro, 2.4 GHz Quad-Core Intel Core i5) using the mentioned CRAN packages, and additional basic R code with no resorting to parallel computing.For the low dimensions of this example, i.e., d = p = 2, the computing times were always less for GLLiM-based procedures, but not significantly so.However, the amortization nature of the GLLiM solution becomes an advantage in higher-dimensional problems, as seen in the following example.

D. Hyperboloids example
This example was introduced in [6] and exhibits a posterior distribution whose mass is located on 4 hyperboloids, as illustrated in Figure 3 (a).The GLLiM estimator was used to produce an MoE with K = 38 mixture components, as selected by BIC.The GLLiM-based likelihood was used with an RW MH algorithm to make comparisons with the standard BSL procedures.We also considered SS and G MH.In G MH, the variance of the GLLiM posterior was multiplied by 2 to avoid the proposal distribution being too narrow.The acceptance rate was 60% for G MH vs 16% for RW MH.
As depicted in Figure 3, although the posterior is far from being unimodal, some of the standard BSL variants (semiBSL and missBSL) succeed in capturing it satisfactorily compared to the previous example.This is likely due to the fact that the likelihood is simpler here, being a mixture of two Student distributions. Figure 3 shows the best results, obtained with GLLiM-BSL (c,d) and semiBSL (f).GLLiM approximations (Figure 3 (b,c,d)) appear to be better at capturing the hyperboloid branches, while some of the BSL variants (f,g,h), are more precise in the center with an obvious excess mass at the intersections of the branches.To complement this visual comparison, we also show the posterior marginals in Figure 4.The marginal plots allow us to better visualize the difference with standard BSL procedures.Refer to Figure 4 (fj), which all show larger deviations from the truth, determined by numerical integration.Both true posterior marginals are the same due to symmetry in the model formulation and exhibit a non-smooth shape, which has also been doublechecked using a long run of 3 × 10 5 iterations of the SS algorithm; see Figure 4 (a).For GLLiM-BSL, among the three MCMC schemes, it appears that the SS version in Figure 4 (d) provides more satisfactory samples than the MH versions (c, e).The gain over the direct GLLiM posterior sample (b) is also clearer.Computing times are reported in Table I.For the larger dimensional example (d = 10), GLLiM methods take much less time than standard BSL, even when considering BIC and learning times.

V. CONCLUSION
MoE approaches provide several advantages over previous BSL variants.The flexibility of the model allows for better approximations of likelihoods that strongly depart from Gaussianity.In particular, GLLiM model estimators have interesting amortization properties.GLLiM-based procedures can be applied in a wide variety of settings, such as sets of i.i.d.observations or time series, as illustrated in [6].To the best of our knowledge, we are the first to demonstrate approximation and estimation theoretical properties of MoEs in this setting.

Fig. 1 .
Fig. 1.Data X generated from the Two Moons example for θ = (−0.5,0.75).Samples of size 1000 from (a) the 2 moons simulator and from (b) the GLLiM likelihood estimation with K = 49 Gaussian components.
that η (X n ) and θ have finite second moments with respect to Q n , and compact X .By [26, Thm.5.3], this implies that since ΨK,N N ∈N is convergent and conditional likelihood maximizing, ΨK,N −→ N →∞ Ψ * K for almost every (Y N ) N ∈N , for some Kullback-Leibler divergence(