Bayesian Inversion by ! -Complete Cone Duality

The process of inverting Markov kernels relates to the important subject of Bayesian modelling and learning. In fact, Bayesian update is exactly kernel inversion. In this paper, we investigate how and when Markov kernels (aka stochastic relations, or probabilistic mappings, or simply kernels) can be inverted. We address the question both directly on the category of measurable spaces, and indirectly by interpreting kernels as Markov operators: For the direct option, we introduce a typed version of the category of Markov kernels and use the so-called ‘disintegration of measures’. Here, one has to specialise to measurable spaces borne from a simple class of topological spaces -e.g. Polish spaces (other choices are possible). Our method and result greatly simplify a recent development in Ref. [4]. For the operator option, we use a cone version of the category of Markov operators (kernels seen as predicate transformers). That is to say, our linear operators are not just continuous, but are required to satisfy the stronger condition of being ω -chain-continuous. 1 Prior work shows that one obtains an adjunction in the form of a pair of contravariant and inverse functors between the categories of L 1 - and L ∞ -cones [3]. Inversion, seen through the operator prism, is just adjunction. 2 No topological assumption is needed. We show that both categories (Markov kernels and ω -chain-continuous Markov operators) are related by a family of contravariant functors T p for 1 ≤ p ≤ ∞ . The T p ’s are Kleisli extensions of (duals of) conditional expectation functors introduced in Ref. [3].


Introduction
Before we get in the technicalities, we review informally the central notions of interest: kernels, kernels as operators, and Bayesian update.We also take the opportunity to introduce some of the notations used in the remainder of the paper.
A kernel is a (measurable) map from some measurable space X to the set of probability distributions on another measurable space Y .Probability distributions have to be equipped with a set of mesasurables for this to make sense.We will write X → GY or equivalently X Y for the type of such X, Y kernels, where GX is the set of probability distributions over X.With the appropriate constructions, G is a monad over Mes, the category of measurable spaces and measurable maps [9].Kernels are often used as models of probabilistic behaviour.If X = Y is finite, then this is but the familiar notion of probabilistic state machine (or finite discrete-time Markov chain), where one jumps probabilistically from one state to the next with no dependence on the past.We say a kernel f : X → GY is deterministic 3 if it factorises in Mes as f = δ Y • f d for some f d : X → Y , where δ X : X X sends x to δ x the Dirac measure at x. Intuively determinism means that f allows only one possible jump at each x in X.In particular, δ X is a deterministic kernel itself (with an identical jump).
One can also use a kernel as a family of probabilities over Y parameterised by X.Each probability in the range of the kernel can be thought as a competing description of a hidden true probability.Given an additional probability p on the parameter space X, a prior, one can specify how much 'trust' one has in any particular description offered.This is the Bayesian model of quantification of uncertain probabilities: the prior describes our beliefs and 'Bayesian update' is a process by which new data (minted by the true random source) can be incorporated to modify the prior, and update our beliefs.The hope is that in the long run the successive priors will take us closer to the truth.

Kernels as operators
Yet another standpoint on kernels is to look at them as linear maps.Indeed, a finite f : X → GY can be seen as a transition matrix T(f ) of type X × Y with the x, y-entry specifying the probability that being at x in X one jumps to y in Y . 4Thus, one can think of f : X → GY as a linear map from the free vector space over Y to that over X. Clearly T(δ X ) = I X .In particular, a probability p over X, ie a map from 1 → GX, is a matrix T(p) of type 1 × X (a row vector).This new standpoint gives a ready access to the key operation of kernel composition, written • G .Because G is a monad (known as the Giry monad), one knows how to compose kernels for general reasons using the so-called Kleisli composition (see Section 2).In the operator interpretation, Kleisli composition is just plain matrix multiplication.E.g. the composition f • G p of p and f is represented as T(p)T(f ), a new row vector of type 1 × Y .The T (contravariant) functor can be extended to arbitrary measurable kernels, using the machinery of Banach cones of real functions (see Section 4.4).5

1:3
Importantly, we will deal not with 'naked' Markov kernels, but with 'typed' ones of the form: where the triangle (above, left) is drawn in the Kleisli category of G in Mes and assumed to commute.The simpler diagram (above, right) is just a more compact notation for the same.
(Recall that we use blocky arrows to remind us that these are Kleisli arrows.)Either diagram means that in addition to f , we are given probabilities p on X, q on Y with f This category of typed kernels has a natural subcategory where f is deterministic.The typing simply amounts to saying that q is the push-forward of p along g.7 E.g. δ X : (X, p) (X, p) for any p in GX.This subcategory is (equivalent to) the familiar category of probabilistic triples and measurable measure-preserving maps.8

Bayesian update or inversion
Our main question is as folllows.Given a typed kernel f , we wish to build and characterise a 'weak' inverse f † : In the Bayesian world, f † should be the update map we mentioned above.Given a data input y in Y , this map returns a posterior f † (y) which represents our updated set of beliefs.It is therefore central to the theory to obtain good and general descriptions of f † .We will access such descriptions of f † following two routes.The direct route uses the non-trivial notion of disintegration (aka regular conditional probabilities) which solves the inversion problem in the special case of a deterministic f .A clever construction based on couplings allows one to generalise it to any kernel.This is done in Section 3.3 and follows the construction of Ref. [4] while being markedly simpler.The other route goes the operator way.As alluded to in the abstract, we use a domain theoretic variant of the usual interpretation which carries a perfect duality (which one does not have in the standard interpretation).Then, as the notation suggests, we can just define T(f ) † as the adjoint of T(f ) (this is done in Section 4).Bayesian inversion is cone duality!Whence the title.We then show that through our functorial bridge, T (really a family T p ), both routes agree (Section 5).
The paper is almost self-contained.The only pieces of mathematics borrowed are the disintegration and the cone duality results and some of the most basic definitions.We now turn to the technical preliminaries.

Measurable and Polish spaces
We refer to Ref. [1] for the definitions of measurable spaces and maps, and Ref. [11] for an introduction to the theory of Polish spaces (completely metrisable and separable topological spaces).Where convenient, we will denote measurable spaces and related structures by their underlying set.If X is a set, (Y, Λ) a measurable space and f : X → Y a function, we denote by σ(f ) its initial σ-algebra, which is the smallest σ-algebra that makes it measurable.As seen earlier, the category of measurable spaces and measurable maps is denoted by Mes, and that of Polish spaces and continuous maps by Pol.There is a functor B : Pol → Mes associating any Polish space to the measurable space with same underlying set equipped with the 'Borel' σ-algebra (generated by open sets), and interpreting continuous maps as measurable ones.Measurable spaces in the range of B are the standard Borel spaces.
A measure p on a measurable space (X, Σ) is a set function Σ → R which is σ-additive and such that p(∅) = 0.One says p is a finite measure whenever p(X) < ∞, and a probability measure if p(X) = 1.A property holds p-almost surely (p-a.s) if its negation holds on a set of measure 0. A measure space is a triple (X, Σ, p) such that (X, Σ) is a measurable space and p is a finite measure on (X, Σ).We denote by p| Λ the restriction of p to a sub-σ-algebra

Radon-Nikodym and conditional expectations
Let (X, Σ) be some measurable space.For p, q finite measures, we say that p is absolutely continuous with respect to q if for all B measurable, q(B) = 0 implies p(B) = 0.This will be denoted by p q.The Radon-Nikodym theorem tells us that we can express p in terms of its derivative with respect to q: Theorem 1 (Radon-Nikodym).If p q there exists a q-a.s.unique positive integrable function denoted by dp dq : (X, Σ, q) → R such that p = B → B dp dq dq.
The function dp dq is called the Radon-Nikodym derivative of p with respect to q.Let us denote f The following two identities follow from Theorem 1: (i) df •p dp = f (ii) dp dq • q = p.We refer the reader to [1] for further facts about Radon-Nikodym derivatives.Conditional expectations can be implemented in terms of Radon-Nikodym derivatives.

Probability functors
The endofunctor G : Mes → Mes associates to any measurable space X the set of all probability measures on X with the smallest σ-algebra that makes the evaluation functions ev B : G(X) → R = p → p(B) measurable, for B a measurable set in X.If f : X → Y is measurable, the action of the functor is defined by G(f )(P ) = P • f −1 .This functor can be endowed with the structure of the Giry monad (G, µ, δ).The multiplication µ : The Kleisli category of G will be denoted by K .It has the same objects as Mes.For all X, Y measureable spaces, a Kleisli arrow f :

Bayesian inversion
Let D be a space representing some space of data and let t ∈ G(D) be the truth, a unknown probability measure that we wish to discover by sampling repeatedly from it.In order to make this search analytically or computationally tractable, or more generally to reflect some additional knowledge or assumptions held about the truth, one might wish to parameterise the search through a space H of parameters and a measurable likelihood function f : H D. The uncertainty about which parameter best matches the truth is represented by a probability p ∈ G(H) called the prior.The composite of the two arrows q = f • G p is called the marginal likelihood.
Bayesian inversion is the construction from these data of a posterior map g : D H, also called the inference map.Upon observing a sample d ∈ D, this inference map will produce an updated prior g(d).In good cases (e.g.H and D finite, and q absolutely continuous w.r.t.t), sampling independently and identically from the truth t and iterating this Bayesian update will make the marginal likelihood converge (in some topology to be chosen carefully) to t.The key step in the above process is the construction of the posterior g, which relies crucially on disintegrations.
Culbertson & Sturtz give in [4] a nice categorical account of Bayesian inversion in a setting close to K .In the following, we provide a streamlined view of their work by defining a category of kernels where disintegration and Bayesian inversion admit rather elegant statements.

Categories of kernels
Let F : Mes → K be the functor embedding Mes into the Kleisli category of G.It acts identically on spaces and maps measurable arrows f : X → Y to Kleisli arrows F (f ) = δ Y • f . 1 ↓ F is the category having as objects probabilities p : 1 X, denoted by (X, p), and as morphisms f : (X, p) δ (Y, q) degenerate Kleisli arrows F (f ) : As said, these correspond to the usual notion of measure-preserving map. 1 ↓ K is the category having the same objects as 1 ↓ F but where arrows are nondegenerate, i.e. an arrow from (X, p) to (Y, q) as above is any Kleisli arrow f : X Y such that q = f • G p. Clearly, 1 ↓ F is a subcategory of 1 ↓ K with the same objects (aka lluf).
The following result ensures that for an arrow f : (X, p) (Y, q), there are p-negligibly many points jumping to q-negligible sets (it corresponds to the condition of non-singularity of [3]).
Proof.By definition of 1 ↓ K , q(B) = X f (x)(B) dp.Assume q(B) = 0, then having f (x)(B) > 0 on a set of strictly positive p-measure implies that the integral is strictly positive, yielding a contradiction.
C O N C U R 2 0 1 6

1:6 Bayesian Inversion by ω-Complete Cone Duality
For all objects (X, p), (Y, q), let R (X,p),(Y,q) be the smallest equivalence relation on Hom 1↓K (X, Y ) such that (f, f ) ∈ R (X,p),(Y,q) if f and f are p-a.s.equal.

Lemma 4. R defines a congruence relation on 1 ↓ K .
Proof.We must show that for all g : (X , p ) (X, p) and all h : p p -a.s., hence the following equation holds for p -almost all x : which concludes the proof.
This congruence relation allows us to consider R-equivalence classes of 1 ↓ K arrows as proper morphisms in the corresponding quotient category (Section 2.8, [13]): Definition 5.The category Krn is the quotient category (1 ↓ K )/R, with subcategory In other terms, an arrow f : (X, p) (Y, q) in Krn is an equivalence class of kernels that are p-a.s.equal.

Disintegrations
Disintegrations are also called regular conditional probabilities and correspond to measurable families of conditional probabilities.Working in the setting of standard Borel spaces ensures their existence, and the corresponding statement admits a particularly elegant form in Krn: Theorem 6 (Disintegration, [8]).Let X and Y be standard Borel spaces, and let f : (X, p) δ (Y, q) be an arrow in Krn δ .There exists a unique Krn arrow f † : (Y, q) (X, p) that verifies f † (y)(f −1 ({y})) = 1 q-a.s.
We call f † the disintegration of p along f .We will show in Section 5 that disintegrations and more generally Bayesian inverses are adjoints, hence the use of the − † notation.The last condition can be equivalently stated as the fact that f • f † = id (Y,q) .In order to bridge our crisp statement of Theorem 6 with the usual measure-theoretic one, let us unfold the objects at play.If we inspect the type of the arrows f : (X, p) δ (Y, q) and f † : (Y, q) (X, p) we see that by definition of composition in K , we have the equation We recall that the uniqueness of f † claimed in Theorem 6 is really that of a q-equivalence class of kernels.Note also that disintegrations do not in general exist in Pol as they need not be continuous (even when disintegrating along a continuous map).
Disintegration, as said above, is a measurable family of conditional probabilities.The subσ-algebras against which the conditionings are performed are encoded through the measurable map f along which the disintegration is computed.Simple calculations make explicit how conditional expectation underpins disintegration: for all h : (X, p) → R integrable, X h dp = y∈Y X h df † (y) dq (Equation 1) The last equation corresponds to the usual measure-theoretic characteristic identity of disintegrations.Let us consider a measurable set B ∈ σ(f ) and let us apply this identity to the function h • 1 B .Applying a change of variables on q = G(f )(p), we get: We recognise the characteristic identity of conditional expectations (Definition 2).This implies that the following identity holds p-almost everywhere:

Bayesian inversion
Bayesian inversion is a reformulation of the disintegration theorem where the map f is allowed to be any arrow of Krn (and not just a deterministic one).To formulate our Bayesian inversion theorem we will define two Set-valued functors and two natural transformations between them.The first functor is simply the functor Hom((X, p), −) : Krn → Set for a given (X, p) in Krn.For notational clarity, we will abbreviate objects (X, p), (Y, q), (Y , q ) in Krn to X, Y, Y with the understanding that they come equipped with measures p, q, q .It is useful to explicitly write the action of Hom(X, −) on morphisms g : Y Y .By definition Hom(X, −)(g) : Hom(X, Y ) → Hom(X, Y ) maps f ∈ Hom(X, Y ) to the kernel: Given X in Krn, our second functor Γ(X, −) : Krn → Set is defined on objects as follows: we define Γ(X, Y ) ⊆ G(X × Y ) to be the set of couplings of p and q, corresponding to measures γ such that G(π X )(γ) = p and G(π Y )(γ) = q.Couplings corresponds to elements γ such that the following diagram commutes in K : 1:8

Bayesian Inversion by ω-Complete Cone Duality
where ⊗ : GX × GY → G(X × Y ) is the product measure bifunctor and δ X is the Giry unit at X.By unravelling the definitions we get The proof that Γ(X, −) commutes with composition follows from the disintegration theorem.We now define a transformation α X : Hom(X, −) → Γ(X, −) defined at Y by where (1) follows from the definition of Γ, (2) follows from the fact that α X Y constructs a coupling in an explicitly disintegrated form, and (3) follows from the definition of α X and Hom(X, −).
Our second natural transformation goes in the opposite direction and is given by the disintegration along the first projection (which exists by Theorem 6), i.e. we define D X : Γ(X, −) → Hom(X, −) at Y in Krn by: Proof.Let g : (Y, q) → (Y , q ) and γ ∈ Γ(X, Y ).For notational clarity, for any h ∈ Hom(X, Y ) let us write h = Hom(X, −)(g)(h), and let us define f D X Y (γ).We now calculate: where (1) follows by definition of Γ(X, −), (2) is by definition of f and the Disintegration Theorem 6, and (3) by definition of Hom(X, −).It follows immediately that f factors through the disintegration of Γ(X, g)(γ) along the first projection, i.e. that F. Dahlqvist, V. Danos, I. Garnier, and O. Kammar

ω-complete normed cones
We recall some facts pertaining to the categories of ω-complete normed cones introduced in [14].The main object of this section is to give a functional analytic account of kernels as operators on ω-complete normed cones, in the style of [3].We improve on the latter by presenting the transformation from kernels to operators functorially.In Section 5, we will use the machinery developed here to interpret Bayesian inversion in this domain-theoretic setting.We first recall some general definitions about ω-complete normed cones and the associated category ωCC.We then concentrate on the duality existing between the subcategories of cones of integrable and bounded functions.Proofs not provided here can be found in Ref. [3].

Basic definitions
Cones are axiomatisations of the positive elements of (real) vector spaces.

Definition 10 (Cones).
A cone (V, +, •, 0) is a set V together with an associative and commutative operation + with unit 0 and with a multiplication by real positive scalars • distributive over +.We have two more axioms: the cancellation law ∀u, v, w ∈ V, v Any cone C admits a natural partial order structure ≤ defined as follows: u ≤ v if and only if there exists w such that v = u + w.We will consider normed cones which are complete with respect to increasing sequences (chains) in this order which are of bounded norm.

Definition 11 (Normed cones, ω-complete).
A normed cone C is a cone together with a function − : Note that the norm − is ω-continuous.All the cones we are going to consider in the following are ω-complete and normed.ω-continuous linear maps form the natural notion of morphism between such structures.Note that linearity implies monotonicity in the natural order.
Definition 12 (ω-continuous linear maps).For C, D ω-complete normed cones, an ωcontinuous linear map f : C → D is a linear map such that for every chain (u n ) n∈N for which The dual of an ω-complete normed cone is defined in the usual way.We will admit the following result: Therefore, f * ≤ f .We now introduce the cones we are going to work with in the remainder of the paper.

Cones of measures and of measurable functions
Let us fix a measure space (X, Σ, p) with p finite.Much of the constructions in the rest of the paper rely on dualities between the cones of measurable functions L + 1 (X, Σ, p), L + ∞ (X, Σ, p) and cones of measures M p (X, Σ), M p U B (X, Σ).Let us introduce these cones in more detail.

Cones of measurable functions
Two positive measurable maps f, f : (X, Σ, p) → R + are said to be p-equivalent if Clearly, being p-integrable is preserved by p-equivalence.The elements of the cone L + 1 (X, Σ, p) are p-equivalence classes of real-valued integrable maps.L + 1 (X, Σ, p) is normed by f 1 = X f dp.The dominated convergence theorem implies that L + 1 (X, Σ, p) is an ω-complete normed cone.A positive measurable map f is p-essentially bounded if there exists C ≥ 0 such that p {x | f (x) > C} = 0.The elements of the cone L + ∞ (X, Σ, p) are p-equivalence classes of realvalued essentially bounded maps.The norm is given by f ∞ = inf {C ≥ 0 | f (x) ≤ C p-a.s.}.

Cones of measures
Closely related to the cones above are cones of absolutely continuous measures M p (X, Σ) and bounded measures M p U B (X, Σ).M p (X, Σ) is the cone of finite measures which are absolutely continuous with respect to p, with norm given by q = q(X).M p U B (X, Σ) is the cone of finite measures which are uniformly bounded by a finite multiple of p, with norm given by q U B = inf {c ≥ 0 | q ≤ cp}.The ω-completeness of these cones will appear as a byproduct of the duality to be proved in the next.

Duality between L 1 and L ∞ cones
In the following, we will denote by L + 1 the full subcategory of ωCC having as objects cones L + 1 (X, Σ, p) and by L + ∞ the full subcategory of ωCC having as objects cones L + ∞ (X, Σ, p).As indicated before, the construction of the duality goes through cones of absolutely continuous measures M p (X, Σ) and bounded measures M p U B (X, Σ).

1:13
Section 3 and 4 of [3] for more details.We sum up the developments so far in the following (non-commuting) diagram: We conclude this section by indicating that the duality between L + 1 and L + ∞ generalises to arbitrary dual pairs L + p , L + q [2].We conjecture that the functors T 1 , T ∞ have counterparts in this more general setting.In the next section, we give a functional interpretation of Bayesian inversion through these functors.

Bayesian inversion as duality
As shown in Section 3, Bayesian inversion is a symmetrised disintegration, which by Equation 2corresponds to a measurable family of conditional probabilities.As these are a fundamental tool of the modern probabilistic toolkit, a natural question is to find a corresponding process in the functional analytic setting of norm-1 operators between the ω-complete cones L + 1 and L + ∞ .It is well-known that conditional expectation can be framed as a projection operator (e.g. in the L 2 case, see [10]).The result we are about to prove provides a fresh perspective on this classical problem: the Bayesian inverse of a kernel corresponds to the adjoint of its functional form.
Theorem 19 (Inversion as duality).Let X, Y be standard Borel spaces and f : (X, p) (Y, q) a Krn arrow.We have: Proof.Let us recall the types of the objects: Unfolding, we must prove: Let γ ∈ Γ(p, q) be the coupling corresponding to f : (X, p) (Y, q).Continuing the string of equivalences above, we must prove: Taking e(x, y) = v(x)u(y) and using equations marked 1 and 2 above concludes the proof.
Some comments are in order.Note that the disintegration and Bayesian inversion theorems rely on some strong assumptions on the underlying spaces-here, we assume the spaces to be standard Borel; Culbertson & Sturtz [4] work in the setting of perfect measure spaces and equiperfect kernels.However, the cone duality works for any measure space!We conjecture that these regularity conditions are necessary if one wishes to extract a measurable kernel from a Markov operator or dually from an abstract Markov kernel.

Conclusion
We have established that the functional representation of measurable kernels as operators acting on ω-complete cones presented in [3] is functorial.Two variants of the functor exist, mapping kernels to operators acting either on bounded functions or on integrable ones.The category of 'typed' kernels on which these functors are defined allows to state elegantly the famous disintegration theorem and its generalisation, Bayesian inversion.What's more, we uncovered the categorical underpinnings of Bayesian inversion as particular natural transformations mapping kernels to couplings and reciprocally.Finally, we have shown that Bayesian inversion amounts in the functional world to adjunction.Several further developments suggest themselves.First of all, It remains to be seen whether our construction generalises from the duality L + 1 /L + ∞ to arbitrary pair of dual cones L + p /L + q (e.g. the pair p = q = 2 which allows one to talk about reversible kernels), and can prove a stronger statement, namely T p (f † ) = T q (f ) † for all conjugate exponents p, q.Another line of thought is to connect these results with some of the authors recent's work [7,5].In particular, instantiating the kernel-theoretic framework with the Dirichlet process [7], might provide insight into the operator-theoretic counterpart of so-called nonparametric methods in Bayesian learning.On a different note, this process has the type of a natural and continuous kernel and admits a convenient finitary characterisation.We are eager to study how these properties map through the operator interpretation.