Classification among Hidden Markov Models

An important task in AI is one of classifying an observation as belonging to one class among several (e.g. image classiﬁcation). We revisit this problem in a veriﬁcation context: given k partially observable systems modeled as Hidden Markov Models (also called labeled Markov chains), and an execution of one of them, can we eventually classify which system performed this execution, just by looking at its observations? Interestingly, this problem generalizes several problems in veriﬁcation and control, such as fault diagnosis and opacity. Also, classiﬁcation has strong connections with diﬀerent notions of distances between stochastic models. In this paper, we study a general and practical notion of classiﬁers, namely limit-sure classiﬁers , which allow misclassiﬁcation, i.e. errors in classiﬁcation, as long as the probability of misclassiﬁcation tends to 0 as the length of the observation grows. To study the complexity of several notions of classiﬁcation, we develop techniques based on a simple but powerful notion of stationary distributions for HMMs. We prove that one cannot classify among HMMs iﬀ there is a ﬁnite separating word from their stationary distributions. This provides a direct proof that classiﬁability can be checked in PTIME, as an alternative to existing proofs using separating events (i.e. sets of inﬁnite separating words) for the total variation distance. Our approach also allows us to introduce and tackle new notions of classiﬁability which are applicable in a security context.


Introduction
The spectacular success of artificial intelligence (AI) and machine learning techniques in many varied application domains in the last decade has led to the emergence of several new and old questions, especially regarding their guarantees and correctness.This has led to several recent projects at the interface of formal methods and AI, whose broad goal is to formally reason and verify properties about these AI models and tasks.One such important task in AI is classification, which is a fundamental problem with many practical applications, e.g., in image processing.In this paper, we consider classification in a verification context.One main issue when verifying systems is partial observability.It is thus important to know what information can be recovered from a partially observable system.We first consider a system perspective: we want to know whether, no matter the execution of the system, some hidden information is retrievable, at least with high probability.To represent the system, we thus consider a partially observable stochastic model, namely

Classification among HMMs
Hidden Markov Models (HMM for short) [14,10], also known as labeled Markov chains [5] or probabilistic labeled transition systems [4].While notationally different, these various models are equivalent in terms of expressive power.In HMMs, states are not observable, but we get some (potentially stochastic) signals from states.In the specific variant of HMMs that we study in this paper, we encode the signals from states as labels of transitions exiting states.That is, the observation from an execution of an HMM is its labeling sequence.We encode the different hidden information as several HMMs, with different transition probabilities.Finding the hidden information from the observation thus amounts to classifying the observation among the different HMMs.
Many problems concerning systems with hidden information can be recast in the framework of classification, such as, (i) fault diagnosis: classifying between a faulty system that has executed errors and the system without faults [16,18,3,4]; (ii) opacity: classifying between high and low privilege parts of the system [10], etc.Although some problems are incomparable (e.g.diagnosis is intrinsically "asymmetric" while classification is "symmetric"), most proof techniques and ideas are common.Moreover, results on classification problems have been applied to show results in these related contexts.While it is not our aim to survey these applications here, we provide two instances: a fault diagnosis problem [4] is solved using a result on distance between stochastic systems [5], which is equivalent with classification [11].Also, opacity is cast as a classification problem in [10].We hence believe that classification is a good framework to state and prove algorithmic and complexity results.
Several notions of classification can be defined: sure, almost-sure, and limit-sure, depending respectively on whether we want the classification to eventually happen for sure, with probability 1, or with arbitrarily small error.The first two notions have classical solutions coming from fault-diagnosis [16,3]: the existence of such classifiers can be checked in PTIME and PSPACE respectively.The third notion is however the most practical as the classifier is the most powerful: it can use the long run statistics on observations to take its decision (e.g. the frequency of ab's in the word).It is also the hardest notion to study for this very reason.
We focus on this notion of limit-sure classification in this paper.First, a closely-related problem of distinguishability has been proved to be in PTIME by [11], using the PTIME algorithm from [5] to test whether the total variation metric between two HMMs is 1.We reinvestigate these deep results using different techniques, which shines some new light on this problem.Our starting point is the following: for a very restricted class of HMMs [10], whose underlying Markov chains are ergodic and crucially, assuming that initial distributions have non-zero probability on every state, it is sufficient to consider the statistics on states (e.g. the frequency of state s).These statistics on states are obtained by [10] using the classical notion of stationary distributions over the underlying Markov Chain, i.e. the HMM where we forget all observations.As we show in Example 2, stationary distributions on Markov chains do not suffice for solving limit-sure classification for general HMMs.We build on this idea and propose a new notion to study the long run statistics of the observations.Our first contribution is to develop the notion of stationary distributions for general HMMs to study the long run statistics of the observations.To do so, we focus on beliefs, that is the set of states that can be reached with the same observation.We show that a notion of stationary distributions can be defined for beliefs in Bottom Strongly Connected Components (BSCCs), and that it also corresponds to a notion of asymptotic distributions, describing the asymptotic statistics of beliefs.This generalizes stationary distributions for Markov chains: for instance, irreducible Markov chains of period k correspond to cycling through k different beliefs.We believe that this notion can find applications in other contexts.
Our next contribution is to show how this notion of stationary distribution of HMMs can S. Akshay, H. Bazille, E. Fabre, B. Genest 29:3 be used to characterize limit-sure classifiability.We show that we cannot classify between HMMs iff they have beliefs which can be reached by the same observation and for which the stationary distributions can be separated by one finite word (for which the probability is different).This provides a PTIME algorithm to test for limit-sure classifiability.Note that the existence of such a PTIME algorithm has been established in [11], where this result was formulated in terms of HMMs distinguishability.The proofs are different however, as [11] focuses on separating events [5], that is sets of infinite words with probability 0 (resp.1) in one of the HMMs (resp.the other one), while considering stationary distributions allows us to focus on a single finite separating word with probability p (resp.q = p).
Our final contribution is to study classifiability in a security context: an attacker has different attacks against different HMMs.To be able to perform his attack, he needs to find one execution that can be classified (and thus attacked) rather than whether every execution can be classified.We call this notion attack-classification.We study limit-sure attack-classification using the notion of stationary distributions for HMMs developed above.We show that deciding whether there exists a limit-sure attack-classifier between two HMMs is PSPACE-complete.On the other hand, if we consider a variation on the notion of limit-sure attack-classifier, which extends distinguishability for HMMs [11], we are able to show that it is not only different from limit-sure attack-classifier, but this problem is also undecidable.

Preliminaries and Problem Statement
A Hidden Markov Model [14,15,10] (HMM for short) A on finite alphabet Σ is a tuple A = (S, M, σ 0 ) with S a set of states, σ 0 an initial distribution, M : S × Σ × S → [0; 1], such that for all s, a,s M (s, a, s ) = 1.Notice that this notion has been referred to using different names in the literature: labeled Markov chains, pLTS (probabilistic transition systems) in [4], probabilistic automata (not to be confused with Rabin's Probabilistic automata), etc. Classical Markov chains can be viewed as HMMs with a single letter alphabet.In what follows we assume knowledge of classical properties, definitions about Markov chains, such as irreducibility, aperiodicity and refer to [9] for a formal treatment.A run ρ of A is a sequence in S(Σ × S) * .It starts in s − (ρ), with σ 0 (s − (ρ)) > 0, and ends in state s + (ρ).An observation w from A is a sequence of letters w = a 1 • • • a n ∈ Σ * such that there exists a run ρ made of n + 1 states ρ = s 0 , a 1 . . ., a n s n with σ 0 (s 0 ) > 0 and for all i > 0, M (s i−1 , a i , s i ) > 0. We denote obs(ρ) = w.For a run ρ = s 0 , a 1 . . ., a n s n , we define its probability as We sometimes abuse notation and write M (s 1 , w, s n ) to mean n i=1 M (s i−1 , a i , s i ).We define the probability of an observation w ∈ Σ * as P (w) = ρ|obs(ρ)=w P (ρ).In general we write P A σ to express the probability in HMM A with initial distribution σ.If σ(s) = 1, then we use P A s instead.A non-deterministic finite automaton (NFA for short) is as usual a structure A = (S, ∆, S 0 ), where the transition probabilities (as in a HMM) are replaced with a transition relation ∆ and initial distribution is replaced by a set of initial states S 0 .For a HMM (S, M, σ 0 ), we can associate the NFA A = (S, ∆, S 0 ), by taking (s, a, t) ∈ ∆ iff M (s, a, t) > 0 and s ∈ S 0 iff σ 0 (s) > 0. The notion of paths and observation is preserved.Fig. 1 shows an HMM on the left and an NFA on the right.
The language of an automaton (or by extension of an HMM) is the set of observations L(A) = {w | w = obs(ρ), ρ a path of A}.We denote by L ∞ (A) the set of infinite observations in A, that is such that every of its prefix is in L(A).Finally, we use the standard way to extend probabilities to some sets of infinite paths, by means of cylinder-sets [1].In particular, taking two HMMs A 1 , A 2 on the same alphabet, L ∞ (A 1 ) ∩ L ∞ (A 2 ) is measurable.We write L(A, s) for the language of A starting in state s.
A strongly connected component C of an HMM A is a maximal set of states such that there is a path from any state of C to any state of C. A strongly connected component C is called a bottom strongly connected component(BSCC) if the only states reachable from C are in C. For instance, there is only one BSCC in the NFA of Fig. 1, with 2 states {x, y} and {z}.Runs of an HMM end up in one of the BSCCs with probability 1.
Probabilistic Finite Automata (PFA) Several lower bounds will come from results on Rabin's probabilistic finite automaton (PFA) [8].A PFA A on finite alphabet Σ is a tuple A = (S, (M a ) a∈Σ , σ 0 ) with S a set of states, σ 0 an initial distribution, M a : S × S → [0, 1] for each a ∈ Σ, such that for all a, s, s M a (s, s ) = 1.Similar to HMMs, the states of a PFA are not observed, but only letters a ∈ Σ are.The difference is that we can control a PFA by choosing an action a ∈ Σ, while in HMMs, we observe passively an observation a ∈ Σ.
It suffices then to check whether (σ Notice that equivalence of PFA has been known to be in PTIME for 30 years [19], before HMMs [2].Actually, equivalences of HMMs of and PFAs are inter-reducible (one direction can be found in [7], and the other one is easy by considering the HMM associated with a PFA, which performs actions of the PFA uniformly at random).

The classification problem and its variants
Let (A i ) i≤k be a set of HMMs representing different behaviors of a system under observation.The system secretly picks one HMM behavior to follow, i.e. it is a priori unknown which of the HMMs is being followed by the system.We want to classify, i.e. find out, which HMM behavior the system follows, only by looking at the observation w ∈ Σ * .The longer we observe the system, the larger the length of the observation and better the information we have to find out the HMM.This leads us to the notion of classifiability.As it suffices to consider HMMs pairwise, we will consider in the following there is only a choice between k = 2 HMMs.We will denote them by A 1 , with n states, and A 2 , with m states.Formally, a classifier is a function f : Σ * → {⊥, 1, 2} that outputs the index of the HMM from an observation, or possibly ⊥ if it cannot conclude (yet).Consider for example A 1 , A 2 , both following the HMM in Figure 1, the difference being that A 1 starts in x while A 2 starts in z.If the observation starts with b, then we know the systems follows A 2 , because b is not possible from x.We can thus let f (bw) = 2.However, if the observation is ab 2 a, then it could come from any A 1 or A 2 .If the systems are probabilistically equivalent, then no matter how much we observe, we cannot classify among them.However, this is one extreme case.One can consider several notions of classifiability: sure classifiability: there exists a classifier f that eventually identifies the accurate HMM that generated w.That is, for all w ∈ Σ ∞ , there exists a finite prefix v of w and a classifier f for v such that f (v) = 1 (resp.f (v) = 2) iff there is no path ρ of A 2 (resp. of A 1 ) with obs(ρ) = w.almost-sure classifiability: there exists a classifier f that eventually identifies the accurate HMM that generated w with probability 1.This classifier cannot do errors, but there may be some infinite observation that cannot be classified, though the probability it happens should be 0 (such as tossing tail forever on a fair coin).limit-sure classifiability: there exists a classifier f that, for all > 0, eventually provides the accurate HMM with probability > 1 − .This is the most general notion: sure implies almost-sure implies limit-sure classifiability.
This leads to the two main questions that we are interested in, for each of the above notions: (i) how easy is it to decide if there exists a classifier?(ii) if there exists a classifier, how easy is it to produce one explicitly?For the first question, we can answer easily for the two first notions, which have been studied in different contexts.
For the first two notions, building the classifier is also easy: intuitively, it suffices to compute the set of states reached with the observation (called belief in the next section) for both HMMs.If the system is classifiable, one of these sets will eventually (almost surely with the second notion) become empty.The classifier answers the HMM with non-empty set.
Unlike the two first notions, limit-sure classifiability cannot be expressed in terms of the language.Indeed, it is possible to limit-surely classify among A 1 , A 2 , and yet L(A 1 ) = L(A 2 ).Also, a limit-sure classifier can use statistics in order to give its estimate, which opens a lot of possibilities.Let us illustrate these: Example 2. Consider again A 1 , A 2 , where both are the HMM A from Fig. 1, where A 1 starts from state x and A 2 starts from state z.If the observation starts with b, then it is easy to conclude that the HMM is A 2 .If it starts with a, then the set of states which can be reached after observation a is {x, y} in A 1 and {z} in A 2 , which are both in the BSCCs.Actually, after an even number of b's (and any number of a's), we still have {x, y} the set of states possible in A 1 and {z} in A 2 .In the following section using stationary distributions F S T T C S 2 0 1 9 29:6

Classification among HMMs
on HMMs, we will show how to compute that if the HMM is A 1 , after an even number of b's, the long term average is 3  5 to be in x and 2 5 to be in y.From this, we deduce that the long term average is 4 5 = 3 5 1 + 2 5 1 2 to perform an a after an even number of b's.On the other hand, if the HMM is A 2 , then the state is z and we obtain the long term average 1  2 to perform letter a after an even number of b's.As the observation grows, the average frequency over the observation will tend towards the long term average by law of large numbers.Thus the classifier f (w) = 1, if the average frequency of a's after an even number of b's observed in w is closer to 4  5 than to 1 2 , is limit-sure.Notice that using the standard stationary distributions on Markov chains as in [10] only tells us that both A 1 and A 2 stay in long term average frequency 3  7 in x, 2  7 in y, and 2 7 in z , and thus do 2 of a's in average, which cannot limit-surely classify between A 1 , A 2 .
From the point of view of practical applicability, limit-sure classifiers are the most powerful, although harder to study.In Section 4, we will study limit-sure classifiability, that we simply call classifiability.In Section 5, we further generalize this notion to a game-theoretic attack-classification framework, which is applicable in security settings.

Stationary distributions for HMMs
In order to solve limit-sure classification, we would like to use statistics on observations.Stationary distributions, which is a concept developed for Markov chains, tells us the frequency to expect about states, as used in [10].We generalize this concept to HMMs to take into account observations.While stationary distributions for HMMs turn out to be crucial in the realm of classifiability, we believe it is also of independent interest.For a Markov chain M , a stationary distribution σ is a distribution over states of M such that σ • M = σ.In HMMs, the observation plays an important role and changes our knowledge of states in which the run could be.Thus, we consider the set of states that could be reached in an HMM A with a given observation, and call this as the belief-state or just belief.Formally, let w be an observation.The belief B A (w) associated with w is the set of states {s + (ρ) | obs(ρ) = w} which can be reached by a path labeled by w.For instance, with the HMM A from Fig. 1, we have B A (aa) = {x, y}.We let B A = (2 S , ∆, s 0 ) be the belief automaton associated with A: (i) its states represent the beliefs associated with observations of A, (ii) we have (B, a, B ) ∈ ∆ if B = {s | ∃s ∈ B, M (s, a, s) > 0}, and (iii) We denote E D the set of beliefs X in the unique BSCC of B x D , and E A the union over all BSCCs D of A. Notice that E A may not contain all beliefs in the BSCCs of B A , because we restrict ourselves to beliefs X reachable from {x} with a single state x of a BSCC of A. This is crucial for Lemma 3 to hold.We will see that considering singletons is not a restriction: assume that the belief reached in a BSCC of beliefs comes from two different states x, y.Either the statistics on observation from x and y are the same, in which case we change nothing by considering them only from x. Otherwise, they have different statistics on observation, and looking at the observed statistics will give away with arbitrarily small error the state x or y which they originate from.
For Markov chains (i.e.HMMs on a one letter alphabet), the BSCC E D is exactly with k the period of this BSCC.Hence, this construction can be seen as a generalization to HMMs of the notion of period of a Markov chain.We use it to generalize the Fundamental theorem of Markov chains to HMMs.
Let X ∈ E A .We are interested in the asymptotic distribution associated to belief X, that is the statistics over states of X given that the belief state is X.From that, we will be able to deduce the statistics over observations.Let W X the (possibly countable infinite) set of words which brings from belief X to belief X without seeing belief X in-between.Consider σ y,i the distribution over X such that σ y,i (x) = w∈W i X M (y, w, x), the probability of reaching x from y after seeing i words of W X .To compute the limit of σ y,i , we define the stationary distribution σ X : X → [0, 1] of the HMM given a belief X.For that, we enrich the states of A with its beliefs, considering the product A × B A (same runs with same probabilities as in A).For all x, y ∈ X, let M X (x, y) be the probability in the HMM A × B A to reach (y, X) from (x, X) before reaching any other (z, X), z = y (we refer to [1] to compute M X (x, y) for all x, y).We have that for all x ∈ X, y∈X M X (x, y) = 1, that is M X is a Markov chain.For instance, on Fig. 1, let X = {x, y} ∈ E. The Markov chain M X is depicted in Fig. 2 has a unique stationary distribution σ(x) = 3  5 and σ(y) = 2 5 .We obtain: Theorem 4. Given a HMM A, let X be a belief in Proof sketch.We apply the fundamental theorem to M X to get the statement.It suffices to show that M X is ergodic.For all x ∈ X, by Lemma 3, there is an observation v x leading from {x} to X, i.e. ∆({x}, v x ) = X.As ∆({x}, v i x ) is increasing with i and |∆({x} ).We can then obtain a word w x with ∆({x}, w x ) = ∆(X, w x ) = X.Now, by induction on the size of X, we can build a uniform word w such that ∆({x}, w) = X for all x ∈ X.For all x, y ∈ X, we get M |w| X (x, y) > 0.

Limit-sure Classifiability
We start by stating the definition of limit-sure classification more precisely: Definition 5. Two HMMs A 1 , A 2 are limit-sure classifiable iff there exists a computable function, also called a classifier, f : Σ * → {1, 2} such that P (ρ run of A 1 of size k | f (obs(ρ)) = 2) → k→∞ 0, and similarly for ρ run of A 2 .
(Notice we do not need ⊥ as the classifier is allowed to give erroneous answers at first).Consider the Maximum A Posteriori (MAP) classifier [14,10]: it answers 1 if P A1 (u) > P A2 (u), and 2 otherwise.To do so, it just needs to record for every state of A 1 (resp.every state of A 2 ) the probability to observe u and finish in state s 1 (resp.s 2 ).Indeed, we may then compute confidence(i, u) = , i.e. the probability that the decision i is correct after observing u.Notice that this confidence is not necessarily non-decreasing, and that the answer of a classifier can also switch from one answer to the other.In fact, we show in Proposition 16 (in Appendix) that if (A 1 , A 2 ) is limit-sure classifiable, then the MAP classifier will be a limit-sure classifier.The main problem is to decide when limit-sure classification holds.In fact, this problem can be solved in PTIME.We remark that a variant of the problem was already shown to be in PTIME, namely distinguishability [5,11].While both problems coincide for HMMs, as explained in Section 4.4, our proof described in the rest of this section, crucially uses the notion of stationary distributions for HMMs developed in the previous section.

The Twin Automaton and the Twin Belief Automaton
Given HMMs A 1 , A 2 , we define their twin automaton A = (S = S 1 × S 2 , ∆, s 0 ) as the product of the automata associated with A 1 × A 2 by forgetting the probabilities.Recall that A 1 has n states and A 2 has m states.The transition relation is ∆ . We call states of A twin states.In the following, we will often consider the belief automata B A , B A1 , B A2 associated with A, A 1 , A 2 , obtained by the subset construction (see Section 3).States of B A will be called twin beliefs.Notice that although twin beliefs are formally sets of pairs of states in 2 S1×S2 , we can also present them as pairs of sets of states 2 S1 × 2 S2 because if (s 1 , s 2 ) and (s 1 , s 2 ) are in the same twin belief, then we also have (s 1 , s 2 ) and (s 1 , s 2 ) in this twin belief.We will thus write the twin belief X(u) associated with observation u as X(u) = (X 1 (u), X 2 (u)), with X 1 (u), X 2 (u) the beliefs states of B A1 , B A2 associated with u. Figure 3 presents an example with a twin automaton and the twin belief automaton for two copies of the HMM given in figure 1, one starting in state y and the other starting in state z.

Characterization for classifiability
Our goal is to use the result of Section 3 to obtain stationary distributions in A 1 , A 2 , and classify between them by comparing the stochastic language wrt these stationary distributions using probabilistic equivalence (see Section 2.1).In order to do this, we first need to compare the same information in both HMMs.The idea is to consider twin beliefs from each HMM: we will enrich A 1 with the beliefs of A 2 , and vice versa.Let A 1 be the HMM where the state space is S 1 × 2 S2 , and the transition matrix is a, y ), y ∈ Y }, and 0 otherwise, for all x, Y, a, x , Y .We define similarly A 2 with set of states S 2 × 2 S1 .It is easy to see that for all observation w, the belief state , and we will abuse notation and represent beliefs of A 1 and A 2 as twin belief (X 1 , X 2 ), where X 1 or X 2 can be empty.
What we are interested in is what happens after a BSCC of A is reached.We thus consider twin beliefs reachable from some (x 1 , x 2 ) in the BSCC of A. The set of twin beliefs reachable in A 1 and in A 2 from ({x 1 }, {x 2 }) are almost the same, except for twin beliefs of the form (X 1 , ∅) which cannot be reached in A 2 , and of the form (∅, X 2 ) which cannot be reached in A 1 .

Definition 7.
We say that a twin belief (X 1 , X 2 ) is oblivious if the languages of B A1 from X 1 and of B A2 from X 2 are the same.
By definition, if (X 1 , X 2 ) is not oblivious, there are words differentiating X 1 and X 2 .Now, assume that X = (X 1 , X 2 ) is oblivious.The twin beliefs reachable from (X 1 , X 2 ) are the same in A 1 and A 2 .To potentially differentiate them, we need to consider their long term statistics.Let B 1 and B 2 be the belief automata associated with A 1 and A 2 .Let E A be the union of BSCCs of twin beliefs accessible from twin states in the BSCCs of twin states, as in lemma 3. Let X ∈ E A .In this case, we say that X is in the BSCCs of twin beliefs.We define σ 1 X : X 1 → [0, 1] the stationary distribution in A 1 around the twin belief X (formally, σ 1 X is defined on (x, X 2 ) for all x ∈ X 1 , and we omit the second component X 2 because it is constant).In the same way, we define σ 2 X : X 2 → [0, 1] for the second component X 2 around the twin belief X.We can then look for words differentiating A 1 , A 2 , i.e. with different probabilities from σ 1 X and from σ 2 X .We can now state our characterization: Theorem 8.The following are equivalent: 1.One cannot limit-surely classify between A 1 , A 2 , 2. There exists an oblivious X ∈ E A in a BSCC of twin beliefs such that The second condition is sufficient to show that MAP is a limit-sure classifier (see Proposition 16 in Appendix).However, checking condition 2 explicitly is not algorithmically efficient, as the belief automaton can have exponentially many states.Instead, to obtain a PTIME algorithm to check limit-sure classifiability, we will use the third condition.For comparison, in [5], a variant of the equivalence between (1) and ( 3) is shown, without using the stationary distributions σ 1 X , σ 2 X of (2).For the proof, we note that the case of 2 implies 3 is easy.For the remaining two directions, i.e. 1 implies 2 and 3 implies 1, proofs are technical, and can be found in the appendix.For 1 implies 2, we prove that negation of 2 implies that the MAP classifier (defined in beginning of Section 4) is limit-sure, implying negation of 1. Intuitively, negation of 2 means F S T T C S 2 0 1 9 29:10 Classification among HMMs that every pair of reachable beliefs have a distinguishing word.It then suffices to consider statistics on these finite number of distinguishing words to know the originating HMM with arbitrarily high probability.For 3 implies 1, we show that any twin belief (H 1 , H 2 ) reached from (y 1 , y 2 ) in E A must be oblivious because of the probabilistic equivalence.We show this implies (A 1 , σ 1 H1,H2 ) and (A 2 , σ 2 H1,H2 ) are equivalent and conclude using Lemma 6.

A PTIME Algorithm
Theorem 8 gives us a characterization for the existence of a limit-sure classifier.The third condition is particularly interesting, because it does not require computing beliefs.Using this, we can build an efficient algorithm, similar to [5], to test in PTIME whether there exists a limit-sure classifier between A 1 , A 2 .

Algorithm 1 Limit-sure Classifiability
if there exist two distributions σ 1 , σ 2 over X 1 , X 2 with σ 1 (y 1 ) > 0 and σ 2 (y 2 ) > 0 6: return not classifiable 8: return classifiable The correctness of the algorithm is immediate from Theorem 8, as it checks explicitly for the third condition to hold, in which case it returns not classifiable.If the third condition is false for every BSCC D, then it returns classifiable.

Comparison with Distinguishability between HMMs [11]
We complete this section, by comparing our results with a related result on HMMs.In [11], the problem of distinguishability between labeled Markov Chains has been considered.First, labeled Markov Chains are just another name for HMMs.The idea behind distinguishability is similar to the idea behind classifiability.Still, there are some technical differences: distinguishability asks that for all ε > 0, there exists a (1 − ε)-classifier, that is a classifier f : Σ * → {⊥, 1, 2}, such that if the classifier answers f (u) = 1, then there is probability at least (1 − ε) that the observation comes from a run from A 1 , and similarly for f (u) = 2.To compare, limit-sure classifiers need to be uniform over ε (see the next section).
The authors of [11] show that this notion can be checked in PTIME, by indirectly using the result of [5] stating that one can check in PTIME whether the total variation distance between two HMMs is 1.More precisely, the total variation distance is defined as: Definition 9.The total variation distance between two HMMs A 1 and A 2 is given by

29:11
This supremum has been shown to be a maximum [5].It is not too hard to show that limit-sure classification coincides with these notions as well for HMMs: Theorem 10.The following are equivalent: 1.There exists a limit-sure classifier for A 1 , A 2 , 2. For all ε > 0, there exists a The proofs to obtain the PTIME algorithms are quite different though: we use stationary distributions in HMMs while [5] focuses on separating events.Some intermediate results are however related: our Proposition 18 in the appendix is to be compared with Proposition 19 b) of [5]: Our statement is stronger as the equivalence is true from all pairs of states with the same (non stochastic) language -and in particular from (i 1 , j 1 ) = (y 1 , y 2 ) (cf Proposition 17 in the appendix).Also, the proof of Proposition 18 in the appendix is simple, using strict convexity focusing on one finite separating word, while in [5], the existence of a maximal separating events (sets of infinite words) is used crucially in the proof of Proposition 19 b).
Surprisingly, our resulting algorithm is very similar to the one in [5], whereas we use very different methods.Still, we can restrict the search to distributions in a BSCC of twin states, while [5] considers subdistributions on the whole state space of twin states.This allows us to optimize the number of variables in the Linear Program.

Attack-classification
While limit-sure classification allows for some misclassification, i.e. error in classification, it requires that every execution of the HMMs is classifiable.From a security perspective, if one wants to make sure that two systems cannot be distinguished from each other, then the question changes slightly: from the point of view of an attacker who could exploit the knowledge of which model the system is following, it need not classify every single execution.It only needs to find one execution for which it can decide.This gives rise to what we call attack-classification, which amounts to providing the attacker with a reset action she can play when she believes the execution cannot be classified.Then, a new (possibly the same) HMM is taken at random and an execution of this new HMM is observed by the attacker.
For instance, it is not possible to limit-surely classify between HMM A 3 and HMM A 4 on Figure 4, because executions starting with a b cannot be classified.On the other hand, an attacker can wait for an execution of the system starting with an a, for which he is sure the HMM is A 3 .If it starts with a b, then the attacker just forgets this execution and wait for a new execution of the system (the "reset" operation).We start by considering limit-sure attack-classifiers, namely, we require that there exists a reset-strategy, which with probability 1, resets only finitely many times, and a limit-sure

29:12 Classification among HMMs
classifier for the observation after the last reset.We also consider what happens if instead of limit-sure classifier, we ask for the existence of a family of (1 − ε)-classifiers after the last reset, one for each ε.The difference is that the reset action can take into account the ε in the latter, but not in the former.While both notions coincide for the classifiers defined in the previous section, we show now that they do not coincide for attack-classification.
Figure 4 illustrates the difference between these two notions, considering A 3 and A 5 .First, for all ε > 0, there exists an (1 − ε)-attack-classifier: given an ε, the reset strategy resets if the first letter b happens within the first k ε = log( 1 9ε ) steps.That is, the reset strategy is τ (a * ) = ⊥, τ (a kε w) = ⊥ and τ (a b) = reset for < k ε .For observation a kε w, the classifier claims that the HMM is A 3 , which is true with probability at least (1 − ε).However, this reset strategy is not compatible with limit-sure classifier (and, in fact, no reset strategy is), because it is not uniform wrt all ε: once a b has been produced, no more information can be gathered.On the other hand, limit-sure attack-classified implies the existence of (1 − ε)-attack-classifiers for all ε.Thus the former notion of limit-sure attack-classifier is strictly contained in the latter.More importantly, we show that deciding the former is PSPACE-complete, while the latter turns out to be undecidable.

Limit-sure attack-classifiability is PSPACE-complete
Let us first formalize the definition of attack-classification.Definition 11.We say two HMMs A 1 , A 2 are limit-sure attack-classifiable if: there exists 1. reset strategy τ : Σ * → {⊥, reset} telling when to reset, and which eventually stops resetting, with probability 1 on the reset runs, and 2. limit-sure classifier for u, where u ∈ Σ * denotes the suffix of observations since last reset.
In the following, we show an algorithmic characterization for this concept.Intuitively, there needs to exist one execution of one HMM (say A 1 ), such that no matter the execution of the other HMM with the same observation, we can eventually classify between these two executions.We will thus consider A 1 and A 2 , the HMMs A 1 and A 2 enriched with the beliefs of the other HMM.
First, we define classifiable twin states in the BSCC of twin states: ( ) the stationary distributions built for (X 1 , X 2 ).Notice that it does not depends upon the choice of (X 1 , X 2 ).For a belief state X 2 of A 2 , we say that ( In case there are more than two HMMs, we follow the state s of one HMM and the belief of every other HMMs along the observation, and we need to check classifiability between (s, t) for every t in the belief of any of the other HMMs.Using this characterization, we obtain: Theorem 13.Let A 1 , A 2 be two HMMs.It is PSPACE-complete to check whether (A 1 , A 2 ) are limit-sure attack-classifiable.

Existence of (1 − ε) attack-classifiers for all ε is undecidable.
We now turn to the other notion.Let ε > 0. An (1 − ε) attack-classifier for two HMMs A 1 , A 2 is given by: 1.A reset strategy τ : Σ * → {⊥, reset} telling when to reset, and which eventually stops resetting, with probability 1 on the reset runs, and 2. a (1 − ε)-classifier for u, where u ∈ Σ * denotes the suffix of the observations since the last reset.
We next show that this notion, which we showed to be weaker than limit-sure attackclassifiability on Fig 4, is also computationally much harder, in fact, it is undecidable.Theorem 14.It is undecidable to know whether for all ε, there exists an (1 − ε) attackclassifier between 2 HMMs.
Intuitively, we reduce from the problem of whether a PFA B, that accepts all words with probability in (0, 1), is 0 and 1 isolated, that is, there is no sequence of words (w i ) i∈N such that lim n→∞ P B (w i ) = 0 or = 1.This problem is undecidable [8].The idea is to transform the PFA into an HMM which performs the actions of the PFA uniformly at random.We check whether we can attack classify this HMM with an HMM which accepts all words of size k with probability 1/2 k .This is possible if 0 is not isolated or if 1 is not isolated.

Conclusion
In this paper, we tackled the notion of limit-sure classifiability between HMMs, which is a general notion in studying how to uncover hidden information in partially observable systems.The class of classifiers we consider are quite powerful, as they can use statistics on the observations in order to take their decision.To obtain our results, summarized in the table below we developed a robust theory of stationary distributions for HMMs.While limit-sure classifiability is stronger and more complex than almost-sure classifiability, checking for it is in a lower complexity class: PTIME instead of PSPACE-complete.This result shines some new light on total variation metric for stochastic systems, recovering with different techniques the PTIME result from [5].We also considered attack-classifiability, where the attacker needs to classify at least one observation rather than every execution.In this setting, there is a difference between limit-sure classifier and the existence of (1 − ε)-classifiers for each ε.Limit-sure attack-classifiability is decidable (PSPACE-complete), whereas the existence of (1 − ε)-classifiers for all ε is undecidable.Proposition 1. [16,3] We can surely classify among 2 HMMs iff L ∞ (A 1 ) ∩ L ∞ (A 2 ) = ∅, and this can be checked in PTIME.We can almost-surely classify among 2 HMMs iff the set L ∞ (A 1 ) ∩ L ∞ (A 2 ) has probability 0, and this is a PSPACE-complete problem.
Proof.The first result is a classical result, in the context of fault-diagnosis [16], which can be adapted trivially to the case of classification.Clearly, an observation w ∈ L then considering the product of both HMMs, called the twin machine, it has no loop.It means that after at most n • m observation, we can classify.Looking for a loop in the twin machine is in PTIME.
For the second result we use [18,3] ) has probability >0, then clearly no almost-sure classifier exists for these observations.Conversely, assume that L ∞ (A 1 ) ∩ L ∞ (A 2 ) has probability 0. Consider the belief automata associated with A 1 , A 2 and perform their product.The hypothesis implies that all BSCCs of the product have one of the component empty: one can thus classify when BSCCs are reached, which eventually happen with probability 1.To get the PSPACE algorithm, it suffices to check whether a BSCC of the belief product, with both components non empty, can be reached.The PSPACE-lower bound follows the one in [3].

Proof of Lemma 3 from Section 3
Lemma 3.There is a unique BSCC in B x D , and it does not depend upon x ∈ D.
Proof.Assume by contradiction that X 1 and X 2 are in two distinct BSCCs of B x D (wlog, we can choose x ∈ X 1 , x ∈ X 2 as x is reachable from any state, and thus x must belong to at least one member of each BSCC).Let w 1 , w 2 be observations reaching X 1 and X 2 respectively.As x ∈ X 1 , there is a path in B x D labeled w 2 from X 1 to some X 2 with X 2 X 2 (they cannot be equal because they are in 2 different BSCCs).
As x ∈ X 2 , there is a path in B x D labeled w 1 from X 2 to some X 1 with X 1 X 1 .We can then play w 2 to obtain some X 2 from X 1 with X 2 X 2 .We can iterate this process infinitely, which gives a contradiction with the bounded number of states.
In the same way, consider B x D and B y D , and assume by contradiction that they have different BSCCs.Let Y (resp.X) be a configuration in the unique BSCC of B x D (resp.B y D ), reachable by playing w 1 (resp.w 2 ), with x ∈ X and y ∈ Y .One can play w 2 (resp.w 1 w 2 ) from Y (resp.X) and reach some X , with X X X .Again, one can iterate and reach a contradiction with the boundedness of the number of states.

Proof of Theorem 4 from Section 3 Theorem 4. Given a HMM A, let X be a belief in E
Proof.We first prove that there exists such that for all x, y ∈ X, we have M X (x, y) > 0, which implies irreducible aperiodic.Then we will use the fundamental theorem of Markov chains [9].For all x ∈ X, by Lemma 3, there is an observation v x leading from {x} to X, i.e. ∆({x}, X 2 , then we apply v x again.As F S T T C S 2 0 1 9 29:16 Classification among HMMs Now, by induction on the size of X, we build a uniform word w such that ∆({x}, w) = X for all x ∈ X.Let x 1 , . . ., x k be the elements of X.The word w starts with w x1 .We have that for all i ≤ k, ∆({x i }, w x1 ) ⊆ X.Let y 2 ∈ ∆({x 2 }, w x1 ).Hence y 2 ∈ X, and we will append to w x1 the observation w y2 , obtaining ∆({x 1 }, w x1 w y2 ) = ∆({x 2 }, w x1 w y2 ) = X, and for all i ≤ k, ∆({x i }, w x1 w y2 ) ⊆ X.By induction, we will obtain the desired word w.Then, for the size of w, we will have M X (x, y) > 0 for all x, y ∈ X.That is, M X is irreducible and aperiodic.
We now apply the fundamental theorem of Markov chains to the irreducible and aperiodic Markov chain M X : M X has a unique stationary distribution, denoted σ X .Further, for σ X y,i the distribution with σ X y,i (x) = M i X (y, x), we have that lim i→∞ σ X y,i exists and is unique, it does not depend upon y ∈ X, and it is equal to σ X .Now, let W X the (possibly countable infinite) set of words which brings from belief X to belief X without seeing belief X in-between.Consider σ y,i the distribution over X such that σ y,i (x) = w∈(W X ) i P (w)M (y, w, x), the probability of reaching x from y after seeing i words of W X .Now, notice that by definition of M X , we have σ y,i = σ X y,i .Hence the limit of σ y,i exists and is unique, it does not depend upon y ∈ X, and it is equal to σ X .

Proof of Lemma 6 from Section 4
Lemma 6 (Proposition 18 in [5]).Let (X 1 , X 2 ) be a reachable twin belief of Then one cannot classify between A 1 , A 2 .
2 implies 3 is easy.Indeed, consider the oblivious twin-belief We have that all (x 1 , x 2 ) ∈ (X 1 , X 2 ) belong to the same BSCC D. Thus, we can let σ 1 = σ 1 X and σ 2 = σ 2 X and choose any y 1 ∈ X 1 , y 2 ∈ X 2 , which gives us the statement.We now prove the two remaining implications.We start in the next subsection by showing 1 implies 2. Then we show 3 implies 1, completing the proof.
(1 =⇒ 2): MAP is a limit-sure classifier when condition 2 is false To prove 1 implies 2, we prove that negation of 2 implies that the MAP classifier (defined in beginning of Section 4) is limit-sure, which of course implies that 1 cannot hold.Intuitively, (not 2) means that every pair of accessible beliefs have a distinguishing word.It then suffices to consider statistics on these finite number of distinguishing words to know the originating HMM with arbitrarily high probability.
Let ε > 0. Intuitively, when the observation u is long enough, the MAP classifier can claim that the observation comes from one HMM with probability at least 1 − ε.Long enough means that we can decompose u into u = u 1 u 2 u 3 , with some specific properties on u 1 ; u 2 ; u 3 .That is, eventually with probability 1, we will reach a word u that can be decomposed into u 1 u 2 u 3 .Intuitively, there is a high probability to reach a BSCC of the twin automaton with u 1 , to reach a BSCC of the twin belief automaton after u 2 , and u 3 allows with high probability to eliminate one of the two possible HMMs.
We now formalize this decomposition into u 1 ; u 2 ; u 3 .Let u be an observation from a run of A 1 .We denote by p 1 (s, u) (resp.p 2 (t, u)) the probability in A 1 to observe u and reach state s (resp.state t).Let ε > 0. Then u = u 1 u 2 u 3 is a good decomposition if the following conditions hold: u 1 is such that there exists R 1 , R 2 sets of states of A 1 , A 2 with: 1. (s, t) is in a BSCC of A for all (s, t) ∈ R 1 × R 2 , 2.

t /
∈R2 p 2 (t, u 1 ) < 2 min s∈R1 p 1 (s, u 1 ).u 2 is such that for all (s, t) ∈ R 1 × R 2 , the twin-belief X s,t = (X s , X t ) reached by reading u 2 from (s, t) is in the BSCC of the twin-beliefs automaton.It is easy to see that eventually with probability 1, we will observe such a u 2 .
Last, we tackle the condition on u 3 .If X s,t is oblivious, let σ 1 s,t , σ 2 s,t be the stationary distributions around X s,t .By hypothesis (not 2), there exists w s,t such that P A1 (w s,t )|.From any state of X s , denoting by n s,t (u 3 ) the number of times X s,t has been a twin-belief along u 3 , and n s,t (u 3 ) the number of times w s,t has been observed from X s,t , by the central limit theorem, we have that (w s,t ) with probability 1.We consider observations u 3 in L(B A1 , X s ) = L(B A2 , X t ) such that: Let W k (ε) be the set of observations u 1 u 2 u 3 of size k which are good decompositions.Then, Lemma 15.For all ε > 0, for k large enough, we have Proof.As runs converge towards BSCCs, eventually with probability 1, observation u 1 satisfies the first two conditions.For the last one, consider some u 1 satisfying the first two conditions.Then let p 1 (u 1 ) = min s∈S1 p 1 (s, u 1 ).Considering extensions u 1 u 1 of u 1 , one gets

F S T T C S 2 0 1 9
Intuitively, we show that any twin belief (H 1 , H 2 ) reached from (y 1 , y 2 ) in E A must be oblivious because of the probabilistic equivalence.We show this implies (A 1 , σ 1 H1,H2 ) and (A 2 , σ 2 H1,H2 ) are equivalent and conclude using Lemma 6.We write X 1 = {i 1 , . . .i n } and X 2 = {j 1 , . . .j m }.We let i 1 = y 1 and j 1 = y 2 .If (X 1 , X 2 ) was a twin belief, we would have an observation w such that (X 1 , X 2 ) = B A (w), and then we could apply Lemma 6 and obtain that one cannot classify between A 1 , A 2 .However, in general, (X 1 , X 2 ) is not a twin belief (testing it would be non polynomial time).Instead, we will show that there is probabilistic equivalence from y 1 , y 2 after reading some observation u.As (y 1 , y 2 ) can be reached in A, we can conclude on the non-classifiability using Lemma 6.
As already shown in the proof of Theorem 3, we know that there is a word w and a twin belief (H 1 , H 2 ) in the BSCC of E 1 D such that for all (x, y) ∈ D, the belief from {(x, y)} In particular, this is true for (y 1 , x 2 ) for all x 2 ∈ X 2 and for (x 1 , y 2 ) for all x 1 ∈ X 1 .This implies that after w, from all ( We first show that every twin belief in the BSCC E 1 D is oblivious.In particular, we have First, assume by contradiction that there is a word u possible from H 1 in B A1 but not possible from H 2 in B A2 .Consider (i 1 , j 1 ) = (y 1 , y 2 ) ∈ D. By lemma 3, there is some Otherwise, we already have B 1 (u 1 u) = ∅, and C 2 (u 1 u) = ∅.Either way, |Z 2 | < m − 2 By induction, we can find an observation w with Z 2 (w) = ∅ and B 1 (w) ∈ Z 1 (w) = ∅, a contradiction, as 0 < P σ1 (w) = P σ2 (w) = 0.
The case w possible from H 2 but not from H 1 is symmetric, using C 1 as the non emptyset.
For all twin belief (H 1 , H 2 ) a twin belief in the BSCC E 1 D , we can thus consider σ 1

H1,H2
and σ 2 H1,H2 , the stationary distributions of the HMM A 1 and A 2 around twin belief (H 1 , H 2 ).Now, it is not necessarily the case that we can reach the BSCC E D of twin beliefs in a uniform way over all (x 1 , x 2 ) ∈ D (Theorem 4 shows that it is the case for all (x 1 , x 2 ) ∈ X a belief in the BSCC of the belief states, but again, (X 1 , X 2 ) is not necessarily (included in) a belief).Let (H 1 , H 2 ) ∈ E D .In the following, we will consider observations that reaches the BSCC of E D from u.Let u 1 such that B 1 (u 1 ) = H 1 and C 1 (u 1 ) = H 2 .Such u 1 exists by lemma 3. Let V be the language from H 1 , which is equal to the language from H 2 .Now, consider what happens from i 2 reading observations in V .There are several cases.First, assume that there is an observation v 2 in V such that a belief state in the BSCC of beliefs is reached from {i If it is the same language, we say that i 2 is of type 1.Otherwise, or if there is no observation v 2 ∈ V such that the BSCC of beliefs can be reached reading u 1 v 2 , then we say that i 2 is of type 2. Intuitively, a state of type 2 will be negligible when following y 1 , y 2 , whereas a state of type 1 needs to be tracked because it is not negligible.We then consider the state i 3 and the belief B 3 (u 1 v 2 ),

F S T T C S 2 0 1 9
and classify each state i 3 . . .then j 2 . . .inductively into type 1 and type 2. We have an observation w leading all the type 1 state to their BSCC, and all the type 1 states have the same language.
We reorder X 1 = {i 1 , . . .i n } and X 2 = {j 1 , . . .j m } such that i 1 , . . .i k and j 1 , . . ., j are of type 1 and the rest is of type 2. We now follow every type 1 belief in parallel: Consider a (k + )-belief H = (H 1 , . . ., H k , K 1 , . . ., K ) in the BSCC of belief states of A k 1 × A 2 .Let u an observation such that B r (u) = H r for all r ≤ k and C r (u) = K r for all r ≤ .Because the language for the type 1 states are the same from their belief state, we can compute σ r : H r → [0, 1] the stationary distribution for i r to be around belief H for all r ≤ k and τ r : K r → [0, 1] be the stationary distribution over H for all r ≤ .Let W H be the set of observations from the (k + )-belief H to H without seeing H in-between.
For all w , we have by definition of the equivalence: w∈W κ H r≤n σ(i r )P A1 ir (uww ) = w∈W κ H r≤ τ (j r )P A2 jr (uww ).Considering the limit when κ tends to infinity, we have for all r > k, lim κ→∞ w∈W κ H α r P A1 ir (uw) = 0. Indeed, consider i r , r > k.For paths reaching a state such that the BSCC of beliefs cannot be reached, the probability to stay out of the BSCC tends to 0 with the size of the run.Otherwise, the path reaching the BSCC of beliefs, let say in belief X r .By definition of type 2 state, the language is not the same as the language of H 1 , which is W * H . Hence either there is a word in W * H which cannot be done from X r and can be done from H 1 , in which case avoiding this word forever have probability 0, or there is a word which can be done from X r but not from H 1 : this word is not in W * H , and at each W H iteration, there is some missing probability from X r , say 1 − , and eventually the probability is 0. We thus obtain; Proof.Assume by contradiction that it is not the case: That is, there is a w such that P A1 σ1 (w) > P A1 σ (w).Let us write x = P A1 σ1 (w) = γP A1 σ (w) = γx, with γ < 1.We have the following: τ (w) = αP A1 σ1 (w) + (1 − α)P A1 σ (w) = αx + (1 − α)γx We let W be the set of minimal observation u sending to X from (B 1 (w), . . ., B k (w), C 1 (w), . . ., C (w)).We have that Now, the function x → x 2 is strictly convex (its second derivative is strictly positive).Applying the definition to (1, γ) (this is also Jensen's inequality), we obtain a contradiction: We can then apply this result symmetrically to the second component and obtain (A 1 , σ 1 ) ≡ (A 2 , τ 1 ).As (i 1 , j 1 ) = (y 1 , y 2 )) ∈ D, we can conclude about non-classifiability using Lemma 3.
It remains to show that 3 implies 1: Assume that d(A 1 , A 2 ) = 1.We will show that the MAP classifier is a limit-sure classifier.Let mis(A 1 , A 2 , w) be its probability of misclassification.Thus, for all ε > 0, there exists k and W k ⊂ Σ k such that P 1 (W k Σ ω ) ≥ 1 − ε and P 2 (W k Σ ω ) ≤ ε and we obtain: That is, when k → ∞, the probability of misclassification, i.e. error in classification, tends towards 0.
Proof.First, if there exists a classifiable (x 1 , X 2 ) ∈ A 1 , then let ρ 1 be a path in A 1 ending in (x 1 , X 2 ).Now, for all x 2 ∈ X 2 , consider (x 1 , x 2 ), and let (Y 1 , Y 2 ) be a twin belief in the BSCC of twin beliefs reachable from (x 1 , x 2 ) by path ρ 2 .As (x 1 , x 2 ) is classifiable, there are several cases: either there is a word Y2 , and we consider path ρ 3 labeled by w x2 after ρ 1 ρ 2 in A 1 .It proves that the state cannot be x 2 .or there is a word Y1 , and we set ρ 3 = ε, otherwise, (Y 1 , Y 2 ) is oblivious, and we also le ρ 3 = ε.
From ρ 1 ρ 2 ρ 3 , we define ρ 4 ρ 5 associated with another x 2 , until we took into account every x 2 ∈ X 2 .The path ρ = ρ 1 ρ 2 ρ 3 ρ 4 • • • ρ has strictly positive probability to happen in A 1 , and thus strictly positive probability to happen in the union of HMMs (remember the run are picked with uniform probability among the HMMs).
Given this path ρ and the associated observation w, the reset strategy is to play τ (u) = reset if: 1.The observation u of the system since the last reset is of length |u| < |w|, and u is not a prefix of w, or 2. otherwise, if there is no extension ρ of ρ in A 1 such that ρρ is labeled by u, 3. otherwise, if the statistical counts the frequency of w x2 from (Y 1 , Y 2 ) is closer to the average value av Y2,Y1 given by σ 2 Y1,Y2 than to the average value av Y1,Y2 given by σ 1 Y1,Y2 .

FFigure 1
Figure 1 Example of an HMM A on alphabet Σ = {a, b} and of an NFA BA on alphabet Σ.

Lemma 3 .
This is the usual subset construction used for determinizing an automaton, as shown on Fig.1.As B A is deterministic, we sometimes abuse notation and denote ∆(B, a) for the unique B with (B, a, B ) ∈ ∆.Consider a BSCC D of HMM A (as for Markov chains, this is to ensure irreducibility).For x ∈ D, we denote by B x D the subgraph of B A reachable from {x}.On figure1, we have B y D = B A .It has a unique BSCC, with 2 beliefs {x, y} and {z}.We now show that this is the general form of the belief automaton: There is a unique BSCC in B x D , and it does not depend upon x ∈ D.

Figure 2 Figure 3
Figure 2 Markov chain Mx,y associated with the belief {x, y}

Figure 4
Figure 4HMMs A3, A4 and A5 (left to right).One cannot classify betweeen A3, A4, but they can be attack-classified.On the other hand, one cannot attack-classify between A3, A5.
Let u be an observation.Let B k (u) be the belief of A 1 reached by u from {i k }, and C k (u) be the belief of A 2 reached by u from {j k }.We define Z 1 (u) the sets of beliefs B that we denote E D .Proposition 17.Let (H 1 , H 2 ) a twin belief in the BSCC E 1 D .Then (H 1 , H 2 ) is oblivious.Proof.i (u), i ≤ n and Z 2 (u) the sets of beliefs C i (u), i ≤ m.Notice that the sizes |Z 1 (u)| and |Z 2 (u)| (the number of distinct non empty beliefs) are non increasing with u.