Playing Stochastically in Weighted Timed Games to Emulate Memory

Weighted timed games are two-player zero-sum games played in a timed automaton equipped with integer weights. We consider optimal reachability objectives, in which one of the players, that we call Min , wants to reach a target location while minimising the cumulated weight. While knowing if Min has a strategy to guarantee a value lower than a given threshold is known to be undecidable (with two or more clocks), several conditions, one of them being the divergence, have been given to recover decidability. In such weighted timed games (like in untimed weighted games in the presence of negative weights), Min may need finite memory to play (close to) optimally. This is thus tempting to try to emulate this finite memory with other strategic capabilities. In this work, we allow the players to use stochastic decisions, both in the choice of transitions and of timing delays. We give for the first time a definition of the expected value in weighted timed games, overcoming several theoretical challenges. We then show that, in divergent weighted timed games, the stochastic value is indeed equal to the classical (deterministic) value, thus proving that Min can guarantee the same value while only using stochastic choices, and no memory


Introduction
Real-time aspects are often inherent in the behaviour of critical software systems.Timed automata [2] extend finite-state automata with timing constraints, providing an automatatheoretic framework to model and verify real-time systems.While this has lead to the development of mature verification tools, the design of programs verifying some real-time specifications remains a notoriously difficult problem.One way to avoid the need to a posteriori debugging is to automatise the process as much as possible.To do so, the situation is modelled into a timed game, played by a controller and an antagonistic environment: they act, in a turn-based fashion, over a timed automaton.A simple, yet realistic, objective for the controller is to reach a target location.We are thus looking for a strategy of the controller, that is a recipe dictating how to play so that the target is reached no matter how the environment plays.Reachability timed games are decidable [4], and EXPTIME-complete [19].

137:2 Playing Stochastically in Weighted Timed Games to Emulate Memory
For many applications, this qualitative setting is often too coarse to model faithfully the system.This motivated a shift to a quantitative setting, based on weighted extensions of the models considered so far.Weighted extensions of these timed automata and games have thus been considered in order to measure the quality of the winning strategy for the controller [11,1]: when the controller has several winning strategies, the quantitative version of the game helps choosing a good one with respect to some metrics.More precisely, the controller, which we now call player Min, wants to reach the target while minimising the cumulated weight.The model we consider, called weighted timed game (WTG for short), is defined as follows: the game takes place over a weighted (or priced) timed automaton [5,3], where locations are split among the two players, transitions are equipped with weights, and locations with rates of weights (the cost is then proportional to the time spent in this location, with the rate as proportional coefficient).In this setting, the possibility to use negative weights on transitions and locations is crucial when one wants to model energy or other resources that can grow or decrease during the execution of the system under study.
While solving the optimal reachability problem on weighted timed automata has been shown to be PSPACE-complete [8] (i.e. the same complexity as the non-weighted version), WTGs are known to be undecidable [13].Many restrictions have then been considered in order to regain decidability, the first and most interesting one being the class of strictly non-Zeno cost with only non-negative weights (in transitions and locations) [11]: this hypothesis requires that every execution of the timed automaton that follows a cycle of the region automaton has a weight far from 0 (in interval [1, +∞), for instance).This setting has been extended in the presence of negative weights in transitions and locations [16]: in the so-called divergent WTGs, each execution that follows a cycle of the region automaton has a weight in (−∞, −1] ∪ [1, +∞).A triply-exponential-time algorithm allows one to compute the values and almost-optimal strategies, while deciding the divergence of a WTG is PSPACE-complete.
When studying optimal reachability objectives with both positive and negative weights, it is known that strategies of player Min require memory to play optimally (see [15] for the case of finite games).More precisely, the memory needed is pseudo-polynomial (i.e.polynomial if constants are encoded in unary).For WTGs, the memory needed even becomes exponential.An important challenge is thus to find ways to avoid using such complex strategies, e.g. by proposing alternative classes of strategies that are more easily amenable to implementation.
Strategies considered so far are deterministic.Though the game has no stochastic edges, it is possible to allow players to use stochastic strategies.This approach has been recently studied in the setting of finite games [20], where it is shown that memory may indeed be emulated using randomness in finite reachability games with integer weights.More precisely, the minimal value Min can achieve using memoryless stochastic strategies is the same as the value achievable using deterministic strategies.In the present work, we lift the results obtained in [20] for finite games to the timed setting.
A first important challenge is to analyse how to play stochastically in WTGs.To our knowledge, this has not been studied before.Starting from a notion of stochastic behaviours in a timed automaton considered in [7] (for the one-player setting), we propose a new class of stochastic strategies.Compared with [7], our class is larger in the sense that we allow Dirac distributions for delays, which subsumes the setting of deterministic strategies.However, in order to ensure that strategies yield a well-defined probability distribution on sets of executions, we need measurability properties stronger than the one considered in [7] (we actually provide an example showing that their hypothesis was not strong enough).
Then, we turn our attention towards the expected cumulated weight of the set of plays conforming to a pair of stochastic strategies.We first prove that under the previous measurability hypotheses, this expectation is well-defined when restricting to the set of plays 137:3 Figure 1 On the left, a weighted timed game.Locations belonging to Min (resp.Max) are depicted by circles (resp.squares).The target location is ℓ3.Location ℓ1 (resp.ℓ5) has (deterministic) value +∞ (resp.−∞).As a consequence, the value in ℓ4 is determined by the edge to ℓ3, and depicted in blue on the right.In location ℓ2, the value associated with the transition to ℓ3 is depicted in red, and the deterministic value in ℓ2 is obtained as the minimum of these two curves.
following a finite sequence of transitions.In order to have the convergence of the global expectation, we identify another property of strategies of Min, which intuitively ensures that the set of target locations is reached quickly enough.This allows us to define a notion of stochastic value (resp.memoryless stochastic value) of the game, i.e. the best value Min can achieve using stochastic strategies (resp.memoryless stochastic strategies), when Max uses stochastic strategies (resp.memoryless stochastic strategies) too.
In a second step, we aim at adapting the proof techniques of [20] from finite to infinite games.It is well-known that the classical region abstraction of timed automata is not suited to analyse WTGs (there are cases in which one has to split regions).In order to obtain positive results, we focus on the class of divergent WTGs.We prove that the notion of optimal deterministic switching strategy, which was central in the approach of [20], can be adapted to divergent WTGs.Our main result is then to show that for these games, the two versions of stochastic values are equal to the deterministic value.In other terms, we show that Min can emulate memory using randomisation, and vice versa.Moreover, combining memory and randomisation does not increase Min's capabilities.Due to the lack of space, detailed proofs of all results can be found in the long version [21].

Weighted timed games
We let C be a finite set of variables called clocks.A valuation is a mapping ν : C → R ≥0 .For a valuation ν, a delay t ∈ R ≥0 and a subset Y ⊆ C of clocks, we define the valuation ν Without loss of generality, we suppose the absence of deadlocks except on target locations, i.e. for each location ℓ ∈ L\L T and valuation ν, there exists (ℓ, g, Y, ℓ ′ ) ∈ ∆ such that ν |= g, and no transitions start in L T .The semantics of a WTG G is defined in terms of a game played on an infinite transition system whose vertices are configurations of the WTG.A configuration is a pair (ℓ, ν) with a location and a valuation of the clocks.Configurations are split into players according to the location.A configuration is final if its location is a target location of L T .The alphabet of the transition system is given by ∆ × R ≥0 : a pair (δ, t) encodes the delay t that a player wants to spend in the current location, before firing transition δ.For every delay t ∈ R ≥0 , transition δ = (ℓ, g, Y, ℓ ′ ) ∈ ∆ and valuation ν, there is an edge (ℓ, ν) The weight of such an edge e is given by t × wt(ℓ) + wt(δ).An example is depicted on Figure 1.
A finite play is a finite sequence of consecutive edges ρ = (ℓ 0 , ν 0 ) δ0,t0 We sometimes denote such a play (ℓ 0 , ν 0 ) since intermediate locations and valuations are uniquely defined by the initial configuration and the sequence of transitions and delays.We denote by |ρ| the length k of ρ.The concatenation of two finite plays ρ 1 and ρ 2 , such that ρ 1 ends in the same configuration as ρ 2 starts, is denoted by ρ 1 ρ 2 .We denote by I(ρ, δ) the interval of delays t such that the play ρ can be extended with the edge δ,t −→.We let FPlays be the set of all finite plays, whereas FPlays Min (resp.FPlays Max ) denote the finite plays that end in a configuration of Min (resp.Max).A play is then a maximal sequence of consecutive edges (it is either infinite or it reaches L T ).
We call path a finite or infinite sequence π of transitions of G.Each play ρ of G is associated with a unique path π (by projecting away everything but the transitions): we say that ρ follows the path π.A target path is a finite path ending in the target set L T .We denote by TPaths the set of target paths.We let TPaths ρ (resp.TPaths n ρ ) the subset of target paths that start from the last location of the finite play ρ (resp.containing n transitions).A path is said to be maximal if it is infinite or if it is a target path.
The objective of Min is to reach a target configuration, while minimising the cumulated weight up to the target.Hence, we associate to every finite play ρ = (ℓ 0 , ν 0 ) δ0,t0 its cumulated weight, taking into account both discrete and continuous costs: Then, the weight of a maximal play ρ, denoted by wt(ρ), is defined by +∞ if ρ is infinite (does not reach L T ), and wt Σ (ρ) if it ends in (ℓ T , ν) with ℓ T ∈ L T .
As usual in related work [1,11,12,16], we assume that all clocks are bounded by a constant M ∈ N, i.e. every transition of the WTG is equipped with a guard g such that ν |= g implies ν(x) ≤ M for all clocks x ∈ C. We denote by w L max (resp.w ∆ max , w e max ) the maximal weight in absolute values of locations (resp. of transitions, edges) of G, i.e.
In the following, we rely on the crucial notion of regions, as introduced in the seminal work on timed automata [2].A game G can be populated with the region information, without loss of generality, as described formally in [16], e.g.The region automaton, or region game, R(G) is thus the WTG with locations S = L × Reg(C, M ) and all transitions ((ℓ, r), g ′′ , Y, (ℓ ′ , r ′ )) with (ℓ, g, Y, ℓ ′ ) ∈ ∆ such that the model of guard g ′′ (i.e.all valuations ν such that ν |= g ′′ ) is a region r ′′ , time successor of r such that r ′′ satisfies the guard g, and r ′ is the region obtained from r ′′ by resetting all clocks of Y .Distribution of locations to players, final locations, and weights are inherited from G. We call region path a finite or infinite sequence of transitions in this automaton, and we again denote by π such paths.A play ρ in G is projected on a region path π, with a similar definition as the projection on paths: we again say that ρ follows the region path π.It is important to notice that, even if π is a cycle (i.e.starts and ends in the same location of the region game), there may exist plays following it in G that are not cycles, due to the fact that regions are sets of valuations.
As shown in previous work [11,16], knowing whether dVal ℓ,ν = +∞ for a certain configuration is a purely qualitative problem that can be decided easily by using the region game: indeed, dVal ℓ,ν = +∞ if and only if Min has no strategies that guarantee reaching the target L T .This is thus a reachability objective, where weights are useless.Moreover, Max has a strategy that guarantees that no plays reach the target L T from any configuration (ℓ, ν) such that dVal ℓ,ν = +∞.In this situation, considering stochastic choices is not interesting.We thus rule out this case by supposing in the following that no configurations of G have a value +∞: such configurations can be removed in the region game by strengthening the guard on transitions.

Playing stochastically in WTGs
Our first contribution consists in allowing both players to use stochastic choices in their strategies.From a game theory point of view, this seems natural.From a controller synthesis point of view, we claim that the question is natural too, especially because player Min may require exponential memory to play optimally in WTGs.This is already the case even without clocks (such games are then sometimes called shortest-path games) where it has been shown in [20] that the memory required by Min could be traded for stochastic choices instead (and vice versa).We aim at extending this result in the context of weighted timed games.Before doing so, we must introduce stochastic strategies in the context of weighted timed games, which has never been explored until now, as far as we are aware of.We will however strongly rely on a recent line of works aiming at studying stochastic timed automata [7,9,6,10], thus extending the results in the context of two-player games (instead of model-checking) and with weights, which indeed represents the main challenge in order to give a meaning to the expected payoff.Naturally, deterministic strategies for Min are extended to more general stochastic strategies as mappings η : FPlays Min → Dist(∆ × R ≥0 ) where each finite play is associated to a probability distribution over the set of pairs of transition and delay.Here, we let Dist(S) the set of all probability distributions over a set S (equipped with an underlying σ-algebra).Since ∆ is a finite set, this is equivalent to letting first Min choose a transition I C A L P 2 0 2 1 via η ∆ : FPlays Min → Dist(∆), and then, knowing the chosen transition, choose a delay via η R + : FPlays Min × ∆ → Dist(R ≥0 ), the support of the distribution η R + (ρ, δ) being included in the interval I(ρ, δ) of valid delays.We can then recombine η ∆ and η R + to obtain the distribution η(ρ).Similar definitions hold for Max whose general strategies are denoted by θ.
Notice that deterministic strategies are a special case of strategies, where the distributions are chosen to be Dirac distributions.Another useful restriction over strategies is the nonuse of memory: a strategy η is memoryless if for all finite plays ρ, ρ ′ ending in the same configuration, we have that η(ρ) = η(ρ ′ ).A similar definition holds for Max.
Probability measure on plays.We fix two strategies η and θ for both players, and an initial configuration (ℓ 0 , ν 0 ).Our goal is to define a probability measure on plays.To do so, and following the contribution of [7] for stochastic timed automata, the set of plays of a WTG G starting from (ℓ 0 , ν 0 ) and conforming to η and θ can naturally be equipped with a structure of σ-algebra whose generators are all subsets of plays that start with a finite prefix following the same finite path π (remember that paths are sequences of transitions, with no information on the delayed time) with some Borel-measurable constraints on the delays taken along π.The a priori idea is thus to define a probability measure P η,θ ℓ0,ν0 on such generators which extends uniquely as a probability measure over the whole σ-algebra, by Carathéodory's extension theorem.
Consider thus a finite path π, starting in location ℓ, and a play ρ ending in the same location ℓ.We define the probability P η,θ ρ (π) taking into account all possible plays that start with ρ and continue according to π (we leave the Borel-measurable constraints on the delays for now, but discuss them later).It is defined by induction on the length of π by P η,θ ρ (ε) = 1, and for all transitions . This definition is very similar to the one in [7] except that we choose to decouple the distribution on pairs of ∆ × R ≥0 by first selecting a transition and then delay, whereas authors of [7] consider independent choices, the one on transitions being described by some weights on transitions (depending on the current region).
For modelling purposes, authors of [7] enforce that probability distributions on delays do not forbid any delays of the interval I(ρ, δ) of possible delays, thus ruling out singular distributions like Dirac ones that would consider taking a single possible delay (like deterministic strategies do).More formally, they require η R + (ρ, δ) to be absolutely continuous (i.e.equivalent to the Lebesgue measure) on interval I(ρ, δ).We claim that even with this assumption, the previous definition of the probability may not be well-founded, as demonstrated by the example given in [21, Appendix A].From this example, we see the importance to moreover enforce that the distributions η ∆ (ρ) and η R + (ρ, δ) are "measurable wrt the sequence of delays along the play ρ".This is easy to define for the transition part.For delays, since we want deterministic strategies to be a subset of stochastic strategies, we must be able to choose delays by using Dirac distributions, and by extension discrete distributions (that are not absolutely continuous, as [7] requires).This results in the following hypothesis: ▶ Hypothesis 1.A strategy η satisfies this hypothesis if 1. for all transitions δ 0 , . . ., δ k , δ, the mapping This hypothesis allows us to obtain : If η and θ are strategies satisfying Hypothesis 1, the probabilities P η,θ ρ (π) of following a path π after the play ρ are well defined.It can be extended into a probability distribution over maximal paths π starting in the last location of ρ.
Apart from the well-definition that is new, the rest of the proof is very close to the one of [7].The probability measure easily extends to unions of maximal paths: in particular, P η,θ ℓ0,ν0 (TPaths ℓ0,ν0 ) is set as the sum π∈TPaths ℓ 0 ,ν 0 P η,θ ℓ0,ν0 (π) of probabilities of all paths reaching L T from ℓ 0 .Authors of [7] go one step further, by using Carathéodory's theorem to extend the probability measure on paths (P η,θ ρ (π)) to a measure on plays (P η,θ ρ ), whose σ-algebra is generated by maximal plays with Borel-measurable constraints on the delays.We do not formally need this further extension and will only use such extension to give an intuitive introduction of the expected payoff below.In the following, we let Strat Min and Strat Max be the sets of (stochastic) strategies satisfying Hypothesis 1, for both players.We let mStrat Min and mStrat Max be the respective subsets of memoryless strategies.
Expected payoff of plays.As explained before, by Carathéodory's theorem, the set of plays can be equipped with a probability distribution, and we are interested in the expectation of the random variable wt(ρ) (where ρ conforms with two fixed strategies η and θ).This only makes sense if the probability to reach a target location is equal to 1, since otherwise, the expected weight will intuitively be +∞ (there is a non-zero probability to not reach the target location, the weight of all such plays being +∞).We thus now require that P η,θ ℓ0,ν0 (TPaths ℓ0,ν0 ) = 1 (i.e. the probability to follow an infinite path is 0).We will see afterwards that this is not a sufficient condition to ensure that the expected weight is finite.
We would like to define the expected weight to reach the target as (we write E η,θ ℓ0,ν0 instead of E η,θ ℓ0,ν0 (wt), since we only consider the expectation of the weight wt): E η,θ ℓ0,ν0 = ρ wt(ρ) dP η,θ ℓ0,ν0 (ρ) where the integral is over all plays ρ that start in (ℓ 0 , ν 0 ) and reach the target L T (such restriction is again justified by the fact that the probability mass of all other plays is 0).This is problematic a priori (and we will see below an example where this indeed would be a problem) since the cumulated weight is not known to be a measurable function of the play, wrt the measure P η,θ ℓ0,ν0 .To overcome this challenge, we follow a different approach, consisting in mimicking the construction of the probability before: first define the expected payoff of all plays following a given path, and then sum over all possible paths.▶ Definition 3. We define the expected weight E η,θ ρ (π) of plays that can extend ρ (the weight of ρ is thus not counted in the expectation) and that follow the path π.It is defined by induction on the length of π by E η,θ ρ (ε) = 0 and for all transitions δ = (ℓ, g, Y, ℓ ′ ): We then define the expected weight E η,θ ρ = π∈TPathsρ E η,θ ρ (π), when this sum converges. 1We let H denote the mapping from R to [0, 1] such that H(t) = 0 if t < 0 and H(t) = 1 otherwise.Recall that it is the CDF of the Dirac distribution choosing t = 0.
I C A L P 2 0 2 1

137:8 Playing Stochastically in Weighted Timed Games to Emulate Memory
Hypothesis 1 is sufficient to show the well-definition of all expectations E η,θ ρ (π): is well-defined for all ρ and π.However, the infinite sum in E η,θ ρ can be problematic.We thus need a stronger hypothesis to ensure its convergence.We adopt here an asymmetrical point of view, relying only on hypothesis on the strategy η of Min.Our choice is grounded in our controller synthesis view, Min being the controller desiring to reach a target location with minimum expected payoff, while Max is an uncontrollable environment.
▶ Definition 5. A strategy η ∈ Strat Min of Min is said proper if for all finite plays ρ and strategies θ ∈ Strat Max , P η,θ ρ (TPaths ρ ) = 1 and the infinite sum π∈TPathsρ E η,θ ρ (π) converges.We let Strat p Min be the set of proper strategies of Min, mStrat p Min the subset of memoryless proper strategies.Notice that a deterministic strategy of Min is proper as soon as it guarantees to reach the target set of locations (remember that we have ruled out configurations with a deterministic value dVal(ℓ, ν) = +∞ where Min cannot deterministically guarantee to reach the target L T ): this shows that proper strategies exist (even without using memory).For stochastic strategies, we have seen above that reaching the target set of locations with probability 1 is a necessary but not sufficient condition to be proper.Not only Max must reach the target almost surely, but he must do it quickly enough so that the expectation converges.We now give a sufficient condition for a strategy to be proper, that we will use in the rest of this article.

▶ Hypothesis 2. A strategy η ∈ Strat
Min of Min satisfies this hypothesis if there exist m ∈ N and α ∈ (0, 1] such that for all finite plays ρ and strategies θ ∈ Strat Max , P η,θ ρ ( n≤m TPaths n ρ ) ≥ α.This hypothesis is indeed a sufficient condition for a strategy to be proper: ▶ Lemma 6.All strategies of Min satisfying Hypothesis 2 are proper.
Our main contribution, presented in details in Section 5, is to compare the memoryless (stochastic) value, the deterministic value and the stochastic value, showing their equality for a fragment of WTGs.Along the way, we will need the following result showing that when Min plays with a proper strategy, Max always has a best response strategy that is deterministic:

Divergent weighted timed games
As we have already seen in the introduction, interesting fragments of WTGs have been designed, in order to regain decidability of the problem of determining whether the value of a WTG is below a certain threshold.One such fragment is obtained by enforcing a semantical property of divergence (originally called strictly non-Zeno cost when only dealing with non-negative weights [11]): it asks that every play following a cycle in the region automaton has weight far from 0. We will consider this restriction in the following, since it allows for a large class of decidable WTGs, with no limitations on the number of clocks.Formally, a cyclic region path π of R(G) is said to be a positive cycle (resp.a negative cycle) if every finite play ρ following π satisfies wt Σ (ρ) ≥ 1 (resp.wt Σ (ρ) ≤ −1).

▶ Definition 8 ([16]). A WTG is divergent if every cyclic region path is positive or negative.
In [16], it is shown that this definition is equivalent to requiring that for all strongly connected components (SCC) S of the graph of R(G), either every cycle π inside S is positive (we say that the SCC is positive), or every cycle π inside S is negative (we say that the SCC is negative).The best computability result in this setting is: ▶ Theorem 9 ([16]).The deterministic value of a divergent WTG can be computed in triply-exponential-time.
We explain how to recover from Theorem 9 the needed shape of ε-optimal strategies, since this is one of the new technical ingredient we need afterwards.
Switching strategies for Min.Theorem 9 is obtained in [16] by using a value iteration algorithm (originally described in [1] for acyclic timed automata).If V represents a value function, i.e. a mapping L × R C ≥0 → R ∞ , we denote by V ℓ,ν the image V (ℓ, ν), for better readability.One step of the game is summarised in the following operator F mapping each value function V to the value function defined for all (ℓ, ν) −→ (ℓ ′ , ν ′ ) ranges over valid edges in G.Then, starting from V 0 mapping every configuration to +∞, except for the targets mapped to 0, we let V i = F(V i−1 ) for all i > 0. The value function V i is intuitively what Min can guarantee when forced to reach the target in at most i steps.
The value computation of Theorem 9 is then obtained in two steps.First, configurations (ℓ, ν) of value dVal ℓ,ν = −∞ are found by using a decomposition of the region game R(G) into strongly-connected components (SCC).Indeed, in divergent WTGs, configurations of value −∞ are all the ones from which Min has a strategy to visit infinitely many times configurations of a single location (ℓ, r) of R(G) contained in a negative SCC.This is thus a Büchi objective on the region game, that can easily be solved with some attractor computations.Notice that if a configuration (ℓ, ν) has value −∞, this implies that all configurations (ℓ, ν ′ ) with ν ′ in the same region as ν have value −∞.As we explained at the end of Section 2 for the values +∞, we can then remove configurations of value −∞ by strengthening the guards on transitions, while letting unchanged other finite values.

I C A L P 2 0 2 1
Then, the (finite) value dVal is obtained as an iterate V H of the previous operator, with H polynomial in the size of the region game and the maximal weights of G.This means that playing for only a bounded number of steps is equivalent to the original game.In particular, at horizon H, we have that F(V H ) = V H+1 = dVal so that dVal is a fixpoint of F. As a side effect, this allows one to decompose the clock space R C ≥0 into a finite number α of cells (a refinement of the classical regions) such that dVal is affine on each cell.
Based on this, we can construct good strategies for Min that have a special form, the so-called switching strategies (introduced in [15] in the untimed setting, further extended in the timed setting with only one-clock in [14]).

▶ Definition 10.
A switching strategy σ is described by two deterministic memoryless strategies σ 1 and σ 2 , as well as a switching threshold K.The strategy σ then consists in playing strategy σ 1 until either we reach a target location, or the finite play has length at least K, in which case we switch to strategy σ 2 .
In particular, if all configurations have a finite deterministic value, there exists an ε-optimal switching strategy wrt the deterministic value.In the presence of a configuration with a deterministic value −∞, we build from Theorem 11 a family of switching strategies (indexed by the parameter N ) whose value tends to −∞.
The proof of Theorem 11 requires to build both strategies σ 1 and σ 2 , as well as a switching threshold K.The second strategy σ 2 only consists in reaching the target and is thus obtained as a deterministic memoryless strategy from a classical attractor computation in the region game R(G).It is easy to choose σ 2 smooth enough so that it fulfils Hypothesis 1.In contrast, the first strategy σ 1 requires more care.We build it so that it fulfils two properties, that we summarise in saying that σ 1 is fake-ε-optimal wrt the deterministic value: 1. each finite play conforming to σ 1 from (ℓ, ν) and reaching the target has a cumulated weight at most dVal ℓ,ν + |ρ| ε (in particular, if dVal ℓ,ν = −∞, no such plays should exist); 2. each finite play conforming to σ 1 following a long enough cycle in the region game R(G) has a cumulated weight at most −1.Here, "fake" means that σ 1 is not obliged to guarantee reaching the target, but if it does so, it must do it with a cumulated weight close to dVal ℓ,ν , the error factor depending linearly on the size of the play.The second property ensures that playing long enough σ 1 without reaching the target results in diminishing the cumulated weight.Then, if the switch happens at horizon K big enough, (K = (w e max |R(G)|(|L|α + 2) + N )(|R(G)|(|L|α + 1) + 1) suffices for instance), Min is sure that the cumulated weight so far is low enough so that the rest of the play to reach a target location (following σ 2 only) will not make the weight increase too much.In the absence of values −∞ in dVal, the first property allows one to obtain a Kε-optimal strategy even in the case where the switch does not occur (because we reach the target prematurely).The construction of a fake-ε/K-optimal strategy σ 1 (the linear dependency on the length of the play in the first property of fake-optimality is thus taken care by a division by K here) relies on the fact that F(dVal) = dVal to play almost-optimally at horizon 1.More formally: For all configurations of value −∞, σ 1 is built as a winning strategy for the Büchi objective "visit infinitely often configurations of a location (ℓ, r) of R(G) contained in a negative SCC".By definition, all cyclic paths following σ 1 will be inside a negative SCC, and thus be of cumulated weight at most −1, by divergence of the WTG.Moreover, no plays conforming to σ 1 from such a configuration of value −∞ will reach a target location, since the chosen negative SCC is a trap controlled by Min.It is easy to choose σ 1 smooth enough so that it fulfils Hypothesis 1.
For the remaining configurations of finite value, we rely upon operator F, letting σ 1 choose a decision that minimises the value at horizon 1.However, because of the guards on clocks, infimum/supremum operators in F are not necessarily minima/maxima, and we thus need to allow for a small error at each step of the strategy: this is the main difference with the untimed setting, which by the way explains why our definition of switching strategy needed to be adapted.We will use the arginf ε operator defined for all mappings f : This set is non empty since dVal is a fixpoint of operator F in this case.Moreover, knowing that the mapping dVal ℓ is piecewise affine by the results shown in [16], it is possible to choose σ 1 so that it fulfils the measurability (even piecewise continuity) conditions of Hypothesis 1.More precisely, we can consider it to take the same kind of decision for all configurations of a same cell: same transition, and either no delay or a delay jumping to the same border of cell.The strategy σ 1 thus built makes a small error wrt the optimal at each step.But once again strongly relying on the divergence of the WTG, we can nevertheless show that σ 1 is fake-ε/K-optimal wrt the deterministic value.
Memoryless strategies for Max.WTGs are known to be determined [14], i.e. the deterministic value is also equal to dVal ℓ,ν = sup τ ∈dStrat Max inf σ∈dStrat Min wt(Play((ℓ, ν), σ, τ )).In this setting, we can turn our study to the point of view of Max, looking for good strategies for this other player.A deterministic strategy τ of Max has an associated value: dVal τ ℓ,ν = inf σ∈dStrat Min wt(Play((ℓ, ν), σ, τ )).It is ε-optimal wrt the deterministic value if dVal τ ℓ,ν ≥ dVal ℓ,ν − ε for all (ℓ, ν).As Max does not wish to go to the target, we show that no switch is necessary to play ε-optimally: memoryless strategies are sufficient to guarantee a value as close as wanted to the deterministic value.For a configuration with a value equal to −∞, all the deterministic strategies for Max are equivalent where they are all equally bad.Without loss of generality, we can therefore suppose that there are no configurations in G with a value equal to −∞.Then, it is shown in [16] that remaining values are bounded in absolute value by w e max |R(G)|, since optimal plays have no cycles.We use that fact to build a memoryless deterministic strategy τ analogous to strategy σ 1 before: ▶ Theorem 12.In a divergent WTG, there exists a memoryless ε-optimal strategy for player Max wrt the deterministic value (that moreover satisfies Hypothesis 1).

Emulate memory with randomness, and vice versa
The main contribution of this article, apart from defining a notion of expected value in weighted timed games, is to relate the different notions of values.In divergent WTGs, memory can thus be fully emulated with stochastic choices, and combining memory and stochastic choices does not bring more power to players, which we summarise by: The proof of this result is decomposed into several inequalities on these values.One is easier, and holds for all WTGs: the stochastic value is at most equal to the deterministic value, using the inclusion of deterministic strategies into stochastic ones, and Lemma 7.
We show in the rest of this section the inequalities comparing the deterministic value with the other values: first we show that memoryless stochastic strategies can emulate deterministic ones (mVal ℓ,ν ≤ dVal ℓ,ν ); then we show that deterministic strategies can emulate stochastic ones (dVal ℓ,ν ≤ Val ℓ,ν and dVal ℓ,ν ≤ mVal ℓ,ν ).
Simulating deterministic strategies with memoryless strategies.We focus here on showing that, for all configurations (ℓ, ν), mVal ℓ,ν ≤ dVal ℓ,ν .We build a memoryless strategy of Min at least as good as a deterministic strategy.By Theorem 11, we can start from a switching strategy for Min.For N ∈ N and ε > 0, we thus consider a switching strategy σ = (σ 1 , σ 2 , K) of value dVal σ ℓ,ν ≤ max(−N, dVal ℓ,ν ) + ε, and simulate it with a memoryless strategy for Min, denoted η p , with a probability parameter p ∈ (0, 1).This new strategy is a probabilistic superposition of the two memoryless deterministic strategies σ 1 and σ 2 .
Formally, we define η p (ℓ, ν), with ℓ ∈ L Min , depending on the sign of the SCC containing the location (ℓ, r), with r the region of ν, of the region game R(G).
Theorem 11 ensuring that strategies σ 1 and σ 2 satisfy Hypothesis 1, the superposition η p also satisfies these hypotheses.Moreover, we use the sufficient condition in Hypothesis 2 to show that η p is also proper: ▶ Lemma 15.For all p ∈ (0, 1), the strategy η p satisfies Hypothesis 2.
Proof.Lemma 7 allows us to limit ourselves to deterministic strategies for Max.For all deterministic strategies τ of Max, we compute a lower bound on p independent of τ such that E η p ,τ ℓ,ν ≤ dVal σ ℓ,ν + 3ε/2.By Lemma 7 (with ε/2), we obtain the desired mVal η p ℓ,ν ≤ dVal σ ℓ,ν + 2ε.The case where the whole region game only contains positive SCCs is easy, since then η p chooses the transition and delay given by σ 1 with probability 1.By divergence, G then contains no negative cycles.A play conforming to η p is also conforming to the deterministic strategy σ 1 , so it must be acyclic.In particular, there exists only one play ρ conforming to η p and τ .This one is also conforming to σ and thus reaches the target with a cumulated weight wt Σ (ρ) = E η p ,τ ℓ,ν ≤ dVal σ ℓ,ν as expected.Now, suppose that the region graph contains at least a negative SCC.Thus, we let c > 0 be the maximal size of an elementary cycle of the region game (that visits a pair (ℓ, r) at most once) and w − > 0 be the opposite of the maximal cumulated weight of an elementary negative cycle in R(G) (necessarily bounded by w e max |R(G)|).We partition the set FPlays η p ,τ ℓ,ν into subsets Π i,j according to the number i of choices of probability 1 − p along the play (the probability as described previously with the product of the probabilities given by η p ∆ and η p R + ), and their length j (we always have i ≤ j).The partition is depicted in Figure 2: Π N,≥K , depicted in blue, contains all plays with a length greater than K (the switching threshold) Π 0,≤K , depicted in yellow, contains all plays without any probability 1 − p, with a length at most K; Π, depicted in red, contains the rest of the plays.
We can use the particular shape of the memoryless strategy η p for Min, and the fact that we fixed a deterministic strategy τ for Max, to decompose the expectation E η p ,τ ρ on the partition.Indeed, notice that the set of plays conforming to η p and τ , from a particular configuration (ℓ, ν), is countable.Moreover, we can associate a probability to each play (instead of a probability to a path).For a finite play ρ = (ℓ 0 , ν 0 ) δ0,t0 conforming to η p and τ , we let where, for all i ∈ {0, . . ., k − 1} This definition allows us to recover the probability of a path π using the probability of all plays following π.Then, we obtain easily We now compute and bound the three expectations γ 0,≤K , γ N,≥K and γ.In the following, for a set Π of plays, we let P η p ,τ ℓ,ν (Π) = ρ∈Π P η p ,τ ℓ,ν (ρ).
Red zone is such that γ ≤ ε/4.All plays in Π have a length at most K: so the cumulated weight of all such play is at most Kw e max .So, we have But, all plays ρ ∈ j≤K Π i,j with i ≤ K take i transitions of probability 1 − p.In particular, by bounding all other probabilities by 1, and since there are at most 2 K plays in j≤K Π i,j , we obtain (using that 1 − (1 − p) K ≤ 1) If we suppose that Yellow and blue zones are such that γ 0,≤K + γ N,≥K ≤ dVal σ ℓ,ν + 5ε/4.All plays in Π 0,≤K reach the target without taking any probability 1 − p from η p ∆ , so they are conforming to σ 1 .In the case where dVal ℓ,ν = −∞, Π 0,≤K = ∅ and γ 0,≤K = 0, since no play conforming to σ 1 from (ℓ, ν) reaches the target.In this case, Min can stay in a cycle with a negative cumulated weight as long as he wants.Now, if dVal ℓ,ν is finite, Theorem 11 (see [21,Lemma 19]) allows us to show that the cumulated weight of a play in Π 0,≤K is at most dVal ℓ,ν + Kε/K = dVal ℓ,ν + ε, as dVal ℓ,ν = inf σ∈dStrat Min dVal σ ℓ,ν ≤ dVal σ ℓ,ν .Therefore, in both cases, we can write Let ρ be a play in Π i,j with 0 ≤ i and j ≥ K. Since η p only allows cycles in negative SCCs, all region cycles in ρ have a cumulated weight at most −1.By definition of K and the proof of Theorem 11, wt(ρ) ≤ dVal σ ℓ,ν ≤ dVal σ ℓ,ν + ε.By summing up the contribution of yellow and blue zones, we get γ 0,≤K + γ N,≥K ≤ dVal σ ℓ,ν + ε P η p ,τ ℓ,ν (Π 0,≤K ∪ Π N,≥K ) We distinguish two cases.
The deterministic strategy σ uses the same kind of memory as η (in particular, it will be memoryless if η is memoryless).However, we want this strategy to be relatively simple to define, independent of the memory of η.Intuitively, we want to build a switching strategy (as in Section 4) on a game induced by the memory of η, i.e. a deterministic strategy σ 1 that uses the memory capabilities of η, a memoryless deterministic strategy σ 2 obtained by an attractor in the region game, and a threshold K. Strategy σ then consists in playing σ 1 for at most K steps, before switching to strategy σ 2 .The construction of σ 1 is done in a similar way as in the deterministic case, Min always choosing the best possible candidate according to the choices of η, thus trying to minimise the immediate reward obtained in one turn.Under this condition, we verify that σ 1 satisfies some properties similar to the fake-ε-optimality encountered in Section 4.Then, by mimicking the techniques of Theorem 11, we obtain that the switching strategy η obtained from η 1 satisfies the desired inequality dVal σ ℓ,ν ≤ Val η ℓ,ν + ε.

Conclusion
We have introduced stochastic strategies for WTGs, showing that, in divergent games, Min can use randomisation to emulate memory, and vice versa.We aim at extending our study to more general WTGs.As a first step, we may consider the class of almost-divergent WTGs (adding the possibility for an execution following a region cycle to have weight exactly 0 ), used in [12,17] to obtain an approximation schema of the optimal value.We wonder if similar ε-optimal switching strategies may exist also in this context, one of the crucial argument in order to extend our emulation result.Another question concerns the implementability of the randomised strategies: even if they use no memory, they still need to know the precise current clock valuation.In (non-weighted) timed games, previous work [18] aimed at removing this need for precision, by using stochastic strategies where the delays are chosen with probability distributions that do not require exact knowledge of the clocks measurements.In our setting, we aim at further studying the implementability of the randomised strategies of Min in WTGs, e.g. by requiring them to be robust against small imprecisions.

▶ Definition 1 . 1 137: 4
for all x ∈ C, and the valuation ν[Y := 0] as (ν[Y := 0])(x) = 0 if x ∈ Y , and (ν[Y := 0])(x) = ν(x) otherwise.The valuation 0 assigns 0 to every clock.A (non-diagonal) guard on clocks of C is a conjunction of atomic constraints of the form x ▷◁ c, where ▷◁ ∈ {≤, <, =, >, ≥} and c ∈ N. A valuation ν : C → R ≥0 satisfies an atomic constraint x ▷◁ c if ν(x) ▷◁ c.The satisfaction relation is extended to all guards g naturally, and denoted by ν |= g.We let Guards(C) denote the set of guards over C. A weighted timed game (WTG) is a tuple G = ⟨L Min , L Max , L T , ∆, wt⟩ where L Min , L Max , L T are finite disjoint subsets of Min locations, Max locations, and target locations, respectively (we let L = L Min ⊎ L Max ⊎ L T ), ∆ ⊆ L × Guards(C) × 2 C × L is a finite set of transitions, wt : ∆ ⊎ L → Z is the weight function.I C A L P 2 0 2 Playing Stochastically in Weighted Timed Games to Emulate Memory

) I C A L P 2 0 2 1 137: 14 Playing
Stochastically in Weighted Timed Games to Emulate Memory

I C A L P 2 0 2 1 137: 16 Playing
Stochastically in Weighted Timed Games to Emulate Memory