Sliding window property testing for regular languages

We study the problem of recognizing regular languages in a variant of the streaming model of computation, called the sliding window model. In this model, we are given a size of the sliding window $n$ and a stream of symbols. At each time instant, we must decide whether the suffix of length $n$ of the current stream ("the active window") belongs to a given regular language. Recent works showed that the space complexity of an optimal deterministic sliding window algorithm for this problem is either constant, logarithmic or linear in the window size $n$ and provided natural language theoretic characterizations of the space complexity classes. Subsequently, those results were extended to randomized algorithms to show that any such algorithm admits either constant, double logarithmic, logarithmic or linear space complexity. In this work, we make an important step forward and combine the sliding window model with the property testing setting, which results in ultra-efficient algorithms for all regular languages. Informally, a sliding window property tester must accept the active window if it belongs to the language and reject it if it is far from the language. We consider deterministic and randomized sliding window property testers with one-sided and two-sided errors. In particular, we show that for any regular language, there is a deterministic sliding window property tester that uses logarithmic space and a randomized sliding window property tester with two-sided error that uses constant space.


Introduction
Regular expression search constitutes an important part of many search engines for biological data or code, such as, for example, Elasticsearch Service1 .In this paper, we consider the following formalization of this problem.We assume to be given an integer n, a regular

Our results
Previous study implies that even simple languages require linear space in the sliding window model, which gives the motivation to seek for novel approaches in order to achieve efficient algorithms for all regular languages.We take our inspiration from the property testing model introduced by Goldreich et.al [22].In this model, the task is to decide whether the input has a particular property P , or is "far" from any input satisfying it.For a function γ : N → R ≥0 , we say that a word w of length n is γ-far from satisfying P , if the Hamming distance between w and any word w satisfying P is at least γ(n).We will call the function γ(n) the Hamming gap of the tester.We must make the decision by inspecting as few symbols of the input as possible, and the time complexity of the algorithm is defined to be equal to the number of inspected symbols.The motivation is that when working with large-scale data, accessing a data item is a very time-expensive operation.The membership problem for a regular language in the property testing model was studied by Alon et al. [2] who showed that for every regular language L and every constant > 0, there is a property tester with Hamming gap γ(n) = n for deciding membership in L that can make the decision by inspecting a random constant-size sample of symbols of the input word.
In this work, we introduce a class of algorithms called sliding window property testers.Informally, at each time moment, a sliding window property tester must accept if the active window has the property P and reject if it is far from satisfying P .The space complexity of a sliding window property tester is defined to be all the space used, including the space we need to store information about the input.We consider deterministic sliding window property testers and randomized sliding window property testers with one-sided and two-sided errors (for a formal definition, see Section 2).A similar but simpler model of streaming property testers, where the whole stream is considered, was introduced by Feigenbaum et al. [11].
François et al. [12] continued the study of this model in the context of language membership problems and came up with a streaming property tester for visibly pushdown languages that uses polylogarithmic space.Note that deciding membership in a regular languages becomes trivial in this model (where the active window is the whole stream): one can simply simulate a deterministic finite automaton on the stream.What makes the sliding window model more difficult is the fact that the oldest symbol in the active window expires in the next step.
While at first sight the only connection between property testers and sliding window property testers is that we must accept the input if it satisfies P and reject if it is far from satisfying P , there is, in fact, a deeper link.In particular, the above mentioned result of Alon et al. [2] combined with an optimal sampling algorithm for sliding windows [4], immediately yields a O(log n)-space, two-sided error sliding window property tester with Hamming gap γ(n) = n for every regular language.We will improve on this observation.Our main contribution are tight complexity bounds for each of the following classes of sliding window property testers for regular languages: deterministic sliding window property testers and randomized sliding window property testers with one-sided and two-sided error.
Deterministic sliding window property testers.We call a language L trivial, if for some constant c > 0 the following holds: For every word w ∈ Σ * such that L contains a word of length |w|, the Hamming distance from w to L is at most c.Every trivial regular language has a constant-space deterministic sliding window property tester with constant Hamming gap (Theorem 4).For generic regular languages, we show a deterministic sliding window property tester with constant Hamming gap that uses O(log n) space.This is particularly surprising, because for Hamming gap zero (i.e., the exact case) [16] showed a space lower bound of Ω(n) for generic regular languages.In other words, a constant Hamming gap allows an exponential space improvement.We also show that for non-trivial regular languages, O(log n) space is the best one can hope to achieve, even for Hamming gap γ(n) = n (Theorem 6).
Randomized sliding window property testers with two-sided error.Next, we show that for every regular language, there is a randomized sliding window property tester with Hamming gap γ(n) = n and two-sided error that uses constant space (Theorem 7).This is an optimal bound and a considerable improvement compared to the tester that can be obtained by combining the property tester of Alon et al. [2] and an optimal sampling algorithm for sliding windows [4].Our constant space tester makes use of a probabilistic counter from [16].
Randomized sliding window property testers with one-sided error.While our randomized sliding window property tester with two-sided error is optimal, we believe that a two-sided error is a very strong relaxation that has to be avoided in some applications.To this end, we study the one-sided error randomized setting.The general landscape for this setting is the most complex: In Theorems 8 and 9, we show that for every regular language L, the space complexity of an optimal randomized sliding window property tester with one-sided error is either O(1), O(log log n), or O(log n), and we provide language theoretic characterizations of these space classes.
In order to show our upper bound results, we demonstrate novel combinatorial properties of automata and regular languages and develop new streaming techniques, such as probabilistic counters, which can be of interest on their own.To show the lower bound results, we introduce a new methodology, which could potentially simplify further establishments of lower bounds in string processing tasks in the streaming setting: namely, we view the testers as nondeterministic automata, and study their behaviour.

Related work
The results above assume that the regular language admits a constant-space description and we will follow the same assumption in this work.Currently, there are few studies on the dependency of the complexity of sliding window algorithms on the size of the language description.On the negative side, Ganardi et al. [14] showed that there are regular languages such that any sliding window algorithm that achieves logarithmic space (in the window size) depends exponentially on the automata size.On the positive side, there is an extensive study of the pattern matching problem and its variants that gives sub-exponential upper bounds for a class of (very simple) regular languages.In this problem, we are given a pattern and a streaming text T , and at each moment we must decide if the active window is equal to the pattern.This problem and its generalisations have been studied in [5,6,7,8,9,19,20,21,28,30].
Similar to regular languages, we can ask whether the current active window belongs to a given context-free language.This question was studied in [3,24,25,26] for the model where the active window is the complete stream and in [13,18] for the sliding-window model.

Sliding window property tester
We fix a finite alphabet Σ for the rest of the paper.We denote by Σ * the set of all words over Σ and by Σ n the set of words over Σ of length n.The empty word is denoted by λ.Let w be a word.We say that v is a prefix (suffix) of w if w = xv (w = vx) for some word x.We say that v is a factor of w if w = xvy for some words x, y.The Hamming distance between two words A deterministic finite automaton (DFA) is a tuple A = (Q, Σ, q 0 , δ, F ) where Q is a finite set of states, Σ is the input alphabet, q 0 is the initial state, δ : Q × Σ → Q is the transition mapping and F ⊆ Q is the set of final states.We extend δ to a mapping δ : Q × Σ * → Q inductively in the usual way: δ(q, λ) = q and δ(q, aw) = δ(δ(q, a), w).The language accepted by A is L(A) = {w ∈ Σ * : δ(q 0 , w) ∈ F }. A language is regular if it is accepted by a DFA.For more background in automata theory see [23].
A stream is a word a 1 a 2 • • • a m over Σ.A sliding window algorithm is a family A = (A n ) n≥0 of streaming algorithms.Given a window size n ∈ N and an input stream a 1 a 2 • • • a m ∈ Σ * the algorithm A n reads the stream symbol by symbol from left to right and thereby updates its memory content.After reading a prefix a 1 • • • a t (0 ≤ t ≤ m) the algorithm is required to compute an output value that depends on the active window For convenience, for i < 0 we define a i = where ∈ Σ is an arbitrary fixed symbol.In other words, we assume an initial window n that is active at time t = 0. We consider deterministic sliding window algorithms (where every A n can be viewed as a DFA) and randomized sliding window algorithms (where every A n can be viewed as a probabilistic finite automaton in the sense of Rabin [29]).In the latter case, A n updates in each step its memory content according to a probability distribution that depends on the current memory content and the current input symbol.Let γ : N → R ≥0 be a function such that γ(n) ≤ n for all n ∈ N, let α, β be probabilities, and let L ⊆ Σ * be a language.Definition 1.A deterministic sliding window (property) tester for L with Hamming gap γ(n) is a deterministic sliding window algorithm A = (A n ) n≥0 such that for every input stream w ∈ Σ * and every window size n the following properties hold: Definition 2. A randomized sliding window (property) tester for L with Hamming gap γ(n) and error (α, β) is a randomized sliding window algorithm A = (A n ) n≥0 such that for every input stream w ∈ Σ * and every window size n the following properties hold: if last n (w) ∈ L, then A n accepts with probability at least 1 − α; if dist(last n (w), L) > γ(n), then A n rejects with probability at least 1 − β.We say that A has one-sided error if A has error (0, 1/2) and two-sided error if A has error (1/3, 1/3).
Notice that our definition is non-uniform since we allow an arbitrary algorithm A n for each window size n.If the window size is not specified, then it is implicitly universally quantified.The space consumption of A is the mapping s(n), where s(n) is the space consumption of A n , i.e., the maximal number of bits stored by A n while reading any input stream.We can assume that s(n) ∈ O(n) since A n can store the active window in O(n) bits.The goal is to devise algorithms which only use o(n) space.Using probability amplification (similar to [16]) one can replace the error probability 1/3 in the two-sided error setting (resp.1/2 in the one-sided error setting) by any probability p < 1/2 (resp.p < 1).This influences the space complexity only by a constant factor.The case of Hamming gap γ(n) = 0 corresponds to exact membership testing to L which was studied in [14,15,16].In this paper, we focus on the two cases γ(n) = c for some constant c > 0 and γ(n) = n for some > 0.
Before we come to the main results of the paper we state two simple facts about the sliding window testers.Lemma 3. Assume that L = k i=1 L i and that for every 1 ≤ i ≤ k there exists a randomized sliding window tester for L i with Hamming gap γ(n) and error (α, β) that uses space s i (n).Then there exists a sliding window tester for L with Hamming gap γ(n) and error (α, β) that uses space O( The second fact concerns so-called trivial languages.Let γ : N → R ≥0 be a mapping with γ(n) ≤ n for all n ≥ 0. A language is L ⊆ Σ * is γ-trivial if there exists n 0 ∈ N such that for all n ≥ n 0 with L ∩ Σ n = ∅ and all w ∈ Σ n we have dist(w, L) ≤ γ(n).If γ(n) ∈ O(1), we say that L is trivial.Note that Alon et al. [2] call a language L trivial if L is ( n)-trivial for all > 0 according to our definition.In the long version [17] we show that both definitions coincide for regular languages, but we will not make use of this fact.Theorem 4. For every trivial (but not necessarily regular) language there is a deterministic sliding window tester with constant Hamming gap that uses constant space.The converse is also true: If for a language L there is a deterministic constant-space sliding window tester with Hamming gap γ(n), then there exists a constant c such that L is (γ + c)-trivial.

Main results
Our first main contribution is a deterministic logspace sliding window tester for every regular language, together with a matching lower bound for so-called nontrivial regular languages (defined above).
Theorem 5.For every regular language L, there exists a deterministic sliding window tester for L with constant Hamming gap which uses O(log n) space.Theorem 6.For every non-trivial regular language L, there exist > 0 and infinitely many window sizes n ∈ N on which every deterministic sliding window tester for L with Hamming gap n uses space Ω(log n).I S A A C 2 0 1 9

6:6 Sliding Window Property Testing for Regular Languages
Our second main contribution is a constant-space randomized sliding window property tester with two-sided error for any regular language: Theorem 7.For every regular language L and every > 0, there exists a randomized sliding window tester for L with two-sided error and Hamming gap γ(n) = n that uses space O(1/ ).While the randomized setting with two-sided error allows ultra-efficient testers, we find that allowing a two-sided error is a very strong relaxation.To this end, we study the randomized setting with one-sided error.In this setting, only a small class of regular languages admits sliding window testers working in space o(log n).A language L ⊆ Σ * is suffix-free if xy ∈ L and x = λ imply y / ∈ L.
Theorem 8.If L is a finite union of trivial regular languages and suffix-free regular languages, then there exists a randomized sliding window tester for L with one-sided error and constant Hamming gap which uses O(log log n) space.Theorem 9. Let L be a regular language.
If L is not a finite union of trivial regular languages and suffix-free regular languages, there exist > 0 and infinitely many window sizes n on which every randomized sliding window tester for L with one-sided error and Hamming gap n uses space Ω(log n).
If L is non-trivial, then there exist > 0 and infinitely many window sizes n on which every sliding window tester for L with one-sided error and Hamming gap n uses space Ω(log log n).
We sketch the proofs of Theorem 5, 7, and 8 in Sections 4.1, 4.2, and 4.3, respectively.The proofs of the lower bounds (Theorems 6 and 9) can be found in the long version [17].We would like to emphasize that the lower bounds shown in [17] are stronger than those stated in Theorems 6 and 9.More precisely, we show space lower bounds for nondeterministic and co-nondeterministic sliding window testers; see [17] for definitions.

Proofs of the upper bounds
In this section we sketch proofs of Theorems 5, 7, and 8 that give upper bounds for deterministic and (one-sided and two-sided error) randomized sliding window testers.All algorithms in this section satisfy the stronger property that words with large prefix distance are rejected by the algorithm with high probability (probability one in the deterministic setting).The prefix distance between words u = a We extend the definition to languages: for a language L, let pdist(u, L) = min{pdist(u, v) : v ∈ L}.The prefix distance between two runs π = (q 0 , a 1 , . . ., q n−1 , , a n , q n ) and ρ For our upper bound proofs it is convenient to work with DFAs which read the input word from right to left.A right-deterministic finite automaton (rDFA) is a tuple B = (Q, Σ, F, δ, q 0 ), where Q, Σ, q 0 and F are as in a DFA, and δ : Σ × Q → Q is the transition function.We extend δ to a mapping δ : Q × Σ * → Q analogously to DFAs: δ(q, λ) = q and δ(q, wa) = δ(δ(q, a), w).The regular language recognized by the rDFA The length of π is |π| = n.We visualize π in the form If p n ∈ F , then π is an accepting run.A run of length 1 is a transition.If π is a run from p to q on a word v, and ρ is a run from q to r on a word u, then ρπ denotes the unique run from p to r on uv.We denote by π w,q the unique run on w from q. Strongly connected graphs.With a DFA A = (Q, Σ, q 0 , δ, F ) we associate the directed graph (Q, E) with edge set E = {(p, δ(p, a)) | p ∈ Q, a ∈ Σ}.Similarly, with an rDFA A = (Q, Σ, F, δ, q 0 ) we associate the directed graph (Q, E) with edge set E = {(p, δ(a, p)) | p ∈ Q, a ∈ Σ}.Let A be a DFA or an rDFA.Two states p, q in A are strongly connected if there exists a path in (Q, E) from p to q, and vice versa.The strongly connected components (SCCs) of A with state set Q are the maximal subsets C ⊆ Q in which all states p, q ∈ C are strongly connected.A state q ∈ Q is transient if there exists no nonempty path from q to q.An SCC C is transient if it only contains a single transient state.There is a natural partial order on the SCCs, called the SCC-ordering, where the SCC C 1 is smaller than the SCC C 2 if there exists a path in (Q, E) from a state in C 1 to a state in C 2 .
The following combinatorial result from [2] will be used in this paper.Consider a directed graph G = (V, E).The period of G is the greatest common divisor of all cycle lengths in G.
If G is acyclic we define the period to be ∞.
Lemma 10 (cf.[2]).Let G = (V, E) be a strongly connected directed graph with E = ∅ and finite period g.Then there exist a partition V = g−1 i=0 V i and a constant m(G) ≤ 3|V | 2 with the following properties: For every 0 ≤ i, j ≤ g − 1 and for every u ∈ V i , v ∈ V j the length of every directed path from u to v in G is congruent to j − i modulo g.For every 0 ≤ i, j ≤ g − 1, for every u ∈ V i , v ∈ V j and every integer r ≥ m(G), if r is congruent to j − i modulo g, then there exists a directed path from u to v in G of length r.
If G = (V, E) is strongly connected with E = ∅ and finite period g, and V 0 , . . ., V g−1 satisfy the properties from Lemma 10, then we define the shift from u ∈ V i to v ∈ V j by shift(u, v) = j − i (mod g) ∈ {0, . . ., g − 1}. ( Notice that this definition is independent of the partition g−1 i=0 V i since any path from u to v has length ≡ shift(u, v) (mod g) by Lemma 10.Also note that shift(u, v) + shift(v, u) ≡ 0 (mod g).In the following let g(C) denote the period of the SCC C. Lemma 11.For every regular language L there exists an rDFA A for L and a number g such that every non-transient SCC C in A has period g(C) = g.Path summaries.We start by recalling the notion of a path summary from [14], where it was used in order to prove a logspace upper bound for regular left-ideals (in the exact setting where the Hamming gap is zero).For the rest of Section 4 we fix a regular language L ⊆ Σ * and an rDFA B = (Q, Σ, F, δ, q 0 ) which recognizes L. By Lemma 11, we can assume that every non-transient SCC C of B has period g(C) = g.Consider a run π = (p n , a n , . . ., a 1 , p 0 ) on x = a n • • • a 1 .If all states p n , . . ., p 0 are contained in a single SCC we call π internal.

We can decompose
where each π i is a possibly empty internal run and each τ i is a single transition connecting two distinct SCCs.We call this unique factorization the SCC-factorization of π, which is illustrated in Figure 1.The path summary of π is ps where q i is the first state in π i (1 ≤ i ≤ m).Note that m is bounded by the constant number of states of B. Hence, a path summary can be stored with O(log |π|) bits.I S A A C 2 0 1 9

6:8
Sliding Window Property Testing for Regular Languages q 1 q 2 q 3 q m π1 τ1 π2 τ2 τm−1 πm Figure 1 The SCC-factorization of a run.Periodic acceptance sets.For a ∈ N and X ⊆ N we use the standard notation X + a = {a + x : x ∈ X}.For a state q ∈ Q we define Acc(q) = {n ∈ N : ∃w ∈ Σ n : δ(w, q) ∈ F }. A set X ⊆ N is eventually d-periodic, where d ≥ 1 is an integer, if there exists a threshold t ∈ N such that for all x ≥ t we have x ∈ X if and only if x + d ∈ X.If X is eventually d-periodic for some d ≥ 1, then X is eventually periodic.Lemma 12.For every q ∈ Q the set Acc(q) is eventually g-periodic.
Two sets X, Y ⊆ N are equal up to a threshold t ∈ N, in symbol Lemma 13.Let C be a non-transient SCC in B, p, q ∈ C and s = shift(p, q).Then Acc(p) and Acc(q) + s are almost equal.Corollary 14.There exists a threshold t ∈ N such that 1. Acc(q) = t Acc(q) + g for all q ∈ Q, and 2. Acc(p) = t Acc(q) + shift(p, q) for all non-transient SCCs C and all p, q ∈ C.
We fix the threshold t from Corollary 14 for the rest of Section 4. The following lemma is the main tool to prove the correctness of our sliding window testers.It states that if a word of length n is accepted from p and ρ is any internal run from p of length at most n, then, up to a bounded length prefix, ρ can be extended to an accepting run of length n.Formally, a run π k-simulates a run ρ if one can factorize ρ = ρ 1 ρ 2 and π = π ρ 2 where |ρ 1 | ≤ k.Lemma 15.If ρ is an internal run starting from p of length at most n and n ∈ Acc(p), then there exists an accepting run π from p of length n which t-simulates ρ.

Deterministic logspace tester
Proof of Theorem 5. Let n ∈ N such that n ≥ |Q| (for n < |Q| we use a trivial streaming algorithm which stores the window explicitly).The algorithm maintains the set {ps(π w,q ) | q ∈ Q} where w ∈ Σ n is the active window.Initially this set is {ps(π w,q ) | q ∈ Q} for w = n .Now suppose w = av for some a ∈ Σ and the next symbol of the stream is b ∈ Σ, i.e. the new active window is vb.For each transition q b ← − p in B we can compute ps(π vb,p ) from ps(π av,q ) as follows.Suppose that ps(π av,q ) = ( m , q m ) • • • ( 1 , q 1 ) where q = q 1 .
If p and q belong to the same SCC, then we increment 1 by one, else we append a new pair (1, p).
If m > 0 we decrement m by one.If m = 0 we remove the pair ( m , q m ) and we decrement m−1 by one (in this case we must have m > 1 and m−1 > 0).The obtained path summary is ps(π vb,p ).This data structure can be stored with O(log n) bits since it contains |Q| path summaries, each of which can be stored in O(log n) bits.