Computing the longest common prefix of a context-free language in polynomial time

We present two structural results concerning longest common prefixes of non-empty languages. First, we show that the longest common prefix of the language generated by a context-free grammar of size $N$ equals the longest common prefix of the same grammar where the heights of the derivation trees are bounded by $4N$. Second, we show that each nonempty language $L$ has a representative subset of at most three elements which behaves like $L$ w.r.t. the longest common prefix as well as w.r.t. longest common prefixes of $L$ after unions or concatenations with arbitrary other languages. From that, we conclude that the longest common prefix, and thus the longest common suffix, of a context-free language can be computed in polynomial time.


Introduction
Let Σ denote an alphabet.On the set Σ * of all words over Σ, the prefix relation provides us with a partial ordering ⊑ defined by u ⊑ v iff uu ′ = v for some u ′ ∈ Σ * .The longest common prefix (lcp for short) of a non-empty set L ⊆ Σ * then is given by the greatest lower bound L of L w.r.t.this ordering.For two words u, v ∈ Σ * , we also denote this greatest lower bound as u ⊓ v. Our goal is to compute the lcp when the language L is context-free, i.e., generated by a context-free grammar (CFG). 1he computation of the lcp arises naturally in various applications of context-free languages.As a first example, recall that the possible (unevaluated) runs of a recursive program can be described by a context-free grammar: the nonterminals of the grammar are then in one-to-one correspondence to the nodes of the interprocedural control-flow graph, and the lcp of the language generated by any nonterminal is the longest common sequence of statements always executed whenever the respective node in the control-flow graph is reached.CFGs also represent a popular formalism to specify which words are well-formed and which can be rejected immediately.Assume that we are given a CFG for the legal outputs of a program.This CFG might be derived from the specification as well as from an abstract interpretation of the program.Then the lcp of this language represents a prefix which can be output already, before the program actually has been run.This kind of information is crucial for the construction of normal forms, e.g., of string producing processors such as linear tree-to-string transducers [1,6].For these devices, the normal forms have further interesting applications as they allow for simple algorithms to decide equivalence [2] and enable efficient learning [7].
Obviously, the lcp of the context-free language L is a prefix of the shortest word in L. Since the shortest word of a context-free language can be effectively computed, the lcp of L is also effectively computable.The shortest word generated from a context-free grammar G, however, may be of length exponential in the size of G. Therefore, it is an intriguing question whether or not the lcp can be efficiently computed.
Here, we show that the longest common prefix can in fact be computed in polynomial time.As the words the algorithm computes with, may be of exponential length, we have to resort to compressed representations of long words by means of straight-line programs (SLPs) [9].We will rely on algorithms for basic computational problems for straight-line programs as presented, e.g., in [8].
Our method of computing L, is based on two structural results.First we show in Section 3 that it suffices to consider the finite sublanguage of L consisting of those words, for which there is a derivation tree of height at most 4N -with N the number of nonterminals of the CFG for L. This implies that (1) for the proofs we can replace the grammar by an acyclic context-free grammar, and (2) the actual fixpoint iteration will converge within at most 4N iterations.Second we show in Section 4 that for every nonempty language L there is a subset L ′ ⊆ L of at most three elements which is equivalent to L w.r.t. the lcp after arbitrary concatenations with other words.This means that for every word w, the language L ′ w has the same lcp as Lw.
We illustrate both results by examples.For the first result, i.e. the restriction to derivation trees of bounded height, consider the language generated by the context-free grammar consisting of the following rules over the alphabet Σ = {a, b, c} and the six nonterminals {S, It is easy to check that here the lcp is already determined by repeating the derivation of X to aabXaaba at most two times which correspondence to the sublanguage consisting of all words which have a derivation tree of height at most 9.
We remark that the bound of 4N , i.e. 24 for this example, on the height resp.the number of iterations needed to converge is a crude overapproximation based on the pigeon-hole principle which does not take into account the structure of the grammar.The actual computation of the lcp may thus terminate much earlier.We discuss this in more detail in Example 3.
We show in Section 4 that for every language L there is a unique longest, possibly infinite, word w L satisfying (Lw) = ( L)w.Based on this observation, every language can be equivalently represented by a sublanguage consisting of at most three words where these three words have to be chosen such that both the lcp and w L are preserved.Consider, e.g., the language L 1 = a(ba) * .This language is quasi-periodic, meaning that it is a subset of a language uv * for some words u, v ∈ Σ * .We show that for every quasi-periodic language L we have w L = v ω and L is equivalent to any sublanguage containing both the shortest and any other word of L. Thus, L 1 can be represented by the language L ′ 1 = {a, aba}.As a second example, consider the language L 2 = a + a(bbc * + bbd * ).Here, we have w L = bb.In order to preserve both the lcp and w L 2 we may choose L ′ 2 = {a, abbc, abbd}.As a last example, consider L 3 = abab + aba(ba) * with L 3 = aba and w L 3 = b.We can choose L ′ 3 = {aba, abab, ababa} as an equivalent finite sublanguage; note that L ′′ 3 = {aba, abab} is not equivalent to L 3 as it is quasi-periodic with w L ′′ 3 = b ω .In order to compute the lcp of a given context-free language L we then (implicitly) unfold the given context-free grammar into an acyclic grammar, and compute for every nonterminal of the unfolded grammar an equivalent sublanguage of at most three words, each compressed by means of a SLP, instead of the actual language.From this finite representation of L we then can easily obtain its lcp.Altogether, we arrive at a polynomial time algorithm.
For u, v ∈ Σ * we write u ⊑ v (u ⊏ v) to denote that u is a (strict) prefix of v, i.e. v = uw for some w ∈ Σ * (w ∈ Σ + ).For L ⊆ Σ * (with L = ∅) its longest common prefix lcp L is given by the greatest lower bound of L w.r.t.this ordering.We simply write u ⊓ v for {u, v}.Note that for any word w ∈ L there is at least one word α ∈ L s.t.L = w ⊓ α; we call any such α a witness (w.r.t.w).Note that ⊓ is commutative and associative; concatenation distributes from the left over the lcp (i.e.u(v ⊓ w) = uv ⊓ uw); and the lcp is monotonically decreasing on the union of languages, i.e. (L ∪ L ′ ) = ( L) ⊓ ( L ′ ).The lcp of infinite words is defined analogously.
A word p ∈ Σ * is called a power of a word q if p ∈ q * ; then q is called a root of p; if p = ε is its own shortest root, p it is called primitive.We recall two well-known results: Lemma 1 (Commutative Words, [3]).Let u, v ∈ Σ * be two words.If uv = vu, then u, v ∈ p * for some primitive p ∈ Σ * .Lemma 2 (Periodicity lemma of Fine and Wilf, [5]).Let u, v ∈ Σ + be two non-empty words.If Combining these two lemmata yields the following result which is an useful tool in the proofs to follow: Proof.Since the bound of the size of |uv ⊓ vu| follows from Lemma 2 we only have to show that uv Here is a short example for the last corollary:

e. the bound is sharp. Note that this example also shows, that even if uv = vu and ε
We briefly discuss properties of the LCP for very simple regular languages.These will be used several times in the proofs of Section 3 in order to bound the height of the derivation trees we need to consider: Proof.Since w ⊑ yw, w = ε and y = ε.Assume that w = w ⊓ yw, then by the preceding lemma w = w ⊓ yw = w ⊓ y ω , i.e. w ⊏ y ω and thus w ⊑ yw.

LCP of a context-free language
Our main result in this section, Theorem 2 shows that for every context-free language L = L(G) generated by the given CFG G its lcp L is equal to the lcp of its finite sublanguage L ′ which contains only those words w ∈ L which possess a derivation tree w.r.t.G whose height (considering only nonterminals) is at most four times the number of nonterminals of G.For the main result we require the following, very technical lemma.
x ⊏ xwx⊓xywȳx for some (y, ȳ) ∈ {(y 1 , ȳ1 ), . . ., (y l , ȳl )}, then w.r.t.this y there exists some primitive q ∈ Σ * and some k > 0 such that The proof of the main theorem of this section, Theorem 2, crucially depends on the observation that in the case L ⊏ xwx ⊓ xywȳ x, all the words y i are powers of the same primitive word p with pw = wq and all that is needed to obtain a witness is one additional power of p resp. its conjugate q (with pw = wq) to which Theorem 1 refers to.We give an example in order to clarify the statement of Theorem 1 in the case of l = 2 ∧ y 1 y 2 = y 2 y 1 which is central to Theorem 2: Example 2. We write (y, ȳ) for (y 1 , ȳ1 ) and (z, z) for (y 2 , ȳ2 ), respectively.Let (x, x) = (ε, ababaaa) = (ε, qqaaa), (y, ȳ) = (ab, abaab) = (q, qaab), (z, z) = (ab, abaac) = (q, qaab), and w = ε with q = ab = y = z.We then have: So in this example, any word except for xywȳ x and xzwz x is a witness for the lcp w.r.t.xwx.W.r.t. the proof of Theorem 2 it is important that also in general we can pick a witness which either is derived using only (y, ȳ) or (z, z) but not both, and that we need to use (y, ȳ) resp.(z, z) at most twice in order to get one additional copy of the conjugate q of the primitive root of both y and z.
To give an impression of the proof of Theorem 1 we show the case l = 1.The complete proof of Theorem 1 can be found in the appendix.Lemma 5. Let L = (x, x)(y, ȳ) * w.Then: L = (x, x)(y, ȳ) ≤2 w.
If L ⊏ xwx ⊓ xywȳ x, then there is some primitive q and some k > 0 s.t.
Hence, assume that y = ε and ȳ = ε in the following.If w ⊑ yw, then from Lemma 4 we have w ⊓ y * w = w ⊓ yw = w ⊓ y ω ⊏ w, and thus So assume that w ⊑ yw from now on.Then there is some y s.t.w y = yw.Let q be the primitive root of y s.t.y = q k (k > 0).Thus for i large enough that |wx| ≤ i |y|, we have We factorize x and ȳ w.r.t.q ω : • Let x = q n q ′ x′ with x ⊓ q ω = q n q ′ ⊏ q n+1 .
If q n q ′ ⊏ q k+k ′ q, then Thus q k (ȳ ⊓ q ω )q k+k ′ q ⊑ q n q ′ = x ⊓ q ω for the following.
If q ȳ = ȳq, then ȳ = q k ′ , and xywȳ x is a witness as k > 0 We therefore also assume q ȳ = ȳq from here on.If x ⊓ q ω = q n q ′ = q k+k ′ q = q k (ȳ ⊓ q ω ), then xyywȳ ȳx is a witness: Note that we only need at most one additional q (coming from y = q k ), i.e. also L = xwx ⊓ xywq ȳx.It remains the case q k+k ′ q ⊏ q n q ′ , i.e. q k ȳ ⊓ q ω ⊏ x ⊓ q ω .Hence 0 < k ≤ k + k ′ ≤ n.Define φ ⊏ q ω by L = xwφ.From q k+k ′ q ⊏ q n q ′ it follows that xwq k+k ′ q ⊑ xwx ⊓ xy i wȳ i x for all i ∈ N 0 , implying ȳ ⊓ q ω = q k+k ′ q ⊑ φ.
If φ = q k+k ′ q, then we have L = xwx ⊓ xywȳx: from xwq 2k+k ′ q ⊑ xy i wȳ i x for i > 1 and q k+k ′ q ⊏ q n q ′ it follows that xwx ⊓ xy i wȳ i x is a strictly longer prefix of xwq ω than xwq k+k ′ q as k > 0.
As q ȳ = q k ′ q q = q k ′ qq (recall q = q q) also q q = qq and thus Hence q k+k ′ q ⊏ φ ⊏ q k+k ′ +1 q.That is either xywȳ x or xyywȳ ȳ x has to be a witness as we can extend q k+k ′ q by at most |q| − 1 symbols, i.e.only by a strict prefix of q q; but as k > 0, this additional q is again given by xyywȳ ȳx.In particular, we have again that, if xyywȳ ȳx is a witness, then so is xywq ȳ x, i.e. we only require one additional power of q left of ȳ resp.q.
Using Theorem 1, we show: Theorem 2. Let L = L(G) be given by a contex-free grammar G = (Σ, V, P, S) in Chomsky normal form.Let L ⊆ L be the finite language of all words of L for which there is a derivation tree w.r.t.G of height 2  We claim that there is at least one such α (for any fixed σ) that has an derivation tree w.r.t.G of height less than 4N (not counting the leaves representing the letters of a word).If σ = α, we are done as σ has a derivation tree of height less than N .So assume σ = α s.t.σ = πaσ ′ and α = πbα ′ with a = b alphabet symbols, and fix any derivation tree t of α w.r.t.G.
We will show the stronger claim, namely that any path from the root of t to any letter of πb has length at most 3N (i.e.all the paths leading to the separating letter b or a letter left of it, see Figure 1); note that any path that leads to a letter right of b (i.e.into α ′ ) has to enter a subtree of height less than N as soon as it leaves the path leading to b because of the minimality of α.Hence, if all the paths leading to b or a letter left of b have length less than 3N , the longest path in the derivation tree must have length at most 4N .So assume there is a path leading to a letter within πb that has at least length 3N i.e. consists of at least 3N + 1 nonterminals.Then there is one nonterminal A that occurs at least four times leading to a factorization α = (x, x)(y 1 , ȳ1 )(y 2 , ȳ2 )(y 3 , ȳ3 )w Note that xx = ε, y i ȳi = ε (i = 1, 2, 3), and w = ε as G is in CNF.As this path ends at b or left of it, we have xy 1 y 2 y 3 ⊑ π implying that y i y j = y j y i for all i, j ∈ {1, 2, 3}, so y i = p k i for the same primitive p.
Let L ′ = (x, x)[(y 1 , ȳ1 ) + (y 2 , ȳ2 ) + (y 3 , ȳ3 )] * w.By construction L ′ ⊆ L and thus As xwx is shorter than α, it cannot be a witness, so πa ⊑ xwx and π = xwx ⊓ α.Hence It therefore suffices to consider L ′ in the following; in particular, α has to be a witness w.r.t.xwx of minimal length, too.By virtue of Theorem 1: Note that for any i = 1, 2, 3 Set (y, ȳ) := (y I , ȳI ) and Applying Theorem 1 to L ′′ we obtain that for the primitive q satisfying pw = wq.As y i = p k i for all i ∈ {1, 2, 3}, also y i w = wq k i .As xwq k ⊑ L ⊏ xwq ω , we find some m ≥ 0 and q ⊏ q s.t.π = L = xwq k q m q.As xywȳ x is not a witness, but Hence qb ⊑ q as q m qa ⊑ ȳx and q m qb ⊑ q ȳx ⊒ qq m qa.Using L ⊏ xwq k+1 (ȳ ⊓ q ω ), we further obtain q m qb ⊑ q(ȳ ⊓ q ω ).
The explicit formula from Lemma 7 can be used to identify small equivalent sublanguages.
Theorem 3.For every non-empty language L ⊆ Σ * there is a language L ′ ⊆ L consisting of at most three words such that L ≡ L ′ .
Proof.If L is a singleton language, we choose L ′ = L. Now assume that L contains at least two words with lcp u.If the lcp u of L is not contained in L then we choose L ′ as consisting of the two minimal words w 1 , w 2 so that u = w 1 ⊓ w 2 .It remains to consider the case where the lcp u of L is contained in L. Then we have for each word w ∈ Σ * , If L is quasi-periodic, then all words in L are of the form uv i 0 for some v 0 ∈ Σ + and i ≥ 0 where (v i 0 ) ω = v ω 0 .Thus, (Lw) = u(w ⊓ v ω ) for any uv ∈ L with v = ǫ.Hence, L ≡ L ′ = {u, uv} for any such v.
If L is not quasi-periodic, then we choose words uv 1 , uv 2 ∈ L so that the lcp of v ω 1 and v ω 2 has minimal length.Then Since for any non-empty words w 1 , w 2 given by SLPs, an SLP for w ω 1 ⊓ w ω 2 can be computed in polynomial time, we have: Corollary 2. For every non-empty finite L ⊆ Σ * consisting of words each of which is represented by an SLP, a subset L ′ ⊆ L consisting of at most three words can be calculated in polynomial time such that L ≡ L ′ .
Proof.The proof distinguishes the same cases as in the proof of Theorem 3. If L is a singleton or contains at most three words we are done.Since the words in L are given as SLPs, we can calculate the lcp u of the words in L. Next, we determine whether u is a prefix of every word in L. This can again be checked in polynomial time.If this is not the case, then we can select two words w 1 , w 2 ∈ L so that u = w 1 ⊓ w 2 giving us L ′ = {w 1 , w 2 } in polynomial time.So, now assume that u is a prefix of all words in L. Next, we check whether or not L is quasi-periodic, i.e., whether for any non-empty words By the periodicity lemma of Fine and Wilf (see also Corollary 1), this is the case iff v 1 v 2 = v 2 v 1 .The latter can be checked in polynomial time.If this is the case, then we obtain L ′ = {u, uv} for some uv ∈ L with v = ǫ in polynomial time.
It remains to consider the case where the lcp u is contained in L and L is not quasiperiodic.Then we need to determine words uv 1 and uv 2 in L with v 1 = ǫ = v 2 such that v ω 1 ⊓ v ω 2 has minimal length.Since (again by the periodicity lemma of Fine and Wilf , such a pair can be computed in polynomial time as well.Therefore, L ′ = {u, uv 1 , uv 2 } can be computed in polynomial time.
The following lemma explains that equivalence of two non-empty languages of cardinalities at most 3 can be decided in polynomial time.Lemma 8. Let L 1 , L 2 ⊆ Σ * denote non-empty languages consisting of at most three words each, which are all given by SLPs.Then it can be decided in polynomial time whether or not Proof.If one of the two languages, contains just a single word, then L 1 ≡ L 2 iff L 1 = L 2 -which can be decided in polynomial time.Otherwise, we first compute the lcps of L 1 and L 2 , respectively.If these differ, then L 1 cannot be equivalent to L 2 .Therefore assume now that u is both the lcp of L 1 and L 2 , respectively.Now assume that for all words In this case, both L 1 and L 2 are quasi-periodic with the same period and thus equivalent.Again according to the periodicity lemma of Fine and Wilf, this can be checked in polynomial time.
Next assume that neither L 1 nor L 2 are quasi-periodic, i.e., there are uv i , uv Since w 1 , w 2 can be computed in polynomial time, the result follows in this case as well.
Finally, when none of the listed cases applies, L 1 is necessarily inequivalent to L 2 .Thus, we ultimately arrive at a polynomial time decision procedure for equivalence.
Remark 1.Note that in light of the equivalence test, we can choose distinct letters a, b ∈ Σ, and equivalently replace the language L 1 = {uv 1 , uv 2 } with L ′ 1 = {ua, ub} whenever v 1 = ǫ = v 2 and v 1 ⊓ v 2 = ǫ, and the language L 2 = {u, uv 1 , uv 2 } by the language L ′ 2 = {u, uwa, uwb} This kind of reduced representation may allow to use shorter words.Now we have all pre-requisites to prove the main theorem of our paper.Theorem 4. Assume that G is a context-free grammar with L = L(G) non-empty.Then the longest common prefix of L can be calculated in polynomial time.
Proof.Assume w.l.o.g. that G is a CFG in Chomsky normal form.Then we calculate L(G) as follows.
We build (implicitly, see the following remark) an acyclic CFG Ĝ in polynomial time such that L( Ĝ) consists of all words of L(G) for which there is a derivation tree of height at most 4N where N is the number of nonterminals in G.To this end, for every rewriting rule A → BC of G and every i ∈ {1, . . ., 4N } we add to Ĝ the rule A (i) → B (i−1) C (i−1) , and for every rule A → a of G we add the rule A (0) → a to Ĝ.A straight-forward induction on i shows that the derivation trees rooted at A (i) are isomorphic to the derivation trees rooted at A of height at most i.For more details, see e.g.[4].
By Theorem 2, we know that L(G) = L( Ĝ).By construction, Ĝ is also in Chomsky normal form.For i from 0 to (at most) 4N , we then compute in every iteration for every nonterminal A (i) first the language By induction on i, we may assume that the languages [B (i−1) ], [C (i−1) ] (a) have already been computed, (b) consist of at most three words, and (c) every word is given as an SLP.Note that the cardinality of every language [A (i) ] ′ is polynomial in the size of G.By virtue of Corollary 2, we therefore can reduce [A Since Ĝ has polynomially many nonterminals only, the overall algorithm runs in polynomial time.
Remark 2. Note that we can drop the assumption that the grammars G and likewise Ĝ are in Chomsky normal form if the right-hand sides of all rules have bounded lengths.Then the cardinality of the languages [A (i) ] ′ are still polynomial.Further, instead of spelling out the grammar Ĝ explicitly, we may perform a round robin fixpoint iteration where in every round we first compute We demonstrate this simplified version of the algorithm described in Theorem 4 by an example.

Example 3. Consider the following grammar G with the following rules:
The round robin fixpoint iteration would proceed by iteratively evaluating the equations

and recomputing the languages [A] and [S] so that [A] ≡ [A] ′ and [S] ≡ [S] ′ and both [A] and [S] consist of at most three words where we further reduce the words of [A] and [S] as described in the remark following Lemma 8. As [A] does not depend on [S], we can postpone the computation of [S] after
[A] has converged.In the first round, we have: For the second round, we first calculate: So L = (ab) 3 ⊓ (ab) 2 aa = (ab) 2 a.

Conclusion
We have shown that the longest common prefix of a non-empty context-free language can be computed in polynomial time.This result was based on two structural results, namely, that it suffices to consider words with derivation trees of bounded height, and second that each non-empty language is equivalent to a sublanguage consisting of at most three elements.For the actual algorithm, we relied on succinct representations of long words by means of SLPs.It remains as an intriguing open question whether the presented method can be generalized to more expressive grammar formalisms.Let w = p m p ′ w ′ with p = p ′ p ′′ and w ⊓ p ω = w ⊓ p m+1 = p m p ′ ⊏ p m+1 .Set q = p ′′ p ′ s.t.pp ′ = p ′ q and q ⊓ w ′ = ε and p ′′ = ε.As p is primitive, so is q.
If w ⊏ p ω , then w ′ = ε.If y = ε, then xywȳ x is a witness; if z = ε, then xzwz x is a witness, too.
Hence, w ⊏ p ω in the following.Then w ′ = ε and pw = wq.We factorize x, ȳ, z w.r.t.q: at most 4 |V | with V the nonterminals of G. Then: L = L. Proof.Let L = L(G) with G CFG in CNF.Let N be the number of nonterminals of G. Let σ ∈ L be a shortest word, and α ∈ L a shortest word with L = σ ⊓ α.Set π := L.

Figure 1 :
Figure 1: Factorization of a witness α = (x, x)(y 1 , ȳ1 )(y 2 , ȳ2 )(y 3 , ȳ3 )w = πbα ′ w.r.t.anonterminal A occurring at least four times a long a path (here the dashed path) in a derivation tree of α leading to a letter either within the LCP π = L or to the letter, here b (the leaf of the dotted path), that bounds the LCP.
Theorem 2 guarantees that the lcp is attained after at most 4N iterations.Using standard approaches like work lists, we only need to recompute [A] if there is some rule A → γBδ in G and [B] has changed since the last recomputation of [A].As shown in Lemma 8 we can easily check if [B] ≡ [B] ′ in every round and accordingly insert A into the work list.