Approximating Longest Common Substring with k mismatches: Theory and Practice

In the problem of the longest common substring with k mismatches we are given two strings X


Introduction
For decades, the edit distance and its variants remained the most relevant measure of similarity between biological sequences.However, there is strong evidence that the edit distance cannot be computed in strongly subquadratic time [7].One possible approach to overcoming the quadratic time barrier is computing the edit distance approximately, and last year in the breakthrough paper Chakraborty et al. [8] showed a constant-factor approximation algorithm that computes the edit distance between two strings of length n in time Õ(n 2−2/7 ).Nevertheless, the algorithm is highly non-trivial and because of that is likely to be impractical.

16:2 Approximating Longest Common Substring with k Mismatches
A different approach is to consider alignment-free measures of similarities.Ideally, we want the measure to be robust and simple enough so that we could compute it efficiently.One candidate for such a measure is the length of the longest common substring with k mismatches.Formally, given two strings X, Y of lengths at most n and an integer k, we want to find the maximal length LCS k (X, Y ) of a substring of X that occurs in Y with at most k mismatches.Computing this value constitutes the LCS with k Mismatches problem.
The LCS with k Mismatches problem was first considered for k = 1 [6,13], with current best algorithm taking O(n log n) time and O(n) space.The first algorithm for the general value of k was shown by Flouri et al. [13].Their simple approach used quadratic time and linear space.Grabowski [15] focused on a data-dependent approach, namely, he showed two linear-space algorithms with running times O(n((k + 1)(LCS + 1)) k ) and O(n 2 k/LCS k ), where LCS is the length of the longest common substring of X and Y and LCS k , similarly to above, is the length of the longest common substring with k mismatches of X and Y .Abboud et al. [1] showed a k 1.5 n 2 /2 Ω( √ (log n)/k) -time randomised solution to the problem via the polynomial method.Thankachan et al. [24] presented an O(n log k n)-time, O(n)-space solution for constant k.This approach was recently extended by Charalampopoulos et al. [10] to develop an O(n)-time and O(n)-space algorithm for the case of LCS k = Ω(log 2k+2 n).
On the other hand, Kociumaka, Radoszewski, and Starikovskaya [19] showed that there is k = Θ(log n) such that the LCS with k Mismatches problem cannot be solved in strongly subquadratic time, even for the binary alphabet, unless the Strong Exponential Time Hypothesis (SETH) of Impagliazzo, Paturi, and Zane [16] is false.This conditional lower bound implies that there is little hope to improve existing solutions to LCS with k Mismatches.To overcome this barrier, they introduced an approximation approach to LCS with k Mismatches, inspired by the work of Andoni and Indyk [4].
Problem 1 (LCS with Approximately k Mismatches).Two strings X, Y of length at most n, an integer k, and a constant ε > 0 are given.Return a substring of X of length at least LCS k (X, Y ) that occurs in Y with at most (1 + ε) • k mismatches.Kociumaka,Radoszewski,and Starikovskaya [19] also showed that for any ε ∈ (0, 2) the LCS with Approximately k Mismatches problem can be solved in O(n 1+1/(1+ε) log 2 n) time and O(n 1+1/(1+ε) ) space.Besides for superlinear space, their solution uses a very complex class of hash functions which requires n 4/3+o(1) -time preprocessing, and that is the underlying reason for the bounds on ε.In this work, we significantly improve the complexity of the LCS with Approximately k Mismatches problem and show the following results.Theorem 2. Let ε > 0 be an arbitrary constant.The LCS with Approximately k Mismatches problem can be solved correctly with high probability: 1) In O(n 1+1/(1+2ε)+o (1) ) time and O(n 1+1/(1+2ε)+o (1) ) space assuming a constant-size alphabet; 2) In O(n 1+1/(1+ε) log 3 n) time and O(n) space for alphabets of arbitrary size.
Our first solution uses the Approximate Nearest Neighbour data structure [5] as a black box.The definition of this data structure is extremely involved, and we view this result as more of a theoretical interest.On the other hand, our second solution is simple and practical, which we confirm by experimental evaluation (see Section 4 for details).
As a final remark, we note that a construction similar to the one used to show a lower bound for the LCS with k Mismatches problem [19] gives a lower bound for LCS with Approximately k Mismatches.A proof of the following fact can be found in Section 5.
Dimension reduction.We will exploit a computationally efficient variant of the Johnson-Lindenstrauss lemma [17] which describes a low-distortion embedding from a high-dimensional Euclidean space into a low-dimensional one.Let • be the Euclidean (L 2 ) norm of a vector.We will exploit the following claim which follows immediately from [2, Theorem 1.1]: Lemma 4. Let P be a set of n vectors in R , where ≤ n.Given α = α(n) > 0 and a constant β > 0, there is d = Θ(α −2 log n) and a scalar c > 0 such that the following holds.Let M be a d × matrix filled with i.u.d.±1 random variables.For all Since the Hamming distance between binary strings U, V is equal to U − V 2 , the matrix M defines a low-distortion embedding from an -dimensional into a d-dimensional Hamming space as well.For non-binary strings, an extra step is required.Let the alphabet be Σ = {1, 2, . . ., σ} and consider a morphism µ : Σ → {0, 1} σ , where µ(a) = 0 a−1 10 σ−a for all a ∈ Σ.We extend µ to strings in a natural way.Note that for two strings U, V over the alphabet Σ the Hamming distance between µ(U ), µ(V ) is exactly twice the Hamming distance between U, V .We therefore obtain: 1 Here δ is a function of ε for which the explicit form is not known (a condition inherited from [22]).
C P M 2 0 2 0 16:4 Approximating Longest Common Substring with k Mismatches Corollary 5. Let P be a set of n strings in Σ , where ≤ n.Given α = α(n) > 0 and a constant β > 0, there is d = Θ(α −2 log n) and a scalar c > 0 such that the following holds.Let M be a d × (σ • ) matrix filled with i.u.d.±1 random variables.For all U ∈ P , define We will use the corollary for dimension reduction, and also to design a simple test that checks whether the Hamming distance between two strings is at most k.Corollary 6.Let P be a set of n strings in Σ , where ≤ n.With probability at least 1 − n −β , for all U, V ∈ P :

The Twenty Questions game
Consider the following version of the classic game "Twenty Questions".There are two players: Paul and Carole; Carole thinks of two numbers A, B between 0 and N , and Paul must return some number in [A, B].He is allowed to ask questions of form "Is x ≤ A?", for any x ∈ [0, N ].If x ≤ A, Carole must return YES; If A < x ≤ B, she can return anything; and if B < x, she must return NO.Paul must return the answer after having asked at most Q questions where Carole can tell at most ρQ lies, and only in the case when x ≤ A.
We show that Paul has a winning strategy for Q = Θ(log n) and any ρ < 1/3 by a black-box reduction to the result of Dhagat, Gács, and Winkler [11] who showed a winning strategy for A = B. Theorem 7 ([11]).For A = B, Paul has a winning strategy for all ρ < 1  3 asking Q = 8 log N (1−3ρ) 2 questions.This result is obtained by maintaining a stack of trusted intervals.Once Paul knows that A is between and r, where ≤ r, he checks whether A is in the left or the right half of the interval [ , r].If no inconsistencies appear (like A < or r < A), he pushes the new interval to the stack, else he removes the interval [ , r] from the stack of trusted intervals.After Q rounds, Paul returns the only number in the top interval in the stack, which is guaranteed to have length 1 and to contain A. We give the pseudocode of Paul's strategy in Algorithm 1.By Carole(x), we denote the answer of Carole for a question "Is x ≤ A?".

Algorithm 1
The Twenty Questions game.
if Carole(mid) then if Carole(r) then S.pop() The answer is inconsistent with I; remove I.

else S.pop()
The answer is inconsistent with I; remove I.
We now a show a winning strategy for our variant of the game.Proof.We introduce just one change to Algorithm 1, namely, we return the argument of the largest YES obtained in the course of the algorithm.From the problem statement it follows that the answer is at most B. We shall now prove that the answer is at least A. If Carole ever returned YES for A < x ≤ B, then it is obviously the case.Otherwise, Carole actually behaved as if she had A = B in mind: apart from the small fraction of erroneous answers, she returned YES for x ≤ A, and NO for x > A. Thus, the strategy of Dhagat, Gács, and Winkler ends up with A as the answer (and this must be due to a YES for x = A).

LCS with Approximately k Mismatches
In this section, we prove Theorem 2. Let us first introduce a decision variant of the LCS with Approximately k Mismatches problem.
Problem 9. Two strings X, Y of length at most n, integers k, , and a constant ε > 0 are given.We must return: If we return YES, we must also give a witness pair of length-substrings S 1 and S 2 of X and Y , respectively, such that d The decision variant of the LCS with Approximately k Mismatches problem can be reduced to the following (c, r)-Approximate Near Neighbour problem.Problem 10.In the (c, r)-Approximate Near Neighbour problem with failure probability f , the aim is, given a set P of n points in R d , to construct a data structure supporting the following queries: given any point q ∈ R d , if there exists p ∈ P such that p − q ≤ r, then return some point p ∈ P such that p − q ≤ cr with probability at least 1 − f .
Using the reduction, we will show our first solution to the LCS with Approximately k Mismatches decision problem based on the result of Andoni and Razenshteyn [5], who showed that for any constant f , there is a data structure for the (c, r)-Approximate Near Neighbour problem that has O(n 1+ρ+o (1)  (1) ) query time, and O(d • n 1+ρ+o (1) ) preprocessing time, where ρ = 1/(2c 2 − 1).
Proof.Let P be the set of all length-substrings of X and Q be the set of all lengthsubstrings of Y , all encoded in binary using the morphism µ (see Section 2).We start by applying the dimension reduction procedure of Corollary 5 to P and Q with α = 1/(log log n) Θ (1) and β = 2 to obtain sets P and Q .We can implement the procedure in O(σn log 2 n(log log n) Θ( 1) ) = O(n log 2+o(1) n) time by encoding X, Y using µ and running the FFT algorithm [12] for each of the O(log 1+o(1) n) rows of the matrix and µ(X), µ(Y ).
To solve the decision variant of LCS with Approximately k Mismatches, we build the data structure of Andoni and Razenshteyn [5] for ( (1 + ε)(1 − α), (1 + α)k)-Approximate Near Neighbour over Q .We make a query for each string in P .If, queried for sk α (S 1 ) ∈ P , where S 1 is a length-substring of X, the data structure outputs sk α (S 2 ) ∈ Q , where S 2 is C P M 2 0 2 0 16:6 Approximating Longest Common Substring with k Mismatches a length-substring of Y , then we compute sk α (S 1 ) − sk α (S 2 ) 2 .If it is at most (1 + ε)k, we output YES and the witness pair (S 1 , S 2 ) of substrings.As the length of vectors in P , Q is d = O(log 1+o(1) n), we obtain the desired complexity.
To show that the algorithm is correct, suppose that there are length-substrings S 1 and S 2 of X and Y , respectively, with d H (S 1 , S 2 ) ≤ k.By Corollary 5, sk α (S 1 ), sk α (S 2 ) ≤ (1 + α)k holds with probability at least 1 − 1/n.Then, when querying for sk α (S 1 ), with constant probability the data structure will output a string sk α (S 2 ) such that sk α (S 1 ) − Then, our algorithm will return YES.
On the other hand, if we output YES with a witness pair (S 1 , S 2 ), then sk α (S 1 ) − While this solution is very fast, it uses quite a lot of space.Furthermore, the data structure of [5] that we use as a black box applies highly non-trivial techniques.To overcome these two disadvantages, we will show a different solution based on a careful implementation of ideas first introduced in [4] that showed a data structure for approximate text indexing with mismatches.In [19], the authors developed these ideas further to show an algorithm that solves the LCS with Approximately k Mismatches problem in O(n 1+1/(1+ε) ) space and O(n 1+1/(1+ε) log 2 n) time for ε ∈ (0, 2) with constant error probability.In this work, we significantly improve and simplify the approach to show the following result: Proof of Theorem 2. We will rely on the modified version of the Twenty Questions game that we described in Section 2.1.In our case, A = LCS k (X, Y ) and B = LCS (1+ε)k (X, Y ).For Carole, we use either the algorithm of Lemma 11, or the algorithm of Theorem 12, with an additional procedure verifying the witness pair (S 1 , S 2 ) character by character to check that it indeed satisfies d H (S 1 , S 2 ) ≤ (1 + ε)k.We output the longest pair of (honest) witness substrings found across all iterations.We will return a correct answer assuming that the fraction of errors is ρ < 1  3 .Recall that the algorithm solves the decision variant of the LCS with Approximately k Mismatches problem incorrectly with probability not exceeding a constant δ, and we can ensure δ < 1  3 by repeating it a constant number of times.It means that Carole can answer an individual question erroneously with probability less than 1  3 .Therefore, for a sufficiently large constant in the number of queries Q = Θ(log n), the fraction of erroneous answers is ρ < 1  3 with high probability by Chernoff-Hoeffding bounds.The claim of the theorem follows immediately from Lemma 11 and Theorem 12.

Proof of Theorem 12
We first give an algorithm for the decision version of the LCS with Approximately k Mismatches problem that uses O(n log n) space and O(n 1+1/(1+ε) log n + σn log 2 n) time, and then we improve the space and time complexity.
We assume to have fixed a Karp-Rabin fingerprinting function ϕ for a prime q = Ω(max{n 5 , σ}) and an integer r ∈ Z q .With error probability inverse polynomial in n, we can find such q in O(log O(1) n) time; see [23,3].
Let Π be the set of all projections of strings of length onto a single position, i.e., the value π i (S) of the i-th projection on a string S of length is simply its i-th character S[i].More generally, for a length-string S and a function h = (π a1 , . . ., π am ) ∈ Π m , we define We choose a set H of L = Θ(n 1/(1+ε) ) hash functions in Π m uniformly at random.Let C H be the mutliset of all collisions of length-substrings of X and Y under the functions from We We must explain how we compute C H and choose the collisions that we test.We consider each hash function h ∈ H in turn.Let h = (π a1 , . . ., π am ).Recall that for a string S of length we define h(S) as We create a vector U of length where each entry is initialised with 0. For each i, we add r i−1 mod q to the a i -th entry of U .Finally, we run the FFT algorithm [12] for U and X, Y in the field Z q , and sort the resulting values.We obtain a list of sorted values that we can use to generate the collisions.Namely, consider some fixed value z.Assume that there are x substrings of X and y substrings of Y of length such that the fingerprint of their projection is equal to z.The value z then gives xy collisions, and we can generate each one of them in constant time.This explains how to choose the subset C in O(nL log n) time.
To draw a collision from C H uniformly at random, we could simply compute the total number of collisions across all functions h ∈ H, draw a number in [1, |C H |], and generate the corresponding collision.However, this would require to generate the collisions twice.Instead, we use the weighted reservoir sampling algorithm [9].We divide all collisions into subsets according to the values of fingerprints.We assume that the weighted reservoir sampling algorithm receives the fingerprint values one-by-one, as well as the number of corresponding collisions.At all times, the algorithm maintains a "reservoir" containing one fingerprint value and a random collision corresponding to this value.When a new value z with xy collisions arrives, the algorithm replaces the value in the reservoir with z and a random collision with C P M 2 0 2 0 16:8 Approximating Longest Common Substring with k Mismatches some probability.Note that to select a random collision it suffices to choose a pair from [1, x] × [1, y] uniformly at random.It is guaranteed that if for a value z we have xy collisions, the algorithm will select z with probability xy/|C H |. Consequently, after processing all values, the reservoir will contain a collision chosen from C H uniformly at random.Lemma 14.Let S 1 and S 2 be two length-substrings of X and Y , respectively, with ) is large enough, then, with probability at least 3/4, there exists a function h ∈ H such that h(S 1 ) = h(S 2 ).
Proof.Consider a function h = (π a1 , . . ., π am ) drawn from Π m uniformly at random.The probability of h(S 1 ) = h(S 2 ) is at least p m 1 .Due to p 1 ≤ 1, we have Moreover, Hence, we can choose the constant in L = |H| so that the claim of the lemma holds.
In total, we have n 2 |H| possible triples (S 1 , S 2 , h) so by linearity of expectation, we conclude that the expected number of such triples is at most 2 n n 2 L = 2nL.Therefore the probability to hit a triple (S 1 , S 2 , h) Below, we combine the previous results to prove that, with constant probability, Algorithm 2 correctly solves the decision variant of the LCS with Approximately k Mismatches problem.Note that we can reduce the error probability to an arbitrarily small constant δ > 0: it suffices to repeat the algorithm a constant number of times.

16:9
Proof.Suppose first that ≤ LCS k (X, Y ), which means that there are two length-substrings S 1 , S 2 of X, Y such that d H (S 1 , S 2 ) ≤ k.By Lemma 14, with probability at least 3/4, there exists a function h ∈ H such that h(S 1 ) = h(S 2 ).In other words, (S 1 , S 2 , h) ∈ C H with probability at least 3  4 .If |C H | < 4nL, we will find this triple and it will pass the test with probability at least 1 − n −6 .If |C H | ≥ 4nL, then by Lemma 15 the Hamming distance between S 1 , S 2 , where (S 1 , S 2 , h) was drawn from C H uniformly at random, is at most (1 + ε)k with probability ≥ 1/2, and therefore this pair will pass the test with probability ≥ 1/2.It follows that in this case the algorithm outputs YES with constant probability.
Suppose now that > LCS (1+ε)k (X, Y ).In this case, the Hamming distance between any pair of length-substrings of X and Y is at least (1 + ε)k, so none of them will ever pass the second test and none of them will pass the first test with constant probability.
We now improve the space of the algorithm to linear.Note that the only reason why we needed O(n log n) space is that we precompute and store the sketches for the Hamming distance.Below we explain how to overcome this technicality.
First, we do not precompute the sketches.Second, we process the collisions in C in batches of size n.Consider one of the batches, B. For each collision (S 1 , S 2 , h) ∈ B we must compute sk ε (S 1 ) − sk ε (S 2 ) 2 .We initialize a counter for every collision, setting it to zero initially.The number of rounds in the algorithm will be equal to the length of the sketches, and, in round i, the counter for a collision (S 1 , S 2 , h) ∈ B will contain the squared L 2 distance between the length-i prefixes of sk ε (S 1 ) and sk ε (S 2 ).In more detail, let S be the set of all substrings of X, Y that participate in the collisions in B. Recall that all these substrings have length .At round i, we compute the i-th coordinate of the sketches of the substrings in S. By definition, the i-th coordinate is the dot product of the i-th row of c • M , where c and M are as in Corollary 5, and a substring encoded using µ.Hence, we can compute the coordinate using the FFT algorithm [12] in O(σn log n) time and O(n) space.When we have the coordinate i computed, we update the counters for the collisions and repeat.
At any time, the algorithm uses O(n) space.Compared to the time consumption proven in Lemma 13, the algorithm spends an additional O(σn 1+1/(1+ε) log 2 n) time for computing the coordinates of the sketches.Therefore, in total the algorithm uses O(σn 1+1/(1+ε) log 2 n) = O(n 1+1/(1+ε) log 2 n) time and O(n) space.For constant-size alphabets, this completes the proof of Theorem 12.For alphabets of arbitrary size, we replace the sketches from Section 2 with the sketches defined in [19] to achieve the desired complexity.We note that we could use the sketches [19] for small-size alphabets as well, but their lengths hide a large constant.

Experiments
We now present results of experimental evaluation of the second solution of Theorem 2.
Methodology and test environment.The baselines and our solution are written in C++11 and compiled with optimizations using gcc 7.4.0.The experimental results were generated on an Intel Xeon E5-2630 CPU using 128 GiB RAM.To ensure the reproducibility of our results, our complete experimental setup, including data files, is available at https: //github.com/fnareoh/LCS_Approx_k_mis.
Baseline.The only other solution to the LCS with Approximately k Mismatches problem was presented in [19].However, it has a worse complexity and is likely to be unpractical because it uses a very complex class of hash functions.We therefore chose to compare our algorithm against algorithms for the LCS with k Mismatches problem.To the best of our knowledge, C P M 2 0 2 0 16:10 Approximating Longest Common Substring with k Mismatches none of the existing algorithms has been implemented.We implemented the solution to LCS with k Mismatches by Flouri et al., which we refer to as FGKU [13].(The other algorithms seem too complex to be efficient in practice.)The main idea of the algorithm of Flouri et al. is that if we know that the longest common substring with k mismatches is obtained by a substring of X that starts at a position p and a substring of Y that starts at a position p + i, then we can find it by scanning X and Y [i, |Y |] in linear time.

Details of implementation.
We made several adjustments to the theoretical algorithm we described.First, we use the fact that • LCS(X, Y ) + k to bound the interval in the Twenty Questions game.We also treated the number of questions in the Twenty Questions game and L, the size of the set of hash functions H, as parameters that trade time for accuracy, and put the number of questions to 2 log(B − A) in the Twenty Questions game and L = n 1/(1+ε) /16.In Line 6 of Algorithm 2, we used sketches to estimate the Hamming distance.In practice, we computed the Hamming distance via character-by-character comparison when is small compared to k and via kangaroo jumps [14] otherwise.Also, when ≤ 2 log n in Algorithm 2, we computed the hash values of the length-substrings of S 1 and S 2 naively, instead of using the FFT algorithm [12].
Data sets and results.We considered k ∈ {10, 25, 50} and ε ∈ {1.0, 1.25, 1.5, 1.75, 2.0}.We tested the algorithms on pairs of random strings (each character is selected independently and uniformly from a four-character alphabet {A, T, G, C}) and on pairs of strings extracted at random from the E. coli genome.The lengths of the strings in each pair are equal and vary from 0 to 60000 with a step of 5000.All timings reported are averaged over ten runs.Figures 1-3 show the results for k = 10, 25, 50.We note that for ε = 1 and k = 10, 25, the standard deviation of the running time on the E. coli data set is quite large, which is probably caused by our choice of the method to compute the Hamming distance between substrings, but for all other parameter combinations it is within the standard range.We can see that the time decreases when ε grows, which is coherent with the theoretical complexity.
As for the accuracy, note that our algorithm cannot return a pair of strings at Hamming distance more than (1 + ε)k, and so the only risk is returning strings which are too short.Consequently, we measured the accuracy of our implementation by the ratio of the length ) and LCS k (X, Y ) for 10 pairs of strings for each length from 5000 to 60000 with step of 5000, as well as the error rate, i.e., the percentage of experiments where LCS k(X, Y ) < LCS k (X, Y ) (see Table 1).Not surprisingly, r min and r max grow as k and ε grow, while the error rate drops.Even though there is no theoretical upper bound on r max , the latter is at most 2.24 at all times.We also note that even in the cases when the error rate is non-negligible, LCS k ≥ 0.86 • LCS k ; in other words, our algorithm returns a reasonable approximation of LCS k .

Proof of Fact 3
We now show the lower bound of Fact 3 by a reduction from the (1 + γ)-approximate Bichromatic Closest Pair problem.
Rubinstein [22] proved that for every constant δ > 0, there exists γ = γ(δ) such that any randomised algorithm that solves (1 + γ)-approximate Bichromatic Closest Pair correctly with constant probability requires O(N2−δ ) time assuming SETH: Hypothesis 18 (SETH).For every δ > 0, there exists an integer q such that SAT on q-CNF formulas with m clauses and n variables cannot be solved in m O(1) 2 (1−δ)n time even by a Monte-Carlo randomised algorithm (with error probability bounded by a small constant) 2 .
We show the lower bound by reducing a single instance of (1 + γ)-approximate Bichromatic Closest Pair to a polylogarithmic number of instances of LCS with Approximately k Mismatches.We assume that U i , V j are over the alphabet {0, The substring S 2 contains either HV j or V j H for some j.Without loss of generality, we can assume that S 2 contains a copy of H followed by V j for some j.Let us consider the substring S of X aligned with the copy of H in S 2 .Below we will prove that S = H, and since S is followed by U i for some i, this will imply that d H (HU i H, HV j H) ≤ k. ) 2 , we will have at least one more mismatch from the alignment of the copy of H in Y and the copies of H in X that surround U i .Therefore, in total there are at least d + 1 > k mismatches, a contradiction.To conclude, both cases are impossible, and hence s = 0.The lemma follows as explained above.
By the definition of k 0 , Observation 19 and Lemma 20, there do not exist i, j ∈ [1, N ] such that d H (U i , V j ) ≤ k 0 /(1 + ε), but there exist i, j ∈ [1, N ] such that d H (U i , V j ) ≤ k 0 (1 + ε).In the (1 + γ)-approximate Bichromatic Closest Pair problem, this translates to k 0 /(1 + ε) < h ≤ k 0 (1 + ε), where h is the minimal distance between all pairs U i , V j .This is equivalent to which means that the pair (U i , V j ) found by the algorithm for k 0 is a valid solution for

Theorem 12 .
Assume an alphabet of arbitrary size σ = n O(1) .The decision variant of LCS with Approximately k Mismatches can be solved in O(n 1+1/(1+ε) log 2 n) time and O(n) space.The answer is correct with constant probability.Let us defer the proof of the theorem until Section 3.1 and start by explaining how we use Lemma 11 and Theorem 12 and the Twenty Questions game to show Theorem 2.

Figure 1
Figure 1 Comparison of the FGKU algorithm versus our algorithm for k = 10 and different values of ε.Large standard deviation for length 60000 is caused by an outlier with very long longest common substring with k mismatches.

Figure 2
Figure 2 Comparison of the FGKU algorithm versus our algorithm for k = 25 and different values of ε.

Figure 3
Figure 3 Comparison of the FGKU algorithm versus our algorithm for k = 50 and different values of ε.

Figure 4
Figure 4 Substrings S1 and S2 of X and Y , respectively, substring S aligned with a copy of H in S2, and the shift s.
will perform two tests.The first test chooses an arbitrary subset C ⊆ C H of size |C | = min{4nL, |C H |} and, for each collision (S 1 , S 2 , h) ∈ C , computes sk ε (S 1 )−sk ε (S 2 ) 2 .If this value is at most (1 + ε)k, then the algorithm returns YES and the pair (S 1 , S 2 ) as a witness.The second test chooses a collision (S 1 , S 2 , h) ∈ C H uniformly at random and computes the Hamming distance between S 1 and S 2 character by character in O( ) = O(n) time.If the Hamming distance is at most (1 + ε)k, the algorithm returns YES and the witness pair (S 1 , S 2 ).Otherwise, the algorithm returns NO.See Algorithm 2. LCS with Approximately k Mismatches (decision variant).1: Choose a set H of L functions from Π m uniformly at random 2: inequality follows from the definition of m.Therefore, the probability that for some function h ∈ H we have ϕ Lemma 15.If |C H | > 4nL and (S 1 , S 2 , h) is a uniformly random element of C H , then Pr[d H (S 1 , S 2 ) ≥ (1 + ε)k] ≤ 1 2 .Proof.Consider length-substrings S 1 , S 2 of X, Y , respectively, such that d H (S 1 , S 2 ) ≥ (1 + ε)k, and a hash function h.Let us bound the probability of (S 1 , S 2 , h) ∈ C H .There two possible cases: either h(S 1 ) = h(S 2 ) but ϕ(h(S 1 )) = ϕ(h(S 2 )), or h(S 1 ) = h(S 2 ).The probability of the first event is bounded by the collision probability of Karp-Rabin fingerprints, which is at most 1/n.Let us now bound the probability of the second event.Since d H (S 1 , S 2 ) ≥ (1+ε)k, we have Pr[h(S 1 ) = h(S 2 )] ≤ p m 2 ≤ 1/n, where the last

Table 1
Accuracy of the LCS with Approximately k Mismatches algorithm.For each k and ε, we show rmin(ε, k), rmax(ε, k), as well as the error rate.

16:12 Approximating Longest Common Substring with k Mismatches LCS
k(X, Y ) returned by our algorithm divided by LCS k (X, Y ) computed by the dynamic programming.We estimate r min