Good predictions are worth a few comparisons

Most modern processors are heavily parallelized and use predictors to guess the outcome of conditional branches, in order to avoid costly stalls in their pipelines. We propose predictor-friendly versions of two classical algorithms: exponentiation by squaring and binary search in a sorted array. These variants result in less mispredictions on average, at the cost of an increased number of operations. These theoretical results are supported by experimentations that show that our algorithms perform signiﬁcantly better than the standard ones, for primitive data types


Introduction
As an introductory example, consider the simple problem of computing both the minimum and the maximum of an array of size n.The naive approach is to compare each entry to the current minimum and maximum, which uses 2n comparisons.A better solution, in terms of number of comparisons, is to look at the elements of the array two by two, and to compare the smallest to the current minimum and the greatest to the current maximum.This uses only 3n/2 comparisons, which is optimal.In order to observe the benefit of this optimization, we implemented both versions (see Figure 3) and measured their execution time 2 for large arrays of uniform random float in [0, 1].The results are given in Figure 1 and are very far from what was expected, since the naive implementation is almost twice as fast as the optimized one.Clearly, counting comparisons can not explain these counterintuitive performances.An obvious explanation could be a difference in the number of cache misses.However, both implementations make the same memory accesses, in the same order.Instead, we turn our attention to the comparisons themselves.Most modern processors are heavily parallelized and use predictors to guess the outcome of conditional branches in order to avoid costly stalls in their pipelines.Every time a conditional is used in a program, there is a mechanism that tries to predict whether the corresponding conditional jump will be taken or not.The cost of a misprediction can be quite large compared to a basic instruction, and should be taken into account in order to explain accurately the behavior of algorithms that use a fair amount of comparisons.

Good predictions are worth a few comparisons
In this matter, our example is quite revealing since the trick used to lower the number of comparisons relies on a conditional branch that is unpredictable (for an input taken uniformly at random) and will cause a substantial increase in the number of mispredictions.As we will see in the sequel, the expected number of mispredictions caused by the naive algorithm is Θ(log n), whereas it is Θ(n) for the "optimal" one.
The influence of branch predictors over comparison based algorithms has already been studied, mostly to acknowledge the over-cost induced by mispredictions.Our approach is quite the opposite as we propose to take advantage of this feature, by proposing predictorfriendly versions of two classical algorithms.
Our contributions.After dealing with our introductory example using combinatorial arguments, we turn our attention towards the classical exponentiation by squaring and give a simple alternative algorithm, which reduces the number of mispredictions without increasing the number of multiplications.The analysis is based on the study of the Markov chains that describe the dynamic local predictors (see next section for a brief description of predictors).Finally, in the same vein, we propose biased versions of the binary search in a sorted array.We analyze the expected number of mispredictions for local predictors and we also give the (first to our knowledge) analysis of a global predictor.For these two different problems, we manage to significantly lower the number of mispredictions by breaking the perfect balance usually favored in the divide and conquer strategy.In practice, the trade-off between comparisons and mispredictions allows a noticeable speed-up in the execution time, when the comparisons involve primitive data types, which supports our theoretical results.

Related work.
Over the past decade, several articles began to address the influence of branch predictors, and especially the cost of mispredictions, in comparison based algorithms.For instance, Biggar and his coauthors [1] investigated the behavior of branches for many sorting algorithms, in an extensive experimental study.Brodal, Fagerberg and Moruz reviewed the trade-offs between comparisons and mispredictions for several sorting algorithms [3] and studied how the number of inversions in the data affects statistics such as the number of mispredictions [2].Moreover, these works introduced the first theoretical analysis of static branch predictors.
Also interested by the influence of mispredictions on the running time of sorting algorithms, Sanders and Winkel considered the possibility to dissociate comparisons from branches in their SampleSort, which allows to avoid most of the misprediction cost [13].Elmasry, Katajainen and Stenmark then proposed a version of MergeSort that is not affected by mispredictions [6], by taking advantage of some processor-specific instructions. 3he influence of mispredictions was also studied for Quicksort: Kaligosi and Sanders gave an in-depth analysis of simple dynamic branch predictors to explain how mispredictions affect this classical algorithm [9]; however, Martínez, Nebel and Wild pointed out that this is not enough to explain the "better than expected" performances of the dual-pivot version of QuickSort [11] implemented in Java's standard library.
Besides, Brodal and Moruz conducted an experimental study of skewed binary search trees in [4], highlighting that such data structures can outperform well-balanced trees, since branching to the right or left does not necessarily have the same cost, due to branch prediction schemes.Our work follows the same line, as we also want to take advantage of the branch predictions, but we focus on algorithms rather than on data structures.

Elements of computer architecture
To analyze the complexity of searching or sorting algorithms, the standard model consists in counting the number of comparison operations performed.However most modern processors are pipelined.And to avoid stalling the pipeline when coming across a conditional jump, the processor tries to predict if the jump will occur and proceeds according to its prediction.A correctly predicted jump does not stall the pipeline whereas mispredictions lead this one to be flushed, causing a significant performance loss. 4Therefore, the cost of a comparison in an "if" statement actually depends on the quality of the prediction.For any conditional jump, a branch predictor will "guess" if the corresponding branch will be taken or not.For this purpose, many different strategies have been designed.The simplest one is a static branch predictor that does not use information from the code execution.It can, for example, predict that all branches will be taken.To improve its accuracy, a dynamic branch predictor uses the outcome of past branches to guess whether a particular branch should be taken or not.We now describe different techniques of dynamic branch prediction.
A 1-bit predictor is a state buffer which remembers the last outcome of the branch; the guess is that the next outcome will be the same.As an improvement, the 2-bit predictors try to avoid making two mispredictions when a branch takes an unlikely path.Two slightly different schemes are given by Figure 2. The saturating counter scheme can be further improved by keeping more information (k-bit predictors using 2 k states).All these predictors are local: there is one for every conditional (up to some limit in practice).
An history table has 2 n entries indexed by the sequence of the last n branches (1 for taken, 0 otherwise).The entries themselves are usually k-bit predictors.Such a table is said to be local when its entries correspond to the behavior of one specific branch and are used for this one only.On the contrary, in a global history table, the outcomes of the most recently executed branches are used to index the table, which is shared by all the conditionals.
To get the best of both worlds, correlating branch predictors use local and global information mixed together, and tournament predictors use an additional dynamic scheme to decide if they follow the local or the global prediction.These types of predictors are far beyond what we study in this article, but are worth mentioning for further analysis.
Strictly speaking, mispredictions can only be analyzed on a given assembly code, as they occur at conditional jumps.In this article, we use C-style pseudo code.We implicitly work on the non-optimized assembly code, where control structures are translated into conditional jumps in the standard way.For our experimental results, we checked that it was indeed the case.Furthermore, we remarked that our good results still hold when considering fully optimized binaries.

4
Good predictions are worth a few comparisons 4 Good predictions are worth a few comparisons T is an array of size n.Both min and max are returned.length n.This algorithms are given in Figure 3 (see also [5,Sec. 9.1]).
In the classical settings for the analysis, the algorithm 3 2 -Minmax is optimal, 1 the number of comparisons performed being asymptotically equivalent to 3  2 n.Obviously, the NaiveMinmax needs around 2n comparisons.
In order to give an explanation to the graphics presented in Figure 1, where NaiveMinmax outperforms 3  2 -Minmax, we estimate the expected number of mispredictions for both algorithms.Our probabilistic model is the following: we consider the uniform distribution on arrays of size n, where each element is chosen uniformly and independently in [0, 1].Up to an event of probability 0 (when the elements of the input are not pairwise distinct), this is the same as choosing a uniform random permutation of [n], since we only use comparisons on the elements in both algorithms.
Recall that a min-record (resp.max-record) in an array or a permutation is an element that is smaller (resp.greater) than any element to its left.Obviously, in NaiveMinmax, the first conditional line 3 (resp.second conditional line 5) is true for each min-record (resp.max-record), except for the first position.The number of records in a random permutation is a well-known statistic, which we can use to establish the following lemma.
I Proposition 1.The expected number of mispredictions performed by NaiveMinmax for the uniform distribution on arrays of size n is asymptotically equivalent to 4 log n for the 1-bit predictor and to 2 log n for the two 2-bit predictors and the 3-bit saturating counter.I Proposition 2. The expected number of mispredictions performed by 3  2 -Minmax for the uniform distribution on arrays of size n is asymptotically equivalent to n 4 for all the considered predictors.
In light of these results, we observe that the mispredictions occurring in NaiveMinmax are negligible with respect to comparisons.On the other hand, the additional test used to optimize 3  2 -Minmax (line 3) causes the number of mispredictions to be comparable to the number of comparisons performed.We believe this is enough to explain why the naive implementation performs better (Figure 1) since we know that mispredictions can cost many CPU cycles and that comparisons are very cheap operations in comparison.Of course, we are aware that other factors can influence the performances of such simple programs, including cache e ects.For our tests we took care to fetch each element of the array only once and in the same order, so that the cache behavior should not interfere with our results.In Figure 1, we also give the results obtained with the most commonly used optimization of the

Simultaneous maximum and minimum finding
In this section, we go back to the example given in the introduction: We consider two algorithms that simultaneously compute the minimum and the maximum of an array of length n.These algorithms are given in Figure 3 (see also [5,Sec. 9.1]).For our analysis, we consider the local predictors presented in Section 2.
In the classical settings for the analysis, the algorithm 3 2 -Minmax is optimal (see the footnote 1 in the introduction), the number of comparisons performed being asymptotically equivalent to 3  2 n.Obviously, NaiveMinmax needs 2n − 2 comparisons.In order to give an explanation of the experimental results presented in Figure 1, where NaiveMinmax outperforms 3  2 -Minmax, we estimate the expected number of mispredictions for both algorithms.Our probabilistic model is the following: we consider the uniform random distribution on arrays of size n, where each element is chosen uniformly and independently in [0, 1].Up to an event of probability 0 (when the elements of the input are not pairwise distinct), this is the same as choosing a uniform random permutation of {1, . . ., n}, since we only use comparisons on the elements in both algorithms.
Recall that a min-record (resp.max-record) in an array or a permutation is an element that is strictly smaller (resp.greater) than any element to its left.Obviously, in NaiveMinmax, the first conditional at line 3 (resp.the one at line 5) is true for each min-record (resp.max-record), except for the first position.The number of records in a random permutation is a well-known statistics, which we can use to establish the following proposition.
Proposition 1.The expected number of mispredictions performed by NaiveMinmax, for the uniform distribution on arrays of size n, is asymptotically equivalent to 4 log n for the 1-bit predictor and to 2 log n for the two 2-bit predictors and the 3-bit saturating counter.The expected number of mispredictions performed by 3  2 -Minmax is asymptotically equivalent to n 4 for all the considered predictors.
In light of these results, we observe that the mispredictions occurring in NaiveMinmax are negligible towards the number of comparisons.On the other hand, the additional test used to optimize 3  2 -Minmax (line 3) causes the number of mispredictions to be comparable to the number of comparisons performed.We believe this is enough to explain why the naive implementation performs better (Figure 1), since we know that mispredictions can N. Auger, C. Nicaud, and C. Pivoteau 5 6 n /= 2; x is a floating-point number, n is an integer and r is the returned value.gcc compiler (-O3) to check that these results withstand strong code optimization. 55 In this particular case, notice that all the branches but the one at line 3 in 3  2 -Minmax are replaced by conditional moves that are not vulnerable to misprediction.However, our analysis still holds since the remaining branch concentrates the majority of the mispredictions.

Exponentiation by squaring
As we have seen in the previous section, conditional branches with equal probabilities of going one way or another are particularly harmful when using branch prediction.Besides, most divide and conquer algorithms feature such branches, since they tend to split problems into parts of equal size to reach an optimal complexity.In the sequel, we explore two di erent ways of disturbing this balance to end up with better performances for two classical algorithms: binary exponentiation and binary search.

Modified algorithms
The classical divide and conquer algorithm to compute x n consists in rewriting x n = (x 2 ) Ân/2Ê x n0 , where n k . . .n 1 n 0 is the binary decomposition of n, in order to divide the size n of the problem by two.This is the algorithm ClassicalPow of Figure 4.As expected, the conditional branch of line 3 is taken with probability 1  2 , which is what we want to avoid. 66 In order to introduce some imbalance in the algorithm, we first unroll the loop (UnrolledPow, Figure 4) using the decomposition x n = (x 4 ) Ân/4Ê (x 2 ) n1 x n0 .Still, both conditional branches are taken with probability 1  2 , but we can now guide the algorithm by injecting the test which determines whether the last two bits of n are 00 or not.This is the third algorithm of Figure 4.Note that this conditional branch (line 4) is absolutely unnecessary in the algorithm since it is redundant with the tests of line 5 and 7.But on the other hand, this branch is taken with probability 1  4 and the branches of line 5 and 7 are now both taken with probability 2  3 .This is how we aim at using the branches predictions.cost many CPU cycles and that comparisons are cheap operations.Of course, we are aware that other factors can influence the performances of such simple programs, including cache effects.In our implementation, we took care to fetch each element of the array only once and in the same order, so that the cache behavior should not interfere with our results.We also tried the most commonly used optimization of the gcc compiler (-O3) to check that these results withstand strong code optimization. 5In this particular case, all the branches but the one at line 3 in 3  2 -Minmax are replaced by conditional moves that are not vulnerable to misprediction.Hence, 3  2 -Minmax still causes approximatively 1 4 n mispredictions on average.

Exponentiation by squaring
We saw in the previous section that conditional branches with equal probabilities of going one way or another are particularly harmful when using branch prediction.Besides, most divide and conquer algorithms feature such branches, since they tend to split problems into parts of equal size to reach an optimal complexity.In the sequel, we explore two different ways of disrupting this balance, to end up with better performances for two classical algorithms: exponentiation by squaring and binary search.

Modified algorithms
The classical divide and conquer algorithm to compute x n consists in rewriting x n = (x 2 ) n/2 x n0 , where n k . . .n 1 n 0 is the binary decomposition of n, in order to divide the size n of the problem by two.This is the algorithm ClassicalPow of Figure 4.As expected, the conditional branch of line 3 is taken with probability 1 2 , which is what we want to avoid. 6In order to introduce some imbalance in the algorithm, we first unroll the loop (UnrolledPow, Figure 4) using the decomposition x n = (x 4 ) n/4 (x 2 ) n1 x n0 .Still, both conditional branches are taken with probability 1  2 , but we can now guide the algorithm by injecting the test that determines whether the last two bits of n are 11 or not.This is the  4, using the PAPI library. 7The number of branches is given excluding the ones caused by loops, since these branches do not yield mispredictions.third algorithm of Figure 4.Note that this conditional branch (line 4) is absolutely unnecessary in the algorithm, as it is redundant with the tests of line 5 and 7.But on the other hand, this branch is taken with probability 3  4 and the branches of line 5 and 7 are now both taken with probability 2  3 .This is how we aim at using the branch predictions.To compare their performances experimentally, we computed the floating-point value of x n using each of the algorithms 5.10 7 times, with n chosen uniformly at random in {0, . . ., 2 26 − 1}.We measured the execution time, as well as some other parameters given by the latest version of the PAPI library, 7 which give access, for instance, to the number of mispredictions occurring during the execution.These results are depicted on Figure 5.The first observation is that GuidedPow is 14% faster than UnrolledPow and 29% faster than ClassicalPow and yet, the number of multiplications performed is essentially the same for the three algorithms.The main explanation we have come across for the speed-up between UnrolledPow and ClassicalPow is that the number of loops is divided by two.As for GuidedPow, the number of loops is the same as for UnrolledPow and it uses 25% more comparisons, but still the guided version is faster.The main difference between the two is that the test added at line 4 allows to decrease the number of mispredictions by about a quarter.We are in similar settings as for the simultaneous minimum and maximum, where the increased number of comparisons is balanced by less mispredictions.We now proceed with the analysis of this phenomenon.

Analysis of the average number of mispredictions for GuidedPow
For the analysis, we consider that n is taken uniformly at random in {0, . . ., N − 1}, for N = 4 k and with k ≥ 1.This model is exactly the same as choosing each of the 2k bits of the binary representation of n uniformly at random and independently.We consider the local predictors presented in Section 2.
Let L k (n) be the number of loop iterations of GuidedPow.This is a random variable, which is easy to analyze since it is equal to the smallest integer such that 4 is greater than n.In particular, we have We now recall, using our algorithm as an example, why Markov chains are the key tools for that kind of analysis (as done in [9,11]).Let us consider the first conditional of line 4.In our model, at each iteration, the condition is true with probability 3  4 , as it is not satisfied when the last two bits are 00.It yields that the behavior of the predictor associated to this conditional is exactly described by the Markov chain obtained when changing the edges labels "taken" by 3  4 and the labels "not taken" by 1 4 (see Figure 6).A misprediction occurs whenever an edge labeled by "taken" (resp."not taken") is used from a state that predicts "not taken" (resp."taken").We also need to know the initial state of the predictor, but it has no influence on our asymptotic results, as we shall see.
Hence, we reduced our problem to counting the number of times some particular edges are taken in a Markov chain, when we perform a random walk of (random) length L k .We can therefore conclude using the classical Ergodic Theorem [10], which we restated bellow in order to fit our needs.
Theorem 2 (Ergodic Theorem).Let (M, π 0 ) be a primitive and aperiodic Markov Chain on the finite set S. Let π be its stationary distribution.Let E be a set of edges of M , that is, a set of pairs (i, j) ∈ S 2 such that M (i, j) > 0.
For any nonnegative integer n, let L n be a random variable on nonnegative integers such that lim n→∞ E[L n ] = +∞.Let X n be the random variable that counts the number of edges in E that are used during a random walk of length L n in M (starting from the initial distribution π 0 ).Then the following asymptotic equivalence holds: When considering a given predictor, under the model where the condition is satisfied with probability p, we denote by M p its transition matrix, by π p its stationary vector and by µ(p) its expected misprediction probability defined by µ(p) = (i,j)∈E π p (i)M p (i, j), where E is the set of edges corresponding to mispredictions.As shown in [11], if we denote by µ 1 (p), µ 2 (p) and µ 2 (p) the expected misprediction probability of the 1-bit, 2-bit saturating counter and the flip-on-consecutive 2-bit, then we have: Similarly, the expected misprediction probability µ 3 (p) of the 3-bit saturated counter is Applying these mathematical tools to GuidedPow yields the following results.The theorem is stated for values of N that are not powers of 4, which is more complicated since the bits are not exactly 0's and 1's with probability 1 2 (and not independent).In Section 5 we show how to deal with the cases where we slightly deviate from the ideal case.Theorem 3. Assume that n is taken uniformly at random in {0, . . ., N − 1}.The expected number of conditional tests in ClassicalPow and UnrolledPow is asymptotically equivalent to log 2 N , whereas it is asymptotically equivalent to 5  4 log 2 N for Guided-Pow.The expected number of mispredictions is asymptotically equivalent to 1  2 log 2 N for ClassicalPow and UnrolledPow, for any kind of predictor.For GuidedPow, it is asymptotically equivalent to α log 2 N , where α =

5
Binary search and variants

Unbalancing the binary search
We first consider the classical binary search which partitions a sorted array of size n into two parts of size about n 2 and compares the value x that is searched for to the middle of the array in order to determine in which part of the array to continue the search.As before, if we consider arrays of uniform random floating-point numbers, we get a conditional branch that is taken with probability 1  2 .A simple way to change that is to partition another way, for instance with parts of size about n 4 and 3n 4 , as in the BiasedBinarySearch (see Figure 7).Carrying on with the divide and conquer strategy but partitioning the array into three parts of size about n 3 , gives a ternary search.The main issue with this approach is that, in practice, the division by 3 which is involved is extremely costly in terms of hardware.Thus, to limit the cost of partitioning, we choose to slice the array into two parts of size n 4 and one part of size n 2 , which can be done using only divisions by powers of two that are simple binary shifts, as in the initial binary search (see SkewSearch in Figure 7).

Experiments
As expected at this point in our work, the BiasedBinarySearch experimentally performed better than the classical binary search and the SkewSearch performed much better.Unlike our precedent examples, the changes we brought in the binary search are quite sensible to cache e ects, since the way we partition the array influences the location where the memory is accessed.Thus we conducted experiments on arrays that fit in the last-level cache of our machine 2 in order to mostly measure the e ects of branch prediction.The results are given by Figure 8 and we can see that, for medium-size arrays, SkewSearch is up to 23% faster than the binary search (program compiled with gcc without optimization, in order to keep track of what really happens during the execution).Experiments in JAVA using a dedicated micro-benchmarking library 8 gave roughly the same results (but with a lesser speedup of 8 Benchmark using jmh: http://openjdk.java.net/projects/code-tools/jmh/Using Theorem 3 and Equations ( 1) and ( 2), we get that α is equal to 25 48 ≈ 0.52, 9 20 = 0.45, 2045 4368 ≈ 0.47 and 1095 2788 ≈ 0.39 for the 1-bit, 2-bit saturated, flip-on-consecutive 2-bit and 3-bit saturated counter, respectively.These values are to be compared with the 1 2 of the other two algorithms.In particular, for the 1-bit predictor, the expected number of mispredictions is greater for GuidedPow than for ClassicalPow or UnrolledPow.This predictor is not efficient enough to offset the mispredictions caused by the additional conditional.For the 3-bit saturated counter, GuidedPow therefore uses ≈ 0.25 log 2 n more comparisons than UnrolledPow, but ≈ 0.11 log 2 n less mispredictions.

5
Binary search and variants

Unbalancing the binary search
We first consider the classical binary search which partitions a sorted array of size n into two parts of size n 2 and compares the value x that is searched for to the middle of the array in order to determine in which part of the array to continue the search.As before, if we consider arrays of uniform random floating-point numbers, we get a conditional branch that is taken with probability 1 2 .A simple way to change that is to partition another way, for instance with parts of size about n 4 and 3n 4 , as in the BiasedBinarySearch (see Figure 7).Carrying on with the divide and conquer strategy but partitioning the array into three parts of size about n 3 , gives a ternary search.The main issue with this approach is that, in practice, the division by 3 is costly in terms of hardware.Thus, to limit the cost of partitioning, we choose to slice the array into two parts of size n 4 and one part of size n 2 .This can be done using only divisions by powers of two, which are simple binary shifts, as in the initial binary search (see SkewSearch in Figure 7).

Experiments
As expected at this point in our work, the BiasedBinarySearch experimentally performs better than the classical binary search and the SkewSearch performs much better.Unlike our previous examples, the changes we brought in the binary search are quite sensitive to cache effects, since the way we partition the array influences the location where the memory is accessed.Thus we conducted experiments on arrays that fit in the last-level cache of our machine 2 in order to mostly measure the effects of branch prediction.The results are given by Figure 8 and we can see that, for medium-size arrays, SkewSearch is up to 23% faster than the binary search (program compiled with gcc without optimization, in order to keep track of what really happens during the execution).Experiments in JAVA using a dedicated micro-benchmarking library8 gave roughly the same results (but with a lesser speedup of about 12%), when comparing our skew search to the implementation of the binary search on doubles in the standard library.

Local predictors analysis
As in Section 4, we aim at using the Ergodic Theorem (page 7) to obtain a good asymptotic estimate of the number of mispredictions.We therefore need to compute the expected number of times each given conditional is performed, in our different algorithms.We consider that each possible output is equally likely (i.e. the uniform distribution on {0, . .

. , n}).
A first order estimation of the expected number of times a given conditional is executed can be obtained using the following version of Roura's Master Theorem [12], which has been simplified for our specific case:9 Theorem 4 (Master Theorem).Let k ≥ 1, and a 1 , . . ., a k and b 1 , . . ., b k be positive real numbers such that k i=1 a i = 1.For every i ∈ {1, . . ., k}, let also ε i (n) be a real valued sequence such that b i n Let T (n) be the real valued sequence that satisfies, for some positive constants c and d, 10 Good predictions are worth a few comparisons 0, 8 0, 2 3, 8 0, 0 3, 3 3, 3

Figure 9
The decomposition tree of BiasedBinarySearch for n = 8.
Before stating our main result, we describe the main steps of our analysis on the algorithm BiasedBinarySearch.The expected number of iterations L(n) of BiasedBinarySearch satisfies the relation and L(0) = 0.
Thus, Theorem 4 applies and L(n) ∼ λ log n, with λ = 4 4 log 4−3 log 3 ≈ 1.78.Unfortunately, we cannot directly transform the predictor into a Markov chain as we did in Section 4, because the probabilities an n+1 and bn n+1 are not fixed anymore (they slightly depend on n).However, since an n+1 ), this Markov chain should still yields a good approximation of the number of mispredictions with Theorem 2.
A convenient way to prove this formally is to introduce the decomposition tree T associated with the search algorithms, which is defined as follows.If the input has size n, its root is labeled by the pair (0, n), and each node corresponds to the possible values of d and f during one loop of the algorithm.The leaves are the pairs (i, i), for i ∈ {0, . . ., n}; they are identified with the output of the algorithm in {0, . . ., n}.There is a direct edge between (d, f ) and (d , f ) whenever the variables d and f can be changed into d and f during the current iteration of the loop.Such an edge is labeled with the probability f −d +1 f −d+1 , which is the probability that this update happens in our model.An example of such a decomposition tree for BiasedBinarySearch is depicted on Figure 9.
By construction, following a path from the root to a leaf, by choosing between left and right according to the edge probability is exactly the same as choosing an integer uniformly at random in {0, . . ., n}.Let u = (u 0 , u 1 , . ..) be a infinite sequence of elements of [0, 1] taken uniformly at random and independently.To u is associated its path Path n (T , u) in T where, at step i, we go to the left if u i is smaller than the left child edge probability and to the right otherwise.Let L n (T , u) be the length of Path n (T , u).Let also Path n (I, u) be the path following the values in u in the ideal (infinite) tree I where we go to the left with probability 1  4 and to the right with probability 3 4 .Then the following result holds.Lemma 5.The probability that Path n (T , u) and Path n (I, u) differ at one of the first Hence, the algorithm BiasedBinarySearch behaves almost like the idealized version, for most of the iterations of its main loop, and we have a sufficiently precise estimation of the error term.This is enough to prove that the idealized version is a correct first order approximation of the number of mispredictions.The same construction can be done for all three algorithms, yielding the following result.
Figure 10 A fully global predictor scheme: The history table of size 2 keeps track of the outcomes of the last branches encountered during the execution, the last one corresponding to the rightmost bit.To each sequence of branches is associated a global 2-bit predictor (shared by all the conditional branches).

Theorem 6. Let C n and M n be the number of comparisons and mispredictions performed in our model of randomness. For
where µ is the expected misprediction probability associated with the predictor.

Analysis of the global predictor for skewSearch
In this section we intend to give hints about the behavior of a global branching predictor, such as the one depicted on Figure 10 (see also Section 2), for the algorithm SkewSearch.Notice in particular that the predictor of each entry is a 2-bit saturated counter.This is not the only possible choice of a global predictor, but it is simple enough without being trivial.We make the analysis in the idealized framework that resemble the real case sufficiently well, by ignoring the rounding effects of dealing with integers.We saw in the previous section why these approximations still give the correct result for the first order asymptotic.
In our idealized model we only consider the sequence of taken / not taken produced by the two conditional tests of SkewSearch.We deliberately do not consider the conditional induced by the test within the "while" loop, which would be always not taken in our settings (except for the very last step).Adding it would complicate the model without adding interesting information to the branch predictor. 10We encode a taken conditional by a 1 and a not taken conditional by a 0. The trace of an execution of the algorithm is thus a non-empty word on the binary alphabet B = {0, 1}.Because of the way the two conditional tests are nested within the algorithm, we can keep track of the current "if" by the use of the simple deterministic automaton A if with two states depicted in Figure 11: main stands for the first conditional and nested for the second one.In our model, main is taken with probability 1  4 and nested with probability 1 3 .As done in Section 4, A if can be changed into a Markov chain M if using this transition probabilities.A direct computation shows that its stationary vector π if satisfies π if (main) = 4  7 and π if (nested) = 3  7 .
1 4 0: 12 Good predictions are worth a few comparisons For the same reason as above, in the global table, we only record the history for the two conditionals main and nested.Let denote the history length, that is, the number of bits used in the history table of Figure 10.We assume that is even.An history h is thus seen as a binary word of length .Let 0 be the history made of 0's only.
When a conditional is tested at time t, the predictor uses the entry at position h t to make the prediction, where h t is the current history.To follow the evolution of the algorithm at time t + 1, we therefore only have to keep track of (1) the history table T t , (2) the current history h t and (3) which of the two conditionals IF t is under consideration.Knowing IF t is required in order to compute the probability that the next outcome is 0 or 1.This defines a Markov chain M up for the updates in the history table .From M up , one can theoretically estimate the expected number of mispredictions using Theorem 2, as we did for local predictors.The main issue with this approach is that computing π up is typically in O(m 3 ), where m is the number of states of M up .Since the number of states is exponential in , the computations are completely intractable for reasonable history lengths (such as ≥ 6), even if we first remove the unreachable states.In the sequel, we therefore use the particular structure of M up to directly compute the typical number of mispredictions.
Let h ∈ B be an history that is not equal to 0 .There is at least one 1 in h.Since reading a 1 always send to state main in A if , we know for sure the conditional IF t under consideration when an occurrence of h has just happened at time t.Hence, we know the probability to have a 0 or a 1 at time t + 1, given that h t = h.As a consequence, each entry of h = 0 in the table T behaves like a fixed-probability local 2-bit saturating predictor, with probability 1 4 (resp. 1 3 ) for histories associated to main (resp.to nested).Therefore, h = 0 concentrates all the differences between the local and the global predictors.
What happens for the entry 0 is well described by considering the automaton on pairs (s, i), where s is a state of the predictor and i is the current conditional.This automaton can be turned into a Markov chain, and the Ergodic Theorem yields a precise estimation of the number of mispredictions.Following this idea yields the following result.

Theorem 7. For the global predictor, the average number of mispredictions caused during
SkewSearch on an input of size n is asymptotically equivalent to . By Theorem 6, if we use a local 2-bit predictor for each conditional, the expected number of mispredictions is asymptotically equivalent to 12  35 E[C n ].The difference with the global predictor is therefore extremely small, which is not surprising as there is a difference only when the history is 0 .However, if there is a competition between a global predictor and a more accurate local predictor (a 3-bit saturated counter for instance), then the local predictor performs better; it is probably slightly disrupted by the global one, as the dynamic selector between both predictors can choose to follow the global predictor from time to time.

Conclusion
In this article we propose unbalanced predictor-friendly versions of two very classical algorithms, namely the exponentiation by squaring and the binary search.Using a precise estimation on the expected number of mispredictions, we show that our new algorithms are worth considering when the cost of a comparison is reasonable compared to the cost of a misprediction.This is typically the case for primitive data types.We believe that these theoretical results, supported by experiments, advocate strongly for considering this particular feature of modern computers in the design and analysis of algorithms: we showed that taking branch prediction into account can yield significant improvements, even on very classical algorithms.

A.2.1 Proof of Proposition 1
Proposition 1.The expected number of mispredictions performed by NaiveMinmax for the uniform distribution on arrays of size n is asymptotically equivalent to 4 log n for the 1-bit predictor and to 2 log n for the two 2-bit predictors and the 3-bit saturating counter.The expected number of mispredictions performed by 3  2 -Minmax for the uniform distribution on arrays of size n is asymptotically equivalent to n 4 for all the considered predictors.
Proof.For a given positive n, let [n] be the set {1, . . ., n}.It is convenient for this proof to use the correspondence between cycles and records in a permutation.If one sees a permutation σ as a one-to-one mapping from [n] to [n], its underlying graph11 is a set of labeled directed cycles.
In other words, we read the cycle starting from its smallest element.If the cycles C 1 , . . .,C m of σ are ordered by decreasing order of their smallest element, then f (σ) is defined by Classically, f is a bijection from the set S n of permutations of size n onto itself, such that the number of min-records in f (σ) is equal to the number of cycles of σ.Hence, the expected number of records in a uniform random element of S n is asymptotically equivalent to log n (see [7]).
In the sequel, we will use the fact that f is a bijection by remarking that if ξ is a real valued random variable on uniform random permutations, the expectation of ξ satisfies: We first consider the 1-bit predictor.Let σ be a permutation of S n whose cycles, ordered by decreasing order of their smallest element, are C 1 , . . ., C m .We want to estimate the number of mispredictions ξ(f (σ)) caused by line 3 of NaiveMinmax applied to σ.As , we count the mispredictions cycle by cycle.Assume that the predictor is on state N T (not taken) just before processing f (C i ).As the first element of f (C i ) is a min-record, it causes a misprediction and the predictor switch to state T (taken).If C i has length at least 2, the second element of f (C i ) is greater than the first one: it also causes a misprediction and the predictor goes back to state N T .The remaining elements of C i , if any, are all greater than its first element and therefore cause no more misprediction.Then, either C i has length 1, and the process of f (C i ) causes 1 misprediction leaving the predictor on state T , or C i has length at least 2 and it causes 2 mispredictions, leaving the predictor on state N T .A similar study can be done when the starting state of the predictor is T , yielding the following From this, we readily get that the number χ(f (σ)) of mispredictions caused by the first conditional satisfies since there can be two mispredictions caused by f (C i ) only if C i−1 has length at most 3.This concludes the proof for this predictor, by summing the contribution of both conditionals, as . The flip-on-consecutive 2-bit predictor and the 3-bit saturating counter are analyzed the same way, yielding the same results.
We now consider the algorithm 3 2 -Minmax.Using the model of n random numbers in [0, 1], it is straightforward to see that the first test in the loop of 3  2 -Minmax (line 3) causes a misprediction with probability 1  2 , for every predictors.Hence, this first test causes around n 4 mispredictions in average, when the algorithm ranges through the whole input.Moreover, the inner tests are true only when a min-record or max-record occurs.Using the same kinds of arguments as for Lemma 1, the expected number of mispredictions caused by these inner tests is in O(log n), concluding the proof.

A.2.2 Proof of Theorem 2
Theorem 2. Let (M, π 0 ) be a primitive and aperiodic Markov Chain on the finite set S.
Let π be its stationary distribution.Let also E be a set of edges of M , that is, a set of Good predictions are worth a few comparisons pairs (i, j) ∈ S 2 such that M (i, j) > 0.
For any nonnegative integer n, let L n be a random variable on nonnegative integers such that lim n→∞ E[L n ] = +∞.Let X n be the random variable that counts the number of edges in E that are used during a random walk of length L n in M (starting from any given distribution π 0 ).Then the following asymptotic equivalence holds: Proof.From the classical Ergodic Theorem [10, p. 58], we get that if F is a subset of S, and if Y n counts the number of times an element of F is met during a random walk of length We just have to modify it so that it works for edges and for random walks of random lengths.Let B be the set of all edges of M .We consider the Markov Chain of order two M 2 obtained from M as follows: its set of states is B, and the only positive values of M 2 are M 2 (i → j, j → k) = M (j, k), for every i → j and j → k in B. It is straightforward to verify that M 2 is irreducible and aperiodic, and that its stationary vector π 2 satisfies π 2 (i → j) = π(i)M (i, j).Applying the classical Ergodic Theorem on M 2 for the set of edges E yields, if Z counts the number of times an edge of E is used during a random walk of length in M : with → 0 as tends to infinity.
We now prove the result for variable length random walks: At this point we only have to prove that E[L n Ln ] = o(E[L n ]) to conclude the proof.For any real α > 0, there exists an integer 0 such that, for every ≥ 0 , ≤ α 2 .Hence, with Since E[L n ] tends to infinity, for n sufficiently large we have η0( 0−1)

A.2.3 Proof of Theorem 3
Theorem 3. Assume that n is taken uniformly at random in {0, . . ., N − 1}.The expected number of conditional tests in ClassicalPow and UnrolledPow is asymptotically equivalent to log 2 N , whereas it is asymptotically equivalent to 54 log 2 N for Guided-Pow.The expected number of mispredictions is asymptotically equivalent to 1  2 log 2 N for ClassicalPow and UnrolledPow, for any kind of predictor.For GuidedPow, it is asymptotically equivalent to α log 2 N , where α = 1 2 µ(3/4) + 3 4 µ(2/3).
Proof.The proof is done for N = 4 k .Some care is required for other values of N , since the bits are not 0's and 1's with probability exactly 1 2 , and since they are not completely independent.We explain in Section 5 how to deal with this approximations rigorously: it can also be done here in a similar way.
It is straightforward to prove, as we did for UnrolledPow, that the expected number of iterations of ClassicalPow is log 2 N + O(1).As each iteration performs one conditional test which is mispredicted with probability 1 2 , we get the announced results for this algorithm.
We already saw that the expected number of iterations of UnrolledPow and Guid-edPow is asymptotically equivalent to log 4 N = 1 2 log 2 N .At each iteration, the first conditional is performed, and the other two are performed with probability 3  4 .This yields that the expected number of conditional tests is asymptotically equivalent to 5  4 log 2 N .In UnrolledPow, each conditional is mispredicted with probability 1 2 yielding the announced result.
For GuidedPow we use the Ergodic Theorem and the fact that E[L k ] ∼ k ∼ log 4 N .The first conditional causes an expected number of mispredictions asymptotically equivalent to µ(3/4) log 4 N and each of the two nested conditionals causes 3  4 µ(2/3) log 4 N mispredictions, since they are tested with probability 3  4 .Hence, the expected number of mispredictions is asymptotically equivalent to µ(3/4) + 3  2 µ(2/3) log 4 N , concluding the proof.

A.2.4 Proof of Lemma 5
Lemma 5.The probability that Path n (T , u) and Path n (I, u) differ at one of the first Proof.Let (d, f ) be a node of T n that is not a leaf, and let t = f − d + 1.The probability to go to the left child at the next step is p And therefore, the probability that the two paths differ at this step is at most α t .Let (d , f ) be the node reached after L n (u) − λ n steps.For n sufficiently large, the range of a node is multiplied by a number that lies in [ 1 5 , 4  5 ] (as 1 5 < 1 4 and4 5 > 3 4 ).Hence, we have As a consequence, all the nodes at distance at most L n (u) − λ n from the root have a range that is greater than or equal to γ n , with γ n = 5 and thus Theorem 4 applies, and yields the same result that f (n) = 3/4+ Proof.Let w = w 0 w 1 . . .w N −1 ∈ {0, 1} N be the output of the two conditionals, where 1 stands for "taken" and 0 for "not taken".Let h = 0 be a history.Let 1 ≤ τ 1 < τ 2 < . . .< τ m ≤ N − 1 be the times t such that h t = h (there is an occurrence of h in w that ends at position t).Of course, the τ i 's and m depend on N and h, and are random variables for random inputs of the algorithm.By the Ergodic Theorem, there exists a constant α h > 0 such that E[m(h)] ∼ α h N , as N tends to infinity.Indeed, if π up is the stationary vector of the final strongly connected component of M up , the big Markov chain that captures everything, then α h is the sum of the π up (x), when x ranges over the states such that the current history is h.
Observe also that the entry h of the global table can only be updated at the times τ i + 1, that is, T t (h) is constant for τ i + 1 ≤ t ≤ τ i+1 , with the convention that τ m+1 = N .
Let H main (resp.H nested ) be the set of histories h such that if h t = h then IF t = main (resp.IF t = nested).The sets H main and H nested are disjoint and contain all the histories of size but 0 .Let h ∈ H main .From what we have just established, snapshots of T t (h) at times τ 1 + 1, τ 2 + 1, . . ., τ m + 1 describe a random walk in the predictor of parameter 1 4 .Hence, the excepted number of mispredictions caused by this history is asymptotically equivalent to µ(1/4)α h N , by Theorem 2. Similarly, if h ∈ H nested , the excepted number of mispredictions caused by this history is asymptotically equivalent to µ(1/3)α h N .
Let us analyze the behavior of the predictor of the entry 0 .When the history is 0 , the current conditional can be either main or nested.Thus, we have to distinguish the two cases and consider the pairs (s, i), where s is a state of the 2-bit predictor and i is either main or nested.Thanks to the specificity of our problem, knowing the starting pair (s, i) and whether the conditional is taken or not is enough to determine the next pair (s , i ) reached the next time the history is 0 : s is determined by the associated transition in the 2-bit predictor; if the conditional is taken then we have another occurrence of 0 immediately after and i = i, otherwise the next time the history is 0 is immediately after a 1, then i = main as is even.This yields the automaton (and its associated Markov chain M 0 ) depicted on Figure 12.Note that for odd we obtain a different automaton, which can ba study in the same way; however, is always even in real computers.
The stationary vector π 0 of M 0 can easily be computed.In particular, we get that the stationary probability of being in a main-state is p main = 4  7 and of being on a nested-state is p nested = 3 7 (this would be different for odd ).Moreover, the expected misprediction probability of M 0 is µ 0 = 41 119 .The stationary probability of the history is 0 is p 0 = (3/4) /2 (1/3) /2 = 2 − .Hence, by the Ergodic Theorem, the number G n of mispredictions for the global predictor has an expectation that is asymptotically equivalent to As µ 2 (1/4) = 3 10 and µ 2 (1/3) = 2 5 , we get the announced result:

Figure 1
Figure 1 Execution time of simultaneous minimum and maximum searching.

10 } 11 }Figure 3
Figure 3 Naive and optimized implementation of simultaneous maximum and minimum finding.

Figure 3
Figure 3 Naive and optimized implementations of simultaneous maximum and minimum finding.

12 }Figure 4
Figure 4 Three versions of the exponentiation by squaring, in C. The & denote the bitwise AND of the C language.

Figure 4
Figure 4 Three versions of the exponentiation by squaring, in C. The & denotes the bitwise AND in the C language.

Figure 6
Figure6 The saturating counter and its associated Markov chain for the first conditional of GuidedPow.The bold edges correspond to mispredictions.

1 d
1  2 µ(3/4) +3  4 µ(2/3), where µ is the expected misprediction probability associated to the local predictor.Good predictions are worth a few comparisonsBiasedBinarySearchIn both cases, T is an array of floats of size n and x is the number that is searched for.The classical binary search is obtained by replacing line 3 of BiasedBinarySearch by m = (d+f)/2;

Figure 7
Figure 7 Algorithms for binary search and skew search.Both return the position where the element should be inserted.

Figure 7
Figure 7 Algorithms for the biased binary search and skew search.Both return the position where the element should be inserted.

Figure 8
Figure 8Execution time of the three searching algorithms of Figure7for small-size arrays (that fit in the first-level cache) and medium-size arrays (that fit in the last-level cache).

Figure 12
Figure 12The Markov chain corresponding to the entry 0 .

table :
(1) any positive integer i, let Cyc i (σ) be the number of cycles of length i of σ.Let Cyc(σ) = i≥1 Cyc i (σ) be the number of cycles of σ.We see the table above as follows: there is one misprediction if the starting state is N T , plus one misprediction if the cycle has length at least 2. As the starting state is N T when the previous cycle has length at least 2, we get that the number of mispredictions caused by f (σ) at line 3 is ξ(f (σ)) = 2 Cyc(σ)−2 Cyc 1 (σ)+O(1), where the O(1) captures the border effects (initial configuration of the predictor and whether the last cycle of σ has length 1 or not).This concludes the proof for the 1-bit predictor, since E n [Cyc] ∼ log n and E n [Cyc 1 ] → 1, as n tends to infinity.There is a factor two since we have to add the number of mispredictions of both conditionals.For the 2-bit saturating counter predictor, we use the same technique.If, for instance, the starting state is N T when beginning the process of f (C i ), then there is a misprediction and the predictor switches to T .If C i has length at least 2, then the next element also causes a misprediction and the predictor goes back to N T .If it has length at least 3, then the predictor is set to SN T with no further misprediction.All useful cases are depicted in the following table:

6 Proof of Theorem 7 Theorem 7.
For the global predictor, the average number of mispredictions caused duringSkewSearch on an input of size n is asymptotically equivalent to( 12 35 + 1 595•2 )E[C n ]. A.2.