Minimizing I/Os in Out-of-Core Task Tree Scheduling

Scientific applications are usually described as directed acyclic graphs, where nodes represent tasks and edges represent dependencies between tasks. For some applications, such as the multifrontal method of sparse matrix factorization, this graph is a tree: each task produces a single output data, used by a single task (its parent in the tree). We focus on the case when the data manipulated by tasks have a large size, which is especially the case in the multifrontal method. To process a task, both its inputs and its output must fit in the main memory. Moreover, output results of tasks have to be stored between their production and their use by the parent task. It may therefore happen, during an execution, that not all data fit together in memory. In particular, this is the case if the total available memory is smaller than the minimum memory required to process the whole tree. In such a case, some data have to be temporarily written to disk and read afterwards. These Input/Output (I/O) operations are very expensive; hence, the need to minimize them. We revisit this open problem in this paper. Specifically, our goal is to minimize the total volume of I/O while processing a given task tree. We first formalize and generalize known results, then prove that existing solutions can be arbitrarily worse than optimal. Finally, we propose a novel heuristic algorithm, based on the optimal tree traversal for memory minimization. We demonstrate good performance of this new heuristic through simulations on both synthetic trees and realistic trees built from actual sparse matrices.


Introduction
Scientific applications are often modeled by directed task graphs, where nodes represent tasks and edges represent the dependencies between tasks. There is an abundant literature on task graph scheduling when the objective is to minimize the total 1 completion time, or makespan. However, with the increase of the size of the data to be processed, the memory footprint of the application can have a dramatic impact on the algorithm execution time, and thus needs to be optimized. When handling very large data, the available main memory may be too small to simultaneously handle all data needed by the computation. In this case, we have to resort to using disk as a secondary storage, which is sometimes known as out-of-core execution. The cost of the Input/Output (I/O) operations to transfer data from and to the disk is known to be several orders of magnitude larger than the cost of accessing the main memory. Thus, in the case of out-of-core execution, it is a natural objective to minimize the total volume of I/O.
In the present paper, we consider the scheduling of rooted in-trees. Such treeshaped task graphs arise in several computational domains, such as the factorization of sparse matrices [8], or in computational physics code modeling electronic properties [15]. Dependencies between tasks, modeled by the edges of the tree, represent the tasks' input and output data: each task uses all the data produced by its children to output new data for its parent. In particular, a task must have enough available memory to fit the input from all its children.
It is known that the problem of minimizing the peak memory M peak of a tree traversal, that is, the minimum amount of memory needed to process a tree, is polynomial [13,19]. However, it may well happen that the available amount of memory M is smaller than the peak memory M peak . In this case, we have to decide which child's data, or which part of a child's data, has to be written to disk. In a previous study [13], we have focused on the case when a child's data cannot be partially written to disk, and we proved that this variant of the problem was NPcomplete. However, it is usually possible to split data that resides in memory, and write only part of it to the disk if needed. This is for instance what is done by operating systems using paging: all data is divided into same-size pages, which can be moved from main memory to secondary storage when needed. Since all modern computer systems implement paging, we consider here that data can be split into parts (corresponding to system pages), each of which can be written to disk.
Note that the present study focuses on sequential scheduling for task trees. Even if parallel processing is necessary for most large-scale scientific applications, we claim that such a sequential study is needed: (i) for memory-constrained applications, a sequential traversal of the task tree is preferred, as it lowers the memory pressure, but each task is performed in parallel on multiple cores; (ii) if the ultimate goal is to obtain a parallel schedule of the tree, it is important to first investigate the sequential problem before moving to its more complex, bi-criteria (I/O and makespan) parallel variant. The present paper is therefore a step towards understanding the sequential version of this problem.
The main contributions of this work are: • A formalization, in a common framework, of previous results from the literature.
• A proof of optimality of post-order traversals when trees are homogeneous (all output data has the same size). An algorithm to compute the best postorder traversal was previously proposed by E. Agullo [1]; our proof shows that this algorithm is optimal for homogeneous trees. • A proof that neither the best post-order traversal nor the memory-peak minimization algorithms are approximation algorithms for minimizing the I/O volume. • An Integer Linear Programming (ILP) formulation of the considered problem that allows us to compute an optimal solution for small scale problems. • The design of a new heuristic that takes advantage of peak-memoryoptimizing algorithms. • An extensive experimental comparison of all available strategies (including the ILP for small test cases) through simulations on both synthetic and realistic trees built from actual sparse matrices. These simulations show the very good performance of the proposed solution.
The rest of this paper is organized as follows. We give an overview of the related work in Section 2. Then in Section 3 we formalize our model and present elementary results. Existing solutions are studied in Section 4. An optimal ILP is presented in Section 5. We introduce a new heuristic in Section 6 and evaluate its performance through simulations in Section 7. We finally conclude and present future directions in Section 8.

Related work
As stated above, rooted trees are commonly used to represent task dependencies for scientific applications. This is, for example, the case for some computational physics codes modeling the electronic properties of semiconductors and metals [15,17,22], and for the accurate modeling of the electronic structure of atoms and molecules in quantum chemistry [5,12]. In the domain of sparse linear algebra, Liu [18] gives a detailed description of the construction of the elimination tree, its use for Cholesky and LU (Lower-Upper) factorizations, and its role in multifrontal direct methods: during the factorization, the computations are organized as a task tree, and the huge size of the data involved makes it absolutely necessary to reduce the memory requirement of the factorization. Note that peak memory minimization is still a crucial question for direct solvers, as highlighted by Agullo et al. [2], who study the effect of processor mapping on memory consumption for multifrontal methods.
Memory and storage have always been a limited parameter for large computations, as outlined by the pioneering work of Sethi and Ullman [25] on register allocation for task trees. In the realm of sparse direct solvers, the problem of scheduling a tree so as to minimize peak memory has first been investigated by Liu [20] in the sequential case: he proposed an algorithm to find a peak-memory-minimizing traversal of a task tree when the traversal is required to correspond to a postorder traversal of the tree. A postorder traversal requires that each subtree of a given node must be fully processed before the processing of another subtree can begin. A follow-up study [19] presented an optimal algorithm to solve the general problem, without the postorder constraint on the traversal. Postorder traversals are known to be arbitrarily worse than optimal traversals for memory minimization [13]. However, they are very natural and straightforward solutions to this problem, as they allow us to fully process one subtree before starting a new one. Therefore, they are widely used in sparse matrix software like MUMPS [3,4] (MUltifrontal Massively Parallel sparse direct Solver), and achieve good performance on actual elimination trees [13].
The problem of minimizing I/Os-and not only peak memory-has been formalized as the so-called red-blue Pebble Game [14], where pebbles have to be placed on the computation DAG and colors respectively represent the small and fast memory. This model is adapted to unit-size data and several hardness and inapproximation results have been developed for several variants, see [23] among the recent work. Our problem restricted to unit-size data falls under the oneshot model in the terminology adopted in [23]. On general task graphs (i.e., directed acyclic graphs), this problem is shown to be NP-hard and impossible to approximate within a factor less than 2 unless the Unique Games Conjecture is invalid. Note that the fact that we show an optimal algorithm for unit-size data (Section 4) directly implies a polynomial-time solution for the red-blue pebble game on trees.
As mentioned in the introduction, the problem of minimizing the I/O volume when traversing a tree has been studied in [13] with the constraint that each node's data either stays in the memory or has to be written wholly to disk. Here we study the case when we have the option to store part of the data, which is also the topic of E. Agullo's PhD. thesis [1]. In his thesis, Agullo exhibited the best postorder traversal for minimizing I/O volume, which we adapt to our model in Section 4.1. He also studied numerous variants of the model that are important for direct solvers, as well as other memory management issues-both for sequential and parallel processing. Based on these preliminaries, he finally presented an out-of-core version of the MUMPS solver.
Finally, out-of-core execution is a well-known approach for computing on large data, especially (but not exclusively) in linear algebra [24,26].

Model and notation
As introduced above, we assume that we have an available memory (or primary storage) of limited size M , and a disk (or secondary storage) of unlimited size.
We consider a workflow of tasks whose precedence constraints are modeled by a tree of tasks G = (V, E). Its nodes v ∈ V represent tasks and its edges e ∈ E represent dependencies. All dependencies are directed toward the root (denoted by root): a node can only be executed after the termination of all its children. The output data of a node i occupies a size w i in the main memory. This data may be written totally or partially to the disk after task i produces it. In order for a node to be executed, the output data of all its children must be entirely stored in the main memory. An amount of memory m can be moved between the memory and the disk at a cost of m I/O operations, regardless of which data it corresponds to. We assume that all memory values (M , w i ) are given in an appropriate unit (such as kilobytes) and are integers. We divide the main memory into slots, where each slot holds one such unit of memory.
At the beginning of the computation of a task i, the output data of i's children must be in memory, while at the end of its computation, its own output data must be in memory. The amount of memory needed in order to execute node i is thus We assume that M is at least as large as everyw i , as otherwise the tree cannot be processed.
Our objective is to find a solution minimizing the total I/O volume. A solution needs to give the order in which nodes should be executed, and how much of each node should be written out during I/O operations. In particular, for a tree of n tasks, we define a solution to our problem as a permutation σ of [1 . . . n] and a function τ . We call such a solution a traversal. The permutation σ represents the schedule of the nodes, that is, σ(i) = t means that task i is the t-th task computed. The function τ represents the amount of I/Os for each task: τ (i) = m means that m units of the output data of task i are written to disk (see below). Note that we do not need to clarify which part of the data is written to disk, as our cost function only depends on the volume. We assume without loss of generality that when τ (i) ̸ = 0, the write operation on the output data of task i is performed right after task i completes (and produces the data), and the read operation is performed just before the use of this data by task i's parent. Finally, since there are exactly the same number of read and write operations, we only count the write operations.
In order for a traversal to be valid, it must respect the following conditions: • Tasks are processed in topological order: We say that a node i of parent j is considered active at step t under the schedule σ if σ(i) < t < σ (j). This means that its output data is either partially in memory and/or partially written to disk at time t. • The amount of data written to disk never exceeds the size of the data: • Enough memory remains available for the processing of each task (taking into account active nodes): The problem we are considering in this paper, called MinIO, is to find a valid traversal that minimizes the total amount of I/Os, given by i∈G τ (i).
We formally define a postorder traversal as a traversal σ such that, for any node i and for any node k outside the subtree T i rooted at i, we have either ∀j ∈ T i , σ(k) < σ (j) or ∀j ∈ T i , σ(j) < σ(k).

Towards a compact solution
Although a traversal is described by both the schedule σ and the I/O function τ , the following results show that one can be deduced from the other. The first result is adapted from [1, Property 2.1], which has the same result limited to postorder traversals (see Section 2). It states that given a schedule σ, it is easy to derive an I/O scheme τ which minimizes the I/O volume of the traversal (σ, τ ). Theorem 1. We consider a tree G, a memory bound M , and a schedule σ. The I/O function τ following the Furthest in the Future policy achieves the best performance under σ.
The I/O function τ following the Furthest in the Future (FiF) policy is defined as follows: during the execution of σ, whenever the memory exceeds the limit M , I/O operations are performed on the active nodes which will remain active the furthest in the future, i.e., whose execution come last in the schedule σ. This result is similar to Belady's rule which states that the offline cache replacement policy MIN is optimal [6,16]. MIN evicts from the cache the data which will be used the latest.
Proof. Given a tree G, a memory bound M , a schedule σ, and an I/O function τ that does not respect the FiF policy, it is straightforward to transform τ into another I/O function τ ′ following the rule. Consider the first step when an I/O is performed on data i that is not the last to be used among active data. Let j denote the last-used among active data (so FiF would evict j). We can safely increase τ ′ (j) and decrease τ ′ (i) until either τ ′ (j) = w j or τ ′ (i) = 0. As j is active longer than i is, the memory freed by τ ′ is available for a longer time than the one freed by τ , which keeps the traversal valid. Repeating this transformation, we produce an I/O function which respects the FiF policy.
On the other hand, if we have an I/O function τ describing how much of each node is written to disk, we can compute a schedule σ such that (σ, τ ) is a valid traversal (if such a schedule exists).
Theorem 2. We consider a tree G, a memory bound M , and an I/O function τ for which there exists a valid schedule. Such a schedule can be computed in polynomial time.
The proof of this result is delegated to Section 6 where we use a similar method to derive a heuristic: once we know where the I/O operations take place, we may transform the tree by expanding some nodes to make these I/O operations explicit within the tree structure. If a valid traversal using τ exists, the resulting tree may be completely scheduled without any additional I/Os, and such a schedule can be computed using an optimal scheduling algorithm for memory minimization.
Both previous results allow us to describe solutions in a more compact format (as either a schedule or an I/O function). However, this does not make the problem less combinatorial: there are n! possible schedules and already 2 n functions τ if we restrict only to functions such that τ (i) = 0 or w i .

Related algorithms
As mentioned in Section 2, the problem of minimizing peak memory, denoted Min-Mem, is closely related to our problem, and has been extensively studied. In this problem, the available memory is unbounded (which means no I/Os are required) and we look for a schedule that minimizes the peak memory, i.e., the maximum amount of memory used at any time during the execution. There are at least two important algorithms for this problem, which we use in the present paper: • It is possible to compute a schedule minimizing the peak-memory in polynomial time, as proven by Liu [19]. We refer to this algorithm as OptMin-Mem. • The best postorder traversal for peak-memory minimization can also be computed in polynomial time [20]. We refer to this algorithm as PostOr-derMinMem.

Existing solutions are not satisfactory for heterogeneous data
We now detail two existing solutions for the MinIO problem. The first one is the best postorder traversal for MinIO proposed by Agullo [1]. We show that it is optimal if all data have unit size. The second uses the optimal traversal for MinMem proposed by Liu [19], and then applies Theorem 1 to obtain a valid traversal. After presenting these algorithms, we prove that neither of them is constant-factor competitive compared to the optimal traversal.

Computing the best postorder traversal
For the sake of completeness, we present the algorithm computing the best postorder traversal for MinIO from [1] and adapt it to our model. Recall that in a postorder traversal, when a node is processed, its whole subtree must be processed before any other external node may be started. Given a node i and a postorder schedule σ, we first recursively define S i as the storage requirement of the subtree T i rooted at i. Let Chil (i) be the children of i. Then: This expression represents the maximum memory peak reached during the execution. If the peak is obtained at the end of the execution, it is then equal to w i .
Otherwise, it appears during the execution of the subtree of some child j. In this case, the peak is composed of the weights of the children already processed, plus the peak S j of T j .
We may now consider A i = min(M, S i ), which represents the amount of main memory used for the out-of-core execution of the subtree T i by σ. We recursively define V i as the volume of I/Os performed by σ during the execution of T i when I/O operations are chosen using the FiF policy: The expression of V i has a similar structure to the expression of S i . No I/Os can be incurred when only the root i is in memory, hence w i has no effect here. The second term accounts for the I/Os incurred on the children of i. Indeed, during the execution of node j, some parts of children of i must be written to disk if the memory peak exceeds M , and this quantity is at least The last term accounts for the I/Os occurring inside the subtrees. Note that such I/Os can only happen if the memory peak of the subtree exceeds M .
It remains to determine which postorder traversal minimizes the quantity V root . Note that the only term sensitive to the ordering of the children of i in the expression of V i is: Theorem 3 states that sorting the children of i in decreasing order of A j − w j achieves the minimum V i .

Theorem 3 (Lemma 3.1 in [20]) Given a set of values
Therefore, the postorder traversal that processes the children nodes by decreasing order of A i − w i minimizes the I/O cost among all postorder traversals. This traversal is described in Algorithm 1, initially called with r = root, and will be referred to as PostOrderMinIO. Note that in the algorithm ⊕ refers to the concatenation operation on lists.
Input: a tree G and a node r in G Output: an ordered list ℓ r of the nodes in the subtree rooted at r, corresponding to a postorder Compute the A i value using postorder ℓ i

PostOrderMinIO is optimal on homogeneous trees
In this section we focus on homogeneous trees-that is, on trees where all nodes have output data of size one. We show that PostOrderMinIO is optimal on these homogeneous trees, i.e., that it performs the minimum number of I/Os. This generalizes a result of Sethi and Ullman [25], which considers binary trees from arithmetic expressions and aims to minimize the number of store/load operations when evaluating these expressions with a limited number of registers. They considered different variants, and the one with commutative operators closely resembles our problem, where the registers are replaced by memory slots and load/store by read/write. However, in our work, we do not limit the model to binary trees, but consider any tree with homogeneous data sizes. In the case that the heterogeneity in data sizes is limited, our result provides a good strategy of minimizing the number of I/O operations.
Theorem 4. PostOrderMinIO is optimal for homogeneous trees.
In order to prove this theorem, we need first to define labels on the nodes of the tree. Let T be any homogeneous tree (w v = 1 for all nodes v of T ). In the following definitions, whenever v is a node of T with k children, v 1 , . . . , v k will be its children.
Memory bound l(v). For each node v of T , we recursively define a label l(v) which represents the minimum amount of memory necessary to execute the subtree T (v) rooted at v without performing any I/Os: Let Postorder be a postorder schedule that executes the children of any node by non-increasing l-labels (ties being arbitrarily broken). Intuitively, under Postorder, while computing the i-th child, we have i − 1 extra nodes in memory, each of size one, so we need l(v i ) + (i − 1) memory slots in total.
We set c(root) = 0. To ease the writing of some proofs, we use the notation Thus m(v i ) represents the number of children of v in memory right before v i is executed. Note that m(v 1 ) = 0 and m( and W (T (v)). w(v) represents the total number of children of v stored by Postorder: Finally, for a given node v, we define W (T (v)) on the subtree rooted at v: Intuitively, W (T (v)) represents the total volume of communications performed during the execution of the tree T (v) by Postorder.
We first state the correctness of the l-labels and the optimality of Postorder for the MinMem problem.
Lemma 5. With infinite memory, Postorder uses l(n) slots to compute the subtree rooted at node n.
Proof. The result follows from the definition of l(v). Lemma 6. With infinite memory, any schedule uses at least l(v) slots to compute the subtree rooted at v.
Proof. We prove this result by induction on the size of T (v). If v is a leaf, the result holds (l(v) = 1).
Otherwise, we assume the lemma to be true for the subtrees rooted at the children v 1 , . . . , v k of v. We consider the schedule returned by MinMem. The memory peak inherent to the execution of a subtree T (v i ) is equal to l(v i ) by the induction hypothesis. Assume without loss of generality that the children of v are ordered such that MinMem first computes a node of T (v 1 ), then the next executed node not in , and so on. Then, the memory peak reached during the execution of . Finally, the total memory peak is at least equal to max 1≤i≤k (l(v i ) + i − 1). By Theorem 3, this quantity is minimized when the nodes are ordered by non-increasing values of l(v i ). Hence, the total memory peak is at least l(v).
We now state the performance of Postorder for the MinIO problem (I/Os are performed using the FiF policy).

Proof.
We prove this result by induction on the size of T . We introduce new notation: In other words, W(v) intuitively represents the total volume of communications performed during the execution of the tree T (v) if we had nothing to execute but T (v) (in practice T (v) may be a strict sub-tree of T and, therefore, the execution of T (v) in the midst of the execution of T can induce more communications). Note that if v is the root of T . We prove by induction on the size of T (v) that at most W(v) I/Os are performed during the execution of T (v).
Let us assume that v is a leaf. Then W(v) = 0. Because we have assumed (in Section 3.1) that M was large enough for a single node to be processed without Now assume that v is not a leaf. By the induction hypothesis, for any i ∈ [1; k], Postorder executes the tree T (v i ) alone using at most W(v i ) I/Os. We prove that to process the tree T (v i ), after the trees Then, according to Lemma 5, no I/Os are required to execute T (v i+1 ) under Postorder even after the processing of T (v 1 ) through T (v i ). Indeed, before the start of the processing of T (v i+1 ) the memory contains ex- We are now in the case c( Proof. Let v 1 , . . . , v k be the children of v, ordered so that l(v 1 ) ≥ · · · ≥ l(v k ). Let j be the index of a: a = v j . As the label of a in T ′ , l ′ (a), is not larger than l(a), we can have l ′ (a) < l ′ (v j+1 ). Therefore, we define another ordering of the children of Let us now consider the case The following lemma gives a lower bound on the I/Os performed by any schedule.
No schedule can compute a tree T performing strictly less than W (T ) I/Os.
Here is a short summary of the proof, which is given in full detail in Appendix A. The result is proven by induction on the size of the tree. The case where no I/Os are required can be deduced from Lemma 6.
We then consider a tree T for which any schedule performs at least one I/O, and an optimal schedule P on this tree. We focus on the first node s to be stored under this schedule, and define the tree T ′ in which T (s) is replaced by s. Using the induction hypothesis, we know that any schedule on T ′ , including the restriction of P on T ′ , performs at least W (T ′ ) I/Os. Therefore, we deduce that P performs at least W (T ′ ) + 1 I/Os on T . Thus, it remains to prove that W (T ′ ) ≥ W (T ) − 1.
To do so, we focus on the closest ancestor of s to have a label l larger than M , and denote it as µ. We first prove that in the new tree T ′ , we have l(µ) ≥ M . This means, by Theorem 8, that the w labels of the ancestors of µ are unchanged in T ′ . Then, we prove through an extensive case study that w(µ) in T ′ cannot be smaller than w(µ) in T minus one. Finally, we conclude that all the other w labels are equal in T and in T ′ ; therefore, W (T ′ ) ≥ W (T ) − 1.
We are now ready to prove Theorem 4.
Proof of Theorem 4. From Lemmas 7 and 9, Postorder is optimal for homogeneous trees. Moreover, PostOrderMinIO is a post-order that minimizes the volume of I/O operations. Hence, it is also optimal for homogeneous trees.
The Postorder algorithm designed in this proof is actually equivalent to Pos-tOrderMinMem, the postorder algorithm minimizing the peak memory, when applied to homogeneous trees. The only difference with PostOrderMinIO is that the latter sorts the children by non-increasing A i = min(M, l i ) whereas Postorder sorts them by non-increasing l i . PostOrderMinIO is then less specific, as it does not specify the order among subtrees with l i ≥ M : there are more ties that can be arbitrarily broken. This difference also implies that the schedule given by Postorder does not depend on the value of M . It is thus cache-oblivious [11] and optimal (on homogeneous trees) for any memory size. Therefore, if we consider several levels of memory (e.g., a cache memory connected to a Random-Access Memory, itself connected to a disk), Postorder minimizes the memory transfers between every level (e.g., both cache-RAM and RAM-disk transfers). Note that with heterogeneous trees, this result does not hold anymore, as the optimal traversal depends on the memory size. Therefore, no algorithm can simultaneously minimize transfer between all levels of the memory hierarchy.

Postorder traversals are not competitive
Previous research has shown that the best postorder traversal for the MinMem problem is arbitrarily far from the optimal traversal [13]. We prove here that postorder traversals may also have bad performance for the MinIO problem. More specifically, we prove that there exist problem instances on which PostOrder-MinIO performs arbitrarily more I/Os than the optimal amount. We could exhibit an example where the optimal traversal does not perform any I/Os and PostOr-derMinIO performs some I/Os, but we rather present a more general example where the optimal traversal does perform some I/Os: in the following example, the optimal traversal requires 1 I/O, when PostOrderMinIO requires Ω(nM ) I/Os. The tree used in this instance is depicted on Figure 1     It is possible to traverse the tree of Figure 1(a) with a memory of size M using only a single I/O, by executing the nodes in increasing order of the (red) labels next to the nodes. After processing the minimal subtree including the two leftmost leaves, our strategy is to process leaves from left to right. Before processing a new leaf, we complete the previous subtree up to a node of weight 1; this way the leaf and the actives nodes can both fit in memory.
On the other hand, the best postorder traversal must perform a volume of I/O equal to M/2 − 1 before processing any leaf, except for the very first processed leaf. This is because the least common ancestor of any two leaf nodes has two children of size M/2, and all leaves have size at least M − 1. Thus, any postorder traversal performs at least M/2−1 I/Os for all but one leaf node, leading to at least 3M/2−3 I/Os for the tree in Figure 1(a) (a best postorder starts from any of the two M leaves and performs 3M/2 − 2 I/Os). We can extend this tree as follows: we replace root by a node of size 1, add to it a parent of size M/2 which is the left child of the new root; the right child of the new root is then a chain containing a leaf of size M − 1 and its parent of size M/2. Doing this repeatedly until n nodes are used gives the lower bound of Ω(nM ). Therefore, PostOrderMinIO is not constant-factor competitive.

OptMinMem is not competitive
Minimizing the amount of I/Os in an out-of-core execution seems similar to minimizing the peak memory when the memory is unbounded. Thus, in order to derive a good solution for MinIO, it seems reasonable to use an optimal algorithm for MinMem, such as the OptMinMem algorithm presented by Liu [19], to compute a schedule σ and then to perform I/Os using the FiF policy. In the following, we also use OptMinMem to denote this strategy for MinIO. We prove here that there exist problem instances on which this strategy will also perform arbitrarily more I/Os than the optimal traversal.
We first exhibit in Figure 1(b) a tree showing that OptMinMem does not always lead to minimum I/Os in our model. Let M = 6. The tree of Figure 1(b) can be completed with 3 I/Os, by doing one chain after the other. This corresponds to a peak memory of 9. But OptMinMem achieves a peak memory of 8 at the cost of 4 I/Os by executing the nodes in increasing order of the labels next to the nodes.
This example can be extended to show that OptMinMem may perform arbitrarily more I/Os than the optimal strategy. The extended tree is illustrated on Figure 1(c). It contains two identical chains of length 2k + 2, for a given parameter k, and the memory size is set to 4k. The weights of the tasks in each chain (in order from root to leaf) are defined by interleaving two sequences: {2k, 2k − 1, . . . , k} and {3k, 3k + 1, . . . , 4k}. As above, it is possible to schedule this tree with only 2k I/Os, but with a memory peak of 6k, by computing one entire chain, then the other. However, OptMinMem achieves a memory peak of 5k by alternating between chains, each time processing the chain until reaching a node with a weight smaller than 2k, as represented by the labels besides the nodes. OptMinMem performs k I/Os on each of the k + 1 smallest nodes, leading to a cost of k(k + 1) I/Os. The competitive ratio is then larger than k/2, and OptMinMem is not constant-factor competitive for the MinIO problem.

Unknown complexity
As shown above, polynomial-time approaches based on similar problems fail to even give a constant-competitive ratio. The main issue facing a polynomial approach is the highly nonlocal aspect of the optimal solution. For example, since postorder traversals are not optimal, it may be highly useful to stop at intermediate points of a subtree's execution in order to process entirely different subtrees.
We conjecture that this problem is NP-hard due to these difficult dependencies. As mentioned above, if we require entire nodes to be written to disk, the problem has been shown to be NP-hard by reduction to Partition [13]. However, this proof highly depends on indivisible nodes, rather than on the recursive structure of trees. Taking advantage of the structure of our problem to give an NP-hardness result could lead to an interesting understanding of optimal solutions, and possibly further heuristics. We leave this as an open problem.

ILP formulation of the problem
We now present an Integer Linear Program solving the MinIO problem.
The linear program relies on the boolean variables δ ij to express the schedule constraints. δ ij is equal to 1 if node i precedes node j in the corresponding schedule and 0 otherwise, as used previously in [7] for instance. All the variables considered in this linear program are nonnegative. The first constraints represent the antisymmetric (Equation (2)), acyclic (Equation (3)) and reflexive (Equation (4)) properties of the order, and the consistency with the precedence constraints (Equation (5)).
We introduce the variable α i ∈ [0, 1] which represents the fraction of node i written to disk.
The memory constraint is then equivalent to the following nonlinear inequality (that will be linearized ultimately). Indeed, Equation (8) is the transposition of the memory constraint defined as Equation (1) in Section 3.1, noting that δ ki δ ip = 1 if and only if node k is active during the execution of node i.
Finally, the objective function is to minimize the I/O cost: We have now formalized the MinIO problem described in Section 3.1 through a quadratic program P quad composed of Equations (2) to (8). It then remains to linearize Equation (8).
We define the variables x ik and y ik . They are constrained to satisfy the following: if node k is active during the execution of node i, then x ik is equal to 1 and y ik is not larger than α k ; otherwise they are both null. Note that in the special case when i is either k or its parent, both variables are forced to be 0.
∀(k, p) ∈ E, x pk = y pk = x kk = y kk = 0 (12) The final integer linear program P lin is then composed of Equations (2) to (7) and (10) to (13), with the objective function described by Equation (9). It requires O(n 2 ) variables and O(n 3 ) constraints. We now prove that the linearization is correct: for any value X of the objective function, there exists a solution to P lin of objective value X if and only if there exists a solution to P quad of objective value X.
First, as an intermediate step, we show that if there exists a valuation of variables that satisfies Equations (2) to (7) and (10) to (12), then for (k, p) ∈ E and i ∈ G with i / ∈ {k, p}, we have x ik − y ik ≥ δ ki δ ip (1 − α k ). We consider such a valuation of the variables. By the precedence constraint (5), we have δ kp = 1. Hence, thanks to the antisymmetric (2) and acyclic (3) constraints, we deduce that we cannot have both δ ki = 0 and δ ip = 0. Therefore thanks to Equation (10), we have x ik = δ ki δ ip . From Equation (11), we deduce y ik ≤ α k and y ik = 0 if δ ki δ ip = 0. Thus, we have y ik ≤ δ ki δ ip α k and finally x ik − y ik ≥ δ ki δ ip (1 − α k ).
Assume that P lin allows a feasible valuation of variables V. V respects the conditions of the above paragraph, so that x ik −y ik ≥ δ ki δ ip (1−α k ). Therefore, the left hand side of Equation (13) is not smaller than the left hand side of Equation (8). As Equation (13) is satisfied as part of P lin , Equation (8) is satisfied. V (restricted to the δ and α variables) is then a solution of P quad .
On the contrary, let us assume now that P quad allows a feasible valuation of variables V. Then, we complete V by setting, for (k, p) ∈ E and i ∈ G with i / ∈ {k, p}, x ik = δ ki δ ip , y ik = δ ki δ ip α k , and for i ∈ {k, p}, x ik = y ik = 0. Let V ′ be the completed valuation. In V ′ , Equation (13) is then equivalent to Equation (8), which is thus also satisfied. We now show that V ′ satisfies Equations (10) and (11). Let (k, p) ∈ E and i ∈ G with i / ∈ {k, p}. By the precedence constraint (5), we have δ kp = 1. Hence, thanks to the antisymmetric (2) and acyclic (3) constraints, we deduce that we cannot have both δ ki = 0 and δ ip = 0. Therefore, x ik = δ ki δ ip = δ ki + δ ip − 1, so Equation (10) is satisfied. Then, y ik = δ ki δ ip α k is equal to 0 if δ ki or δ ip is null, and to α k otherwise, so Equation (11) is satisfied. Therefore, V ′ is a solution of P lin , achieving the same objective value as V in P quad .

Heuristic
We now move to the design of a novel heuristic, FullRecExpand, whose goal is to improve the performance of OptMinMem for the MinIO problem. The main idea of this heuristic is to run OptMinMem several times: when we detect that an I/O is needed on some node, we force this I/O by transforming the tree. This way, the following iterations of OptMinMem will benefit from the knowledge of this I/O. We continue transforming the tree until no more I/Os are necessary.
In order to enforce I/Os, we use the technique of expanding a node (illustrated on Figure 2). Under an I/O function τ , we define the expansion of a node i as the substitution of this node with a chain of three nodes i 1 , i 2 , i 3 of respective weights w i , w i − τ (i), and w i . The expansion of a node actually mimics the action of executing I/Os: the weight of the three tasks represent which amount of main memory is occupied by this node 1) when it is first completed (w i1 = w i ), 2) when part of it is moved to disk (w i2 = w i − τ (i)), and 3) when the whole data is transferred back to main memory (w i3 = w i ).
This technique first allows us to prove Theorem 2, which states that given an I/O function τ , we can find a schedule σ such that (σ, τ ) is a valid traversal if there exists one.
Proof of Theorem 2. Consider the tree G ′ obtained from G by expanding all the nodes for which τ is not null. Then, consider the schedule σ ′ obtained by Opt-MinMem on G ′ , and let σ be the corresponding schedule on G. Then, the memory used by σ on G during the execution of a node i is the same as the one used by σ ′ on G ′ during the execution of the same node i, or of i 1 if i is expanded. Then, as OptMinMem achieves the optimal memory peak on G ′ , we know that σ uses as little main memory as possible under the I/O function τ . Then, (σ, τ ) is a valid traversal of G.
The heuristic FullRecExpand is described in Algorithm 2. The main idea of the heuristic is to expand nodes in order to obtain a tree that can be scheduled without any I/O, which is equivalent to building an I/O function.
First, the heuristic recursively calls itself on the subtrees rooted at the children of the root, so that each subtree can be scheduled without any I/O (but using expansions). Then, the algorithm computes OptMinMem on this new tree, and if I/Os are necessary, it determines which node should be expanded next. This selection is the only part where FullRecExpand can deviate from an optimal strategy. Our choice is to select a node on which the FiF policy would perform I/Os; if there are several such nodes, we choose the one whose parent is scheduled the latest. After the expansion, the algorithm recomputes OptMinMem on the modified tree, and proceeds until no more I/Os are necessary. At the end of the computation, the returned schedule is obtained by running OptMinMem on the final tree computed by FullRecExpand, and by transposing it on the original tree. The I/O performance of this schedule is then equal to the sum of the expansions.
FullRecExpand is only a heuristic: it may give suboptimal results but also may achieve better performance than OptMinMem, as illustrated in several examples in Appendix B.
Unfortunately, the complexity of FullRecExpand is not polynomial, as the number of iterations of the while loop at Line 4 cannot be bounded by the number of nodes, but may depend also on their weights. We therefore propose a simpler variant, named RecExpand, where the while loop at Line 4 is exited after 2 iterations. In this variant, the resulting tree G might need I/Os to be executed. The final schedule is computed as in FullRecExpand, by running OptMinMem on this tree G. We show in the next section that this variant gives results which are very similar to the original version.

Numerical results
In this section, we compare the performance of the two existing strategies, OptMin-Mem and PostOrderMinIO, and the two proposed heuristics, FullRecExpand and RecExpand. All algorithms are compared through simulations on two datasets described below. Because of its high computational complexity, FullRecExpand is only tested on the first smaller dataset. Input: tree G, root of exploration r Output: Return a tree G r which can be executed without any I/O, obtained from G by expanding several nodes 3 G r ← tree formed by the root r and the G i subtrees 4 while OptMinMem(G r , r) needs more than a memory M do 5 τ ← I/O function obtained from OptMinMem(G r , r) using the FiF policy 6 i ← node for which τ (i) > 0 whose parent is scheduled the latest in OptMinMem(G r , r) 7 modify G r by expanding node i according to τ (i) The first dataset, named Synth, is composed of 330 instances of synthetic binary trees of 3000 nodes, generated uniformly at random among all binary trees. As we considered small trees, we used half-Catalan numbers in order to draw a tree, similarly to the method described at the beginning of [21]. The memory weight of each task is uniformly drawn from [1; 100]. The second dataset, named SmallSynth, is composed of 30,000 synthetic binary trees of 30 nodes, generated with the same method as the trees of Synth. This dataset contains trees small enough to allow the determination of the optimal solution by solving the ILP directly.
The last dataset, named Trees, is composed of 329 elimination trees of actual sparse matrices from the University of Florida Sparse Matrix Collection a (see [13] for more details on elimination trees and the data set). Our dataset corresponds to the 329 smallest of the 640 trees presented in [13], with trees ranging from 2000 to 40000 nodes.
For each tree of each dataset, we first computed the minimal memory size necessary to process the tree nodes: LB = max iwi . We also computed the minimal peak memory for an incore execution Peak incore (using OptMinMem). We eliminated all trees from the Trees dataset where Peak incore = LB (i.e., trees for which outof-core execution is useless whatever the memory bound M ), leaving us with 133 remaining trees in this dataset. In all other cases, note that the possible range for the memory bound M such that some I/Os are necessary is [LB , Peak incore − 1]. The main memory bound we use in our simulation is the middle of this interval M mid = (LB + Peak incore − 1)/2. For a more complete analysis, we also perform the same simulations with the two extreme memory bounds M min = LB and M max = Peak incore − 1.

Results
Our objective in this study is to minimize the total amount of I/Os needed to process the tree. In order to summarize and compare the performance of the different strategies we choose here to consider the number of I/Os and the memory bound In order to compare the performance of these algorithms, we use a generic tool called a performance profile [9]. For a given dataset, we compute the relative I/O volume of each algorithm on each tree and for each memory limit. Then, rather than computing an average across all the cases, a performance profile reports a cumulative distribution function. We define the deviation of a heuristic on a given instance as the relative I/O volume of this heuristic divided by the best relative I/O volume achieved for this instance. We then use the deviation to the best heuristic for the datasets Synth and Trees, and the deviation to the optimal solution for the dataset SmallSynth, which is computed with the ILP. Given a heuristic and a deviation τ expressed in percentage, we compute the fraction of test cases for which the heuristic has a deviation not larger than τ , and plot these results. Therefore, the higher the curve, the better the method: for instance, for a deviation τ = 5%, the performance profile shows how often a given method lies within 5% of the smallest relative I/O volume obtained.    The left plot of Figure 3 presents the performance profile of the four heuristics for the complete dataset Synth using the memory bound M mid . The first result is the poor performance of PostOrderMinIO in this dataset: it almost always has a deviation of at least 50%, and even of 100% in 75% of the cases. Thus, the right plot of the figure presents the performance profiles of exclusively OptMin-Mem, RecExpand, and FullRecExpand. RecExpand performs far better than OptMinMem: it produces strictly fewer I/Os than OptMinMem on 90% of the instances, and on half of them, OptMinMem has a deviation of at least 4%. We can also note that FullRecExpand performs only slightly better than RecExpand, but both heuristics are far ahead of OptMinMem, so the gain in the complexity of the algorithm is only balanced by a small loss of performance. For instance, Rec-Expand has a deviation larger than 2% over FullRecExpand on only 3% of the instances.
We present the performance profiles for the dataset SmallSynth and the memory bound M mid on Figure 4. In this figure, the deviation is computed using the optimal solution obtained via the ILP, which allows us to analyze the quality of the solutions returned by FullRecExpand and RecExpand. As the left graph shows, the heuristics observe the same hierarchy as in the Synth dataset with larger trees, but as one could expect, the differences of performance are less significant: PostOr-derMinIO has a deviation of less than 10% over the optimal solution in 75% of the instances. OptMinMem is non-optimal in 3% of the instances. FullRecExpand and RecExpand achieve better performance as they are non-optimal in respectively 0.72% and 0.74% on the instances. The performance profile on the subset of trees where at least one of these heuristics is non-optimal is presented on the right graph.  The left plot of Figure 5 presents the performance profiles of the three heuristics PostOrderMinIO, RecExpand and OptMinMem for the complete dataset Trees using the memory bound M mid . The first remark is that the three heuristics are equal on more than 90% of the 329 instances. Therefore, we now focus on the right plot, which presents the top part of the same performance profile, corresponding to the 25 cases where the heuristics do not all give equal performance. We can see that the hierarchy is the same as in the previous dataset (RecExpand is never out-performed, and OptMinMem performs better than PostOrderMinIO) but with smaller discrepancies between the heuristics. We observe a deviation larger than 5% on only 3% of the instances for PostOrderMinIO and 1% of the instances for OptMinMem.  We now consider the memory bound M min = LB , which represents the minimum memory bound for which it is possible to compute a given tree. We plot the corresponding performance profiles for the Synth dataset in Figure 6, the SmallSynth dataset in Figure 7, and the Trees dataset in Figure 8. The main conclusion that can be made in comparison to the previous results is that the difference between OptMinMem and RecExpand is significantly larger with this memory bound. Indeed, in the Synth dataset, there is a deviation of 10% for OptMinMem in 90% of the cases whereas such a deviation was reached in only 15% of the cases previously. This can be explained by the fact that the memory bound considered here is  further from the memory required by MinMem. On the other hand, the difference between PostOrderMinIO and RecExpand is smaller in this case: there is a deviation of 100% for PostOrderMinIO in half of the cases whereas we had this property in 75% of the cases with a higher memory bound. The same tendency can be observed for the Trees dataset in Figure 8, even if it is less significant. For the SmallSynth dataset, the proportion of non-optimal cases is around 3.5 times larger than with the previous memory bound for the three heuristics FullRecExpand, RecExpand, and MinMem, so they are also further from the optimal, but the modification of the memory bound did not significantly modify the behavior of PostOrderMinIO.
For the sake of completeness, we have also considered the memory bound M max = Peak incore − 1, which is the opposite case: the largest memory bound for which I/Os are required in order to compute a tree. With this memory bound, OptMinMem, RecExpand, and FullRecExpand are always equal (and even optimal for the SmallSynth dataset), and only PostOrderMinIO achieves worse performance. This can be explained by the fact that M max is right below the memory required by OptMinMem to compute a tree without I/Os. Therefore, we can argue that it is closer to the optimal algorithm and FullRecExpand does not improve the few I/Os performed by MinMem. Nevertheless, the deviation of Pos-tOrderMinIO is smaller than with the other memory bounds.

Conclusion
In this paper, we revisited the problem of minimizing I/O operations in the out-ofcore execution of task trees. We proved that existing solutions allow us to optimally solve the problem when all output data have identical size, but that, in the general case, none of them has a constant competitive factor compared to the optimal solution. In addition to an ILP formulation of the problem, which allows us to compute an optimal solution for small trees, we proposed a novel heuristic solution. Through simulations, we show that this new heuristic is very efficient in practice, achieves better performance than existing solutions, and achieves near optimal performance on small trees. Despite our efforts, the complexity of the problem remains open. Determining this complexity would definitely be a major step, although our findings already lay the basis for more advanced studies. These include moving to parallel out-of-core execution (as was already done for parallel incore execution [10]) as well as designing competitive algorithms for the sequential problem.
First, by Theorem 5, there exists a node v such that l(v) > M . Otherwise, Postorder would be able to schedule T without I/Os, which would violate our assumption on T . Then, the label of the root r of T also satisfies l(r) > M .
Let s be the first node to be stored under P . Then, the subtree T (s) has been scheduled without I/Os so, by Theorem 6, we have l(s) ≤ M and, hence, no node of T (s) has a label larger than M . Let µ be the closest ancestor of s to have a label larger than M . µ exists as l(r) > M and l(s) ≤ M . Let µ 1 , . . . , µ k be the children of µ, ordered such that l(µ i ) ≥ l(µ i+1 ). Let j be such that µ j is either s or one of its ancestors. Let t = min{i ∈ [1; k] | l(µ i ) + i − 1 > M } (t exists because, by definition, l(µ) > M ). See Figure 9 for an illustration of the tree. Let T ′ be the tree obtained from T by replacing s by a leaf, therefore replacing the subtree T (s) by a single node s. As T (s) cannot be empty, T ′ contains fewer nodes than T . Consider a schedule P ′ on T ′ that executes the same operations as P on T and in the same order, except for the ones concerning T (s).
We use the following notation: as above, l, m, c, w are defined on nodes of the tree T , whereas l ′ , m ′ , c ′ , w ′ refer to the same values on the tree T ′ . The nodes in T ′ share the same names as their equivalents in T .
Putting things together, no node of T (s) has a label l larger than M , so none has a positive label w. Between µ and s, no node had a label l larger than M . Therefore, except µ and its ancestors, all the nodes satisfy w ′ (v) = w(v).
By the induction hypothesis, P ′ executes at least W (T ′ ) = W (T ) − 1 I/Os, so P executes at least W (T ) I/Os, which proves the lemma.

Appendix B. Illustration of FullRecExpand on some examples
The left-hand side of Figure 10 provides an example where FullRecExpand performs better than OptMinMem. OptMinMem computes the left branch first until node a, then the right branch until node b, before completing the left branch. The memory peak reached is 12, but this schedule incurs 4 I/Os with a memory limit of 10: 2 on node a and 2 on node b. On this example, FullRecExpand expands node b as specified on the middle diagram. With this expansion, OptMinMem schedules the right branch until b 2 first, then the whole left branch, using one more I/O on b 2 . This node is expanded a second time on the right diagram, without changing the schedule obtained by OptMinMem, yielding to 3 I/Os on the original tree, all on b. Figure 11 provides an example where FullRecExpand does not improve Opt-MinMem. On this instance, OptMinMem performs 4 I/Os, 2 on node a then 2 on node b, where PostOrderMinIO executes first the left subtree and consumes only 3 I/Os on node c. This instance shows an example where no optimal solution performs an I/O on a node where OptMinMem performs an I/O. So the strategy of FullRecExpand cannot be optimal, even if we used a different priority at Line 6.