Dynamic Interference-Sensitive Run-time Adaptation of Time-Triggered Schedules

Over-approximated Worst-Case Execution Time (WCET) estimations for multi-cores lead to safe, but over-provisioned, systems and underutilized cores. To reduce WCET pessimism, interference-sensitive WCET ( is WCET) estimations are used. Although they provide tighter WCET bounds, they are valid only for a speciﬁc schedule solution. Existing approaches have to maintain this is WCET schedule solution at run-time, via time-triggered execution, in order to be safe. Hence, any earlier execution of tasks, enabled by adapting the is WCET schedule solution, is not possible. In this paper, we present a dynamic approach that safely adapts is WCET schedules during execution, by relaxing or completely removing is WCET schedule dependencies, depending on the progress of each core. In this way, an earlier task execution is enabled, creating time slack that can be used by safety-critical and mixed-criticality systems to provide higher Quality-of-Services or execute other best-eﬀort applications. The Response-Time Analysis (RTA) of the proposed approach is presented, showing that although the approach is dynamic, it is fully predictable with bounded WCET. To support our contribution, we evaluate the behavior and the scalability of the proposed approach for diﬀerent application types and execution conﬁgurations on the 8-core Texas Instruments


Introduction
The constantly growing processing demand of applications has led the processor manufacturing industry towards multi-/many-core architectures.These architectures have multiple processing elements, called cores, providing massive computing capabilities, by being able to concurrently execute a high volume of tasks.Hard real-time systems have to provide timing guarantees, i.e., guarantee that tasks are completed before their respective latency requirements (deadlines).To rigorously provide such guarantees, deployment approaches schedule tasks on cores considering the Worst-Case Execution Time (WCET) of tasks.
However, in multi-core architectures, several resources are shared among the cores (such as memories and interconnects) introducing timing delays (interferences), changing the timing behavior of tasks and varying their WCET.Indeed, tasks WCETs, which include interferences, can be 7.5 times larger than the corresponding estimations without interferences,

4:2
Dynamic Interference-Sensitive Run-time Adaptation of Time-Triggered Schedules both experimentally measured [11,13] and analytically computed [23,24].To account for all possible interferences, the WCET has to be over-approximated.This over-approximation practice has led to the "one-out-of-m processors" problem [8], where the additional processing capacity is negated by the pessimism of the WCET.As a result, the sequential execution (on a single core) may provide better timing guarantees than any parallel execution, which seriously undermines the advantages of utilizing multi-cores.To reduce the WCET pessimism, recent state-of-the-art research [15,16,19,24] has proposed tighter WCET, called interferencesensitive WCET (isWCET).isWCET are computed by accounting for the interference that can occur only by the parallel-scheduled tasks.Hence, isWCETs are schedule-dependent, and they are valid only for the schedule solution they have been estimated for.
In order to guarantee a time-safe execution, this isWCET schedule solution has to be maintained during execution.Otherwise, additional interferences may occur, which have not been accounted for.To achieve that, time-triggered execution is usually applied, where the tasks are executed exactly at their start time assigned in isWCET schedule [18,19].Although time-triggered execution is time-safe, it prohibits any improvement on performance.Performance improvement can create slack, that can be used to increase the Quality-of-Service in safety-critical systems or execute other best-effort applications in mixed-critical systems.For example, in cruise control systems, the created slack can be used to further improve quality of the result produced by the control law, whereas in satellite systems less essential functions, such as scientific instrument data collection, can be activated [7].The means to obtain any performance improvement is through run-time adaptation, using information of the task actual execution time (AET), that becomes available as the execution progresses.However, any adaptation occurring at run-time must be safe.
Existing isWCET run-time adaptation approaches [21,22] allow tasks to be executed earlier-than-originally scheduled.Despite the potential earlier task execution, these approaches enforce the partial order of all tasks, provided by the isWCET schedule.In this way, additional interference due to earlier task execution cannot be introduced, maintaining the isWCET estimations valid.However, this enforced partial order of tasks limits the performance improvements that can be achieved through run-time adaptation.This limitation is illustrated in Figure 1, where arrows denote the partial order of tasks.As depicted in Figure 1a, a static run-time mechanism with enforced partial order can allow an earlier execution of successor tasks (τ 2 and τ 3 ), only when all their predecessor tasks have finished (τ 0 and τ 1 ).However, in permissive cases, τ 2 could be executed even earlier, since τ 0 has already finished execution before τ 1 .Static mechanisms, that enforce the partial order, do not permit this earlier execution of τ 2 , as τ 2 will insert interference to τ 1 , which has not been computed during the creation of the isWCET schedule.Therefore, existing static mechanisms cannot exploit such opportunities created by the varying actual execution time of tasks across cores.However, assuming that τ 1 started earlier-than-originally scheduled, some time slack has been created at run-time, which can be exploited to further improve performance.As depicted in Figure 1b, if the additional isWCET, due to the interferences inserted by a new task running in parallel (τ 2 in this example, with interference illustrated in a stripped pattern), is less than the time slack, then the partial order of tasks can be safely relaxed, and τ 2 can be executed in parallel with τ 1 .
In this work, we propose such a dynamic interference-sensitive run-time adaptation approach (isRA-DYN) that safely relaxes the partial order of tasks.The proposed approach exploits the actual execution time of tasks across cores to allow concurrent tasks to sustain more interference, than the one computed during the isWCET schedule, as long as the timing guarantees are preserved.Compared to existing approaches, the proposed approach  is capable of exploiting the run-time variability due to a shorter task execution compared to the isWCET schedule computed offline.This run-time variability is created due to i) fewer interferences occurred during execution than the maximum possible interferences, used to offline compute the isWCET schedule, and ii) the executed path is different than the worst-case path of the task, used to compute the isWCET schedule.We provide the response timing analysis of the proposed approach, showing that timing guarantees are satisfied and any run-time adaptation does not alter the timing behavior of the system.To support our contributions, we perform extensive evaluation of the proposed approach on a real platform (8-core Texas Instruments TMS320C6678) for three applications and several execution configurations.The obtained results show that the proposed dynamic approach is able to provide performance improvements compared to the existing static approaches.

System model
Let T denote the set of tasks of an application to be executed on the set of cores K of the target platform.The tasks of T can be either dependent, or independent, and are periodically executed in a non-preemptive manner.The proposed dynamic adaptation uses as input a time-triggered schedule that provides the start/end times of the tasks and their allocation to cores.Formally, we model such a time-triggered schedule S for the task-set T with the tuple (µ, β, ), where µ(τ ) denotes the core allocation, i.e., the core k at which task τ is executed, and β(τ ), (τ ) are the start and end times of the task, respectively.These times refer to the absolute time elapsed from the start of the period, and shall not be confused with the typical notion of release time.Such time-triggered schedule can be constructed by a scheduling algorithm providing timing guarantees, applied offline.Since the approach operates upon the input time-triggered schedule, any limitation will stem from the task model and scheduling algorithm, used offline to construct the time-triggered schedule.For clarity reasons, we will assume that tasks are released at the start of the period and their isWCET does not consider restrictions on the length of task overlapping or timing of the interference (see Section 6).A time-triggered schedule S defines the partial ordering ≺ S of the tasks, i.e., τ ≺ S τ , iff task τ finishes its execution before τ starts.Additionally, a schedule S is considered safe, iff it satisfies the system-defined timing constraints, i.e., each task deadline and/or a global deadline must be met.Given a safe time-triggered schedule S, let E isRA be the transitive reduction of the tasks partial order, i.e. (E isRA ) * ≡ ≺ S .Essentially, E isRA is a set of scheduling dependencies E isRA , such that a task τ depends only on the tasks {τ } that finished immediately before it, on all cores K, according to β(τ ), i.e.:     The proposed dynamic adaptation mechanism is divided into four phases, namely ready, relax, execute and notify, which are respectively denoted with R τ , S τ , X τ and N τ for any task τ .Since time-triggered schedules refer to absolute time, we shall denote the absolute response time of the control phases with Finally, given a set of potentially parallel tasks T E τ to task τ , given a dependency relation E ⊆ E isRA , we assume that the interference ι E (τ ) that τ can cause to, and sustain from, T E τ is computable and upper-bounded by ι max (τ ).This is a realistic assumption, e.g., a task τ with Worst Case Resource Accesses, W CRA(τ ), to an arbitrated resource considering a fair Round-Robin arbiter with arbitration delay of D RR will cause/sustain [18]: Run-time adaptation mechanisms for hard real-time systems, must guarantee that any adaptation decision does not violate the real-time constraints and the resulting execution is correct, i.e., no concurrency issue can occur.Furthermore, compared to traditional WCET schedules based on pessimistic WCET estimations, adapting interference-sensitive schedules poses an extra challenge: re-scheduling a task can increase the interference that another task sustains, as shown in Figure 1, potentially violating the timing guarantees.Static run-time adaptation [22] can safely adapt interference-sensitive time-triggered schedules by keeping fixed the partial order of task execution, preventing additional tasks overlaps, thus unaccounted interference to occur.Yet, static approaches miss adaptation opportunities due to this fixed partial order, being unable to provide further performance improvements.To address such cases, we propose a dynamic run-time adaptation mechanism, outlined in Algorithm 1, that is executed independently on each core and for each task.Each task execution is extended with control phases, ready, relax and notify.During the ready phase (L.3), the controller checks if the current active task is ready, i.e., all previous tasks have finished, and thus, its dependencies have been met.In case the task is not ready, it tries to relax the partial order (L.4) and checks again if the task is ready.To achieve a behavior that allows the partial order to change only when it is safe, a global slack is computed during execution, which is the minimum time-slack among all cores.The time-slack of a core is given by the amount by which the execution of its tasks has been sped up.The partial order, according to E isRA , is allowed to be modified, if the introduced interference by any new task running in parallel is less than this global slack.This process continues, alternating a ready phase and a relax phase, until the task becomes ready and it is executed (L. 6).When the task finishes, the controller performs the notify phase, where it notifies all relevant cores that the task has finished (L.7) and updates the necessary information for slack calculation (L.2).A core k' is called relevant for any task τ executed on core k, when there exists an outgoing edge from task τ towards a task τ on core k'.In order to enforce a particular ordering of the tasks (either the original partial ordering of E isRA or any relaxation of it E dyn ) each core holds its own status vector (of size |K|).Each bit of the status vector corresponds to a core.The status vector of each core represents the notifications received from other cores at any point in time and it must be updated during execution by all cores.
The following sections explain the controller phases with respect to the dependencies where relaxation can occur, i.e., scheduling dependencies.In case of data-dependent tasks, the data-dependencies are never removed.Figure 3 Example of control phases for four tasks on two cores.For each task the ready vector is in curly brackets.The notification vector is in parentheses and illustrated with arrows.

Ready phase
To implement the ready phase, a ready vector (of size |K|) is required for each task τ .Each bit in the ready vector represents the core k on which the incoming edge of the scheduling dependencies originates from, i.e.: where readyV ector τ is the ready vector of task τ .The ready vectors are created offline for each task τ , based on the dependency relation E isRA , but may be modified during a relax phase of the same core, or by a notify phase of another core to reflect the dependency relation E dyn .For instance, in Figure 3a, the ready vector of task τ 2 is {11}, since it is has to wait for i) task τ 1 running on core 1 and ii) task τ 0 running on core 0, to finish before being executed.These dependencies ensure that the number of interferences will not increase due to an earlier execution of τ 2 .On the other hand, the ready vector of task τ 1 is {00}, as no dependency exists from another task.The functionality of the ready phase of the controller is described in Algorithm 2. Initially, the controller tries to gain access to the critical section of the status vector through the protection mechanism related to core k (L.2).Once it has been granted, it checks if all task Algorithm 2 Ready phase of isRA-DYN mechanism on core k.
Input: Task τ , status k [ ] bit vector.Output: true if all dependencies readyV ector τ have been met; otherwise false dependencies have been already met, encoded by the task's ready vector (L.3).If this is true, the task τ can be executed.For instance, tasks τ 0 and τ 1 in Figure 3a are considered ready, since the corresponding bits of the status vectors of Core 0 and Core 1 are clear and the status vectors are equal with the ready vectors.Before advancing to the execution phase, the controller has to reset the bits indicated by the ready vector of task τ in its status (L.4).In this way, any already-met dependencies from other cores to subsequent tasks on core k are preserved.Then, the protection mechanism is released and the task is executed.

Notify phase
To implement the notify phase, a notification vector (of size |K|) is required for each task τ that describes which cores have to be informed that the task has finished execution.Each bit in the notification vector represents the core k, to which the outgoing edge of scheduling dependencies ends, i.e.: where notif yV ector τ is the notify vector of task τ .The notification vectors are created offline for each task τ , based on the dependency relation E isRA , but may be modified during a relax phase of another core to reflect the dependency relation E dyn .For instance, in Figure 3a, the notification vector of task τ 1 is (11); when it finishes execution, it has to notify task τ 2 running on core 0 (bit 0) and task τ 3 running on core 1 (bit 1).Through the notification, the k-th bit in the status vector of core i is set by core k, when the finished task of core k has an outgoing edge to a task on core i.For example, in Figure 3b, the status vector of core 0 is 01, since task τ 0 finished execution and notified only core 0. Algorithm 3 describes the functionality of the notify phase of the controller on core k.After the task τ on core k completes its execution, the controller has to update the status of all the relevant cores.To do so, for each successor τ of task τ , the controller tries to gain access to the critical section of the successor's core protection mechanism (L.3).If access is granted, the controller verifies if the dependency still exists (L.4).If it exists, the controller tests if the previously occurred update of the core k has been already consumed by the core µ(τ ), where τ is mapped to (L. 5).If this is true, the k-th bit in the status of core µ(τ ) is set, otherwise it clears the k-th bit from the ready vector of task τ , indicating that the dependency from core k has been met.For instance, Figure 3b illustrates this case Algorithm 3 Notify phase of isRA-DYN mechanism on core k.
Input: Task τ , Array of all status vectors.(status i [j]: the j-th status bit of the i-th core) 1 Function updateStatus(τ , status[ ]): where Core 0 updates its own bit after task τ 0 finishes.After the controller has updated all relevant status, it updates the start time of its active task with the time of the next task (L. 2, Algorithm 1) and computes the minimum among the active tasks (see Section 3.4).

Relax phase
In case the task is not ready to be executed, isRA-DYN tries to relax the partial ordering of the tasks, iff the introduced interference is less than the global slack (the amount that the execution has already advanced).That is, task τ is allowed to overlap with the active tasks, iff all active tasks started at least n time units before their time-triggered start time β, and task τ would introduce interference less than n time units to each one of them.This is illustrated in Algorithm 4 (L.2) where the global slack has to be greater than the interference that task τ will introduce, denoted as ι max (τ ), in addition to the WCET required of executing the relaxation, denoted as C S .
The relaxation strategy that isRA-DYN follows is an "all-or-nothing" approach, in the sense that either all the incoming scheduling dependencies, but no data ones, E − τ of task τ will be removed or the relaxation is postponed for a later invocation.The reasoning behind such design choice is that it provides short alternation between ready and relax phases.This minimizes the worst-case response time from the time a task becomes ready to when the task is executed by the controller.More formally, the result of such relaxation is: To achieve such relaxation, the controller of core k tries to gain access to its critical section (L.4) and clears the k-th bit of the notification vector for each predecessor task τ (L.5-8), indicating that the dependency has been removed, as illustrated in Figure 3c.In order to reflect these changes to its own status and ready vectors, it registers which dependencies have been removed in a local variable, i.e., modM ask (L.8).By definition, a dependency from a predecessor task τ on the same core k is met, i.e., the notification from Algorithm 4 Relax phase of isRA-DYN mechanism on core k.
Input: Task τ , status k [ ] bit vector.Output: true if task τ is ready after the relaxation; otherwise false core k has already occurred.Hence, the local variable is initialized with all bits set, except the k-th bit (L.3).For the same reason, the k-th bit of that task's τ notification vector is not reset (L. 6).Finally, the controller resets all the bits of its status and ready vector that were modified by the relaxation process, according to the local variable (L.[10][11], and tests (L.13) if the task is indeed ready (to avoid re-execution of the ready phase), as shown in Figure 3d.Notice that, in case some of the tasks are data-dependent, the dependency is preserved (L.6), thus ensuring proper ordering of data-dependent tasks.

Global slack computation
In order to relax the partial order of tasks, it is essential to know at run-time the amount of global slack, i.e., the minimum current time-slack across all cores.The time-slack of a core expresses the amount of time by which the execution of its tasks has been sped up.Speed-up occurs when the actual execution of a task is shorter than its isWCET.Formally, we define time-slack as the difference between the actual response time R(τ ) of a task τ and its end time (τ ), and shall not be confused with the typical term slack, meaning laxity.As the actual response time R(τ ) is not known a-priori, we use a safe slack approximation σ τ : where t is any time instance between the time-triggered start time and when the task becomes ready, i.e., all its predecessors have finished.Notice that σ τ is safe since, (τ R(τ ), i.e., the difference in response time between two consecutive (in time) tasks cannot be greater than the isWCET of the latter task, (τ )β(τ ).This safe approximation enables an efficient computation of the global slack, i.e., the minimum slack of all cores, at any time instance t, in a distributed manner, without requiring any sort of synchronisation or explicit exchange of information among cores.This is achieved by subtracting the current time instance t from the minimum start time of all active tasks, as outlined in Algorithm 5 (L.9).To avoid inter-core information exchange, a global array is used to store the start time of the active task on each core and a global variable obtains the minimum value of the array.The start time of an active task is updated every time a core has to execute a new task (L. 3 of Algorithm 1).As soon as a core k finishes the notify phase of a task, it proceeds to its next task τ .As a new task is now active, the controller of core k stores the old start time in a local variable and updates the start time of its active task with the new one (L.2-3).If its old start time is equal to the minimum value of the array, it means that this controller was the owner of the minimum value, and thus, it has to recalculate the new minimum of the global array (L.4-7).Otherwise, it delegates this computation to the controller that is the owner of the minimum value of the array.

Deadlock freedom
Since isRA-DYN is a distributed approach, it is important to establish its correctness.Here we prove that isRA-DYN is free from deadlocks, while time-correctness is addressed in Section 4. As a stepping stone, we first prove its static behavior, i.e., no relaxation occurs.Lemma 1. Assuming that each set/reset operation on the k-th bit of the bit-vectors (status, readyV ector, notif yV ector) is atomic, the static behavior of isRA-DYN is free from deadlocks.
Proof.Consider two dependent tasks, (τ, τ ) ∈ E isRA ; there are two distinct cases when task τ finishes its execution and notifies core µ(τ ): 1. the µ(τ )-bit of the status for core µ(τ ) is not set (status µ(τ ) [µ(τ )] = 0), which results in setting the bit after notification (L.6, Algorithm 3). 2. the µ(τ )-bit is already set (status µ(τ ) [µ(τ )] = 1) by some other task, which results in resetting the µ(τ )-bit of readyV ector τ of task τ .(L. 8, Algorithm 3).At the ready phase of task τ either the µ(τ )-bit of the ready vector is zero, or both the µ(τ )-bits of the status and the ready vector are set; in both cases task τ is considered ready w.r.t.its dependency with task τ .In the former case, the value of that status vector bit is preserved (via XOR with the zero of the ready vector), in order to be reset by the corresponding task, while in the latter case that bit is reset.Since the value of the µ(τ )-bit of the status vector is the same before the notify phase and after the ready phase, it is straight forward to show that isRA-DYN is deadlock-free, via induction on E isRA on all |K| bits.Theorem 2. Assuming that each set/reset operation on the k-th bit of the bit-vectors (status, readyV ector, notif yV ector) is atomic, isRA-DYN is free from deadlocks.
Proof.Consider two dependent tasks, (τ, τ ) ∈ E dyn and τ is about to be executed; there are two distinct cases for task τ , namely either it is ready or it could be relaxed.The former case corresponds to the static behavior, which we have established its correctness from Lemma 1.In the latter case, there are two options for the µ(τ )-bit of status µ(τ ) : 1. the µ(τ )-bit has been set by τ 2. the µ(τ )-bit is not set In both cases the µ(τ )-bit of status µ(τ ) is reset, resetting also the µ(τ )-bit of readyV ector µ(τ ) (L. 10-11, Algorithm 4).Since τ is about to execute, if the µ(τ )-bit is set, it cannot have been set by any other task τ than τ , as it would have been already reset by the succesive task of τ on core µ(τ ).It is thus, straightforward to show that isRA-DYN is deadlock-free, via induction on E dyn on all |K| bits.
In the presented algorithms, the modification of the bit-vectors is protected, satisfying the assumption for atomic set/reset operation of bits.In particular, since a ready and a relax phase cannot be executed simultaneously on the same core, it suffices to use a single protection mechanism per core.Additionally, since data-dependencies are always preserved (L.7 in Algorithm 4), the execution with isRA-DYN cannot introduce race conditions to the application itself.It should be stressed that the global variable, with the minimum start time of all active tasks across cores, does not require protection in order to be safe.The minimum start time is genuinely increasing, as the execution progresses, and its used for the calculation of the global slack.Accessing the global variable without protection can result in missing write from another core.This means that the controller uses an older value, which is smaller then the new one.This only results in smaller global slack computation, and thus, only missed opportunities of relaxation.This is deliberately done so, in favor of run-time performance.

Response time analysis
Introducing run-time mechanisms into time-critical systems can improve system performance, by better utilisation of the system resources.Nevertheless, the controllers themselves require processing time, thus they can alter the timing behavior and potentially violate timing guarantees, if the controller WCET is not properly accounted for.To overcome this issue, either additional tasks are incorporated into the model, used to derive the safe time-triggered schedule, or the WCET of the controller is incorporated into the WCET of each task (modulo some timing alignment).Especially for interference-sensitive system, attention must be paid to potential interference created by the controller, i.e., accesses to shared variables among cores (status, ready and notify vectors).If these variables are placed alongside with the task data, additional interferences must be accounted, due to the parallel execution of a control phase and a task.Multiple approaches exist to mitigate this effect; the control data can be placed in a separate memory accessed by a separate bus, when the platform provides such a feature.Alternatively, shared control data can be placed in the local memories of each core and accessed through remote reads/writes [2].If such solutions are not possible, the amount of induced interference can be controlled, either by using resource-partitioning techniques common in real-time systems or by bounding the number of invocations of the controller, e.g., using non-interrupting hardware events [21].
In Algorithm 1, the proposed control mechanism is divided into two alternating phases, namely ready and relax, followed by two consecutive phases, execute and notify.We consider the absolute response time of a task τ to be when it finishes execution and the notify phase has performed all the status updates, i.e., the absolute response time of a task τ is the same as response time of its update phase: Notice that, while the execution phase X τ has a fixed isWCET (without considering the additional interferences due to relaxation), the ready and notify phases have varying isWCET, which depends on a number of factors.For any task τ , the isWCET of the ready phase depends on: (i) when it will gain access to its critical section, and (ii) when the task is going to be ready, i.e., all previous tasks have finished and all updates have been performed.The WCET of the update phase depends on: (i) when it will gain access to its critical section, (ii) the number of cores to notify, and (iii) when previous tasks, which depended on this core, finish their ready phase (s.t.
In order to make our response time analysis accurate, we derive parametric response times R, based on the number of outgoing edges of a task τ , according to the scheduling dependencies E isRA .We denote with C N [L] the WCET of the controller part that corresponds E C R T S 2 0 2 0

4:12
Dynamic Interference-Sensitive Run-time Adaptation of Time-Triggered Schedules to the snippet L, i.e., the sequence of lines L of Algorithm N .We perform the RTA for the most restrictive case, i.e. for tasks with data-dependencies.An RTA for independent tasks, would at least provide the same response time bounds for the same task set, if not better.
Accessing a critical section.While one core can be only at one control phase at any time instance, different cores can be in different phases.Hence, for a core to enter any particular critical section, it may have to wait for all the other cores to finish their critical section.The WCET of the critical sections of the ready, notify and relax phase, are C 2 [3−6] , C 3  [3−9] and C 4  [9−12] , respectively.Thus, the worst-case wait time for a core to access critical section i is: Ready Phase.Let t τ R be the time instance that task τ , running on core k, becomes ready, i.e., all predecessor tasks have finished their corresponding notify phases: The response time of the ready phase, if the controller is invoked precisely at time t τ R , is the WCET of acquiring access to critical section k plus the WCET of executing that section: where C T A R is a timing alignment constant, analysed below.The response time R(R τ ) is defined recursively, as it depends on the maximum response time of preceding tasks (t τ R ).This will assist us in proving that under any AET, the execution respects the timing guarantees.

Execute Phase.
As there are no preemptions during the execution of a task, if ι E dyn (τ ) is the interference that task sustains because of the relaxed dependency relation E dyn , the response time for the execution phase of the controller is: Notify Phase.Following the task execution τ , on core k, the controller updates each core's status and the minimum start time of the active tasks.For each outgoing dependency, the core k has to gain access to a distinct critical section and perform a write to either the status or the ready vector: Since the worst-case waiting time in Equation 8 is constant, we have replaced C µ(τ ) with C k to derive the final expression.
Relax Phase.Let time instance t τ S be a time instance, where task τ running on core k is not ready yet, i.e., t τ S < t τ R , but the global slack is large enough to accommodate the additional interferences.If the relax phase is invoked at time t τ S , it has to remove all the dependencies (excluding the data-dependencies) and acquire access to its critical section, in order to write the status vector.Hence, its response time is: R(S τ ) ≤ t τ S +C Sτ with ( 14) Notice that the loop body (L.9-13) in Algorithm 4 is executed only deg + (τ ) − 1 times, since the dependency from the core itself is by default met.This is reflected by the term (deg in the response time R(S τ ).

Timing alignment.
In the RTA of the ready phase, we assumed that the controller starts precisely at the time when all the required status updates have been performed (for the ready phase).Nevertheless, since the tasks can be executed in less time than their isWCET, there is a possibility that the controller is already inside a relax phase, when the last status update occurs.In the worst-case, there will be a single data-dependency that is not removed by the relax phase.Therefore, the timing alignment constant C T A R for the ready phase is: where C S is the WCET of relax phase, with the maximum number of dependencies, i.e., |K|.

Safety
Having the WCET and the response time of the controller phases, we need to prove that, if these costs are added upfront to the isWCET of tasks, the timing guarantees of any time-triggered schedule are not violated, under any run-time reduction of execution times, i.e., R(τ ) ≤ (τ ) for all tasks τ .Let C Rτ , C Nτ , C Sτ denote the WCET of the control phases, and C T A R the timing alignment constants, as analysed in the previous sections.Assume a safe solution (µ, β, ), derived by a safe scheduling algorithm, such that it includes the controller WCETs in the isWCET of each task: Before proving the safety of the approach, we establish some properties regarding the impact of relaxations to isWCET of the task and control phases.

Property 1. Relaxation does not increase the WCET of control phases (C
Proof.The WCET of the control phases depends on the indegree and outdegree of each task τ according to the dependency relation E isRA (Equations 10,13,15).The dependency relation E dyn is a genuinely decreasing relation (Equation 5) starting from E isRA , thus the WCET of the control phases decreases with each relaxation.

Property 2. Given a relaxed dependency relation E dyn ⊆ E isRA , a task τ can suffer additional interference at most equal to its slack:
Proof.In case task τ is the task with the minimum start time among the active tasks, then σ τ ≥ ι max (τ ) (L. 2 in Algorithm 4).Otherwise, let τ min be the task with minimum start time, i.e., β(τ ) ≥ β(τ min ).By equation 6: Proof.Proof by induction on the dependency relation E dyn .
We have therefore established that isRA-DYN is timely safe, and relaxes the dependency relation when enough global slack exists to accommodate the additional interference.Furthermore, the execution is work-conserving, w.r.t.E dyn , which improves run-time performance, as shown in Section 5.

Experimental Setup
Platform.A real multi-core COTS platform, i.e., the TMS320C6678 chip (TMS in short) of Texas Instrument [25] is used for the experiments.The platform characteristics are depicted in Table 2.The isRA-DYN mechanism is implemented as a bare-metal library, with low-level functions for the controller phases using TMS hardware semaphores.The isRA-DYN semaphore implementation is applicable to any platform, since semaphores are a fundamental building block.In the rare case that no such hardware support exists, a software implementation can be used.However, the isRA-DYN approach can be implemented by other protection mechanisms.
Benchmarks.To experimentally evaluate our isRA-DYN approach, we have conducted experiments using three different applications with respect to the number of tasks, WCET, and WCRA taken from the StreamIT benchmarks [26]: i) Discrete Cosine Transformation (DCT), ii) Mergesort (MERGE), and iii) Fast Fourier Transformation (FFT).WCET and WCRA acquisition.Since no existing static WCET analysis tool supports the TMS platform, a measurement-based approach has been used to acquire the WCET of each task.Obtaining safe and context-independent measurements requires to eliminate the sources of timing variability [6], by disabling data-caches, removing interferences (i.e., the task is executed alone on one core) and providing input data to enforce the worst-case path.To perform our measurements on the real platform, we used the local timer of the core.To increase the reliability of the measurements, we have followed the approach of multiple executions.Each task has been executed 50 times, which has been shown to provide a small standard deviation [14], and maintained the largest observed value.The application has been compiled with -O0, i.e., no optimizations, in order to obtain the WCRA of each task by the produced binary.Table 2 depicts the overall WCET, WCRA and number of tasks of each benchmark, used to obtain the offline near-optimal solutions.Data-placement.The controller data are placed on the on-chip Multicore Shared SRAM Memory (MSM), while application data are placed in the off-chip main memory (DDR3), ensuring that the isRA-DYN does not interfere with the task's execution.
Comparison.The evaluation of the proposed approach can be achieved by comparing run-time execution time of the tasks allocated on each core (a.k.a.makespan) obtained by the proposed dynamic approach (isRA-DYN) and the static run-time approach (isRA-LOCK) that enforces the partial order of tasks [22].The offline isWCET schedule has been generated by [24] and it is used as an input to both approaches.To attribute the observed gains to isRA-DYN, any system parameter, that may lead to timing variability at run-time, should be controlled and explored independently, whenever possible.These parameters are mainly the interferences, the different execution paths of the benchmarks and the impact of caches.Therefore, we initially explore the timing variability that each benchmark can have, when executed on TMS platform and a single system parameter is tuned each time.Then, we present the gains provided by isRA-DYN and isRA-LOCK by comparing the makespan under different variations at the timing variability of benchmarks.Last, but not least, we compare the overhead of isRA-DYN and isRA-LOCK approaches.

Evaluation results
Characterization of timing variability.The main system parameters that can alter the execution time are the occurring interferences, the diverse execution paths of the benchmarks and the caches.In this first experimental section, we tune each of these parameters independently in order to characterize its impact to the timing variability per benchmark.To compute the timing variability, the execution time of the best observed case and the worst observed case are compared.Table 3 shows the timing variability due to caches and Caches DCT MERGE FFT Disabled 46.65% 12.84% 0.15% Enabled 40.51% 14.69% 0.46% diverse paths (computed without any interferences, i.e., running the benchmark alone on a core), and the timing variability due to interferences, when all cores are running the same benchmark.We observe that even when the application is executed in isolation, the impact of caches in execution time is quite high for all applications, with 71.14% on average.The impact of different execution paths depends on the application type, thus it is higher for the DCT, since it is an application that has several execution paths and, much smaller for FFT, which is a single-path application.Last but not least, we observe an important impact of the interferences, with 45.85% on average, with disabled caches.When caches are enabled the interference impact is reduced, since the cache sizes are large enough to keep the data locally.

Makespan comparison.
We perform an exhaustive set of experiments to explore and quantify the behavior of the proposed approach.We have used three configurations, i.e., two, four and eight application instances on two, four and eight cores, respectively.In addition, in order to explore the behavior of isRA-DYN with respect to the timing variability, due to interferences, caches, and multiple execution paths of the application, we have performed experiments, where we insert at each task a timing variability from 0% up to 40%, on average.Each experiment has been executed twenty consecutive iterations.During the execution, we observed no timing violations according to the offline solution.Due to page limitations, we only present the measured makespan of each core for the eight core configuration in Figure 4, in the form of box plot.However, we thoroughly analyze the behavior of our approach by providing the gains of the proposed isRA-DYN compared to isRA-LOCK for all experiments.The gain is given by computing the makespan gain, i.e., (isRA−LOCK)−(isRA−DY N ) (isRA−LOCK) , for each core.Tables 5, 4 and 6 depict the average makespan gain per core and the average makespan gain across all cores, for all configurations.a) General observations: The first and important observation is that the behavior of isRA-LOCK is similar, in terms of minimum, maximum and average makespan, for all cores for all benchmarks, under any timing variability.This behavior of isRA-LOCK is due to the fixed partial task order.This behaviour motivates the use of a dynamic approach that can explore the variability occurring at run-time.Compared to the isRA-LOCK, the makespan distribution of the isRA-DYN among cores is varying.As isRA-DYN performs partial order relaxations, allows earlier task execution and additional interference to occur, which varies the core's makespan.When the variability is increased, this variation becomes more important.b) Minimum timing variability (0%): This experimental set-up is the worst for the proposed approach, since the timing variability of the benchmarks is eliminated as much as possible.To achieve that, the same execution path is used among executions and caches are disabled.However, it is not possible to eliminate the interferences occurring from the parallel execution of tasks.For all the experiments, we observe that the behaviour of isRA-DYN improves over the behavior of isRA-LOCK, in all cores, as the number of cores increases.More precisely, for the configuration with two cores and 0% variability, isRA-DYN provides a small gain (from 0.08% for MERGE up to 0.22% for FFT, with an average of 0.145% among all applications).As the number of cores is increased, the gains are also increased, especially for MERGE.Compared to the two core configuration, the gain is increased on average by a factor of x3.11 for DCT, x36.63 for MERGE and x1.93 for FFT, when four cores are used, and by x2.95 for DCT, x44.64 for MERGE and x4.28 for FFT.The high gain factor of the MERGE benchmark is due to the low gain when only two cores are used.The lower gain, when only two cores are used, is attributed to the small number of interferences occurring during execution in combination with a bit higher run-time overhead, due to the relax phase, compared to isRA-LOCK.However, as the number of cores is increased, the number of occurring interference is increased, and thus, the gain is higher.As the only source of timing variability is the interferences in this experimental, the achieved gain of isRA-DYN verifies that the proposed approach is capable of exploring the occurring interferences during execution, compared to isRA-LOCK.c) Tuning timing variability (from 5% to 40%): To quantitatively characterize the behavior of the proposed approach, when other sources of timing variability occur on top of the interferences, we insert an average variability of 5%, 10%, 20% and 40% in the WCET of the tasks (WCRA remains unchanged).For all the experiments, we observe that as task variability is inserted, isRA-LOCK fails to take advantage of it during execution, due to its fixed partial order policy.On the other hand, as the variability is increased, the proposed approach provides higher gains.More precisely, we observe that with the configuration with two cores (which is the configuration with the minimum possible interferences), the gains of isRA-DYN are significant compared to isRA-LOCK.In particular, with an increasing timing variability of 5%, the average gains are increased to 1.125% for DCT, 0.935% for MERGE, and 0.490% for FFT (with an average of 0.85% for all benchmarks).Tables 5, 4 and 6 show that with timing variability increasing, the gains are increased.Considering all  The maximum gain for 40% variability is 11.35% observed for C0 running DCT.As the number of cores is increased, the gains are also increased.This occurs due to the fact that the proposed approach is able to take advantage of both the inserted timing variability and the occurring interferences.When four cores are used, the average gain over all benchmarks is 1.89%, 3.82%, 8.46% and 15.86%, for 5%, 10%, 30% and 40% variability, respectively.The maximum gain for 40% variability is 19.85% observed for C1 running MERGE.When eight cores are used, the gains are even higher, i.e., with an average gain over all benchmarks equal to 3.45%, 7.11%, 14.22% and 23.26%, for 5%, 10%, 30% and 40% variability, respectively.The maximum gain for 40% variability is 25.31% observed for C4 running MERGE.

Controller cost
Table 7 depicts the corresponding WCET values for the isRA-DYN and isRA-LOCK approaches.Due to the additional relax phase, the overhead of the isRA-DYN controller is higher than isRA-LOCK controller.Despite the increased overhead, isRA-DYN can provide further performance improvements, as it has been shown in the previous paragraphs.

Related Work
The run-time mechanisms are categorized whether: i) the considered tasks are only timecritical or also best-effort, ii) the WCET is pessimistic or interference-sensitive, and iii) the adaptation is static or dynamic.A detailed survey of the state-of-the-art is available in [4].The run-time mechanisms considering only time-critical tasks must guarantee the timely execution of the complete task set.The mechanisms that consider the pessimistic WCET are typically based on static decisions, i.e., the execution of a new task can start as soon as a task finishes earlier than its pessimistic WCET.Typical examples of such approaches come from scheduling theory, e.g.[1,5].However, the use of pessimistic WCET over-approximates the interferences having a negative impact in performance and in schedulability.To tackle with over-approximated WCETs due to interferences, several approaches incorporate interference   [17,19,24].In general, these approaches provide improved timing guarantees, since they compute a context-dependent upper-bound of the interferences for a particular schedule.To improve the provided upperbounds, some approaches take into account the length of task overlapping, e.g.[17], or the precise timing of the requests, e.g.[20], or even provide contention-free schedules, e.g.[19]; a detailed survey of such approaches can be found in [12].To further reduce the impact of the inherent pessimism in any kind of WCET estimations, several run-time mechanisms have been proposed.In [21,22,24], the authors provide a run-time approach suitable for interference-sensitive WCET.However, these approaches act upon static decisions, being unable to modify the partial order of tasks, created offline during the interference-sensitive scheduling.In contrast, the proposed isRA-DYN approach embraces dynamic decisions allowing safe modification of the isWCET scheduling, leading to performance improvements.
The run-time mechanisms for time-critical and best-effort tasks assume the timely execution of time-critical tasks, when they run in isolation.Based on this assumption, they decide the best-effort tasks execution, so as to still guarantee the timely execution of timecritical tasks.Such approaches use different confidence levels in the WCET estimation [3], compute the remaining WCET in isolation for the time-critical tasks [9][10][11], use resource usage capacities [15,16] and partitioning of the memory accesses [27].The isRA-DYN approach is orthogonal to these approaches, since it focuses on providing timing guarantees for the time-critical tasks.

Conclusion
In this work, we propose a dynamic interference-sensitive run-time adaptation technique isRA-DYN that alleviates the limitations of the existing isRA-LOCK, since it allows to safely relax the partial order of isWCET schedule solutions, whenever this is possible.We have presented the corresponding RTA for our technique and have formally argued regarding its safety, under any execution.The obtained results show that isRA-DYN outperforms isRA-LOCK as it can exploit variability in actual execution of tasks.When using two cores, the interferences are few and without any variability, the isRA-DYN provides small gains.However, with increasing variability, even with under few interferences, isRA-DYN provide significant gains.As the number of cores is increased, isRA-DYN provides better gains.
Relaxed partial order.

Figure 1
Figure 1 Motivational example: four tasks running on two cores and their isWCET dependencies.
Scheduling dependencies E isRA .

Figure 2
Figure 2 Construction of EisRA scheduling dependencies, based on a given time-triggered schedule.

Figure 2
Figure 2 illustrates the construction of E isRA given a time-triggered schedule.Notice that, for any task τ in the dependency relation E isRA , the number of incoming edges (denoted as deg − (τ )) and the number of outgoing edges (denoted as deg + (τ )) is upper bounded by the number of cores |K|, i.e. deg − (τ ) ≤ |K| and deg + (τ ) ≤ |K|.The proposed approach relaxes, whenever possible, the dependency relation E isRA , which we shall denote as E dyn ⊆ E isRA .The proposed dynamic adaptation mechanism is divided into four phases, namely ready, relax, execute and notify, which are respectively denoted with R τ , S τ , X τ and N τ for any task τ .Since time-triggered schedules refer to absolute time, we shall denote the absolute response time of the control phases withR(R τ ), R(S τ ), R(X τ ), R(N τ ).Finally, given a set of potentially parallel tasks T E τ to task τ , given a dependency relation E ⊆ E isRA , we assume that the interference ι E (τ ) that τ can cause to, and sustain from, T E τ is computable and upper-bounded by ι max (τ ).This is a realistic assumption, e.g., a task τ with Worst Case Resource Accesses, W CRA(τ ), to an arbitrated resource considering a fair Round-Robin arbiter with arbitration delay of D RR will cause/sustain[18]: State of the system at time t.After τ0 finishes, the status of core0 is updated.Vector and status updates after τ3 is relaxed.core1 re-schedules τ3.

Theorem 3 .
For any safe scheduling solution, derived by adding the controller costs (C Rτ , C Nτ , C Sτ , C T A R ) to the isWCET (C τ ) of the tasks T , the isRA-DYN execution is safe under any AET, i.e.: R(τ ) ≤ (τ )

Table 1
Notation Summary.The WCET of code snippet L of Algorithm N CR τ , CS τ CN τ Controller WCET for the corresponding phaset τ R , t τ S , t τ N ,Time instance when the corresponding phase can execute successfully (all branches are not taken)

Table 2
Benchmark and platform characteristics.

Table 5
DCT: Average gains (%) per core (C) and among all cores (A)
analysis and provide interference-sensitive WCETs, such as