From Formal Boosted Tree Explanations to Interpretable Rule Sets

The rapid rise of Artificial Intelligence (AI) and Machine Learning (ML) has invoked the need for explainable AI (XAI). One of the most prominent approaches to XAI is to train rule-based ML models, e


Introduction
Rapid development of Artificial Intelligence (AI) and Machine Learning (ML) have revolutionized all aspects of human lives in recent years [30,1].However, decisions made by most widely used ML models are hard for humans to understand hence the interest in the theory and practice of Explainable AI (XAI) rises.One major approach to XAI is to compute post-hoc explanations for ML predictions to answer a "why" question [34,44], i.e. why the prediction is made.Although heuristic approaches to post-hoc explanations prevail [34,44,43], they suffer from a number of weaknesses [21,16,49,52].Formal methods [48,20,37] provide alternative approaches to explanations that avoid these weaknesses.Another alternative approach to XAI is to compute interpretable ML models, i.e. logic-based models, including decision trees [40],

38:2
From Formal Boosted Tree Explanations to Interpretable Rule Sets decision lists [46], and decision sets [29].These models enable decision makers to obtain succinct explanations from the models directly.In this paper, we focus on the decision set (DS) models.
Decisions sets are particularly easy to explain: the rule that fired is an explanation of the decision.This led to an upsurge in interest of decision sets that are both interpretable and accurate.Recent work [50] uses propositional satisfiability (SAT) to generate minimumsize decision sets that are perfectly accurate on the training data, and demonstrates that decision sets that completely agree with the training data outperform others in terms of accuracy.A more scalable maximum satisfiability (MaxSAT) approach [18] to this problem was then proposed.Unfortunately, both of these methods are unable to provide any decision information if a dataset is not completely solved.
Motivated by these works and their limitations, this paper aims at making a bridge between formal post-hoc explainability and interpretable DS models.In particular, the paper focuses on developing a novel anytime approach to computing decision sets that are both interpretable and accurate, by compiling a gradient boosted tree model into a decision set on demand with the use of formal explanations.This is done with the use of the recent approach [17] to compute abductive explanations for gradient boosted trees using maximum satisfiability (MaxSAT).Furthermore, the paper proposes a range of post-hoc model reduction heuristics aiming at enhancing interpretability of the result models, done with MaxSAT and integer linear programming (ILP).The experimental results show that compared with other state-of-the-art methods, decision sets generated by the proposed approach are more accurate, and comparable with the competition in terms of interpretability.

Preliminaries
SAT and MaxSAT.The standard definitions for propositional satisfiability (SAT) and maximum satisfiability (MaxSAT) solving are assumed [3].A propositional formula ϕ is said to be in conjunctive normal form (CNF) if it is a conjunction of clauses.A clause is a disjunction of literals, where a literal is either a Boolean variable b or its negation ¬b.
A truth assignment µ is a mapping from the set of variables to {0, 1}.A clause is said to be satisfied by truth assignment µ if one of the literals in the clause is assigned value 1; otherwise, the clause is falsified.If all clauses in formula ϕ are satisfied by assignment µ, ϕ is satisfied; otherwise, assignment µ falsifies ϕ.A CNF formula ϕ is unsatisfiable if there exists no assignment satisfying ϕ.
In the context of unsatisfiable formulas, the MaxSAT problem consists in finding a truth assignment that maximizes the number of satisfied clauses.Hereinafter, we use a variant of MaxSAT called Partial Weighted MaxSAT [3,Chapters 23 and 24].The formula ϕ in this variant is represented as a conjunction of hard clauses H, which must be satisfied, and soft clauses S where each of them is associated with a weight representing a preference to satisfy them, i.e. ϕ = H ∧ S. Partial Weighted MaxSAT problems aim at finding a truth assignment µ that satisfies all hard clauses and maximizes the total weight of satisfied soft clauses.
Classification Problems.We consider classification problems with a set of classes1 K = {1, . . ., k}, and a set of features F = {1, . . ., m}.The value of each feature i ∈ F is taken from its corresponding (numeric) domain D i .As a result, the entire feature space is defined as (b) BT model [5] consisting of 2 trees per class, each of depth ≤ 2, adopted from [17].

Figure 1
Example DS and BT models computed on the well-known Iris classification dataset.
each v i is a constant value taken by feature i ∈ F, together with its corresponding class c ∈ K, represented by a pair (v, c), indicate a data instance or example.With a slight abuse of notation and whenever convenient, a data point v ∈ F is also referred to as an instance.Finally, x = (x 1 , . . ., x m ) denotes a vector of feature variables x i ∈ D i , i ∈ F, used for reasoning over points in F.
A classifier defines a classification function τ : F → K.The objective of classification problems is to learn a function τ to generalize well on unseen data given a training dataset E = {e 1 , e 2 , . . ., e n }, where each instance e d ∈ E is a pair of (v d , c d ).Classification problems are conventionally posed as an optimization problem, i.e. either to minimize the complexity of τ , or maximize its accuracy, or both.

Rules, Decision Sets and Gradient Boosted Trees. Multiple ways exist to learn classifiers
given data E.This paper focuses on arguably one of the most interpretable models, i.e. decision sets, trained by compiling gradient boosted trees.
A decision rule is in the form of "IF antecedent THEN prediction", where the antecedent is a set of feature literals.Informally, a rule is said to classify an instance v ∈ F as class c ∈ K if its antecedent is compatible with v (or matches v) and its prediction is c.A decision set (DS) is an unordered set of decision rules R.An instance (v, c) ∈ E is misclassified by a DS if either there exists no rule in R matching v, or there exists a rule classifying v as a class A gradient boosted tree (BT) is a tree ensemble T defining sets of decision trees

Interpretability and Explanations.
Interpretability is not formally defined as it is considered to be a subjective concept [33].In this paper interpretability is defined as the overall succinctness of the information offered by an ML model to justify a provided prediction.Moreover, following earlier work [48,20], we equate explanations for ML models with abductive explanations (AXps), which are subset-minimal sets of features sufficient to explain a given prediction.Concretely, given an instance v ∈ F and a prediction c = τ (v) ∈ K, an AXp is a subset-minimal set of features X ⊆ F such that ▶ Example 2. Consider the setup of Example 1.Given instance v 1 , observe that for any instance with "petal.length"= 1.4, the BT is guaranteed to predict "setosa" independently of the values of other features, since the weights for "setosa" and "versicolor" are 0.71928 and −0.40253 respectively as before, and the maximal weight for "virginica" is 0.39408−0.08968= 0.30440.Thus, the (only) AXp X for the prediction for e 1 made by the BT model is {"petal.length"}.⌟ Explanations in BTs.Formal reasoning has been recently applied to computing AXps for BT models, with the key difficulty being how to effectively reason about the aggregation over a large number of trees in a BT model.Recent work applied satisfiability modulo theory (SMT) [21] or mixed integer linear programming (MILP) solvers [42,27] to directly address the linear summations arising in the BT encoding.Hereinafter, we build on the recent MaxSAT approach [17], which maps the aggregation reasoning to a set of MaxSAT queries to avoid a costly encoding of the linear constraints into CNF.Also, [17] demonstrates how a MaxSAT query can be made such that (1) holds if and only if the optimal value of the constructed objective function is negative. 2In general, assuming that each feature i ∈ F is numeric (continuous), the approach orders the set of splitting thresholds {d i1 , ..., d ihi } in a BT T for each feature i, where h i is the total number of thresholds of feature i in T and Given an instance v = (v 1 , . . ., v m ) ∈ F, the above approach associates each value v i with a single interval I ′ i from the set of disjoint intervals Thus, AXp extraction boils down to finding a subset-minimal subset X ∈ F s.t.

Related Work
Interpretable decision sets are logic-based ML models that can be traced back to the 70s and 80s [39,15,4,45].To the best of our knowledge, [6] proposed the first approach to decision sets, which were introduced as the variant of decision lists [45,7].The first method making use of logic and optimization to synthesize a disjunction of rules that match a given dataset was proposed in [26].Recent work [29] argued that decision sets are more interpretable than the other logic-based models, i.e. decision lists and decision trees.This work uses smooth local search to generate a set of rules first and heuristically minimizes a linear combination of criteria afterwards, e.g. the size of a rule, their maximum number, overlap or error.
Since then a number of works proposed the use of logic reasoning and optimization procedures to train DS models [22,36,12,50,18] claiming to significantly outperform the approach of [29] in terms of accuracy and performance.Among those, the works closest to ours are [22,50,18].They proposed SAT-based approaches to computing smallest-size decision sets that perfectly agree with the training data by minimizing either the number of rules [22,18] or the number of literals [50,18] used in the model.Additionally, [50] is capable of computing sparse decisions sets that trade off training accuracy for model size.Despite the dramatic performance increase achieved in [18], all the approaches above suffer from scalability issues.
Post-hoc explainability is one of the major approaches to XAI.Besides a plethora of heuristic sampling-based methods to post-hoc explainability [43,34,44], a formal reasoning based approach to computing abductive explanations [48,20] stands out.AXps can be related with prime implicants of the decision function (hence an alternative name prime implicant explanations, PI-explanations) associated with ML predictions and are guaranteed to capture the semantics of the ML models in the entire feature space.Although hard to compute in general, AXps were shown to be effectively computable for BT models by an incremental MaxSAT-based approach [17].
Our work aims at making a bridge between interpretable DS models and AXp computation by exploiting the latter for training the former.Given a BT model, it focuses on generating decision rules that agree with the BT.Each rule represents an AXp for the prediction made by the BT model, resulting in a DS model in a way guided by the original BT model.The approach is shown to outperform the prior logic-based approaches to DS inference in terms of test accuracy and performance.Note that despite prior attempts to train sparse models guided by tree ensembles [38], to our best knowledge, none of the existing works have applied formal post-hoc explanations to compile interpretable models.

38:6
From Formal Boosted Tree Explanations to Interpretable Rule Sets Finally, our approach can be related to the existing line of work on knowledge distillation [11,13], where an interpretable model is trained to approximate a hard-to-interpret black-box model, which is often seen as teacher-to-student knowledge transfer.Note that in contrast to knowledge distillation, our approach is able to compile a BT into an equivalent DS if we consider the entire feature space, as shown below.

Decision Sets by Boosted Tree Compilation
Based on [17], this section details a MaxSAT-based approach to compiling a BT into a DS where each rule in the DS is equivalent to a prime implicant of the BT classification function.

Rule Extraction
Recall that an AXp, as defined in ( 1) and ( 2), can be seen as an if-then rule.Given a hard-to-interpret BT model, the AXp extraction approach of [17] can be modified to compute an interpretable DS consisting of a set of AXps for the BT.However, when the features are continuous (numeric), this potential approach suffers from the following issue.Recall that an AXp X ∈ F indicates a set of concrete feature values that are sufficient to explain a prediction c = τ (v) for a certain instance v ∈ F. Although this same AXp can explain other instances compatible with it, its applicability in general is at the mercy of expressivity of the feature literals used in the AXp, i.e. equality literals and succinct interval membership in the case of ( 1) and ( 2), respectively.Motivated by this limitation, we propose to compute AXps over the literals intrinsic to the BT model aiming at getting feature intervals that are as general as possible, as detailed below. 3n contrast to the work of [17], which associates each feature value v i ∈ D i with a single narrowest interval I ′ i covering the value, we exploit all the splitting points used by the BT for feature i and identify all of the corresponding literals satisfied by the feature value v i .Note that the original MaxSAT encoding [17] and replaced by ¬o ij otherwise.By construction, this conjunction holds true for instance v. Now, given this conjunction of literals, we can apply the existing approach of [17] to extract a subset-minimal explanation Such an explanation Y may (or may not) define either a lower bound on feature i, an upper bound, or both, aiming to construct the most general interval for each feature i ∈ Y. Hence, we informally refer to such explanations as generalized AXps or simply rules (hereinafter, we use both interchangeably).1a.The original approach of [17] would instead compute an AXp defining the narrowest intervals for features 3 and 4, representing a rule: ⟨IF 2.60 ≤ "petal.length"< 4.75 ∧ "petal.width"< 1.45 THEN class = "versicolor"⟩, which is far less general than Y. ⌟ A possible rule extraction procedure is outlined in Algorithm 1. (Please ignore line 3 for now; feature sorting is described in Section 4.2).The input BT model T is encoded into MaxSAT by applying the approach of [17].Given an instance v ∈ F, the initial set of literals

Boosted Tree Compilation
As mentioned above, generalized AXps can be seen as general decision rules that can be applied to an enormous number of instances.Therefore, it makes little sense to extract such rules for each instance in the feature space F. Instead, one can devise an on-demand R ← R ∪ Y 9: return R compilation process, i.e. given a yet uncovered instance v ∈ F, we can apply Algorithm 1 to extract a rule covering v (and some other instances).Clearly, exhaustive compilation of a BT, i.e. if the target is to cover all the instances in F with generalized AXps of the BT, is computationally expensive given that AXp extraction for tree ensembles is hard for D P [25].This can also lead to the large size of the resulting DSes making them hard to interpret.In practice, local compilation aiming at capturing the behavior of the BT on the training data only, is sufficient to generate a DS, which is both accurate and interpretable.
The proposed approach to compiling a BT T into a DS R is shown in Algorithm 2. We initialize the set C u of currently uncovered instances to be equal to C, i.We consider two usages of the algorithm: for exhaustive compilation the coverage set C = F is all possible feature combinations (in practice we model this coverage set implicitly, rather than in its explicit exponential sized form), and for training set compilation where C = E is the training set.Based on the properties of prime implicants, Proposition 8 states that as a generalized AXp Y ∈ R is a formal explanation for a prediction made by BT T, a compiled DS captures the semantics of the original model T on coverage set C, assuming everything else is a don't care.Furthermore, if the process is applied subject to coverage set C = F, i.e. when we target the entire feature space F, then R and T behave identically, i.e. they compute the same classification function τ (x).
▶ Corollary 9. Let Algorithm 2 return a DS R for a BT T. Then there is no instance in feature space F covered by two distinct rules As each generalized AXp for T represents a prime implicant of the decision function τ (x) computed over literals o ij , the above corollary claims that there are no overlapping rules in the result DS R.This contrasts with other modern approaches to DS inference, where rule overlap is known to be a problem [29,22].Note that this approach still suffers from another common issue of DS models: namely, if DS R is computed for the training data E, there may still be instances in F uncovered by R.
▶ Example 10.Consider the running example BT model shown in Figure 1b.Its compiled DS representation computed by Algorithm 2 is shown in Figure 1a.Observe that there is no rule overlap in the DS computed.In fact, as the DS is computed by taking into account feature space F, it computes the same classification function as the original BT model.⌟ Feature Sorting.Intuitively, how general and hence how applicable a rule is depends on how frequently the features used in it appear in the training data E labeled with the target class.Thus, a simple heuristic to apply when extracting a rule for prediction Anytime Property.Most widely used reasoning-based algorithms to infer DSes provide a solution only if the computation is completed; otherwise, no decision set is reported.In contrast to these, the proposed approach is an anytime algorithm, i.e. it can return a valid DS R even though the compilation process is interrupted before all the coverage set instances C are covered.Furthermore, it can generate a more comprehensive DS R, which covers more instances as it keeps going, i.e. after we have covered C ⊆ F we can continue running the algorithm for the (unseen) instances of F.

Post-Hoc Model Reduction
The compiled DS R can be large (in terms of either the number of rules or the total number of literals) since each generalized AXp Y ∈ R may need a significant number of literals to explain a prediction made by BT T, or/and many rules are required to explain all instances of C. Once the target DS is obtained, we can apply post-hoc heuristic methods for reducing its size and so making it more interpretable.The methods below are in a way inspired by the optimization problems studied in [18,50].Although these ideas are applicable to any DS inference method once the result model is devised, they do not look necessary for standard DS inference algorithms as they minimize the model while training.On the contrary, no minimization is applied in the rule enumeration process described above and so post-hoc model reduction plays a vital role in our approach to reduce the size of final DS models.
Reducing the Number of Rules.Given a set of rules R, we can compute a minimum subset R ⋆ ⊆ R that is still equivalent to the BT T wrt. the coverage set C using discrete optimization, e.g.integer-linear programming (ILP).Concretely, the approach aims at selecting the smallest-size subset R ⋆ ⊆ R that covers all instances in C, where R is the compiled DS from T. Here, the size of R ⋆ is measured as the total number of literals used.This can be done by solving the following set cover problem [28].Namely, for each rule Y j ∈ R, we introduce a Boolean variable u j such that u j = 1 iff Y j is included in R ⋆ .C P 2 0 2 3

38:10 From Formal Boosted Tree Explanations to Interpretable Rule Sets
Additionally, a Boolean variable y ij is used to indicate that Y j covers e i ∈ C. As a result, the weighted set cover problem for minimizing the total number of literals used is as follows: Reducing are all the literals compatible with v k then this can be modeled with constraints Furthermore, let rule Y predict c ∈ K and let C ⊖ ⊆ C contain all instances labeled with any other class.Thus, we can apply the objective below when minimizing rule Y: If W is large enough, say |C| + 1, this lexicographically minimizes misclassifications and then literals.If W is small, e.g.

Experimental Results
This section compares the proposed approach with the state-of-the-art DS learning algorithms on a variety of publicly available datasets in terms of accuracy, scalability, model and explanation size.The experiments are performed on an Intel Xeon 8260 CPU running Ubuntu 20.04.2 LTS, with the time limit of 3600s and the memory limit of 8GByte.Our experiments contain two parts, namely, exhaustive BT compilation and training-set BT compilation.

Prototype implementation. A prototype of the compilation-based approach to generating
DSes was developed as a set of Python scripts using C = E, hereinafter referred to as cpl.
The implementation of BT compilation exploits [17] and, therefore, makes use of the RC2 MaxSAT solver [19]. 4The BTs to be compiled are computed by XGBoost [5]; the number of trees per class in a BT model is 50 and the maximum depth of each tree is 3. Post-hoc literal reduction is done again with RC2 [19].Let cpl l denote the implementation applying lexicographic optimization while cpl lλ1 trades off model accuracy for the number of literals used, with λ 1 = 0.005.Let cpl r denote the implementation with post-hoc rule reduction applied using the Gurobi ILP solver [14].The configuration with both post-hoc lexicographic optimization and rule reduction is denoted cpl lr .Finally, the proposed approach applying exhaustive compilation C = F is referred to as cpl f .
Competition.Our approach is compared against: twostg a two-stage MaxSAT approach [18] for DSes perfectly accurate on the training data; opt another MaxSAT approach [50] for perfectly accurate DSes; sp λ1 a sparse alternative to opt by the same authors (with λ 1 = 0.005) optimizing like cpl lλ1 ; imli 1 and imli 16 using MaxSAT-based IMLI [12] to minimize the number of literals given a predefined number of rules (we use 1 or 16); ids a state-of-theart approach [29] based on smooth local search;5 ripper a popular heuristic DS algorithm RIPPER [8]; and CN2 (referred to as cn2 ) another heuristic algorithm [7,6].
Datasets.For the evaluation, 59 publicly available datasets from UCI Machine Learning Repository [9] and Penn Machine Learning Benchmarks [41] are considered.We apply 5-fold cross validation, resulting in 295 pairs of training and test (unseen) data.For the sake of a fair comparison, the datasets used are preprocessed so that each original feature i ∈ F is replaced with a number of non-intersecting feature intervals x i < d ij defined by the XGBoost model (see Section 2).This guarantees that all competitors tackle the same problem instances.

Exhaustive BT Compilation
The first experiment compares exhaustive compilation, where C = F is the entire feature space.This is impractical except for 6 small benchmarks.

Results.
Here we compare cpl f with the competition in terms of accuracy, the total number of literals used and explanation size.We present the results as cactus plots showing the number of datasets that e.g.reach a certain accuracy, or finish in a certain runtime, for each method.These experimental results are shown in Figures 2 and 3 as well as the average results across folds are described in Table 2 where only the results of the datasets completely solved by compared competitors are presented.Note that cpl f is nowhere near as scalable as the approaches described in the later experiments, but it is the most accurate approach to creating DSes we are aware of.
Test accuracy.An instance is considered misclassified if either there exists a rule of a wrong class that covers it, or it is not covered by any rule of the correct class.Thus, the test accuracy in this paper is calculated as n−g n , where n is the total number of instances in the test data and g is the total number of misclassified instances.If an approach fails to train a model within the time limit, we assume its accuracy to be 0% for this dataset.Figure 2 Accuracy of exhaustive compilation.The standard interpretation of cactus plots is assumed, i.e. a plot sorts the datapoints for each method by the y-axis value, and then shows them in increasing order independently of other methods.Thus, the order of datasets/folds differs for different methods.Also, the order of datasets for the same method differs in different subplots.As can be seen in Figure 2b and Table 2, the best accuracy is achieved by BTs and cpl f .In fact, these models share the same accuracy (this is also confirmed in Figure 2a), which should not come as a surprise given that cpl f replicates the behavior of the BT in the entire feature space F (see Proposition 8).
Model Complexity.In general, complexity of a DS model can be measured by the total number of literals used in this DS.The total number of literals used in DS models is compared in Figure 3a and Table 2. Though the accuracy of DSes trained by cpl f outperforms the other competitors, these models are significantly larger, which is no surprise given that cpl f computes many more rules with no post-hoc reduction applied.
Explanation size.Explanation size is defined as the number of literals required to explain an instance. 7This is arguably more important than the model size, since it defines "how hard" it is to understand an individual explanation.A small DS model tends to provide compact explanations but it is not always accurate.As can be seen in Figure 3b and Table 2 and similar to the total number of literals used in DSes, cpl f requires more literals to explain an instance than all competitors except ids.
A crucial observation to make here is that we test explanation size for each of the test instances available.Although test data are meant to extrapolate the overall unseen data, such approximation of the unseen feature space is not ideal.As a result, there may be numerous instances in F uncovered by all the approaches but cpl f , in which case it will be the only approach providing a user not only with a prediction but also with a succinct explanation of the prediction made.

BT Compilation Targeting Training Data
Compilation to cover the training set C = E is much more efficient, and the main usage we expect of our algorithms.
Scalability. Figure 4a depicts scalability of all selected algorithms on the 295 considered datasets.Note that runtime of our approach includes BT training time.The best performance is demonstrated by the proposed implementation, i.e. cpl and cpl * , * ∈ {l, r, lr, lλ 1 }, where all selected datasets are solved within the time limit.This is not surprising since the approach is an anytime algorithm that can always return a valid DS.As for other competitors, the heuristic method ripper and the MaxSAT approaches imli 1 as well as imli 16 also solve all considered datasets.Next is the heuristic algorithm cn2, where 235 datasets are solved    within the 3600s time limit.Followed by ids, which solves 166 considered datasets.The two-stage MaxSAT approach twostg successfully addresses 130 datasets, while the other MaxSAT algorithm for perfect decision sets opt and its sparse alternative sp λ1 solve 65 and 63 datasets respectively.
Test Accuracy.The accuracy among the selected approaches is shown in Figure 4b.The average accuracy among all selected datasets for BTs is 77.34%, beating all DS approaches.The highest accuracy among DSes is achieved by all the configurations of the proposed approach, i.e. cpl and cpl * , where the average accuracy ranges from 54.01% (cpl lλ1 ) to 57.49% (cpl lr ). 8 Unsurprisingly, the accuracy in cpl lλ1 is lower than the other configurations since cpl lλ1 trades off training accuracy on the number of literals in the computation process.
Next most accurate are the heuristic methods cn2 (48.03%) followed by ripper (44.81%).The average accuracy of imli 16 and imli 1 is 35.47% and 29.7% respectively, while the average accuracy of twostg is 29.6% and ids is 26.78.Finally, the worst accuracy is demonstrated by sp λ1 and opt (18.84% and 18.27% on average respectively) as these tools fail to provide prediction information for many datasets within the time limit.We will omit further discussion of sp and opt λ1 since they solve so few datasets.4d demonstrates that post-hoc literal reduction not only helps decrease the number of literals required to explain DS models, but also enables DSes to remain accurate, whereas rule reduction does not contribute to smaller explanations.With literal reduction applied our approaches are very competitive in terms of explanation size.

Detailed Comparison.
While cactus plots allow us to compare many methods over a large suite of benchmarks, they do not allow direct comparison on individual benchmarks.We provide a detailed comparison of cpl lr versus other decision set inference approaches in Figures 5 and 6, including cn2, ripper, twostg, and imli 16 . 9The scatter plots depicting explanation size are obtained for the datasets solvable by both competitors.Note that cpl lr can generate more accurate DSes than the competitors.Also observe that the explanation size of DSes computed by cpl lr is smaller than cn2 and comparable with twostg.Although the explanation size of DSes in cpl lr is larger than ripper and imli 16 , the two approaches are less interpretable as they compute DSes representing only one class.

Summary.
The experimental results were performed on various datasets, demonstrating that our approach computes DSes that outperform the state-of-the-art competitors in terms of accuracy and yield comparable explanation size to them.

Conclusions
This paper introduced a novel anytime approach to generating decision sets by means of on-demand extraction of generalized abductive explanations for boosted tree models.It can be used for exhaustive compilation of a BT model wrt. the entire feature space, or target a set of training instances.Augmented by a number of post-hoc model reduction techniques, the approach is shown to compute decision sets that are more accurate than decision sets computed by the state-of-the-art algorithms and comparable with them in terms of explanation size.
As the proposed approach targets generating a decision set by compiling a BT, a natural line of future work is to extend the proposed approach to compile BTs into the other interpretable models, i.e. decision trees and decision lists, making use of AXp extraction for BTs.Additionally, another future work is to apply AXp extraction to compile other accurate black box models, e.g.neural networks, into decision sets. 9The average results across the folds are given in the appendix.Eliminating the impossible, whatever remains must be true: On extracting and applying background knowledge in the context of formal explanations.In AAAI, 2023.

B Detailed Comparisons Across Folds
In this appendix, we provide a detailed comparison of cpl lr versus other decision set inference approaches across folds.Figure 9 and Figure 10 detail the comparisons of cpl lr with CN2, RIPPER, imli 16 and twostg in terms of average accuracy and explanation size across folds.As can be seen in Figure 9a, the accuracy of DSes generated by cpl lr is higher than the accuracy of CN2, where the average accuracy is 57.49% and 48.03%, respectively.Additionally, Figure 9b demonstrates that the explanation size of DSes produced by CN2 (81.93 on average) can be two orders of magnitude larger than the explanation size of cpl lr (25.88 on average).
Figure 9c illustrates that the average accuracy in RIPPER is 44.81%, which is 12.68% lower than the accuracy in cpl lr .Although Figure 9d depicts that RIPPER is comparable with cpl lr regarding explanation size (29.08 and 25.34 on average respectively), RIPPER is less interpretable as it computes DSes representing only one class.
As can be observed in Figure 10a, the accuracy of twostg (29.67% on average) is 27.82% lower than the accuracy in cpl lr while Figure 10b illustrates that the explanation size is comparable between the two approaches.Finally, Figure 10c demonstrates that the accuracy of imli 16 is 22.02% lower than the accuracy of cpl lr on average.However, as can be seen in Figure 10d, the explanation size of imli 16 is smaller than the explanation size of cpl lr but imli 16 generates DSes targeting only a single class, which significantly diminishes the interpretability of computed DSes.

From
its class is obtained by computing the sum of scores assigned by trees for each class w(v, c) = t∈Tc t(v) and assigning the class which has the maximum score, i.e. argmax c∈[|K]| w(v, c).Whenever convenient, n ∈ t denotes a non-terminal node, where t ∈ T represents an arbitrary decision tree.Moreover, each such n indicates a feature condition in the form of x i < d, where feature i ∈ F and splitting threshold d ∈ D i .Formal Boosted Tree Explanations to Interpretable Rule Sets

Algorithm 1
Deletion-based Rule Extraction.Function: RuleExtract(T, v, c, E) Input: T: BT defining τ (x), v: Instance, c: Prediction, i.e. c = τ (v) E: Training data Output: Y: Subset-minimal rule 1: ⟨H, S⟩ ← Encode(T) 2: Y ← Init(T, v) 3: Y ← Sort(Y, E) 4: for l ∈ Y do 5: if EntCheck(⟨H, S⟩, c, Y \ {l}) then 6: Y ← Y \ {l} 7: return Y ▶ Example 5. Consider instance v 3 predicted as "versicolor" by the BT (observe that v 3 = 3.9 and v 4 = 1.1) and recall the thresholds for features 3 and 4 discussed in Example 3. We can compute a generalized AXp Y = {¬o 31 , o 33 , o 43 } representing the second rule of the DS shown in Figure e. the set of examples we wish to cover.The algorithm represents a loop generating rules until the set of computed rules R covers all instances in coverage set data C, i.e. until there is no uncovered instances in C. Each iteration of the algorithm selects an instance v from C u .Afterwards, a generalized AXp Y for the prediction c = τ (v) by the BT T (recall that T is meant to compute classification function τ (x)) is extracted by invoking Algorithm 1.The iteration proceeds by updating the set of rules R and the set of uncovered instances C u .The algorithm terminates when all the instances in the coverage set C are covered and returns a compiled DS R. ▶ Proposition 8. Let T be a BT and R be a DS returned by Algorithm 2 for T. Then R ≡ T with respect to C.
based on how frequently the corresponding literals o ij apply in examples E labeled with c.This feature sorting represented by line 3 in Algorithm 1 in practice (according to our experiments) results in significantly more general rules and so overall smaller DSes.
Comparison with the others.
Number of literals used.
Number of literals used.

Figure
Figure Summary of experimental results when the competitors aim at training a DS given training data E (i.e.C = E).

Figure 5
Figure 5Comparison of cpl lr vs. cn2 and ripper in terms of accuracy and explanation size.

Figure 6
Figure6cpl lr vs. imli16 and twostg in terms of accuracy and explanation size.

Figure 7
Figure 7 Experimental results of runtime and accuracy across folds.
Number of literals used.

Figure 8
Figure 8 Experimental results of model complexity and explanation size across folds.

Figures 7 and 8 3 38: 20 From
Figures 7 and 8 the average experimental results across folds regarding scalability, accuracy, model complexity, and explanation size.Since 5-fold cross validation is used, these results for each dataset are obtained from the average of 5 pairs of training and test data.Here, observations similar to those described in Section 5 can be made, i.e. the best Acry: vs. CN2.

Figure 9
Figure 9 cpl lr vs. CN2 and RIPPER across folds in terms of accuracy and explanation size.

Figure 10 cpl
Figure 10 cpl lr vs. imli16 and twostg Across Folds in terms of accuracy and explanation size.

Table 1
Several instances extracted from Iris dataset.
introduces a single Boolean variable o ij for each literal x i < d ij with d ij being a j'th threshold used in the BT for feature i, s.t.o ij = 1 iff x i < d ij holds true.This way, each positive o ij represents an upper bound on the value of x i while each negative ¬o ij represents a lower bound on x i .Feature 3 ("petal.length")from Example 3 has 3 thresholds: d 31 = 2.60, d 32 = 4.75, d 33 = 4.95.Boolean variables o 31 , o 32 , and o 33 are set to true iff x 3 < 2.60, x 3 < 4.75, and x 3 < 4.95, respectively.Let feature 3 take value 3.9 in the instance we want to explain.Observe how we can immediately assign literals ¬o 31 , o 32 , and o 33 to true.
▶ Example 4. ⌟ Next, given an instance [31,35,2,32]the procedure implements the standard deletion-based AXp extraction[20], i.e. it iterates through all literals in Y one by one, and checks which of the them can be safely removed such that entailment (3) still holds.Consider our running example model and instance v 2 ∈ e 2 from Table1predicted as "virginica" by the BT T. Given the thresholds for features 3 and 4 in Example 3, set Y is initialized to {¬o 31 , ¬o 32 , ¬o 33 , ¬o 41 , ¬o 42 , ¬o 43 }.The other two features are excluded from Y since they are irrelevant to the classification function in T. Applying Algorithm 1 results in extracting a subset-minimal generalized AXp Y = {¬o 33 }, which represents the rule ⟨IF petal.length≥4.95THEN class = "virginica"⟩.MaxSAT solving[31,35,2,32]can be applied in this setting.Although this may look plausible at first glance, time-restricted anytime MaxSAT algorithms can only over-approximate exact MaxSAT solutions while (3) holds if and only if the exact value of the objective function is negative.Therefore, an over-approximation of a MaxSAT solution is never able to prove the validity of (3) and so none of the features being tested can be discarded in the case of incomplete MaxSAT algorithms, which defies the purpose of Algorithm 1.
[19,17]le 6. ⌟ ▶ Remark 7. Algorithm 1 relies on deciding whether formula (3) holds for each feature in explanation Y. Here, this is done by means of a series of incremental core-guided MaxSAT oracle calls[19,17].One may wonder whether or not incomplete anytime

P 2 0 2 3 38:8 From Formal Boosted Tree Explanations to Interpretable Rule Sets Algorithm 2
Compile a BT into a DS.
Function: Compile(T, τ, C) Input: T: BT defining τ (x), τ : Classification function in T, C: Coverage set Output: R: Set of Rules 1: R ← ∅ 2: [18]Number of Literals.Additionally, one can minimize the total number of literals used in the rules of R. Given a rule Y ∈ R, this can be done either lexicographically by maximizing rule accuracy followed by size minimization, or by optimizing both, or trading off misclassifications for rule size -in either case, a single MaxSAT call per rule to minimize can be made.The intuition is that if a rule Y misclassifies k instances then its optimized version Y ⋆ ⊆ Y should not result in many more misclassifications on training data E. Recall that a rule misclassifies an instance v k ∈ C if it matches v k but assigns it to a wrong class.Inspired by[18], we introduce a Boolean variable p k , which is true iff rule Y covers v kthis holds if Y does not use any literals incompatible with v k

Table 2
Accuracy, number of literals used, and explanation size across folds.