Static Analysis of Graph Database Transformations

We investigate graph transformations, defined using Datalog-like rules based on acyclic conjunctive two-way regular path queries (acyclic C2RPQs), and we study two fundamental static analysis problems: type checking and equivalence of transformations in the presence of graph schemas. Additionally, we investigate the problem of target schema elicitation, which aims to construct a schema that closely captures all outputs of a transformation over graphs conforming to the input schema. We show all these problems are in EXPTIME by reducing them to C2RPQ containment modulo schema; we also provide matching lower bounds. We use cycle reversing to reduce query containment to the problem of unrestricted (finite or infinite) satisfiability of C2RPQs modulo a theory expressed in a description logic.


INTRODUCTION
The growing adoption of graph databases calls for suitable data processing methods.Query languages for graph databases typically define their semantics as a set of tuples, which alone is inadequate for scenarios such as (materialized) graph database views and data migration in the context of schema evolution [11], with the schema describing the expected structure of the graph.A more adequate mechanism is that of a transformation, which takes a graph as input and produces a graph on the output.
Example 1.1.Consider a scenario where the schema of a medical knowledge graph undergoes changes due to advances in the understanding of biomolecular processes.The purpose of this knowledge graph is to catalog vaccines based on the antigen they are designed to target and to identify the pathogens that exhibit the antigens, each antigen being exhibited by at least one pathogen.Additionally, some pairs of antigens are known to be cross reacting: if a vaccine targets an antigen that is cross reacting with an antigen , then also targets .Thus, the set of all antigens targeted by a vaccine is represented implicitly.
The schema 0 of the original knowledge graph is presented in Figure 1 as a graph itself.It specifies the allowed node and edge la- bels, and expresses participation constraints on edges in a manner that is typical for data modeling languages, e.g., A In the present paper, we study two classical problems of static analysis on graph transformations: type checking, that verifies if for every graph conforming to the source schema the transformation outputs a graph conforming to the target schema, and equivalence, that verifies if two transformations produce the same output for every graph conforming to the source schema.Additionally, when the target schema is not known, we investigate the problem of target schema elicitation that constructs the containment-minimal target schema that captures the graphs produced by the transformation.
We study executable graph transformations defined with Dataloglike rules.The rules specify how to construct the output graph from the results of regular path queries evaluated over the input graph.To allow multiple copies of the same input node the rules use node constructors, essentially explicit Skolem functions that create nodes.As an example, the cross-reactivity rule from Example 1.1 gives rise to the following graph transformation rule targets( ( ), ( )) ← (designTarget • crossReacting * )( , ) , where ( ) and ( ) are constructors of Vaccine and Antigen nodes respectively.The two constructors can, for instance, have the following definitions ( ) = (Vaccine, ) and ( ) = (Antigen, ); essentially, they take the identifiers of the original nodes and decorate them with their type.
We investigate transformations that use only acyclic two-way conjunctive regular path queries (acyclic C2RPQs), which is arguably of practical relevance in the context of graph transformations.For instance, we have found no cyclic queries in the transformations implementing graph data migration between consecutive versions of the FHIR data format [34,53] (Fast Healthcare Interoperability Resources is an international standard for interchange of medical healthcare data).Our constructions rely on acyclicity of C2RPQs to obtain relatively low computational complexity.We argue that the acyclicity assumption cannot be lifted without a significant complexity increase (see Section 7).
Node constructors are closely related to object creating functions [36,37].Our use of node constructors is inspired by analogous constructions in transformation languages such as R2RML [18,22,55], where node IRIs are typically obtained by concatenation of a URL prefix and the key values of a object represented by the constructed node.Our node constructors can have an arbitrary arity, thus allowing for instance to create nodes in the target graph that represent relationships (edges) between nodes in the source graph.To isolate the concern of possible overlaps between node constructors, we make the natural assumption that node constructors are injective, have pair-wise disjoint ranges, and for every node kind (label) a single dedicated node constructor is used.These assumptions allow us to remove the need to analyze the definitions of node constructors, which is out of the scope of the present paper, and they are consistent with how the analogous constructions are used in languages such as R2RML and FHIR mapping language.
For schemas, we employ a natural formalism of graph schemas with participation constraints, inspired by standard data modeling languages such as Entity-Relationship diagrams [17], and already studied, for instance, in the context of graph database evolution [11].Such schemas allow one to declare the available labels of nodes and edges and to express participation constraints.In contrast to more expressive languages as ShEx and SHACL [19,57], our formalism allows a single label per node, which determines the node type.Thus, roughly speaking, our schema formalism is to ShEx and SHACL what DTD is to XML Schema.
The key contributions of the present paper are as follows.
(1) We define graph database transformations and we reduce the problems of interest to containment of C2RPQs in unions of acyclic C2RPQs modulo schemas.(2) We reduce the query containment problem to the unrestricted (finite or infinite) satisfiability of a C2RPQ modulo a set of constraints expressed in the Horn fragment of a description logic known as ALCIF .
The reduction involves an application of the cycle reversing technique [20,38], carefully tailored to our needs.(3) The unrestricted satisfiability problem for ALCIF can be solved in EXPTIME owing to a simple model property [16], but applying this result directly to the instance obtained via cycle reversing would lead to doubly exponential complexity due to an exponential blow-up inherent to cycle reversing.We provide a new algorithm with improved complexity bounds, which allows to accommodate the blow-up while keeping the overall complexity in EXPTIME.
We also reformulate the simplicity of models in terms of a graph-theoretical notion of ( , )-sparsity [44], which allows to streamline the reasoning.
These reductions allow to solve all problems of interest in EXP-TIME and we also establish the matching lower bounds.The paper is organized as follows.In Section 2 we discuss related work.In Section 3 we introduce basic notions.In Section 4 we define graph transformations and the problems of interest, which we reduce to query containment modulo schema.In Section 5 we reduce the latter to satisfiability of a query modulo Horn-ALCIF theory, which we solve in Section 6.In Section 7 we summarize our findings and identify directions of future work.Full proofs and some standard definitions have been moved to Appendix.

RELATED WORK
Graph-based data models have been proposed in various forms and shapes since the 1980s [4].
The proposals in the 1980s and 1990s included labeled graphs [32], graphs where certain nodes represent complex values [33,43], graphs where nodes have associated complex values [1,2], and graphs where nodes are associated with nested graphs [45].More recently the RDF data model [30] and the Property Graph data model [3] have become popular.RDF graphs are similar to labeled graphs except that nodes are unlabeled and participate in at least one edge, and the labels of edges can be nodes and participate in edges.Property Graphs are also similar to labeled graphs except that nodes and edges have multiple labels and properties, and edges have identity.In our work we assume one of the simplest models, namely, labeled graphs where nodes have multiple labels and edges have a single label; our schemas require exactly one label per node.Since we focus here on transformations of the graph structure, we have no explicit notion of value associated with nodes and edges, but there are straightforward ways of adding this, as is done for example in [32].
The term graph transformations can refer to different formalisms [54]: the purpose of graph grammars is to define graph languages; algebraic graph transformations are mainly used to model systems with infinite behavior and are not functional (they produce multiple outputs on single input).Therefore, not only are these formalisms ill-suited for defining transformations of graph databases, but also the problems studied for them are unrelated to the problems we study here.Monadic second-order (MSO) graph transductions [21] can capture our transformations only when restricted to unary node constructors; moreover, resorting to MSO logic typically incurs a prohibitive complexity overhead.
Transformation languages for graph databases are often based on Datalog extended with node-creation syntax in the head of the rules.It could be just a variable that is not bound in the body of the rule, like in IQL [2] and G-Log [51]; this ensures a fresh node is created for each valuation that makes the body true.Another option is to replace the unbound variable with a term consisting of a constructor function (sometimes called a Skolem function) applied to bound variables, like in O-logic [46] and F-logic [41]; the constructor creates a fresh node when called for the first time for certain arguments, and after that the same node for the same arguments.We adopt the idea of node constructors because we believe it provides a powerful and intuitive way to control the identity of new nodes.
A different proposal, based on structural recursion, is offered by UnQL [12], but the underlying data model considers graphs equivalent if they are bisimilar, which makes the expressive power quite different.
Graph transformations can also be expressed using query languages such as SPARQL and Cypher.
Nevertheless, we believe that a rule-based transformation language is more convenient for defining transformations and it can co-exist with an expressive query language.For instance, in the XML world, XSLT [40] (rule-based) focuses on transformations, while XQuery [56] is mostly used for querying XML data.
In the context of data exchange, schema mappings provide a declarative way to define database transformations [7,13,24].Our transformations could be simulated by considering canonical solutions for plain SO-tgds [5] extended to allow acyclic C2RPQs in rule bodies.Note, however, that equivalence is undecidable for plain SO-tgds with keys [25], and open for plain SO-tgds [42].
The static type checking problem originates in formal language theory and has been studied for finite state transducers on words and for various kinds of tree transducers, including some designed to capture XML transformation languages [47][48][49][50].Type checking has also been studied for graph transformations.In [33] labelled graphs are transformed using addition, deletion, and reduction operations, and type checking is investigated for schemas similar to ours but without participation constraints.The typing problem for UnQL is studied in [39], but the approach relies on schemas specifying graphs up to bisimulation, which limits their power to express participation constraints.Regarding transformations defined by schema mappings, if the mapping does not define target constraints, then the target schema is simply a relational signature and type checking is reduced to trivial syntactic check, and as such it is irrelevant.This is most often the case for graph schema mappings [7,13], with seldom exceptions such as [10] for mapping relational to graph-shaped data.Their notion of consistency is related to type checking, but is studied for a simpler formalism without path queries.In the context of XML schema mappings, absolute consistency can be seen as a counterpart of type checking for non-functional transformations [8].

PRELIMINARIES
Graphs.We fix an enumerable set N of node identifiers, a recursively enumerable set Γ of node labels, and an recursively enumerable set Σ of edge labels.We work with labeled directed graphs, and in general, a node may have multiple labels while an edge has precisely one label.We allow, however, multiple edges between the same pair of nodes, as long as these edges have different labels.We model graphs as relational structures over unary relation symbols Γ and binary relation symbols Σ.That is, a graph is a pair dom( ), • where dom( ) ⊆ N is the set of nodes of and the function • maps each ∈ Γ to a set ⊆ dom( ) and each ∈ Σ to a binary relation ⊆ dom( ) × dom( ).A graph is finite if dom( ) is finite and and are empty for all but finitely many ∈ Γ and ∈ Σ.In the sequel, we use , , . . . to range over node identifiers, , , , . . . to range over node labels, and , ′ , . . . to range over edge labels.Also, we use − for inverse edges and let Schemas.We consider a class of schemas that constrain the number of edges between nodes of given labels and we express these constraints with the usual symbols: ?for at most one, 1 for precisely one, + for at least one, * for arbitrary many, and 0 for none.We focus on these basic cardinality constraints that are most commonly used in practice; e.g., Chen's original ER diagrams only used those [17].In fact, we were unable to find any non-basic cardinality constraints in the FHIR specifications [34], while in the SHACL schemas in Yago 4.0 [58] we found only one: a person may have at most two parents.Now, a schema is a triple = (Γ , Σ , ), where Γ ⊆ Γ is a finite set of allowed node labels, Σ ⊆ Σ is a finite set of allowed edge labels, and : Γ × Σ ± × Γ → {?, 1, +, *, 0}.Schemas can be presented as graphs themselves, interpreted as illustrated next.
Example 3.1.Take the schema 0 in Figure 1 and consider, for instance, the designTarget edge.It indicates that every Vaccine has a single design target Antigen, in symbols of its -successors with label is as specified by ( , , ).By ( ) we denote the set of all finite graphs that conform to .
Queries.We work with conjunctive two-way regular path queries (C2RPQs) that have the form , . . ., , ′ } \ ¯ and for every ∈ {1, . . ., }, and ′ are variables and the formula is a regular expression that follows the grammar where ∈ Γ matches nodes, ∈ Σ ± matches edges, matches empty paths, and matches no path.The semantics of C2RPQs is defined in the standard fashion [15] and we denote the set of Example 3.2.Recall the schema 0 in Figure 1.The following query selects vaccines together with the antigens they are designed to target or target through cross-reaction.
Trivial atoms are of the form ( , ), ( , ), and ( , ), and in the sequel, we abuse notation and write them as unary atoms: ( ), ( ), and ( ), respectively.The multigraph of a C2RPQ has variables of as nodes and an edge from to for every non-trivial atom ( , ).The subclass of acyclic C2RPQs consists of queries whose multigraph is acyclic i.e., it does not have a path consisting of distinct edges that visits the same node twice.Note that acyclicity for C2RPQs needs to be more restrictive than the classical acyclicity of conjunctive queries based on Gaifman graphs.Indeed, the Gaifman graph of a C2RPQ ( , ) ∧ ( , ) is acyclic but its matches may form nontrivial cycles in the input graph.
A Boolean C2RPQ has all its variables existentially quantified, and it may have only a single answer, the empty tuple, in which case, we say that is satisfied in and write |= .We also use unions of C2RPQs (abbreviated as UC2RPQs) represented as sets of C2RPQs ( ¯ ) = { 1 ( ¯ ), . . ., ( ¯ )} and extend the notions of answers, satisfaction, and acyclicity to UC2RPQs in the natural fashion.Given two UC2RPQs ( ¯ ) and ( ¯ ), and a schema , we say that Description logics.We operate on properties of graphs formulated in the description logic ALCIF (and its fragments) [6].In description logics, elements of Γ and Σ are called concept names and role names, respectively.ALCIF allows to build more complex concepts with the following grammar: where ∈ Γ and ∈ Σ ± .We also use additional operators that are redundant but useful when defining fragments; for brevity we introduce them as syntactic sugar: ⊤ := ¬⊥, 1 ⊔ 2 := ¬(¬ 1 ⊓¬ 2 ), ∀ .:= ¬∃ .¬, .:= ¬∃ . .We extend the interpretation function • to complex concepts as follows: Statements in description logics have the form of concept inclusions,

⊑
where and are concepts.A graph satisfies ⊑ , in symbols |= ⊑ , if ⊆ .A set T of concept inclusions is traditionally called a TBox and we extend satisfaction to TBoxes in the canonical fashion: |= T if |= ⊑ for each ⊑ ∈ T .
In the Horn fragment of ALCIF , written Horn-ALCIF , we only allow concept inclusions in the following normal forms: where ∈ Γ, ∈ Σ ± , and , ′ are intersections of concept names (intersection of the empty set of concepts is ⊤).If statements of the form ⊑ 1 ⊔ 2 ⊔ • • • ⊔ are allowed too, then we recover the full power of ALCIF (up to introducing auxiliary concept names).
Participation constraints of schemas can be expressed with simple Horn-ALCIF statements as illustrated in following example.
Example 3.3.For instance, the assertion in 0 (Figure 1) that Pathogen manifests at least one Antigen is expressed with the statement Pathogen ⊑ ∃exhibits.Antigen.The assertion that an Antigen may be exhibited by an arbitrary number of Pathogens needs no Horn-ALCIF statement.However, statements are needed for implicitly forbidden edges, e.g., Vaccine ⊑ exhibits.Antigen.

GRAPH TRANSFORMATIONS
We propose transformations of graphs defined with Datalog-like rules that use acyclic C2RPQs in their bodies.To allow multiple copies of the same source node we use node constructors.Formally, a -ary node constructor is a function : N → N and we denote the set of node constructors by F .To remove the concern of overlapping node constructors, and the need to analyze their definitions, we assume that for every node label ∈ Γ we have precisely one node constructor , all node constructors are injective, and their ranges are pairwise disjoint.
Type checking Given a transformation , a source schema , and a target schema ′ check whether for every that conforms to the output of transformation ( ) conforms to ′ .
Equivalence Given a source schema and two transformations 1 and 2 check whether 1 and 2 agree on every graph that conforms to .In settings where the target schema is not known, it might be useful to construct one.Naturally, we wish to preclude a trivial solution that produces the universal schema that accepts all graphs over a given set of node and edge labels.Instead, we propose to construct a schema that offers the tightest fit to the set of output graphs.To define formally this requirement, we define schema containment in the classical fashion: a schema is contained in ′ if and only if ( ) ⊆ ( ′ ).
Schema elicitation Given a transformation and a source schema , construct the containment-minimal target schema ′ such that ( ) ∈ ( ′ ) for every ∈ ( ).We observe that ( ) may have nodes with no label, which may preclude it from satisfying any schema, and consequently, schema elicitation may also return error.
We prove the main result by reducing the problems of interest to query containment modulo schema (and vice versa), which we later show to be EXPTIME-complete.Although schema elicitation is not a decision problem, we show EXPTIME-completeness of deciding if the result of schema elicitation is equivalent to a given schema.Should schema elicitation have lesser complexity, so would have the corresponding decision problem since schema equivalence is easily decided in polynomial time.We outline the main ideas of the proof by illustrating how a transformation can be analyzed with a toolbox of methods based on query containment modulo source schema .We formulate these methods with an entailment relation: W.l.o.g.we assume that every rule of transformation is trim i.e., it uses in its body a query ( ¯ ) that is satisfiable modulo , in symbols ∃ ¯ .( ¯ ) ; otherwise, can be trimmed.
First, we group queries from rules of based on the labels of nodes and edges they create.For , ∈ Γ and ∈ Σ we define In essence, ( ¯ ) identifies tuples over the input graph that yield a node constructed with and with label while , , ( ¯ , ¯ ) identifies tuples that yield -edges from a node created with to a node created with .Vaccine ( ) = (Vaccine)( ) , Vaccine,targets,Antigen ( , ) = (designTarget • crossReacting * )( , ) , Vaccine,designTarget,Antigen ( , ) = (designTarget)( , ) .Since an edge rule does not assign labels to nodes it creates, the result of a transformation may be a graph with nodes without a label.Such a situation precludes type checking from passing and prevents schema elicitation from producing meaningful output.Consequently, we first verify that every node in every output graph has exactly one label, in symbols ( , ) |= ⊤ ⊑ Γ , where { 1 , . . ., } is a shorthand for 1 ⊔. ..⊔ .We prove the following (Lemma B.6).
We point out that the restriction of one node constructor per node label ensures that each node of the output has at most one label.Now, to perform type checking against a given target schema ′ , we verify that Γ ⊆ Γ ′ and Σ ⊆ Σ ′ .Then, we take the TBox T ′ of concept inclusions that expresses participation constraints of the target schema ′ and we verify that ( , ) |= T ′ .Type checking succeeds if and only if all the above tests succeed (Lemma B.2).The TBox T ′ consists of statements from a small fragment L 0 of Horn-ALCIF which allows only statements of the forms where , ∈ Γ and ∈ Σ ± .The entailment of such statements is also reduced to query containment (Lemma B.7): Example 4.5.Take the transformation 0 and the schemas 0 and 1 in Figure 1.The schema 1 requires every vaccine to target at least one antigen, in symbols Vaccine ⊑ ∃targets.Antigen.This statement is entailed by 0 and 0 if and only if the following holds For schema elicitation, we use a close correspondence between schemas and L 0 TBoxes.It is sufficient to construct the TBox T containing all L 0 statements that are entailed by and ; T corresponds to the containment-minimal target schema (Lemma B.5).
Finally, the equivalence of two transformations 1 and 2 is essentially the equivalence (modulo ) of the respective queries and , , of both transformations (Lemma B.8). Naturally, query equivalence is reduced to query containment, as usual.
We have shown that type checking, schema elicitation, and equivalence of graph transformations are Turing-reducible in polynomial time to testing containment of UC2RPQs in acyclic UC2RPQs modulo schema.We also show polynomial-time reductions of containment of 2RPQs modulo schema to all above problems of interest (Lemma F.2).With that, Theorem 4.2 follows from Theorem 5.1.

QUERY CONTAINMENT MODULO SCHEMA
The aim of this section is to show the following result.The lower bound can be derived from the EXPTIME-hardness of unrestricted containment of 2RPQs (using only edge labels) modulo very simple TBoxes.The latter is obtained by reduction from another reasoning task (satisfiability of ALCI TBoxes) and relies on the inner workings of its hardness proof.For completeness, we provide a direct reduction from the acceptance problem for polynomial-space alternating Turing machines (Theorem F.1).The remainder of this section is devoted to the upper bound.We show it by reduction to unrestricted (finite or infinite) satisfiability of C2RPQs modulo a Horn-ALCIF TBox, which we discuss in Section 6.The principal technique applied in the reduction is cycle reversing [20].
Let be a schema, a UC2RPQ, and an acyclic UC2RPQ.Without loss of generality we may assume that and are Boolean (see Lemma D.1).The key idea is to pass from finite to possibly infinite graphs, thus making canonical witnesses for non-containment easier to find.However, as Example 5.2 shows, we cannot pass freely from finite to possibly infinite graphs, as this may affect the answer.
Example 5.2.Consider the schema in Figure 2. Observe that allows infinite graphs that are essentially infinite trees when restricted to -edges, e.g.∞ in Figure 2. In fact, every infinite graph satisfying that is connected when restricted to -edges is an infinite tree.On the other hand, every non-empty finite graph that conforms to is a collection of disjoint cycles when restricted to -edges, e.g., 0 in Figure 2. Clearly, the topology of finite and infinite graphs defined by the schema differs drastically.Now, take the queries = ∃ .( , ), = ∃ , .(• + • )( , ), and observe that ⊆ .However, the containment does not hold over infinite graphs: is satisfied by ∞ while is not.
The reason why we cannot pass directly to infinite models is that finite graphs conforming to schema may display certain additional common properties, detectable by queries, but not shared by infinite graphs conforming to .The cycle reversing technique [20] captures these properties in * such that where by ⊆ ∞ * we mean containment over possibly infinite graphs conforming to * .However, as the following example shows, we cannot obtain * by analysing alone.
Example 5.3.In Example 5.2 we saw that in a finite graph conforming to , each node has exactly one incoming and one outgoing -edge.We can use this observation to tighten the original schema to the schema * (Figure 2).Alas, we still have because there is an infinite graph * ∞ that satisfies but not .
Instead, we first reduce containment modulo schema to finite satisfiability, fusing the schema and the query into a single Horn-ALCIF TBox, and then pass from finite to unrestricted satisfiability by applying cycle reversing to the resulting TBox.We follow closely the approach of Ibáñez-García et al. [38], relying crucially on some of their results.
Let T be a Horn-ALCIF TBox.A finmod cycle is a sequence for 1 ≤ < .The completion T * of a TBox T is obtained from T by exhaustively reversing finmod cycles.The following key result is stated in [38] in terms of sets of ground facts (so-called ABoxes) rather than subgraphs, but our formulation is equivalent.
Non-satisfaction of is captured by TBox T ¬ that consists of Now, suppose that there exists a (finite or infinite) model of T * that satisfies (see Figure 3).must have a node with ( , ) ∈ .It follows already from T that ∈ ( ⊓ ) and that has an -successor ′ ∈ ( ⊓ • + ) .The statement ⊓ • + ⊑ ∃ − .⊓ • + in T * implies that ′ has an − -successor ′′ ∈ ( ⊓ • + ) .As each node has at most one incoming -edge, = ′′ and ∈ ( • + ) .But has an outgoing -edge, which contradicts the last concept inclusion in T ¬ .Thus, is not satisfied in T * .
We are now ready to reduce containment modulo schema to unrestricted satisfiability modulo Horn-ALCIF TBox.Note that the guarantees on the resulting TBox in the statement below are sufficient to conclude Theorem 5.1 using Theorem 6.1.
Theorem 5.6.Given a UC2RPQ , an acyclic UC2RPQ , and a schema , one can compute in EXPTIME a UC2RPQ of polynomial size and a Horn-ALCIF TBox T using linearly many additional concept names and polynomially many at-most constraints, such that ⊆ if and only if is (unrestrictedly) unsatisfiable modulo T .
Let us sketch the proof.Let T be the Horn-ALCIF TBox corresponding to .Note that apart from the explicit restrictions captured in T the schema also ensures that only graphs with exactly one label per node are considered.To ensure at most one label from Γ per node, we use the TBox The concept inclusion ⊤ ⊑ Γ , expressing that each node has at least one label from Γ , is not Horn and cannot be used.Instead, we modify the query .Assuming Γ = { 1 , 2 , . . ., }, we include ( 1 + 2 + • • • + ) before and after each edge label used in an atom of .Additionally, to ensure that uses only labels allowed by , we substitute in each label not in Γ ∪ Σ ± by .Letting be the resulting query, we have Because is acyclic, by adapting the rolling-up technique [35] one can compute in PTIME a Horn-ALCIF TBox T ¬ over an extended set of concept names Γ ∪ Γ such that (see Lemma C.2). Since T ∪ T ¬ is a Horn-ALCIF TBox, we can consider its completion T ∪ T ¬ * .As UC2RPQs are witnessed by finite subgraphs whenever they are satisfied, we can infer from Theorem 5.4 that is finitely satisfiable modulo It remains to compute the completion.Reversing cycles does not introduce new concept names, but it may generate exponentially many concept inclusions.Identifying a finmod cycle involves deciding unrestricted entailment of Horn-ALCIF concept inclusions, which is decidable in EXPTIME [26].However, since the input TBox might grow to an exponential size as more and more cycles are reversed, it is unlikely that the completion can be computed in EXPTIME for every Horn-ALCIF TBox.Our key insight is that T ∪ T ¬ enjoys a particular property, invariant under reversing cycles, that keeps the complexity under control.
A concept inclusion (CI) of the form ⊑ ∃ .′ or ⊑ ∃ ≤1 .′ is relevant for a TBox T if the triple ( , , ′ ) is satisfiable modulo T ; that is, some model of T contains nodes and ′ such that ∈ , ( , ′ ) ∈ , and ′ ∈ ( ′ ) .We say that T is -driven if for each relevant CI in T of the form ⊑ ∃ .′ (resp.⊑ ∃ ≤1 .′ ), T contains ⊑ ∃ .′ (resp.⊑ ∃ ≤1 .′ ) for some , ′ ∈ Γ such that ∈ , ′ ∈ ′ ; here and later we blur the distinction between conjunctions of concept names and sets of labels.Note that T ∪ T ¬ is trivially -driven, as all its existential and at-most constraints are of the form ⊑ ∃ .′ or ⊑ ∃ ≤1 .′ .Lemma 5.7.Every -driven TBox T can be simplified in polynomial time so that it contains at most From our results in Section 6 it follows that unrestricted entailment for a Horn-ALCIF TBox T with concept names and ℓ at-most constraints can be solved in time poly(|T |) • 2 poly( ,ℓ) (Corollary E.7).Hence, it would suffice to show that by reversing a finmod cycle in an -driven TBox, we obtain another -driven TBox.In fact, we prove something weaker, but sufficient to compute the completion in EXPTIME, and conclude that it is -driven.
Based on the obtained invariant we can compute the completion T ∪ T ¬ * in EXPTIME (Lemma D.7).By reducing T ∪ T ¬ * as described above, we obtain the desired TBox T , thus completing the proof of Theorem 5.6.

SATISFIABILITY MODULO TBOX
The last missing piece is to solve the unrestricted satisfiability of C2RPQs modulo Horn-ALCIF .Calvanese et al. show that the problem is in EXPTIME not only for Horn-ALCIF , but even for ALCIF extended with additional features [16].This result is not directly applicable, because our reduction produces a TBox of exponential size.The following theorem gives the more precise complexity bounds that we need.section to the simple model property.We do it to show a connection to an elegant graph-theoretical notion that helps to simplify the reasoning considerably, at least for ALCIF .We begin by illustrating how simple models are obtained for queries satisfiable modulo schemas (rather than arbitrary TBoxes).
Example 6.2.Take the schema in Figure 4 (its two types are represented with a blue square and a red circle), and consider the following satisfiable (cyclic) query Since is satisfiable modulo , we take any graph conforming to where is satisfied, and we choose any 3 paths witnessing each of the regular expressions of .We construct the initial graph 0 consisting of the 3 paths joined at their ends: it might look like the one in Figure 4. We observe that requires every red circle node to have at most one outgoing -edge and at most one incomingedge (to and from a red circle node).The initial graph 0 violates this requirement and to enforce it we exhaustively merge offending nodes.The final graph is a simple model of modulo .
We formalise simple models using a graph-theoretic notion of sparsity proposed by Lee and Streinu [44].We say that a connected graph with nodes and edges is -sparse if ≤ + .(In Lee and Streinu's terminology this corresponds to (1, − )-sparsity.)Being -sparse is preserved under adding and removing nodes of degree 1.By exhaustively removing nodes of degree 1 from asparse graph we arrive at single node or a connected -sparse graph in which all nodes have degree at least 2. Assuming ≥ 1, it is not hard to see that such a graph consists of at most = 2 distinguished nodes connected by at most = 3 simple paths disjoint modulo endpoints (see Lemma E.1).We call such a graph a ( , )-skeleton, and we refer to the graph above as the skeleton of .Thus, a -sparse graph consists of a (2 , 3 )-skeleton and a number of attached trees; by attaching a tree to a graph we mean taking their disjoint union and adding a single edge between the root of the tree and some node of the graph.
For the purpose of the simple model property we need to lift the notion of -sparsity to infinite graphs.We call a (possibly infinite) graph -sparse if it consists of a finite connected -sparse graph with finitely many finitely branching trees attached.

P
. Let be the difference between the number of atoms and the number of variables of .Because is connected, ≥ −1.By definition, understood as a graph with variables as nodes and atoms as edges is -sparse.
We write → ′ to indicate that there is a homomorphism from graph to graph ′ ; that is, a function ℎ mapping nodes of to nodes of ′ that preserves node labels and the existence of labelled edges between pairs of nodes.Let be a (possibly infinite) model of and T .We construct a sequence of finite connectedsparse graphs of strictly decreasing size such that 0 |= and the homomorphism from to is injective over -successors of every node, for each .
To construct 0 let us fix a match of in together with a (finite) witnessing path for each atom of .We construct 0 as follows.For each variable of we include a node whose set of labels is identical to that of the image of in under the fixed match.Next, for each atom of that connects variables and we add a simple path connecting and such that the sequence of edge labels and sets of node labels read off of this path is identical to that of the witnessing path of this atom in .This graph can be seen as a specialization of where each regular expression is replaced by a single concrete word, except that we include full sets of labels of nodes, as they are encountered in the witnessing path in .It follows immediately that 0 |= and that 0 → .To see that 0 is -sparse one can eliminate the internal nodes of the connecting paths one by one, like in the proof of Lemma E.1, until a graph isomorphic to remains.
We define the remaining graphs inductively, maintaining an additional invariant → .Suppose we already have together with a homomorphism ℎ : → for some ≥ 0. If ℎ is injective over -successors of each node of , we are done.If not, there are two different -successors 1 and 2 of a node in that are mapped to the same node ′ in .It follows that 1 and 2 have the same sets of labels types.We let +1 be the graph obtained from by merging 1 and 2 into a single node .We include an ′edge between and each ′ -successor of 1 or 2 .This decreases the number of nodes by one, and the number or edges by at least one.It follows that +1 is -sparse and → +1 → .Because the sizes of graphs are strictly decreasing, at some point we will arrive at a graph such that the homomorphism from to is injective over -successors.The graph clearly satisfies .It also satisfies all concept inclusions in T of the forms . ′ , and ⊑ ∃ ≤1 .′ , because ℎ is injective over -successors and |= T .On the other hand, is not guaranteed to satisfy concept inclusions of the form ⊑ ∃ .′ in T .In order to fix it, we exhaustively (ad infinitum) perform the following: whenever a node in is missing an -successor with some set of labels, we add it and map it to some such -successor ′ of the image of in ( ′ exists because |= T ).As ≤ | |, the resulting (typically infinite) graph is | |-sparse, and it satisfies and T .
The connectedness assumption in Theorem 6.3 is not restrictive, because a witnessing graph for can be obtained by taking the disjoint union of witnesses for its connected components.Hence, it remains to decide for a given connected if there exists a | |sparse graph that satisfies and T .To get a finer control of the effect different parameters of the input have on the complexity, we side-step two-way alternating tree automata (2ATA) applied by Calvanese et al. and develop a more direct algorithm.
Observe that if is satisfied in a | |-sparse graph , then contains a (4| |, 5| |)-skeleton ′ , extending the skeleton of , such that all variables of are mapped to distinguished nodes of ′ .Indeed, ′ can be obtained by iteratively extending the skeleton of .Suppose that some variable is mapped to a node that is not yet a distinguished node of ′ .If already belongs to ′ , then it is an internal node in a path between two distinguished nodes; we then split the path in two, turning into a distinguished node.If does not belong to ′ , then it belongs to a tree attached to ′ at a node .If is not a distinguished node of ′ , we turn it into one, as above.Then, we add to ′ as a distinguished node, including the path between and into ′ as well.As we start from a (2| |, 3| |)skeleton and add at most two distinguished nodes and two paths for each variable of , we end up with a (4| |, 5| |)-skeleton.
Thus, the algorithm can guess a (4| |, 5| |)-skeleton ′ with each path represented by a single symbolic edge and check that it can be completed to a suitable graph by materializing symbolic edges into paths and attaching finitely many finitely branching trees in such a way that is a model of T and there is a match of in that maps variables of to distinguished nodes of ′ .This can be done within the required time bounds by means of a procedure that can be seen as a variant of type elimination or an emptiness test for an implicitly represented nondeterministic tree automaton (see Theorem E.3).

DISCUSSION
Summary.In this paper we have studied several static analysis problems for graph transformations defined with Datalog-like rules that use acyclic C2RPQs.When the source schema is given, we studied the equivalence problem of two given transformations, and the problem of target schema elicitation for a given transformation.If the output schema is also given, we have studied the problem of type checking.We have shown that the above problems can be reduced to containment of C2RPQs in acyclic UC2RPQs modulo schema, a problem that we have reduced to the unrestricted (finite or infinite) satisfiabilty of a C2RPQ modulo Horn-ALCIF TBox using cycle reversing.For the latter problem we have presented an algorithm with sufficiently good complexity to accommodate the exponential blow-up introduced by cycle reversing, thus allowing to solve in EXPTIME all problems of interest.We have also shown matching lower bounds by reducing query containment modulo schema to each of the static analysis problems.
Finite containment modulo Horn-ALCIF TBox.In the course of the proof of the upper bound for containment modulo schema, we essentially solved (finite) containment modulo Horn-ALCIF TBox.Indeed, while the EXPTIME upper bound relies on the special shape of the TBox expressing the schema, the method can be applied directly to any Horn-ALCIF TBox, at the cost of an exponential increase in complexity.Thus, we immediately get that finite containment of UC2RPQs in acyclic UC2RPQs modulo Horn-ALCIF TBoxes can be solved in 2EXPTIME.To the best of our knowledge this is the first result on finite containment of C2RPQs in the context of description logics.A related problem of finite entailment has been studied for various logics [27][28][29]31], but while for conjunctive queries the solutions carry over to finite containment, for C(2)RPQs these logics are too weak to allow this.Unrestricted containment of C2RPQs modulo ALCIF TBoxes is known to be in 2EXPTIME [16], but passing from unrestricted to finite structures is typically challenging for such problems.For example, finite entailment of CRPQs for a fundamental description logic ALC has been solved only recently [31], 15 years after the unrestricted version [14].
Extending queries.It is straightforward to extend our methods to two-way nested regular expressions (NREs) [52].We also intend to investigate introducing negation in filter expressions of NREs.Eliminating the acyclicity assumption, on the other hand, is problematic.Containment of arbitrary C2RPQs is EXPSPACE-complete [15], and we have shown that it reduces to our problems of interest for transformation rules with cyclic queries.Hence, extending our EX-PTIME upper bounds to transformations allowing cyclic C2RPQs is highly unlikely.In fact, even establishing decidability would be hard.For acyclic queries we could use the rolling-up technique to reduce containment to satisfiability, which allowed us to apply the cycle reversing technique and pass from finite to unrestricted models.When cyclic queries are allowed, the rolling-up technique is inapplicable and we are left with containment of C2RPQs modulo constraints, which is a major open problem, not only for constraints expressed in description logics.The only positive results we are aware of do not go significantly beyond CQs extended with a binary reachability relation [23].
Extending schemas.Extending the schema formalism with disjunction is also challenging: the corresponding description logic would not be Horn any more and the transition to unrestricted models via cycle reversing would not be possible.Supporting multiple labels on nodes would not be a trivial extension either: we rely on the single label per node assumption in the reduction of the problems of interest to containment of UC2RPQs in acyclic UC2RPQs, and in the EXPTIME upper bound.Supporting more general cardinality constraints, on the other hand, should be possible, but it might affect the complexity upper bounds.
Extending the data model.It is straightforward to encode data values in our graph model, for instance, by using dedicated node labels to designate literal nodes whose identifiers are their data values.Then, one can apply methods similar to type checking to verify that transformations are well-behaved, and in particular, do not attempt to construct literal nodes from non-literal ones.However, the full consequences of allowing literal values in definitions of transformation rules need to be thoroughly investigated.
Finally, we have considered equivalence of transformations based on equality of results but one could also consider a variant based on isomorphism of results.This would be an entirely different problem, probably much harder.

A DETAILS ON QUERIES
A two-way regular expression is an expression defined with the following grammar.
where ∈ Γ and ∈ Σ ± .We define the semantics with the notion of witnessing paths that we formalize next.Given a graph , a path from 0 to in is a sequence such that 0 , . . ., are nodes of , ℓ 1 , . . ., ℓ ∈ Γ ∪ Σ ± , and for every ∈ {1, . . ., } the following conditions are satisfied: ( Given a two-way regular expression we define the corresponding binary relation on nodes of the graph: ( , ) ∈ [ ] iff there is a path from node to node in whose labeling is recognized by .Now, a conjunctive two-way regular path query (C2RPQ) is a formula of the form , where for every ∈ {1, . . ., } the formula is a two-way regular expression and ¯ = { 1 , ′ 1 , . . ., , ′ } \ ¯ .A C2RPQ is Boolean if all of its variables are existentially quantified.
Evaluating a C2RPQ ( ¯ ) over a graph yields a set [ ( ¯ )] of tuples over ¯ i.e., functions that assign nodes of to elements of ¯ .Formally, ∈ [ ( ¯ )] iff there is a tuple ′ over ¯ such that the two tuples combined ′′ = ∪ ′ satisfy all atoms i.e., ( ′′ ( ), ′′ ( ′ )) ∈ [ ] for every ∈ {1, . . ., }.When the query is Boolean, then it may have only a single answer, the empty tuple () i.e., the unique function with the empty domain.If indeed () ∈ [ ] we say that is satisfied in and denote it by |= ; otherwise, when [ ] = ∅, we say that is not satisfied in and we write |= .
For defining transformations we employ the subclass of acyclic C2RPQs.Formally, for a query we construct its query multigraph whose nodes are variables and for every atom ( , ) we add an edge ( , ) unless the atom is of the form ( , ), ( , ), or ( , ).
is acyclic if its query multigraph is acyclic.Finally, the semantics of unions of conjunctive two-way regular path queries (UC2RPQs), represented as sets of C2RPQs, is defined simply as: A UC2RPQ is acyclic if all of its components are acyclic.A Boolean UC2RPQ consists of Boolean C2RPQs.

B PROOFS FOR TRANSFORMATIONS
We begin by introducing elements of useful terminology.Given any finite subsets Γ 0 ⊆ Γ and Σ 0 ⊆ Σ, we say that a schema is over Γ 0 and Σ 0 if Γ = Γ 0 and Σ = Σ 0 .Analogously, we say that a ALCIF TBox T is over Γ 0 and Σ 0 if all base concept names and base rule names used in T are from Γ 0 and Σ 0 respectively.Also, we say that a graph is over Γ 0 and Σ 0 if does not use any node or edge label outside of Γ 0 and Σ 0 , and we extend this notion to families of graphs in the canonical fashion: G is a family of graphs over Γ 0 and Σ 0 if every graph in is over Γ 0 and Σ 0 .Finally, a transformation is over Γ 0 and Σ 0 if all rules in use in their heads node and edge labels in Γ 0 and Σ 0 respectively.However, for a transformation we shall need to identify tighter sets of node and edge labels when the input schema is known.As such, a transformation rule ← ( ¯ ) is productive modulo a schema if ( ¯ ) . A transformation is trimmed modulo if 1) every rule in is productive modulo , 2) for every ∈ Γ there is an -node rule in , and 3) for every ∈ Σ there is a -edge rule in .Naturally, checking that a transformation is trimmed can be Turing-reduced in polynomial time to testing query containment modulo schema.Moreover, for a given schema we can trim a given transformation by removing all unproductive rules and removing from Γ and Σ any symbols that are not present in the head of any of the remaining rules.
Next, an L 0 TBox over Γ 0 and Σ 0 is a set of statements of the forms where , ∈ Γ 0 and ∈ Σ ± 0 .T is coherent iff 1) T does not contains two contradictory rules ⊑ ∃ .and ⊑ . for any , ∈ Γ and ∈ Σ ± , and 2) T contains ⊑ ∃ ≤1 .whenever it contains ⊑ . .Now, for a given schema the corresponding L 0 TBox T (over Γ and Σ ) is defined as follows.
It is easy to see that there is one-to-one correspondence between schemas and coherent TBoxes.More precisely, given Γ 0 ⊆ Γ and Σ 0 ⊆ Σ, for any schema over Γ 0 and Σ 0 , T is a coherent TBox over Γ and Σ , and for any coherent TBox T over Γ 0 and Σ 0 there is a unique schema over Γ 0 and Σ 0 such that T = T .Naturally, T also captures the semantics of the cardinality constraints of .Later we prove how to reduce entailment of statements to query containment.Before, we address the problem of schema elicitation by observing that the correspondence between schemas and their L 0 TBoxes is tighter.We first need to establish two auxiliary results.The first one characterizes the containment of schemas, which is expressed as an extension of a syntactic containment relation on the symbols used to specify participation constraints.More precisely, we define as the transitive and reflexive closure of the following assertions: 0 ?, 1 ?, ?+, and + *.Proposition B.3.Take finite Γ 0 ⊆ Γ and Σ 0 ⊆ Σ.Given two schemas 1 and 2 over Γ 0 and Σ 0 , ( 1 ) ⊆ ( 2) if and only if 1 ( , , ) 2 ( , , ) for every , ∈ Γ 0 and ∈ Σ ± 0 .P .For the if part, we take any that conforms to 1 and we note first that every node of has exactly one label in Γ 0 .Also, for any , , ∈ Γ 0 and any ∈ Σ ± 0 we observe that 1 ( , , ) 2 ( , , ) implies that any -node in whose number of -successors with label satisfies the participation constraint 1 ( , , ) will also satisfy 2 ( , , ).
Next, we establish correspondence between L 0 theories of sets of graphs and their containment-minimal schemas.
Proposition B.4.Take finite Γ 0 ⊆ Γ and Σ 0 ⊆ Σ and take any nonempty family G of graphs over Γ 0 and Σ 0 such that G |= ⊤ ⊑ Γ 0 and G |= ⊓ ⊑ ⊥ for all , ∈ Γ 0 .Let T be the set of all L 0 statements over Γ 0 and Σ 0 that hold in every graph in G.Then, T corresponds to the containment minimal schema over Γ 0 and Σ 0 such that G ⊆ ( ).
We obtain the following result allowing to solve the problem of schema elicitation problem.
Lemma B.5.Take a schema and a transformation that is trimmed modulo and such that ( , ) |= ⊤ ⊑ Γ .Let T be the set of all L 0 statements over Γ and Σ that are satisfied by every graph in the family { ( ) | ∈ ( )}.Then, T corresponds to the containment minimal schema over Γ and Σ that contains { ( ) | ∈ ( )}.

P
. The proof follows immediately from Proposition B.4 except for the case when is empty.Then, however, Γ and Σ are empty too and so is T .However, the schema that corresponds to T is also empty and it recognizes only empty graphs.As such it is the containment minimal schema over Γ and Σ that contains { ( ) | ∈ ( )} ⊆ {∅}.
To move to reducing entailment of statements to query containment we repeat the definitions of the relevant queries but in this version we clearly indicate the transformation in question.More precisely, For a transformation , , ∈ Γ , and ∈ Σ we define: Now, we prove that the entailment of ⊤ ⊑ Γ is reduced to query containment.
(2) The proof of this statement is by contradiction and it uses arguments that are analogous to those used in the proof of the above claim and we only outline it.We take a graph ∈ ( ) such that in ( ) there is a node ( ) with label and an -edge to a node with with label .This happens if and only if the intersection of ( ¯ ) and ∃ ¯ ., , ( ¯ , ¯ ) is non-empty.
(3) Similarly, the proof is by contradiction but uses argument analogous to those in the proof of the first claim and we only outline it.We take a graph ∈ ( ) such that ( ) ) cannot be answer to ( , ).
For testing equivalence of two transformations we observe that since a transformation is equivalent to its trimmed version, two transformations 1 and 2 are equivalent modulo if and only if they trimmed versions trim ( 1 ) and trim ( 2 ) are equivalent modulo .In the following lemma, 1 ≡ 2 is short for 1 ⊆ 2 and 2 ⊆ 1 .

P
. The if part is trivial.We prove the only if part by proving the contraposition: we show that if one of the conditions (1), (2), and (3) is not satisfied, then 1 2 .If (1) is not satisfied, then one of the transformations has at least one rule that generate a node or an edge with a label that is not employed by the other transformations.Since both transformations are trimmed, there exists an input graph such that the rule produces objects on the output.But then 1 ( ) ≠ 2 ( ).
If ( 2) is not satisfied, then there is an input graph such that one of the transformations generates a node that the other does not.Hence, 1 ( ) ≠ 2 ( ).
If (3) is not satisfied, then analogously, there is an input graph such that one of the transformations generates an edge that the other does not.Hence, 1 ( ) ≠ 2 ( ).

C ROLLING UP QUERIES
We next show how to reduce the non-satisfaction of an acyclic UC2RPQ to the satisfaction of a Horn-ALCIF TBox T ¬ .The TBox is basically a recursive program that defines a collection of sets (monadic relations) of nodes.We illustrate this construction with the following example.
Example C.1.We take the following Boolean query.
We construct a TBox that essentially simulates automata for the regular expressions, which are presented in Figure 5.The TBox T ¬ 0 consists of the following constraints.
T ¬ introduces a set fresh node labels Γ and the satisfaction T ¬ is defined in terms of the existence of valuations of symbols in Γ .More precisely, given a graph over Γ 0 and Σ 0 and a TBox T over Γ 0 ∪ Γ 1 and Σ 0 , we say that satisfies T if and only if there is an interpretation Lemma C.2.Given a Boolean acyclic UC2RPQs , one can compute in polynomial time a Horn-ALCIF TBox T ¬ and a reserved set of concept names Γ such that for every that does not use labels in Γ , |= if and only if satisfies T ¬ .

P
. We prove the lemma for queries that are Boolean C2RPQs that are acyclic and connected.The claim extends to unions of Boolean acyclic C2RPQs in a straightforward fashion: it suffices to take the union of the desired TBoxes of all connected components of the union.Consequently, the query can be seen as a tree and we assume that it is defined with the following grammar: where is a two-way regular expression over Σ and Γ.For instance, the query from Example C.1 is represented as 0 = − ( , • * • ).We express the semantics of such defined queries as the set of all nodes that satisfy it.
Naturally, a graph satisfies iff [ ] ≠ ∅.Now, fix an acyclic Boolean C2RPQ and let Φ be the set of all two-way regular expressions used in .For any ∈ Φ by = ( , , , ) we denote an -free NDA over the alphabet Σ ∪ Γ that recognizes , where is a finite set of states, ⊆ is the set of initial states, ⊆ is the set of final states, and ⊆ × (Σ∪Γ) × is the transition table.We assume that the size of is polynomial in the size of the expression (such automaton can be obtained for instance with the standard Glushkov technique).We also assume that the sets of states are pair-wise disjoint.
The set of additional node labels consists of the states of automata: Γ = .The constructed TBox consists of two subsets of rules: T ¬ = T 1 ∪ T 0 .The set T 1 encodes transitions of the automata that simulate their execution.
(4) For every ∈ of the root of , T 0 contains ⊑ ⊥; Now, we fix a graph whose node labels do not use any symbol in Γ .We first argue that there is a unique minimal interpretation 0 : Γ → P (dom( )) such that ∪ 0 |= T 1 .Indeed, since the rules are Horn-like, an intersection of two models of T 1 is also a model of T 1 .
Next, we prove the main claim with an inductive argument which requires defining subqueries of .For ∈ Φ and ∈ by we denote the query ( 1 , . . ., ), where 1 , . . ., are children of in and is the two-way regular expression corresponding to the automaton , = ( , , , { }) (essentially, we make the only final state).We claim that for any ∈ Φ, any ∈ , and any ∈ we have In essence, the unary predicate identifies all nodes at which the subquery is satisfied.We prove the above claim with double induction: firstly over the height of the subquery = ( 1 , . . ., ), and secondly, over the length of the witnessing path for ( , . Consequently, is satisfied at a node ∈ iff ∈ 0 for some ∈ {1, . . ., }.As such, is not satisfied at any node of if and only if 0 |= ⊑ ⊥ for every ∈ {1, . . ., } i.e., 0 |= T 0 .We finish the proof by observing that if the minimal model 0 does not satisfy T 0 , then none of supersets of 0 does.
Secondly, the schema • ensures that the original regular expression can be witnessed only by paths that begin and end in nodes with labels in Σ only.

P
. The construction of • is as in Lemma D.1 and the construction of Boolean RPQs depends on the form of the unary RPQ: • is constructed in the same way.

P
. Each finite graph falsifying the left-hand side condition falsifies the right-hand side condition as well.For the converse, let be a finite graph falsifying the right-hand side condition.Without loss of generality we can assume that only labels from Γ ∪ Σ are used in .Let ′ be obtained by dropping all nodes without a label, as well as edges incident with these nodes.Because all concept inclusions in T that require a witnessing neighbour specify the label of this neighbour, they are not affected by this modification.Other concept inclusions are always preserved when passing to a subgraph.It follows that ′ conforms to .The RPQs in can only traverse nodes with a label from Γ , so is still satisfied in ′ .Then, is satisfied as well. is not satisfied in ′ , because ′ is a subgraph of .Lemma D.4. is finitely satisfiable modulo T ∪ T ¬Q iff is satisfiable modulo T ∪ T ¬Q * .

P
. Suppose that is satisfied in a finite model of T ∪ T ¬ .By Theorem 5.4, there is a (possibly infinite) model of T ∪ T ¬ * containing as a subgraph.This model obviously satisfies .
Conversely, suppose that there is a possibly infinite graph satisfying and T ∪ T ¬ * . Let be the disjunct of that is satisfied in .Let be the image of in , including a finite witnessing path for each RPQ.Note that is finite.By Theorem 5.4, there is a finite model of T ∪ T ¬Q containing as a substructure.This models satisfies as well.
Lemma D.5.Every -driven TBox T can be simplified in polynomial time so that it contains at most |Σ ± | • |Γ | 2 at-most constraints.

P
. To achieve this, for each such CI of the form ⊑ ∃ ≤1 .′ in T we do one of the following.

P
. Since all triples in 1 , 1 , . . ., −1 , −1 , are satisfiable, all CIs ⊑ ∃ .+1 and +1 ⊑ ∃ ≤1 − .are relevant for T .We cannot simply apply the fact that T is -driven, because these CIs need not belong to T : they are only entailed by T .The proof will proceed in several steps.
The first step is to see that each contains a label from Γ .Towards contradiction, suppose it does not.We construct a graph witnessing that T does not entail ⊑ ∃ .+1 , which is a contradiction.Let be the tree-shaped graph obtained by unravelling some model of T witnessing that ( , , +1 ) is satisfiable, from a node satisfying .Clearly, is also a model of T , its root satisfies and has an -successor ′ satisfying +1 .We construct as the graph with a single node 0 whose labels are copied from the root of but with any letter from Γ dropped.To see that |= ⊑ ∃ .+1 , note that as ∈ ( ) and contains no labels from Γ , also 0 ∈ ( ) ; but clearly 0 has no -successors at all.Let us check that |= T .
• New CIs of the form ⊑ are not introduced by reversing cycles, so it suffices to look at ones from T ∪ T ¬ .There, such CIs are only present in T ¬ and always satisfy ∉ Γ (see the proof of Lemma C.2). Hence, as they were satisfied in and was obtained by dropping labels from , they still hold in .• CIs of the form ⊑ ⊥ in T were satisfied in and they cannot be violated by dropping labels (recall that does not use negation).• All CIs of the forms ⊑ ∀ .′ , ⊑ .′ , and ⊑ ∃ ≤1 .are trivially satisfied in .
• Consider a CI of the form ⊑ ∃ .′ from T .Suppose that 0 ∈ . Then also ∈ .This means that the CI was "fired" in , which implies that ( , , ′ ) is satisfiable modulo T and ⊑ ∃ .′ is relevant for T .As T is -driven, it follows in particular that contains a label from Γ .But this contradicts the fact that 0 ∈ .Hence, ⊑ ∃ .′ is trivially satisfied in .Thus we have shown that |= T .This concludes the first step.Now, as all contain a label from Γ and all triples ( , , +1 ) are satisfiable modulo T , it follows that for each there exists exactly one label ∈ Γ such that ∈ .It remains to show that ⊑ ∃ .+1 and +1 ⊑ ∃ ≤1 − . .Let us begin with ⊑ ∃ .+1 .Consider graph obtained from (same as above) by removing all subtrees rooted atsuccessors of the root that satisfy +1 .Clearly, |= ⊑ ∃ .+1 .As T |= ⊑ ∃ .+1 , it follows that |= T .Then, some CI of the form ⊑ ∃ .′ from T is violated in , because CIs of other forms are preserved when passing to a subgraph.In particular, it must be the case that the root of satisfies .But then also the root of satisfies and since |= T , the root of has an -successor ′ that satisfies ′ .This means that ⊑ ∃ .′ is relevant for T .Because T is -driven, it must contain ⊑ ∃ .′ for some , ′ ∈ Γ such that ∈ , ′ ∈ ′ .As the root of satisfies both and , and we know that ∈ and ∈ and that labels from Γ are exclusive, it follows that = .We claim that also = and ′ = +1 .If ≠ , then ′ is not an -successor of the root in , and it has not been removed in .That would imply that actually does satisfy ⊑ ∃ .′ .Since we know this is not the case, we conclude that = .Similarly, suppose that ′ ≠ +1 .Because ′ satisfies ′ and ′ ∈ ′ , it must have label ′ .But then ′ cannot have label +1 , which means it cannot satisfy +1 , and has not been removed in .This yields a contradiction just like before and we can conclude that ′ = +1 .Wrapping up, we have seen that ⊑ ∃ .′ belongs to T and that = , = , and ′ = +1 .This means that ⊑ ∃ .+1 belongs to T .
Finally, let us see that +1 ⊑ ∃ ≤1 − .belongs to T .Consider the model but reorganize it so that the root satisfies +1 and has an − -successor ′ satisfying .Let be the graph obtained from by duplicating the whole subtree rooted at ′ , and adding an − -edge from to the root ′′ of the copy.Clearly |= +1 ⊑ ∃ ≤1 − .and since T |= +1 ⊑ ∃ ≤1 − ., we conclude that |= T .It follows immediately that violates some CI of the form ⊑ ∃ ≤1 .′ from T , as CIs of other forms are not affected by the modification turning to .Similarly, it must hold that = − , and that satisfies and ′ and ′′ satisfy ′ .It follows that ⊑ ∃ ≤1 .′ is relevant, +1 ∈ , ∈ ′ , and +1 ⊑ ∃ ≤1 − .belongs to T .
Lemma D.7.For T = T ∪T ¬ , the completion T * can be computed in EXPTIME.

P
. Construct a graph T over all possible intersections of concept names used in T , including an edge with label ∈ Σ ± from to ′ iff T |= ⊑ ∃ .′ and T |= ′ ⊑ ∃ ≤1 − . .
T has exponential size and can be constructed in EXPTIME, because CI entailment by Horn-ALCIF TBoxes can be tested in exponential time [26].Repeat the following until the graph stops changing.Pick an -edge from to ′ such that there is no −edge from ′ to .Check if there exists a path from ′ to in T .If so, the identified path combined with the -edge from to ′ constitutes a finmod cycle 1 , 1 , . . ., −1 , −1 , in T .Add to T an − -edge from +1 to for all < and extend T with the corresponding concept inclusions.Note that this includes an − -edge from ′ to and concept inclusions ′ ⊑ ∃ − .and T |= ⊑ ∃ ≤1 .′ .
Moreover, if there are unique 1 , 2 , . . ., ∈ Γ such that ∈ for ≤ , check if 1 , 1 , . . ., −1 , −1 , is a cycle in T .If so, add to an − edge from +1 to , and the corresponding CIs to T .By Lemma D.6, this ensures that the extended T is -driven.We can now reduce it and recompute T based on the updated T .Using the complexity bounds for CI entailment given in Corollary E.7, we conclude that this can be done in EXPTIME.Note that we are indeed relying on the more precise complexity bounds here, because at later iterations of the cycle reversing procedure the TBox might well contain exponentially many concept inclusions.However, it has still only the original concept names and, after reducing, only a polynomial number of at-most restrictions.

E PROOFS FOR SATISFIABILITY E.1 Introductory lemmas
We begin by showing the two lemmas mentioned in the body of the paper.
Lemma E.1.For ≥ 1, if a finite connected -sparse graph has only nodes of degree at least 2, then it is (2 , 3 )-skeleton.

P
. Let be a finite connected -sparse graph without nodes of degree 0 or 1.We claim that consists of at most 2 nodes connected by at most 3 paths disjoint modulo endpoints.If is empty, we are done.Otherwise, we eliminate vertices of degree 2 that are incident with two different edges by merging these edges into a single edge.This process results in a -sparse multigraph 0 , whose edges represent simple paths in .This graph is either a single node with a loop or all its nodes have degree at least 3.In the first case it follows that is a single cycle, and thus a (1, 1)-skeleton.In the second case, assuming that 0 has nodes and edges, we have 3 /2 ≤ ≤ + .It follows that > 0, ≤ 2 , ≤ 3 .Lemma E.2.If is satisfied in a | |-sparse graph , then contains a (4| |, 5| |)-skeleton , extending the skeleton of , such that all variables of are mapped to distinguished nodes of and can be obtained by attaching finitely many finitely branching trees to .

P
. The skeleton 0 of is a (2| |, 3| |)-skeleton.Consider a match of in .Some variables of might well be matched to nodes on the paths connecting the distinguished nodes of 0 or in the attached trees.We define as follows.First, we add to as distinguished nodes all images of variables of that lie on the paths connecting distinguished nodes of 0 .Next, for each attached tree that contains an image of a variable of , we add to as distinguished nodes all the images of variables of that belong to together with all their least common ancestors in , as well as the node of to which the root of is connected.All ancestors (in ) of these nodes are added to as ordinary nodes.The skeleton thus obtained has the required properties.

E.2 The main result
The goal of this section is to prove the following theorem.The proof of Theorem E.3 is not very hard, but it combines several components and requires developing some machinery.Let us begin with a road map.
Relying on Lemma E.2, we guess a (4| |, 5| |)-skeleton .The distinguished nodes of are represented explicitly, together with all their labels, but each of the connecting paths is represented by a single symbolic edge.Note that there might be multiple symbolic edges between the same pair of distinguished nodes, representing different paths.We need to check that can be completed to a graph by materializing the symbolic edges into paths and attaching finitely many finitely branching trees in such a way that is a model of T and there is a match of in that maps variables of to distinguished nodes of .
To achieve this, we guess an annotation of that summarizes how the witnessing paths of can traverse the parts of missing from , and which witnesses of distinguished nodes required by T these parts provide (Section E.3).We then check if these promises of the annotation are sufficient to guarantee that and T are satisfied (Section E.4).Finally, we verify that the promises of the annotation can be fulfilled: we check if we can attach trees to the distinguished nodes and expand the symbolic edges into finite paths with attached trees in a way that matches the promises of the annotation and respects the TBox T (Section E.5).

E.3 Annotated skeleta
Let Γ , Σ , Γ T , Σ T be the sets of edge and node labels used in and T , respectively.In what follows we only consider graphs and skeleta using only node labels from Γ ∪ Γ T and edge labels from Σ ∪ Σ T .
Let Φ be the set of two-way regular expressions used in .For each ∈ Φ we fix an equivalent linear size non-deterministic automaton A over the alphabet Γ ∪ Σ ± with states , initial states ⊆ , and final states ⊆ .We assume that all are pairwise disjoint and let = ∈Φ .An annotation of skeleton is given by the following functions.
• src and tgt record information about the source and target of the paths represented by each symbolic edge: they both map each symbolic edge to Σ ∪ Σ T ± × 2 Γ ∪Γ T .
• node records how the witnessing paths for may loop in the subtrees attached to the distinguished nodes.Thus, node maps every distinguished node to a subset of ∈Φ × .• edge records how the witnessing paths for progress along paths (and the trees attached to them) represented by the symbolic edges in the skeleton.Thus, edge maps every edge to a subset of If is an edge from to , then ( , ′ , →) ∈ edge ( ) indicates that some path enters (the part of the model summarized by) the edge from in state , and exits at node in state ′ .Similarly, ( , ′ , ) ∈ edge ( ) indicates a loop: some path enters from in state , and exits at the same node in state ′ , etc.

E.4 Verifying annotated skeleta
An annotation of is sufficient for TBox T if the witnesses recorded by src and tgt respect T ; that is, for each distinguished node of the graph defined below satisfies the TBox T 0 obtained from T by dropping all concept inclusions of the form ⊑ ∃ . .To construct we begin from with labels inherited from , and then for each symbolic edge incident with we add an -successor of with label set Λ, where ( , Λ) = src ( ) if is the source of and ( , Λ) = tgt ( ) if is the target of .
An annotation is sufficient for C2RPQ if there exists a function mapping variables of to distinguished nodes of such that for each atom ( , ) of , there exists a finite witnessing sequence 0 0 1 1 . . . of states and distinguished nodes of satisfying the following conditions.
Proposition E.4.One can decide if a given annotated skeleton is sufficient for and T in PTIME.

P
. To check that the annotated skeleton is sufficient for T it is enough to examine the graphs for each distinguished node of the skeleton.
Checking that the annotated skeleton is sufficient for amounts to guessing the function and for each atom ( , ) running a reachability test in the product graph whose nodes combine distinguished nodes of the skeleton with states from , where edges are defined according to the symbolic edges in the skeleton and the triples from edge .In the reachability test we check if there exists a path beginning in { ( )} × and ending in { ( )} × .

E.5 Implementing annotated skeleta
Consider an annotated skeleton H = , src , tgt , edge , node .We say that a graph implements H if is obtained from by replacing each symbolic edge with a path connecting the endpoints of and by attaching finitely many finitely branching trees in a way consistent with the annotations, in the following sense.
• For each symbolic edge from to ′ , the subgraph of that consists of and all trees attached to the internal nodes of is correctly summarized in the annotations: -for each ( , ′ , ) ∈ edge ( ) with , ′ ∈ there is a path in with endpoints ( , ) if = , ( , ′ ) if = → , ( ′ , ′ ) if = , and ( ′ , ) if = ← , on which A moves from state to state ′ ; -if src ( ) = ( 1 , Λ 1 ) and src ( ) = ( 2 , Λ 2 ), then the first edge of is an -edge, the last edge of is an − 2 -edge, the second node on has the labels set Λ 1 , and the penultimate node on has label set Λ 2 .
• For each distinguished node , the trees attached to are summarized correctly in the annotations: for each ( , ′ ) ∈ node ( ) with , ′ ∈ there is a tree , ′ attached to and a path that starts and ends in and otherwise only visits nodes of , ′ , on which A moves from state to state ′ .• is a model of T .
Note that all the missing pieces of the graph are essentially trees (finitely branching, but typically infinite).Indeed, each , ′ simply is a tree, but also can be viewed as a tree: its root is the source of , the root has exactly one child, the path constitutes a special finite branch ending in the target of which is a leaf in this tree.Importantly, each ( , ′ ) ∈ node ( ) is witnessed by a finite subgraph of , ′ , and each triple ( , ′ , ) ∈ edge ( ) is witnessed by a finite subgraph of .The algorithm to check if there exist such , ′ and can be seen as an emptiness test for tree automaton, or as a variant of type elimination.
We first define types, which can also be viewed as states of a tree automaton.We assign to each node of the tree a type that records the following information: • a subset of Γ ∪ Γ T , representing the labels of the current node; • an element of Σ ± ∪ Σ ± T and a subset of Γ ∪ Γ T , representing the label on the edge to the parent and the parent's label set; • with ℓ the number of at-most restrictions in T , a list of ≤ ℓ +1 elements of Σ ± ∪Σ ± T and subsets of Γ ∪Γ T , representing labels on the edges to children of the current node and the children's label sets; • a Boolean flag indicating whether the current node belongs to the special path (not used for , ′ at all); • a subset of ∈Φ × ×{ , , ↓, ↑} recording the progress on witnessing edge or node : -( , ′ , ) indicates that from state in the current node we can navigate the current subtree and return to the current node in state ′ , -( , ′ , ) indicates that from state in the current node we can navigate outside of the current subtree and return to the current node in state ′ , -( , ′ , ↓) indicates that from state in the current node, we can reach the target node of in state ′ , -( , ′ , ↑) indicates that from state in target node of we can reach the current node in state ′ .Actually, all four kinds of triples are required along the special path, but in the remaining nodes we only need the triples of the form ( , ′ , ).By a pre-type we shall understand a type with the boolean flag and the progress information dropped; that is, a tuple In what follows we blur the distinction between conjunctions of concept names and sets Λ of labels, as usual, and write ⊆ Λ.
Lemma E.5.Given T and one can compute the set of pre-types compatible with T within the time bound stated in Theorem E.3
A pre-type (Λ, ′ , Λ ′ , 1 , Λ 1 , . . ., , Λ ) is said to be compatible with T modulo a set Θ of pre-types if • Θ contains a pre-type (Λ , − , Λ, . . . ) for each ≤ ; • the pre-type satisfies all CIs in T not of the form ⊑ ∃ .′ ; • for each concept inclusion ⊑ ∃ .′ in T with ⊆ Λ, at least one of the following holds: -= ′ and ′ ⊆ Λ ′ , or -= and ′ ⊆ Λ for some 1 ≤ ≤ , or -= 0 and ′ ⊆ Λ 0 for some repeatable (Λ 0 , − 0 , Λ, . . . ) from Θ. Now, to compute the set of pre-types compatible with T , we start with the set Θ = Θ 0 of all pre-types, and exhaustively remove those pre-types that are not compatible with T modulo Θ.This algorithm terminates after at most The result is the maximum set Θ of pre-types such that each pre-type from Θ is compatible with T modulo Θ.Each pre-type compatible with T will belong to this set, because the graph witnessing the triple can be used to argue that the triple will not be removed at any iteration.Conversely, each triple from Θ is compatible with T , because one can construct a witnessing tree-shaped graph top-down, using the witnesses justifying the presence of pretypes in Θ in the last iteration of the algorithm.
Lemma E.6.The existence of a graph implementing a given annotated skeleton is decidable within the time bound from Theorem E.3.

P
. We call a type (Λ, ′ , Λ ′ , 1 , Λ 1 , . . ., , Λ , , Δ) compatible with T if the underlying pre-type (Λ, ′ , Λ ′ , 1 , Λ 1 , . . ., , Λ ) is compatible with T .Repeatable types are defined analogously, based on the underlying pre-types.Clearly, Lemma E.5 suffices to precompute the set of types compatible with T .Our task is to check if from these types one can construct the witnessing and , ′ .We will build them bottom-up, guaranteeing that each promise related to is fulfilled in a finite fragment.
Each iteration takes time polynomial in |Θ| ℓ and |T |.The promised complexity bounds follow.
Deciding the existence of the witnessing trees for a node of the annotated skeleton is very similar.We can reuse the set Θ computed for any symbolic edge .The only delicate issue is that we need to account for src ( ′ ) for all edges ′ outgoing from and tgt ( ′′ ) for all edges ′′ incoming to .Essentially, we check if there exists a type (Λ, 1 , Λ 1 , . . ., , , Δ) -note the missing ′ and Λ ′ -with = 0 and ≤ ℓ + deg( ), compatible with T and compatible with modulo Θ, except that for = 1, 2, . . ., deg( ), the components , Λ must be as specified by src ( ′ ) and tgt ( ′′ ) for outgoing ′ and incoming ′′ , and their corresponding types must be (Λ , − , Λ, 0, ∅), not required to belong to Θ.This can be done in time polynomial in |Θ| ℓ , T , and H . Corollary E.7.Unrestricted entailment of concept inclusions by an ALCIF TBox T using concept names and ℓ at-most constraints can be decided in time poly(|T |) • 2 poly( ,ℓ) .

P
. The result holds in full generality, but we only sketch the arguments for the two kinds of concept inclusions we need to compute the completion.For existential constraints, note that where and ′ are fresh concept names.For at-most constraints, We present a reduction of the acceptance problem of an alternating Turing machine with a polynomial bound on space.We begin by defining a special variant of alternating Turing machines.We also present a number of conceptual tools used in the reduction.
Alternating Turing machines.We consider a variant of alternating Turing machine with the following particularities: • there is a single distinguished initial state that the machine never reenters; • there are two special states yes and no that are final (no transition allowed to follow) 1 ; • the transition table has exactly two transitions for any nonfinal state and any symbol; 1 The state no is not necessary for the purposes of our reduction but we include it for the sake of completeness of this variant of ATM • there exists 3 special symbols: for empty tape space, ⊲ for left tape boundary, and ⊳ for right tape boundary; we only assume that the input word does not use those symbols and the transition table handles the boundary symbols appropriately.It's relatively easy to see that any alternating Turing machine with polynomially bounded space can converted to the variant above.
Formally, an alternating Turing machine (ATM) is a tuple = ( , , 0 , 1 , 2 ), where is a finite alphabet, is a finite set of states with two distinguished final states yes and no and partitioned into three pair-wise disjoint subsets = ∀ ∪ ∃ ∪{ yes , no }, 0 ∈ is a distinguished initial state, and : ( \ { yes , no }) × → ( \ { 0 }) × × {−1, +1} are two transition tables such that ( , ) = ( ′ , , ) satisfies the following two conditions: ( (1) We consider ATMs with polynomially bounded space, a class of Turing machines that defines the class ASPACE known to coincide with EXPTIME.Recall that a binary tree is a finite prefix-closed subset ⊆ {1, 2} * and a labeled-tree is a function that assigns a label to every element (node) of a tree.
Given an ATM and a polynomial poly( ), a run of w.r.t.poly on an input ∈ (Σ \ {⊲, ⊳, }) * is a binary tree whose nodes are labeled with configurations of such that: (1) the root node is labeled with ( ) = ⊲• 0 Reduction outline.We present a reduction of the problem of word acceptance by an ATM with polynomial bound on space to the complement of the problem of containment of Boolean 2RPQs in the presence of schema.We point out that the class of ASPACEcomplete problems is closed under complement, and consequently, this reduction proves that the query containment problem is EXP-TIME-hard.
More precisely, for an ATM , whose space is bounded by poly( ), and an input word we construct a schema and two Boolean 2RPQs and such that ( ) = yes iff iff ∃ ∈ ( ). |= ∧ |= .
In the sequel, we refer to as the positive query and to as the negative query.Naturally, we present a reduction that is polynomial i.e., the combined size of , , and is bounded by polynomial in the size of and .

Figure 1 :
Figure 1: Evolving schema of a medical knowledge graph.

Example 4 . 1 .
Below we present rules defining the transformation 0 of the medical database, described in Example 1.1.We use 3 unary node constructors ( ) for Antigen nodes, ( ) for Pathogen nodes, and ( ) for Vaccine nodes.

Example 4 . 3 .
A couple of examples of above queries for the transformation 0 in Example 4.1 follow.

Figure 2 :
Figure 2: Query containment over finite and infinite graphs.

Theorem 5 . 4 (Example 5 . 5 .
Ibáñez-García et al., 2014).A Horn-ALCIF TBox T has a finite model containing a finite subgraph iff its completion T * has a possibly infinite model containing .Schema from Example 5.2 is equivalent to TBox T that consists of

Theorem 6 . 3 .
A connected C2RPQ is satisfiable in a possibly infinite model of an ALCIF TBox T iff is satisfiable in a possibly infinite | |-sparse model of T .

Proposition B. 1 .Lemma B. 2 .
For any schema and for any graph , conforms to if and only if |= T , |= ⊤ ⊑ Γ , and |= ⊓ ⊑ ⊥ for any , ∈ Γ .P. Straightforward since the ALCIF formulas are translations of the conditions of conformance of a graph to a schema.We use the above result to reduce type checking to testing entailment of simple ALCIF statements.Recall that for a schema and a transformation we define the entailment relation ( , ) |= ⊑ ′ as ( ) |= ⊑ ′ for every ∈ ( ).Given two schemas and ′ and a transformation , { ( ) | ∈ ( )} ⊆ ( ′ ) if and only if ( , ) |= ⊤ ⊑ Γ and ( , ) |= T ′ .P. Immediate consequence of Proposition B.1 and the fact that transformations must use a single dedicated node constructor for each node label.This ensures that ( , ) |= ⊓ ⊑ ⊥ holds for any , ∈ Γ ′ .

Theorem E. 3 .
Given a C2RPQ and an ALCIF TBox T using concept names and ℓ at-most constraints, one can decide in time poly(|T |) • 2 poly( | |, ,ℓ) if there exists a | |-sparse graph that satisfies and T .
iterations.Each iteration takes time polynomial in |Θ| and |T |.