Complexity and Expressiveness of ShEx for RDF

We study the expressiveness and complexity of Shape Expression Schema ( ShEx ), a novel schema formalism for RDF currently under development by W3C. ShEx assigns types to the nodes of an RDF graph and allows to constrain the admissible neighborhoods of nodes of a given type with regular bag expressions ( RBE s). We formalize and investigate two alternative semantics, multi-and single-type, depending on whether or not a node may have more than one type. We study the expressive power of ShEx and study the complexity of the validation problem. We show that the single-type semantics is strictly more expressive than the multi-type semantics, single-type validation is generally intractable and multi-type validation is feasible for a small (yet practical) subclass of RBE s. To curb the high computational complexity of validation, we propose a natural notion of determinism and show that multi-type validation for the class of deterministic schemas using single-occurrence regular bag expressions ( SORBE s) is tractable.


Introduction
Schemas have a number of important functions in databases.They describe the structure of the database, and its knowledge is essential to any user trying to formulate and execute a query or an update over a database instance.Typically, schemas allow for efficient algorithms for validating the conformance of a given database instance.Schemas also capture the intended meaning of the data stored in database instances and are important for static analysis tasks such as query optimization.
Relational and XML databases have a number of well-established and widely accepted schema formalisms e.g., the SQL Data Definition Language for relational databases and W3C XML Schema and RELAX NG for XML databases.One of the reasons why the RDF data model at its conception has been schema-free was to promote its use and ensure its wide-spread adoption.Indeed, a number of existing RDF applications, such as the linked open data initiative 1 , could not have had the same success if the published data had to comply with a rigid schema.However, RDF is slowly but surely becoming an independent database model [1], with applications that were previously considered only in the context of relational and XML databases, for instance data exchange [16,2].Such classical applications often rely on the conformance of the data with a set of constraints.Not only has the 'declarative definition of the structure of a graph for validation and description' been clearly identified [39] but also we currently see an emergence of approaches to address this need: apart from the existing but somewhat inadequate RDF Schemas (RDFS) [9], we have OSLC Resource Shapes (ResSh) [31], integrity constraints expressed in OWL [34], and SPARQL queries generated with SPIN templates [6].In this paper, we investigate Shape Expression Schemas (ShEx) [29,24], a novel schema formalism for RDF currently under development by W3C [40].
A ShEx allows to define a set of types that impose structural constraints on nodes and their immediate neighborhood.Figure 1 presents a simple example of a Shape Expression Schema for an RDF database storing bug reports.Essentially, the schema above says that a bug report has a description, a user who reported it, and on what date.Optionally, a bug report may also have an employee who successfully reproduced the bug, and on what date.Finally, a bug report can have a number of related bug reports.For every user we wish to store his/her name and optionally email.For an employee we wish to store his/her name, either as one string or split into the first and last name, and email address.
A Shape Expression Schema defines a set of types to be associated to graph nodes.Each type defines the admissible collection of outgoing edges and the types of the nodes they lead to.Naturally, such a schema bears a strong resemblance to RELAX NG and DTDs which also use regular expressions to describe the allowed collections of children of an XML node.The most important difference comes from the fact that in XML, the children of a node are ordered, and the regular expressions in DTDs and RELAX NG schemas define admissible sequences of children, whereas for RDF graphs, no order on the neighborhood of a given node can be assumed.As the regular expressions used in ShEx define bags (multisets) of symbols rather than sequences, we call them regular bag expressions (RBEs).
The semantics of Shape Expression Schemas is quite natural.An RDF graph is valid if it is possible to assign types to the nodes of the graph in a manner that satisfies all shape expressions of the schema.A natural question arises: can a node be assigned more than one type?In most applications the multi-type semantics, which permits assigning multiple types to a node, seems to be more natural.For instance, the RDF graph in Figure 1 requires assigning both the type User and the type Employee to the node emp 1 because emp 1 has reported bug 2 and reproduced bug 1 .However, there are applications where the single-type semantics may be more suitable e.g., when modeling graph transformations that visit every node exactly once.
The requirement of exactly one type per node makes the single-type semantics more restrictive, and therefore, capable of defining more refined families of graphs than the multitype semantics.We show that the single-type semantics is in fact strictly more powerful than the multi-type semantics.We also show both semantics are closed under intersection but neither is closed under union or complement.We then compare the expressive power of ShEx with two standard yardstick logics for graphs: first-order logic (FO G ) and existential monadic second-order logic (∃MSO G ). ShEx are not comparable with FO G but if the RBEs use the Kleene closure on symbols only (e.g. a * ), then ShEx using such expressions are captured by ∃MSO G .Finally, in our study we compare the expressive power of ShEx with graph grammars, which are generally incomparable, and graph acceptors/automata, which are typically less expressive.
Next, we study the complexity of the validation problem i.e., checking if a given graph has a valid typing w.r.t. the given schema.Naturally, this problem comes in two flavors, depending on the chosen semantics, and we show significant computational differences between them.While validation for both semantics is generally intractable, the multi-type semantics admits tractable validation for a subclass RBE 0 of disjunction-free expressions that use the Kleene closure on symbols only.This fragment of RBEs is quite practical as it can, for instance, very easily capture the topology of RDF graphs obtained by exported relational databases in virtually any of the proposed approaches for this task (for survey, see [33]).More interestingly, however, we show that the complexity of multi-type validation for ShEx using a class C of RBEs is closely related (Turing reducible) to the complexity of the satisfiability problem for C with intersection.We show that in general this problem is NP-complete, which stands in contrast with its analogue for regular word expressions known to be PSPACE-complete [23].
To lower the complexity of validation, we introduce the notion of determinism.Essentially, determinism requires that every shape expression uses at most one type with every label.The shape expressions in Figure 1 are deterministic but the following shape expression is not.

<BugReport> {
descr xsd:string, (reportedBy @<User> | reportedBy @<Employee>), reportedOn xsd:dateTime, (reproducedBy @<Employee>, reproducedOn xsd:dateTime)?related @<BugReport>* } This shape expression is not deterministic because reportedBy is used with two types: User and Employee.For deterministic shape expression schemas, we are able to relate the complexity of multi-type validation to the problem of checking membership of a bag of symbols to the language of RBEs.While this problem is known to be NP-complete [22], it is generally simpler than the satisfiability problem, and a number of tractable subclasses has already been identified [7,22].All known tractable classes of RBEs require the expressions to be single-occurrence i.e., every symbol of the alphabet is used at most once in an RBE.In the present paper, we show that the full class of single-occurrence regular bag expressions (SORBE) has in fact tractable membership.Finally, we consider the problem of validating only a fragment of a graph with preassigned types for its root nodes and argue that for deterministic ShEx using SORBEs, multi-type validation can be performed efficiently, and show that single-type validation can be performed with a single pass over the graph.
While checking that the data values satisfy constraints is an important part of database validation, in our study we focus only on the core capacity of Shape Expression Schemas to define the graph topology, and therefore, ignore data values.While the exact array of types of value constraints of ShEx is yet to be elaborated, our results identify important computational obstacles that arise from the topology shaping properties of ShEx alone, and furthermore, we present tractable algorithms that can serve as a basis for RDF validation with a number of data value constraints.Our methodology can be compared to using automata to model schema languages for XML databases: while virtually all schema languages for XML allow to define constraints on data values, from simple domain checks in DTDs to key constraints in XML Schema, the results on automata often translate directly to results for schema languages for XML.
The main contributions of the present paper are: 1.We formalize two alternative semantics for ShEx, multi-type and single-type, depending on whether or not a node may have more than one type.2. We study the expressive power and basic properties of ShEx. 3. We provide a comprehensive understanding of the complexity of validation and show a very close relationship to the complexity of the problem of satisfiability of regular bag expressions with intersection.We show that multi-type validation is tractable for a practical subclass of RBE 0 .4. We propose a notion of determinism for ShEx that allows to curb the complexity of validation and (Turing) reduces validation to checking membership of a bag to the language of RBE.Additionally, we show that single-occurrence regular bag expressions (SORBE) enjoy a tractable membership problem which makes them an attractive candidate for use in deterministic shape expression schemas.

Related work.
A number of approaches for validation of RDF has been previously proposed.RDF Schema (RDFS) in essence allows to define a light ontology consisting of types of objects (classes), inclusion dependencies between types (a hierarchy), and specification of the domain and the range of the graph edges of a given label.However, the W3C recommendation does not fix a semantics but only suggest two possible usages: 1) as an ontology allowing to infer types of RDF objects and 2) as a constraint language.It should be noted that only the first use is formalized [5] and it is a common belief that, despite its name, RDF Schema is a basic ontology language rather than a schema language.We point out that the constraints definable with RDFS can be easily captured with ShEx but the converse is not the case.OSLC Resource Shapes (ResSh) [31] essentially extend RDFS by allowing to specify cardinality constraints ?, * , +, and 1 on types which renders it equivalent to ShEx using single-occurrence RBE 0 .
While OWL [14] is an ontology language geared towards inference, it can enforce certain constraints by capturing constraint violations with rules that trigger an inconsistency.However, the class of constraints that can be enforced by OWL is limited due to the fact that OWL is interpreted under open world assumption (OWA) and without unique name assumption (UNA).Consequently, a modified semantics for OWL has been proposed [34] that uses OWA when inferring new facts while employing closed word assumption (CWA) and UNA when enforcing constraints.Since OWL with its standard semantics is widely accepted, concerns have been raised [31] about the potential confusion arising from the mixed semantics.Such formalisms is, however, very powerful and expressive, easily captures a rich fragment of ShEx (equivalent to ∃MSO G ), but its computational properties are yet to be characterized.While [36] outlines a method of translating OWL constraints into SPARQL queries, the size of the resulting SPARQL queries seems to depend on the size of the OWL constrains.Currently, we do not know if it is reasonable to assume the size of schema for RDF to be fixed as it may involve large vocabularies of types used by ontologies.This gives a PSPACE upper bound while ShEx enjoys much lower (combined) complexity.Similar criticism applies to other solutions based on SPARQL queries, such as SPIN [6], while they are very powerful and expressive, their use may require significant computational resources.
Finally, we point out an important difference in semantics: while we investigate the existence and construction of a valid typing, all approaches above assume the typing to be given (with rdf:type edges) and only verify that the typing is valid.Our approach is more general, we show how to verify the validity of a given typing and that it is the main source of complexity.In particular, we propose a method constructing a maximal (multi-type) valid typing and a method extending the given (possibly invalid) typing to one that is valid.

Organization.
In Section 2 we present basic notions.In Section 3 we introduce Shape Expression Schemas (ShEx) and define the single-and multi-type semantics.In Section 4 we study the complexity of the validation problem for ShEx.In Section 5 we introduce a natural notion of determinism for ShEx and identify a rich class of single-occurrence RBEs that together render multi-type validation tractable.In Section 6 we analyze the expressive power of ShEx.Finally, we conclude and discuss related and future work in Section 7. Because of space restriction we omit the proofs: they can be found in the technical report [8].

Preliminaries
Because we wish to investigate only the capacity of ShEx to shape the graph topology, we model RDF databases with standard graphs whose edges are labeled by elements of a finite set.In [8] we show how a more general model can be employed without affecting the results.

Graphs
We assume a finite set Σ of edge labels.An edge-labeled graph (or simply a graph) is a pair G = (V, E), where V is a finite set of nodes and E ⊆ V × Σ × V is the set of edges.In Figure 2 we present a number of examples of edge-labeled graphs.
In our approach, we shape the topology of a graph based on the immediate outbound neighborhood of nodes.The labeled outbound neighbourhood of the node n in the graph G = (V, E) is essentially the set of edges outgoing from n, and is defined as On occasions, we use only the collection of outgoing labels, ignoring their target nodes.Note, however, that this collection needs not be representable as neither a set nor by a list of labels because a node may have multiple outgoing edges with the same label and its neighborhood is not ordered.Take for instance node n 0 in the graph G 0 : it has two outgoing a-edges and one outgoing b-edge.Consequently, we employ bags, also known as multisets, which essentially specify the number of occurrences of every symbol.

Bags of symbols
Let ∆ be a finite set of symbols (which is not necessarily Σ).A bag over ∆ is a function w : ∆ → N that maps a symbol to the number of its occurrences.The empty bag ε has 0 occurrences of every symbol i.e., ε(a) = 0 for every a ∈ ∆.We write a ∈ w as a short for w(a) = 0.
We present bags using the notation {|a, . ..| } with elements possibly being repeated.For example, when ∆ = {a, b, c}, w 0 = {|a, a, a, c, c| } represents the function w 0 (a) = 3, w 0 (b) = 0, and w 0 (c) = 2. Now, for a given graph G = (V, E) and its node n ∈ V , we define the bag of outbound labels of n in G as the bag out-lab G (n) = {|a | (n, a, m) ∈ E| }.For instance, for the graph G 0 in Figure 2 and the node n 0 we have out-lab G0 (n 0 ) = {|a, a, b| }.
The bag union w 1 w 2 of two bags w 1 and w 2 is defined as [w 1 w 2 ](a) = w 1 (a)+w 2 (a) for all a ∈ ∆.For instance, {|a, c, c| } {|a, b| } = {|a, a, b, c, c| }.A bag language is a set of bags.The bag union of two languages L 1 and L 2 is the language Also, for a given bag language L, we define L 0 = {ε} and L i = L L i−1 for i ≥ 0.

Regular bag expressions
A number of XML schema languages, including DTD, XML Schema, and RelaxNG, uses regular expressions to define the content model (local structure) of types.The popularity of using regular expressions (for words) in validation comes from the fact that they are easy to grasp and use by a wide range of potential users.Because in the context of RDF graphs the nodes are not ordered, we employ regular expressions for bags, which replace the ordered concatenation operator with its unordered version ||, and implicitly the Kleene star by the unordered Kleene star.This family of expressions have been successfully employed to model XML with unordered and mixed content model [7,12].
A regular bag expression (RBE) defines bags by using disjunction "|", unordered concatenation "||", and unordered Kleene star " * ".Formally, RBEs over ∆ are defined with the following grammar , where a ∈ ∆.Their semantics is defined as follows: and L(E * ) = i≥0 L(E) i .We use two standard macros: E ?:= ( | E) and E + := (E || E * ).We also use intervals on symbols a [n;m] , where n ∈ N and m ∈ N ∪ {∞}, with the natural semantics: We use different syntactic restrictions on RBEs, by indicating which of the syntactic ingredients can be used, among the following: a M means that multiplicities among {1, ?, * , +} can be used only on symbols; a I means that arbitrary intervals on symbols can be used; ||, |, and * mean that the corresponding operator can be used.For instance, RBE(a M , ||, |) is the class of RBE that allows the use of the Kleene star operator * on symbols only; RBE(a, ||, |, * ) is the class of all RBE.By RBE 0 we denote the simplest class of RBE(a I , ||) that we believe to be of practical relevance.Finally, by RBE 1 we denote the RBEs of the form (a ), with a i,j ∈ ∆.In the sequel, we also use RBE to denote the family of bag languages definable with regular bag expressions, and it should be clear from the context whether "RBE" stands for the class of expressions, or for the class of languages.
A number of important facts are known about RBE: it is closed under intersection, union, and complement [26], testing membership w ∈ L(E) is NP-complete [22], and so is testing the emptiness of L(E 1 ) ∩ L(E 2 ) [12].Also, when a bag of symbols is viewed as vector of natural numbers (obtained by fixing some total order on ∆), RBE is equivalent to the class of semilinear sets and the class of vectors definable with Presburger arithmetic [18,28].

Shape Expression Schemas
In this section we formally introduce shape expressions schemas and propose two semantics that we study in the remainder of the paper.We assume a finite set of edge labels Σ and a finite set of types Γ.A shape expression is an RBE over Σ × Γ.In the sequel we write (a, t) ∈ Σ × Γ simply as a :: t.A shape expression schema (ShEx), or simply schema, is a tuple S = (Σ, Γ, δ), where Σ is a finite set of edge labels, Γ is a finite set of types, and δ is a type definition function that maps elements of Γ to bag languages over Σ × Γ.We only use shape expressions for defining bag languages of δ.Typically, we present a ShEx as a collection of rules of the form t → E to indicate that δ(t) = L(E), where t ∈ Γ and E is a shape expression over Σ and Γ (naturally, no two rules shall have the same left-hand side).If for some type t a rule is missing, the default rule is t → .For a class of RBEs C by ShEx(C) we denote the class of shape expression schemas using only shape expressions in C. Two example schemas follow: The semantics of ShEx is natural: a graph is valid if it is possible to assign types to the nodes of the graph in a manner that satisfies the type definitions of the schema.Two variants of semantics can be envisioned depending on whether or not more than one type can be assigned to a node.

Single-type semantics
We fix a graph G = (V, E) and a schema S = (Σ, Γ, δ).A single-type typing (or simply an s-typing) of G w.r.t.S is a function λ : V → Γ that associates with every node n ∈ V its type λ(n).An example of an s-typing of G 0 w.r.t.S 0 is Next, we identify the conditions that an s-typing needs to satisfy.Given a typing λ and a node n ∈ V we define the labeled and typed out-neighborhood of n w.r.t.λ as bag over Σ × Γ as For instance, for the graph G 0 (Figure 2) and the typing λ 0 we have out-lab-type λ0 G0 (n 1 ) = {|a :: t 1 , b :: t 2 | } and out-lab-type λ0 G0 (n 4 ) = {|c :: t 1 | }.Now, λ is a valid s-typing of S on G if and only if every node satisfies the type definition of its associated type i.e., for every n ∈ V , out-lab-type λ G (n) ∈ δ(λ(n)).By L s (S) we denote the set of all graphs that have a valid s-typing w.r.t. the shape expression schema S. For a class C of bag languages by ShEx s (C) we denote the class of graph languages definable under the single-type semantics with shape expression schemas using shape expressions from C only.Naturally, λ 0 is a valid typing of G 0 w.r.t.S 0 .G 1 also has a valid s-typing of S 1 : G 2 , however, does not have a valid s-typing w.r.t.S 1 .

Multi-type semantics
Again, we assume a fixed graph G = (V, E) and a fixed schema S = (Σ, Γ, δ).A multi-type typing (or simply an m-typing) of G w.r.t.S is a function λ : V → 2 Γ that associates with every node of G a set of types.For instance, an m-typing of G 2 w.r.t.S 1 is The labeled and typed out-neighborhood of a node is defined in the same way but note that this time it is a bag over Σ × 2 Γ .For instance, out-lab-type λ2 G2 (n 1 ) = {|b :: {t 1 , t 2 }, c :: {t 3 }| }.Now, a flattening of a bag over Σ × 2 Γ is any bag over Σ × Γ obtained by choosing one type from every occurrence of every set.For instance, out-lab-type λ2 G2 (n 1 ) has two flattenings: {|b :: t 1 , c :: t 3 | } and {|b :: t 2 , c :: t 3 | }.Formally, a flattening of a bag w over Σ × 2 Γ is any bag in the language of the following RBE 1 expression Flatten(w) = || a::T ∈ ∈w (| t∈T a :: t), where a :: T ∈ ∈ w indicates that the symbol a :: T is to be considered w(a :: T ) times.For instance, Flatten({|a :: {t 0 , t 1 }, a :: {t 0 , t 1 }, b :: For instance, Out-lab-type λ2 G2 (n 1 ) = {{|b :: t 1 , c :: t 3 | }, {|b :: t 2 , c :: t 3 | }}.Note that while Flatten(out-lab-type λ G (n)) is an expression of size polynomial in the size of G and S, the cardinality of the set Out-lab-type λ G (n) may be exponential in the size of G and S. Now, λ is a valid m-typing of G w.r.t.S if and only if: 1. it assigns at least one type to every node, λ(n) = ∅ for n ∈ V , 2. every node satisfies the type definition of every type assigned to the node i.e., for every n ∈ V and every t ∈ λ(n), Out-lab-type λ G (n) ∩ δ(t) = ∅.For instance, λ 2 is a valid multi-type typing of G 2 w.r.t.S 1 .A semi-valid m-typing is a m-typing that satisfies the second condition but might violate the first one.By L m (S) we denote the set of all graphs that have a valid m-typing w.r.t. S. For a class C of bag languages by ShEx m (C) we denote the class of graph languages definable under the multi-type semantics with shape expressions schemas using shape expressions in C only.

Validation
In this section we consider the problem of validation: checking whether a given graph has a valid typing w.r.t. a given ShEx.This problem has two parameters: 1) the kind of typing, either single-type or multi-type and 2) the class of regular bag expressions used for type definitions in the schema.We first point out that the complexity of single-type validation for ShEx(RBE) is NPcomplete.The NP upper bound follows from the fact that the membership problem for RBE is in NP.The upper bound is shown with a reduction from 3-colorability of graphs.

Theorem 1. Single-type validation for ShEx(RBE) is NP-complete.
Proof (sketch) The following schema which, under the single-type semantics, defines the set of graphs with homomorphism into K 3 i.e., all 3-colorable graphs: For the remaining of this section, we focus on multi-type validation and return briefly to the single-type semantics in the next section.

Semi-lattice of m-typings
We begin by presenting a downward refinement method that allows to construct a valid m-typing.Take a graph G = (V, E) and a ShEx S, and let mTyping(G, S) be the set of all valid m-typings of the graph G w.r.t. the schema S. mTyping(G, S) is a semi-lattice with the meet operation and the (induced) partial order defined as follows: The refinement method works as follows.We begin with a typing that assigns to every node the set of all types, λ • (n) = Γ for all n ∈ V , and then we iteratively remove the types that are not satisfied.Every iteration is an application of the one-step refinement operator on m-typings defined as follows (with n ∈ V ): Clearly, Refine(λ) λ, and therefore, the fix-point Refine * (λ) is well-defined.We claim that the procedure outlined above indeed constructs the maximal valid m-typing if one exists.

Lemma 2. For any λ ∈ mTyping(G, S), λ Refine
In particular, G satisfies S if and only if Refine * (λ • ) is valid, and then, Refine * (λ • ) is the -maximal valid m-typing of G on S. We point out that there does not necessarily exist a unique -minimal valid m-typing.

Complexity of Validation
Using the above refinement procedure, we show that multi-type validation is NP-complete for arbitrary regular bag expressions and later identify a tractable fragment.
In essence, performing the refinement procedure requires testing the nonemptyness of the intersection Out-lab-type λ G (n) ∩ δ(t).Recall that Out-lab-type λ G (n) is defined by an RBE 1 expression, i.e. an expression of the form (a ).Therefore, for a class of RBEs C used we identify the following decision problem: Tractability of INTER 1 is a necessary and sufficient condition for tractability of multi-type validation for ShEx(C).On the one hand, we show that for any class C of RBEs there exists a polynomial-time reduction from INTER 1 (C) to validation for ShEx m (C).On the other hand, the refinement procedure performs a polynomial number of INTER 1 tests (with polynomiallysized inputs).This observation allows us to characterize precisely the complexity of multi-type validation for ShEx(RBE): the lower bound follows from an existing complexity results [12] on testing emptiness of RBEs with intersection and the upper bound follows from a more general result we prove, namely testing satisfiability of RBEs with intersection can be reduced to finding integer solutions to a system of linear equations known to be in NP [27].

The tractable subclass RBE 0
When ShEx use only expressions in RBE 0 = RBE(a I , ||), the multi-type validation is tractable, which we show by providing a polynomial algorithm for INTER 1 (RBE 0 ).We point out that ShEx using RBE 0 are capable, for instance, of capturing the topology of RDF graphs obtained from exporting relational databases to RDF [33].
We reduce INTER 1 (RBE 0 ) to the circulation problem in flow networks.Recall that a flow network is a directed graph with arcs having additionally assigned the amount of flow they require (minimum flow) and the amount of flow they can accept (maximum flow).The circulation problem is to find a valid flow i.e., an assignment of flow values to arcs of the flow network so that the flow constraints of every arc are satisfied and at every node the sum of incoming flow is equal to the sum of outgoing flow.This problem has been well-studied and a number of efficient polynomial algorithms exist (cf.[19]).
We illustrate the reduction on the example of 3 with the maximum flow (the minimum flow) of an arc is indicated above (below respectively).An example of a valid Theorem 4. Multi-type validation for ShEx(RBE 0 ) is in PTIME.

Determinism
Determinism is a classical tool for decreasing the complexity of validation [20], which we explore next.We propose a suitable and natural notion of determinism for ShEx and show that multi-type validation for deterministic ShEx is not harder than membership of a bag to the language of an RBE.This allows us to identify a large and practical class of singleoccurrence regular bag expressions (SORBE) that render validation tractable (Section 5.2).
We then investigate the problem of partial validation, where the conformance of only a fragment of the input graph is to be checked (Section 5.3), and which we believe to be an important practical use case of ShEx.We present an optimal algorithm for partial validation which is tractable for classes of deterministic ShEx with tractable membership, for both multi-type and single-type semantics.

Deterministic Shape Expressions
Essentially, the idea of determinism for ShEx S = (Γ, δ) is that, knowing the type t of a node n ∈ V and the label a of an outgoing edge (n, a, m) ∈ E we should know the type t that must be satisfied by m if n is to satisfy the type t.Formally, a shape expression E is deterministic if every label a ∈ Σ is used with at most one type t ∈ Γ in E. For instance, E 1 = a :: || a :: t 3 || c :: t 2 is not because the symbol a is used with two different types t 1 and t 3 .Now, a shape expression schema S = (Σ, Γ, δ) is deterministic if it uses only deterministic shape expressions, and then, by δ(t, a) we denote the unique type used with the symbol a in the expression used to define δ(t) (if a is used in this expression).
Recall that the tractability of the refinement method for multi-type validation presented in Section 4.1 depends on the tractability of testing that Out-lab-type λ G (n) ∩ δ(t) is nonempty.When the schema is deterministic, Out-lab-type λ G (n) ∩ δ(t) is nonempty if and only if 1) the bag out-lab-type δ G (n, t) = {|a :: δ(t, a) | (n, a, m) ∈ E| } belongs to δ(t) and 2) for every (n, a, m) ∈ E, δ(t, a) belongs to λ(m).Using this argument, we show that the sufficient and necessary condition for tractable validation against deterministic ShEx using a class C of RBEs is tractable membership problem for C, a decision problem defined formally as

Single-occurrence RBE (SORBE)
While the problem of membership of a bag to a language defined by an RBE is in general intractable [22], we identify a rich and practical class of RBEs with tractable membership.This class is obtained by disallowing repeating symbols, while allowing arbitrary intervals on symbols, a restriction on regular expressions commonly imposed in the context of document content models with evidence justifying its use [3,11,12,17,25].Formally, a single-occurrence regular bag expression (SORBE) over ∆ is an RBE that allows at most one occurrence of every symbol of ∆ and allows the use of the wildcard + inside expressions and arbitrary intervals on symbols a I .Note that this also enables the use of the wildcard E ?since it can be defined using and the union operator without repeating any symbol of ∆.

Theorem 5. MEMB(SORBE) is in PTIME.
Proof.We fix a bag of symbols w over ∆.For a regular bag expression E, by ∆(E) we denote the subset of ∆ containing exactly the symbols used in E. For a subset X ⊆ ∆ by w X we denote the bag over X obtained from w by removing all occurrences of symbols outside of X. W.l.o.g.we assume that the Kleene's plus E + is used only if ε ∈ L(E) (otherwise E + can be replaced by E * ).
The algorithm recursively constructs for an expression E a set of integers I(E) such that i ∈ I(E) iff w ∆(E) ∈ L(E) i .This set is represented by an interval.Recall that an interval [n; m] is a finite representation of the set {i | n ≤ i ≤ m}.It is empty if m < n and we use ∅ to denote (the equivalence class of) all empty intervals.Also, the intersection of two intervals can be obtained easily [n The algorithm works as follows (with 0/∞ = 0 and i/∞ = 1 for i ≥ 1): otherwise, Note that assigning I(E + ) = [0; 0] when w ∆(E) = ε is valid since we assume ε ∈ L(E).Naturally, w ∈ L(E) if and only if 1 ∈ I(E) and w uses only symbols present in E.
We, therefore, immediately get Corollary 6. Multi-type validation for deterministic ShEx using SORBE is in PTIME.
We employ the single type requirement to reduce the NP-complete problem of exact set cover to single-type validation against a deterministic ShEx using only single-occurrence RBE 0 expressions.
Theorem 7. Single-type validation for deterministic ShEx using SORBE is NP-complete.

Optimal validation algorithm
Some applications might not require testing validity of the whole graph, but rather checking the validity of only a fragment that will be accessed by the application.Such a fragment can be identified by a set of root nodes, entry points for navigating the graph, and typically, the application will require the entry points to satisfy certain types.In this section, we show how this scenario can be modeled with ShEx and present an efficient algorithm works with deterministic shape expressions.For this, take a schema S = (Σ, Γ, δ) such that Γ contains a special universal type t with the definition δ(t ) = (Σ × Γ) * .The language of S is the universal graph language, as any node of any graph can be typed with t .In essence, the universal type allows to forgo validation of any node because any node implicitly satisfies t .
To carry out validation on a fragment of a graph identified by the entry points we introduce the notion of pre-typing, an assignment of required types to a selected set of nodes.Formally, a pre-typing of a graph G (w.r.t.S) is a partial mapping λ _ : V → 2 Γ .Now, the objective is to find a valid extension of λ _ i.e., a valid m-typing λ of G w.r.t.S such that λ _ λ.The universal type t combined with pre-typing is a very powerful modelling tool.For instance, the rule t 0 → a :: t * 1 || (|| b∈Σ,b =a b :: t ) indicates that we are interested in checking correct typing for nodes reachable by an a-labelled edge, but all remaining nodes can have arbitrary types, therefore do not need to be typed.
Since we are not interested in typing the whole graph G, we focus on the smallest possible valid extension of a given pre-typing λ.Interestingly, we can show the following.Lemma 8.For a deterministic ShEx S = (Σ, Γ, δ) with universal type, a graph G = (V, E), and a pre-typing λ _ : V → 2 Γ , if λ _ admits a valid extension, then it admits a unique -minimal valid extension.
We present an algorithm that constructs the minimal valid extension of a given pre-typing λ _ of a given graph G w.r.t. a given deterministic Shape Expression Schema S with universal type.For technical reasons, we represent a typing as binary relation between the set of nodes and the set of types, and deliberately omit the universal type.More precisely, we use a relation for some t ∈ Γ, and λ(n) = {t } otherwise.Furthermore, we abuse notation and use λ instead of R λ .Recall that δ(t, a) is the unique type used together with the symbol a in δ(t).
This algorithm is a modified graph flooding algorithm that maintains a frontier set F of pairs (n, t) for which it remains to be verified that the node n satisfies type t.Initially, this set contains only the pairs specified by the pre-typing (line 1).The algorithm fails whenever for some (n, t) ∈ F the outgoing edges of n do not satisfy the structural constraints given by Algorithm 1 MinValidExt(S, G, λ _ ) Input: S = (Σ, Γ, δ) a deterministic ShEx, G = (V, E), λ _ ⊆ V × Γ a pre-typing; Output: λ ⊆ V × Γ the minimal valid extension of λ _ .
for (a, m) ∈ out-lab-node G (n) do F := F ∪ {(m, δ(t, a))} 12: return λ δ(t) (lines 5-7).If, however, the constraints are satisfied, any node m reachable from n is added to F with an appropriate type unless the type is universal.Note that a run of the algorithm considers the pair (t, n) at most once, and therefore main loop is executed at most |V | × |Γ| times.Once F is empty, the constructed λ represents the minimal valid extension of λ _ .This algorithm is optimal in the sense that it constructs the minimal representation of the minimal valid extension and considers assigning a type to a node only if it is required to construct the extension.Naturally, for single occurrence RBEs the algorithm works in polynomial time.
Theorem 9. Given a deterministic ShEx(SORBE) S, a graph G, and a pre-typing λ _ : V → Γ, the algorithm MinValidExt(S, G, λ _ ) constructs in polynomial time the minimal valid extension of λ _ if it exists, or fails otherwise.
A slight modification of this algorithm works for the single-type semantics too: in the inner loop (lines 9-11) it suffices to add a check that neither F nor λ contain (m, t ) for some t = δ(t, a), which prevents assigning two different types to the same node.As a result such modified algorithm constructs an s-typing λ (with universal type omitted).Also, note that with the single-type modification the while loop is executed at most |V | times, and the algorithm considers each edge of the graph at most once, i.e. the algorithm makes a single pass over the graph.

Expressive power
This section outlines the results of our study of expressive power of ShEx.First, we consider the first-order logic on graphs (FO G ) over the standard signature consisting of relation names (E a ) a∈Σ , and the existential monadic second-order logic on graphs (∃MSO G ) allowing only formulas of the form ∃X 1 , . . ., X n ϕ, where X 1 , . . ., X n are monadic second order variables and ϕ is an FO formula using additional atomic formulae of the form x ∈ X i .
We say that a class of graph languages C separates a graph H from a graph G if there is L ∈ C such that G ∈ L and H ∈ L. Consider the fork G < and the diamond G graphs in Figure 4. We observe that single-type semantics can easily separate G from G < while it can be easily shown that the multi-type semantics cannot.However, even the single-type semantics cannot separate G < from G while this separation is trivial for FO G and ∃MSO G .Also, let L cycle be the set of graphs labeled with Σ = {a, b} such that for every node n with an incoming b-edge, there is a cycle reachable from n.It is a classic result that FO G cannot define sets of graphs which contain cycles of unbounded size [15].However, L cycle can be defined in both semantics with the following schema: The following table present a comparison of expressive power, indicating whether a language L satisfying the constraints given in the first column can be expressed by each of the formalisms.
It should also be noted that RBEs can express cardinality constraints e.g., (a || b) * means that the number of a must be equal to the number of b, that cannot be captured by ∃MSO G .However, if we limit the expressive power of the bag languages used in schemas to those that can be captured by ∃MSO G e.g., any subclass of RBE(a I , ||, |), the expressive power of schemas is captured by ∃MSO G , a result easily proven with a simple adaptation of the standard translation of an automaton to an existential monadic second-order formula [37].The restriction on use of Kleene closure in defining unordered content models has been previously advocated for complexity reasons in the context of XML [12], and we provide yet another reason.
The single-type semantics is in fact very powerful and can easily capture graph languages defined by homomorphism into a fixed graph, as illustrated in the proof of Theorem 3. It seems unlikely that the multi-type semantics is as powerful as suggested by complexity arguments in the present paper.Both semantics have the same closure properties: The closure under intersection can be extended to a powerset technique, similar to the determinisation technique of finite automata [21], that allows to show that the single-type semantics is in fact more powerful than the multi-type semantics.
Theorem 11.ShEx s properly contains ShEx m .ShEx can be viewed as an automaton on edge-labeled (possibly infinite) trees obtained by unraveling the input graph and we believe that such a model would correspond closely to Presburger automata [32] if the latter were extended to infinite trees, taking a universal acceptance condition.Interestingly, in this analogy the single-type semantics corresponds to deterministic automata while multi-type semantics corresponds to nondeterministic ones.More recently, k-Pebble automata on graphs have been proposed [30] but they are not comparable with ShEx because capable of expressing arbitrary FO properties.Recognizable sets of graphs as defined in [13] go beyond MSO G , and therefore, capture ShEx(RBE(a I , ||, |)).
Finally, ShEx are incomparable with both the node replacement (NR) graph grammars, and the hyperedge replacement (HR) graph grammars.On the one hand, the language {G } is definable by both HR and NR graph grammars with single initial graph and no rules.On the other hand, ShEx can define languages that are not definable by neither HR nor NR grammars: HR grammars can define only languages of graphs of bounded tree-width while NR grammars cannot define a language containing infinitely many square grids.

Conclusions
We have investigated Shape Expressions Schemas (ShEx), a novel formalism of schemas for RDF graphs currently under development by W3C.We have proposed two alternative semantics, single-type and multi-type, studied their expressive powers and the complexity of the problem of validation.We have also proposed a notion of determinism in order to curb down the complexity of validation.While the single-type semantics is in general intractable, for multi-type validation we have identified two essential bottleneck complexity problems on RBE, membership and satisfiability of RBEs with intersection, depending on whether or not deterministic expressions are used.Summary of complexity results can be found in Table 1. 3) Table 1 Summary of main complexity results for the validation problem.
Our results on expressive power suggest that an unrestricted use of the Kleene closure may render the proposed formalism too powerful and so far there exists little evidence of its practical usability in the context of unordered content model [12,7].Complexity results suggest that the single-type semantics may be too expensive for practical application unless we wish to validate only a fragment of graph with a given pretyping.As for the multi-type semantics, validation is tractable for a small yet practical fragment RBE 0 and if we use determinism, a richer class of SORBEs can be handled efficiently.

Future work.
In the future, we plan to investigate the impact of data value constraints on complexity of validation.Our preliminary study shows that adding local data value constraints, such as domain check, does not affect our results.However, the impact of value constraints of global nature, such as key dependencies, remains to be investigated.We also plan to thoroughly evaluate experimentally the proposed algorithms and compare with existing validation approaches (SPIN, ICV) on both real-life and synthetically generated data e.g., RDF export of TPC-H benchmark data [38].Our preliminary experiments [8] are very promising.Also, we would like to study the complexity of classical static analysis problems such as schema containment and query validity in the presence of schema.Finally, we would like devise inference algorithms for ShEx drawing inspiration from learning XML twig queries [35] and schemas for XML [4,3,10].

Figure 1
Figure 1 An example of a Shape Expression Schema and a valid RDF graph.

Figure 4
Figure 4 Fork and diamond graphs.

Theorem 10 .
ShEx s and ShEx m are not closed under union and complement.Both ShEx s and ShEx m are closed under intersection.