Next Article in Journal
The Modeling of Time Series Based on Least Square Fuzzy Cognitive Map
Next Article in Special Issue
On Computational Hardness of Multidimensional Subtraction Games
Previous Article in Journal
Optimal Cooking Procedure Presentation System for Multiple Recipes and Investigating Its Effect
Previous Article in Special Issue
Constructing Minimally 3-Connected Graphs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Determinization and Minimization of Automata for Nested Words Revisited

Inria, Université de Lille, 59000 Lille, France
*
Author to whom correspondence should be addressed.
Submission received: 11 December 2020 / Revised: 19 February 2021 / Accepted: 19 February 2021 / Published: 24 February 2021
(This article belongs to the Special Issue Selected Algorithmic Papers From CSR 2020)

Abstract

:
We consider the problem of determinizing and minimizing automata for nested words in practice. For this we compile the nested regular expressions ( NREs ) from the usual XPath benchmark to nested word automata ( NWAs ). The determinization of these NWAs , however, fails to produce reasonably small automata. In the best case, huge deterministic NWAs are produced after few hours, even for relatively small NREs of the benchmark. We propose a different approach to the determinization of automata for nested words. For this, we introduce stepwise hedge automata ( SHA s) that generalize naturally on both (stepwise) tree automata and on finite word automata. We then show how to determinize SHA s, yielding reasonably small deterministic automata for the NREs from the XPath benchmark. The size of deterministic SHA s automata can be reduced further by a novel minimization algorithm for a subclass of SHA s. In order to understand why the new approach to determinization and minimization works so nicely, we investigate the relationship between NWAs and SHA s further. Clearly, deterministic SHA s can be compiled to deterministic NWAs in linear time, and conversely NWAs can be compiled to nondeterministic SHA s in polynomial time. Therefore, we can use SHA s as intermediates for determinizing NWAs , while avoiding the huge size increase with the usual determinization algorithm for NWAs . Notably, the NWAs obtained from the SHA s perform bottom-up and left-to-right computations only, but no top-down computations. This NWA behavior can be distinguished syntactically by the (weak) single-entry property, suggesting a close relationship between SHA s and single-entry NWAs . In particular, it turns out that the usual determinization algorithm for NWAs behaves well for single-entry NWAs , while it quickly explodes without the single-entry property. Furthermore, it is known that the class of deterministic multi-module single-entry NWAs enjoys unique minimization. The subclass of deterministic SHA s to which our novel minimization algorithm applies is different though, in that we do not impose multiple modules. As further optimizations for reducing the sizes of the constructed SHA s, we propose schema-based cleaning and symbolic representations based on apply-else rules that can be maintained by determinization. We implemented the optimizations and report the experimental results for the automata constructed for the XPathMark benchmark.

1. Introduction

Nested words are hierarchical structures that are omnipresent in computer science. They were used to represent sequences of data trees, like XML or JSON documents, and to analyze the call structure of recursive programs. The idea of nested words is to generalize on both words and trees, resulting in sequences of unranked trees that are also known as hedges. Otherwise, nested words can be obtained by enriching Dyck words with internal letters, besides opening and closing parentheses. Furthermore, nested words are the elements of the least set containing internal letters from a given alphabet, triples consisting of an opening parenthesis, a nested word, and a closing parenthesis, and all sequences of nested words. Last but not least, nested words can be seen as words over an alphabet with internal letters, opening parentheses, and closing parentheses, under the conditions that the parenthesis are well nested, so that every opening parenthesis is properly closed and every closing parenthesis properly opened.
From the viewpoint of formal language theory, a natural question is how to lift the notions of finite automata and regular expressions, from words and trees to nested words, while preserving their well-known relationships. Nested word automata ( NWAs ) were heavily studied since the 1980s  [1,2,3,4], under the name input-driven automata. They are the same as visibly pushdown automata [5], pushdown forest automata [6], and streaming tree automata [7]. NWAs can recognize the same languages of unranked trees as hedge automata [8], a generalization of tree automata for ranked trees [9]. NWAs are often defined as pushdown automata with visible stacks, meaning that exactly one symbol is pushed when reading an opening parenthesis, and exactly one symbol is popped when reading a closing parenthesis, while the stack is not used otherwise. Their main advantage is a powerful notion of determinism, generalizing both over bottom-up and top-down determinism of tree automata for ranked trees [2,3]. We note that general pushdown automata do not permit determinization in contrast.
Regular expressions for nested words were proposed more recently by Hosoya and Pierce [10] under the name of regular expression types. In the present article, we will call them nested regular expressions ( NREs ) instead. Independently, more complex notions of nested regular expressions were introduced  [11,12] in order to deal with generalizations of nested words with dangling opening and closing parentheses, which are not of interest to us. It was already claimed in [10], that our simpler notion of NREs has the same expressiveness as hedge automata [8,9], which in turn have the same expressiveness as NWAs  [3]. However, the question under which conditions NREs can be compiled to small deterministic NWAs has not been studied. For classes of NREs for which deterministic NWAs can be computed in polynomial time, we can decide language inclusion or equivalence in polynomial time too. For other classes, these problems may not be feasible since language inclusion for nondeterministic NWAs is EXP-complete.
Our concrete interest in the universality of deterministic NWAs is motivated by XML stream processing: we want to compute the certain answers of a navigational XPath query on an XML stream  [13,14], i.e., those elements that are selected in all possible futures of the stream. Whether an answer is certain is computationally hard, even for tiny syntactic fragments of navigational XPath  [14,15], but can be done in polynomial time for queries defined by deterministic NWAs  [16]. A natural question is, therefore, whether it is possible to compile navigational XPath queries as in the usual benchmark [17] to deterministic NWAs of reasonable size. Unfortunately, the existing compilers fail to do so [18], as they are based on NWA determinization for dealing with disjunction, negation, and recursive steps. Thereby, they produce huge deterministic automata even for very simple navigational XPath queries from the benchmark, or do not terminate after some hours.
In this article, we consider NREs for defining queries on nested words. For benchmarking with realistic example, we consider the navigational XPath queries in the XPathMark benchmark with only forwards axis, that we compiled to NREs of the same size up to a constant factor. The question is then whether these NREs can be compiled to reasonably small deterministic NWAs .
As a first approach, we distinguish a subclass of “deterministic” NREs that can be compiled in polynomial time to deterministic NWAs by generalizing on Glushkov’s construction of deterministic finite-state automata (DFAs) from “deterministic” regular expressions  [19,20]. However, the  NREs obtained by compilation from navigational XPath queries are rarely deterministic, so neither are the NWAs compiled from them. Moreover, as we cannot apply NWA determinization to them in practice as argued above, this first approach has a much too low coverage to reach the objective. Therefore, we will report it only at the end in Section 9.
For our second approach, we propose a novel variant of automata for nested words that we call stepwise hedge automata ( SHA s). Even though motivated by the wish to create deterministic automata for the NREs of our benchmark, they are of general interest: they generalize naturally on both (stepwise) tree automata [21] and on finite word automata. In contrast to stepwise tree automata, SHA s can not only recognize unranked trees, but also sequences thereof, i.e., hedges or nested words. Furthermore, SHA s can be determinized in a bottom-up and left-to-right manner by combining in a natural manner the determinization procedures for tree and word automata.
By adapting existing compilers for stepwise tree automata  [21], SHA s can be compiled to NWAs with the same language in linear time while preserving determinism. Conversely, NWAs can be compiled to SHA s in polynomial time, but at the cost of introducing nondeterminism. By compiling NWAs to SHA s, determinizing the SHA , and compiling the obtained deterministic SHA back to a deterministic NWA , we can determinize NWAs by determinizing the corresponding SHA s. This alternative determinization algorithm for NWAs is different from the usual determinization algorithm for NWAs  [2,3,18]. Indeed, it yields reasonably small deterministic NWAs for the NREs from the XPath benchmark.
Yet another alternative algorithm for determinizing NWAs can be obtained by compiling NWAs to SHA s and back, and then determinizing the NWAs obtained in this manner. When applied to back-and-forth converted NWAs , the usual NWA determinization algorithm turns out to be well behaved: it produces deterministic NWAs of reasonable size for all our benchmark NREs . This might be surprising, given that the same determinization algorithm behaved so poorly for the non-converted NWAs that were obtained from the benchmark NREs  directly.
We contribute two further solutions for producing deterministic NWAs for our benchmark NREs . These are both based on a direct compiler from NREs to SHA s. We can then determinize these SHA s, followed by compilation to deterministic NWAs . Otherwise, we can first compile the SHA s to NWAs and then determinize these NWAs.
The next question is why the new determinization algorithms for NWAs that use SHA s as intermediates work so nicely. In order to understand this, we need to investigate the relationship between NWAs and SHA s more deeply. Clearly, the  NWAs obtained via SHA s do all their work in a bottom-up and left-to-right manner, and nothing when moving top-down. We can characterize the subclass of NWAs with this restricted behavior syntactically by the (weak) single-entry property: it requires that all opening rules of the NWA go into into the same target state while popping the sources state onto the stack. Note that our single-entry property is weaker than the (multi-module) single-entry property studied previously [22,23,24], which in addition requires that the automaton can be split into at least 2 modules, one for the top level and one for the nested level. The NWAs obtained by compilation from SHA s all have the (weak) single-entry property (but not necessarily multiple modules). Therefore, when compiling NWAs to SHA s and back, the resulting NWAs also have the (weak) single-entry property. It seems that the usual determinization algorithm from NWAs is well-behaved when applied to NWAs with the (weak) single-entry property. The relationship between SHA s and (weak) single-entry NWAs seems sufficiently close, so that their determinization algorithms seem to operate in somehow similar manners.
It is known that the subclass of deterministic multi-module single-entry NWAs (also called call-driven automata) enjoys unique minimization  [22,23]. The separation of the module for the top level from the module for the nested level can be obtained w.l.o.g., by building a product with the NWA with two hedge states that distinguishes the two levels. In our application, minimization could thus be used to reduce the size of the deterministic NWAs produced by our four algorithms for converting NREs , with the hope to eventually obtain a unique outcome after minimization. However, as we will use some symbolic representations for sets of rules, the uniqueness will hold only for the non-symbolic counterpart. In any case, the number of states of the deterministic minimal NWAs obtained for the same NRE could be expected to become unique.
Motivated by our application, we found it more relevant to minimize deterministic SHA s rather than deterministic (weak) single-entry NWAs (despite of their close correspondence). As for the class of deterministic (weak) single-entry NWAs , a restriction is needed for the class of deterministic SHA s to obtain unique minimization. We could have required the existence of multiple modules as for multi-module single-entry NWAs . Instead, we restrict ourselves to deterministic SHA s for which the initial states for trees and hedges coincide. We then show that minimization for such SHA s can be reduced to the minimization of tree automata up to a novel encoding of hedges to binary trees.
We implemented all four algorithms for compiling our benchmark NREs to deterministic NWAs and report the experimental results. We have also implemented the novel minimization algorithm for SHA s with equal tree and hedge initial states, and used it in our experiments. We propose two further optimization methods for reducing the sizes of the constructed automata.
First, we introduce schema-based cleaning both for SHA s and NWAs . In our application, the schema expresses the Xml data model, stating that hedges must encode valid Xml documents. More generally, an automaton A can be cleaned relative to an automaton S for the schema, if the language of interest is the intersection L ( A ) L ( S ) rather than L ( A ) itself. The idea of schema-based cleaning is to keep only those transition rules of A that are used to recognize some hedge of L ( S ) . These transition rules can be computed from the product of A and S. Note that schema-based cleaning may change the language of the automaton. Only the intersection L ( A ) L ( S ) is preserved, not necessarily  L ( A ) .
Second, we propose a symbolic representations for SHA s based on apply-else rules. They help to represent more compactly a large number of apply rules produced by the determinization of SHA s. Before compiling SHA s to NWAs , however, we need to eliminate the apply-else rules. This is because we have not developed analogous symbolic representations for NWAs so far. A second limitation is that we have not implemented any minimization algorithm for NWAs at the time being.
The main improvement of this journal article compared with the conference version [25] is the addition of the minimization algorithm for the subclass of SHA s with equal tree and hedge initial states. Furthermore, we added the idea of schema-based cleaning and the symbolic representations for SHA s by apply-else rules. The experimental results were enhanced with minimization, symbolic representations of rules, and schema-based cleaning. All the nested regular expressions generated for the XPathMark benchmark queries that we consider, as well as their corresponding automata—when we could produce them—can be found at http://researchers.lille.inria.fr/niehren/complementary-material (accessed on 1 February 2021).
Related Work. In the present article, we restrict ourselves to nested words over signatures with a single opening parenthesis, a single closing parenthesis, and possibly many internal letters a , b . This permits us to simplify the presentation of nested regular expressions, the notions of NWAs and SHA s, their forth and back compilers, as well as the determinization algorithms. Note that any multi-module NWA for such signatures must have exactly 2 modules. From an application perspective, multiple parentheses can be encoded by using internal letters, that is a named opening parenthesis a by the word a · and a named closing parenthesis b by the word b · . When encoding XML documents as nested words, such some encoding is needed anyway in order to deal with the complex information in XML tags, and also to provide symbolic representations with else rules that are able to deal with infinite signatures.   
For the minimization of deterministic NWAs , general signatures with multiple parentheses raise additional problems. Chervet and Walukiewicz  [23] solved such problems by reducing the minimization for expanded CDAs to the minimization of CDAs. Gauwin, Muscholl, and Raskin [24] showed that the minimization for deterministic NWAs is NP-hard in the case with general signatures. Their approach is based on a reduction from the problem of minimal immersion for sequences of Dfas, for which they construct NWAs with an unbounded number of opening parenthesis and an unbounded number of entry states. Weak single-entry NWAs in our setting do not permit this. Neither do NWAs over fixed general signatures with a finite number of opening parenthesis.
Navigational XPath queries on XML documents can be formalized in the language CoreXPath [26], or more generally by nested regular path queries [27] on data trees. Nested regular path queries were introduced earlier under the name of the propositional dynamic logic (PDL) in the 1970s  [28], where they are applied to labeled graphs that generalize on data trees.
As certain query answering for XPath was considered difficult, the currently existing approaches to XPath query evaluation on XML streams [13,18] either approximate certain query answers based on nondeterministic machines or restrict the queries so that answer certainty can be decided without latency [15,29]. This also holds for recent streaming algorithms on words without nesting in the context of complex event processing [30].

2. Nested Words

Nested words are words with parentheses that are well nested. They can be identified with hedges, that is, sequences of internal symbols and unranked trees.
Nested words are constructed with opening and closing parentheses, respectively, 〈 and 〉. An unranked alphabet Σ is a possibly infinite set of so-called “internal” symbols, that does not contain the two parentheses. The set of nested words over Σ is denoted N Σ and is defined by the following abstract syntax:
h , h N Σ : : = ε | a | h | h · h where a Σ
The empty nested word is denoted by ε and assumed to be the neutral element of the composition operator ε · h = h = h · ε , which furthermore is assumed to be associative, i.e., h 1 · ( h 2 · h 3 ) = ( h 1 · h 2 ) · h 3 .
Nested words can be identified with hedges, i.e., words of trees and internal symbols. Seen as a graph, the inner nodes are labeled by the tree constructor and the leaves by symbols in Σ or the tree constructor. For instance, a · b · ε · c · d · ε corresponds to the hedge in Figure 1. A nested word of type tree has the form h .
Variants. Our notion of nested words accepts only well-nested words without dangling opening or closing parentheses in contrast to others  [3,5]. This will lead to simpler notion of regular expressions, avoiding the more complex operators as with visibly rational expressions  [12,31]. A less important difference is that we do not support labeled parentheses.
Labeled unranked trees. Labeled parentheses can be simulated by using internal letters. For instance, the labeled tree a ( b ( ) , c ( ) ) can be represented by the nested word of type tree a · b · c . In this way, the labeled tree a ( ) is represented by the nested word a , which is of type tree (while the internal letter a alone is not). Unranked sequences of subtrees, often called hedges and sometimes forests, can be composed by using the sequence operator.   
XML Documents. Our notion of nested words is sufficiently powerful to express general XML documents. An example of an XML document is given in Figure 2 and the representing nested word in Figure 3.
We use the names of XML elements as labels of the nested word, as well as the letters of UTF8 for the string data values. Further labels such as doc and elem are added to express the types of the XML data model document and element, respectively.
When it comes to querying for nodes in Xml documents, we will be interested in nested words encoding Xml documents, in which a unique node is marked. We will use the label x to mark the selected node and the label ¬ x for all others. When marking the date in the XML document of Figure 2, we obtain the nested word in Figure 4.

3. Nested Regular Expressions

We present nested regular expressions ( NREs ), which were introduced under the name regular expression types in the context of XDuce [10] up to minor details. Note that similar nested regular expressions for ranked trees are folklore in the context of tree automata [32].

3.1. Syntax and Semantics

Let the alphabet Σ be a set. An  NRE over Σ is a term describing a language of nested words. It has the following abstract syntax where a Σ :
E , E : : = ε a _ E · E E + E E & E E * E ¯ E μ a . E
The μ a . E expressions are the same as in μ -calculus [33], except that we restrict them such that all occurrences of a in E are nested below parentheses. Otherwise, nonregular languages could be defined such as with μ a . ( b · a · c + ε ) whose language would be { b n · c n n N } . We also forbid intersections and complements in expression μ a . E on all paths between the μ a -operator and the occurrences of a in E that are bound by this operator. The expressions μ a . E allow for vertical recursion, while the expressions with the Kleene star E * support horizontal recursion.
Our syntax allows for conjunctions E & E and negations E ¯ , which are well known to not add expressiveness if Σ is finite. They are still relevant from the viewpoints of modeling, and for the treatment of infinite signatures. This comes at the price of increasing the complexity, as for the well-known case of words [34].
For infinite signatures, we can define for any finite subset Σ of labels the language of single-letter words Σ \ Σ by some NRE . This can be seen as follows. If Σ = , then the expression _ does the job: it matches exactly the set of all labels in Σ . Moreover, if Σ is nonempty then we can use negation. For instance, if Σ = { a , b } then the expression a + b ¯ & _ describes the language Σ \ Σ .
The sets of free and bound letters fn ( E ) and bn ( E ) are defined as usual. The only binder μ a . E binds the symbol a with scope E. Note that f n ( _ ) = .
There are three differences with respect to the regular expression types from  [10]. First, our NREs treat labels as internal symbols instead of labels of parentheses. Second, they provide recursion through the μ -operator instead of using recursive equation systems. Third, conjunctions and general negations are not considered there.
Any NRE E describes a language L ( E ) of nested words that we define by induction on the structure of E as follows: N Σ is the set of nested words over Σ , as defined in Section 2.
L ( ε ) = { ε } L ( a ) = { a } L ( _ ) = Σ L ( E · E ) = L ( E ) · L ( E ) L ( E ¯ ) = N Σ \ L ( E ) L ( E + E ) = L ( E ) L ( E ) L ( E & E ) = L ( E ) L ( E ) L ( E ) = { h h L ( E ) } L ( μ a . E ) = n 0 L ( μ n a . E ) L ( E * ) = L ( E ) * L ( ) =
For all expressions— E , E 1 , and E 2 , the notation E [ E 1 / E 2 ] stands for the expression E where all the occurrences of E 1 have been replaced by E 2 . The semantics of a μ -operator is then defined using the shortcuts μ 0 a . E = E [ a / ] and μ n a . E = E [ a / μ n 1 a . E ] for all n 1 . In particular L ( μ a . _ ) = L ( _ ) = Σ , so that a L ( μ a . _ ) . The semantics of the complement expression L ( E ¯ ) is the complement of L ( E ) in the set of all nested words, that is N Σ \ L ( E ) .

3.2. XPath  Example

We now show how to express navigational XPath queries by NREs that are restricted to forward axis. The idea is to adapt the spirit of a generate-and-test algorithm for query answering. The generation produces a nested word from Xml documents by guessing a single node and marking it by x. This node is a candidate for a query answer that is to be tested. The test is done by a NRE .
For expressing XPath queries with child and descendant-or-self axes we will use the following NREs where a fn ( E ) :
T = df μ a . ( a + _ ) * c h ( E ) = df T · E · T c h * ( E ) = df μ a . ( E + c h ( a ) ) c h + ( E ) = df μ a . ( c h ( E ) + c h ( a ) )
For instance, consider the XPath query A5 from the XPathMark benchmark  [17]:
/site/closed_auctions/closed_auction[descendant::keyword]/date
Applied to the above Xml document, it selects all date children of closed_auctions nodes that contain at least one keyword descendant. Query A5 can be compiled to the following NRE , which will accept the nested word in Figure 4 in particular:
d o c · _ · e l e m · s i t e · _ · c h ( e l e m · c l o s e d _ a u c t i o n s · _ · c h (
    e l e m · c l o s e d _ a u c t i o n · _ · ( c h + ( e l e m · k e y w o r d · _ · T ) & c h ( e l e m · d a t e · x · T ) ) ) )
The only label that the expression _ may match on a document that is properly annotated with the variable x will be the letter ¬ x Σ . The label x is annotated to the marked node, which is tested for being selected by the query. The label ¬ x is annotated to all nodes except a unique x-marked node.
Note also that the μ -operator of the c h + ( ) -expression expresses the recursion of the descendant axis. Furthermore, the conjunction permits us to connect the main path of A5 with its only filter.

3.3. XPath Benchmark

For testing NREs , we rely on the usual XPathMark benchmark  [17]. We restrict ourselves to navigational path queries with forward axis: child, descendant, and following-sibling. We notice that the following axis is excluded in contrast to following-sibling, as following is not strictly forwards. We can also admit path composition and filters with conjunction, disjunction, and negation.
The XPath queries of the benchmark satisfying these restrictions are the queries A1, , A8 and B3 given in Figure 5. We developed a more general compiler from navigational forward XPath queries to NREs , which yields the NREs in Figure A1 of the Appendix B for the benchmark XPath queries. The  NREs for A1–A3 do have neither conjunctions nor negations, while the queries A4–A8 contain filters, which are mapped to conjunctions in NREs . The compiler uses the μ -operator to capture the recursion of descendant axis as in A2, A3, and A5. Furthermore, nondeterminism is introduced by disjunctions in filters as in A7 and A8. Conjunction in filters appears in A6 which is mapped to conjunctions in NREs too. A detailed description of this compiler is not in the scope of the present article though.

4. Nested Word Automata

Nested word automata ( NWAs ) are pushdown automata reading nested words, whose stacks are visible: they push a single stack symbol when reading an opening parenthesis, pop a single stack symbol when reading a closing parenthesis, and do not alter or inspect the stack otherwise.
Definition 1.
An NWA is a tuple A = ( Q h , Q t , Σ , Γ , Δ , I , F ) consisting of a possibly infinite set Σ of internal symbols; finite sets Q h and Q t of states of type hedge and tree, respectively; sets of initial and final states I , F Q h ; a finite set Γ of stack symbols; and a finite set Δ of transition rules of the forms:
hedge rules a Δ , _ Δ , ε Δ Q h × Q h where a Σ opening rules γ Δ Q h × Q h where γ Γ tree rules T Δ Q h × Q t closing rules γ Δ Q t × Q h
Our NWAs are symbolic, in that they come with else rules, i.e., elements of ( q , q ) _ Δ that we will denote by q _ q , for dealing with large or infinite alphabets.
An example for an NWA is given in a graphical syntax in Figure 6. Tree states are drawn in circles that are filled in light gray Algorithms 14 00068 i001, while hedge states are in unfilled circles q . Initial states are drawn as q and final states as Algorithms 14 00068 i002. Hedge rules that have the form ( q 1 , q 2 ) o Δ where o Σ { _ , ε } are denoted by q 1 o q 2 , while any tree rule ( q 1 , q 2 ) T Δ is denoted q 1 q 2 . Opening rules ( q 1 , q 2 ) γ Δ are represented as q 1 🠆 γ q 2 and closing rules ( q 1 , q 2 ) γ Δ as q 1 🠆 γ q 2 .
Our notion of NWAs supports factorization in the spirit of the work in [35]. It is obtained by distinguishing two types of states, q Q h and p Q t , and adding explicit type coercion rules q p . Semantically, both kinds of states could be merged when replacing the type coercion rules by the epsilon rule q ε p , but at the cost of introducing additional nondeterminism. This may lead to quadratically larger deterministic automata, as we will illustrate at the NWA in Figure 20.
The language of nested words between two states q 1 , q 2 Q h is defined as the least language such that
L q 1 , q 2 ( Δ ) = { ε if q 1 = q 2 or q 1 ε q 2 Δ } q 3 Q h L q 1 , q 3 ( Δ ) · L q 3 , q 2 ( Δ ) { a if q 1 a q 2 Δ or ( q 1 _ q 2 Δ and ¬ q 2 . q 1 a q 2 Δ ) } { h q 1 , q 2 Q h . q 3 Q t . γ Γ . q 1 γ q 1 , h L q 1 , q 2 ( Δ ) , q 2 q 3 Δ and q 3 🠆 γ q 2 Δ } .
The language of the NWA is then L ( A ) = q 1 I , q 2 F L q 1 , q 2 ( Δ ) .

4.1. Determinization of NWAs

Determinization for NWAs was first studied by von Braunmühl and Verbeek [2] in the 1980s, where NWAs are named input-driven pushdown automata. We notice that the determinization algorithm was published only in the journal version of this paper, but not in the conference version. Later on, the same algorithm was rediscovered in the context of visibly pushdown automata and republished for nested word automata.
Definition 2.
An NWA A is called deterministic or equivalently a dNWA if
  • I contains at most one element;
  • there is no epsilon rule, i.e., ε Δ = ,
  • a Δ and _ Δ are partial functions from Q h to Q h for all a Σ , and T Δ is a partial function from Q h to Q t ;
  • for all q Q h and γ Γ there exists a most one q Q h such that q γ Δ ; and
  • γ Δ is a partial function from Q t to Q h for all γ Γ .
Proposition 1
(von Braunmühl and Verbeek [2]). A NWA with n states can be determinized in time O ( 2 n 2 ) .
Many of our results are based on the determinization algorithm going back to von Braunmühl and Verbeek. For self-containedness, we recall the version of this algorithm that we will use in the Appendix A. For illustrations, the determinization of the NWA in Figure 6 is also presented here too, see Figure A1. It has size 271 while the nondeterministic NWA has size 39 (12 states + 2 letters + 3 stack symbols + 22 rules). The blow-up is even worse in general as our experimental results will show and as noticed earlier by [18].

4.2. Multi-Module NWAs

Multi-module NWAs will play a prominent role for our NWA constructions and are relevant for minimization [23]. For signatures with a single opening parenthesis, each multi-module NWAs has exactly two modules, one for the top level and one for the nested level.
We can define multi-module NWAs based on the natural notion of homomorphisms for NWAs . A homomorphism from an NWA A to an NWA A with the same signature is a triple of functions ( α h : Q h Q h , α t : Q t Q t , β : Γ Γ ) that maps all concepts of A to the corresponding concepts of A . These concepts are hedge initial states, final states, opening, closing, internal, and tree transitions. We do not enforce the preservation of epsilon rules by homomorphisms.
Definition 3.
A multi-module NWA A is an NWA for which there exists a homomorphism from A to the NWA T in Figure 7.
The NWA T evaluates all top level positions of a nested word to state 0: all those positions that are not between parentheses. All nested positions are evaluated to state 1. The homomorphism of a multi-module NWA A to T thus partitions the states of A between those that can be assigned to top level positions, and the others that can be assigned to nested positions.

4.3. Compilation of NREs to NWAs

We next discuss a compiler from NREs E to NWAs nwa ( E ) . This compiler extends on the McNaughton–Yamada–Thompson algorithm [36] for regular expressions, which introduces epsilon edges for constructing the automata of composition E · E .
Theorem 1.
For any NRE E, we can construct an NWA A such that L ( A ) = L ( E ) . If E contains neither conjunctions nor negations, then the construction is in time O ( | E | ) .
Proof sketch.
Conjunctions E & E are compiled to products of automata, so repeated conjunctions may lead to an exponential blow up. Negations E ¯ are computed by complementing automata based on determinization. Each complementation may lead to an exponential blow-up, so when this is repeated, the construction may become non-elementary.
For expressions without conjunction and negation, no such blow-up may arise. As stated by the theorem, we have to show that expressions can be compiled in linear time.
Case E = E · E :
We use the McNaughton–Yamada–Thompson algorithm for composing the NWAs of nwa ( E ) and nwa ( E ) .
Case E = E :
Let Q h , Q t , and Γ be, respectively, the set of hedge states, tree states, and stack symbols of nwa ( E ) . We consider new hedge states q i and q f that are not in Q h , a new tree state p not in Q t and a new stack symbol γ not in Γ . Then, nwa ( E ) is constructed by adding to nwa ( E ) opening rules q i 🠆 γ q for all the initial states q of nwa ( E ) , tree rules q p for all the final states q of nwa ( E ) and a closing rule p 🠆 γ q f . Furthermore, we set q i as the only initial state of nwa ( E ) , and q f as its sole final state.
Case E = μ a . E :
Special care has to be given to repeat expression μ a . E . First of all, the naive compilation approach for these expression turns out to be wrong. Second, fixing the problem in the simplest possible manner does not lead to a linear time algorithm.
Note that we can assume w.l.o.g. that a occurs at most once in E by using the golden lemma of the μ calculus [37], stating for all names a 1 , , a n and expressions E in which a 1 , , a n can appear free that μ a 1 . . μ a n . E μ a . E [ a 1 / a , , a n / a ] . Our construction guarantees that all transitions of the form q a q in nwa ( E ) will start with the same state q. The wrong naive construction would remove the transitions q a q from nwa ( E ) and add ε -rules from q to all the initial states of nwa ( E ) , and from all final states of nwa ( E ) to q . Unfortunately, the construction is not correct. For illustration, we consider the NRE E = μ a . a * . The reader should be warned that constructing an NWA for E is less trivial than it might seem at first sight. One has to start from the NWA for a * which is given in Figure 8. Simply adding epsilon edges to capture the operator μ a will not work though. It will lead to the wrong automaton in Figure 9. This automaton will wrongly accept the hedge , as this hedge does not belong to L ( E ) .
If the NWA for E is multi-module, then the naive construction of compiling μ a . E can be made correct. Therefore, the simplest fix is to make the NWA multi-moduled, before applying the naive construction. This can be achieved by typing the states of the automaton, by states of the NWA T in Figure 7. The added types yield the homomorphism of the constructed automaton to T .
The naive algorithm is then adapted as follows. Let P be the multi-module NWA obtained from the product of nwa ( E ) and T . Note that we keep only the accessible top level states (type 0), but all nested states (type 1). In our example, this yields the NWA in Figure 10. We then remove transition ( q , 1 ) a ( q , 1 ) and add ε -rules from state ( q , 1 ) to all states in I × { 1 } , and from all states in F × { 1 } to ( q , 2 ) , where I and F are, respectively, the set of initial and final states of nwa ( E ) . Then, P recognizes L ( μ a . E ) . The result obtained in the example is shown in Figure 11.
The algorithm described so far makes the NWA multi-moduled before compiling a μ -operator. For this, two copies of all states are introduced. This, however, could lead to an exponential construction if multiple μ -operators are nested. This problem can be avoided by preserving multi-moduledness as an invariant. Whenever a new state is created, it is created twice: once for the top level and once for the nested level. This information is maintained by typing the states, so that no further copies of the same state are produced later on.
We omit the correctness proof of this construction. □

4.4. Experimental Results Starting with the NWA  Compiler

In the first two column of Figure 12, we report the sizes of the NWAs obtained from NREs by our compiler, and the size of the deterministic NWAs produced thereof. For each automaton, we give its total size and in parentheses the number of states.
The sizes of the nondeterministic NWAs produced by the compiler for the NREs for A1–A8 and B3 are given in column n w a ( . ) of Figure 12. Note that the NWAs are cleaned so that only accessible and co-accessible states remain. The sizes of the nondeterministic NWAs are acceptable for all NREs , except for A8, for which the NWA has more than 3000 states and an overall size greater than 10 , 000 . This can be partially explained by the fact that the NRE for A8 contains three conjunctions (one for the filter and two for the conjunctions in the filter). Still, the number of states remains surprising.
The determinized NWAs are given in column d e t ( n w a ( . ) ) . It turns out that only A2 and A3 could be determinized successfully with some few hours of computation time on a standard laptop. However, even in the successful cases, the resulting deterministic NWAs are simply huge. This confirms similar problems first noticed in [18] and not solved since then.
The remaining columns of Figure 12 based on the back-and-forth compiler from SHA s to NWAs from the following Section 6. They show that better determinization algorithms can indeed be obtained, yielding NWAs of acceptable size for all benchmark queries, with the exception of A8. The idea of d e t ( n w a ( s t e p ( n w a ( . ) ) ) is to compile the NWAs obtained from the NREs to stepwise hedge automata and back before applying the above algorithm for NWAs . This might be surprising, as this determinization algorithm failed for the original NWAs , while it now proves successful on the forth-and-back transformed NWAs .

5. Stepwise Hedge Automata

We propose SHA s as an extension of stepwise tree automata [21] that allows to recognize not only unranked trees but also hedges. We avoid more classical hedge automata from [9] that were already introduced in 1967 by Thatcher [8], as their notion of determinism is problematic. For instance, it makes unique minimization fail [38] and universality hard.
Our notion of SHA s will be symbolic in using else rules, and factorized as in [35]: there are two types of states for hedges and trees and an operator for explicit type coercion. We also propose a novel treatment of internal letters inspired by nested word automata, so that SHA s generalize both on stepwise tree automata and on Nfas.
Definition 4.
A SHA is a tuple A = ( Q h , Q t , Σ , Δ , I , F ) such that Q t and Q h are finite sets of states of two types: t for tree and h for hedge, respectively; Σ an alphabet of internal letters (that may be infinite); I , F Q h are subsets of hedge initial and final states, respectively; and Δ is a finite set of transition rules such that for all q Q t and a Σ :
hedge rules q Δ , a Δ , _ Δ , ε Δ Q h × Q h tree final rules T Δ Q h × Q t tree initial states Δ Q h
An example for a SHA is given in graphical syntax in Figure 13. It recognizes all hedges that are either just a or b or contain some tree node that contains either just a or b. In the graphical syntax, the states of type tree q Q t are drawn in circles filled in light gray Algorithms 14 00068 i001, while the states of type hedge q Q h are drawn in unfilled circles q . The right part of the graph is an Nfa which uses tree states as additional edge labels, while the left part is a stepwise tree automaton that defines the tree languages of these tree states.
Let Δ h be the restriction of Δ to the hedge rules. Then, ( Q h , Σ Q t , Δ h , I , F ) is a standard Nfa with ε -rules, which is symbolic [39] in providing else rules for dealing with large or infinite alphabets in addition. Therefore, we denote the hedge initial states q I by h q and the final states q F by Algorithms 14 00068 i002. A rule with an internal letter ( q 1 , q 2 ) a Δ is denoted by q 1 a q 2 Δ stating that a hedge in state q 1 can be extended by the internal letter a leading to a hedge in state q 2 . Similarly, an epsilon rule ( q 1 , q 2 ) ε Δ is denoted by q 1 ε q 2 , and an else rule ( q 1 , q 2 ) _ Δ is denoted by q 1 _ q 2 . In the same spirit, a hedge rule ( q 1 , q 2 ) q Δ —also called apply rule—is denoted by q 1 🠆 q q 2 Δ , stating that a hedge in state q 1 can be extended by a tree in state q leading to a hedge in state q 2 .
A tree initial state q Δ is graphically denoted by t q and a tree final rule ( q 1 , q 2 ) T Δ by q 1 q 2 . Intuitively, a tree h can be evaluated to state q if h can be evaluated starting with some tree initial state q 1 Δ to some state q 2 such that q 2 q Δ . More formally, the hedge languages L q 1 , q 2 ( A ) between any two hedge states q 1 , q 2 Q h are defined as
L q 1 , q 2 ( A ) = { ε if q 1 = q 2 or q 1 ε q 2 Δ } q 3 Q h L q 1 , q 3 ( A ) · L q 3 , q 2 ( A ) { a if q 1 a q 2 Δ or ( q 1 _ q 2 Δ and ¬ q 2 . q 1 a q 2 Δ ) } q 1 🠆 q q 2 Δ L q ( A )
This definition is mutually recursive with the definition of the tree languages L q ( A ) of all tree states q Q t :
L q ( A ) = { h t q 1 Δ , h L q 1 , q 2 ( A ) , q 2 q Δ }
The hedge language L ( A ) that is recognized by the automaton is q 1 I , q 2 F L q 1 , q 2 ( S ) . The rules of standard bottom-up tree automata have the form a ( q 1 , , q n ) q where a is a symbol of arity n. With SHA s, this rule can be encoded by the sequence t p 0 a p 1 🠆 q 1 🠆 q n p n q where the states q 1 , , q n , q are all tree states, and p 0 , , p n new hedge states.

5.1. Determinization of SHA s

We formalize the notion of determinism for stepwise hedge automata and show how determinization works.
Definition 5.
A SHA ( Q h , Q t , Σ , Δ , I , F ) is deterministic or equivalently a dSHA , if it satisfies the following conditions:
  • I contains at most one element;
  • Δ contains at most one element;
  • there is no epsilon transition, i.e., ε Δ = ;
  • a Δ , q Δ , _ Δ are partial functions from Q h to Q h for all a Σ and q Q t ; and
  • T Δ is a partial function from Q h to Q t .
Proposition 2.
A SHA of size n can be made deterministic in time O ( 2 n ) while preserving the hedge language.
Proof. 
The determinization procedure for SHA s combines the determinization algorithms of word and tree automata in the natural manner, while eliminating epsilon transitions. Let ε Δ * be the reflexive and transitive closure of ε Δ , and for any subset Q Q h Q t let ε Δ * ( Q ) = q Q ε Δ * ( q ) . Given a SHA A = ( Q h , Q t , Σ , Δ , I , F ) , we define an equivalent deterministic SHA d e t ( A ) = ( Q h det , Q t det , Σ , Δ det , I det , F det ) such that Q h det = 2 Q h , Q t det = 2 Q t , I det = { ε Δ * ( I ) } and F det = { Q Q h Q F } . There is a unique tree initial state in Δ det = { ε Δ * ( Δ ) } and no ε -rule, that is, ε Δ det = . The inference rules in Figure 14 define the missing part of Δ det .
We can show for all Q 1 , Q 2 Q h and P Q t that
L Q 1 , Q 2 ( det ( S ) ) = q 1 Q 1 , q 2 Q 2 L q 1 , q 2 ( S )
so that L P ( det ( S ) ) = q Q L q ( S ) . Therefore, L ( d e t ( S ) ) = Q F det L I , Q ( d e t ( S ) ) and thus L ( d e t ( S ) ) = q 1 I , q 2 F L q 1 , q 2 ( S ) = L ( S ) . □
For illustration, the deterministic SHA in Figure 15 is obtained by determinization of the SHA in Figure 13.

5.2. Compilation of NREs to SHA s

As for NWAs , we introduce the notion of multi-module SHA s for which the sets of hedge states are partitioned between those that can evaluate top level positions and those to which nested positions are assigned. Therefore, multi-module SHA s will have exactly two modules too.
Definition 6.
A SHA A = ( Q h , Q t , Σ , Δ , I , F ) is a multi-module SHA if there is a subset of states Q h 0 Q h , that we call top level states, such that
  • I Q h 0 and
  • the states in Q h 0 can reach only other states in Q h 0 via Δ.
For instance, consider the multi-module SHA in Figure 13. The states of module for the top level are Q 0 = { 1 , 2 , 3 , 4 , 5 , 6 , 12 } . The others belong to the module for the nested level.
Any NRE E can be compiled to a multi-module SHA step ( E ) = ( Q h , Q t , Σ , Δ , I , F ) such that Q t = { E E = E subexpression of E } and L t ( E ) = L ( E ) for all tree states E Q t . The SHA step ( E ) can be partitioned into disjoint SHA s step ( E ) = A t o p E Q t A E such that A t o p = ( Q h t o p , Q t , Σ , Δ t o p , I , F ) and A E = ( Q h E , Q t , Σ , Δ E , , ) for all E Q t and Δ t o p = . Note that the transitions relation Δ is decomposed thereby into independent connected components. The automaton A t o p can be identified with an Nfa with signature Σ Q t given that it has no tree initial states. The automata A E are stepwise tree automata that recognize the tree language L ( E ) when taking E as final state. For this, they may have tree initial states, but will not have any initial nor final states.
Theorem 2.
For any NRE E, we can construct a SHA A such that L ( A ) = L ( E ) . If E contains neither conjunctions nor negations, then the construction is in time O ( | E | ) .
Proof sketch.
For the case of expressions with conjunctions or negations, the construction is analogous to the way it is done for NWAs . We next sketch the construction of SHA s for expressions without conjunction and negation.
Case E = E · E :
We use McNaughton–Yamada–Thompson algorithm [36] for composing the multi-module Nfas of step ( E ) and step ( E ) . The stepwise tree automata A E of the subexpressions of type tree are preserved. For succinctness, if some subexpression E occurs more than once, then only a single copy of A E is kept. References to states of the removed copy should be renamed to their equivalent counterparts.
Case E = E :
We construct step ( E ) from step ( E ) . The initial states of step ( E ) are turned into tree initial states. We then add a new tree state E and connect it to all final states of step ( E ) by a tree final rule q E . Furthermore, the previously final state q becomes non final. Finally we add a new initial state q i , a new final state q f and a transition rule h q i 🠆 E q f .
Case E = μ a . E :
The main idea of the construction is similar to the case of NWAs . The correctness argument relies on the invariant that only multi-module SHA s are built.
Again by the golden lemma of the μ -calculus, we can assume w.l.o.g. that a occurs at most once in E . By using ε -rules, we can preserve the invariant that there will be at most one pair ( q , q ) such that q a q in step ( E ) . Furthermore, these transitions cannot be on top level, given that the occurrence of a in E must be nested below parentheses. The automaton step ( E ) is obtained from step ( E ) by first copying the top level Nfa of step ( E ) , as in Figure 16. We thus obtain two versions for each state of the top level Nfa of step ( E ) : one referred to as the top level copy— q 0 , 0 and q 3 , 0 in Figure 16, and another one as the nested level – q 0 , 1 and q 3 , 1 in Figure 16. Only top level states may be initial or final. Then, we add ε -rules from q to the nested states that correspond to the initial states of step ( E ) , and from the nested states corresponding to the final states of step ( E ) to q . Finally we remove the rule q a q . The resulting automaton is shown in Figure 17.
Note that every transition added for a state–top level or nested—in a subsequent step of the construction—except the ε -rules added for μ -expressions—must also be added for its copy.
The construction is correct as the μ -bound name a is nested below parenthesis in E . Therefore, it can be shown that the ε -edges introduced cannot be used to produce unwanted order in successful runs. Maintaining this invariant in polynomial time requires an additional argument. Instead of copying the top level parts of subexpressions, each state is introduced twice during the construction: one version for nesting, and another one for being part of top level parts. This way the size of the automaton is not doubled at each step, but only once.
We omit the correctness proof of this construction. □
Unlike NWAs , one cannot preserve the determinism of the expressions of nregexp ( c h , T ) in SHA s, even with Glushkov-like constructions. For instance, for the deterministic NRE a 1 · a 2 · · a n , one would have an SHA having a tree initial state for each of the a i subtree, implying nondeterminism.

5.3. Experimental Results Starting with the SHA  Compiler

In the first two columns of Figure 18, we report the size of the SHA s obtained from NREs by our compiler, and the size of the deterministic SHA s produced thereof.
The SHA compiler yields automata of acceptable size from the NREs of all benchmark queries. These sizes are given in the first column s t e p ( . ) of Figure 18. This even holds for A8, in contrast to the case where the produced SHA has overall size 1106 and 267 states.
The determinization of the SHA s in the second column d e t ( s t e p ( . ) ) even yields smaller automata in all cases. For A8, we obtain a deterministic stepwise automaton of overall size 749 and with 123 states. This might be surprising, in that the determinization algorithm may lead to an exponential blow-up in the worst case. However, it may also clean the automaton, keeping only accessible sets of states. This is what seems to happen systematically on the benchmark with the exception of A2, where the size goes up by a factor of four and A5 where the size doubles. For A2 the number of states grows by one third, while for A5 it decreases by one third.
Based on the back-and-forth compiler from SHA s to NWAs from following Section 6, we can obtain deterministic NWAs of acceptable size for all benchmark queries. The method n w a ( d e t ( s t e p ( . ) ) ) yield for A a dNWA of size 2831 and with 124 states. The alternative method d e t ( n w a ( s t e p ( . ) ) ) yields a dNWA of size 2520, which is even smaller, and the same number of states.

6. NWAs versus SHAs

We next show how to compile SHA s to NWAs such that determinism is preserved, and back while introducing nondeterminism. Thereby, we can obtain small NWAs for NREs such as E = c h * ( a + b ) for which d e t ( n w a ( E ) ) blew up in size in a surprising manner (see Figure 12).

6.1. SHAs to NWAs

As a first step, we introduce a transformation on SHA s, so that for any SHA A:
  • if A is deterministic, the transformation returns A, and
  • if A is nondeterministic with set of hedge states Q h and transition relation Δ , the transformation returns a new SHA A with set of hedge states Q h = Q h { q t init } where q t init is a new hedge state, and set of transitions Δ which equals Δ except that Δ = { q t init } and ε Δ = ε Δ { ( q t init , q ) q Δ } .
Then, we compile any SHA A = ( Q h , Q t , Σ , Δ , I , F ) obtained after the above transformation to an NWA n w a ( A ) = ( Q h , Q t , Σ , Γ , Δ , I , F ) such that L ( A ) = L ( n w a ( A ) ) . We set Γ = Q h , _ Δ = _ Δ , a Δ = a Δ for all a Σ , ε Δ = ε Δ , T Δ = T Δ :
q 1 🠆 q 1 Δ p Δ q 1 🠆 q 2 p Δ and q 🠆 q 1 q 2 Δ
Clearly, if S is deterministic then so is n w a ( S ) , as p is unique in this case in particular. One might be tempted to skip the first-step transformation and restrict the above construction rule to states p such that L q ( S [ Δ / { p } ] ) . However, this would lead to a huge blow-up when determinizing these NWAs , basically as this change spoils the single-entry property discussed in Definition 7.
The conversion of step ( c h * ( a + b ) ) in Figure 13 yields the NWA in Figure 19. Note that the opening rules are deterministic (but not the whole NWA ), as for all tree states q there is at most one hedge state p with p such that q is accessible from p. The NWA has size 64, while its determinization has size 159 (see Figure A3 of the Appendix C). The size increase raised by determinization is thus 95 = 159 64 for this NWA .
The size increase for determinization is considerably smaller for the NWA obtained from the regular expressions by indirection via a SHA than for NWAs obtained by direct compilation. Indeed, the determinization of nwa ( c h * ( a + b ) ) blows the size from 39 to 271. The size increase for the determinization of nwa ( c h * ( a + b ) ) is thus 242 = 271 39 while for nwa ( step ( c h * ( a + b ) ) ) it is only 95 = 159 64 .
The experiments will show that this is not an exception but the general rule. Intuitively, the reason is that NWAs obtained from SHA s do all the work bottom-up, where NWAs obtained directly from the regular expression do a considerable amount of work top-down. In terms of the work in [22], this restriction can be characterized syntactically by the single-entry property:
Definition 7.
A (weak) single-entry NWA A = ( Q h , Q t , Σ , Γ , Δ , I , F ) is a NWA for which there exists a single state q e n t r y Q h such that all opening rules in Δ have the form q 🠆 q q e n t r y .
Note that call-driven automata (CDAs) discussed in [23] coincide with multi-module single-entry dNWAs and also with (multi-module) single-entry visibly pushdown automata [22,24].
It can be shown that nwa ( S ) has the (weak) single-entry property for all SHA s S for which the p’s are unique in the above construction rule, i.e., such that Δ = { p } . Note that this was not the case for step ( c h * ( a + b ) ) in Figure 13 but could have been imposed w.l.o.g. leading to a slightly different NWA than in Figure 19.
The conversion of the determinization d e t ( step ( c h * ( a + b ) ) ) in Figure 15 yields the deterministic NWA in Figure 20. The size goes up slightly from 53 to 73. It should be noticed that factorization avoids a quadratic blow up in this case. This can be observed at state 14, which has 3 incoming tree edges and 10 outgoing closing edges. Without factorization, the 3 tree edges could be replaced by 3 ε -edges whose elimination would produce 30 closing edges. This would increase the number 3 + 10 edges to 3 10 edges.

6.2. NWAs to SHAs

Conversely, NWAs can be compiled to stepwise hedge automata, but at the cost of introducing nondeterminism, as an NWA may traverse the branches of a tree top-down, while a stepwise must traverse them bottom-up. For this, the stepwise guesses the state in which the NWA will arrive from above and then evaluates the subtree starting with this state, while verifying the correctness of the guess later on. Let A = ( Q h , Q t , Σ , Δ , I , F ) be an NWA . We build a SHA step ( A ) = ( Q h s , Q t s , Σ , Δ s , I s , F s ) where Q h s = Q h × Q h , Q t s = Q h × Q t , I s = { ( q , q ) q I } , F s = I × F and Δ s is the smallest satisfying the following rules:
Algorithms 14 00068 i003
The construction is such that L ( A ) = L ( step ( A ) ) .
For the NWA nwa ( c h * ( a + b ) ) in Figure 6, we obtain the stepwise in Figure A2 up-to removing useless states and separating the top level. Determinization yields d e t ( step ( n w a ( c h * ( a + b ) ) = d e t ( step ( c h * ( a + b ) ) ) in Figure 15.

7. Optimizations

We will use three optimization methods for constructing smaller dSHA s and thus smaller dNWAs : minimization, symbolic representations of sets of transition rules, and schema-based cleaning.

7.1. Minimization

Our next objective is to reduce the size of deterministic SHA s by developing a minimization algorithm for a subclass of dSHA s. Even though our implementation can deal with them, we consider SHA s without symbolic rules q q for simplicity in this section.
We start with an example that motivates the choice of our subclass. In Figure 21 and Figure 22, two dSHA s are given that both recognize the language of all hedges with signature Σ = { x , y } containing exactly one occurrence of the letter x. The dSHA in Figure 22 is the dSHA recognizing this language which has the minimal number of states. The dSHA in Figure 21 is the minimal multi-module dSHA for this language. The question is how a minimization algorithm for dSHA s could convert the dSHA in Figure 21 to this minimal one in Figure 22. In particular, why would it merge the tree initial state and the hedge initial state? We do not see how this could be done based on some Myhill–Nerode-like equivalence relation. This motivates an a priori restriction to dSHA s imposing that the tree initial state and hedge initial state must be equal.
Note that any SHA can be converted into a dSHA with equal tree and hedge initial states. For this, it is sufficient to “fuse” these states and then to determinize the SHA obtained. When doing so for the dSHA in Figure 21, we indeed obtain the minimal dSHA from Figure 23, so no further minimization is needed in this case.
Given the close relationship between SHA s and weak single-entry NWAs , it is instructive to consider the existing results on minimization for dNWAs . It is known that the class of general dNWAs does not allow for unique minimization [22] and that the minimization becomes NP-hard when admitting general signatures with multiple parenthesis [24].
On the positive side, the best existing minimization algorithm is due to Chervet and Walukiewicz [23]. It applies to the subclass of multi-module single-entry dNWAs , called there call-driven automata (CDA) (Chervet and Walukiewicz [23] permit signatures with multiple opening parenthesis. In the case of a single opening parenthesis, the class of CDAs is equal to their subclass of expanded CDAs for which they develop their minimization algorithm in the first place.). They showed that the subclass of multi-module single-entry dNWAs enjoys unique minimization in polynomial time.
In the case of dSHA s, we believe that unique minimization holds for the following two subclasses, and will show it for the second:
  • the subclass of multi-module dSHA s, and
  • the subclass of dSHA s where the hedge and tree initial state are the same, i.e., Δ = I .
The first subclass of multi-module dSHA s is motivated by the subclass of multi-module single-entry dNWAs . Note, however, that the SHA s that are obtained by compilation from single-entry dNWAs need not be deterministic, so the analogy between both automata classes is not perfect. The dSHA in Figure 21 is minimal for the class of multi-module dSHA s.
The second subclass of dSHA s corresponds to the subclass of single-entry dNWAs in which the single-entry state is equal to the initial state. The dSHA in Figure 22 is minimal for the second subclass. In the remainder of this section, we present a minimization algorithm for the second subclass. For this, we identify dSHA s in which tree and hedge initial state coincide with two-sorted deterministic tree automata, so that we can use a minimization algorithm for the latter. Our automaton translation is based on a novel encoding of hedges into ranked well-sorted trees with monadic and binary function symbols, which is inspired by the previous binary encoding of unranked trees known from stepwise tree automata [21]. For any unranked signature Σ , as for the construction of hedges, we consider two sorts: h for hedges and t for trees. We then consider the following ranked signature with these two sorts:
Σ @ = { a ( h ) a Σ } { @ ( h × t h ) , ε ( h ) , T ( h t ) }
The well-sorted trees over Σ @ of both sorts then have the following abstract syntax:
well-sorted trees of sort h : τ : : = a ( τ ) @ ( τ , τ ) ε well-sorted trees of sort t : τ : : = T ( τ )
Any hedge over Σ can be encoded into a ranked well-sorted tree of sort h with signature Σ @ . For instance, the hedge
h = a · b · c · d · e · f
is encoded into the following ranked well-sorted tree of sort h over Σ @ :
[ [ h ] ] = ( f ( a ( ε ) @ τ ) ) where τ = T ( b ( ε ) @ τ ) and τ = T ( e ( d ( c ( ε ) ) ) )
Any SHA A = ( Q h , Q t , Σ , Δ , I , F ) with equal tree and hedge initial states, that is, Δ = I , can then be encoded into a two sorted tree automaton [ [ A ] ] = ( Q h , Q t , Σ @ , Δ , I , F ) by mapping the transition rules in Δ to those in Δ as follows:
Δ Δ q a q a ( q ) q q p T ( q ) p q 🠆 p q q @ p q q i Δ = I ε q i
We can first note that [ [ L ( A ) ] ] = L ( [ [ A ] ] ) . Second, the translation function is a bijection between SHA s over Σ and two-sorted tree automata over Σ @ . Furthermore, this translation preserves determinism. It follows that if A is a dSHA with a minimal number of states recognizing L ( A ) then [ [ A ] ] is a deterministic two-sorted tree automaton with a minimal number of states recognizing [ [ L ( A ) ] ] . Furthermore, the unique minimization of deterministic two-sorted tree automata implies the unique minimization of the class of dSHA s with equal tree and hedge initial states.
Using this translation back and forth, we can thus lift the minimization algorithm of deterministic two-sorted tree automata to a minimization algorithm for the subclass of dSHA s with equal tree and hedge initial states. This is the minimization algorithm for dSHA s that we have implemented. We then used it in our constructions to reduce the size of the dSHA s obtained by determinization.

7.2. Symbolic SHAs with Apply-Else Rules

The sizes of the dSHA s constructed so far are dominated by the number of transitions. We now propose a class of symbolic dSHA s by adding apply-else rules, in order to represent large numbers of apply rules in a more compact and symbolic manner.
An apply-else rule has the form q 🠆 _ q where q , q Q h . It represents the set of apply rules q 🠆 p q , where p Q t can be chosen arbitrarily from a subset of tree states distinguished by the automaton.
We have also adapted our determinization for dSHA s so that it preserves apply-else rules. What is missing so far is a concept for NWAs that corresponds to the apply-else rules of dSHA s. Therefore, we have to eliminate apply-else rules before translating SHA s to NWAs .

7.3. Schema-Based Cleaning

Automata for XPath queries recognize nested words that can be obtained by encoding Xml documents with a single x-marked position. The class of such nested words is characterized by a schema that we can define as the intersection of the two dSHA s in Figure 23 and Figure 24. The first SHA tests whether there is exactly one occurrence of the internal letter x, and the second one tests that the Xml data model is satisfied, and the node annotations with x and ¬ x are put at the right positions.
The automata constructed for the XPath queries may accept some trees that do not satisfy the schema (but will never be evaluated on such trees when answering the query). The idea of schema-based cleaning is to remove all transition rules and states that are not used for recognizing any nested word satisfying the schema. Schema-based cleaning of an automaton can be performed by constructing the product of the automaton with the schema, which is in our case an intersection of two dSHA s. We then only keep those states of the original SHA that are used in accessible and co-accessible states of the product with the schema.
Note that schema-based cleaning typically changes the language of the automaton. Different languages may be obtained when cleaning different automata for the same query with respect to the schema. If one is interested in a unique language, then one can choose the intersection of the automaton with the schema. This intersection, however, is usually larger than the automaton obtained by schema-based cleaning.

7.4. Experimental Results for Optimizations

The sizes of optimized automata for the benchmark queries are reported in Figure 25.
The function s t e p @ _ used in the first column compiles NREs to SHA s with apply-else rules. This does not change the number of states, but reduces the number of automata transitions. In the case of A8, the size of the stepwise automaton is reduced from 1106 to 894.
An optimized determinizer is applied by the function D ( . ) = d e t ( s t e p @ _ ( . ) ) in the second column. It preserves apply-else rules in particular. For A8, the size is reduced from 749 to 639 while the number of states is preserved.
Schema-based cleaning is applied by the function S ( . ) = s c h e m a c l e a n ( D ( . ) ) in the third column. For A8, the number of rules is reduced further from 639 to 527.
Minimization is applied by the function M ( . ) = m i n i ( S ( . ) ) in the fourth column. In the case of A8, it reduces the number of states from 117 to 101 and the size from 527 to 487.
In order to come back to dNWAs , we have to eliminate the apply-else rules in column six. For A8 this increases the number of rules back from 527 to 1257.
In the final column, we apply the compiler from SHA s to NWAs which preserves determinism. For A 8 , this results in a dNWA of size 1413 and 102 states. This is better than the previous results, in particular with respect to the number of states.

8. Summary of Experimental Results

We now plug the different compilers and optimization methods all together and compare the sizes of deterministic NWAs that we can obtain thereby.
The overall sizes (#states) of the resulting dNWAs are given in Figure 26. We see that the two methods starting with SHA s n w a ( d e t ( s t e p ( . ) ) ) and d e t ( n w a ( s t e p ( . ) ) ) yield reasonably small deterministic NWAs for the NREs of all benchmark queries. The methods starting with NWAs n w a ( d e t ( s t e p ( n w a ( . ) ) ) ) and d e t ( n w a ( s t e p ( n w a ( . ) ) ) ) provide reasonably small deterministic NWAs for queries except for A8.
We also tested our algorithms on collections of XPath queries with a scalable parameter, such as the queries c h n ( a ) for increasing n. This series is known to require automaton with a number of states exponential in n for deterministic bottom-up evaluation. The best methods to produce deterministic NWAs in this case is nwa ( det ( step ) ) . It works until n = 9 , leading to an dNWA of size 134 , 929 with 772 states. The number of states close to doubles when increasing n by 1. The second best method for producing dNWAs for the series c h n ( a ) works only until n = 6 .
For explaining the different size of the dNWAs for the series c h n ( a ) , we first note that no schema-based cleaning was applied in this experiment. As a consequence, unique minimal single-entry dNWAs in which the single-entry state is the initial state should exist. The reason for the larger number of states with the three other methods is that we have not implemented any minimization algorithm for this subclass of single-entry dNWAs . Furthermore, our implementation of the minimization algorithm for our subclass of dSHA s failed for too big dSHA s. In this case, the number of states reported in Figure 27 could not be reduced to the minimum. In addition, the number of rules seems to be increased further by the lack of any symbolic representation for rules of NWAs that could mimic the apply-else rules for SHA s.

9. Deterministic Nested Regular Expressions

We finally show how to distinguish NREs that can be evaluated deterministically in polynomial time, for instance, by compilation to deterministic NWAs . For this, we consider the language of NREs nregexp ( c h , T ) that extends the abstract syntax of NREs by a new constant T and a new unary operator c h .
Definition 8.
An expression of nregexp ( c h , T ) is deterministic if it does not contain a subexpression of any of the forms: E 1 + E 2 , E * , T · E , μ a . E .
Note in particular that c h ( a ) is a deterministic expression of nregexp ( c h , T ) , as the child operator is added as a primitive there. In contrast, the semantically equivalent expression T . a . T is not deterministic. Similarly, T is deterministic since it is a primitive expression of nregexp ( c h , T ) , while the equivalent expression μ x . ( x + _ ) * is nondeterministic for three different reasons: the μ -operator, star *, and disjunction +. The recursive expression c h * ( E ) is nondeterministic: it is not primitive in nregexp ( c h , T ) , and its definition is based on the μ -operator and disjunction.
The only query of the benchmark for which we can provide a deterministic NRE is the query A1. The NRE for query A1 in Figure A1 is nondeterministic nevertheless, as we replaced c h ( E ) with T · E · T . This is not problematic, given that we can use a decent method for determinization of NWAs . For this reason, it does no more seem worth the effort to maintain specialized compilation methods for deterministic NREs . For the same reason, we will not present any experimental results for our specialized compiler from deterministic NREs to deterministic NWAs . Instead we use the more general compiler for general nondeterministic NREs .
The compiler from Theorem 1 introduces epsilon rules, and thus it does not preserve determinism: some deterministic NREs will be compiled to nondeterministic NWAs . This introduction of nondeterminism can be avoided by eliminating epsilon rules on the fly, that is by using Glushkov’s approach rather than that of Thompson.
Theorem 3.
For any deterministic regular expression E of nregexp ( c h , T ) without conjunction and negation, we can construct in time O ( | E | 2 ) a dNWA recognizing the same language.
Proof sketch.
Theorem 3 uses Glushkov’s construction and thus eliminates ε -edges on the fly compared to the McNaughton–Yamada–Thompson algorithm. The Glushkov construction is well-known to preserve determinism when compiling regular expressions without nesting to finite state automata [20]. For the additional deterministic expressions c h ( E ) , we adapt the deterministic compilation from the work in [18]. This quadratic time result generalizes a previous result for the Glushkov construction [19] from regular expressions without conjunctions and negations to NREs without conjunctions and negations. □
Small deterministic NREs without conjunction and negation can thus be compiled to small dNWAs . On the benchmark, however, this construction can be applied to the query A 1 only, so only a few queries can be covered in this manner.

10. Conclusions and Future Work

We presented SHA s and showed how they can be used to compile NREs to deterministic NWAs . When applied to NREs for navigational XPath queries in the usual XPathMark benchmark, we obtained reasonably small deterministic NWAs , in contrast to all previous approaches.
The dNWAs that we obtain by compilation from SHA s all have the weak single-entry property. This property means that the computation of the NWA is done in a purely bottom-up and left-to-right manner, so in the same way as by an SHA . Our experiments show that the usual determinization algorithm for NWAs is well-behaved when applied to weak single-entry NWAs , while it quickly fails without the weak single-entry property.
We have also stated a unique minimization algorithm for dSHA s with the same tree and hedge initial state. It is open whether unique minimization holds for general dSHA s. Neither do we know whether dSHA minimization is NP-hard. The analogous questions remain open for the class of weak single-entry dNWAs .
In future work, one needs to tackle the open questions on the minimization of dSHA s, weak single-entry dNWAs , and dNWAs with fixed general signatures. One has to understand, whether and why unique minimization holds or not, and whether and why minimization is hard or not. Independently, it is interesting to use SHA s in various questions in theory and practice. In particular, we want to develop new algorithms for earliest query answering for dSHA s that are more efficient than the existing algorithms for dNWAs [16] and to see how they behave in practice.

Author Contributions

Conceptualization, J.N.; methodology, J.N.; software, J.N. and M.S.; validation, J.N. and M.S.; formal analysis, J.N. and M.S.; writing—original draft preparation, J.N. and M.S.; writing—review and editing, J.N. and M.S.; visualization, J.N. and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by a grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015–2020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The nested regular expressions generated for the queries mentioned in this article and their corresponding automata—when we could produce them—can be found at http://researchers.lille.inria.fr/niehren/complementary-material.

Acknowledgments

It is a pleasure to thank Iovka Boneva for her contributions to the conference version [25] onto which this journal article extends. We are equally grateful to Antonio Al Serhali for implementing the determinization algorithm for SHA s with apply-else rules.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Determinization of NWAs

Let us first introduce some notations. For a transition τ Q h × Q h , we write lab ( τ ) = { a Σ ( q , q ) τ , q Q . q a q Δ } . Furthermore, we write ε Δ * to denote the reflexive and transitive closure of ε Δ . Finally, for any set Q, we write i d Q to denote the binary relation that relates every element of Q to itself, that is, i d Q = { ( q , q ) Q 2 } .
We adapt the usual determinization procedure for NWAs [3,18] so that they can account for hedge ending and else rules. Given an NWA A = ( Q h , Q t , Σ , Γ , Δ , I , F ) , the difficulty is to deal with concurrent opening rules q 🠆 γ 1 q 1 and q 🠆 γ 2 q 2 in Δ during determinization without mixing up the stack symbols γ 1 and γ 2 . Therefore, we use transition relations as states of the determinized automaton det ( A ) = ( Q h det , Q t det , Σ , Γ det , Δ det , I det , F det ) , that is, Q h det = 2 Q h × Q h , Q t det = 2 Q h × Q t . The only initial state is the composition of i d I with ε Δ * , i.e., I det = { i d I ε Δ * } . The set of final states is F det = { τ Q h det τ ( I × F ) } . Schemas generating the transition rules in Δ det are given below.
Algorithms 14 00068 i004

Appendix B. NREs for the XPathMark Benchmark

We compiled navigational XPath queries of the XPathMark benchmark to the NREs given in Figure A1.
Figure A1. The NREs of the XPath benchmark queries.
Figure A1. The NREs of the XPath benchmark queries.
Algorithms 14 00068 g0a1

Appendix C. Some More Automata

Figure A2. Deterministic NWA : det ( nwa ( c h * ( a + b ) ) .
Figure A2. Deterministic NWA : det ( nwa ( c h * ( a + b ) ) .
Algorithms 14 00068 g0a2
Figure A3. Stepwise hedge automaton from NWA for step ( nwa ( c h * ( a + b ) ) ) .
Figure A3. Stepwise hedge automaton from NWA for step ( nwa ( c h * ( a + b ) ) ) .
Algorithms 14 00068 g0a3
Figure A4. Determinization of NWA from stepwise hedge automaton: det ( nwa ( step ( c h * ( a + b ) ) ) ) .
Figure A4. Determinization of NWA from stepwise hedge automaton: det ( nwa ( step ( c h * ( a + b ) ) ) ) .
Algorithms 14 00068 g0a4

References

  1. Mehlhorn, K. Pebbling Moutain Ranges and its Application of DCFL-Recognition. In Automata, Languages and Programming, Proceedings of the 7th Colloquium, Noordweijkerhout, The Netherlands, 14–18 July 1980; Lecture Notes in Computer Science; de Bakker, J.W., van Leeuwen, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1980; Volume 85, pp. 422–435. [Google Scholar] [CrossRef]
  2. Von Braunmühl, B.; Verbeek, R. Input Driven Languages are Recognized in log n Space. North Holl. Math. Stud. 1985, 102, 1–19. [Google Scholar] [CrossRef]
  3. Alur, R. Marrying Words and Trees. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Paris, France, 14–16 June 2004; ACM-Press: New York, NY, USA, 2007; pp. 233–242. [Google Scholar]
  4. Okhotin, A.; Salomaa, K. Complexity of input-driven pushdown automata. SIGACT News 2014, 45, 47–67. [Google Scholar] [CrossRef]
  5. Alur, R.; Madhusudan, P. Visibly pushdown languages. In Proceedings of the 36th ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004; ACM-Press: New York, NY, USA, 2004; pp. 202–211. [Google Scholar]
  6. Neumann, A.; Seidl, H. Locating Matches of Tree Patterns in Forests. In Proceedings of the Foundations of Software Technology and Theoretical Computer Science, Chennai, India, 17–19 December 1998; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1530, pp. 134–145. [Google Scholar]
  7. Gauwin, O.; Niehren, J.; Roos, Y. Streaming Tree Automata. Inf. Process. Lett. 2008, 109, 13–17. [Google Scholar] [CrossRef] [Green Version]
  8. Thatcher, J.W. Characterizing derivation trees of context-free grammars through a generalization of automata theory. J. Comput. Syst. Sci. 1967, 1, 317–322. [Google Scholar] [CrossRef] [Green Version]
  9. Comon, H.; Dauchet, M.; Gilleron, R.; Löding, C.; Jacquemard, F.; Lugiez, D.; Tison, S.; Tommasi, M. Tree Automata Techniques and Applications. 2007. Available online: http://tata.gforge.inria.fr (accessed on 1 February 2021).
  10. Hosoya, H.; Pierce, B.C. XDuce: A statically typed XML processing language. ACM Trans. Internet Technol. 2003, 3, 117–148. [Google Scholar] [CrossRef]
  11. Mozafari, B.; Zeng, K.; Zaniolo, C. From Regular Expressions to Nested Words: Unifying Languages and Query Execution for Relational and XML Sequences. PVLDB 2010, 3, 150–161. [Google Scholar] [CrossRef]
  12. Pitcher, C. Visibly Pushdown Expression Effects for XML Stream Processing. Program. Lang. Technol. XML 2005, 1060, 1–14. [Google Scholar]
  13. Olteanu, D. SPEX: Streamed and Progressive Evaluation of XPath. IEEE Trans. Know. Data Eng. 2007, 19, 934–949. [Google Scholar] [CrossRef]
  14. Gauwin, O.; Niehren, J. Streamable Fragments of Forward XPath. In Proceedings of the International Conference on Implementation and Application of Automata, Blois, France, 13–16 July 2011; Lecture Notes in Computer Science; Markhoff, B.B., Caron, P., Champarnaud, J.M., Maurel, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6807, pp. 3–15. [Google Scholar] [CrossRef] [Green Version]
  15. Benedikt, M.; Jeffrey, A.; Ley-Wild, R. Stream Firewalling of XML Constraints. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 10–12 June 2008; ACM-Press: New York, NY, USA, 2008; pp. 487–498. [Google Scholar]
  16. Gauwin, O.; Niehren, J.; Tison, S. Earliest Query Answering for Deterministic Nested Word Automata. In Proceedings of the 17th International Symposium on Fundamentals of Computer Theory, Wroclaw, Poland, 2–4 September 2009; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5699, pp. 121–132. [Google Scholar]
  17. Franceschet, M. XPathMark Performance Test. Available online: https://users.dimi.uniud.it/~massimo.franceschet/xpathmark/PTbench.html (accessed on 25 October 2020).
  18. Debarbieux, D.; Gauwin, O.; Niehren, J.; Sebastian, T.; Zergaoui, M. Early nested word automata for XPath query answering on XML streams. Theor. Comput. Sci. 2015, 578, 100–125. [Google Scholar] [CrossRef]
  19. Brüggemann-Klein, A. Regular Expressions into Finite Automata. Theor. Comput. Sci. 1993, 120, 197–213. [Google Scholar] [CrossRef] [Green Version]
  20. Brüggemann-Klein, A.; Wood, D. One-Unambiguous Regular Languages. Inf. Comput. 1998, 142, 182–206. [Google Scholar] [CrossRef] [Green Version]
  21. Carme, J.; Niehren, J.; Tommasi, M. Querying Unranked Trees with Stepwise Tree Automata. In Proceedings of the 19th International Conference on Rewriting Techniques and Applications, Paris, France, 26–28 June 2004; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3091, pp. 105–118. [Google Scholar]
  22. Alur, R.; Kumar, V.; Madhusudan, P.; Viswanathan, M. Congruences for Visibly Pushdown Languages. In Automata, Languages and Programming, Proceedings of the 32nd International Colloquium, Lisbon, Portugal, 11–15 July 2005; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3580, pp. 1102–1114. [Google Scholar] [CrossRef]
  23. Chervet, P.; Walukiewicz, I. Minimizing Variants of Visibly Pushdown Automata. In Mathematical Foundations of Computer Science 2007; Kučera, L., Kučera, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 135–146. [Google Scholar]
  24. Gauwin, O.; Muscholl, A.; Raskin, M. Minimization of visibly pushdown automata is NP-complete. Log. Methods Comput. Sci. 2020, 16. [Google Scholar] [CrossRef]
  25. Boneva, I.; Niehren, J.; Sakho, M. Nested Regular Expressions Can Be Compiled to Small Deterministic Nested Word Automata. In Computer Science—Theory and Applications, Proceedings of the 15th International Computer Science Symposium in Russia (CSR 2020), Yekaterinburg, Russia, 29 June–3 July 2020; Lecture Notes in Computer Science; Fernau, H., Ed.; Springer: Cham, Switzerland, 2020; Volume 12159, pp. 169–183. [Google Scholar] [CrossRef]
  26. Gottlob, G.; Koch, C.; Pichler, R. The complexity of XPath query evaluation. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, CA, USA, 9–12 June 2003; pp. 179–190. [Google Scholar]
  27. Libkin, L.; Martens, W.; Vrgoč, D. Querying Graph Databases with XPath. In Proceedings of the 16th International Conference on Database Theory (ICDT’13), Genoa, Italy, 18–22 March 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 129–140. [Google Scholar] [CrossRef]
  28. Fischer, M.J.; Ladner, R.E. Propositional Dynamic Logic of Regular Programs. J. Comput. Syst. Sci. 1979, 18, 194–211. [Google Scholar] [CrossRef] [Green Version]
  29. Mozafari, B.; Zeng, K.; Zaniolo, C. High-performance complex event processing over XML streams. In Proceedings of the SIGMOD Conference, Scottsdale, AZ, USA, 20–24 May 2012; Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano, L., Fuxman, A., Eds.; ACM: New York, NY, USA, 2012; pp. 253–264. [Google Scholar] [CrossRef] [Green Version]
  30. Grez, A.; Riveros, C.; Ugarte, M. A Formal Framework for Complex Event Processing. In Proceedings of the 22nd International Conference on Database Theory (ICDT 2019), Lisbon, Portugal, 26–28 March 2019; Volume 127, pp. 5:1–5:18. [Google Scholar] [CrossRef]
  31. Bozzelli, L.; Sánchez, C. Visibly Rational Expressions. Acta Inf. 2014, 51, 25–49. [Google Scholar] [CrossRef] [Green Version]
  32. Gécseg, F.; Steinby, M. Tree Automata; Akadémiai Kiadó: Budapest, Hungary, 1984. [Google Scholar]
  33. Scott, D.; de Bakker, J.W. A Theory of Programs; IBM: Vienna, Austria, 1969; Unpublished Manuscript. [Google Scholar]
  34. Stockmeyer, L.J.; Meyer, A.R. Word Problems Requiring Exponential Time. In Proceedings of the 5th ACM Symposium on Theory of Computing, Austin, TX, USA, 30 April–2 May 1973; pp. 1–9. [Google Scholar]
  35. Champavère, J.; Gilleron, R.; Lemay, A.; Niehren, J. Efficient Inclusion Checking for Deterministic Tree Automata and XML Schemas. Inf. Comput. 2009, 207, 1181–1208. [Google Scholar] [CrossRef] [Green Version]
  36. Aho, A.V.; Lam, M.S.; Sethi, R.; Ullman, J.D. Compilers: Principles, Techniques, and Tools, 2nd ed.; Addison Wesley: Reading, MA, USA, 2006. [Google Scholar]
  37. Arnold, A.; Niwiński, D. Complete lattices and fixed-point theorems. In Rudiments of μ-Calculus; Studies in Logic and the Foundations of Mathematics; Elsevier: Amsterdam, The Netherlands, 2001; Volume 146. [Google Scholar]
  38. Martens, W.; Niehren, J. On the Minimization of XML-Schemas and Tree Automata for Unranked Trees. J. Comput. Syst. Sci. 2007, 73, 550–583. [Google Scholar] [CrossRef] [Green Version]
  39. D’Antoni, L.; Alur, R. Symbolic Visibly Pushdown Automata. In Computer Aided Verification, Proceedings of the 26th International Conference (CAV, VSL 2014), Vienna, Austria, 18–22 July 2014; Lecture Notes in Computer Science; Biere, A., Bloem, R., Eds.; Springer: Cham, Switzerland, 2014; Volume 8559, pp. 209–225. [Google Scholar] [CrossRef]
Figure 1. Nested word a · b · ε · c · d · ε seen as a graph.
Figure 1. Nested word a · b · ε · c · d · ε seen as a graph.
Algorithms 14 00068 g001
Figure 2. An Xml document.
Figure 2. An Xml document.
Algorithms 14 00068 g002
Figure 3. The corresponding nested word.
Figure 3. The corresponding nested word.
Algorithms 14 00068 g003
Figure 4. The nested word of the x-marked Xml document from Figure 2.
Figure 4. The nested word of the x-marked Xml document from Figure 2.
Algorithms 14 00068 g004
Figure 5. XPath benchmark queries.
Figure 5. XPath benchmark queries.
Algorithms 14 00068 g005
Figure 6. Nested word automaton n w a ( c h * ( a + b ) ) .
Figure 6. Nested word automaton n w a ( c h * ( a + b ) ) .
Algorithms 14 00068 g006
Figure 7. The NWA T maps top level positions to state 0 and nested positions to 1 or 1 .
Figure 7. The NWA T maps top level positions to state 0 and nested positions to 1 or 1 .
Algorithms 14 00068 g007
Figure 8. Automaton for the a * expression.
Figure 8. Automaton for the a * expression.
Algorithms 14 00068 g008
Figure 9. Wrong naive construction for μ a . a * .
Figure 9. Wrong naive construction for μ a . a * .
Algorithms 14 00068 g009
Figure 10. The multi-module NWA for a * .
Figure 10. The multi-module NWA for a * .
Algorithms 14 00068 g010
Figure 11. The correctly adapted naive construction for μ a . a * .
Figure 11. The correctly adapted naive construction for μ a . a * .
Algorithms 14 00068 g011
Figure 12. The size (#states) of the nested word automata ( NWAs ) for the benchmark nested regular expressions ( NREs ) and the automata derived thereof.
Figure 12. The size (#states) of the nested word automata ( NWAs ) for the benchmark nested regular expressions ( NREs ) and the automata derived thereof.
Algorithms 14 00068 g012
Figure 13. Stepwise hedge automaton step ( c h * ( a + b ) ) : the part with the stepwise tree automaton is on the left and middle, and the Nfa part on the right.
Figure 13. Stepwise hedge automaton step ( c h * ( a + b ) ) : the part with the stepwise tree automaton is on the left and middle, and the Nfa part on the right.
Algorithms 14 00068 g013
Figure 14. Determinization of SHA s.
Figure 14. Determinization of SHA s.
Algorithms 14 00068 g014
Figure 15. The determinized SHA d e t ( step ( c h * ( a + b ) ) ) .
Figure 15. The determinized SHA d e t ( step ( c h * ( a + b ) ) ) .
Algorithms 14 00068 g015
Figure 16. SHA for a * .
Figure 16. SHA for a * .
Algorithms 14 00068 g016
Figure 17. SHA for μ a . a * .
Figure 17. SHA for μ a . a * .
Algorithms 14 00068 g017
Figure 18. The SHA s for the benchmark NREs and the automata derived thereof.
Figure 18. The SHA s for the benchmark NREs and the automata derived thereof.
Algorithms 14 00068 g018
Figure 19. The single-entry NWA n w a ( step ( c h * ( a + b ) ) ) obtained from the SHA .
Figure 19. The single-entry NWA n w a ( step ( c h * ( a + b ) ) ) obtained from the SHA .
Algorithms 14 00068 g019
Figure 20. Deterministic NWA : n w a ( d e t ( step ( c h * ( a + b ) ) ) ) .
Figure 20. Deterministic NWA : n w a ( d e t ( step ( c h * ( a + b ) ) ) ) .
Algorithms 14 00068 g020
Figure 21. A dSHA for hedges over Σ = { x , y } with single occurrence of x. It is minimal in the class of multi-module dSHA s.
Figure 21. A dSHA for hedges over Σ = { x , y } with single occurrence of x. It is minimal in the class of multi-module dSHA s.
Algorithms 14 00068 g021
Figure 22. An equivalent dSHA to that in Figure 21 that is minimal in the class of dSHA s with equal tree and hedge initial states.
Figure 22. An equivalent dSHA to that in Figure 21 that is minimal in the class of dSHA s with equal tree and hedge initial states.
Algorithms 14 00068 g022
Figure 23. A single x-marked position.
Figure 23. A single x-marked position.
Algorithms 14 00068 g023
Figure 24. Nested words of x-marked XML documents.
Figure 24. Nested words of x-marked XML documents.
Algorithms 14 00068 g024
Figure 25. Optimized automata for derived stepwise automata compiled from NREs .
Figure 25. Optimized automata for derived stepwise automata compiled from NREs .
Algorithms 14 00068 g025
Figure 26. Deterministic NWAs computed with optimizations for the XPath benchmark queries. Note that different dNWAs for the same query may recognize different languages, due to schema-based cleaning with respect to the Xml data model. Furthermore, our implementation of the minimization algorithm for the subclass of dSHA s worked successfully only for dSHA s with at most 200 states.
Figure 26. Deterministic NWAs computed with optimizations for the XPath benchmark queries. Note that different dNWAs for the same query may recognize different languages, due to schema-based cleaning with respect to the Xml data model. Furthermore, our implementation of the minimization algorithm for the subclass of dSHA s worked successfully only for dSHA s with at most 200 states.
Algorithms 14 00068 g026
Figure 27. Deterministic NWAs for the queries c h n ( a ) where n = 0 , , 9 : size (#states). There is no schema-based cleaning. Our implementation of the minimization algorithm was applied to all dSHA with at most 200 states, as it failed for larger dSHA s. No minimization algorithm for subclasses of single-entry dNWAs was implemented.
Figure 27. Deterministic NWAs for the queries c h n ( a ) where n = 0 , , 9 : size (#states). There is no schema-based cleaning. Our implementation of the minimization algorithm was applied to all dSHA with at most 200 states, as it failed for larger dSHA s. No minimization algorithm for subclasses of single-entry dNWAs was implemented.
Algorithms 14 00068 g027
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Niehren, J.; Sakho, M. Determinization and Minimization of Automata for Nested Words Revisited. Algorithms 2021, 14, 68. https://0-doi-org.brum.beds.ac.uk/10.3390/a14030068

AMA Style

Niehren J, Sakho M. Determinization and Minimization of Automata for Nested Words Revisited. Algorithms. 2021; 14(3):68. https://0-doi-org.brum.beds.ac.uk/10.3390/a14030068

Chicago/Turabian Style

Niehren, Joachim, and Momar Sakho. 2021. "Determinization and Minimization of Automata for Nested Words Revisited" Algorithms 14, no. 3: 68. https://0-doi-org.brum.beds.ac.uk/10.3390/a14030068

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop