Twig Pattern Minimization Based on XML Schema Constraints

Twig pattern is one of the core components of XQuery. Twig usually includes redundancy nodes which can be optimized. Schema feature is used to judge whether the node of Twig pattern is redundancy. In this paper, we propose sufficient Schema constraints and specific rules. We have designed more determination conditions to optimize, then we will get the most efficient results. By a large number of test case, we finally get the practical limits of minimization


INTRODUCTION
XML (Extensible Markup Language) has become the standard description of network information and information exchange. XML data store in document, be searched by information retrieval method, such as the keyword query. Some existing commercial database systems extend capabilities of XML data processing, transform XML queries to database query expressions. XML query results obtained after execute it. This approach satisfy the requirements of complex query.
XQuery is a standard developed by W3C, used to extract information from an XML document. XQuery to XML is equivalent to the SQL to database. XQuery is built on XPath expressions to query XML data. A path expression as the core statement of the query. Core part of XPath and XQuery are generally extracted as Twig pattern, which can also be expressed as a tree structure. An XQuery that user writes often complicated, includes redundant parts. Twig minimization is a hot research topic and it is by deleting unnecessary portions of the query to improve query efficiency.
Our work as described in this paper makes the following contributions. 1)We propose a relatively complete set of Schema constraints. Include basic constraints and path constraints, provide more opportunities for minimize.
2)We propose Twig pattern minimization rules. Optimization both for leaf nodes and middle nodes, so that the query can be optimized faster and correct.

BACKGROUND XML Schema
XML Schema provide a means for defining the structure, content and semantics of XML documents. One of the most important ability of Schema is to support data type, so that can easily describe the permissible document content, can easily verify the accuracy of the data.
The indicator node in the XML Schema document, as show in Figure 1, specifies the order and occurrence times of the elements. In XML Schema, judging elements' tag and indicators, according to their nest relation, we can determine whether they have inevitable relationship.

Twig Minimization
Twig pattern minimization usually fall into two categories. The one is only by Twig pattern analysis; another is using constraints.
There is a Twig query in Figure 2 a). Search the book elements satisfied the conditions in the XML. The first condition is that book element must have child element with tag author, at the same time it must have child element with tag name. The second condition is that book element must have descendent element with tag name. Obviously, if the first condition be satisfied, the second condition must be satisfied. Namely, the second condition included in the first condition is redundancy which can be deleted. Result as show in Figure 2 b). This method is only based on Twig, without reference of other conditions.
Based on the analysis of XML Schema, if all the author element have descendent element with tag name, and all the book element have child element with tag name. The name element can be deleted, and result as show in Figure 2c). This method is based on the constraint conditions for check whether the node is redundant part in Twig. Obviously, twig c) can further reduce middle results of the query and detections of the XML, its efficiency is higher than twig b). So this paper based on Schema feature to minimize Twig pattern, finding highest efficiency minimization result. J a n u a r y 1 9 , 2 0 1 6

XML SCHEMA CONSTRAINT AND OPTIMIZATION RULES XML Schema constraint
XML Schema can strictly define the structure of XML documents. The certain relationship, can extracted from XML Schema, between XML elements are called structural constraints (or feature relationship). This paper is mainly based on two kinds of constraints to minimize twig.  Assume that p is a XPath without predicate, the definitions of extended constraint are as follows:

Optimization rules
Node in Twig are fall into query node, matching XML elements, and logical node, represents the logical relationship between various conditions. In this paper, we introduce query node optimization rules based on Schema features. Twig optimization not only upon leaf node but also middle node.
Leaf node optimization rules: suppose that x is a leaf node and y is a query node in Twig, p is an XPath expression without predicate, if x is not returned node and satisfying one conditions of the four groups on Twig structural and XML Schema features in Table 1, the node x may be deleted from the twig. J a n u a r y 1 9 , 2 0 1 6 For example, if the constraint RPC (author, name) can be obtained from XML Schema. When optimizing the node name of twig in Figure 2 b), both it and its parent author are query node; it is a leaf node and not returned. Processing the Twig according to the rules in Table 1. We can find that first line have been satisfied, the node name can be deleted, result Twig in Figure 2 c).
The middle node optimization rules: (1) suppose that y is a middle node, x is a query node in Twig, p is an XPath expression without predicate and z is the only descendent node of y. If y is not returned node and satisfying one conditions of the eight groups on Twig structural and XML Schema features in Table 2, the node y may be deleted from the twig and the node z may connect to the node x using double lines.
(2) suppose that y is a middle node, x is a query node in Twig, p is an XPath expression without predicate and z1, z2,…, zn are descendent nodes of y. If y is the only child node of x and not returned node, satisfying one conditions of the four groups on Twig structural and XML Schema features in Table 3, the node y may be deleted from the twig and the nodes z1, z2,…, zn may connect to the node x using double lines. AD(x,y) AD(y,z) RAD(tag(x),tag(y)) !MAD(tag(y),tag(x)) RDA(tag(z),tag(y)) 2 PC(x,y) AD(y,z) RCP(tag(y),tag(x)) !MAD(tag(x),tag(x)) RDA(tag(z),tag(y)) 3 AD(x,y) PC(y,z) RAD(tag(x),tag(y)) RCP(tag(z),tag(y)) 4 PC(x,y) PC(y,z) RCP(tag(y),tag(x)) !MAD(tag(x),tag(x)) RCP(tag(z),tag(y)) 5 AD(x,y) AD(y,z) PAD(p,tag(x),tag(y)) !MAD(tag(y),tag(x)) PDA(p,tag(z),tag(y)) 6 PC(x,y) AD(y,z) PCP(p,tag(y),tag(x)) !MAD(tag(x),tag(x)) PDA(p,tag(z),tag(y)) 7 AD(x,y) PC(y,z) PAD(p,tag(x),tag(y)) PCP(p,tag(z),tag(y)) 8 PC(x,y) PC(y,z) PCP(p,tag(y),tag(x)) !MAD(tag(x),tag(x)) PCP(p,tag(z),tag(y)) For example, if the constraint RPC( book, author), RCP ( name, author) and !MAD(author, author) can be obtained from XML Schema. When optimizing the node author of twig in Figure 2 b), both it and its parent author are query node; it is a middle node and not returned; it have single child node name. Processing the Twig according to the rules in Table 2. We can find that fourth line have been satisfied, the node author can be deleted, and the node name can connect to the node book using double lines, result Twig in Figure 2d). Table 3 Middle node optimization rules.

No.
Twig structural XML Schema features J a n u a r y 1 9 , 2 0 1 6

TEST CASE AND TEST RESULTS
Experiment carried out with different size XML Schema documents and different types of Twig pattern. The Twig patterns which before and after optimization were executed in the 82M XML document using TwigList algorithm. If the number of Twig nodes after optimizing reduce with large degree, the minimization results are obvious. The optimization time and query time after optimizing divide the query time without optimizing. The smaller the ratio, the higher the efficiency of the optimization. If the ratio is greater than 1, the optimization algorithm is meaningless. The test results as shown in Table 4.

CONCLUSION
The two aspects problem from the data structure and the query needs led to XML query optimization. In this paper, twig minimization based on the constraints from XML Schema, which are relatively complete. On the basis of the necessary constraint, the path constraint are added, which provide more opportunities for optimization. In this paper, we also propose specific rule that is optimized for different kinds of nodes. When optimizing node according to the specific rules and the corresponding algorithm. By a large number of test case, the conclusion is drawn that the minimization cost is greater and the result is better when the Schema nodes number is far greater than the Twig nodes number; the minimization cost is smaller but the result are not obvious when the Twig contains a large number of returned nodes.