Saturday, February 25, 2012

modular XML instances and modular XSD schemas

I was playing with some new ideas lately related to exploring design options, to construct modular XML instance documents vs/and modular XSD schema documents and thought to write my findings as a blog post here.

I believe, there are primarily following concepts related to constructing modular XML documents (and XSD schemas) when XSD validation is involved:
1. Modularize XML documents using the XInclude construct.
2. Modularize an XSD document via <xs:include> and <xs:import>. The <xs:include> construct maps significantly to modularlity concepts in XSD schemas, and <xs:import> is necessary (necessary in XSD 1.0, and optional in XSD 1.1) to compose (and also to modularize) XSD schemas coming from two or more distinct XML namespaces.

I don't intend to delve much in this post into concepts related to XSD constructs <xs:include> and <xs:import> since these are well known within the XSD and XML communities. In this post, I would tend to primarily focus on XML document modularization via the XInclude construct and presenting few thoughts about various design options (I don't claim to have covered every design option for these use cases, but I feel that I would cover few of the important ones) to validate such XML instance documents via XSD validation.

What is XInclude?
This is an XML standards specification, that defines about how to modularize any XML document information. The primary construct of XInclude is an <xi:include> XML element. Following is a small example of an XInclude aware XML document,

z.xml

<z xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:include href="x.xml"/>
    <xi:include href="y.xml"/>
</z>

x.xml

<x>
    <a>1</a>
    <b>2</b>
</x>

y.xml

<y>
    <p>5</p>
    <q>6</q>
</y>

We'll be using the XML document, z.xml provided above that is composed from other XML documents via an XInclude meta-data, to provide to an XSD validator for validation.

I essentially discuss here, the XSD schema design options to validate an XML instance document like z.xml above. Following are the XSD design options (that cause successful XML instance validations) that currently come to my mind for this need, along with some explanation of the corresponding design rationale:

XS1:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
          <xs:complexType>
               <xs:sequence>
                    <xs:any processContents="skip" minOccurs="2" maxOccurs="2"/>
               </xs:sequence>
          </xs:complexType>
    </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. An xs:any wild-card in this schema would weakly validate (since this wild-card declaration only requires *any particular* XML element to be present in an instance document, which is validated by this wild-card. the wild-card here doesn't specify any other constraint for it's corresponding XML instance elements) each of the included XML document element roots (i.e XML elements "x" and "y").

XS2:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

        <xs:element name="z">
                <xs:complexType>
                     <xs:complexContent>
                         <xs:restriction base="T1">
                              <xs:sequence>
                                   <xs:element name="include"  minOccurs="2" maxOccurs="2" targetNamespace="http://www.w3.org/2001/XInclude"/>
                             </xs:sequence>
                         </xs:restriction>
                    </xs:complexContent>
                </xs:complexType>
        </xs:element>
   
    <xs:complexType name="T1" abstract="true">
          <xs:sequence>
               <xs:any processContents="skip" maxOccurs="unbounded"/>
          </xs:sequence>
    </xs:complexType>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema specifies slightly stronger XSD validation constraints as compared to the previous example (stronger in a sense that, this schema declares an XML element and specifies it's name and an namespace). This schema would need an XSD 1.1 processor, since the element declaration specifies a "targetNamespace" attribute. An XSD 1.0 version of this design approach is possible, which would involve using an XSD <xs:import> element to import XSD components from the XInclude namespace.

XS3:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

       <xs:element name="z">
              <xs:complexType>
                  <xs:sequence>
                       <xs:any processContents="skip" minOccurs="2" maxOccurs="2" namespace="http://www.w3.org/2001/XInclude"/>
                 </xs:sequence>
                 <xs:assert test="count(*[local-name() = 'include']) = 2"/>
                 <xs:assert test="deep-equal((*[1] | *[2])/@*/name(), ('href','href'))"/>
             </xs:complexType>
      </xs:element>
   
</xs:schema>

This schema is also written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data unexpanded. But this schema enforces XSD validation even more strongly than the example "XS2" above (since this schema also requires the XInclude attribute "href" to be present on the XInclude meta-data, which the previous XSD schema doesn't enforce). This schema validates the names of XML instance elements, that are intended to be XInclude meta-data via XSD 1.1 <assert> elements (this may not be the best XSD validation approach, but such an XSD design idiom is now possible with XSD 1.1 language).

XS4:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="z">
         <xs:complexType>
               <xs:sequence>
                    <xs:element name="x">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="a" type="xs:integer"/>
                                 <xs:element name="b" type="xs:integer"/>
                            </xs:sequence>
                        </xs:complexType>
                    </xs:element>
                    <xs:element name="y">
                         <xs:complexType>
                             <xs:sequence>
                                  <xs:element name="p" type="xs:integer"/>
                                  <xs:element name="q" type="xs:integer"/>
                             </xs:sequence>
                        </xs:complexType>
                   </xs:element>
              </xs:sequence>
         </xs:complexType>
     </xs:element>
   
</xs:schema>

This schema is written with a view that, the XML document (i.e z.xml) would be validated with XInclude meta-data expanded. This schema specifies the strongest of XSD validation constraints as compared to the previous three approaches (strongest in a sense that, the internal structure of XML element instances "x" and 'y" are now completely specified by the XSD document).

But to make this XSD validation approach to work, the XInclude meta-data needs to be expanded and the expanded XML infoset needs to be supplied to the XSD validator for validation. This would require an XInclude processor (like Apache Xerces), that plugs within the XML parsing stage to expand the <xi:include> tags.

For the interest of readers, following are few java code snippets (the skeletal class structure and imports are omitted to keep the text shorter) that enable XInclude processing and supplying the resulting XML infoset (i.e post the XInclude meta-data expansion) to the Xerces XSD validator,

try {           
     Schema schema = schemaFactory.newSchema(getSaxSource(xsdUri, false));
     Validator validator = schema.newValidator();
     validator.setErrorHandler(new ValidationErrHandler());
     validator.validate(getSaxSource(xmlUri, true));
}
catch(SAXException se) {
     se.printStackTrace();
}
catch (IOException ioe) {
     ioe.printStackTrace();
}

private SAXSource getSaxSource(String docUri, boolean isInstanceDoc) {

     XMLReader reader = null;

     try {
          reader = XMLReaderFactory.createXMLReader();
          if (isInstanceDoc) {
              reader.setFeature("http://apache.org/xml/features/xinclude", true);
              reader.setFeature("http://apache.org/xml/features/xinclude/fixup-base-uris", false);
          }
     }
     catch (SAXException se) {
          se.printStackTrace();
     }

     return new SAXSource(reader, new InputSource(docUri));

}
     
class ValidationErrHandler implements ErrorHandler {

      public void error(SAXParseException spe) throws SAXException {
           String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
           System.err.println(formattedMesg);
      }

      public void fatalError(SAXParseException spe) throws SAXException {
             String formattedMesg = getFormattedMesg(spe.getSystemId(), spe.getLineNumber(), spe.getColumnNumber(), spe.getMessage());
             System.err.println(formattedMesg);
      }

      public void warning(SAXParseException spe) throws SAXException {
           // NO-OP           
      }
       
}

private String getFormattedMesg(String systemId, int lineNo, int colNo, String mesg) {
      return systemId + ", line "+lineNo + ", col " + colNo + " : " + mesg;   
}

Summary: I would ponder that, is devising the above various XSD design approaches beneficial for an XSD schema design that involves validating XML instance documents that contain <xi:include> meta-data directives? My thought process with regards to the above presented XSD validation options had following concerns:
1) Providing various degrees of XSD validation strenghts for <xi:include> directives (essentially the un-expanded and expanded modes).
2) Exploring some of the new XML validation idioms offered by XSD 1.1 language for the use cases presented above (essentially using "targetNamespace" attribute on xs:element elements, and using <assert> elements).
3) Exploring the java SAX and JAXP APIs to enable XInclude meta-data expansion, and providing a SAXSource object containing an XInclude expanded XML infoset which is subsequently supplied further to the XSD validation pipeline.

I hope that this post was useful.

Sunday, February 5, 2012

"castable as" vs "instance of" XPath 2.0 expressions for XSD 1.1 assertions

I'm continuing with my thoughts related to my previous blog post (ref, http://mukulgandhi.blogspot.in/2012/01/using-xsd-11-assertions-on-complextype.html). The earlier post used the XPath 2.0 "castable as" expression to do some checks on the 'untyped' data of complexType's mixed content (essentially finding if the string/untyped value in an XML instance document is a lexical representation of an xs:integer value).

This post talks about the use of XPath 2.0 "instance of" vs "castable as" expressions in context of XSD 1.1 assertions -- essentially providing guidance about when it may be necessary to use one of these expressions.

The XSD 1.1 "castable as" use case was discussed in my earlier blog post. Here I essentially talk about "instance of" expression when used with XSD 1.1 assertions.

Let's assume that there is an XML instance document like following (XML1):

<X>
   <elem>
     <a>20</a>
     <b>30</b>
   </elem>
   <elem>
     <a>10</a>
     <b>2005-10-07</b>
   </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML1):
1. The elements "a" and "b" can be typed as an xs:integer or a xs:date (therefore we'll express this with an XSD simpleType with variety 'union').
2. If both the elements "a" and "b" are of type xs:integer (this is allowable as per the simpleType definition described in point 1 above), then numeric value of element "a" should be less than numeric value of element "b".
3. If one of the elements "a" or "b" is an xs:integer and the other one is xs:date, then we would like to express the following constraints,
   - the numeric XML instance value of an xs:integer typed element should be less than 100
   - the xs:date XML instance value should be less that the current date

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML1) provided above:

[XS1]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
     <xs:element name="X">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="elem" maxOccurs="unbounded">
                 <xs:complexType>
                    <xs:sequence>
                       <xs:element name="a" type="union_of_date_and_integer"/>
                       <xs:element name="b" type="union_of_date_and_integer"/>
                    </xs:sequence>
                    <xs:assert test="if ((data(a) instance of xs:integer) and (data(b) instance of xs:integer))
                                              then (data(a) lt data(b))
                                           else if (not(deep-equal(data(a), data(b))))
                                              then (*[data(.) instance of xs:integer]/data(.) lt 100
                                                         and
                                                      *[data(.) instance of xs:date]/data(.) lt current-date())
                                              else true()"/>
                 </xs:complexType>
              </xs:element>
           </xs:sequence>
        </xs:complexType>
     </xs:element>
   
     <xs:simpleType name="union_of_date_and_integer">
        <xs:union memberTypes="xs:date xs:integer"/>
     </xs:simpleType>
   
</xs:schema>

I think it may be interesting for readers to know why I wrote an assertion like the one above. Following are few of the thoughts,
1. Since the XML elements "a" and "b" are typed as a simpleType 'union', therefore for an assertion to access the XML instance atomic values that were validated by such an simpleType we need to use the XPath 2.0 "data" function on a relevant XDM node (elements "a" and "b" in this case). Further determining that the XML document's atomic instance value is typed as xs:integer, we need to use the "instance of" expression -- "castable as" is not needed in this case, since the instance document's data is already typed.
2. The rest of the assertion implements what is mentioned in the requirements above.

If you want to have further visual and/or design elegance within what is written in an assertion above, one may feel free to break assertion rules into two or more assertions.

I would also want to write another XSD 1.1 assertions example which doesn't use an XPath 2.0 "castable as" or an "instance of" expression. This demonstrates that, if an XDM assert node is already typed it would usually be unnecessary to use the "castable as" expression (since "castable as" is essentially useful to programmatically enforce typing with string/untyped values) or an "instance of" expression may be needed for some cases.

Following is a slightly modified variant of the XML instance document specified above (XML1):

[XML2]

<X>
   <elem>
     <a>2</a>
     <b>2012-02-04</b>
   </elem>
   <elem>
     <a>10</a>
     <b>2005-10-07</b>
   </elem>
</X>

The XSD schema should express the following constraints with respect to the above XML instance document (XML2):
1. The element "a" is typed as an xs:nonNegativeInteger value, and element "b" is typed as xs:date.
2. The number of days equal to the numeric value specified in an element "a" if added to the xs:date value specified in an element "b", should result in an xs:date value which must be less than the current date.

The following XSD (1.1) schema document describes all of the above validation constraints for a sample XML instance document (XML2) provided above:

[XS2]

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
   
     <xs:element name="X">
        <xs:complexType>
           <xs:sequence>
              <xs:element name="elem" maxOccurs="unbounded">
                 <xs:complexType>
                    <xs:sequence>
                       <xs:element name="a" type="xs:nonNegativeInteger"/>
                       <xs:element name="b" type="xs:date"/>
                    </xs:sequence>
                    <xs:assert test="(b + xs:dayTimeDuration(concat('P', a, 'D'))) lt current-date()"/>
                 </xs:complexType>
              </xs:element>
           </xs:sequence>
        </xs:complexType>
     </xs:element>
   
</xs:schema>

That's all I had to say today.

I hope this post was useful.