Saturday, August 22, 2009

XML document validation, while parsing with Java DOM API

I spent few hours, discovering this while working with the DOM XML parsing API, and using it with Xerces-J, in a Java program.

I wanted to parse an XML document in Java using a plain DOM parser, along with doing validation, using either W3C XML Schema or a DTD.

Following is a sequence of instructions which needs to be written for this:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setValidating(true);
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema ...
dbf.setSchema(schema);
DocumentBuilder docBuilder = dbf.newDocumentBuilder();
docBuilder.parse(..


These statements, are all that are necessary to accomplish this task. But there, are few catches here, which I wish to share.

1) If dbf.setValidating(true) is specified, then a DTD is mandatory. Even if W3C XML Schema is provided with dbf.setSchema .., parsing would fail, since dbf.setValidating(true) was specified, and if a DTD is absent.

2) If we only want to do validation with W3C XML Schema, then we shouldn't specify dbf.setValidating(true), which is required only for DTD validation.

I spent a few hours discovering this, and thought that somebody might benefit from this post.

Saturday, August 8, 2009

XML Schema 1.1: inheritable attributes, and it's implementation in Apache Xerces-J

The XML Schema 1.1, language has defined a new facility to define attributes as inheritable.

The XML Schema, attribute definition(s) can now specify an additional property (in 1.1 version of the XML Schema language), inheritable (having a schema type, xs:boolean), which will indicate that all the descendant elements to the element (which specifies an inheritable attribute), can access the inheritable attribute by it's name.

It could first appear to the reader of the XML Schema 1.1 spec, that inheritable attributes are something, which can physically be present (i.e., a copy of it) on descendant elements. But this is not the correct interpretation of the inheritable attributes concept. I'll try to illustrate this point with few examples in this post.

Please consider the following XML Schema 1.1, fragment:
  <xs:element name="X">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Y" type="xs:string" />
      </xs:sequence>
      <xs:attribute name="attr" type="xs:int" inheritable="true" />
    </xs:complexType>
  </xs:element>

This corresponds to, an XML structure like following:

  <X attr="1">
    <Y>hello</Y>
  </X>

The above XML Schema 1.1 fragment, indicates that attribute, "attr" is inheritable. The word inheritable seems to convey, that the following XML fragment could be valid as well, for the above XML Schema 1.1 fragment:

  <X attr="1">
    <Y attr="1">hello</Y>
  </X>

But this interpration of inheritable attributes is not correct. The inheritable attributes, cannot be physically copied to the descendant elements. In the above examples, the Schema type of element, "Y" is a simple type (i.e., xs:string). So how, could Y have an attribute, "attr" (since by definition, elements with simple types cannot have attributes)? Only XML Schema "complex types", can specify attributes. XML Schema 1.1, inheritable attributes do not change the nature of XML Schema simple types, and simple contents. The presence of attributes on any XML element, is governed only by the attribute declarations on the complex type definition of the element. This meaning for attributes with respect to XSD complex types is preserved, in XML Schema 1.1 as well.

Then it's interesting to think, that what could be the use of specifying the attribute as inheritable (when it cannot be physically present in the descendant elements)?

Inheritable attributes, are useful in a XML Schema 1.1 facility, like Conditional Type Assignment (CTA) / type alternatives.

Please consider the following XML Schema 1.1 example, defining an XML element and it's Schema type, using CTA and inheritable attributes:

  <xs:element name="X">
     <xs:complexType>
       <xs:sequence>
         <xs:element name="Y" type="xs:anyType">
           <xs:alternative test="@attr = 'INT'" type="xs:int" />
           <xs:alternative type="xs:error" />
         </xs:element>
       </xs:sequence>
       <xs:attribute name="attr" type="xs:int" inheritable="true" />
     </xs:complexType>
   </xs:element>

As per the above Schema (using type alternatives), the following XML instance is valid:

  <X attr="INT">
    <Y>100</Y>
  </X>

But the following XML instance would be invalid:

  <X attr="INT">
    <Y>hello</Y>
  </X>

The inheritable attribute is also particularly useful, to define the attribute xml:lang as inheritable in XML elements.

I got to know these facts, after raising a query last week, to W3C XML Schema comments forum.

I am thankful to following gentlemen, for answering my queries, on the W3C XML Schema forum:

C. M. Sperberg-McQueen
Noah Mendelsohn
Michael Kay

The fact, which I really wanted to share on this blog post (other than, sharing what the XML Schema, inheritable attributes are used for), was that I've written an implementation of inheritable attributes, for Apache Xerces-J's XML Schema 1.1 validator. I've submitted a patch for this, to Apache Xerces-J JIRA issue tracking system.

This patch currently, has a full implementation of attribute syntax changes (i.e, the presence of inheritable attribute itself, and it's binding with the XML Schema type, xs:boolean).

I'm in a process to, enhance the Xerces-J implementation of Conditional Type Assignment (CTA) facility, to be able to use inheritable attributes. I hope to complete the CTA changes in Xerces-J, for inheritable attributes in near future.

After all necessary reviews are done for this patch, by Xerces-J committers, I hope to have the inheritable attributes implementation, go to Xerces-J SVN repository, which will in most likelihood subsequently become part of an official future release, of Xerces-J.

2009-08-14: Today, I submitted all the Conditional Type Assignment (CTA) related changes, for inheritable attributes, to Apache Xerces-J JIRA issue tracking system. I would say, the XML Schema 1.1 inheritable attributes, and it's integration with CTA is completed, for Xerces-J. I'm feeling good about it :)