Sunday, May 25, 2008

One-based indexes in XPath

Justin Johansson asked an interesting question on the xsl-list.

Would someone please give me advice as to why "1-based" indexes are used in XPath, such as para[1] instead of para[0] for the first para item/element?

Why does the spec for XPath (and its/XQuery operator/function library) go against the norm for modern programming languages in which zero is the base for array-like collections?

An interesting discussion happened on xsl-list after this query.

Colin Adams wrote: Zero is not the norm for modern programming languages. It might well be for ancient ones. It is a very poor choice, justifiable only when trying to squeeze the last ounce of speed in a highly numerically-intensive application.

And even there it is not justified - you simply use data structures that have an unused first element, and so avoid the subtract one operation in that way.

I reasoned:
Let's say, we have to select a node as following:

following-sibling::xx[1]

To me traversal on following (or say preceding) axis will make sense if indexes start from 1.

I also think 0 based indexes in low level languages (I consider Java or C to be low level than XPath. I am talking about assembly languages too.) have relation to hardware addressing.

For e.g., a memory might have addresses ranging from 0000 to 1111 (this is just a small amount of memory). This probably has got to do with logic of bits, where 0 has a very important meaning.

Lot of programming languages (also mentioned by you) have 0 based indexes (in arrays, strings etc.), so compilers can easily map them to hardware locations.

Indexes in XPath start from 1 because it's more convenient for the users.

Michael Kay commented on my reasoning as follows, "I don't think hardware addressing is the only benefit of 0-based addressing. It also makes computations easier. If you number the rows and columns on a chessboard from 0-7, and the squares from 0-63, then the square number is row*8+column, whereas with 1-based addressing it is (row-1)*8+column.

And we do sometimes use 0-based logic in real life too. In many countries the "first floor" is the one above where you enter the building; and in many societies a child is "1 year old" between 12 months and 24 months after their birth.

But on balance I do think 1-based logic was the right choice for XPath and XSLT."

Michael further said, "Because the language was designed for users, not for programmers, and users still have this old-fashioned habit of referring to the first chapter in a book as Chapter One. (Though I did once hear Dijkstra refer to the fourth slide in someone's presentation as the third.)

(I fully agree that when handling tables, or subscripting into strings, zero-based addressing would often be much more convenient. There are arguments both ways, and as always, I can't tell you what the actual history of the decision was; I can only post-rationalize it.)".

Owen Rees shared an interesting point: Dijkstra wrote a note "Why numbering should start at zero": http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html.

The book "Informal introduction to ALGOL 68" numbers its chapters from zero. I remember that as being unusual, amusing, and also appropriate given the audience.

Since XSLT does not have multi-dimensional sequences, the issues that arise when multi-dimensional arrays can be treated as one-dimensional arrays do not arise. Zero-based indexing tends to make the index formulae simpler when accessing a multi-dimensional array with a one-dimensional array access expression but on the whole I think it is best to just not do that sort of thing at all.

I don't think that the age of language or design for programmer or user arguments stand up very well.

FORTRAN is a counterexample to the "old languages use zero" argument, it uses 1. Modern versions of FORTRAN allow the lower bound to be specified (as Algol did) but the default is still 1.

Dartmouth BASIC used zero-based indexing, I have no information about why, but I doubt if it has anything to do with being close to assembly code or concern for the last ounce of speed.

I think that both of these languages were originally aimed at people who did not think of themselves as being or intending to become programmers.

One related point here is that those thinking in terms of other programming languages may be misled by the syntactic similarity of the XPath expression para[1] to an array indexing expression in various other languages. Understanding that para[1] is just a shorthand for para[position()=1] moves the issue to the question of why position() is defined the way it is.

Defining last() to be the context size means that context position has to count from one unless you are going to cause either 'last' or 'size' to have a very counterintuitive meaning.

Wednesday, May 14, 2008

Namespace-based validation dispatching language

A recent discussion on xml-dev list motivated me to know more about NVDL.

NVDL stands for, Namespace-based Validation Dispatching Language.

NVDL is Part 4 of ISO/IEC 19757 DSDL (Document Schema Definition Languages).

If an XML document is composed of sections belonging to multiple namespaces, then each section (differentiated by the namespace) can be dispatched for validation to a different schema processor (for e.g., RELAX NG, W3C XML Schema, DTD, Schematron etc.).

This is an innovative new idea to allow validation of different parts of XML document with different schema languages.

This is what Roger Costello wrote about NVDL on xml-dev list:

Here are the evolutionary changes I envision NVDL bringing about in the marketplace:

1. Opens the marketplace to utilizing a variety of schema languages.

Previously, you and all your trading partners were locked into using one schema language (typically W3C XML Schema) if you wanted interoperability. With NVDL that limitation is lifted and you can achieve interoperability while using a variety of schema languages.

2. Promotes using the right schema language for the right job.

XML Schema and Relax NG are two schema languages for expressing grammar-based rules. They are both standards, the former a W3C standard, the later an ISO standard. Although their capabilities are largely overlapping, there are important differences. "Use the right tool for the right job" is an adage that applies to choosing a schema language. Knowing the differences in capabilities is important to making a good decision in choosing a schema language.

3. Encourages the creation of small, simple, independent schemas, written in any schema language.

4. Moves the application developer's focus from:

"using a schema"

to:

"using XML vocabularies"

Sunday, May 11, 2008

Some differences between XQuery 1.0 and XSLT 2.0

I can think, of following differences between XQuery 1.0 and XSLT 2.0:

1. In XQuery 1.0, functions need to be declared before use. While in XSLT 2.0, functions may be defined anywhere in the stylesheet (provided, that the function body is a child of the element, xsl:stylesheet).

2. In XQuery 1.0, the XML Schema namespace, http://www.w3.org/2001/XMLSchema is not required to be declared for using the prefix, xs:. While in XSLT 2.0 XML Schema namespace needs to be declared, if any reference to the prefix xs: is made in the XSLT stylesheet.

3. XQuery 1.0 has (seems to me) stronger static typing than XSLT 2.0. For e.g., to return a xs:string value from a function in XQuery 1.0, we cannot simply write, $people/person[fname = $fName]/lname/text(). But instead we have to do for example, xs:string($people/person[fname = $fName]/lname/text()).
While in XSLT 2.0, an expression like $people/person[fname = $fName]/lname is able to return a xs:string value (if element 'lname' contains a text only data).

4. Moreover, an XQuery program is not a template based program description (as is a XSLT, stylesheet). The XQuery syntax looks to me, a mix of procedural and declarative syntax.

It's also true that, XSLT and XQuery are both functional in nature. But that's a similarity, between these two languages!

Following are an XQuery 1.0 and XSLT 2.0 examples which illustrate the above points:

Input XML:
<?xml version="1.0" encoding="UTF-8"?>
  <people>
    <person>
      <fname>Mukul</fname>
      <lname>Gandhi</lname>
    </person>
    <person>
      <fname>Rohit</fname>
      <lname>Rawat</lname>
    </person>
  </people>

XQuery program:
declare namespace my = "http://localhost/functions";

declare function my:getLastName($people as element(), $fName as xs:string)
as xs:string
{
   xs:string($people/person[fname = $fName]/lname/text())  
};

<person>
  <fname>Mukul</fname>
  <lname>{my:getLastName(doc("../Data/test.xml")/people, "Mukul")}</lname>
</person>

XSLT 2.0 stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:my="http://localhost/functions"
                exclude-result-prefixes="xs my">

<xsl:output method="xml" indent="yes" />

<xsl:template match="/">
  <person>
    <fname>Mukul</fname>
    <lname><xsl:value-of select="my:getLastName(people,'Mukul')" /></lname>
  </person>
</xsl:template>

<xsl:function name="my:getLastName" as="xs:string">
  <xsl:param name="people" as="element()" />
  <xsl:param name="fName" as="xs:string" />

  <xsl:sequence select="$people/person[fname = $fName]/lname" />

</xsl:function>

</xsl:stylesheet>


Michael Kay shared following observation on xsl-list:
XQuery tends to work better when you want to extract a small amount of information from a large document and ignore the rest. XSLT tends to work better if you want to keep most things the same and make a few small changes. Of course there's a range of tasks between those extremes.

Tuesday, May 6, 2008

Namespace nodes for literal result elements

A recent discussion on xsl-list taught me something new about the XSLT 2.0 language. Following are my thoughts about it.

Suppose, we have this simple stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:a="http://mydomain" version="2.0">

<xsl:output method="xml" indent="yes" />

<xsl:template match="/">
<result>
<x/>
</result>
</xsl:template>

</xsl:stylesheet>

This stylesheet when run, produces the following output:

<?xml version="1.0" encoding="UTF-8"?>
<result xmlns:a="http://mydomain">
<x/>
</result>

Please note the xmlns:a namespace declaration in the output.

To get rid of this namespace declaration from the output, we have to do:

exclude-result-prefixes="a" on the xsl:stylesheet element, or

have literal result element be declared in the stylesheet as follows:

<result xsl:exclude-result-prefixes="a">
<x/>
</result>

The question is: Why is the namespace declaration copied to the output?

The answer can be found in the XSLT 2.0 specification, at http://www.w3.org/TR/xslt20/#lre-namespaces. As per the XSLT 2.0 specification, XSLT namespace - http://www.w3.org/1999/XSL/Transform is not copied to the output, while any other namespace nodes are copied to the output, except for few additional rules, as specified in the spec.

Saturday, May 3, 2008

Output validation with XSLT 2.0

An interesting example occurred to me, about Schema-aware XSLT stylesheet design. Below is the code for it.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
version="2.0">

<xsl:output method="xml" indent="yes" />

<xsl:import-schema>
<xs:schema>
<xs:element name="x">
<xs:complexType>
<xs:sequence>
<xs:element name="y" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
</xsl:import-schema>

<xsl:import-schema>
<xs:schema>
<xs:element name="p">
<xs:complexType>
<xs:sequence>
<xs:element name="q" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
</xsl:import-schema>

<xsl:template match="/">
<xsl:variable name="temp1">
<x xsl:validation="strict">
<y/>
</x>
</xsl:variable>
<xsl:variable name="temp2">
<p xsl:validation="strict">
<q/>
</p>
</xsl:variable>
<result>
<xsl:copy-of select="$temp1" />
<xsl:copy-of select="$temp2" />
</result>
</xsl:template>

</xsl:stylesheet>

This stylesheet imports/declares two inline XSD schemas. In the body of the root template, two variables (temp1 and temp2) request strict validation of the element markup.

If we run this example with a Schema-aware XSLT 2.0 processor, we can find that invalid content cannot be generated from the stylehseet.

An alternate writing style for the above example could be:

<xsl:template match="/">
<xsl:variable name="temp1">
<x>
<y/>
</x>
</xsl:variable>
<xsl:variable name="temp2">
<p>
<q/>
</p>
</xsl:variable>
<result>
<xsl:copy-of select="$temp1" validation="strict" />
<xsl:copy-of select="$temp2" validation="strict" />
</result>
</xsl:template>

Now we specify validation="strict" option on xsl:copy-of instruction.

The intended meaning is same in both the above cases.

This to me is quite useful XSLT facility. XSLT 2.0 is very flexible, where we want the validation in output tree to occur.