scala.xml

(draft book, updated for Scala 2.6.1)


Table of Contents

preface
I. Semistructured Syntax and Data
1. Introduction
XML, Types and Objects
Developer Perspectives
Acknowledgements
2. The scala.xml API
Nodes and Attributes
Elements and Text
Embedded expressions
Other nodes
Matching XML
Updates and Queries
Names and Namespaces
Sharing namespace nodes
3. XPath projection
4. XSLT style transformations
5. XQuery style querying
6. Loading and Saving XML
The native Scala parser
Pull parsing (experimental)
II. Library
7. Overview
8. scala.xml runtime classes
scala.xml.Node
scala.xml.NodeSeq
scala.xml.Elem
SpecialNode
Atom
EntityRef
scala.xml.MetaData
scala.xml.Null
scala.xml.PrefixedAttribute
scala.xml.UnprefixedAttribute
scala.xml.NamespaceBinding
scala.xml.TopScope
9. Scala's XML syntax, formally
10. Interpretation of XML expressions and patterns
III. Tools
11. xinc
EHR's SAXIncluder
12. schema2src
Introduction to Data Binding
13. xslt2src
14. xquery2src
IV. Appendix
A. Scala/XML expression grammar
EBNF productions
Summary of changes
Omissions from XML syntax
B. Implementation Chart: Information Set
Bibliography

List of Tables

12.1.
12.2.

preface

Abstract

We shed light on Scala's XML data model and the syntax of literal XML markup in Scala code.

Part I. Semistructured Syntax and Data

Chapter 1. Introduction

Scala [scala] is a programming language that is compiled to Java Virtual Machine(tm) bytecode which supports a variety of programming styles and can call Java libraries. It provides extensive library support for XML processing with functional and object-oriented techniques.

This book aims to inform the reader of Scala's XML facilities. Some basic knowledge of Scala is assumed, as provided by the Scala Overview, a cursory reading of [scala-programming], or any of the fashionable Scala books that are coming out these days (use Google to find them). Before we embark on this journey, let me try to place scala.xml within the big picture:

  • Some consider XML as just syntax: In this view, the core XML specification[xml] merely talks about sequences of characters with some markers (tags) in angle brackets appearing here and there. XML is kind of "meta" because the spec authors do not say, which tags. When tags are "instantiated" to concrete structuring elements like <html>, then the XML spec speaks of an XML application (like XHTML, DocBook or Atom).

  • XML is also something like a data model: The nesting of tags in an XML document provides a neat tree structure, which can be used to represent data. Thus, most of this book is concerned with trees and sequences of trees. Thinking in trees is useful, for instance when XML transformation can be described applying recursive tree traversals. However, it is sometimes too imprecise: sometimes we might encounter a string like "23", and decide whether we actually want to consider it as an integer, a string, or a day of the month. [1]

The scala.xml library is designed to help with both perspectives, and for the latter, to keep options open of unmarshalling parts of the XML to object and value representations ("data binding"). In this document I try to promote an understanding of the library classes, programming constructs and design patterns provided to this end. This should help the reader do things like parsing, maybe validating, applying recursive transformations, querying and data binding.

There is a wealth of XML specific programming languages which however do not integrate too well with the object-oriented paradigm [2]. Scala is a language that is particularly open to elegant solutions to old problems, because it allows new programming abstractions to be defined easily, providing some opportunities to bridge syntactical gaps and achieving somewhat tighter integration.

XML, Types and Objects

Types in programming can help structure your code, remind you of data invariants and push the compiler to detect errors and apply optimizations. A type system should be considered a simple and effective form of program verification.

Data Types in XML specifications are concerned with assigning meaning to sequences of characters -- not with programming. The types thus introduced express some form of structural invariants of XML documents and fragments. This, and not more, is what standards like "Document Type Definition" (DTD)[xml], the more recent XML Schema Definitions (XSD) [xsd1][xsd2] and Relax NG (RNG) [rng] schemata achieve. Less well-known alternatives are schematron and Document Structure Description (DSD) [dsd]. An XML document conforming to such a schema is called valid or schema-valid. For the programmer, the job does not end at data definition, it begins there. And then there are a whole number of XML programs that don't need this datatype business at all.

I believe, it is wrong to impose one perspective and completely neglect the other. Probably, most users of scala.xml are interested in generating (X)HTML: These users need support for almost all details of the XML and XHTML spec, plus some knowledge about browser incompatibilities. The Scala language thus supports cut-and-paste compatibility for XML literals.

On the other hand, there are benefits of using type information. Today there is some static type checking for XQuery and XSLT, but they are somewhat all-or-nothing, forcing the developer to decide whether he wants to live in a typed or an untyped XML world. The scala.xml library keeps all options open: one can manipulate XML without worrying about a schema, but there are ways to convert to or represent some attribute or element text as a Scala/Java integer.

Developer Perspectives

XML programming is a placeholder for many different approaches developers have for XML processing. We will call such a perspective "generic" if it does not depend on any particular XML application (nothing to do with Java generics). A generic approach can deal with XHTML just as it can deal with a markup language for cooking recipes. There are several points of view that can be taken:

  1. XML is regarded as text. We ignore the tree structure completely. Some text/regular expression search is used to retrieve or manipulate information. This can get you quite far for small tasks. Go away, use perl :-)

  2. XML is parsed into a (mutable) object graph that represents the tree structure in a generic way. The Document Object Model (DOM) [dom-L3] and related programming interfaces [dom4j] [jdom] [xom] provide more or less standard APIs to manipulate such trees in general purpose language. Not surprisingly, scala.xml comes with its own API. It is possible to convert to and from others, but this is not yet part of the library.

  3. While parsing XML, a sequence of events is generated. These events either trigger callbacks (push, application is the callee, like in the Simple API for XML (SAX) [sax]) or the application fetches its events itself (pull, implemented in Streaming API for XML (StAX) [stax]). There is an experimental pull api for Scala that allows to experiment with this view (see scala.xml.pull API documentation.

  4. XML is the communication format to interact with a database -- not so much like MySQL running on the same machine, but more like "Acme Corp has good data and allowed us to send them queries". This would use a query language like XQuery [xquery]. An experimental XQuery-to-Scala-source translator is available to support this view [xquery2src].

  5. XML is transformed by applying style templates (like XSLT [xslt]). This falls under the more general term of "recursive transformations". There are some library classes that achieve the same (see package scala.xml.transform). There is also an XSLT-to-Scala-source translator, which is a bit outdated and does not work with the current version of Scala, but which might be revived one day if anybody asks me. For new developments, it is more straightforward to use the more convenient Scala API rather than the cumbersome XSLT syntax, or (if it really must be XSLT), some Java library.

  6. XML is considered as bare trees, and we want to deal with XML "natively". Then the scala.xml API provides methods to handle these structures, with support for XPath like selection and pattern matching.

Using scala.xml feels somewhere between using some DOM API and having an XML specific language. Besides the literal syntax, there is actually no language support. In Scala, most "features" are realized not as language extensions but as libraries. Even the literal XML syntax is desugared into code that constructs objects -- so it is possible to do express everything that can be expressed in XML literals (and even more) without actually using XML syntax, programmatically.

Acknowledgements

Thanks go to Martin Odersky for giving me the freedom in designing this library. Also, without the past and present LAMP staff, Scala would not be what it is today. Matthias Zenger, Michel Schinz, Philippe Altherr, Vincent Cremet, Erik Stenman, Gilles Dubochet, Stéphane Micheloud, Lex Spoon, Sean McDirmid, Nikolay Mihaylov, Iulian Dragos. Some of these guys were pretty ardent XML detractors, which is sometimes good as it reminds one that no XML API is a silver-bullet.

Jamie Webb and Jon Pretty of Sygneca gave a lot of feedback and a couple of features were suggested by them. Students that took undergraduate projects helped to weed out bugs and improve performance and usability -- thank you to Simon Barbey, Fatemeh Borran, Susann Bucher, Badr Hejira, Florian Hof, Clément Hongler and Lukas Rytz.

Update: For the latest iteration of this draft's release, Jonas Bonér, David Pollak, David Hall, Michael Fortson deserve thanks for reporting bugs in the code and the document.



[1] The tree view is somewhat encouraged by the XML InfoSet specification [info], and by common sense.

[2] Mary Fernandez aptly described the problem as "throwing your data over the wall" ) [mf-wall]

Chapter 2. The scala.xml API

Nodes and Attributes

The Scala programming language offers a wide range of constructions and library routines that make dynamic XML processing simple and effective. This section contains an overview of the most common ways to construct XML.

Elements and Text

Probably the easiest way to put XML data in your program is to copy and paste it into your program. The following code will demonstrates this. To make the matter more interesting, it is spiced with an HTML description.

/* examples/phonebook/phonebook1.scala */
package phonebook  

object phonebook1 {

  val labPhoneBook = 
    <phonebook>
      <descr>
        This is the <b>phonebook</b> of the 
        <a href="http://acme.org">ACME</a> corporation.
      </descr>
      <entry>
        <name>Burak Emir</name> 
        <phone where="work">+41 21 693 68 67</phone>
      </entry>
    </phonebook>;

  def main(args: Array[String]) =
    Console.println( labPhoneBook )

}

The Scala parser recognizes the full XML grammar. Further down, we shall see that is actually recognizes a superset, allowing it to parse mixed and nested Scala and XML expression. As a principle, everything allowed in XML is allowed in Scala, with the only exceptions being motivated by the fact that some aspects of the XML spec just don't make sense for the source code of a program. In return, the extensions to the syntax have been made in order to make programming easier.


$ scalac -d /tmp examples/xml/phonebook/phonebook1.scala
$ scala -classpath /tmp phonebook.phonebook1
<phonebook>
      <descr>
        This is the <b>phonebook</b> of the
        <a href="http://acme.org">ACME</a> corporation.
      </descr>
      <entry>
        <name>Burak Emir</name>
        <phone where="work">+41 21 693 68 67</phone>
      </entry>
    </phonebook>
    

XML nodes in Scala are always instances of some subclass of scala.xml.Node. The library uses an immutable representation (no parts of an XML node can be changed), but the programmer may provide own mutable subclasses of scala.xml.Node if required.

By default, elements are represented using scala.xml.Elem and scala.xml.Text. These are case classes, so they can be constructed wihout having to write new and can be used as patterns in a match expression.

The Elem class looks roughly like this:

case class Elem(val prefix: String,          // namespace prefix
                val label: String,           // (local) tag name
                val attributes: MetaData,
                val scope: NamespaceBinding, // namespace bindings
                val child: Node*) extends Node { ... }

From the constructor, we can see what constitutes an XML element (we shall treat namespaces later). The last formal parameter definition child: Node* indicates that an arbitrary number of nodes (including zero) may be passed to the Elem constructor. In fact, the above phonebook code can equivalently written like this:

/* examples/xml/phonebook/verboseBook.scala */
package phonebook 

object verboseBook {

  import scala.xml.{ UnprefixedAttribute, Elem, Node, Null, Text, TopScope } 

  val pbookVerbose = 
    Elem(null, "phonebook", Null, TopScope,
       Elem(null, "descr", Null, TopScope,
            Text("This is a "), 
            Elem(null, "b", Null, TopScope, Text("sample")),
            Text("description")
          ),
       Elem(null, "entry", Null, TopScope,
            Elem(null, "name", Null, TopScope, Text("Burak Emir")),
            Elem(null, "phone", new UnprefixedAttribute("where","work", Null), TopScope, 
                 Text("+41 21 693 68 67"))
          )
       )

  def main(args: Array[String]) = 
    Console.println( pbookVerbose )
}

This code does almost the same as the code above. However, the output of the programs are different:

$ scalac -d /tmp examples/xml/phonebook/verboseBook.scala
$ scala -classpath /tmp phonebook.verboseBook
<phonebook><descr>This is a <b>sample</b>description</descr><entry><name>Burak Emir</name><phone where="work">+41 21 693 68 67</phone></entry></phonebook>

Why does the former output looked somewhat better, although still not perfect?. The answer lies in the whitespace contained in the former program source. Scala's XML parser adopts the simple rule that within XML expressions, whitespace is preserved everywhere. In verboseBook, we did not care to construct superfluous nodes containing only whitespace, consequently there was no whitespace when we printed it. In most cases, this does not matter (flamewars on xml-dev notwithstanding). A pretty printer is available to obtain more human-readable output -- try to change the main to:

  def main(args: Array[String]) = 
    Console.println( new PrettyPrinter(80 /*width*/,3 /*indent*/).format(pbookVerbose) )
      

Three things are worth remembering:

  • Mixed content has to be expressed by juxtaposing Text and Elem.

  • Attributes are an immutable linked list of UnprefixedAttribute objects, terminated with the Null object.

  • The Elem is special in that it can deal with an arbitrary number of arguments.

The mysterious occurrences of null (lowercase) and TopScope are for namespace handling. They will be explained later, together with PrefixedAttribute, in the section on namespaces.

Passing sequences to Elem

Sometimes, we want to call a constructor with a sequence parameter, but the sequence of arguments is computed dynamically. The Elem constructor can deal with a sequence as long as you told the compiler that it is one. You do this by annotating the sequence with _*, like this

val myElem = Elem(null, "baz", Null, TopScope, computeList(42,"froz"):_* );

Assuming that the result computeList(42,"froz") will be List(Elem(null, "foo", Null, TopScope), Elem(null, "bar", Null, TopScope)), then the above code has the same effect as

val myElem = Elem(null, "baz", Null, TopScope, 
                  Elem(null, "foo", Null, TopScope), 
                  Elem(null, "bar", Null, TopScope) )

Embedded expressions

These syntactic considerations are not very exciting yet (because we have not looked into the things that one can do with those objects). For the developer, the fun starts when he can parameterize some XML fragment or include computed parts in it. This is achieved by embedded expressions, which allow to freely mix Scala code. The following program produces the same output as phonebook1

/* examples/phonebook/embeddedBook.scala */
package phonebook  

object embeddedBook {

  val company  = <a href="http://acme.org">ACME</a>
  val first    = "Burak"
  val last     = "Emir"
  val location = "work"

  val embBook = 
    <phonebook>
      <descr>
        This is the <b>phonebook</b> of the 
        {company} corporation.
      </descr>
      <entry>
        <name>{ first+" "+last }</name> 
        <phone where={ location }>+41 21 693 68 {val x = 60 + 7; x}</phone>
      </entry>
    </phonebook>;

  def main(args: Array[String]) =
    Console.println( embBook )

}

Scala expressions are embedded within an XML fragment using single braces { } [3]. In order to get a single brace character, you have to double it {{ }}.

Between the braces is an embedded block, which means not only expressions, but also statements, function and class definitions and pretty much everything else is allowed. The last expression in a block determines its "value" -- what will appear in the XML after evaluating preceding code.

The compiler accepts various types of values within embedded nodes -- everything that is either a scala.xml.Node or something that has toString method is welcome. For embedded attributes, a string or a sequence of nodes will do - the constructor of attributes is typically UnprefixedAttribute(key, string, next) is mostly equivalent to UnprefixedAttribute(key, Text(string), next).

Nullable attributes

Often, whether a particular attribute is present depends on some condition, leading to code like this

  if(cond)
    <foo bar="pizza">{ /*lots of code*/ }</foo>
  else
    <foo>{ /*lots of code*/ }</foo>

In order to simplify life in such a scenario, Scala allows to make attribute addition conditional: an attribute value of null means the attribute is omitted.

    <foo bar={if (cond) "pizza" else null}>{ /*lots of code*/ }</foo>

Type-safety is a nice property, and having a compiler checking options for you is often much better than using null. This is why, you can also use Option types for nullable attributea, provided you pass an instance of Seq[Node].

    val z = if (cond) { Some(Text("pizza")) } else { None }
    <foo bar={z}>{ /*lots of code*/ }</foo>

Other nodes

Although the above is sufficient for most purposes, there are a couple of other nodes that can be used.

  • EntityRef, ProcItem and Comment - for various XML elements

  • Group - for grouping nodes.

  • Unparsed - for including verbatim text, e.g. when generating non-XHTML hypertext.

  • Atom - for nodes containing data of any type, e.g. int, Date.

Why do attributes contain sequences of nodes?

At first sight, it appears that attributes should only be strings and nothing else. However, there are two reasons to allow the same kind of nodes (other than element nodes) that can appear within XML: data values and entity references.

<foo name= "s&uuml;ss" life={Atom(42)}>

Elem(null, 
     foo,  
     new UnprefixedAttribute("name",List(Text("s"),EntityRef("uuml"),Text("ss")), 
       new UnprefixedAttribute("life", Atom(42), Null), TopScope)
          

Fortunately, a single node always behaves as if it was a sequence of nodes, so there is no need to wrap elements in a singleton lists.

Matching XML

Scala provides pattern matching to search and decompose sequences. Pattern matching can also be used to decompose XML.

For instance to find out whether a variable contains an "entry" element which has as last child a "foo" with no children, this pattern will do:

x match {
          case Elem(_,"entry", _, _, _*, Elem(_, "foo", _)) => true
          case _ => false
}

This also works using XML syntax:

x match {
          case <entry>{ _* }<foo/></entry> => true
          case _ => false
}

However, there is no support for testing presence or values of attributes. This can be achieved using guards, for instance like in the following example

x match {
          case link @ <a>{ _* }</a> if link.attribute("href").isEmpty => "href missing"
}

Updates and Queries

The Scala XML API takes a functional approach to representing data, eschewing imperative updates where possible. Since nodes as used by the library are immutable, updating an XML tree can a bit verbose, as the XML tree has to be copied. Here is an example how this could be done.

/* examples/xml/phonebook/phonebook2.scala */
package phonebook;

object phonebook2 {

  import scala.xml.Node

  /** adds an entry to a phonebook */
  def add( p: Node, newEntry: Node ): Node = p match {

      case <phonebook>{ ch @ _* }</phonebook> => 

        <phonebook>{ ch }{ newEntry }</phonebook>
  }

  val pb2 = 
    add( phonebook1.labPhoneBook, 
         <entry>
           <name>Kim</name> 
           <phone where="work">+41 21 111 11 11</phone>
         </entry> );

  def main( args: Array[String] ) = 
    Console.println( pb2 )
}

This code will throw a MatchError in add exception if the node does not have phonebook tag. It is also possible to express it using only method calls:

          def add( p: Node, e: Node ) = Elem(null, p.label, Null, TopScope, (p.child ++ e):_*)
        

Here we assume that our element representing a phonebook will never have a namespace prefix (null), never have attributes (Null) and never define namespace bindings (TopScope). Without these assumptions, we would have copied p.prefix, p.attributes and p.scope over to the new element as well. The _* ("sequence escape") has been explained before: see Passing sequences to Elem.

Changing the phone number of an entry is similar. First we lookup an entry by traversing the tree and and copying it. Then we provide an updated copy of the element we wish to change.

package phonebook;

object phonebook3 {

  import scala.xml.{Elem, Node, Text} ;
  import scala.xml.PrettyPrinter ;
  import Node.NoAttributes ;

  /* this method "changes" (returns an updated copy) of the phonebook when the
   *   entry for Name exists. If it has an attribute "where" whose value is equal to the
   *   parameter Where, it is changed, otherwise, it is added.
   */
  def change ( phonebook:Node, Name:String, Where:String, newPhone:String ) = {

    /** this nested function walks through tree, and returns an updated copy of it  */
    def copyOrChange ( ch: Iterator[Node] ) = {

      import xml.Utility.{trim,trimProper} //removes whitespace nodes, which are annoying in matches

      for( val c <- ch ) yield 
        trimProper(c) match {

          // if the node is the particular entry we are looking for, return an updated copy

          case x @ <entry><name>{ Text(Name) }</name>{ ch1 @ _* }</entry> => 

            var updated = false;
            val ch2 = for(val c <- ch1) yield c match { // does it have the phone number?

              case y @ <phone>{ _* }</phone> if y \ "@where" == Where => 
                updated = true
                <phone where={ Where }>{ newPhone }</phone>
              
              case y => y
              
            }
            if( !updated ) { // no, so we add as first entry
            
              <entry>
                <name>{ Name }</name>
                <phone where={ Where }>{ newPhone }</phone>
                { ch1 }
              </entry>
              
            } else {         // yes, and we changed it as we should
              
              <entry>
                { ch2 }
              </entry>
        
            } 
          // end case x @ <entry>...
          
          // other entries are copied without changing them

          case x =>           
            x
          
        }
    } ; // for ... yield ... returns an Iterator[Node]
    
    // decompose phonebook, apply updates
    phonebook match {
      case <phonebook>{ ch @ _* }</phonebook> =>
        <phonebook>{ copyOrChange( ch.elements ) }</phonebook>
    }
    
  }

  val pb2 = 
    change( phonebook1.labPhoneBook, "John", "work", "+41 55 555 55 55" );

  val pp = new PrettyPrinter( 80, 5 );

  def main( args:Array[String] ) = {
    Console.println("---before---");
    Console.println( pp.format( phonebook1.labPhoneBook ));
    Console.println("---after---");
    Console.println( pp.format( pb2 ));
  }
}

Names and Namespaces

Namespaces [names1.0][names1.1] have been introduced into extensible markup long after the XML specicifaction was out. The intention is to provide a means of 'packaging' related names by associating them with a URL. The association happens indirectly by (1) binding URIs to prefixes and (2) prefixing names using the syntax 'prefix:localname', i.e. using the colon as a separator. Consequently, the colon is not a part of names anymore.

why namespace prefixes?

Namespace prefixes have to be taken into account (Binding,Scope) because they are used whenever QNames live in content (for example, in XML Schema Definitions).

To avoid clutteredness, the standard allows a 'default namespace' to be declared, which implicitly associates unprefixed names with a certain URI. Finally it is possible to undeclare namespaces by binding them to the empty prefix. (v1.0 only allowed to undeclare the default namespaces, but in v1.1 this has been generalized to any prefix).

The empty string is allowed in a binding

The meaning of an empty string is to undeclare the namespace, prefix mapping. In the past, this has caused considerable headache: The Namespaces in XML recommendation allowed empty string only for the default namespace binding, i.e. xmlns="" was allowed, but xmlns:foo="" was not. However, this unnecessary distinction between default and other namespace bindings (those with a prefix) was removed in Namespaces in XML 1.1. Now "undeclarations" are allowed for both kinds.

How does this look in Scala? Namespace bindings are treated in a class aptly named NamespaceBinding, which is a linked list of prefix-URI pairs. A default namespace is synonymous with a namespace for the null prefix (not the empty string), and undeclaring a namespacebinding is done by assigning the empty string as URI. The TopScope is the most common top-level scope, the empty prefix-URI mapping that does not contain any binding.

Here is an example what the compiler does with scala.xml.NamespaceBinding. Assuming we had a internal variable $scope containing the active bindings at each element. Then for the following fragment

val foobar = <foo:bar xmlns:foo="http://foo.com" foo:key="value" xmlns="urn:default" attr="42"><a/></foo:bar>

the compiler has to take the following steps to updates the scope, translating everything roughly into:

val foobar = { // add bindings to scope
  scope = new NamespaceBinding(null,  "urn:default", 
             NamespaceBinding("foo", "http://foo.com",  scope))

  // make attributes
  val md = new UnprefixedAttribute("attr","42", 
              new PrefixedAttribute("foo","key","value", Null))

  // make element
  Elem("foo","bar", md, scope, Elem(null, "a", Null, scope))
}
  

The element labeled bar uses a prefix which tells us it is in the namespace http://foo.com. The element a is nested under bar, this is affected by the same namespace bindings. It is in the namespace urn:default.

Namespace binding is properly scoped over the child nodes: Unless a descendant undeclares a prefix, the prefix is bound to URI according to the bindings defined for the parent. As can be seen, namespace bindings are treated differently from regular attributes -- this seems a good compromise since they are shared, have different properties and there is an important class of users that is simply not concerned with namespaces. The library is design to handle namespace bindings by itself, and where namespace manipulations are needed, they are effected on the scope members and NamespaceBinding classes.

Attributes without a prefix are not implicitly put in the same namespace as the element in which they occur. This is the reason why there is UnprefixedAttribute and a PrefixedAttribute class. UnprefixedAttributes have no namespace at all.

Sharing namespace nodes

Implementations of XML infrastructure routines typically share namespace nodes in the data model. This accounts for the lexical scoping which is prescribed by the spec.

Some requirements are expected of such XML infrastructure routines. It is for instance absolutely necessary to preserve namespace bindings as they are given in source documents (because some documents, like XSD schemata, refer to prefixes not only in XML names but also in content. Then it is often desirable that identical namespace bindings are not repeated in each node, i.e. the number of namespace binding xmlns:prefix="..." should be minimized. This in turn becomes more tricky when sharing namespaces -- we might mix fragments from different trees, in which case namespace nodes might convey identical information and yet have different object identity.

Namespace sharing

The current implementation will not properly stratify namespace bindings when elements from different scopes are combined. This is not a problem when querying or processing XML data, but it might lead to wrong namespace bindings when serializing XML. A modified version of the serializing algorithm can solve the problem by introducing namespace declarations and undeclarations in the right place. Since it seems a rare problem and developers can stratify namespaces themselves in a given XML application, your humble author and scala.xml maintainer did not consider this issue a priority.



[3] The same convention is used in XQuery, Xtatic and maybe Java.

Chapter 3. XPath projection

The XML Path Language (XPath) [xml] is a language expressing simple queries on XML documents. This example illustrates how XPath projection can be used in Scala

package bib;

object bib {

  import scala.xml.{Node,NodeSeq};
  import scala.xml.PrettyPrinter;

  val biblio = 
    <bib>
      <book>
         <author>Peter Buneman</author>
         <author>Dan Suciu</author>
         <title>Data on ze web</title>
      </book>
      <book>
         <author>John Mitchell</author>
         <title>Foundations of Programming Languages</title>
      </book>
    </bib> ;

  val pp = new PrettyPrinter(80, 5);

  def main(args:Array[String]):Unit = {
  Console.println( pp.formatNodes( biblio \ "book" \ "title" ));

  // prints 
  // <title>Data on ze web</title><title>Foundations ...</title>

  Console.println(   pp.formatNodes( biblio \\ "title" ));   // prints the same

  Console.println(   pp.formatNodes( biblio \\ "_" ));   // prints node and all descendant 

  Console.println(   pp.formatNodes( biblio.descendant_or_self )); // prints the same


  }
}

Chapter 4. XSLT style transformations

Here is a sample program to convert Docbook to some other format:

object transform {
  import scala.xml._ ;
  import scala.xml.dtd._ ;
  import org.xml.sax.InputSource ;

  /* a former version of Scala used regular expression patterns, like
   * in the following code. In the future, we plan to reactivate some
   * well-behaved regular expressions again
  // gimmick: text replacement "scalac" => &lt;code&gt;scalac&lt;/code&gt;
  def cook(s: String): Seq[Node] = cook1(s) ;
  def cook1(s: Seq[Char]):List[Node] = s match {

    case Seq( a @ _*, 's','c','a','l','a','c', b @ _* ) => 

      Text(cook2( a )) :: <code>scalac</code> :: cook1( b )
    case _ => List( Text( cook2( s )))
  }
  def cook2(s: Seq[Char]): String = {
    val r = new StringBuffer();
    s.foreach { c:char => val _ = r.append(c); };
    r.toString()
  }
   */

  def transform1(ns: Iterable[Node]): Seq[Node] = {
    val zs = new NodeBuffer();
    for(val z <- ns) { zs &+ transform( z ) }
    zs
  }

  /** this functions holds "templates" that transform nodes of an input tree 
   *  into an iterable representation of a sequence of nodes of the output 
   *  tree.
   *
   *  It is ok to return a single node, since each node is at the same
   *  time a singleton sequence. Likewise, the pattern variable x will be
   *  of type Seq[Node], although here it is always binding a single node.
   */
  def transform(n: Node):Iterable[Node] = n match {
    case x @ <article>{ ns @ _ * }</article> => 
      <source active="ant" title={ (x \ "title" \ "_").toString() }>
      <header>
        <author>Burak Emir</author>
        <keywords>Scala4Ant</keywords>
        <style type="text/css"></style>
        </header>
        <content>
          <title><scala/> Ant Task</title>
          { transform1( x \ "_" ) }
        </content>
      </source>
    case x @ <sect1>{ _* }</sect1> => 
      <section>{ transform1( x \ "_" ) }</section>
    case x @ <title>{ _* }</title> => 
      <h>{ x \ "_" }</h>
    case x @ <para>{ _* }</para> => 
      <p>{ transform1( x \ "_" ) }</p>
    case x @ <itemizedlist>{ _* }</itemizedlist> => 
      <ul>{ transform1( x \ "_" ) }</ul>
    case x @ <listitem>{ _* }</listitem> => 
      <li>{ transform1( x \ "_" ) }</li>
    case x @ <constant>{ _* }</constant> => 
      // an xml:group is a sequence that appears to the scala type system
      //  as a single node. Here it is used to append a text node with a space
      <xml:group><code>{ x \ "_" }</code> </xml:group>
    case x @ <programlisting>{ _* }</programlisting> => 
      <pre>{ x \ "_" }</pre> 
    case Elem(namespace, label, attrs, scp, ns @ _*) => 
      Elem(namespace, label, attrs, scp, transform1( ns ):_* )
    case z => 
      z 
  };

  def main(args:Array[String]) = {
    if( args.length == 1 ) { // must have one arg
      object ConsoleWriter extends java.io.Writer {
        def close() = {}
        def flush() = Console.flush
        def write(cbuf:Array[char], off:int , len:int ): unit = {
          var i = 0
          while(i < len) 
          Console.print(cbuf(off + i))
        }
      }

      val src = XML.load(new InputSource(  args( 0 ))); //use Java parser

      // transform returns an iterable, but we now it is a singleton
      //  sequence, so we get its first element
      val n = transform( src ).elements.next
      
      val doctpe = DocType("html",PublicID("-//W3C//DTD XHTML 1.1//EN","../default.dtd"), Nil)

        /** write document to console, with encoding latin1, xml declaration
         * and doctype 
         */
      XML.write(ConsoleWriter, n, "iso-8859-1", true, doctpe)

    }
    else error("need one arg");
  }
}

Chapter 5. XQuery style querying

This example illustrates XQuery style querying

package bib;

object bibq {

  val theBib = bib.biblio ;

  for( val b <- theBib \ "book" )
    for( val a <- b \ "author"  ) {
      Console.println( a )
    }

}

Chapter 6. Loading and Saving XML

If you just want to load XML, without using databinding, try this:

object Foo with Application {
  val x = scala.xml.XML.loadFile("myfile.xml");
  Console.println(x);
}

The value x will be of type scala.xml.Elem, which in turn is an implementation of the scala.xml.Node interface. The parser used for parsing the XML is currently the XML parser that comes with the underlying JDK.

There is also a save method defined there:

object Foo with Application {
  val y: Elem = ...
  scala.xml.XML.save("myfile.xml", y);
}

There is also a write method that allows to output XML to anything implementing the java.io.Writer class.

The native Scala parser

Scala has a XML parser of its own, which can be invoked like this

import scala.xml.parsing.ConstructingParser

val p = ConstructingParser.fromFile(file, true /*preserve whitespace*/)
val d: xml.Document = p.document
  

The advantages of this parser is that the developer has more fine-grained control over what to parse. It is for instance possible to parse a sequence of elements from a stream (the XML spec allows only one), or to obtain the entity declarations from the internal subset of the DTD.

Pull parsing (experimental)

The native XML parser can also be used for pull parsing. An experimental API is accessible via scala.xml.pull.XMLEventReader. You need to provide a scala.io.Source, just like for the constructing parser.

Part II. Library

Chapter 7. Overview

This part provides a more detailed account of classes in the XML library.

Chapter 8. scala.xml runtime classes

This section describes the classes in scala.xml.

scala.xml.Node

The abstract superclass of all XML nodes as represented in the Scala library. Nodes have an optional prefix (null = no prefix), a namespace binding scope, a list of metadata (attributes), and a sequence of children. A node can be considered as a singleton sequence containing the node, because it inherits from NodeSeq.

scala.xml.NodeSeq

Sequences of nodes are pretty common in XML processing. The main use of this class is to add XPath methods \ and \\ to any sequence of nodes, regardless of its concrete representation. It is a wrapper class, which gets automatically created by means of Scala's view mechanism.

scala.xml.Elem

A class implementing scala.xml.Node with a case class. XML literals embedded in Scala code will get turned into Elem instances. Also, most default parsing factories will produce Elem instances. By contrast, most library routines (like e.g. the PrettyPrinter) expect instances of Node, so it is possible to call them with custom XML representations.

SpecialNode

Atom

To store data values like ints and dates.

EntityRef

To represent entity references. It is possible to output entity declarations using the classes in scala.xml.dtd.

scala.xml.MetaData

The abstract superclass of attribute nodes. Attributes are realized as an immutable linked list. Since the attribute order does not matter in XML, default parser factories may actually change (typically reverse) the order when they parse XML.

scala.xml.Null

This object is used to `ground' linked attribute lists. It is also the representation of empty attribute lists.

scala.xml.PrefixedAttribute

A prefixed attribute has a prefix, a name, a value and a pointer to the tail of the attribute list. It answers to getValue(uri,scope,key) calls with its value if the its own prefix matches the uri in the given scope (typically the scope of the parent element). It will not answer getValue(key) calls, because the Namespaces spec considers it distinct from an unprefixed attribute.

scala.xml.UnprefixedAttribute

An unprefixed attribute has a name, a value and a pointer to the tail of the attribute list. It answers getValue(key) calls, but not the namespace aware ones describe above.

scala.xml.NamespaceBinding

This class is for representing namespace bindings using a linked list of namespace binding nodes.

scala.xml.TopScope

This class is used to `ground' a linked list of namespace binding nodes. It also stands for a top-level scope in which no namespaces are bound.

Chapter 9. Scala's XML syntax, formally

The following changes were made to the Scala syntax in order to accomodate literal XML and XML expressions

  • Lexical syntax (Chapter 1)

    Programming languages are usually defined in terms of lexical syntax, handled by a scanner, and context-free syntax, handled by a parser. Scala is no exception to this rule, adopting a lexical syntax similar to Java's but with more freedom for definition of operators etc.

    The lexical syntax from XML documents cannot be reconciled with the lexical syntax Scala code. Therefore, in addition to the Java-like lexical syntax, a Scala parser needs to treat every input character differently and in conformance with the XML specification when entering a literal XML element. This happens when the following character sequence is encountered:

    ( S | '(' | '{' ) '<' (Letter | '_' )

    Thus, whenever a < is immediately preceded by whitespace, '(' or '{', and immediately followed by an XML name start character, the scanner is forced to interpret the following characters as XML input. In the following, this will be referred to as the scanner being 'in XML mode'. The scanner changes from XML mode to Scala mode when one of the following conditions hold:

    • The XML expression or an XML pattern started by the initial '<' has been successfully parsed.

    • The parser encounters an embedded Scala expression or pattern, indicated by a '{'. This changes the scanner back to normal mode, until the closing '}' is found, which puts the scanner into XML mode again.

      Since the nested Scala expression can contain nested XML expressions/patterns, the parser thus has to maintain a stack that reflects the nesting of XML and Scala expressions adequately.

    Note that Scala comments are interpreted as text (parseable character data) in XML mode.

  • Expression (Ch.4) and pattern (Ch.7) syntax

    The following two productions are added to the Scala grammar (see below for XML expression and pattern grammar)

                  xmlExpr ::= Element (Element*)
                  xmlPat  ::= ElementPattern
                

    As said before, they indicate that the scanner is in xml mode.

Chapter 10. Interpretation of XML expressions and patterns

The meaning of XML expressions and patterns is given using equivalent Scala expressions and patterns.

  • An element <pre:name att1=val1 pre2:att2=val2 ... attN=valN> content <name> is interpreted as scala.xml.Elem("pre", "name", UnprefixedAttribute(att1, val1, PrefixedAttribute(pre2, att2, val2, ... UnprefixedAttribute(attN, valN, Null))), content)

  • A sequence of elements e1...eN is interpreted as (a concrete representation of) Seq(e1...eN)

  • Embedded scala expressions are interpreted by themselves.

  • An element pattern '<name> contentPattern <name>' is interpreted as 'scala.xml.Elem("name", contentPattern )'

  • Embedded scala patterns are interpreted by themselves.

Note that this implies that an xml expression consisting of one element will be of type 'scala.xml.Elem' whereas an xml expression consisting of two or more elements will be of type 'Seq[scala.xml.Elem]'.

Part III. Tools

Chapter 11. xinc

Table of Contents

EHR's SAXIncluder

EHR's SAXIncluder

This tool is an adaption of Eliotte Rusty Harold's SAXXIncluder to Scala. It builds on top of the relevant JAVA API classes and was crucial for including the code samples in this document.

At this point some information (the Scaladoc description) is available at url xinc homepage

Chapter 12. schema2src

Introduction to Data Binding

Despite great APIs, data represented in XML tends to be converted to and from object representations. This task is called data binding. It can in prinicple be coded manually if the data representations are intricate. But often, conversion has to bridge fairly straightforward XML types and fairly basic "pure data" classes. The latter scenario is a case for automation.

For the sake of an example consider the following way to represent bugreports


<bugReport id="42">
  <dateSubmitted>2003-06-25</dateSubmitted>
  <status>     fixed   </status>
  <submitter>  Matthias</submitter>
  <assignedTo> Michel  </assignedTo>
  <code>        ...    </code>
  <whatHappened>...</whatHappened>
  <whatExpected>...</whatExpected>
</bugreport>


<!ELEMENT bugReport (dateSubmitted,
                     status,
                     submitter,
                     assignedTo,
                     code,
                     whatHappened,
                     whatExpected)>
<!ELEMENT dateSubmitted #PCDATA>
<!ELEMENT status        #PCDATA>
<!ELEMENT submitter     #PCDATA>
<!ELEMENT assignedTo    #PCDATA>
<!ELEMENT code          #PCDATA>
<!ELEMENT whatHappened  #PCDATA>
<!ELEMENT whatExpected  #PCDATA>

bugReport.dtd

There are many scenarios, where we would like to programmatically manipulate bug reports in ways that cannot be handled by XML tools. We might want to store and retrieve them in a relational database, access and compile the source in the code element, or notify the submitter of changes by email.

Using the data binding tool schema2src it is possible to generate the following classes from the DTD above. We can invoke the schema2src with its DTD module in the following way

scala -cp ... schema2src.Main dtd -d /tmp bugReport.dtd BugReportDefs

This will result in a sourcefile being generated. The sourcefile contains mainly one object, one cases class and one method definition for each element declaration found in the DTD.

object BugReportDefs {
  import scala.xml.{MetaData, Node};

  object bugReport { def validate(atts: MetaData, child: Node*)    ... }
  case class bugReport    (atts: MetaData, child: Node*)         { ... }
  def bugReport(s:String) = new bugReport(s);

  object dateSubmitted { def validate(atts: MetaData, child: Node*) ...}
  case class dateSubmitted(atts: MetaData, child: Node*)          { ... }
  def dateSubmitted(s:String) = new dateSubmitted(s);

  ...
}

Importing the types from BugReportDefs has several benefits for the programmer working with bug reports. She can

  • construct elements

    val n = scala.xml.Null; // empty attribute list
    bugReport(UnprefixedAttribute("id","42",n),
      dateSubmitted("2003-06-25"),
      status ("fixed"),
      submitter ("Matthias"),
      assignedTo ("Michel"),
      code ("..."),
      whatHappened ("..."),
      whatExpected ("...")
    )
    

  • load elements

todo...

At this point some information (including the Scaladoc description) is available at url schema2src homepage

Chapter 13. xslt2src

todo...

Chapter 14. xquery2src

todo...

Part IV. Appendix

Appendix A. Scala/XML expression grammar

EBNF productions

The following grammar show equal (=) and different (X) productions of XML literals embedded in Scala. (DRAFT-TODO) Check parser (?) and add link to namespace recommendation.

    (=)element       ::=    EmptyElemTag
                    |    STag Content ETag                                    WFC: Element Type Match

    (=)EmptyElemTag  ::=    '<' Name (S Attribute)* S? '/>'                         WFC: Unique Att Spec

    (=)STag          ::=    '<' Name (S Attribute)* S? '>'                          WFC: Unique Att Spec
    (=)ETag          ::=    '</' Name S? '>'                                        
    (=)content       ::=    (CharData)? ((element | Reference | Comment | PI | ScalaExp ) (CharData)?)*

    (=)Attribute     ::=    Name Eq AttValue	                                 WFC: No < in Attribute Values

    (X)AttValue	  ::=    '"' ([^<&"] | CharRef)* '"'
                    |    "'" ([^<&'] | CharRef)* "'"
                    |    ScalaExp

    ScalaExp      ::=    '{' expr '}'

    (X)CharData      ::=   [^<&]* - ([^<&]* '{'[^<&{] [^<&]*) 
                               - ([^<&]* '{']]> [^<&]*)

    (=)Reference     ::=   CharRef | EntityRef
    (=)CharRef       ::=   '&#' [0-9]+ ';'
                       |   '&#x' [0-9a-fA-F]+ ';'                                  WFC: Legal Character
    (=)EntityRef     ::=   '&' Name ';'

    (=)Char          ::=    #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    (=)S             ::=   (#x20 | #x9 | #xD | #xA)+ 
    (=)Comment       ::=   '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

    (=)PI	          ::=   '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'  


    (=)NameChar      ::=   Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender
    (?)Name          ::=   (Letter | '_' | ':') (NameChar)*


    ElemPattern   ::=    EmptyElemTagP
                    |    STagP ContentP ETagP                                       WFC: Element Type Match

    EmptyElemTagP ::=    '<' Name S? '/>'
    STag          ::=    '<' Name S? '>'                          
    ETag          ::=    '</' Name S? '>'                                        
    content       ::=    (CharData)? ((ElemPattern | ScalaPatterns ) (CharData)?)*
    ScalaPatterns ::=    '{' patterns '}'

Summary of changes

This lists reflects the differences to the XML productions with justifications

  • (missing prolog). XML expressions do not need a prolog. This implies that there are no doctype or entity declarations.

  • (changed Char) No single left brace '{'. These have to be written using '{{'

  • (changed EntityRef) No well-formedness constraint for checking entity references. Since entities cannot be declared, any entities that appear in an XML expression are parsed, but never expanded.

  • (changed AttValue) Attributes can have Scala expressions as values.

Omissions from XML syntax

For simplicity, Scala omits some parts of XML. There are two kinds of omission: items omitted from the syntax cannot be used in any Scala program, whereas items omitted in the representation lack some treatment described in the XML spec.

Thus Scala syntax does not include:

  • prolog: XML fragments in Scala programs do not need a prolog

  • (changed AttValue) Attributes cannot refer to parsed entities.

Appendix B. Implementation Chart: Information Set

Bibliography

[scala] Martin Odersky. et al.. Scala language specification.

[dom4j] dom4j.

[jdom] JDOM.

[stax] JCP. StAX Streaming API for XML.

[fxt] fxt - The Functional XML Transformation Tool (written in SML / NJ).

[xquery2src] http://code.google.com/xquery2src/xquery2src (written in Scala).

[fxt-transf] Alexander Berlea. Helmut Seidl. Transforming XML documents using fxt. Computing and Information Technology, Special Issue on DomainSpecific Languages, vol. 10, no. 1, pp. 19--35, 2002..

[fxt-binary] Alexander Berlea. Helmut Seidl. Binary Queries for Document Trees. Nordic Journal of Computing Volume 11, Number 1, Spring 2004 [pdf] .

[mf-wall] Mary Fernandez. XQuery: A Query Language for XML (or...Memoir of a W3C Standards Hacker). invited talk ECOOP'03 Darmstadt.

[scala-programming] Martin Odersky. Programming in Scala.