Table of Contents
Abstract
We shed light on Scala's XML data model and the syntax of literal XML markup in Scala code.
Table of Contents
Table of Contents
Scala [scala] is a programming language that is compiled to Java Virtual Machine(tm) bytecode which supports a variety of programming styles and can call Java libraries. It provides extensive library support for XML processing with functional and object-oriented techniques.
This book aims to inform the reader of Scala's XML facilities. Some basic knowledge of Scala is assumed, as provided by the Scala Overview, a cursory reading of [scala-programming], or any of the fashionable Scala books that are coming out these days (use Google to find them). Before we embark on this journey, let me try to place scala.xml within the big picture:
Some consider XML as just syntax: In this view, the core XML specification[xml] merely talks about
sequences of characters with some markers (tags) in
angle brackets appearing here and there. XML is kind of "meta" because the spec authors
do not say, which tags.
When tags are "instantiated" to concrete structuring elements like <html>
,
then the XML spec speaks of an XML application (like XHTML, DocBook or Atom).
XML is also something like a data model: The nesting of tags in an XML document provides a neat tree structure, which can be used to represent data. Thus, most of this book is concerned with trees and sequences of trees. Thinking in trees is useful, for instance when XML transformation can be described applying recursive tree traversals. However, it is sometimes too imprecise: sometimes we might encounter a string like "23", and decide whether we actually want to consider it as an integer, a string, or a day of the month. [1]
The scala.xml library is designed to help with both perspectives, and for the latter, to keep options open of unmarshalling parts of the XML to object and value representations ("data binding"). In this document I try to promote an understanding of the library classes, programming constructs and design patterns provided to this end. This should help the reader do things like parsing, maybe validating, applying recursive transformations, querying and data binding.
There is a wealth of XML specific programming languages which however do not integrate too well with the object-oriented paradigm [2]. Scala is a language that is particularly open to elegant solutions to old problems, because it allows new programming abstractions to be defined easily, providing some opportunities to bridge syntactical gaps and achieving somewhat tighter integration.
Types in programming can help structure your code, remind you of data invariants and push the compiler to detect errors and apply optimizations. A type system should be considered a simple and effective form of program verification.
Data Types in XML specifications are concerned with assigning meaning to sequences of characters -- not with programming. The types thus introduced express some form of structural invariants of XML documents and fragments. This, and not more, is what standards like "Document Type Definition" (DTD)[xml], the more recent XML Schema Definitions (XSD) [xsd1][xsd2] and Relax NG (RNG) [rng] schemata achieve. Less well-known alternatives are schematron and Document Structure Description (DSD) [dsd]. An XML document conforming to such a schema is called valid or schema-valid. For the programmer, the job does not end at data definition, it begins there. And then there are a whole number of XML programs that don't need this datatype business at all.
I believe, it is wrong to impose one perspective and completely neglect the other. Probably, most users of scala.xml are interested in generating (X)HTML: These users need support for almost all details of the XML and XHTML spec, plus some knowledge about browser incompatibilities. The Scala language thus supports cut-and-paste compatibility for XML literals.
On the other hand, there are benefits of using type information. Today there is some static type checking for XQuery and XSLT, but they are somewhat all-or-nothing, forcing the developer to decide whether he wants to live in a typed or an untyped XML world. The scala.xml library keeps all options open: one can manipulate XML without worrying about a schema, but there are ways to convert to or represent some attribute or element text as a Scala/Java integer.
XML programming is a placeholder for many different approaches developers have for XML processing. We will call such a perspective "generic" if it does not depend on any particular XML application (nothing to do with Java generics). A generic approach can deal with XHTML just as it can deal with a markup language for cooking recipes. There are several points of view that can be taken:
XML is regarded as text. We ignore the tree structure completely. Some text/regular expression search is used to retrieve or manipulate information. This can get you quite far for small tasks. Go away, use perl :-)
XML is parsed into a (mutable) object graph that represents the tree structure in a generic way. The Document Object Model (DOM) [dom-L3] and related programming interfaces [dom4j] [jdom] [xom] provide more or less standard APIs to manipulate such trees in general purpose language. Not surprisingly, scala.xml comes with its own API. It is possible to convert to and from others, but this is not yet part of the library.
While parsing XML, a sequence of events is generated. These events either trigger callbacks (push, application is the callee, like in the Simple API for XML (SAX) [sax]) or the application fetches its events itself (pull, implemented in Streaming API for XML (StAX) [stax]). There is an experimental pull api for Scala that allows to experiment with this view (see scala.xml.pull API documentation.
XML is the communication format to interact with a database -- not so much like MySQL running on the same machine, but more like "Acme Corp has good data and allowed us to send them queries". This would use a query language like XQuery [xquery]. An experimental XQuery-to-Scala-source translator is available to support this view [xquery2src].
XML is transformed by applying style templates (like XSLT [xslt]). This falls under the more general term of "recursive transformations". There are some library classes that achieve the same (see package scala.xml.transform). There is also an XSLT-to-Scala-source translator, which is a bit outdated and does not work with the current version of Scala, but which might be revived one day if anybody asks me. For new developments, it is more straightforward to use the more convenient Scala API rather than the cumbersome XSLT syntax, or (if it really must be XSLT), some Java library.
XML is considered as bare trees, and we want to deal with XML "natively". Then the scala.xml API provides methods to handle these structures, with support for XPath like selection and pattern matching.
Using scala.xml feels somewhere between using some DOM API and having an XML specific language. Besides the literal syntax, there is actually no language support. In Scala, most "features" are realized not as language extensions but as libraries. Even the literal XML syntax is desugared into code that constructs objects -- so it is possible to do express everything that can be expressed in XML literals (and even more) without actually using XML syntax, programmatically.
Thanks go to Martin Odersky for giving me the freedom in designing this library. Also, without the past and present LAMP staff, Scala would not be what it is today. Matthias Zenger, Michel Schinz, Philippe Altherr, Vincent Cremet, Erik Stenman, Gilles Dubochet, Stéphane Micheloud, Lex Spoon, Sean McDirmid, Nikolay Mihaylov, Iulian Dragos. Some of these guys were pretty ardent XML detractors, which is sometimes good as it reminds one that no XML API is a silver-bullet.
Jamie Webb and Jon Pretty of Sygneca gave a lot of feedback and a couple of features were suggested by them. Students that took undergraduate projects helped to weed out bugs and improve performance and usability -- thank you to Simon Barbey, Fatemeh Borran, Susann Bucher, Badr Hejira, Florian Hof, Clément Hongler and Lukas Rytz.
Update: For the latest iteration of this draft's release, Jonas Bonér, David Pollak, David Hall, Michael Fortson deserve thanks for reporting bugs in the code and the document.
Table of Contents
The Scala programming language offers a wide range of constructions and library routines that make dynamic XML processing simple and effective. This section contains an overview of the most common ways to construct XML.
Probably the easiest way to put XML data in your program is to copy and paste it into your program. The following code will demonstrates this. To make the matter more interesting, it is spiced with an HTML description.
/* examples/phonebook/phonebook1.scala */ package phonebook object phonebook1 { val labPhoneBook = <phonebook> <descr> This is the <b>phonebook</b> of the <a href="http://acme.org">ACME</a> corporation. </descr> <entry> <name>Burak Emir</name> <phone where="work">+41 21 693 68 67</phone> </entry> </phonebook>; def main(args: Array[String]) = Console.println( labPhoneBook ) }
The Scala parser recognizes the full XML grammar. Further down, we shall see that is actually recognizes a superset, allowing it to parse mixed and nested Scala and XML expression. As a principle, everything allowed in XML is allowed in Scala, with the only exceptions being motivated by the fact that some aspects of the XML spec just don't make sense for the source code of a program. In return, the extensions to the syntax have been made in order to make programming easier.
$ scalac -d /tmp examples/xml/phonebook/phonebook1.scala $ scala -classpath /tmp phonebook.phonebook1 <phonebook> <descr> This is the <b>phonebook</b> of the <a href="http://acme.org">ACME</a> corporation. </descr> <entry> <name>Burak Emir</name> <phone where="work">+41 21 693 68 67</phone> </entry> </phonebook>
XML nodes in Scala are always instances of some subclass of scala.xml.Node
.
The library uses an immutable representation (no parts of an XML node can be changed), but the
programmer may provide own mutable subclasses of scala.xml.Node
if required.
By default, elements are represented using scala.xml.Elem
and scala.xml.Text
. These
are case classes, so they can be constructed wihout having to
write new
and can be used as patterns in a
match
expression.
The Elem
class looks roughly like this:
case class Elem(val prefix: String, // namespace prefix val label: String, // (local) tag name val attributes: MetaData, val scope: NamespaceBinding, // namespace bindings val child: Node*) extends Node { ... }
From the constructor, we can see what constitutes an XML element (we shall treat
namespaces later). The last
formal parameter definition child: Node*
indicates that an arbitrary number of nodes
(including zero) may be passed to the Elem
constructor.
In fact, the above phonebook code can equivalently written like this:
/* examples/xml/phonebook/verboseBook.scala */ package phonebook object verboseBook { import scala.xml.{ UnprefixedAttribute, Elem, Node, Null, Text, TopScope } val pbookVerbose = Elem(null, "phonebook", Null, TopScope, Elem(null, "descr", Null, TopScope, Text("This is a "), Elem(null, "b", Null, TopScope, Text("sample")), Text("description") ), Elem(null, "entry", Null, TopScope, Elem(null, "name", Null, TopScope, Text("Burak Emir")), Elem(null, "phone", new UnprefixedAttribute("where","work", Null), TopScope, Text("+41 21 693 68 67")) ) ) def main(args: Array[String]) = Console.println( pbookVerbose ) }
This code does almost the same as the code above. However, the output of the programs are different:
$ scalac -d /tmp examples/xml/phonebook/verboseBook.scala $ scala -classpath /tmp phonebook.verboseBook <phonebook><descr>This is a <b>sample</b>description</descr><entry><name>Burak Emir</name><phone where="work">+41 21 693 68 67</phone></entry></phonebook>
Why does the former output looked somewhat better,
although still not perfect?. The answer lies in the whitespace
contained in the former program source. Scala's XML parser
adopts the simple rule that within XML expressions, whitespace
is preserved everywhere. In verboseBook
,
we did not care to construct superfluous nodes containing only
whitespace, consequently there was no whitespace when we printed
it. In most cases, this does not matter (flamewars on xml-dev notwithstanding).
A pretty printer is available to obtain more human-readable output -- try to change the main to:
def main(args: Array[String]) = Console.println( new PrettyPrinter(80 /*width*/,3 /*indent*/).format(pbookVerbose) )
Three things are worth remembering:
Mixed content has to be expressed by juxtaposing Text
and Elem
.
Attributes are an immutable linked list of UnprefixedAttribute
objects, terminated with the Null
object.
The Elem
is special in that it can deal with an arbitrary
number of arguments.
The mysterious occurrences of null
(lowercase) and
TopScope
are for namespace handling. They will be explained later, together with
PrefixedAttribute
, in the section on namespaces.
Elem
Sometimes, we want to call a constructor with a sequence
parameter, but the sequence of arguments is computed
dynamically. The Elem
constructor can
deal with a sequence as long as you told the compiler that it
is one. You do this by annotating the sequence with
_*
, like this
val myElem = Elem(null, "baz", Null, TopScope, computeList(42,"froz"):_* );
Assuming that the result computeList(42,"froz")
will be List(Elem(null, "foo", Null, TopScope), Elem(null, "bar", Null, TopScope))
, then the
above code has the same effect as
val myElem = Elem(null, "baz", Null, TopScope, Elem(null, "foo", Null, TopScope), Elem(null, "bar", Null, TopScope) )
These syntactic considerations are not very exciting yet (because we have not looked into the things
that one can do with those objects). For the developer, the fun starts when he can parameterize some
XML fragment or include computed parts in it. This is achieved by embedded expressions, which allow
to freely mix Scala code. The following program produces the same output as phonebook1
/* examples/phonebook/embeddedBook.scala */ package phonebook object embeddedBook { val company = <a href="http://acme.org">ACME</a> val first = "Burak" val last = "Emir" val location = "work" val embBook = <phonebook> <descr> This is the <b>phonebook</b> of the {company} corporation. </descr> <entry> <name>{ first+" "+last }</name> <phone where={ location }>+41 21 693 68 {val x = 60 + 7; x}</phone> </entry> </phonebook>; def main(args: Array[String]) = Console.println( embBook ) }
Scala expressions are embedded within an XML fragment using
single braces {
}
[3]. In order to get a
single brace character, you have to double it
{{
}}
.
Between the braces is an embedded block, which means not only expressions, but also statements, function and class definitions and pretty much everything else is allowed. The last expression in a block determines its "value" -- what will appear in the XML after evaluating preceding code.
The compiler accepts various types of values within embedded nodes --
everything that is either a scala.xml.Node or something that has toString method is welcome.
For embedded attributes, a string or a sequence of nodes will do - the constructor
of attributes is typically
UnprefixedAttribute(key, string, next)
is mostly equivalent to
UnprefixedAttribute(key, Text(string), next)
.
Often, whether a particular attribute is present depends on some condition, leading to code like this
if(cond) <foo bar="pizza">{ /*lots of code*/ }</foo> else <foo>{ /*lots of code*/ }</foo>
In order to simplify life in such a scenario, Scala allows to make attribute addition conditional: an attribute value of null means the attribute is omitted.
<foo bar={if (cond) "pizza" else null}>{ /*lots of code*/ }</foo>
Type-safety is a nice property, and having a compiler checking options for you is often much better than using null. This is why, you can also use Option types for nullable attributea, provided you pass an instance of Seq[Node].
val z = if (cond) { Some(Text("pizza")) } else { None } <foo bar={z}>{ /*lots of code*/ }</foo>
Although the above is sufficient for most purposes, there are a couple of other nodes that can be used.
EntityRef, ProcItem and Comment - for various XML elements
Group - for grouping nodes.
Unparsed - for including verbatim text, e.g. when generating non-XHTML hypertext.
Atom - for nodes containing data of any type, e.g. int, Date.
At first sight, it appears that attributes should only be strings and nothing else. However, there are two reasons to allow the same kind of nodes (other than element nodes) that can appear within XML: data values and entity references.
<foo name= "süss" life={Atom(42)}> Elem(null, foo, new UnprefixedAttribute("name",List(Text("s"),EntityRef("uuml"),Text("ss")), new UnprefixedAttribute("life", Atom(42), Null), TopScope)
Fortunately, a single node always behaves as if it was a sequence of nodes, so there is no need to wrap elements in a singleton lists.
Scala provides pattern matching to search and decompose sequences. Pattern matching can also be used to decompose XML.
For instance to find out whether a variable contains an "entry" element which has as last child a "foo" with no children, this pattern will do:
x match { case Elem(_,"entry", _, _, _*, Elem(_, "foo", _)) => true case _ => false }
This also works using XML syntax:
x match { case <entry>{ _* }<foo/></entry> => true case _ => false }
However, there is no support for testing presence or values of attributes. This can be achieved using guards, for instance like in the following example
x match { case link @ <a>{ _* }</a> if link.attribute("href").isEmpty => "href missing" }
The Scala XML API takes a functional approach to representing data, eschewing imperative updates where possible. Since nodes as used by the library are immutable, updating an XML tree can a bit verbose, as the XML tree has to be copied. Here is an example how this could be done.
/* examples/xml/phonebook/phonebook2.scala */ package phonebook; object phonebook2 { import scala.xml.Node /** adds an entry to a phonebook */ def add( p: Node, newEntry: Node ): Node = p match { case <phonebook>{ ch @ _* }</phonebook> => <phonebook>{ ch }{ newEntry }</phonebook> } val pb2 = add( phonebook1.labPhoneBook, <entry> <name>Kim</name> <phone where="work">+41 21 111 11 11</phone> </entry> ); def main( args: Array[String] ) = Console.println( pb2 ) }
This code will throw a MatchError
in add
exception if
the node does not have phonebook
tag. It is also possible to express it using only
method calls:
def add( p: Node, e: Node ) = Elem(null, p.label, Null, TopScope, (p.child ++ e):_*)
Here we assume that our element representing a phonebook will never have a namespace
prefix (null
), never have attributes
(Null
) and never define namespace bindings (TopScope
). Without
these assumptions, we would have copied p.prefix
, p.attributes
and p.scope
over to the new element as well. The _* ("sequence escape") has been explained before: see Passing sequences to Elem
.
Changing the phone number of an entry is similar. First we lookup an entry by traversing the tree and and copying it. Then we provide an updated copy of the element we wish to change.
package phonebook; object phonebook3 { import scala.xml.{Elem, Node, Text} ; import scala.xml.PrettyPrinter ; import Node.NoAttributes ; /* this method "changes" (returns an updated copy) of the phonebook when the * entry for Name exists. If it has an attribute "where" whose value is equal to the * parameter Where, it is changed, otherwise, it is added. */ def change ( phonebook:Node, Name:String, Where:String, newPhone:String ) = { /** this nested function walks through tree, and returns an updated copy of it */ def copyOrChange ( ch: Iterator[Node] ) = { import xml.Utility.{trim,trimProper} //removes whitespace nodes, which are annoying in matches for( val c <- ch ) yield trimProper(c) match { // if the node is the particular entry we are looking for, return an updated copy case x @ <entry><name>{ Text(Name) }</name>{ ch1 @ _* }</entry> => var updated = false; val ch2 = for(val c <- ch1) yield c match { // does it have the phone number? case y @ <phone>{ _* }</phone> if y \ "@where" == Where => updated = true <phone where={ Where }>{ newPhone }</phone> case y => y } if( !updated ) { // no, so we add as first entry <entry> <name>{ Name }</name> <phone where={ Where }>{ newPhone }</phone> { ch1 } </entry> } else { // yes, and we changed it as we should <entry> { ch2 } </entry> } // end case x @ <entry>... // other entries are copied without changing them case x => x } } ; // for ... yield ... returns an Iterator[Node] // decompose phonebook, apply updates phonebook match { case <phonebook>{ ch @ _* }</phonebook> => <phonebook>{ copyOrChange( ch.elements ) }</phonebook> } } val pb2 = change( phonebook1.labPhoneBook, "John", "work", "+41 55 555 55 55" ); val pp = new PrettyPrinter( 80, 5 ); def main( args:Array[String] ) = { Console.println("---before---"); Console.println( pp.format( phonebook1.labPhoneBook )); Console.println("---after---"); Console.println( pp.format( pb2 )); } }
Namespaces [names1.0][names1.1] have been introduced into extensible markup long after the XML specicifaction was out. The intention is to provide a means of 'packaging' related names by associating them with a URL. The association happens indirectly by (1) binding URIs to prefixes and (2) prefixing names using the syntax 'prefix:localname', i.e. using the colon as a separator. Consequently, the colon is not a part of names anymore.
Namespace prefixes have to be taken into account (Binding,Scope) because they are used whenever QNames live in content (for example, in XML Schema Definitions).
To avoid clutteredness, the standard allows a 'default namespace' to be declared, which implicitly associates unprefixed names with a certain URI. Finally it is possible to undeclare namespaces by binding them to the empty prefix. (v1.0 only allowed to undeclare the default namespaces, but in v1.1 this has been generalized to any prefix).
The meaning of an empty string
is to undeclare the namespace, prefix mapping.
In the past, this has caused considerable headache: The
Namespaces in XML recommendation allowed empty string only for the
default namespace binding, i.e. xmlns=""
was
allowed, but xmlns:foo=""
was not.
However, this unnecessary
distinction between default and other
namespace bindings (those with a prefix) was removed
in Namespaces in XML 1.1. Now "undeclarations" are allowed
for both kinds.
How does this look in Scala? Namespace bindings are treated in a class aptly named
NamespaceBinding
, which is a linked list of prefix-URI pairs.
A default namespace is synonymous with a namespace for the null
prefix (not
the empty string), and undeclaring a namespacebinding is done by assigning the
empty string as URI.
The TopScope
is the most common top-level scope, the empty prefix-URI mapping that does not contain any binding.
Here is an example what the compiler does with scala.xml.NamespaceBinding
.
Assuming we had a internal variable $scope
containing the active bindings
at each element. Then for the following fragment
val foobar = <foo:bar xmlns:foo="http://foo.com" foo:key="value" xmlns="urn:default" attr="42"><a/></foo:bar>
the compiler has to take the following steps to updates the scope, translating everything roughly into:
val foobar = { // add bindings to scope scope = new NamespaceBinding(null, "urn:default", NamespaceBinding("foo", "http://foo.com", scope)) // make attributes val md = new UnprefixedAttribute("attr","42", new PrefixedAttribute("foo","key","value", Null)) // make element Elem("foo","bar", md, scope, Elem(null, "a", Null, scope)) }
The element labeled bar
uses a prefix which
tells us it is in the namespace
http://foo.com
. The element
a
is nested under bar
, this
is affected by the same namespace bindings. It
is in the namespace urn:default
.
Namespace binding is properly scoped over the child nodes: Unless a
descendant undeclares a prefix, the prefix is bound to URI according
to the bindings defined for the parent. As can be seen, namespace
bindings are treated differently from regular attributes -- this seems
a good compromise since they are shared, have different properties and
there is an important class of users that is simply not concerned with
namespaces. The library is design to handle namespace bindings by
itself, and where namespace manipulations are needed, they are
effected on the scope
members and
NamespaceBinding
classes.
Attributes without a prefix are not
implicitly put in the same namespace as the element in which
they occur. This is the reason why there is
UnprefixedAttribute
and a
PrefixedAttribute
class. UnprefixedAttribute
s have no namespace
at all.
Implementations of XML infrastructure routines typically share namespace nodes in the data model. This accounts for the lexical scoping which is prescribed by the spec.
Some requirements are expected of such XML infrastructure
routines. It is for instance absolutely necessary to preserve
namespace bindings as they are given in source documents
(because some documents, like XSD schemata, refer to prefixes
not only in XML names but also in content. Then it is often
desirable that identical namespace bindings are not repeated
in each node, i.e. the number of namespace binding
xmlns:prefix="..."
should be minimized.
This in turn becomes more tricky when sharing namespaces -- we
might mix fragments from different trees, in which case
namespace nodes might convey identical information and yet
have different object identity.
The current implementation will not properly stratify namespace bindings when elements from different scopes are combined. This is not a problem when querying or processing XML data, but it might lead to wrong namespace bindings when serializing XML. A modified version of the serializing algorithm can solve the problem by introducing namespace declarations and undeclarations in the right place. Since it seems a rare problem and developers can stratify namespaces themselves in a given XML application, your humble author and scala.xml maintainer did not consider this issue a priority.
The XML Path Language (XPath) [xml] is a language expressing simple queries on XML documents. This example illustrates how XPath projection can be used in Scala
package bib; object bib { import scala.xml.{Node,NodeSeq}; import scala.xml.PrettyPrinter; val biblio = <bib> <book> <author>Peter Buneman</author> <author>Dan Suciu</author> <title>Data on ze web</title> </book> <book> <author>John Mitchell</author> <title>Foundations of Programming Languages</title> </book> </bib> ; val pp = new PrettyPrinter(80, 5); def main(args:Array[String]):Unit = { Console.println( pp.formatNodes( biblio \ "book" \ "title" )); // prints // <title>Data on ze web</title><title>Foundations ...</title> Console.println( pp.formatNodes( biblio \\ "title" )); // prints the same Console.println( pp.formatNodes( biblio \\ "_" )); // prints node and all descendant Console.println( pp.formatNodes( biblio.descendant_or_self )); // prints the same } }
Here is a sample program to convert Docbook to some other format:
object transform { import scala.xml._ ; import scala.xml.dtd._ ; import org.xml.sax.InputSource ; /* a former version of Scala used regular expression patterns, like * in the following code. In the future, we plan to reactivate some * well-behaved regular expressions again // gimmick: text replacement "scalac" => <code>scalac</code> def cook(s: String): Seq[Node] = cook1(s) ; def cook1(s: Seq[Char]):List[Node] = s match { case Seq( a @ _*, 's','c','a','l','a','c', b @ _* ) => Text(cook2( a )) :: <code>scalac</code> :: cook1( b ) case _ => List( Text( cook2( s ))) } def cook2(s: Seq[Char]): String = { val r = new StringBuffer(); s.foreach { c:char => val _ = r.append(c); }; r.toString() } */ def transform1(ns: Iterable[Node]): Seq[Node] = { val zs = new NodeBuffer(); for(val z <- ns) { zs &+ transform( z ) } zs } /** this functions holds "templates" that transform nodes of an input tree * into an iterable representation of a sequence of nodes of the output * tree. * * It is ok to return a single node, since each node is at the same * time a singleton sequence. Likewise, the pattern variable x will be * of type Seq[Node], although here it is always binding a single node. */ def transform(n: Node):Iterable[Node] = n match { case x @ <article>{ ns @ _ * }</article> => <source active="ant" title={ (x \ "title" \ "_").toString() }> <header> <author>Burak Emir</author> <keywords>Scala4Ant</keywords> <style type="text/css"></style> </header> <content> <title><scala/> Ant Task</title> { transform1( x \ "_" ) } </content> </source> case x @ <sect1>{ _* }</sect1> => <section>{ transform1( x \ "_" ) }</section> case x @ <title>{ _* }</title> => <h>{ x \ "_" }</h> case x @ <para>{ _* }</para> => <p>{ transform1( x \ "_" ) }</p> case x @ <itemizedlist>{ _* }</itemizedlist> => <ul>{ transform1( x \ "_" ) }</ul> case x @ <listitem>{ _* }</listitem> => <li>{ transform1( x \ "_" ) }</li> case x @ <constant>{ _* }</constant> => // an xml:group is a sequence that appears to the scala type system // as a single node. Here it is used to append a text node with a space <xml:group><code>{ x \ "_" }</code> </xml:group> case x @ <programlisting>{ _* }</programlisting> => <pre>{ x \ "_" }</pre> case Elem(namespace, label, attrs, scp, ns @ _*) => Elem(namespace, label, attrs, scp, transform1( ns ):_* ) case z => z }; def main(args:Array[String]) = { if( args.length == 1 ) { // must have one arg object ConsoleWriter extends java.io.Writer { def close() = {} def flush() = Console.flush def write(cbuf:Array[char], off:int , len:int ): unit = { var i = 0 while(i < len) Console.print(cbuf(off + i)) } } val src = XML.load(new InputSource( args( 0 ))); //use Java parser // transform returns an iterable, but we now it is a singleton // sequence, so we get its first element val n = transform( src ).elements.next val doctpe = DocType("html",PublicID("-//W3C//DTD XHTML 1.1//EN","../default.dtd"), Nil) /** write document to console, with encoding latin1, xml declaration * and doctype */ XML.write(ConsoleWriter, n, "iso-8859-1", true, doctpe) } else error("need one arg"); } }
This example illustrates XQuery style querying
package bib; object bibq { val theBib = bib.biblio ; for( val b <- theBib \ "book" ) for( val a <- b \ "author" ) { Console.println( a ) } }
Table of Contents
If you just want to load XML, without using databinding, try this:
object Foo with Application { val x = scala.xml.XML.loadFile("myfile.xml"); Console.println(x); }
The value x will be of type scala.xml.Elem
, which in turn
is an implementation of the scala.xml.Node
interface.
The parser used for parsing the XML is currently the XML parser that comes with the underlying JDK.
There is also a save method defined there:
object Foo with Application { val y: Elem = ... scala.xml.XML.save("myfile.xml", y); }
There is also a write
method that allows to output XML to anything implementing the java.io.Writer
class.
Scala has a XML parser of its own, which can be invoked like this
import scala.xml.parsing.ConstructingParser val p = ConstructingParser.fromFile(file, true /*preserve whitespace*/) val d: xml.Document = p.document
The advantages of this parser is that the developer has more fine-grained control over what to parse. It is for instance possible to parse a sequence of elements from a stream (the XML spec allows only one), or to obtain the entity declarations from the internal subset of the DTD.
The native XML parser can also be used for pull parsing. An
experimental API is accessible via scala.xml.pull.XMLEventReader
. You
need to provide a scala.io.Source
, just like for
the constructing parser.
Table of Contents
Table of Contents
This section describes the classes in scala.xml.
The abstract superclass of all XML nodes as represented in the Scala library.
Nodes have an optional prefix (null = no prefix), a namespace binding
scope, a list of metadata (attributes), and a sequence of children.
A node can be considered as a singleton sequence containing the node,
because it inherits from NodeSeq
.
Sequences of nodes are pretty common in XML processing. The main use of this class is to add XPath methods \ and \\ to any sequence of nodes, regardless of its concrete representation. It is a wrapper class, which gets automatically created by means of Scala's view mechanism.
A class implementing scala.xml.Node with a case class. XML literals
embedded in Scala code will get turned into Elem
instances. Also, most default parsing factories will produce
Elem
instances. By contrast, most library
routines (like e.g. the PrettyPrinter
) expect
instances of Node
, so it is possible to
call them with custom XML representations.
To represent entity references. It is possible to output entity declarations using the classes in scala.xml.dtd.
The abstract superclass of attribute nodes. Attributes are realized as an immutable linked list. Since the attribute order does not matter in XML, default parser factories may actually change (typically reverse) the order when they parse XML.
This object is used to `ground' linked attribute lists. It is also the representation of empty attribute lists.
A prefixed attribute has a prefix, a name, a value and a
pointer to the tail of the attribute list. It answers to
getValue(uri,scope,key)
calls with its value if the
its own prefix matches the uri in the given scope (typically the
scope of the parent element). It will not
answer getValue(key)
calls, because the
Namespaces spec considers it distinct from an unprefixed attribute.
An unprefixed attribute has a name, a value and a
pointer to the tail of the attribute list. It answers
getValue(key)
calls, but not the
namespace aware ones describe above.
This class is for representing namespace bindings using a linked list of namespace binding nodes.
The following changes were made to the Scala syntax in order to accomodate literal XML and XML expressions
Lexical syntax (Chapter 1)
Programming languages are usually defined in terms of lexical syntax, handled by a scanner, and context-free syntax, handled by a parser. Scala is no exception to this rule, adopting a lexical syntax similar to Java's but with more freedom for definition of operators etc.
The lexical syntax from XML documents cannot be reconciled with the lexical syntax Scala code. Therefore, in addition to the Java-like lexical syntax, a Scala parser needs to treat every input character differently and in conformance with the XML specification when entering a literal XML element. This happens when the following character sequence is encountered:
( S | '(' | '{' ) '<' (Letter | '_' )
Thus, whenever a < is immediately preceded by whitespace, '(' or '{', and immediately followed by an XML name start character, the scanner is forced to interpret the following characters as XML input. In the following, this will be referred to as the scanner being 'in XML mode'. The scanner changes from XML mode to Scala mode when one of the following conditions hold:
The XML expression or an XML pattern started by the initial '<' has been successfully parsed.
The parser encounters an embedded Scala expression or pattern, indicated by a '{'. This changes the scanner back to normal mode, until the closing '}' is found, which puts the scanner into XML mode again.
Since the nested Scala expression can contain nested XML expressions/patterns, the parser thus has to maintain a stack that reflects the nesting of XML and Scala expressions adequately.
Note that Scala comments are interpreted as text (parseable character data) in XML mode.
Expression (Ch.4) and pattern (Ch.7) syntax
The following two productions are added to the Scala grammar (see below for XML expression and pattern grammar)
xmlExpr ::= Element (Element*) xmlPat ::= ElementPattern
As said before, they indicate that the scanner is in xml mode.
The meaning of XML expressions and patterns is given using equivalent Scala expressions and patterns.
An element <pre:name att1=val1 pre2:att2=val2 ... attN=valN> content <name>
is interpreted as
scala.xml.Elem("pre", "name", UnprefixedAttribute(att1, val1, PrefixedAttribute(pre2, att2, val2, ... UnprefixedAttribute(attN, valN, Null))), content)
A sequence of elements e1...eN is interpreted as (a concrete representation of) Seq(e1...eN)
Embedded scala expressions are interpreted by themselves.
An element pattern '<name> contentPattern <name>
' is interpreted as
'scala.xml.Elem("name", contentPattern )
'
Embedded scala patterns are interpreted by themselves.
Note that this implies that an xml expression consisting of one
element will be of type 'scala.xml.Elem
'
whereas an xml
expression consisting of two or more elements will be of type
'Seq[scala.xml.Elem]
'.
Table of Contents
Table of Contents
This tool is an adaption of Eliotte Rusty Harold's SAXXIncluder to Scala. It builds on top of the relevant JAVA API classes and was crucial for including the code samples in this document.
At this point some information (the Scaladoc description) is available at url xinc homepage
Table of Contents
Despite great APIs, data represented in XML tends to be converted to and from object representations. This task is called data binding. It can in prinicple be coded manually if the data representations are intricate. But often, conversion has to bridge fairly straightforward XML types and fairly basic "pure data" classes. The latter scenario is a case for automation.
For the sake of an example consider the following way to represent bugreports
<bugReport id="42"> <dateSubmitted>2003-06-25</dateSubmitted> <status> fixed </status> <submitter> Matthias</submitter> <assignedTo> Michel </assignedTo> <code> ... </code> <whatHappened>...</whatHappened> <whatExpected>...</whatExpected> </bugreport> |
<!ELEMENT bugReport (dateSubmitted, status, submitter, assignedTo, code, whatHappened, whatExpected)> <!ELEMENT dateSubmitted #PCDATA> <!ELEMENT status #PCDATA> <!ELEMENT submitter #PCDATA> <!ELEMENT assignedTo #PCDATA> <!ELEMENT code #PCDATA> <!ELEMENT whatHappened #PCDATA> <!ELEMENT whatExpected #PCDATA>bugReport.dtd |
There are many scenarios, where we would like to programmatically
manipulate bug reports in ways that cannot be handled by XML tools.
We might want to store and retrieve them in a relational database, access
and compile the source in the code
element, or notify
the submitter
of changes by email.
Using the data binding tool schema2src
it is possible
to generate the following classes from the DTD above. We can invoke
the schema2src with its DTD module in the following way
This will result in a sourcefile being generated. The sourcefile contains mainly one object, one cases class and one method definition for each element declaration found in the DTD.
object BugReportDefs { import scala.xml.{MetaData, Node}; object bugReport { def validate(atts: MetaData, child: Node*) ... } case class bugReport (atts: MetaData, child: Node*) { ... } def bugReport(s:String) = new bugReport(s); object dateSubmitted { def validate(atts: MetaData, child: Node*) ...} case class dateSubmitted(atts: MetaData, child: Node*) { ... } def dateSubmitted(s:String) = new dateSubmitted(s); ... } |
Importing the types from BugReportDefs has several benefits for the programmer working with bug reports. She can
construct elements
val n = scala.xml.Null; // empty attribute list bugReport(UnprefixedAttribute("id","42",n), dateSubmitted("2003-06-25"), status ("fixed"), submitter ("Matthias"), assignedTo ("Michel"), code ("..."), whatHappened ("..."), whatExpected ("...") )
load elements
todo...
At this point some information (including the Scaladoc description) is available at url schema2src homepage
Table of Contents
Table of Contents
The following grammar show equal (=) and different (X) productions of XML literals embedded in Scala. (DRAFT-TODO) Check parser (?) and add link to namespace recommendation.
(=)element ::= EmptyElemTag | STag Content ETag WFC: Element Type Match (=)EmptyElemTag ::= '<' Name (S Attribute)* S? '/>' WFC: Unique Att Spec (=)STag ::= '<' Name (S Attribute)* S? '>' WFC: Unique Att Spec (=)ETag ::= '</' Name S? '>' (=)content ::= (CharData)? ((element | Reference | Comment | PI | ScalaExp ) (CharData)?)* (=)Attribute ::= Name Eq AttValue WFC: No < in Attribute Values (X)AttValue ::= '"' ([^<&"] | CharRef)* '"' | "'" ([^<&'] | CharRef)* "'" | ScalaExp ScalaExp ::= '{' expr '}' (X)CharData ::= [^<&]* - ([^<&]* '{'[^<&{] [^<&]*) - ([^<&]* '{']]> [^<&]*) (=)Reference ::= CharRef | EntityRef (=)CharRef ::= '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';' WFC: Legal Character (=)EntityRef ::= '&' Name ';' (=)Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] (=)S ::= (#x20 | #x9 | #xD | #xA)+ (=)Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->' (=)PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' (=)NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender (?)Name ::= (Letter | '_' | ':') (NameChar)* ElemPattern ::= EmptyElemTagP | STagP ContentP ETagP WFC: Element Type Match EmptyElemTagP ::= '<' Name S? '/>' STag ::= '<' Name S? '>' ETag ::= '</' Name S? '>' content ::= (CharData)? ((ElemPattern | ScalaPatterns ) (CharData)?)* ScalaPatterns ::= '{' patterns '}'
This lists reflects the differences to the XML productions with justifications
(missing prolog). XML expressions do not need a prolog. This implies that there are no doctype or entity declarations.
(changed Char) No single left brace '{'. These have to be written using '{{'
(changed EntityRef) No well-formedness constraint for checking entity references. Since entities cannot be declared, any entities that appear in an XML expression are parsed, but never expanded.
(changed AttValue) Attributes can have Scala expressions as values.
For simplicity, Scala omits some parts of XML. There are two kinds of omission: items omitted from the syntax cannot be used in any Scala program, whereas items omitted in the representation lack some treatment described in the XML spec.
Thus Scala syntax does not include:
prolog: XML fragments in Scala programs do not need a prolog
(changed AttValue) Attributes cannot refer to parsed entities.
[dsd] BRICS. Document Structure Description 2.0 .
[schematron] ISO/IEC. Draft International Standard 19757-3 Document Schema Definition Languages Part 3: Rule-based validation .
[rng] OASIS. 2001. Committee Specification: RELAX NG Specification .
[xml] W3C. Recommendation: Extensible Markup Language (XML) 1.0 .
[xml] W3C. Recommendation: XML Path Language (XPath) 1.0 .
[xsd0] W3C. Recommendation: XML Schema Part 0: Primer Second Edition .
[xsd1] W3C. Recommendation: XML Schema Part 1: Structures .
[xsd2] W3C. Recommendation: XML Schema Part 2: Datatypes .
[names1.0] W3C. Recommendation: Namespaces in XML .
[names1.1] W3C. Recommendation: Namespaces in XML 1.1 .
[info] W3C. Recommendation: XML Information Set (Second Edition) .
[xquery] W3C. Recommendation: XQuery 1.0: An XML Query Language .
[xslt] W3C. XSL Transformations (XSLT) .
[dom4j] dom4j.
[jdom] JDOM.
[xom] XML Object Model.
[sax] Simple API for XML.
[stax] JCP. StAX Streaming API for XML.
[xquery2src] http://code.google.com/xquery2src/xquery2src (written in Scala).
[fxt-transf] Transforming XML documents using fxt. Computing and Information Technology, Special Issue on DomainSpecific Languages, vol. 10, no. 1, pp. 19--35, 2002..
[fxt-binary] Binary Queries for Document Trees. Nordic Journal of Computing Volume 11, Number 1, Spring 2004 [pdf] .
[mf-wall] XQuery: A Query Language for XML (or...Memoir of a W3C Standards Hacker). invited talk ECOOP'03 Darmstadt.
[scala-programming] Programming in Scala.