Streaming path filter node factory for continuous queries and/or transformations
over very large or infinitely long XML input.
Background
The W3C XQuery and XPath languages often require the
entire input
document to be buffered in memory for a query to be executed in its full
generality [
Background Paper,
More Papers].
In other words, XQuery and XPath are hard to stream over very large or
infinitely long XML inputs without violating some aspects of the W3C
specifications. However, subsets of these languages (or simplified cousins)
can easily support streaming.
In fact, most use cases dealing with very large XML input documents do not
require the full forward and backward navigational capabilities of
XQuery and XPath across independent element subtrees. Rather those
use cases are record oriented, treating element subtrees (i.e. records)
independently, individually selecting/projecting/transforming record after
record, one record at a time. For example, consider an XML document with one
million records, each describing a published book, music album or web server
log entry. A query to find the titles of books that have more than three
authors looks at each record individually, hence can easily be streamed.
Another use case is splitting a document into several sub-documents based on
the content of each record.
More interestingly, consider a P2P XML content messaging router, network
transducer, transcoder, proxy or message queue that continuously
filters, transforms, routes and dispatches messages from infinitely long
streams, with the behaviour defined by deeply inspecting rules (i.e. queries)
based on content, network parameters or other metadata.
This class provides a convenient solution for such common use cases operating
on very large or infinitely long XML input. The solution uses a strongly
simplified location path language (which is modelled after XPath but not
XPath compliant), in combination with a
nu.xom.NodeFactory and
an optional
XQuery. The solution is not necessarily faster than
building the full document tree, but it consumes much less main memory.
Here is how it works
You specify a simple "location path" such as
/books/book
or
/weblogs/_2004/_05/entry
. The path may contain wildcards and
indicates which elements should be retained. All elements not matching the
path will be thrown away during parsing. Each retained element is fully
build (including its ancestors and descendants) and then made available to
the application via a callback to an application-provided
StreamingTransform object.
The StreamingTransform
can operate on the fully build element (subtree)
in arbitrary ways. For example, it can simply print the element to screen or
disk and then forget about it. Or it can add the element (subtree) to the
document currently build by the
nu.xom.Builder. In addition, a
transform can check conditions such as has book more than three authors?
A transform can also replace the element with a different element or a
list of arbitrary generated nodes. For example, if a book has more than three
authors, just the book title with a authorCount
attribute
can be added to the document, instead of the entire book element subtree.
Typically, simple StreamingTransforms
are formulated in custom
Java code, whereas complex ones are formulated as an
XQuery.
Streaming Location Path Syntax
locationPath := {'/'step}...
step := [prefix':']localName
prefix := '*' | '' | XMLNamespacePrefix
localName := '*' | XMLLocalName
A location path consists of zero or more location steps separated by "/".
A step consists of an optional XML namespace prefix followed by a local name.
The wildcard symbol '*' means:
Match anything.
An empty prefix ('') means:
Match if in no namespace (i.e. null namespace).
Example legal location steps are:
book (Match elements named "book" in no namespace)
:book (Match elements named "book" in no namespace)
bib:book (Match elements named "book" in "bib" namespace)
bib:* (Match elements with any name in "bib" namespace)
*:book (Match elements named "book" in any namespace, including no namespace)
*:* (Match elements with any name in any namespace, including no namespace)
:* (Match elements with any name in no namespace)
Obviously, the location path language is quite simplistic, supporting the "child" axis only.
For example, axes such as descendant ("//"), ancestors, following, preceding, as well as
predicates and other XPath features are not supported. Typically, this does not matter
though, because a full XQuery can still be used on each element (subtree) matching the
location path, as follows:
Example Usage
The following is complete and efficient code for parsing and iterating through millions of
"person" records in a database-like XML document, printing all residents of "San Francisco",
while never allocating more memory than needed to hold one person element:
StreamingTransform myTransform = new StreamingTransform() {
public Nodes transform(Element person) {
Nodes results = XQueryUtil.xquery(person, "name[../address/city = 'San Francisco']");
if (results.size() > 0) {
System.out.println("name = " + results.get(0).getValue());
}
return new Nodes(); // mark current element as subject to garbage collection
}
};
// parse document with a filtering Builder
Builder builder = new Builder(new StreamingPathFilter("/persons/person", null)
.createNodeFactory(null, myTransform));
builder.build(new File("/tmp/persons.xml"));
To find the title of all books that have more than three authors
and have 'Monterey' and 'Aquarium' somewhere in the title:
String path = "/books/book";
Map prefixes = new HashMap();
prefixes.put("bib", "http://www.example.org/bookshelve/records");
prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
StreamingTransform myTransform = new StreamingTransform() {
private Nodes NONE = new Nodes();
// execute XQuery against each element matching location path
public Nodes transform(Element subtree) {
Nodes results = XQueryUtil.xquery(subtree,
"title[matches(., 'Monterey') and matches(., 'Aquarium') and count(../author) > 3]");
for (int i=0; i < results.size(); i++) {
// do something useful with query results; here we just print them
System.out.println(XOMUtil.toPrettyXML(results.get(i)));
}
return NONE; // current subtree becomes subject to garbage collection
// returning empty node list removes current subtree from document being build.
// returning new Nodes(subtree) retains the current subtree.
// returning new Nodes(some other nodes) replaces the current subtree with
// some other nodes.
// if you want (SAX) parsing to terminate at this point, simply throw an exception
}
};
// parse document with a filtering Builder
StreamingPathFilter filter = new StreamingPathFilter(path, prefixes);
Builder builder = new Builder(filter.createNodeFactory(null, myTransform));
Document doc = builder.build(new File("/tmp/books.xml"));
System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size());
System.out.println(XOMUtil.toPrettyXML(doc));
Here is a similar snippet version that takes a filtering Builder
from a
thread-safe pool with optimized parser configuration:
...
... same as above
...
final StreamingPathFilter filter = new StreamingPathFilter(path, prefixes);
BuilderPool pool = new BuilderPool(100, new BuilderFactory() {
protected Builder newBuilder(XMLReader parser, boolean validate) {
return new Builder(parser, validate, filter.createNodeFactory(null, myTransform));
}
}
);
Builder builder = pool.getBuilder(false);
Document doc = builder.build(new File("/tmp/books.xml"));
System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size());
Applicability
This class is well suited for a P2P XML content messaging router, network
transducer, transcoder, proxy or message queue that continuously
filters, transforms, routes and dispatches messages from infinitely long
streams.
However, this class is less suited for classic database oriented use cases.
Here, scalability is limited as the input stream is sequentially scanned, without
exploiting the indexing and random access properties typical for (relational) database
environments. For such database oriented use cases, consider using the Saxon SQL
extensions functions to XQuery, or consider building your own mixed
relational/XQuery integration layer, or consider using a database technology
with native XQuery support.