http://swpat.ffii.org/
Action against software patents
http://www.gnome.org/
Gnome2 Logo
http://www.w3.org/Status
W3C logo
http://www.redhat.com
Red Hat Logo
http://xmlsoft.org/XSLT/
Made with Libxslt Logo
The XSLT C library for Gnome
Library internals
Main Menu
index.html
Home
http://xmlsoft.org/wiki
Wiki
intro.html
Introduction
docs.html
Documentation
bugs.html
Reporting bugs and getting help
help.html
How to help
downloads.html
Downloads
FAQ.html
FAQ
news.html
News
xsltproc2.html
The xsltproc tool
docbook.html
DocBook
API.html
The programming API
python.html
Python and bindings
internals.html
Library internals
extensions.html
Writing extensions
contribs.html
Contributions
EXSLT/index.html
libexslt
xslt.html
flat page
,
site.xsl
stylesheet
html/index.html
API Menu
ChangeLog.html
ChangeLog
Related links
tutorial/libxslttutorial.html
Tutorial
,
tutorial2/libxslt_pipes.html
Tutorial2
xsltproc.html
Man page for xsltproc
http://mail.gnome.org/archives/xslt/
Mail archive
http://xmlsoft.org/
XML libxml2
ftp://xmlsoft.org/
FTP
http://www.zlatkovic.com/projects/libxml/
Windows binaries
http://garypennington.net/libxml2/
Solaris binaries
http://www.zveno.com/open_source/libxml2xslt.html
MacOsX binaries
http://bugzilla.gnome.org/buglist.cgi?product=libxslt
Bug Tracker
http://www.zend.com/php5/articles/php5-xmlphp.php#Heading17
XSLT with PHP
http://www.mod-xslt2.com/
Apache module
http://sourceforge.net/projects/libxml2-pas/
Pascal bindings
http://xsldbg.sourceforge.net/
Xsldbg Debugger
API Indexes
APIchunk0.html
Alphabetic
APIconstructors.html
Constructors
APIfunctions.html
Functions/Types
APIfiles.html
Modules
APIsymbols.html
Symbols
Table  of contents
internals.html#Introducti
Introduction
internals.html#Basics
Basics
internals.html#Keep
Keep it simple stupid
internals.html#libxml
The libxml nodes
internals.html#XSLT
The XSLT processing steps
internals.html#XSLT1
The XSLT stylesheet compilation
internals.html#XSLT2
The XSLT template compilation
internals.html#processing
The processing itself
internals.html#XPath
XPath expressions compilation
internals.html#XPath1
XPath interpretation
internals.html#Descriptio
Description of XPath
Objects
internals.html#XPath3
XPath functions
internals.html#stack
The variables stack frame
internals.html#Extension
Extension support
internals.html#Futher
Further reading
internals.html#TODOs
TODOs
Introduction
This document describes the processing of
http://xmlsoft.org/XSLT/
libxslt
, the
http://www.w3.org/TR/xslt
XSLT
C library developed for the
http://www.gnome.org/
Gnome
project.
Note: this documentation is by definition incomplete and I am not good at
spelling, grammar, so patches and suggestions are
mailto:veillard@redhat.com
really welcome
.
Basics
XSLT is a transformation language. It takes an input document and a
stylesheet document and generates an output document:
the XSLT processing model
Libxslt is written in C. It relies on
http://www.xmlsoft.org/
libxml
, the XML C library for Gnome, for
the following operations:
parsing files
building the in-memory DOM structure associated with the documents
handled
the XPath implementation
serializing back the result document to XML and HTML. (Text is handled
directly.)
Keep it simple stupid
Libxslt is not very specialized. It is built under the assumption that all
nodes from the source and output document can fit in the virtual memory of
the system. There is a big trade-off there. It is fine for reasonably sized
documents but may not be suitable for large sets of data. The gain is that it
can be used in a relatively versatile way. The input or output may never be
serialized, but the size of documents it can handle are limited by the size
of the memory available.
More specialized memory handling approaches are possible, like building
the input tree from a serialization progressively as it is consumed,
factoring repetitive patterns, or even on-the-fly generation of the output as
the input is parsed but it is possible only for a limited subset of the
stylesheets. In general the implementation of libxslt follows the following
pattern:
KISS (keep it simple stupid)
when there is a clear bottleneck optimize on top of this simple
framework and refine only as much as is needed to reach the expected
result
The result is not that bad, clearly one can do a better job but more
specialized too. Most optimization like building the tree on-demand would
need serious changes to the libxml XPath framework. An easy step would be to
serialize the output directly (or call a set of SAX-like output handler to
keep this a flexible interface) and hence avoid the memory consumption of the
result.
The libxml nodes
DOM-like trees, as used and generated by libxml and libxslt, are
relatively complex. Most node types follow the given structure except a few
variations depending on the node type:
description of a libxml node
Nodes carry a
name
and the node
type
indicates the kind of node it represents, the most common ones are:
document nodes
element nodes
text nodes
For the XSLT processing, entity nodes should not be generated (i.e. they
should be replaced by their content). Most nodes also contains the following
"navigation" informations:
the containing
doc
ument
the
parent
node
the first
children
node
the
last
children node
the
prev
ious sibling
the following sibling (
next
)
Elements nodes carries the list of attributes in the properties, an
attribute itself holds the navigation pointers and the children list (the
attribute value is not represented as a simple string to allow usage of
entities references).
The
ns
points to the namespace declaration for the
namespace associated to the node,
nsDef
is the linked list
of namespace declaration present on element nodes.
Most nodes also carry an
_private
pointer which can be
used by the application to hold specific data on this node.
The XSLT processing steps
There are a few steps which are clearly decoupled at the interface
level:
parse the stylesheet and generate a DOM tree
take the stylesheet tree and build a compiled version of it (the
compilation phase)
take the input and generate a DOM tree
process the stylesheet against the input tree and generate an output
tree
serialize the output tree
A few things should be noted here:
the steps 1/ 3/ and 5/ are optional
the stylesheet obtained at 2/ can be reused by multiple processing 4/
(and this should also work in threaded programs)
the tree provided in 2/ should never be freed using xmlFreeDoc, but by
freeing the stylesheet.
the input tree 4/ is not modified except the _private field which may
be used for labelling keys if used by the stylesheet
The XSLT stylesheet compilation
This is the second step described. It takes a stylesheet tree, and
"compiles" it. This associates to each node a structure stored in the
_private field and containing information computed in the stylesheet:
a compiled XSLT stylesheet
One xsltStylesheet structure is generated per document parsed for the
stylesheet. XSLT documents allow includes and imports of other documents,
imports are stored in the
imports
list (hence keeping the
tree hierarchy of includes which is very important for a proper XSLT
processing model) and includes are stored in the
doclist
list. An imported stylesheet has a parent link to allow browsing of the
tree.
The DOM tree associated to the document is stored in
doc
.
It is preprocessed to remove ignorable empty nodes and all the nodes in the
XSLT namespace are subject to precomputing. This usually consist of
extracting all the context information from the context tree (attributes,
namespaces, XPath expressions), and storing them in an xsltStylePreComp
structure associated to the
_private
field of the node.
A couple of notable exceptions to this are XSLT template nodes (more on
this later) and attribute value templates. If they are actually templates,
the value cannot be computed at compilation time. (Some preprocessing could
be done like isolation and preparsing of the XPath subexpressions but it's
not done, yet.)
The xsltStylePreComp structure also allows storing of the precompiled form
of an XPath expression that can be associated to an XSLT element (more on
this later).
The XSLT template compilation
A proper handling of templates lookup is one of the keys of fast XSLT
processing. (Given a node in the source document this is the process of
finding which templates should be applied to this node.) Libxslt follows the
hint suggested in the
http://www.w3.org/TR/xslt#patterns
5.2
Patterns
section of the XSLT Recommendation, i.e. it doesn't evaluate it
as an XPath expression but tokenizes it and compiles it as a set of rules to
be evaluated on a candidate node. There usually is an indication of the node
name in the last step of this evaluation and this is used as a key check for
the match. As a result libxslt builds a relatively more complex set of
structures for the templates:
The templates related structure
Let's describe a bit more closely what is built. First the xsltStylesheet
structure holds a pointer to the template hash table. All the XSLT patterns
compiled in this stylesheet are indexed by the value of the the target
element (or attribute, pi ...) name, so when a element or an attribute "foo"
needs to be processed the lookup is done using the name as a key.
Each of the patterns is compiled into an xsltCompMatch structure. It holds
the set of rules based on the tokenization of the pattern stored in reverse
order (matching is easier this way). It also holds some information about the
previous matches used to speed up the process when one iterates over a set of
siblings. (This optimization may be defeated by trashing when running
threaded computation, it's unclear that this is a big deal in practice.)
Predicate expressions are not compiled at this stage, they may be at run-time
if needed, but in this case they are compiled as full XPath expressions (the
use of some fixed predicate can probably be optimized, they are not yet).
The xsltCompMatch are then stored in the hash table, the clash list is
itself sorted by priority of the template to implement "naturally" the XSLT
priority rules.
Associated to the compiled pattern is the xsltTemplate itself containing
the information required for the processing of the pattern including, of
course, a pointer to the list of elements used for building the pattern
result.
Last but not least a number of patterns do not fit in the hash table
because they are not associated to a name, this is the case for patterns
applying to the root, any element, any attributes, text nodes, pi nodes, keys
etc. Those are stored independently in the stylesheet structure as separate
linked lists of xsltCompMatch.
The processing itself
The processing is defined by the XSLT specification (the basis of the
algorithm is explained in
http://www.w3.org/TR/xslt#section-Introduction
the Introduction
section). Basically it works by taking the root of the input document and
applying the following algorithm:
Finding the template applying to it. This is a lookup in the template
hash table, walking the hash list until the node satisfies all the steps
of the pattern, then checking the appropriate(s) global templates to see
if there isn't a higher priority rule to apply
If there is no template, apply the default rule (recurse on the
children)
else walk the content list of the selected templates, for each of them:
if the node is in the XSLT namespace then the node has a _private
field pointing to the preprocessed values, jump to the specific
code
if the node is in an extension namespace, look up the associated
behavior
otherwise copy the node.
The closure is usually done through the XSLT
apply-templates
construct recursing by applying the
adequate template on the input node children or on the result of an
associated XPath selection lookup.
Note that large parts of the input tree may not be processed by a given
stylesheet and that on the opposite some may be processed multiple times.
(This often is the case when a Table of Contents is built).
The module
transform.c
is the one implementing most of this
logic.
xsltApplyStylesheet()
is the entry point, it
allocates an xsltTransformContext containing the following:
a pointer to the stylesheet being processed
a stack of templates
a stack of variables and parameters
an XPath context
the template mode
current document
current input node
current selected node list
the current insertion points in the output document
a couple of hash tables for extension elements and functions
Then a new document gets allocated (HTML or XML depending on the type of
output), the user parameters and global variables and parameters are
evaluated. Then
xsltProcessOneNode()
which implements the
1-2-3 algorithm is called on the root element of the input. Step 1/ is
implemented by calling
xsltGetTemplate()
, step 2/ is
implemented by
xsltDefaultProcessOneNode()
and step 3/ is
implemented by
xsltApplyOneTemplate()
.
XPath expression compilation
The XPath support is actually implemented in the libxml module (where it
is reused by the XPointer implementation). XPath is a relatively classic
expression language. The only uncommon feature is that it is working on XML
trees and hence has specific syntax and types to handle them.
XPath expressions are compiled using
xmlXPathCompile()
.
It will take an expression string in input and generate a structure
containing the parsed expression tree, for example the expression:
/doc/chapter[title='Introduction']
will be compiled as
Compiled Expression : 10 elements
SORT
COLLECT  'child' 'name' 'node' chapter
COLLECT  'child' 'name' 'node' doc
ROOT
PREDICATE
SORT
EQUAL =
COLLECT  'child' 'name' 'node' title
NODE
ELEM Object is a string : Introduction
COLLECT  'child' 'name' 'node' title
NODE
This can be tested using the
testXPath
command (in the
libxml codebase) using the
--tree
option.
Again, the KISS approach is used. No optimization is done. This could be
an interesting thing to add.
http://www-106.ibm.com/developerworks/library/x-xslt2/?dwzone=x?open&amp;l=132%2ct=gr%2c+p=saxon
Michael
Kay describes
a lot of possible and interesting optimizations done in
Saxon which would be possible at this level. I'm unsure they would provide
much gain since the expressions tends to be relatively simple in general and
stylesheets are still hand generated. Optimizations at the interpretation
sounds likely to be more efficient.
XPath interpretation
The interpreter is implemented by
xmlXPathCompiledEval()
which is the front-end to
xmlXPathCompOpEval()
the function
implementing the evaluation of the expression tree. This evaluation follows
the KISS approach again. It's recursive and calls
xmlXPathNodeCollectAndTest()
to collect nodes set when
evaluating a
COLLECT
node.
An evaluation is done within the framework of an XPath context stored in
an
xmlXPathContext
structure, in the framework of a
transformation the context is maintained within the XSLT context. Its content
follows the requirements from the XPath specification:
the current document
the current node
a hash table of defined variables (but not used by XSLT)
a hash table of defined functions
the proximity position (the place of the node in the current node
list)
the context size (the size of the current node list)
the array of namespace declarations in scope (there also is a namespace
hash table but it is not used in the XSLT transformation).
For the purpose of XSLT an
extra
pointer has been added
allowing to retrieve the XSLT transformation context. When an XPath
evaluation is about to be performed, an XPath parser context is allocated
containing and XPath object stack (this is actually an XPath evaluation
context, this is a remain of the time where there was no separate parsing and
evaluation phase in the XPath implementation). Here is an overview of the set
of contexts associated to an XPath evaluation within an XSLT
transformation:
The set of contexts associated
Clearly this is a bit too complex and confusing and should be refactored
at the next set of binary incompatible releases of libxml. For example the
xmlXPathCtxt has a lot of unused parts and should probably be merged with
xmlXPathParserCtxt.
Description of XPath Objects
An XPath expression manipulates XPath objects. XPath defines the default
types boolean, numbers, strings and node sets. XSLT adds the result tree
fragment type which is basically an unmodifiable node set.
Implementation-wise, libxml follows again a KISS approach, the
xmlXPathObject is a structure containing a type description and the various
possibilities. (Using an enum could have gained some bytes.) In the case of
node sets (or result tree fragments), it points to a separate xmlNodeSet
object which contains the list of pointers to the document nodes:
An Node set object pointing to
The
http://xmlsoft.org/html/libxml-xpath.html
XPath API
(and
its
http://xmlsoft.org/html/libxml-xpathinternals.html
'internal'
part
) includes a number of functions to create, copy, compare, convert or
free XPath objects.
XPath functions
All the XPath functions available to the interpreter are registered in the
function hash table linked from the XPath context. They all share the same
signature:
void xmlXPathFunc (xmlXPathParserContextPtr ctxt, int nargs);
The first argument is the XPath interpretation context, holding the
interpretation stack. The second argument defines the number of objects
passed on the stack for the function to consume (last argument is on top of
the stack).
Basically an XPath function does the following:
check
nargs
for proper handling of errors or functions
with variable numbers of parameters
pop the parameters from the stack using
obj =
valuePop(ctxt);
do the function specific computation
push the result parameter on the stack using
valuePush(ctxt,
res);
free up the input parameters with
xmlXPathFreeObject(obj);
return
Sometime the work can be done directly by modifying in-situ the top object
on the stack
ctxt->value
.
The XSLT variables stack frame
Not to be confused with XPath object stack, this stack holds the XSLT
variables and parameters as they are defined through the recursive calls of
call-template, apply-templates and default templates. This is used to define
the scope of variables being called.
This part seems to be the most urgent attention right now, first it is
done in a very inefficient way since the location of the variables and
parameters within the stylesheet tree is still done at run time (it really
should be done statically at compile time), and I am still unsure that my
understanding of the template variables and parameter scope is actually
right.
This part of the documentation is still to be written once this part of
the code will be stable.
TODO
Extension support
There is a separate document explaining
extensions.html
how the
extension support works
.
Further reading
Michael Kay wrote
http://www-106.ibm.com/developerworks/library/x-xslt2/?dwzone=x?open&amp;l=132%2ct=gr%2c+p=saxon
a
really interesting article on Saxon internals
and the work he did on
performance issues. I wishes I had read it before starting libxslt design (I
would probably have avoided a few mistakes and progressed faster). A lot of
the ideas in his papers should be implemented or at least tried in
libxslt.
The
http://xmlsoft.org/
libxml documentation
, especially
http://xmlsoft.org/xmlio.html
the I/O interfaces
and the
http://xmlsoft.org/xmlmem.html
memory management
.
TODOs
redesign the XSLT stack frame handling. Far too much work is done at
execution time. Similarly for the attribute value templates handling, at
least the embedded subexpressions ought to be precompiled.
Allow output to be saved to a SAX like output (this notion of SAX like API
for output should be added directly to libxml).
Implement and test some of the optimization explained by Michael Kay
especially:
static slot allocation on the stack frame
specific boolean interpretation of an XPath expression
some of the sorting optimization
Lazy evaluation of location path. (this may require more changes but
sounds really interesting. XT does this too.)
Optimization of an expression tree (This could be done as a completely
independent module.)
Error reporting, there is a lot of case where the XSLT specification
specify that a given construct is an error are not checked adequately by
libxslt. Basically one should do a complete pass on the XSLT spec again and
add all tests to the stylesheet compilation. Using the DTD provided in the
appendix and making direct checks using the libxml validation API sounds a
good idea too (though one should take care of not raising errors for
elements/attributes in different namespaces).
Double check all the places where the stylesheet compiled form might be
modified at run time (extra removal of blanks nodes, hint on the
xsltCompMatch).
bugs.html
Daniel Veillard
