pxdom is a W3C DOM Level 3 implementation for XML 1.0/1.1 with/without namespaces, using Python and OMG-style (_get/_set) bindings. All features described in the Core and LS Recommendations are supported, with the following exceptions:
pxdom runs on Python 1.5.2 or later, and has been tested up to 2.7. It will not currently run on Python 3. Certain features are dependent on Python version:
Copy pxdom.py into any folder in your Python path, for example /usr/lib/python/site-packages or C:\Python26\Lib\site-packages. Pre-compile bytecode version with ‘import pxdom’ if necessary.
pxdom can also be included and imported as a submodule of another package. This is a good strategy if you wish to distribute a DOM-based application without having to worry about the versions of Python and/or PyXML installed on users’ machines; the only dependencies are the standard library string-handling and URL-related modules.
The pxdom module implements the DOMImplementationSource interface from DOM Level 3 Core. So to parse a document from a file, use eg.:
dom= pxdom.getDOMImplementation('')
parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
document= parser.parseURI('file:///f|/data/doc.xml')
And to serialise and save a document to a file, try:
serialiser= document.implementation.createLSSerializer()
serialiser.writeToURI(document, 'file:///f|/data/doc.xml')
These interfaces take URIs; you can convert a local filepath to a URI using the standard library urllib module:
uri= 'file:'+urllib.pathname2url(path)
Many features of parsing and serialisation can be set using the domConfig
objects in LSParser
and LSSerializer
,
as well as creating LSInput
and LSOutput
objects for more control over the source and destination of these operations.
For example to serialise a document explicitly to the Latin-1 encoding:
output= document.implementation.createLSOutput()
output.systemId= 'file:///f|/data/doc.xml'
output.encoding= 'utf-8'
serialiser= document.implementation.createLSSerializer()
serialiser.write(document, output)
For full details on using these standard features, see the DOM Level3 LS specification.
As a slightly less verbose alternative to the W3C standard parser interface,
the pxdom module offers the convenience functions
parse
and parseString
,
which work like the Python minidom module’s functions of the same names:
doc= pxdom.parse(r'F:\data\doc.xml')
doc= pxdom.parseString('<el attr="val">content</el>')
You can also get a quick character-serialization by accessing the pxdomContent
property of any node.
The result of the parse operation depends on the parameters set on the LSParser.domConfig
mapping. By default, in accordance with the DOM specification, all CDATA sections will be
replaced with plain text nodes and all bound entity references will be replaced by the
contents of the entity referred to. This includes external entity references and the external
subset.
If you use the parse
and parseString
functions,
pxdom will default the parameter ‘cdata-sections’ to True
,
allowing CDATA sections to stay in the document, and the parameter ‘pxdom-resolve-resources’
to False
so external entities and the external subset are left alone.
This is to emulate the behaviour of the Python standard library’s minidom module.
If you prefer also to receive EntityReference
nodes in your document,
set the ‘entities’ parameter to a true value. For example:
parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
parser.domConfig.setParameter('cdata-sections', 1)
parser.domConfig.setParameter('entities', 1)
doc= parser.parseURI('file:///home/data/doc.xml')
Or, using the parse
/parseString
shortcut functions, you can pass in an optional dictionary of extra DOMConfiguration
parameters to set, like:
doc= pxdom.parse('file:///home/data/doc.xml', {'entities': 1})
(Of course, this usage would no longer be minidom-compatible.)
See the DOM 3 Core and LS specifications for more
DOMConfiguration
parameters.
pxdom supports some supplemental non-standard features. Their names are always prefixed with ‘pxdom’ to avoid confusion with the standard.
Configuration parameters in DOM Level 3 may affect parsing, serialisation and normalisation operations. pxdom adds a few new parameters not defined in the specification.
If you want to set a pxdom extra parameter to a non-default value but still be
compatible with any other DOM Level 3 implementation, you can use the
DOMConfiguration.canSetParameter
method
to ensure that the parameter is supported first.
Applies to: parsing. Default: True (except with
parse
/parseString
functions).
Dictates whether resources external to the document file will be resolved and used. This affects external entities and the DTD external subset.
pxdom uses only the SYSTEM identifier in fetching an external resource, so parsing an XHTML document, for example, would make many requests to the W3C server to grab the document type information. This is quite slow. Note also that at the time of writing the DTD referenced by XHTML 1.1 documents has acknowledged bugs in it, which pxdom is unable to parse. (This has been corrected for the forthcoming XHTML Modularization Second Edition specification.)
To do something with PUBLIC identifiers, such as supply local copies of DTDs,
you would have to provide a standard DOM LSResourceResolver
object to the configuration parameter ‘resource-resolver’. Resource
resolvers will never be called if ‘pxdom-resolve-resources’ is set to false.
When the convenience functions parse
and parseString
are called,
‘pxdom-resolve-resources’ will be false by default, instead of true,
for minidom compatibility. This is also the safest option for parsing simple
standalone XML.
Applies to: normalisation. Default: True.
Dictates whether text node normalisation (as performed by the DOM Level 1
Core Node.normalize
method) will take place when the DOM Level 3 Core
Document.normalizeDocument
method is called.
By default, matching the DOM specification, text node normalisation does occur, but pxdom allows this to be turned off if unwanted.
Applies to: normalisation. Default: True.
Dictates whether entity reference nodes have their content child nodes updated from the declaration stored in the doctype. This may result in descendants with different namespaces when the entity reference has been moved, if the entity contains prefixes whose namespaces are not declared in the entity.
By default, matching the DOM specification, entities are updated, but pxdom allows this to be turned off if unwanted.
Applies to: normalisation. Default: True.
Dictates whether attributes should have their user-specified-IDness (as set by the setAttributeId etc. methods) reset to false during document normalisation.
By default, matching the DOM specification, this does occur, but pxdom allows this to be turned off if unwanted.
Applies to: parsing, normalisation, serialisation. Default: True.
When enabled, pxdom attempts to preserve the base URI context whenever
a node that changes base URI is replaced by its contents. This can happen when
an element with an xml:base
attribute is SKIPped
by a DOM 3 LS filter, or when an entity reference with a different base URI to its
parent is flattened.
By default, matching the DOM specification, base URIs are preserved. However,
the extra xml:base
attributes added to child elements
may be unwanted if you are working with entities (especially external entities)
but do not wish to use XML Base, so pxdom allows it to be turned off. If
you do so, the DOMError warning ‘pi-base-uri-lost’ will also not be
generated.
Applies to: parsing, normalisation, serialisation, isElementContentWhitespace. Default: False.
In order to support the feature Text.isElementContentWhitespace
,
pxdom must know the content model of the particular element that contains the text node. Often this is
only defined in the DTD external subset, which might have been omitted or not read.
Normally, following the XML Information Set specification, pxdom will guess that elements with unknown
content models do not contain ‘element content’ — so
Text.isElementContentWhitespace
will always return
False
for elements not mentioned in the DOCTYPE internal subset.
However, if the DOMConfiguration parameter ‘pxdom-assume-element-content’ is True, it will guess that unknown elements do contain element content, and so whitespace nodes inside them will be ‘element content whitespace’ (often referred to as ‘ignorable whitespace’).
This parameter can be combined with the ‘element-content-whitespace’ parameter to parse an XML file and return a DOM tree containing no superfluous whitespace nodes whatsoever, which can make subsequent processing much simpler:
parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
parser.domConfig.setParameter('element-content-whitespace', 0)
parser.domConfig.setParameter('pxdom-assume-element-content', 1)
doc= parser.parse('file:///data/foo.xml')
Applies to: serialisation. Default: False.
Optionally ensures serialisation operations return markup that is as far as possible compatible with legacy HTML parsers. In particular, satisfies XHTML 1.0’s HTML compatibility guidelines C.2, C.3 and C.10.
Read-only property giving a DOM Level 3 DOMLocator object for any Node. If the Node was created by a parsing operation this will reveal the file and row/column number in which the node was found: particularly useful for error-reporting purposes.
A convenience property to get the markup for a node, or replace the node
with alternative parsed markup, without having to create a separate
LSSerializer
or LSParser
.
All nodes have a readable pxdomContent
, but only those at content level are
writable (attribute nodes, for instance, are not). The document’s domConfig
is used to give parameters for parse and serialise operations invoked by pxdomContent
.
The value read from pxdomContent is a character string, not a byte string, so it
is not suitable for writing directly to a file. Use an LSSerializer
to serialise a document to a byte stream.
pxdomContent
is an extended replacement for the
ElementLS.markupContent
property that was in earlier
Working Drafts of the DOM 3 LS spec.
A flag indicating whether the entity’s replacement content is
available in the childNodes
property. Internal
entities are always available; unparsed external entities never are; for
parsed external entities it depends on whether external resources were
resolved at parse-time.
On external entities, gives the actual URI the entity was read from, after applying the systemId to the baseURI and going through any LSResourceResolver redirection. For internal and unavailable entities this property is null.
In addition to entities
and notations
,
pxdom includes NamedNodeMap
s in the
DocumentType
for the other
two types of declaration that might occur in the DTD. They can be read to get more
information on content models than the DOM Level 3 TypeInfo
interface makes available.
ElementDeclaration
s can be obtained from the
DocumentType.pxdomElements
map. Its
nodeName
is the element name given in the
corresponding DTD <!ELEMENT>
declaration).
ElementDeclaration
nodes have an integer
contentType
property with enum keys
EMPTY_CONTENT
, ANY_CONTENT
,
MIXED_CONTENT
and ELEMENT_CONTENT
.
In the case of mixed and element content, the elements
property
gives more information on the child elements allowed.
AttributeDeclarationList
s can be obtained from the
DocumentType.pxdomAttlists
map. Its
nodeName is the name of the element whose attributes it is defining, as
given in the <!ATTLIST>
declaration).
AttributeListDeclaration
s hold a
NamedNodeMap
in their declarations
property, mapping attribute names from the declaration to corresponding
AttributeDeclaration
nodes.
AttributeDeclaration
nodes have an integer
attributeType
property with enum keys ID_ATTR
,
IDREF_ATTR
, IDREFS_ATTR
,
ENTITY_ATTR
, ENTITIES_ATTR
,
NMTOKEN_ATTR
, NMTOKENS_ATTR
,
NOTATION_ATTR
, CDATA_ATTR
and
ENUMERATION_ATTR
.
In the case of enumeration and notation attribute types, the typeValues
property holds a list of possible string values. There is also an integer defaultType
property with enum keys REQUIRED_VALUE
, IMPLIED_VALUE
,
DEFAULT_VALUE
and FIXED_VALUE
.
In the case of fixed and defaulting attributes, the childNodes
property holds any text and/or entity reference nodes that make up the default value.
xmlns:something=""
)
causing namespaceURIs to become empty strings instead of unbound/null at parse-time.
Fixed namespace fixup to stop extra redundant declarations being added.
Additional thanks to all responsible for the DOM Test Suite (which has caught many gotchas in previous pxdom versions, regardless of the bugs I keep filing against it), particularly Curt Arnold (for fixing many of them).
Copyright © 2008, Andrew Clover. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
This software is provided by the copyright holder and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.