pxdom

pxdom 1.0
A Python DOM implementation

pxdom is a W3C DOM Level 3 Core/XML and Load/Save implementation with Python and OMG-style (_get/_set) bindings. All features in the February 2004 Proposed Recommendations are supported, with the following exceptions:

validation;
inclusion of external entities and ResourceResolvers;
asynchronous LSParsers.

Additionally, Unicode encodings are only supported on Python 1.6 and later, serialising to an HTTP URI in Python 2.0 and later, and Unicode character normalisation in Python 2.3 onwards.

Installation

Copy pxdom.py into any folder in your Python path, for example /usr/lib/python/site-packages or C:\Python23\Lib\site-packages.

pxdom can also be included and imported as a submodule of another package. This is a good strategy if you wish to distribute a DOM-based application without having to worry about the version of Python or other XML tools installed.

The only dependencies are the standard library string-handling and URL-related modules.

Usage

The pxdom module implements the DOMImplementationSource interface from DOM Level 3 Core. So to parse a document from a file, use eg.:

import pxdom dom= pxdom.getDOMImplementation('') parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) doc= parser.parseURI('file:///f|/data/doc.xml')

For more on using DOM Level 3 Load to create documents from various sources, see the DOM Level 3 Load/Save specification.

Alternatively, the pxdom module offers the convenience functions parse and parseString, which work like the Python minidom module’s functions of the same names:

doc= pxdom.parse('F:\\data\\doc.xml') doc= pxdom.parseString('<el attr="val">content</el>')

DOMConfiguration parameters

The result of the parse operation depends on the parameters set on the LSParser.domConfig mapping. By default, according to the DOM 3 spec, all bound entity references will be replaced by the contents of the entity referred to, and all CDATA sections will be replaced with plain text nodes.

If you use the parse/parseString functions, pxdom will set the parameter ‘cdata-sections’ to True, allowing CDATA sections to stay in the document. This is to emulate the behaviour of minidom.

If you prefer to receive entity reference nodes too, set the ‘entities’ parameter to a true value. For example:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) parser.domConfig.setParameter('cdata-sections', 1) parser.domConfig.setParameter('entities', 1) doc= parser.parseURI('file:///home/data/doc.xml')

Or, using the parse/parseString shortcut functions, you can pass in an optional dictionary of extra DOMConfiguration parameters to set, like:

doc= pxdom.parse('file:///home/data/doc.xml', {'entities': 1})

(Of course, this usage would no longer be minidom-compatible.) See the DOM 3 Core and Load/Save specifications for more DOMConfiguration parameters.

Extensions

pxdom supports a few features which aren’t available in the DOM standard. Their names are always prefixed with ‘pxdom’.

Node.pxdomContent

A convenience property to get the markup for a node, or replace the node with alternative parsed markup, without having to create a separate LSSerializer or LSParser.

All nodes have a readable pxdomContent, but only those at content level are writable (ie. attribute nodes are not). The document’s domConfig is used to set parameters for parse and serialise operations invoked by pxdomContent.

pxdomContent is an extended replacement for the ElementLS.markupContent property that was in earlier Working Drafts of the DOM 3 LS spec.

Node.pxdomLocation

Read-only property giving a DOMLocator for any Node.

pxdom-assume-element-content

In order to support the feature Text.isElementContentWhitespace, pxdom must know the content model of the particular element that contains the text node. Often this is only defined in the DTD external subset, which pxdom doesn’t read.

Normally pxdom will (as per spec) guess that elements with unknown content models do not contain ‘element content’ — so Text.isElementContentWhitespace will always return False for elements not defined in the internal subset. However, if the DOMConfiguration parameter ‘pxdom-assume-element-content’ is set to a true value, it will guess that unknown elements do contain element content, and so whitespace nodes inside them will be ‘element content whitespace’ (aka ‘ignorable whitespace’).

This parameter can be combined with the ‘element-content-whitespace’ parameter to parse an XML file and return a DOM tree containing no superfluous whitespace nodes whatsoever, which can make subsequent processing much simpler:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) parser.domConfig.setParameter('element-content-whitespace', 0) parser.domConfig.setParameter('pxdom-assume-element-content', 1) doc= parser.parse('file:///data/foo.xml')

pxdom-resolve-resources

pxdom is a non-validating, non-external-entity-including DOM implementation. However, it is likely that future versions could support external entities. When this is implemented, it will be turned on by default in new LSParser objects.

If you wish to be sure external entities will never be used in future versions of pxdom, set the LSParser.domConfig parameter ‘pxdom-resolve-resources’ to a false value. Alternatively, use the parse/parseString functions, which won’t resolve external entities (because minidom does not).

Entity.pxdomAvailable

A boolean flag indicating whether the entity’s replacement content is available in the childNodes property. This is always true for internal entities, always false for unbound entities and currently false for external entities (though this may change in future versions).

DocumentType.pxdomElements, pxdomAttlists

In addition to the DocumentType NamedNodeMaps ‘entities’ and ‘notations’, pxdom includes maps for the other two types of declaration that might occur in the DTD internal subset. They can be read to get more information on the content models than the DOM 3 TypeInfo interface makes available.

pxdomElements is a NamedNodeMap of element content declaration nodes (as created by the <!ELEMENT> declaration). ElementDeclaration nodes have an integer contentType property with enum keys EMPTY_CONTENT, ANY_CONTENT, MIXED_CONTENT and ELEMENT_CONTENT. In the case of mixed and element content, the elements property gives more information on the child elements allowed.

pxdomAttlists is a NamedNodeMap of elements’ declared attribute lists (as created by the <!ATTLIST> declaration). AttributeListDeclarations hold a NamedNodeMap in their declarations property of attribute names to AttributeDeclaration nodes.

AttributeDeclaration nodes have an integer attributeType property with enum keys ID_ATTR, IDREF_ATTR, IDREFS_ATTR, ENTITY_ATTR, ENTITIES_ATTR, NMTOKEN_ATTR, NMTOKENS_ATTR, NOTATION_ATTR, CDATA_ATTR and ENUMERATION_ATTR. In the case of ENUMERATIONs and NOTATIONs, the typeValues property holds a list of possible string values. There is also an integer defaultType property with enum keys REQUIRED_VALUE, IMPLIED_VALUE, DEFAULT_VALUE and FIXED_VALUE. In the case of FIXED and DEFAULT, the childNodes property holds any Text and/or EntityReference nodes that make up the default value.

Changelog

Updates from 0.9 to 1.0

Tracking changes in the new DOM 3 Proposed Recommendations, renamed LS config properties, added LSException, changed default newLine behaviour, removed prefix from previously-non-standard pxdom-no-input-specified parameter, allow LS namespace parameters to be set False, changed output filter call order
Added support for DOMConfiguration parameters ‘format-pretty-print’ and ‘supported-media-types-only’
Following discussion on www-dom list, changed encoding-to-string to use the string’s native encoding, unless overridden by output.encoding
Added extra error checks for cases in the L3 DOM Test Suite
Fixed recursive readonlyness of entities, notations, entity references
Fixed setting textContent on non-Text-containing nodes
Fixed very silly canSetParameter bug causing occasional erroneous return-false
Added compareDocumentPosition to public interface, and fixed fault in comparison of non-child nodes
Renamed parameterNameList parameterNames and made it return a proper DOM-style List object instead of a Python one
Made namespace/prefix lookup results match the reference algorithm more closely
Reorganised parse/serialisation, allowing application-side LSInput and LSOutput objects to be used
Made isEntityContentWhitespace cope with nodes inside entity references
Fixed possibly-incorrect namespaceURI of unprefixed default attributes
Fixed baseURI for entity references and doctype

Updates from 0.8 to 0.9

Lots of interface alterations and renamings to track changes in the new DOM 3 Candidate Recommendations.
Node.pxdomContent replaces ElementLS.markupContent (removed from CR). Other old DocumentLS, ElementLS interfaces removed.
Module code rearranged into separate aspects to cut down on some of the ‘monster-class’ readability problems.
Serialisation mostly rewritten to conform better to specification, particularly the escaping of characters that can't be reproduced in the current encoding.
Normalisation partially rewritten, support for Unicode character normalisation added.
Support for DOMConfiguration parameter ‘canonical-form’.
Parameter pxdom-resolve-resources added as placeholder for future external entity support .
Made PIs with no data part parse and serialise correctly.
Many changes to LSFilters, which were a bit broken.
Allow multiple attributes with the same namespaceURI and localName (but different prefix) to be parsed. (For support of non-namespace-well-formed docs that use attribute names with colons, and unbound namespaces in entities.)
Renamed DocumentType.elements and .attlists to pxdom-prefixed versions, as they are non-standard extensions.
Fixed parsing of <!ATTLIST>s with NMTOKENS, IDREF, IDREFS (whoops!).
Made attribute value normalization happen in more places it should and fixed entref/charref whitespace-char replacement issues.
Fixed normalizeDocument namespace-declarations=false option.
Support for ‘well-formed’ parameter, tightened up invalid character checks at DOM level too.
Made splitTexting a CDATASection correctly create a new CDATASection node, not text.

Updates from 0.7 to 0.8

Tracking forthcoming changes to spec, getDOMImplementations renamed getDOMImplementationList, isWhitespaceInElementContext method becomes isElementContentWhitespace property, isId method becomes property, DOMLocator.offset becomes byteOffset/utf16Offset (non-functional).
Don’t claim to support DOM Core 1.0 — following discussion on www-dom-ts, there is no such feature.
Allow getDOMImplementation[List] to be called with no argument, as a shortcut.
Allow empty string to be passed in to namespaceURI arguments, meaning same as None.
Added NODE_ADOPTED UserDataHandler event, compliance fixes to AdoptNode (ents, default attrs).
Added DOMConfig.parameterNameList.
Added minidom-style NamedNodeMap dictionary accessors for compatibility (hat tip: Paul Boddie).
Implemented element-content-whitespace option, added pxdom-assume-element-content to make it more useful.
Refuse to parse invalid < in attribute values (makes finding well-formedness errors easier).

Updates from 0.6 to 0.7

Tracking forthcoming changes to spec, DOMSerialiser.writeURI renamed to writeToURI.
Fix typos in Document.isDefaultNamespace and Text.replaceWholeText raising exceptions (oops).
Made renameNode and writes to Node.prefix update NodeListByTagName objects correctly.
Made ParseError return non-Unicode string for easier debugging.

Future updates

Add support for external entities
Preserve namespaces when their declarations are on an element which has been rejected by an LSParserFilter or LSSerializerFilter. Spec is unclear on whether this should happen but it seems sensible.
Possibly support DOM Level 3 Events and/or Level 2 Traversal/Range.

Licence (new-BSD-style)

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
The name of the copyright holder may not be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holder and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. in no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.