pxdom is a W3C DOM Level 3 Core/XML and Load/Save implementation with Python and OMG-style (_get/_set) bindings. All features in the February 2004 Proposed Recommendations are supported, with the following exceptions:
Additionally, Unicode encodings are only supported on Python 1.6 and later, serialising to an HTTP URI in Python 2.0 and later, and Unicode character normalisation in Python 2.3 onwards.
Copy pxdom.py into any folder in your Python path, for example /usr/lib/python/site-packages or C:\Python23\Lib\site-packages.
pxdom can also be included and imported as a submodule of another package. This is a good strategy if you wish to distribute a DOM-based application without having to worry about the version of Python or other XML tools installed.
The only dependencies are the standard library string-handling and URL-related modules.
The pxdom module implements the DOMImplementationSource interface from DOM Level 3 Core. So to parse a document from a file, use eg.:
import pxdom
dom= pxdom.getDOMImplementation('')
parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
doc= parser.parseURI('file:///f|/data/doc.xml')
For more on using DOM Level 3 Load to create documents from various sources, see the DOM Level 3 Load/Save specification.
Alternatively, the pxdom module offers the convenience functions
parse
and parseString
,
which work like the Python minidom module’s functions of the same names:
doc= pxdom.parse('F:\\data\\doc.xml')
doc= pxdom.parseString('<el attr="val">content</el>')
The result of the parse operation depends on the parameters set on the LSParser.domConfig mapping. By default, according to the DOM 3 spec, all bound entity references will be replaced by the contents of the entity referred to, and all CDATA sections will be replaced with plain text nodes.
If you use the parse
/parseString
functions,
pxdom will set the parameter ‘cdata-sections’ to True
,
allowing CDATA sections to stay in the document. This is to emulate the behaviour of
minidom.
If you prefer to receive entity reference nodes too, set the ‘entities’ parameter to a true value. For example:
parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
parser.domConfig.setParameter('cdata-sections', 1)
parser.domConfig.setParameter('entities', 1)
doc= parser.parseURI('file:///home/data/doc.xml')
Or, using the parse
/parseString
shortcut functions, you can
pass in an optional dictionary of extra DOMConfiguration parameters to set, like:
doc= pxdom.parse('file:///home/data/doc.xml', {'entities': 1})
(Of course, this usage would no longer be minidom-compatible.) See the DOM 3 Core and Load/Save specifications for more DOMConfiguration parameters.
pxdom supports a few features which aren’t available in the DOM standard. Their names are always prefixed with ‘pxdom’.
A convenience property to get the markup for a node, or replace the node with alternative parsed markup, without having to create a separate LSSerializer or LSParser.
All nodes have a readable pxdomContent, but only those at content level are
writable (ie. attribute nodes are not). The document’s domConfig
is used to set parameters for parse and serialise operations invoked by pxdomContent.
pxdomContent is an extended replacement for the ElementLS.markupContent property that was in earlier Working Drafts of the DOM 3 LS spec.
Read-only property giving a DOMLocator for any Node.
In order to support the feature Text.isElementContentWhitespace, pxdom must know the content model of the particular element that contains the text node. Often this is only defined in the DTD external subset, which pxdom doesn’t read.
Normally pxdom will (as per spec) guess that elements with unknown content models
do not contain ‘element content’ — so Text.isElementContentWhitespace
will always return False
for elements not defined in the internal
subset. However, if the DOMConfiguration parameter ‘pxdom-assume-element-content’
is set to a true value, it will guess that unknown elements do contain element content,
and so whitespace nodes inside them will be ‘element content whitespace’
(aka ‘ignorable whitespace’).
This parameter can be combined with the ‘element-content-whitespace’ parameter to parse an XML file and return a DOM tree containing no superfluous whitespace nodes whatsoever, which can make subsequent processing much simpler:
parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
parser.domConfig.setParameter('element-content-whitespace', 0)
parser.domConfig.setParameter('pxdom-assume-element-content', 1)
doc= parser.parse('file:///data/foo.xml')
pxdom is a non-validating, non-external-entity-including DOM implementation. However, it is likely that future versions could support external entities. When this is implemented, it will be turned on by default in new LSParser objects.
If you wish to be sure external entities will never be used in future versions of
pxdom, set the LSParser.domConfig parameter ‘pxdom-resolve-resources’
to a false value. Alternatively, use the parse
/parseString
functions, which won’t resolve external entities (because minidom does not).
A boolean flag indicating whether the entity’s replacement content is
available in the childNodes
property. This is always
true for internal entities, always false for unbound entities and currently
false for external entities (though this may change in future versions).
In addition to the DocumentType NamedNodeMaps ‘entities’ and ‘notations’, pxdom includes maps for the other two types of declaration that might occur in the DTD internal subset. They can be read to get more information on the content models than the DOM 3 TypeInfo interface makes available.
pxdomElements is a NamedNodeMap of element content declaration nodes (as created by the
<!ELEMENT>
declaration). ElementDeclaration
nodes have an integer contentType property with enum keys EMPTY_CONTENT, ANY_CONTENT,
MIXED_CONTENT and ELEMENT_CONTENT. In the case of
mixed and element content, the elements
property gives more information on the
child elements allowed.
pxdomAttlists is a NamedNodeMap of elements’ declared attribute lists (as created by the
<!ATTLIST>
declaration). AttributeListDeclarations hold a
NamedNodeMap in their declarations
property of attribute
names to AttributeDeclaration nodes.
AttributeDeclaration nodes have an integer attributeType property with
enum keys ID_ATTR, IDREF_ATTR, IDREFS_ATTR, ENTITY_ATTR, ENTITIES_ATTR,
NMTOKEN_ATTR, NMTOKENS_ATTR, NOTATION_ATTR, CDATA_ATTR and ENUMERATION_ATTR.
In the case of ENUMERATIONs and NOTATIONs, the typeValues property holds a list of possible string values.
There is also an integer defaultType property with enum keys REQUIRED_VALUE, IMPLIED_VALUE,
DEFAULT_VALUE and FIXED_VALUE. In the case of FIXED and DEFAULT, the childNodes
property holds any Text and/or EntityReference nodes that make up the default value.
Copyright © 2004, Andrew Clover. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
This software is provided by the copyright holder and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. in no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.