form.py 1.6 [final]

Contents

  1. Introduction
    1. form vs cgi
    2. Example
  2. Using fdefs
    1. Datatypes
      1. form.STRING
      2. form.TEXT
      3. form.ENUM
      4. form.BOOL
      5. form.LIST
      6. form.MAP
      7. form.FILE
      8. form.INT
      9. form.FLOAT
    2. Embedding values in names
  3. Functions
    1. Initialisation
      1. version
      2. memorylimit
      3. filelimit
      4. listlimit
      5. europe
      6. charset
      7. xhtml
      8. catch
    2. Reading submissions
      1. readForm
      2. readUrlEncoded
      3. readUrlEncodedStream
      4. readFormData
      5. readFormDataStream
    3. Writing submitted values back
      1. writeForm
      2. writeFormStream
      3. writeUrlEncoded
      4. writeUrlEncodedStream
      5. writeFormData
      6. writeFormDataStream
    4. String coding
      1. encH
      2. encU
      3. encI
      4. encJ
      5. encHJ
      6. decU
      7. decI
    5. Utility functions
      1. checked
      2. selected
      3. randomSafeString
      4. makeSafe
      5. makeSafeIsh
      6. bool
  4. Exceptions
    1. cgiError
    2. fdefError
    3. httpError
  5. Notes
    1. Working with Unicode
    2. Deviations from RFC 2388
  6. About
    1. History
    2. Licence

1. Introduction

The form module is an extended replacement for the standard Python cgi module, providing robust form-handling features in order to make writing secure form-handling CGIs in Python less work.

The idea is to define the kind of data you want returned for each field of the form. This definition is done using a mapping of form field names to datatypes (fdefs), which is passed to the main function, readForm. This call reads CGI input and interprets it, returning a mapping of field names to values.

form also fully supports Unicode, [multiple] file-upload fields, image-submit fields and embedding values in names, protects against some denial-of-service problems common to CGI scripting, and provides miscellaneous utility functions useful to CGI progammers. It has been proven to cope with very large input.

1.1. form vs cgi

form and cgi have completely different interfaces and are not compatible. form works at a somewhat higher level than cgi. Its ease of use comes at the expense of disallowing direct access to the exact submitted data.

The main advantage is that the returned values from reading a form submission are guaranteed to conform to your specifications, regardless of how malformed the submission may have been. This reduces the error-checking necessary to produce error-free scripts. The abstraction of datatype from submission data also allows some elements in an HTML form to be changed without having to re-write the corresponding CGI.

cgi is part of the standard distribution and so guaranteed available without having to add any modules. It easily suffices for writing simple forms. form is more complicated that cgi so it may be more likely to have bugs in it, although none are currently known. form is also not suitable for applications where you don't know the names of the submitted fields in advance, eg. generic form-to-mail scripts. (A feature will probably be added to allow this at some point but it's not a priority.)

1.2. Example

A user sign-up form might be read like this:

import form

form.initialise(1.4, charset='utf-8')

fdefs= {
  'email': (form.STRING, 128),
  'username': (form.STRING, 16),
  'password': (form.STRING, 16),
  'sex': (form.ENUM, ['m', 'f'], 'f'),
  'age': form.INT,
  'sendmespam': form.BOOL
}
fvals= form.readForm(fdefs)

if fvals.username=='':
  errorPage('You forgot to enter a user name.')
if allUsers.has_key(fvals.username):
  errorPage('Sorry, someone has already had that user name')

# and so on

2. Using fdefs

Each item in an fdefs dictionary defines one form field. The key should be the same as the name property in the HTML form, which should not normally contain a period or colon (see 2.2). The value of the item dictates the datatype to be returned.

readForm returns a dictionary-like object with the names of the fields as keys. The type of the values depends on which type was requested for that field in the fdefs. You can read the returned object like a dictionary (fvals['address']), or like an object (fvals.address), it makes no difference.

In the case where a field is included more than once in a submission but a list-of-values submission (form.LIST) was not expected, the last field in the input takes precedence.

2.1. Datatypes

The following field types are available. Some of them take parameters, which you can specify by putting the type in a tuple, with the parameters following. If you are not passing parameters, you can use the type name on its own or in a singleton tuple, it doesn't matter which.

(form.STRING, length, exclude)

For input type=text or password. Return a string of maximum length length characters, with all characters in the string exclude removed. You can omit the exclude string to allow all characters. You can omit length or set it to 0 to allow any length string; it's mostly there so you use the value in SQL without having to worry about it being too big to fit the relevant field.

Control characters are always removed from the returned string regardless of the setting of exclude, for safety. These are 0x00-0x1F and 0x7F in ASCII, and additionally if Unicode strings are being used, the C1 control codes 0x80-0x9F, the deprecated codes 0x206A-0x206F, the specials 0xFFFC-0xFFFF and the BOM 0xFEFF (since its function as ZWNBSP has been taken over by 0x2060).

(form.TEXT, length)

For textarea. As form.STRING, but single newlines are converted to space, and double newlines are converted to a Python '\n'. Other control characters are still removed.

(form.ENUM, [value, value, ...], default)

For select and input type="radio". Return one of the list of string values passed if it matches the input, else return the default value, which can be of any type. If the default is not supplied, '' is used as the default.

form.BOOL

For input type="checkbox" with no value property. The value returned evaluates True if the input value for this field was 'on', else False. (On Python 2.2 and later it is a native Python boolean; in earlier versions it is an object that behaves like one.)

form.LIST

For select multiple and multiple fields with the same name (especially checkboxes). Return a list of each non-empty input strings given for this field.

(form.MAP, (width, height))

For input type="image". Return a tuple (x, y) of position of the click, clipped to within (0:width, 0:height) if the (width, height) tuple is supplied. Returns (0, 0) if the input field was supplied but without x and y co-ords, or (-1, -1) if the field was not in the input at all.

(form.FILE, directory)

For input type="file". Fills the given directory with files uploaded through the field, and return a list of tuples (storedFile, suppliedFile, mimeType, length). The suppliedFile filename may be '' if no filename was specified. storedFile is the full pathname of the stored file. The list is empty if no files were uploaded, and is unlikely to be longer than one entry since few browsers support multiple-file upload.

(form.INT, default)

Parse the input as an decimal (possibly negative) integer. Returns the default value if no parsable number could be read. If default is omitted, zero is used as the default. Returns sys.maxint if the number is higher than Python can represent as an integer. Note! Future versions of form may return a long integer for form.INT. I might restrict this to Python 1.6 and later, where str doesn't add an 'L' to the end of the number, to avoid problems.

(form.FLOAT, default)

Parse the input as a simple floating point number, which may contain a decimal point, but not 'E' notation. Returns 0.0, or, if supplied, the default if the input is not a valid number or not supplied.

2.2. Embedding values in names

In HTML, there are some kinds of form fields where you can't use the value attribute to pass information to the CGI script. These are input type="map", where the value is always a pair of co-ordinates, and submit, where the value is used as the text for the button.

So if you wanted to detect which of a set of identically-labelled buttons was pressed, you'd have to give them all a different name, and include a check for each one in your script. This would be especially tedious for an order form with a hundred "Buy It!" buttons, for example.

For this reason, form allows you make a group of controls where the value submitted for each is taken from the name of the control instead of the value, when such a control is included in a submission. The actual value submitted is ignored.

To use the feature, put both the name and the desired value together in the HTML name of the field, separated by a colon. (Which is a valid character for name, albeit a seldom-used one).

<input type="submit" name="b:left" value="Click me!">
<input type="submit" name="b:middle" value="Click me!">
<input type="submit" name="b:right" value="Click me!">

In this example, an call to form.readForm({'b': form.STRING}) would return either 'left', 'middle' or 'right', depending on which button was used to submit the form. This is not limited to STRING: values of all types except FILE may be embedded in names.

(You can still use names with colons if you do not wish to use the value-embedding feature. form only tries to separate a name with a colon in if it can't find the whole name as a key in your fdefs. The same goes for periods, which are special characters used by HTML in image maps.)

Functions

Initialisation

initialise(version, memorylimit, filelimit, listlimit, europe, charset)

Calling this function is not compulsory, but it allows you to set some of form's internal variables easily.

form.py includes features to protect against certain kinds of denial-of-service attacks in POST requests. They are turned off by default, but passing non-zero values in the "limit" parameters enables them.

The arguments you can set are:

version
The lowest version of form your script is happy running with. If you request a newer version than the module, an exception will be raised.
memorylimit
Guards against a request containing parts that are enormous, filling available memory and causing the web server thread to swap like crazy. Nothing will be stored in memory that is larger than this value in bytes; longer values will not be truncated but simply skipped; if some headers grow larger than this, no values will be parsable at all.
filelimit
Guards against a request including an enormous file upload, filling available disc space. Files will be truncated at this number of bytes.
listlimit
Guards against a request including the same input field over and over again, filling memory, or the same file upload field repeatedly, filling disc space. Fields of type LIST or FILE can then contain no more than this number of entries.
europe
If true, form.INT and form.FLOAT will read numbers using European-style punctuation (where "." is a thousands-separator and "," is the decimal point). If false (the default), it's the other way around.
charset
Specifies the character encoding the browser will be using to submit the form, or None (the default) to return plain 8-bit strings without attempting to decode them. See Working with Unicode for more details.
xhtml
If true, functions that return HTML markup will use XHTML syntax. This includes writeForm, checked and selected.
catch
If true, any unhandled exceptions that happen in the script will be caught and displayed in a readable HTML format. (Note, syntax errors in the script itself cannot be caught as they will happen before form is loaded.) This feature is especially useful for testing scripts on servers that do not show errors at all, for example Apache. However, due to problems with sys.excepthook, it won't work in scripts handled by PyApache.

Reading form-based data

All read functions take submitted form data and parse it, returning a dictionary-like object containing the values that have been posted to the form, standardised according to the fdefs argument passed to the function. The returned object may be read like a dictionary or like an object.

Typically, a script calls readForm at the start of its code. Scripts do not normally need to call the other read functions directly.

readForm(fdefs)
This function is normally used to read form data. It works out which of the other read functions is appropriate, and calls that.
readUrlEncoded(fdefs, query)
Reads a query string (without leading "?") passed directly to the function. form understands ';' separators as well as '&'.
readUrlEncodedStream(fdefs, stream, length)
Same as readUrlEncoded, but takes its input from a stream object (must support read()) instead of a string.
readFormData(fdefs, data, parameters)

Decodes fields encoded in a multipart/form-data formatted string. parameters is a dictionary of MIME headers, lower-cased keys, containing at least a 'boundary' key.

Currently this function is no more efficient than readFormDataStream, since it is not commonly needed.

readFormDataStream(fdefs, stream, length, parameters)
As readFormData, but input is taken from a stream object instead of a string. The length is the number of bytes that should be read from the stream.

Writing submitted values back

The write functions take form values from a dictionary (or dictionary-like object returned by the read functions), and convert them into encoded text sent to a string or a stream.

File upload fields only work for writeFormData and writeFormDataStream since it does not make much sense to try to upload a file to a query string or hidden form. File upload values need not have a valid length value in the tuple as the length is read directly from the file specified.

Currently, the string-returning functions are no more efficient than the stream-writing versions.

writeForm(fvals)
Returns a string containing HTML input type="hidden" controls for each field in the fvals dictionary. This is useful for writing a follow-up-form that retains all the information posted into a previous form.
writeFormStream(fvals, stream)
As writeForm, but send output to a stream object (or anything supporting write) instead of returning a string.
writeUrlEncoded(fvals)
Return a &-separated list of URL-encoded key=value pairs representing the values. The query-string separator '?' is not included in the returned string. If you're including the query string in, for example, an <a href="...">, remember to HTML-encode the whole URL, or those & characters could confuse a browser.
writeUrlEncodedStream(fvals, stream)
As writeUrlEncoded, except that the output is sent to the nominated stream.
writeFormData(fvals)
Return a MIME multipart/form-data message from the given values. form will work out a suitable boundary value for you.
writeFormDataStream(fvals, stream)
Oh, does exactly what it says on the tin.

String coding

These convenience functions are available for coding text for representation in HTML, URLs and JavaScript strings. If you have user input anywhere in your scripts, you'll need to do this a lot, or you're likely to make a site susceptible to security problems. (See this CERT advisory for an example of this.)

encH(text)

Encode text as HTML and return as string. ", &, <, > and control characters are replaced with HTML entities. This assumes you use the double-quote rather than single-quote for attribute strings, which is advisable. Obviously quotes do not need to be escaped outside of attribute values, but it does no harm.

encU(text)

Encode text as a URL part (replacing spaces with '+' and many symbols with %-encoded entities), and return as a string.

Note: you should not pass entire URLs through encU, only separate parts, for example a directory name in a path, or a key or value string in a query. Once encoded you can combine these parts using '/', '?' and so on. When writing HTML, remember to encode the complete URL if it has characters like '&' in.

encI(text)

Encode text so it can be included in HTML id or name attributes. This is especially useful when you need to include arbitrary strings in name-embedded values.

This encoding is not a web standard, it's specific to form. Technically it simply replaces all disallowed characters with ':xx' where xx is the hex encoding of the character.

encJ(text)

Encode text suitable for inclusion in a JavaScript string. This escapes single and double quotation marks, and the ETAGO (</) marker, making the result safe to include in a string in a script block.

encHJ(text)

Shorthand for encH(encJ(text)), useful for writing Javascript inside of an HTML attribute, especially event handlers.

decU(urlPart)

Decodes text encoded into part of a URL, replacing the %-encoded entitites into plain text.

decI(urlPart)

Decodes text escaped with encI.

CGI utility functions

These functions are of general use to CGI scripts and are provided together as a convenience, as well as being used internally by form.

checked(condition)

Simply returns the string ' checked' if the condition is true or '' if false. This often saves writing an if statement whilst outputting a form. If you called initialise setting xhtml to be true, you will get the non-minimised XML form ' checked="checked"' instead.

Example

print '<input type="checkbox" name="spam"'+form.checked(f.sendmespam)+'>'
selected(condition)

Like checked, but on true outputs ' selected', for select fields.

Example

print '<option value="m"'+form.selected(f.sex=='m')+'>'
randomSafeString(length)
Returns a pseudo-randomly-generated string of a given length, built only from letters, numbers and underscores.
makeSafe(text)
Filters everything but ASCII letters, numbers and underscores from a string, and adds an underscore if the string is empty. The resulting string should be safe to use as a filename.
makeSafeIsh(text)
As makeSafe, but allow single periods and slashes, but not combinations of them together or strings starting or ending with them. Also allows the range of characters between C0-FF.
bool(condition)

Evaluates its parameter for truth and returns a value that will be understood by the write functions as representing a boolean True or False.

On Python 2.2 and later this function behaves identically to the built-in function bool and returns the native boolean values True and False. On earlier versions of Python which do not have a native boolean type, it return an object that behaves like a boolean.

If you want your code to be compatible across Python versions, you should use form.bool instead of bool, and form.bool(1) and form.bool(0) instead of True and False when constructing new form value dictionaries to be passed to the write functions.

Exceptions

The input-reading functions may throw the following exceptions:

cgiError

Some aspect of the CGI environment is broken, for example environment variables not being correctly set by the script's caller.

cgiErrors are the fault of the web server, and should not happen in working web sites.

fdefError

An fdefs dictionary was passed to readForm which included unknown fdef values or unexpected parameters. Alternatively you passed a set of fields to writeForm or writeUrlEncoded (or the stream versions) which included a file-upload field. Note, readForm may also raise a TypeError, if some of the parameters in the fdefs were of the wrong type.

fdefErrors are your script's fault, and should not happen in working web sites.

httpError

The HTTP request or the MIME message in a HTTP POST request is malformed in some way.

httpErrors are the user-agent's fault, so could happen in a working web site, but only if either:

  1. the user's web browser is badly bugged, or
  2. someone is deliberately sending your script odd input to confuse it.

Finally, initialise may throw a NotImplementedError if it is called with a version number higher than the version of form being used, or Unicode features are requested which cannot be provided.

Notes

Working with Unicode

Since version 1.4, form can return Unicode strings from its form-reading functions. This must be specially requested by passing a charset argument with a non-None value to the initialise function. form will then interpret the submitted data as being encoded in that character set, and will return Unicode strings with that data represented in the standard encoding used by Python, UTF-16.

If charset is specified as None, not specified at all, or initialise is not called, returned strings will be plain 8-bit, undecoded strings as submitted by the browser. This is the easiest option if you are using an 8-bit character set such as 'iso-8859-1' or 'windows-1252', or if you don't care about non-ASCII characters being consistently interpreted in different locales.

The setting of charset will also be used to encode any unicode strings passed to the enc functions.

The charset a browser uses to submit form data is typically the same as the charset of the document containing the form. (You should be able to override this behaviour using the accept-charset attribute on <form>, but this does not work in Internet Explorer.)

The charset used must be one that Python understands. This means your web pages can be in, for example, 'iso-8859-1' or 'utf-8' (the obvious choice if you're working with international forms), but not 'windows-1252' or 'shift_jis'. If you specify an encoding Python doesn't know about, or you try to set a charset on a version of Python that does not support Unicode (ie. before 1.6), you will get an exception on initialisation.

Do remember to tell the browser what character encoding your web pages with forms in are using, or it will submit the form in whatever default encoding the user has chosen, and you have no chance of finding out what that is. The best way to specify the charset of a page is to have your web server send it with a Content-Type HTTP header with a charset:

Content-Type: text/html;charset=utf-8

- rather than just plain 'text/html'. If you're not in a position to control what headers your web server sends, you can set the charset by using an HTML meta hack, such as the following, in your <head>.

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

(There's also an encoding attribute in the <?xml?> preamble if you are using an XML document, but browser support is not brilliant, yet.)

Once you have Unicode strings in your script you can do all the normal string operations with them. But if you need to write them back out to a non-Unicode-aware storage like a text file or a database, you will probably need to encode them again, usually as UTF-8.

Beware: if you had a field described as (form.STRING, 8), the eight characters in the Unicode string may expand to more than eight characters when encoded as UTF-8. In fact it can theoretically be up to four times as many characters. You can either allocate four times as much room in your database to ensure it will always fit (which can be quite wasteful, especially if you are using non-VARYING fields), or truncate.

If you are going to use truncation, make sure you do not break the string in the middle of a UTF-8 code sequence, or when you retrieve it and try to turn it back into a Unicode string Python will give you a UnicodeError exception. Also, add some checks to make sure truncation does not happen to any field being used as a primary key, or your scripts could get very confused trying to SELECT a row whose primary key is unexpectedly shorter then when it was inserted!

Deviations from RFC 2388

According to the standard for POST submission of multipart/form-data, user agents should specify the character set of form data in a Content-Type header with charset parameter on each successful control submitted, when there are non-ASCII characters present. No browser does this, so form does not even bother to look for the header. In practice they simply submit the form in whatever charset the original document was encoded in.

Also according to the standard, the 'name' parameter in the Content-Disposition header should be encoded according to RFC2047 if it contains out-of-bounds characters. No browser does this; instead, they simply surround the name with quotes. They don't even care if the name contains more quotes or control characters, so avoid putting these in your field control names!

The writeFormData functions duplicate both these universal-but-wrong behaviours.

About

form was written by Andrew Clover and is available under the GNU General Public Licence. There is no warranty. (GPL is chosen as a good default licence; if, for some reason, this doesn't match your requirements, get in touch.)

Bugs, queries, comments to: and@doxdesk.com.

History

0.1 [dev] (6 January 2000)
First apparently working version.
0.2 [dev] (27 January 2000)
form.NUMBER becomes form.FLOAT; form.INT added; form.BOOLEAN changed to a class of its own, to distinguish it from form.INT. To support European number formatting conventions, added built-in functions to replace int() and float(), controlled by form.sepChars and form.decChars.
0.3 [dogfood] (2 February 2000)
Fixed bug in assigning default values (mutables confusion). form.MAP non-submission value changed to (-1, -1)
0.4 [dogfood] (6 March 2000)
form.INT now clips when the number goes above maxint instead of throwing an exception. Not sure whether this is good behaviour but it follows the idea of not throwing exceptions due to bad user input. form.BOOL class replaces old BOOLEAN kludge. form.writeUrlEncoded[Stream] no longer prepends a '?'.
0.5 [dogfood] (31 March 2000)
Removed embarrassingly bad stream parsing bugs. Fixed ENUM so that '' can be a non-default value
0.6 [dogfood] (11 June 2000)
Added name-encoding system. Cleaned up image map detection. Added initialise call to avoid having to access limits and other module variables manually, and to allow me to make more interface changes like those in version 0.4 without breaking backwards compatibility.
0.7 [beta] (15 June 2000)
Fixed bug in multipart parsing affecting multiple file uploads. All major features have now been tested, so I'm taking form.py to beta.
0.8 [beta] (13 September 2000)
Added optional default to INT and FLOAT types. Added EitherMapping object replacing plain dictionaries to allow slightly cleaner-looking access to form values. Hopefully this will not cause any incompatibilities.
0.9 [beta] (9 November 2000)
Safeish strings may now not begin with '/' or '.'. Code comments wrapped to 80 columns. Documentation finally brought up-to-date.
1.0 [final] (7 December 2000)
Added trivial encJ and encHJ functions. Cleaned up EitherMapping so it's safe to use in other scripts.
1.1 [beta] (21 December 2000)
Added encI and decI functions, made decI happen automatically on name:value separation. Removed pointless encHU call from documentation. Disallowed the remaining top-bit-set characters from makeSafe strings, in case somehow they lead to the Unicode-parsing security breaches that turned up in IIS. They're still allowed in makeSafeIsh though.
1.2 [final] (29 January 2001)
EitherMapping now allows entries to be removed using del.
1.3 [final] (11 April 2002)
encH, encU and encI made more strict about what things they escape, to expand their usefulness a bit. '+'-encoding in encU fixed (can't believe I let that slip through after deliberately remembering to get it right). encH no longer attempts to HTML-encode top-bit-set characters, so they are left in whatever character set the document is declared as rather than becoming references to the Unicode characters they might not be. This is as a prelude to proper Unicode support coming in the next release.
1.4 [beta] (7 May 2002)
Added Unicode functionality - this was surprisingly easy, which means either Python's Unicode interface is quite well thought out, or I've missed loads of bugs - time will tell. Added XHTML output. Added exception-catching feature. Fixed minor bug in readUrlEncodedStream. Name-encoded values are no longer unescaped with decI (as control names, unlike IDs, are really CDATA so don't need escaping). Since this could cause breakage it will only happen if you call initialise() with a version 1.4 or up.
1.5 [final] (8 Oct 2002)
makeSafeIsh now disallows trailing dots and slashes (dots because Windows ignores a single trailing dot in object names). writeFormData[Stream] now allows file upload fields to be input properly (previously you had to put the parameters in the wrong order). form.py will now also use the built-in boolean type True/False instead of its own internal hack classes where available (ie. the upcoming Python 2.3 and later). Chunking values increased, resulting in large improvements in file upload speed on the Win32 platform.
1.6 [final] (17 Apr 2003)
Minor changes to Unicode and boolean support for older versions of Python: in particular, 'bool' is now always available, and documented. encH fixed to not encode to the target charset until the last moment. (The early encoding problem could conceivably have caused a few output characters to be garbled in some double-byte character sets not natively supported by Python.)

Future: expect a new, more object-based interface in future versions, to be finalised for 2.0. This will have a side-effect of allowing parameters to be read from superfluous pathname parts on servers that allow this (eg. Apache).

Licence

Copyright © 2002 Andrew Clover. Released under the GNU General Public License.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.