The form module is an extended replacement for the standard Python cgi module, providing robust form-handling features in order to make writing secure form-handling CGIs in Python less work.
The idea is to define the kind of data you want returned for each field of the form. This definition
is done using a mapping of form field names to datatypes (fdefs), which is
passed to the main function, readForm
. This call reads
CGI input and interprets it, returning a mapping of field names to values.
form also fully supports Unicode, [multiple] file-upload fields, image-submit fields and embedding values in names, protects against some denial-of-service problems common to CGI scripting, and provides miscellaneous utility functions useful to CGI progammers. It has been proven to cope with very large input.
form and cgi have completely different interfaces and are not compatible. form works at a somewhat higher level than cgi. Its ease of use comes at the expense of disallowing direct access to the exact submitted data.
The main advantage is that the returned values from reading a form submission are guaranteed to conform to your specifications, regardless of how malformed the submission may have been. This reduces the error-checking necessary to produce error-free scripts. The abstraction of datatype from submission data also allows some elements in an HTML form to be changed without having to re-write the corresponding CGI.
cgi is part of the standard distribution and so guaranteed available without having to add any modules. It easily suffices for writing simple forms. form is more complicated that cgi so it may be more likely to have bugs in it, although none are currently known. form is also not suitable for applications where you don't know the names of the submitted fields in advance, eg. generic form-to-mail scripts. (A feature will probably be added to allow this at some point but it's not a priority.)
A user sign-up form might be read like this:
import form form.initialise(1.4, charset='utf-8') fdefs= { 'email': (form.STRING, 128), 'username': (form.STRING, 16), 'password': (form.STRING, 16), 'sex': (form.ENUM, ['m', 'f'], 'f'), 'age': form.INT, 'sendmespam': form.BOOL } fvals= form.readForm(fdefs) if fvals.username=='': errorPage('You forgot to enter a user name.') if allUsers.has_key(fvals.username): errorPage('Sorry, someone has already had that user name') # and so on
Each item in an fdefs dictionary defines one form field. The key should be the
same as the name
property in the HTML
form, which should not normally contain a period or colon (see 2.2).
The value of the item dictates the datatype to be returned.
readForm
returns a dictionary-like object with the names
of the fields as keys. The type of the values depends on which type was requested
for that field in the fdefs. You can read the returned object like a dictionary
(fvals['address']
), or like an object
(fvals.address
), it makes no difference.
In the case where a field is included more than once in a submission but a list-of-values submission
(form.LIST
) was not expected, the last field in the input takes
precedence.
The following field types are available. Some of them take parameters, which you can specify by putting the type in a tuple, with the parameters following. If you are not passing parameters, you can use the type name on its own or in a singleton tuple, it doesn't matter which.
For input type=text
or password
. Return
a string of maximum length length characters, with all
characters in the string exclude removed. You can
omit the exclude string to allow all characters. You
can omit length or set it to 0 to allow any length string; it's
mostly there so you use the value in SQL without having to
worry about it being too big to fit the relevant field.
Control characters are always removed from the returned string regardless of the setting of exclude, for safety. These are 0x00-0x1F and 0x7F in ASCII, and additionally if Unicode strings are being used, the C1 control codes 0x80-0x9F, the deprecated codes 0x206A-0x206F, the specials 0xFFFC-0xFFFF and the BOM 0xFEFF (since its function as ZWNBSP has been taken over by 0x2060).
For textarea
. As form.STRING
,
but single newlines are converted to space, and double newlines are converted
to a Python '\n'. Other control characters are still removed.
For select
and input type="radio"
.
Return one of the list of string values passed if it matches the input, else return the default value, which
can be of any type. If the default is not supplied, '' is used as the default.
For input type="checkbox"
with no value
property. The value
returned evaluates True if the input value for this field was 'on', else False. (From Python 2.3 it is
a native Python boolean; in earlier versions it is an object that behaves like one.)
For select multiple
and multiple fields with the same name
(especially
checkboxes). Return a list of each non-empty input strings given for this field.
For input type="image"
. Return a tuple (x, y) of position of the click,
clipped to within (0:width, 0:height) if the (width, height) tuple is
supplied. Returns (0, 0) if the input field was supplied but without x and y
co-ords, or (-1, -1) if the field was not in the input at all.
For input type="file"
. Fills the given directory with files uploaded through
the field, and return a list of tuples (storedFile, suppliedFile, mimeType,
length). The suppliedFile filename may be '' if no filename was specified.
storedFile is the full pathname of the stored file. The list is empty if no files were uploaded,
and is unlikely to be longer than one entry since few browsers support multiple-file upload.
Parse the input as an decimal (possibly negative) integer. Returns the default value if no
parsable number could be read. If default is omitted, zero is used as the default.
Returns sys.maxint
if the number is higher than Python can represent as an integer.
Note! Future versions of form may return a long integer for
form.INT
. I might restrict this to Python 1.6 and later, where str
doesn't add an 'L' to the end of the number, to avoid problems.
Parse the input as a simple floating point number, which may contain a decimal point, but not 'E' notation. Returns 0.0, or, if supplied, the default if the input is not a valid number or not supplied.
In HTML, there are some kinds of form fields where you can't use the value
attribute to pass information to the CGI script. These are input type="map"
,
where the value is always a pair of co-ordinates, and submit
,
where the value is used as the text for the button.
So if you wanted to detect which of a set of identically-labelled buttons was pressed, you'd have to give them all a different name, and include a check for each one in your script. This would be especially tedious for an order form with a hundred "Buy It!" buttons, for example.
For this reason, form allows you make a group of controls where the value submitted for each is taken from the name of the control instead of the value, when such a control is included in a submission. The actual value submitted is ignored.
To use the feature, put both the name and the desired value together in the HTML name of
the field, separated by a colon. (Which is a valid character for name
,
albeit a seldom-used one).
<input type="submit" name="b:left" value="Click me!"> <input type="submit" name="b:middle" value="Click me!"> <input type="submit" name="b:right" value="Click me!">
In this example, an call to form.readForm({'b': form.STRING})
would return either 'left', 'middle' or 'right', depending on which button was used to submit
the form. This is not limited to STRING
: values of all types
except FILE
may be embedded in names.
(You can still use names with colons if you do not wish to use the value-embedding feature. form only tries to separate a name with a colon in if it can't find the whole name as a key in your fdefs. The same goes for periods, which are special characters used by HTML in image maps.)
Calling this function is not compulsory, but it allows you to set some of form's internal variables easily.
form.py includes features to protect against certain kinds of denial-of-service attacks in POST requests. They are turned off by default, but passing non-zero values in the "limit" parameters enables them.
The arguments you can set are:
form.INT
and form.FLOAT
will read numbers using European-style punctuation (where "." is a thousands-separator
and "," is the decimal point). If false (the default), it's the other way around.
None
(the default) to return plain 8-bit strings
without attempting to decode them. See Working with Unicode
for more details.
writeForm
, checked
and
selected
.
All read
functions take submitted form data and parse it,
returning a dictionary-like object containing the values that have been posted
to the form, standardised according to the fdefs argument passed to
the function. The returned object may be read like a dictionary or like an object.
Typically, a script calls readForm
at the start of its
code. Scripts do not normally need to call the other read
functions directly.
read
functions is appropriate, and
calls that.
readUrlEncoded
, but takes its input from
a stream object (must support read()
)
instead of a string.
Decodes fields encoded in a multipart/form-data formatted string. parameters is a dictionary of MIME headers, lower-cased keys, containing at least a 'boundary' key.
Currently this function is no more efficient than
readFormDataStream
, since it is not
commonly needed.
readFormData
, but input is taken from a
stream object instead of a string. The length
is the number of bytes that should be read from the stream.
The write
functions take form values from a dictionary
(or dictionary-like object returned by the read
functions), and convert them into encoded text sent to a string or a
stream.
File upload fields only work for writeFormData
and
writeFormDataStream
since it does not make much
sense to try to upload a file to a query string or hidden form. File
upload values need not have a valid length value in the tuple as the
length is read directly from the file specified.
Currently, the string-returning functions are no more efficient than the stream-writing versions.
input type="hidden"
controls
for each field in the fvals dictionary. This is useful for writing a follow-up-form
that retains all the information posted into a previous form.
writeForm
, but send output to a stream
object (or anything supporting write
) instead of
returning a string.
<a href="...">
, remember to HTML-encode
the whole URL, or those & characters could confuse a browser.
writeUrlEncoded
, except that the output is sent to
the nominated stream.
These convenience functions are available for coding text for representation in HTML, URLs and JavaScript strings. If you have user input anywhere in your scripts, you'll need to do this a lot, or you're likely to make a site susceptible to security problems. (See this CERT advisory for an example of this.)
Encode text as HTML and return as string. ", &, <, > and control characters are replaced with HTML entities. This assumes you use the double-quote rather than single-quote for attribute strings, which is advisable. Obviously quotes do not need to be escaped outside of attribute values, but it does no harm.
Encode text as a URL part (replacing spaces with '+' and many symbols with %-encoded entities), and return as a string.
Note: you should not pass entire URLs through encU
,
only separate parts, for example a directory name in a path, or a key or
value string in a query. Once encoded you can combine these parts using '/', '?' and so on.
When writing HTML, remember to encode the complete URL if it has characters like '&' in.
Encode text so it can be included in HTML id
or
name
attributes. This is especially useful when
you need to include arbitrary strings in name-embedded
values.
This encoding is not a web standard, it's specific to form. Technically it simply replaces all disallowed characters with ':xx' where xx is the hex encoding of the character.
Encode text suitable for inclusion in a JavaScript string. This escapes single and
double quotation marks, and the ETAGO (</
) marker,
making the result safe to include in a string in a script
block.
Shorthand for encH(encJ(text))
, useful for writing
Javascript inside of an HTML attribute, especially event handlers.
Decodes text encoded into part of a URL, replacing the %-encoded entitites into plain text.
Decodes text escaped with encI
.
These functions are of general use to CGI scripts and are provided together as a convenience, as well as being used internally by form.
Simply returns the string ' checked' if the condition is true or '' if false.
This often saves writing an if
statement whilst outputting a form.
If you called initialise
setting xhtml to be true, you
will get the non-minimised XML form ' checked="checked"' instead.
print '<input type="checkbox" name="spam"'+form.checked(f.sendmespam)+'>'
Like checked
, but on true outputs ' selected', for
select
fields.
print '<option value="m"'+form.selected(f.sex=='m')+'>'
makeSafe
, but allow single periods and slashes,
but not combinations of them together or strings starting or ending with them. Also
allows the range of characters between C0-FF.
Evaluates its parameter for truth and returns a value that will be understood
by the write
functions as representing a boolean True
or False.
From Python 2.3 this function behaves identically to the built-in
function bool
and returns the native boolean values
True
and False
. On earlier
versions of Python which do not have a native boolean type, it return an
object that behaves like a boolean.
If you want your code to be compatible across Python versions, you should
use form.bool
instead of bool
,
and form.bool(1)
and form.bool(0)
instead of True
and False
when
constructing new form value dictionaries to be passed to the
write
functions.
The input-reading functions may throw the following exceptions:
Some aspect of the CGI environment is broken, for example environment variables not being correctly set by the script's caller.
cgiError
s are the fault of the web server, and should not happen in
working web sites.
An fdefs dictionary was passed to readForm
which included unknown fdef
values or unexpected parameters. Alternatively you passed a set of fields
to writeForm
or writeUrlEncoded
(or the
stream versions) which included a file-upload field. Note, readForm
may
also raise a TypeError, if some of the parameters in the fdefs were of the wrong type.
fdefError
s are your script's fault, and should not happen in working web
sites.
The HTTP request or the MIME message in a HTTP POST request is malformed in some way.
httpError
s are the user-agent's fault, so could happen in a working web site, but
only if either:
Finally, initialise
may throw a
NotImplementedError
if it is called with a version
number higher than the version of form being used, or Unicode
features are requested which cannot be provided.
Since version 1.4, form can return Unicode strings from
its form-reading functions. This must be specially requested by
passing a charset argument with a non-None
value to the
initialise
function. form will then
interpret the submitted data as being encoded in that character set,
and will return Unicode strings with that data represented in the
standard encoding used by Python, UTF-16.
If charset is specified as None
,
not specified at all, or initialise
is not
called, returned strings will be plain 8-bit, undecoded strings as
submitted by the browser. This is the easiest option if you are
using an 8-bit character set such as 'iso-8859-1' or 'windows-1252',
or if you don't care about non-ASCII characters being consistently
interpreted in different locales.
The setting of charset will also be used to encode any unicode
strings passed to the enc
functions.
The charset a browser uses to submit form data is typically the same
as the charset of the document containing the form. (You should be able
to override this behaviour using the accept-charset
attribute on <form>
, but this does not work in
Internet Explorer.)
The charset used must be one that Python understands. This means your web pages can be in, for example, 'iso-8859-1' or 'utf-8' (the obvious choice if you're working with international forms), but not 'windows-1252' or 'shift_jis'. If you specify an encoding Python doesn't know about, or you try to set a charset on a version of Python that does not support Unicode (ie. before 1.6), you will get an exception on initialisation.
Do remember to tell the browser what character encoding your web pages with forms in are using, or it will submit the form in whatever default encoding the user has chosen, and you have no chance of finding out what that is. The best way to specify the charset of a page is to have your web server send it with a Content-Type HTTP header with a charset:
Content-Type: text/html;charset=utf-8
- rather than just plain 'text/html'. If you're not in a position to control
what headers your web server sends, you can set the charset by using an
HTML meta hack, such as the following, in your <head>
.
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
(There's also an encoding
attribute in the
<?xml?> preamble if you are using an XML document, but browser support
is not brilliant, yet.)
Once you have Unicode strings in your script you can do all the normal string operations with them. But if you need to write them back out to a non-Unicode-aware storage like a text file or a database, you will probably need to encode them again, usually as UTF-8.
Beware: if you had a field described as (form.STRING, 8)
,
the eight characters in the Unicode string may expand to more than eight
characters when encoded as UTF-8. In fact it can theoretically be up to four
times as many characters. You can either allocate four times as much room in
your database to ensure it will always fit (which can be quite wasteful, especially
if you are using non-VARYING fields), or truncate.
If you are going to use truncation, make sure you do not break the string in the middle of a UTF-8 code sequence, or when you retrieve it and try to turn it back into a Unicode string Python will give you a UnicodeError exception. Also, add some checks to make sure truncation does not happen to any field being used as a primary key, or your scripts could get very confused trying to SELECT a row whose primary key is unexpectedly shorter then when it was inserted!
According to the standard for POST submission of multipart/form-data, user agents should specify the character set of form data in a Content-Type header with charset parameter on each successful control submitted, when there are non-ASCII characters present. No browser does this, so form does not even bother to look for the header. In practice they simply submit the form in whatever charset the original document was encoded in.
Also according to the standard, the 'name' parameter in the Content-Disposition header should be encoded according to RFC2047 if it contains out-of-bounds characters. No browser does this; instead, they simply surround the name with quotes. They don't even care if the name contains more quotes or control characters, so avoid putting these in your field control names!
The writeFormData
functions duplicate both these
universal-but-wrong behaviours.
form was written by Andrew Clover and is available under the GNU General Public Licence. There is no warranty. (GPL is chosen as a good default licence; if, for some reason, this doesn't match your requirements, get in touch.)
Bugs, queries, comments to: and@doxdesk.com.
del
.
Future: expect a new, more object-based interface in future versions, to be finalised for 2.0. This will have a side-effect of allowing parameters to be read from superfluous pathname parts on servers that allow this (eg. Apache).
Copyright © 2002 Andrew Clover. Released under the GNU General Public License.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.