Tuesday, 2 February 2010

Single file Python apps

Various applications exist to ease software redistribution and packaging, for example:
klik
cpan
easyinstall
gems

Yet, redistribution is even easier when the application is platform-independent (scripts) and when the application is directly executable.

A naive approach for Python software is to concatenate all the dependent scripts into a single scripts. In the case of Waf, the tools are actually plugins which are not meant to be loaded all at once. Also, the resulting script would have a pretty huge size.

The best idea so far is to create an archive (in the tar file format) and to encode it as a base64 string hidden into the final script. The script, containing only a few routines for decompressing the archive would decode the string and unpack the library into a hidden folder when executed. All the program logic would then follow from the library files uncompressed.

The base64 encoding is safe in the sense that any binary string as input will be transformed in a string containing only letters and a few symbols (64 in total). Yet this operation increases the file size by about 33%.

The ascii85 encoding (used in pdf and postscript files) produces less safe symbols such as quotes and backslashes (85 characters in total), with a better cost (25%) and only a few more lines of python code.

The best of all encodings is to avoid having to encode the binary stream at all, but the python interpreter would not accept such files. The cPython interpreter at least ignores all characters located in comments between # and the following newline (\n and \r), which enables us to store binary data in commented lines. By using this system, the size increase for the binary data is about 2%.

The system for the waf coding is therefore:
1. make a compressed archive of all the files and obtain a binary string
2. find suitable 2-character escape sequences for the newline characters by scanning the binary string
3. replace the forbidden characters by the escape sequences
4. store the escape sequences, and write the binary string in a commented line

Upon execution, the library will be unpacked by following these steps:
1. open the script being executed, and read the binary string
2. replace the escape sequences by the newline characters
3. unpack the files from the binary string to obtain the library

In the case of Jython, the interpreter is a bit different, and will first read the whole file before parsing the Python code (the parser used by Jython seems to require that). Jython will then validate the characters present in the comment sections and throw a syntax error when trying to execute the Waf file.

Fortunately, changing the encoding declaration from 'utf-8' to 'iso8859-1' solved the problem entirely. While quite a few character sequences are forbidden in 'utf-8', binary data makes no such problem in 'iso8859-1'.

The last problem in Jython is the lack of bzip2 support. For now to obtain a compatible waf script, it is necessary to build Waf by using the gzip compression algorithm:
./waf-light --make-waf --zip-type=gz

Although the waf directory is only for preparing the Waf files, replacing the contents of the folder wafadmin and changing the name of the final application may help building and redistributing other Python applications as well.

2 comments:

  1. Nice summary - I hope I'll get to generalize the process for me, so I can easily use it.

    For example I thought about the option to include precompiled modules in architecture dependent comment strings similar to the following (just a concept):

    # py: [bzipped python tree tar]
    # x86: [bzipped x86 compiled module tree tar]
    # x86_64: …
    # …

    this would require a crosscompiler at the release machine, though.

    ReplyDelete
  2. I finally finished the code to turn arbitrary pure python projects into single file python projects: http://bitbucket.org/ArneBab/waffles/

    But beware: untested and potentially dangerous.

    Also this is far simpler than your conpcept, using only base64 for encodung and none of the features I wrote about are present. Furthermore the format of the binary part is still subject to change (currently I additionally store there a list of all included packages, though I don't yet know if that wil prove necessary).

    ReplyDelete