Monday, 17 December 2012

Linux filesystems for build workloads

Linux systems include several filesystems by default: XFS, JFS, Ext3, Ext4, reiserfs3. These filesystems have certain characteristics, with some known to be better at small file handling (reiserfs3), others at big files handling (XFS), or featuring annoying quirks (long filesystem verification time on Ext3).

I tend to prefer XFS because the Ext2/Ext3 verification times (fsck) can take a very long time to verify (this is just unacceptable on production environments). After seeing XFS performing poorly on a file server (extremely long file deletes), I have decided to take actual measures to make myself an informed opinion.

The scenarios below represent typical operations on servers running on a build farm: file writes (building the software), file deletes (clean builds), and file system verification (unexpected shutdowns).

The numbers have been obtained on an Ubuntu 12.10 workstation freshly installed (Quantal Quetzal) having two mechanical hard drives. A large build folder of 55GB containing source code and build artifacts was used in the tests below (350000 files spread in 19000 folders). The data was first copied to a freshly created filesystem, then the filesystem was unmounted and verified (fsck -f where applicable), and then all the files were removed from the filesystem. The very large fileset was essential to get relevant data, and best times of 2 runs were recorded.

File writes

This test represents the time to copy all the files to the initially empty filesystem from a separate hard drive:

Filesystem verification

A weak point of Ext3 on servers is that verifying the filesystem can take a long time. This verification can happen if the system was not switched off properly, and can cause unwanted downtimes. I was suspecting that Ext4 would take a verification time too, but I was pleasantly surprised:

File removal

File removal has been a weak point of XFS for a long time. Removing a few terabytes of data can take such a long time that I sometimes consider replacing rm by mkfs. I was hoping that the version of XFS in the kernel 3.2 would perform much better due to the recent optimizations. The following represents the time to remove the directory copied previously:


For build servers and related fileservers, it makes sense to prefer Ext4 to other filesystem types. XFS was a good solution against Ext3, but this is not the case anymore.

Sunday, 16 December 2012

Caching object files for the build

An interesting idea to accelerate the builds is to cache already generated object files. The Waf library provides a simple cache system by intercepting the task execution and retrieving files from the cache. Extensions are even provided to limit directory growth or to share the files over the network

In practice, implementing a cache layer on the build system level will not work very well. The following points are conclusions of years of experimentation on both open and closed-source projects:

  1. The task signatures used for identifying tasks make poor keys for accessing the cache. Platform-specific command-line flags, characters (/ or \), and absolute paths severely limit the cache re-use.
  2. Implementing different task signatures to work around the previous limitations (overriding BuildContext.hash_env_vars for example) will cause at best only performance issues (long startup time), and at worst mysterious cache reuse errors.
  3. Because of the two previous points, the build system can become too brittle and too complex.
  4. The Python runtime is essentially single-threaded. The build process is therefore unable to launch more tasks when retrieving files from the cache.

The best system so far is to wrap the compilers or the commands in the manner of ccache. While this requires some more work up front, the resulting builds are faster and more robust.

The ccache application is limited to C/C++ compilations, but it is easy to write command-line wrappers. Such wrappers can then access custom low-latency tcp servers for example.

Saturday, 8 December 2012

Running Waf on Pypy 2.0

Is Pypy an option for running Waf builds now? While Pypy 2.0 beta 1 still hangs on simple parallel builds, Pypy nightly (59365-f2f4cb496c1c) seems to work much better now.

The numbers below represent the best times of 10 runs on a 64-bit Ubuntu 12.10 laptop. The typical benchmark project was used for this purpose (./utils/ /tmp/build 50 100 15 5):

cPython 2.7.3 pypy-c-jit pypy-c-nojit
no-op build 0.76s 6.5s 7.7s
full build 39s 45.4s 48.3s

The no-op build times represent the time taken to load the serialized Python data without executing any command. Pypy is still using a pure python implementation if pickle, which is likely to take much more time than the C extension present in cPython.

This can explain the time differences on the full build times. If we substract these values, we can imagine that the Pypy runtime is getting nearly as fast as cPython.

Saturday, 1 September 2012

KDE 4.9

Waf was originally created to ease the creation of KDE applications, but it has not worked so well in practice. The first versions of KDE 4 were terrible, and I think they discouraged anyone from using it ever again.

Fortunately, the version 4.9 has changed for the best, and it finally provides a pleasant development environment. At least, after the stability fixes (the plasma desktop does not crash anymore, the network manager just works), there are fewer annoyances than on other desktop environments. In particular, the focus stealing prevention policy helps to concentrate, and the apps do not pop-up password/keyring windows all the time anymore.

If Qt5 and KDE5 do not break the API too much, we should see more applications for KDE appearing over time.

Monday, 13 August 2012

Computed gotos in python 2.7

Since Pypy does not work too well for multithreaded applications at the moment, so I am now stuck with cPython.

Since Python 2.7.3 is about as fast as Python 3.2 for my applications, I wondered what Python 3 optimizations could be backported to 2.7. The computed gotos patch did not look too complicated to adapt, so I have created my own version. Here are two files to add to build a computed-gotos-enabled cPython 2.7.3 interpreter: Python/ceval.c and Python/opcode_targets.h.

The optimization does not seem to make a visible difference on my applications though, even after recompiling with -fno-gcse/-fno-crossjumping.

Thursday, 22 March 2012

Listing files efficiently on win32 with ctypes

Listing files on Windows platforms is not particularly fast, but detecting if files are folders, or listing the last modification times is extremely slow (os.path.isfile, os.stat). Such function calls become major bottlenecks on very large Windows builds. An effective workaround is to use the functions FindFirstFile and FindNextFile to list files and their properties at the same time while listing folders. The results can then be added to a cache for later use.

Though cPython provides access to these functions though ctypes, finding a good example is fairly difficult. Here is a short code snippet that works with Python 2:

import ctypes, ctypes.wintypes

BAN = (u'.', u'..')

FindFirstFile = ctypes.windll.kernel32.FindFirstFileW
FindNextFile  = ctypes.windll.kernel32.FindNextFileW
FindClose     = ctypes.windll.kernel32.FindClose

out  = ctypes.wintypes.WIN32_FIND_DATAW()
fldr = FindFirstFile(u"C:\\Windows\\*", ctypes.byref(out))

    raise ValueError("invalid handle!")
    while True:
       if out.cFileName not in ban:
           isdir = out.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY
           ts = out.ftLastWriteTime
           timestamp = (ts.dwLowDateTime << 32) | ts.dwHighDateTime
           print str(out.cFileName), isdir, timestamp
       if not FindNextFile(fldr, ctypes.byref(out)):
To learn more about the attributes available on the "out" object, consult the msdn documentation on WIN32_FIND_DATAW

Saturday, 28 January 2012

Escaping the Google cave

1. The problem with user tracking

A few years ago, a user complained that the Waf project was using Googlecode, and threatened to not use Waf if it remained hosted on the Google server. I thought that it was paranoid at that time, and I just forgot about that request...

Now a few years have passed, and it is a bit late to move to Github. Also, tons of websites are now hosted by Google, and it would be impossible to avoid all of them. But this is not my main concern. Rather, I have taken the bad habit of logging in on my Google or Facebook accounts more often than I cleared all my cookies and I am getting targeted content too often.

For instance I started to notice that it was much harder to obtain information from Google search. I would frequently find Python in all my search results. Searching for Java programming techniques would lead me to more Python sites. Searching for Scons or CMake would only lead me back to Waf. I would also get ads related to the contents of my emails. In other words, the Google tracking had started to create a convenient place where I would always find familiar information.

Since I was a child, I have always had the important feeling or belief that there exists a world independent of me, a reality that is worth exploring (solipsists may disagree). It is a virtue to try to know the world as it is and not as one would like it to be. The web is interesting because it gives an opportunity of getting other views easily, and to explore a world that is not limited or bound by a particular view, and I would like to keep it this way.

2. Filtering tracking websites

First of all, I believe that the "do not track" cookie is one of the most idiotic invention created recently. If I were creating a website and if someone set that cookie, I would really love to track that someone in the sneakiest way possible. This is equivalent to wandering around with a big sign reading "kick me" stuck on your back.

One of the first things I have tried is to avoid tracking by blocking the scripts that report the pages that I am visiting, for example google analytics. For this it is simple to edit the file /etc/hosts:

There are many more addresses to exclude however, and it does not prevent google from reading your mail. It is enough to log in once to Googlemail to get targeted ads and personalized contents again.

The filtering approach is also imperfect, for example, if it becomes widespread, a few websites will start breaking if the tracking is blocked. It will be easy to test in javascript what hosts are blocked to create a fingerprint of the user. This goes back to the principles of information theory, if you have a secret, it will leak eventually however hard you try to keep it.

3. Setting up multiple identities

Trying to filter the websites is just too complicated to do, and removing http cookies, flash cookies, visited pages, website preferences and user-agent is just too much of a hassle. Websites may also try cache timing attacks to get more information on you anyway.

Tor is nice but limited in terms of feature and speed (no flash, use the Tor browser, etc). Virtual machines are convenient but use a lot of resources, for example flash and javascript are too slow to be usable. I keep them for untrusted websites (and with flash disabled anyway).

The best success I have had so far is by setting up multiple Linux user accounts and multiple identities. I keep my current account for normal stateless activities, and use the other accounts for stateful operations. For example, I created a user account named "google" for all googlemail and googlecode-related activities:

# useradd google -p users
# mkdir /home/google
# echo "export DISPLAY=:0" >> /home/google/.bashrc
# chown google /home/google/.bashrc /home/google

The current user account must allow the windows for each other user accounts to be displayed on the current Xorg session. Make sure to always use xhost +local:accountname:

echo "xhost + local:google" >> ~/.profile

To obtain any sound, it is necessary to tweak the pulseaudio settings. First, the file /etc/pulse/ must be copied to ~/.pulse/ and modified to allow connections from other user accounts:

> diff -urN /etc/pulse/ ~/.pulse/ 
--- /etc/pulse/       2011-10-30 03:59:03.000000000 +0100
+++ .pulse/   2011-12-01 00:34:34.537118644 +0100
@@ -158,3 +158,6 @@
 ### Make some devices default
 #set-default-sink output
 #set-default-source input
+load-module module-native-protocol-tcp auth-ip-acl=

Then a new file must be added to each other user account, for example /home/google/.pulse/client.conf:
default-server =

After that, a web browser can be started easily
sudo su
su - google

To make certain that I do not confuse the accounts (the web browser completion already helps a lot), I am also using different web browsers with different versions, with different extensions (noflash, firebug, etc), with different themes and with different language settings. For example, looking at ads in Serbian or in Russian is fun. The firefox themes (Personas) even have animations to help remember what window is what.