Flying Circus at EuroPython 2014

If you’re attending EP14, be sure to visit our Flying Circus booth at BCC level A! We’re here to discuss web operations. Managed hosting is only as good as the people behind it. So just walk over, test us, ask any question related to web operations! Additionally, we have some demo VMs readily available so you can get hands-on experience with a walk-through from our developers.

Flying Circus boot at EP14

Follow up actions after the filesystem corruption incident

On 2014-06-07, the Flying Circus experienced a quite unfortunate filesystem corruption incident. Most of the VMs have been cleaned up since then, but a few defective files are still around. In the following article, I’ll provide some background information on what types of corruption we saw, what you (as our customer) can expect from platform management to rectify the situation, and what everyone can do to check his/her own applications.

Observed types of filesystem corruption

The incident resulted in lost updates on the block layer. This means that some filesystem blocks were reverted to an older state. Depending on what kind of information has been saved in the affected blocks, this may lead to different effects:

  • files show old content, as a file update got lost;
  • files show random content, as updates to the file’s extent list got lost;
  • files have disappeared completely, as updates to their containing directory got lost.

On most VMs which experienced filesystem corruption, filesystem metadata has been rendered invalid as well. We were able to identify these VMs quickly and contacted the affected customers immediately. However, there are still some cases left where filesystem metadata has not been affected (so the automated checks did not find anything), but file contents has been affected. Generally, files that have been updated in the time range between 2014-06-02 and 2014-06-07 or live in directories that saw changes during that time are at risk.

These cases of corruption are impossible to detect via filesystem checks. To make sure that all VMs are in a reasonably good state, twofold action is necessary: First, we will check the OS and all managed components as part of our platform management. Second, we ask you to take a look at your applications to uncover previously hidden cases of filesystem corruption.

Platform-wide checks

After taking short-term action to ensure that we will not run into a similar problem again, we are currently in the process of performing a deep scan of all installed OS files and managed components. In particular, we are going to:

  • perform a consistency check on all files installed from OS packages;
  • perform integrity checks on all managed databases (PostgreSQL, LDAP, …);
  • reboot all VMs to ensure that there is no stale cached content.

Found inconsistencies will be repaired automatically if possible (e.g., OS files). As far as application data is concerned (e.g., databases), we will contact you to work out available options to restore consistency. VM reboots will take place during announced maintenance periods as usual.

Application-specific checks

It is not possible to perform an automated deep check of project files, as we do for OS files. Too much context knowledge is necessary to judge on what is ok and what looks suspicious. So we ask you to throw a critical glance at your applications yourself. Our experience so far shows that signs of filesystem corruption reveal themselves quite fast when one starts to look for the right signs.

Two areas that need to be checked are static project files, like application software and configuration, and application-specific databases.

Checking application installations

Our first and most important recommendation is to restart all applications and check for obvious signs of trouble. This is both easy and points to most problems right away.

Additionally, the applications’ log files should be inspected. Filesystem corruption causes sometimes “illogical” errors to show up in the log files. We recommend to look through the log files for unusual error messages.

If corrupted installation or configuration files are found, the best way out is usually to re-deploy affected applications. This is easy if the deployment is controlled via an automated tool like batou or zc.buildout. Restoring installation files from backup is also an option.

Checking application data stores

Some applications bring in their own data store, for example ZODB or SOLR. Procedures depend on the specific software, but we can give some general suggestions here:

  • Some data stores have their own integrity checking or even repair tools on board. For example, ZODB complains about inconsistencies during packing.
  • Some data stores have means to dump their contents to an external file. During dumping, all pieces of data will be traversed and inconsistencies are likely to be discovered.
  • Some data stores can easily be rebuilt from scratch, for example caches, indexes, or session stores.

Please contact our support if you discover inconsistencies and need help to recover.


We are sorry for all the trouble the filesystem corruption incident has caused. We care about customer data and will do our best to get VMs as clean as possible. With the guidelines mentioned above, it should be possible to uncover a good portion of the corrupted files that have not been identified yet.

Improving HTTP security at the Flying Circus

We now know that the secret services employ extended eavesdropping techniques to scan and analyze nearly all Internet traffic. This worries us since we want to keep our customers’ data confidential. We get a lot of questions about how secure sites hosted at the Flying Circus are. As security has many aspects, I would like to focus on one question in this post: How secure is our HTTPS encryption? In other words, is it likely that some third party sitting in the transmission path is able to decrypt traffic between our server and the user’s browser?

We have checked everything twice to ensure a good level of security with the default configuration of our web server role. Of course no-one can guarantee absolute security, but this is what we do currently:

  • We have improved the web server configuration so that HTTPS web sites still maintain an ‘A’ rating at SSL Labs. They have recently tightened their check criteria in the light of Snowden’s revelations on NSA practices. An ‘A’ rating means that the encryption is still very hard to break.
  • We use only open-source software. There are reports that secret services try to get back doors into security products to intercept traffic after it has been decrypted. Commercial security devices are a black box: You must trust them not to forward your data elsewhere. In contrast, open source software uses only published source code. The sources are read and used by a world-wide community of developers, who are in general very security aware. We compile the source code by ourselves. Although it might be possible to hide an cuckoo’s egg in the source code so advanced that it does even not get recognized by experts, this is highly unlikely.
  • We are in the process of switching on HTTP Strict Transport Security (HSTS) for all HTTPS-only sites. This means that web browsers are told to reject unencrypted connections to such a site.
  • We employ perfect forward secrecy (PFS). This means that even when captured (encrypted) traffic is stored and there will be a decryption attack available in the future, past traffic will still be undecipherable. Note that not all browsers support PFS; for example, some old IE versions on Windows XP feature only insufficient crypto. We think that it is better to reject encrypted connections from broken systems than lulling users into a false sense of security.

What is not so good currently:

  • We are not able to support the newest encryption suite Transport Layer Security 1.2 (TLS 1.2). To get this running, we must upgrade some shared libraries which are central to our OS deployment. This will probably take place during our next major OS upgrade at the end of the year. TLS 1.2 is more resistant against some advanced attacks but is not supported by all browsers.

To summarize: we have implemented decent security measures to prevent third parties to decipher encrypted web traffic. Our ‘A’ rating with SSL Labs is better than the majority of web sites today. There is still a library upgrade pending, but we have it already on our list.

Reliable file updates with Python

Programs need to update files. Although most programmers know that unexpected things can happen while performing I/O, I often see code that has been written in a surprisingly naïve way. In this article, I would like to share some insights on how to improve I/O reliability in Python code.

Consider the following Python snippet. Some operation is performed on data coming from and going back into a file:

with open(filename) as f:
   input =
output = do_something(input)
with open(filename, 'w') as f:

Pretty simple? Probably not as simple as it looks at the first glance. I often debug applications that show strange behaviour on production servers. Here are examples of failure modes I have seen:

  • A run away server process spills out huge amounts of logs and the disk fills up. write() raises an exception right after truncating the file, leaving the file empty.
  • Several instances of our application happen to run in parallel. After they have finished, the file contents is garbage because it intermingles output from multiple instances.
  • The application triggers some follow-up action after completing the write. Seconds later, the power goes off. After we have restarted the server, we see the old file contents again. The data already passed to other applications does not correspond to what we see in the file anymore.

Nothing of what follows is really new. My goal is to present common approaches and techniques to Python developers who are less experienced in system programming. I will provide code examples to make it easy for developers to incorporate these approaches into their own code.

What does “reliability” mean anyway?

In the broadest sense, reliability means that an operation is performing its required function under all stated conditions. With regard to file updates, the function in question is to create, replace or extend the contents of a file. It might be rewarding to seek inspiration from database theory here. The ACID properties of the classic transaction model will serve as guidelines to improve reliability.

To get started, let’s see how the initial example can be rated against the four ACID properties:

  • Atomicity requires that a transaction either succeeds or fails completely. In the example shown above, a full disk will likely result in a partially written file. Additionally, if other programs read the file while it is being written, they get a half-finished version even in the absence of write errors.
  • Consistency denotes that updates must bring the system from one valid state to another. Consistency can be subdivided into internal and external consistency: Internal consistency means that the file’s data structures are consistent. External consistency means that the file’s contents is aligned with other data related to it. In this example, it is hard to reason about consistency since we don’t know enough about the application. But since consistency requires atomicity, we can say at least that internal consistency is not guaranteed.
  • Isolation is violated if running transactions concurrently yields different results from running the same transactions sequentially. It is clear that the code above has no protection against lost updates or other isolation failures.
  • Durability means that changes need to be permanent. Before we signal success to the user, we must be sure that our data hits non-volatile storage and not just a write cache. Perhaps the code above has been written with the assumption in mind that disk I/O takes place immediately when we call write(). This assumption is not warranted by POSIX semantics.

Use a database system if you can

If we would be able to gain all four ACID properties, we would have come a long way towards increased reliability. But this requires significant coding effort. Why reinvent the wheel? Most database systems already have ACID transactions.

Reliable data storage is a solved problem. If you need reliable storage, use a database. Chances are high that you will not do it by yourself as good as those who have been working on it for years if not decades. If you do not want to set up a “big” database server, you can use sqlite for example. It has ACID transactions, it’s small, it’s free, and it’s included in Python’s standard library.

The article could finish here. But there are valid reasons not to use a database. They are often tied to file format or file location constraints. Both are not easily controllable with database systems. Reasons include:

  • we must process files generated by other applications, which are in a fixed format or at a fixed location
  • we must write files for consumption by other applications (and the same restrictions apply)
  • our files must be human-readable or human-editable

…and so on. You get the point.

If we are set out to implement reliable file updates on our own, there are some programming techniques to consider. In the following, I will present four common patterns of performing file updates. After that, I will discuss what steps can be taken to establish ACID properties with each file update pattern.

File update patterns

Files can be updated in a multitude of ways, but I see at least four common patterns. These will serve as a basis for the rest of this article.


This is probably the most basic pattern. In the following example, hypothetical domain model code reads data, performs some computation, and re-opens the existing file in write mode:

with open(filename, 'r') as f:
with open(filename, 'w') as f:

A variant of this pattern opens the file in read-write mode (the “plus” modes in Python), seeks to the start, issues an explicit truncate() call and rewrites the contents:

with open(filename, 'a+') as f:

An advantage of this variant is that we open file only once and keep it open all the time. This simplifies locking for example.


Another widely used pattern is to write new contents into a temporary file and replace the original file after that:

with tempfile.NamedTemporaryFile(
      'w', dir=os.path.dirname(filename), delete=False) as tf:
   tempname =
os.rename(tempname, filename)

This method is more robust against errors than the truncate-write method. See below for a discussion of atomicity and consistency properties. It is used by many applications.

These first two patterns are so common that the ext4 filesystem in the Linux kernel even detects them and fixes some reliability shortcomings automatically. But don’t depend on it: you are not always using ext4, and the administrator might have disabled this feature.


The third pattern is to append new data to an existing file:

with open(filename, 'a') as f:

This pattern is used for writing log files and other cumulative data processing tasks. Technically, its outstanding feature is its extreme simplicity. An interesting extension is to perform append-only updates during regular operation and to reorganize the file into a more compact form periodically.


Here we treat a directory as logical data store and create a new uniquely named file for each record:

with open(unique_filename(), 'w') as f:

This pattern shares its cumulative nature with the append pattern. A big advantage is that we can put a little amount of metadata into the file name. This can be used, for example, to convey information about the processing status. A particular clever implementation of the spooldir pattern is the maildir format. Maildirs use a naming scheme with additional subdirectories to perform update operations in a reliable and lock-free way. The md and gocept.filestore libraries provide convenient wrappers for maildir operations.

If your file name generation is not guaranteed to give unique results, there is even a possibility to demand that the file must be actually new. Use the low-level call with proper flags:

fd =, os.O_WRONLY | os.O_CREAT| os.O_EXCL, 0o666)
with os.fdopen(fd, 'w') as f:

After opening the file with O_EXCL, we use os.fdopen to convert the raw file descriptor into a regular Python file object.

Applying ACID properties to file updates

In the following, I will try to enhance the file update patterns. Let’s see what we can do to meet each ACID property in turn. I will keep this as simple as possible, since we are not planning to write a complete database system. Please note that the material presented in this section is not exhaustive, but it may give you a good starting point for your own experimentation.


The write-replace pattern gives you atomicity for free since the underlying os.rename() function is atomic. This means that at any given point in time, any process sees either the old or the new file. This pattern has a natural robustness against write errors: if the write operation triggers an exception, the rename operation is never performed and thus, we are not in the danger of overwriting a good old file with a damaged new one.

The append patterns is not atomic by itself, because we risk to append incomplete records. But there is a trick to make updates appear atomic: Annotate each written record with a checksum. When reading the log later on, discard all records that do not have a valid checksum. This way, only complete records will be processed. In the following example, an application makes periodic measurements and appends a one-line JSON record each time to a log. We compute a CRC32 checksum of the record’s byte representation and append it to the same line:

with open(logfile, 'ab') as f:
    for i in range(3):
        measure = {'timestamp': time.time(), 'value': random.random()}
        record = json.dumps(measure).encode()
        checksum = '{:8x}'.format(zlib.crc32(record)).encode()
        f.write(record + b' ' + checksum + b'\n')

This example code simulates the measurements by creating a random value every second.

$ cat log
{"timestamp": 1373396987.258189, "value": 0.9360123151217828} 9495b87a
{"timestamp": 1373396987.25825, "value": 0.40429005476999424} 149afc22
{"timestamp": 1373396987.258291, "value": 0.232021160265939} d229d937

To process the log file, we read one record per line, split off the checksum, and compare it to the read record:

with open(logfile, 'rb') as f:
    for line in f:
        record, checksum = line.strip().rsplit(b' ', 1)
        if checksum.decode() == '{:8x}'.format(zlib.crc32(record)):
            print('read measure: {}'.format(json.loads(record.decode())))
            print('checksum error for record {}'.format(record))

Now we simulate a truncated write by chopping the last line:

$ cat log
{"timestamp": 1373396987.258189, "value": 0.9360123151217828} 9495b87a
{"timestamp": 1373396987.25825, "value": 0.40429005476999424} 149afc22
{"timestamp": 1373396987.258291, "value": 0.23202

When the log is read, the last incomplete line is rejected:

$ log
read measure: {'timestamp': 1373396987.258189, 'value': 0.9360123151217828}
read measure: {'timestamp': 1373396987.25825, 'value': 0.40429005476999424}
checksum error for record b'{"timestamp": 1373396987.258291, "value":'

The checksummed log record approach is used by a large number of applications including many database systems.

Individual files in the spooldir can likewise feature a checksum in each file. Another, probably easier, approach is to borrow from the write-replace pattern: first write the file aside and move it to its final location afterwards. Devise a naming scheme that protects work-in-progress files from being processed by consumers. In the following example, all file names ending with .tmp are ignored by readers and are thus safe to use during write operations:

newfile = generate_id()
with open(newfile + '.tmp', 'w') as f:
os.rename(newfile + '.tmp', newfile)

At last, truncate-write is non-atomic. I am sorry that I am not able to offer you an atomic variant. Right after performing the truncate operation, the file is nulled and no new content has been written yet. If a concurrent program reads the file now or, worse yet, an exception occurs and our program gets aborted, we see neither the old nor the new version.


Most things I have said about atomicity can be applied to consistency as well. In fact, atomic updates are a prerequisite for internal consistency. External consistency means to update several files in sync. As this cannot easily be done, lock files can be used to ensure that read and write access do not interfere. Consider a directory where files need to be consistent with each other. A common pattern is to designate a lock file, which controls access for the whole directory.

Example writer code:

with open(os.path.join(dirname, '.lock'), 'a+') as lockfile:
   fcntl.flock(lockfile, fcntl.LOCK_EX)

Example reader code:

with open(os.path.join(dirname, '.lock'), 'a+') as lockfile:
   fcntl.flock(lockfile, fcntl.LOCK_SH)

This method only works if we have control over all readers. Since there may be only one writer active at a time (the exclusive lock is blocking all shared locks), the scalability of this method is limited.

To take it one step further, we can apply the write-replace pattern to whole directories. This involves creating a new directory for each update generation and changing a symlink once the update is complete. For example, a mirroring application maintains a directory of tarballs together with an index file, which lists file name, file size, and a checksum. When the upstream mirror gets updated, it is not enough to implement an atomic file update for every tarball and the index file in isolation. Instead, we need to flip both the tarballs and the index file at the same time to avoid checksum mismatches. To solve this problem, we maintain a subdirectory for each generation and symlink the active generation:

|-- 483
|   |-- a.tgz
|   |-- b.tgz
|   `-- index.json
|-- 484
|   |-- a.tgz
|   |-- b.tgz
|   |-- c.tgz
|   `-- index.json
`-- current -> 483

Here, the new generation 484 is in the process of being updated. When all tarballs are present and the index file is up to date, we can switch the current symlink with a single, atomic os.symlink() call. Other applications see always either the complete old or the complete new generation. It is important that readers need to os.chdir() into the current directory and refer to files without their full path names. Otherwise, there is a race condition when a reader first opens current/index.json and then opens current/a.tgz, but in the meanwhile the symlink target has been changed.


Isolation means that concurrent updates to the same file are serializable — there exists a serial schedule that gives the same results as the parallel schedule actually performed. “Real” database systems use advanced techniques like MVCC to maintain serializability while allowing for a great degree of parallelism. Back on our own, we better use locks to serialize file updates.

Locking truncate-write updates is easy. Just acquire an exclusive lock prior to all file operations. The following example code reads an integer from a file, increments it, and updates the file:

def update():
   with open(filename, 'r+') as f:
      fcntl.flock(f, fcntl.LOCK_EX)
      n = int(
      n += 1

Locking updates using the write-replace pattern can be tricky. Using a lock the same way as in truncate-write can lead to updates conflicts. A naïve implementation could look like this:

def update():
   with open(filename) as f:
      fcntl.flock(f, fcntl.LOCK_EX)
      n = int(
      n += 1
      with tempfile.NamedTemporaryFile(
            'w', dir=os.path.dirname(filename), delete=False) as tf:
         tempname =
      os.rename(tempname, filename)

What is wrong with this code? Imagine two processes compete to update a file. The first process just goes ahead, but the second process is blocked in the fcntl.flock() call. When the first process replaces the file and releases the lock, the already open file descriptor in the second process now points to a “ghost” file (not reachable by any path name) with old contents. To avoid this conflict, we must check that our open file is still the same after returning from fcntl.flock(). So I have written a new LockedOpen context manager to replace the built-in open context. It ensures that we actually open the right file:

class LockedOpen(object):

    def __init__(self, filename, *args, **kwargs):
        self.filename = filename
        self.open_args = args
        self.open_kwargs = kwargs
        self.fileobj = None

    def __enter__(self):
        f = open(self.filename, *self.open_args, **self.open_kwargs)
        while True:
            fcntl.flock(f, fcntl.LOCK_EX)
            fnew = open(self.filename, *self.open_args, **self.open_kwargs)
            if os.path.sameopenfile(f.fileno(), fnew.fileno()):
                f = fnew
        self.fileobj = f
        return f

    def __exit__(self, _exc_type, _exc_value, _traceback):
    def update(self):
        with LockedOpen(filename, 'r+') as f:
            n = int(
            n += 1
            with tempfile.NamedTemporaryFile(
                    'w', dir=os.path.dirname(filename), delete=False) as tf:
                tempname =
            os.rename(tempname, filename)

Locking append updates is as easy as locking truncate-write updates: acquire an exclusive lock, append, done. Long-running processes, which leave a file permanently open, may need to release locks between updates to let others in.

The spooldir pattern has the elegant property that it does not require any locking. Again, it depends on using a clever naming scheme and a robust unique file name generation. The maildir specification is a good example for a spooldir design. It can be easily adapted to other cases, which have nothing to do with mail.


Durability is a bit special because it depends not only on the application, but also on OS and hardware configuration. In theory, we can assume that os.fsync() or os.fdatasync() calls do not return until data has reached permanent storage. In practice, we may run into several problems: we may be facing incomplete fsync implementations or awkward disk controller configurations, which never give any persistence guarantee. A talk from a MySQL dev goes into great detail of what can go wrong. Some database systems like PostgreSQL even offer a choice of persistence mechanisms so that the administrator can select the best suited one at runtime. The poor man’s option although is to just use os.fsync() and hope that it has been implemented correctly.

With the truncate-write pattern, we have to issue an fsync after finishing write operations but before closing the file. Note that there is usually another level of write caching involved. The glibc buffer holds back writes inside the process even before they are passed to the kernel. To get the glibc buffer empty as well, we have to flush() it before fsync’ing:

with open(filename, 'w') as f:

Alternatively, you can invoke Python with the -u flag to get unbuffered writes for all file I/O.

I prefer os.fdatasync() over os.fsync() most of the time to avoid synchronous metadata updates (ownership, size, mtime, …). Metadata updates can result in seeky disk I/O, which slows things down quite a bit.

Applying the same trick to write-replace style updates is only half of the story. We make sure that the newly written file has been pushed to non-volatile storage before replacing the old file, but what about the replace operation itself? We have no guarantee that the directory update is performed right on. There are lengthy discussions on how to sync a directory update on the net, but in our case (old and new file are in the same directory) we can get away with this rather simple solution:

os.rename(tempname, filename)
dirfd =, os.O_DIRECTORY)

We open the directory with the low-level call (Python’s built-in open() does not support opening directories) and perform a os.fsync() on the directory’s file descriptor.

Persisting append updates is again quite similar to what I have said about truncate-write.

The spooldir pattern has the same directory sync problems as the write-replace pattern. Fortunately, the same solution applies here as well: first sync the file, then sync the directory.


It is possible to update files reliably. I have shown that all four ACID properties can be met. The code examples presented above may serve as a toolbox. Pick the programming techniques that match your needs best. At times, you don’t need all four ACID properties but only one or two. I hope that this article helps you to make an informed decision about what to implement and what to leave out.

Python 2 and 3 compatible builds with zc.buildout

Creating a single-source build environment with zc.buildout that works for both Python 2 and 3 is a bit of a hassle. This blog post shows how to do it for a minimal demo project.

During the sprints at PyCon DE 2012, we tried to make the upcoming 1.0 release of the nagiosplugin library compatible with both Python 2.7 and Python 3.2. Going for a single code base (without preprocessing steps like 3to2) was no too hard. The only thing left was a single-source zc.buildout setup suited for both Python 2.7 and 3.2. It worked out at last, but currently it needs two buildout configurations. This is a little bit kludgy. I hope that things will improve in the near future so that a single-source build environment with zc.buildout will be possible.

In the following, I will demonstrate the steps with a simple demo project called MultiVersion. It contains nothing more than a single class that is supposed to run under both Python 2 and 3. There is also a unit test to verify that the code works. We use zope.testrunner to run the unit tests. The code’s functionality is irrelevant for the examples, so I left it out. You can download the full source if you are interested.

1. Use a recent enough virtualenv

Older versions of virtualenv are generally not suited since they ship with obsolete releases of distribute and pip. Check if the virtualenv included in your GNU/Linux distribution is too old. Anything below 1.8 reduces the chance of success, so better install a current virtualenv locally then. Likewise, our must be recent enough to support both Python 2 and 3. The standard from does currently not work with Python 3.

Now we are ready to create a virtualenv in a fresh source checkout.

Python 3.2:

$ virtualenv -p python3.2 .
Running virtualenv with interpreter /usr/bin/python3.2
New python executable in ./bin/python3.2
Installing distribute.....done.
Installing pip.....done.

Python 2.7:

$ virtualenv -p python2.7 .
Running virtualenv with interpreter /usr/bin/python2.7
New python executable in ./bin/python2.7
Not overwriting existing python script ./bin/python (you must use ./bin/python2.7)
Installing setuptools.....done.
Installing pip.....done.

2. Running buildout with Python 3.2

I will discuss the steps for Python 3.2 first, since main development will concentrate on newer Python versions. After that, I will describe the necessary steps to make the build environment backward compatible.

To run zc.buildout, we need a buildout.cfg file. I prefer to pin package versions in all projects to ensure reliable builds. As of writing this blog post, there is just an alpha release of zc.buildout that supports Python 3.2. Unfortunately, this version of zc.buildout supports Python 3.2 only, so don’t try this with Python 3.3.

My basic buildout.cfg looks like this:

allow-picked-versions = false
develop = .
newest = false
package = multiversion
parts = multiversion test
versions = versions

distribute = 0.6.28
z3c.recipe.scripts = 1.0.1
zc.buildout = 2.0.0a2
zc.recipe.egg = 2.0.0a2
zc.recipe.testrunner = 1.4.0
zope.exceptions = 4.0.1
zope.interface = 4.0.1
zope.testrunner = 4.0.4

recipe = zc.recipe.egg
eggs = ${buildout:package}
interpreter = py

recipe = zc.recipe.testrunner
eggs = ${buildout:package}
defaults = ['--auto-color']

In my experience, it is best to pin distutils to exactly the same version that is included in virtualenv’s support files. While differing versions are possible, they may trigger hard to find bugs since it is not always clear which version is used is which step.

I use the Python interpreter from my virtualenv’s bin directory while creating the buildout executable. This saves me from using activate/deactivate scripts which are slightly cumbersome in my opinion.

$ bin/python3.2
Creating directory 'blog-python-2-3/parts'.
Creating directory 'blog-python-2-3/develop-eggs'.
Generated script 'blog-python-2-3/bin/buildout'.

$ bin/buildout
Develop: 'blog-python-2-3/.'
Installing multiversion.
Generated interpreter 'blog-python-2-3/bin/py'.
Installing test.
Generated script 'blog-python-2-3/bin/test'.

Now we have a working build for Python 3.2:

$ bin/test
Running zope.testrunner.layer.UnitTests tests:
  Set up zope.testrunner.layer.UnitTests in 0.000 seconds.
  Ran 1 tests with 0 failures and 0 errors in 0.002 seconds.
Tearing down left over layers:
  Tear down zope.testrunner.layer.UnitTests in 0.000 seconds.

3. Running buildout with Python 2.7

Unfortunately, the current zc.buildout alpha release does not work with anything except Python 3.2. Running fails:

$ bin/python2.7
Getting distribution for 'zc.buildout==2.0.0a2'.
  Getting distribution for 'zc.buildout==2.0.0a2'.
Error: Couldn't find a distribution for 'zc.buildout==2.0.0a2'.

There is no single zc.buildout distribution that fits both Python 2.7 and 3.2. To get around this, I need to create a special-case buildout.cfg that changes version pinnings for incompatible packages. Besides zc.buildout, zc.recipe.egg needs different versions for Python 2.7 and 3.2 as well.

I create buildout-2.x.cfg (slightly grumbling):

extends = buildout.cfg

zc.buildout = 1.6.3
zc.recipe.egg = 1.3.2

This one does the job when used with both bootstrap and buildout:

$ bin/python2.7 -c buildout-2.x.cfg
Generated script 'blog-python-2-3/bin/buildout'.

$ bin/buildout -c buildout-2.x.cfg
Develop: 'blog-python-2-3/.'
Installing multiversion.
Generated interpreter 'blog-python-2-3/bin/py'.
Installing test.
Generated script 'blog-python-2-3/bin/test'.

We now have a build environment that builds single-source code for both Python 2.7 and 3.2 using zc.buildout. Of course, this technique could be extended to support even more versions. But I hope that the incompatible packages will be updated in the near future so that the need for special-case buildout.cfg files will go away. What seems to be most missing: a release of zc.buildout that supports all major Python versions.


  • Use a current virtualenv version.
  • Use a compatible
  • Pin your package versions.
  • Versions for some packages (including zc.buildout) must be special-cased.


I would like to thank Andrei Chirila and Michael Howitz for a great sprint session.

Don’t stop PostgreSQL’s autovacuum with your application

The problem

Some weeks ago, we received a complaint from a customer about bad PostgreSQL performance for a specific application. I took a look into the database and found strange things going on: the query planner was executing “interesting” query plans, tables were bloated with lots of dead rows (one was 6 times as big as it should be), and so on.

The cause revealed itself when looking at pg_stat_user_tables:

abc-hans=# SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_user_tables;
        relname        | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze
 object_ref            |             |                 |              |
 archived_current      |             |                 |              |
 archived_item         |             |                 |              |
 pgtextindex           |             |                 |              |
 archived_container    |             |                 |              |
 archived_chunk        |             |                 |              |
 commit_lock           |             |                 |              |
 archived_blob_info    |             |                 |              |
 archived_state        |             |                 |              |
 archived_blob_link    |             |                 |              |
 archived_item_deleted |             |                 |              |
 pack_object           |             |                 |              |
 archived_class        |             |                 |              |
 archived_object       |             |                 |              |
 object_state          |             |                 |              |
 object_refs_added     |             |                 |              |
 blob_chunk            |             |                 |              |

Despite of heavy write activity on the database, no table had ever seen autovacuum or autoanalyze. But why?

As I delved into it, I noticed that PostgreSQL’s autovacuum/autoanalyze was practically stopped in two ways by the application. I’d like to share our findings to help other programmers not to get trapped in situations like this.

Unfinished transactions

It turned out that the application had one component which connected to the database and opened a transaction right after startup, but never finished that transaction:

abc-hans=# SELECT procpid, current_timestamp - xact_start AS xact_runtime, current_query
FROM pg_stat_activity ORDER BY xact_start;
 procpid |  xact_runtime   | current_query

   18915 | 11:46:20.8783   | <IDLE> in transaction
   21289 | 11:18:20.07042  | <IDLE> in transaction

Note that the database server was started about 11 ¾ hours ago in this example. Vacuuming (whether automatic or manual) stops at the oldest transaction id that is still in use. Otherwise it would be vacuuming active transactions, which is not sensible at all. In our example, vacuuming is stopped right away since the oldest running transaction is only one minute older than the running server instance. At least this is easy to resolve: we got the developers to fix the application. Now it finishes every transaction in a sensible amount of time with either COMMIT or ABORT.

Exclusive table locks

Unfortunately, this was not all of it: autovacuum was working now but quite sporadically. A little bit of research revealed that autovacuum will abort if it is not able to obtain a table lock within one second – and guess what: the application made quite heavy use of table locks. We found a hint that something suspicious is going on in the PostgreSQL log:

postgres[13251]: [40-1] user=,db= ERROR:  canceling autovacuum task

Searching the application source brought up several places where table locks were used. Example:

stmt = """
DELETE FROM %(table)s WHERE docid = %%s;
INSERT INTO %(table)s (docid, coefficient, marker, text_vector)
VALUES (%%s, %%s, %%s, %(clause)s)
""" % {'table': self.table, 'clause': clause}

The textindex code was particularly problematic as it dealt often with large documents. Statements like the one above could easily place load on the database server high enough to cause frequent autovacuum aborts.

The developers said that they have introduced the locks because of concurrency issues. As we could not get rid of them, I have installed a nightly cron job to force-vacuum the database. PostgreSQL has shown much improved query responses since then. Some queries’ completion times even improved by a factor of 10. I’ve been told that in the meantime they have found a way to remove the locks so the cron job is not necessary anymore.


PostgreSQL shows good auto-tuning and is a pretty low-maintenance database server if you allow it to perform its autovacuum/autoanalyze tasks regularly. We have seen that application programs may put autovacuum effectively out of business. In this particular case, unfinished transactions and extensive use of table locks were the show-stoppers. After we have identified and removed these causes, our PostgreSQL database is running smoothly again.

We are currently in the process of integrating some of the most obvious signs of trouble into the standard database monitoring on our managed hosting platform to catch those problems quickly as they show up.

gocept talks at PyCon DE

No blog post for quite a while… part of the reason is that we gocept developers were busy preparing talks for PyCon DE 2011. As result, we presented an impressive number of 7 talks/tutorials at this lovely conference.

Curious? Here is a list of all sessions (most with video recordings). Please be aware that nearly all of this stuff is in German.



No luck with glusterfs

Recently, we’ve been experimenting with glusterfs as an alternative network storage backing our VM hosting. It looked like a very promising candidate to replace our current iSCSI stack: scale-out with decent performance, mostly self-configuring, self-replicating, self-healing. And all of this out-of-the-box without complex setup. In contrast, the conventional architecture with a complex layering of iSCSI targets, DRBD, and Linux-HA glued together with a pack of shell scripts looks rather 90’s.

We played with glusterfs for a while. Setting up and configuring the software went quite smooth compared to the traditional stuff. But after some stress testing in a replicated scenario, we found severe problems.


On the storage, the virtual machines represent themselves basically as one big image file. This image can become several hundreds of Gigabytes big. This is OK as long as the replicated file servers are in sync. But once one goes offline and online again, the versions of the image may differ and the self-healing algorithm is triggered. Due to glusterfs’ architecture, this happens  entirely on the filesystem client (i.e., the KVM host). After re-connecting a file server, all VM I/O is to be paused until self-healing is complete. The live VM is stuck for some amount of time between several seconds and more than a minute. A considerable portion of our hosting cluster could freeze for minutes. This is cleary unacceptable. Re-connecting a previously disconnected file server would be a risky operation: quite the opposite of what replication is good for.

No global state

Another feature of glusterfs is that replication is handled entirely on the filesystem client and not on the server. This leads to an orthogonal and modular approach which has a lot of advantages. But it makes it hard to determine when a file server can be disconnected safely: Given that self-healing takes a considerable amount of time, we cannot be sure if there is still some self-heal operation in progress. But disconnecting a replicated file server which had the newer copy of a VM image before the other file server has caught up would render the VM unusable. Unfortunately, there seems to be no easy way to query a glusterfs file server for active self-healing operations. This makes disconnecting a file server a risky operation, too.

Good for its intended use

In summary, we learned that glusterfs’ architecture is a good fit for the use case it has originally been designed: a NFS replacement with lots of small files. But for our scenario where continuously running processes need to access a few large image files uninterruptedly, glusterfs seems not to be the best fit.

So we will stick to the good ol’ iSCSI stack for now. Perhaps Ceph or Sheepdog will become viable alternatives in the future once they stabilise.