Flying Circus at EuroPython 2014

If you’re attending EP14, be sure to visit our Flying Circus booth at BCC level A! We’re here to discuss web operations. Managed hosting is only as good as the people behind it. So just walk over, test us, ask any question related to web operations! Additionally, we have some demo VMs readily available so you can get hands-on experience with a walk-through from our developers.

Flying Circus boot at EP14

Posted in en | Tagged , | Leave a comment

Follow up actions after the filesystem corruption incident

On 2014-06-07, the Flying Circus experienced a quite unfortunate filesystem corruption incident. Most of the VMs have been cleaned up since then, but a few defective files are still around. In the following article, I’ll provide some background information on what types of corruption we saw, what you (as our customer) can expect from platform management to rectify the situation, and what everyone can do to check his/her own applications.

Observed types of filesystem corruption

The incident resulted in lost updates on the block layer. This means that some filesystem blocks were reverted to an older state. Depending on what kind of information has been saved in the affected blocks, this may lead to different effects:

  • files show old content, as a file update got lost;
  • files show random content, as updates to the file’s extent list got lost;
  • files have disappeared completely, as updates to their containing directory got lost.

On most VMs which experienced filesystem corruption, filesystem metadata has been rendered invalid as well. We were able to identify these VMs quickly and contacted the affected customers immediately. However, there are still some cases left where filesystem metadata has not been affected (so the automated checks did not find anything), but file contents has been affected. Generally, files that have been updated in the time range between 2014-06-02 and 2014-06-07 or live in directories that saw changes during that time are at risk.

These cases of corruption are impossible to detect via filesystem checks. To make sure that all VMs are in a reasonably good state, twofold action is necessary: First, we will check the OS and all managed components as part of our platform management. Second, we ask you to take a look at your applications to uncover previously hidden cases of filesystem corruption.

Platform-wide checks

After taking short-term action to ensure that we will not run into a similar problem again, we are currently in the process of performing a deep scan of all installed OS files and managed components. In particular, we are going to:

  • perform a consistency check on all files installed from OS packages;
  • perform integrity checks on all managed databases (PostgreSQL, LDAP, …);
  • reboot all VMs to ensure that there is no stale cached content.

Found inconsistencies will be repaired automatically if possible (e.g., OS files). As far as application data is concerned (e.g., databases), we will contact you to work out available options to restore consistency. VM reboots will take place during announced maintenance periods as usual.

Application-specific checks

It is not possible to perform an automated deep check of project files, as we do for OS files. Too much context knowledge is necessary to judge on what is ok and what looks suspicious. So we ask you to throw a critical glance at your applications yourself. Our experience so far shows that signs of filesystem corruption reveal themselves quite fast when one starts to look for the right signs.

Two areas that need to be checked are static project files, like application software and configuration, and application-specific databases.

Checking application installations

Our first and most important recommendation is to restart all applications and check for obvious signs of trouble. This is both easy and points to most problems right away.

Additionally, the applications’ log files should be inspected. Filesystem corruption causes sometimes “illogical” errors to show up in the log files. We recommend to look through the log files for unusual error messages.

If corrupted installation or configuration files are found, the best way out is usually to re-deploy affected applications. This is easy if the deployment is controlled via an automated tool like batou or zc.buildout. Restoring installation files from backup is also an option.

Checking application data stores

Some applications bring in their own data store, for example ZODB or SOLR. Procedures depend on the specific software, but we can give some general suggestions here:

  • Some data stores have their own integrity checking or even repair tools on board. For example, ZODB complains about inconsistencies during packing.
  • Some data stores have means to dump their contents to an external file. During dumping, all pieces of data will be traversed and inconsistencies are likely to be discovered.
  • Some data stores can easily be rebuilt from scratch, for example caches, indexes, or session stores.

Please contact our support if you discover inconsistencies and need help to recover.

Summary

We are sorry for all the trouble the filesystem corruption incident has caused. We care about customer data and will do our best to get VMs as clean as possible. With the guidelines mentioned above, it should be possible to uncover a good portion of the corrupted files that have not been identified yet.

Posted in en | Tagged , , | Leave a comment

Heartbleed bug and the Flying Circus

tl;dr: The Flying Circus is not affected by the Heartbleed bug

As reported by several media there is a serious bug in the OpenSSL library, widely known as the Heartbleed bug. The bug was introduced in the OpenSSL development tree on January 1st, 2012 and was finally released with OpenSSL version 1.0.1.

The Flying Circus platform makes use of the Gentoo Linux distribution. The OpenSSL version maintained for the Flying Circus is 1.0.0j, so it is not affected by the Heartbleed bug.

To be sure we are not affected by the bug, for example by possible backports from the Gentoo maintainer, we also audited the OpenSSL sources that are in use to not implement the vulnerable heartbeat function.

The Flying Circus is not affected by Heartbleed and there was no time in the past when we had rolled out a vulnerable version.

There is no need to replace your certificates or keys.

Posted in en | Leave a comment

Improving HTTP security at the Flying Circus

We now know that the secret services employ extended eavesdropping techniques to scan and analyze nearly all Internet traffic. This worries us since we want to keep our customers’ data confidential. We get a lot of questions about how secure sites hosted at the Flying Circus are. As security has many aspects, I would like to focus on one question in this post: How secure is our HTTPS encryption? In other words, is it likely that some third party sitting in the transmission path is able to decrypt traffic between our server and the user’s browser?

We have checked everything twice to ensure a good level of security with the default configuration of our web server role. Of course no-one can guarantee absolute security, but this is what we do currently:

  • We have improved the web server configuration so that HTTPS web sites still maintain an ‘A’ rating at SSL Labs. They have recently tightened their check criteria in the light of Snowden’s revelations on NSA practices. An ‘A’ rating means that the encryption is still very hard to break.
  • We use only open-source software. There are reports that secret services try to get back doors into security products to intercept traffic after it has been decrypted. Commercial security devices are a black box: You must trust them not to forward your data elsewhere. In contrast, open source software uses only published source code. The sources are read and used by a world-wide community of developers, who are in general very security aware. We compile the source code by ourselves. Although it might be possible to hide an cuckoo’s egg in the source code so advanced that it does even not get recognized by experts, this is highly unlikely.
  • We are in the process of switching on HTTP Strict Transport Security (HSTS) for all HTTPS-only sites. This means that web browsers are told to reject unencrypted connections to such a site.
  • We employ perfect forward secrecy (PFS). This means that even when captured (encrypted) traffic is stored and there will be a decryption attack available in the future, past traffic will still be undecipherable. Note that not all browsers support PFS; for example, some old IE versions on Windows XP feature only insufficient crypto. We think that it is better to reject encrypted connections from broken systems than lulling users into a false sense of security.

What is not so good currently:

  • We are not able to support the newest encryption suite Transport Layer Security 1.2 (TLS 1.2). To get this running, we must upgrade some shared libraries which are central to our OS deployment. This will probably take place during our next major OS upgrade at the end of the year. TLS 1.2 is more resistant against some advanced attacks but is not supported by all browsers.

To summarize: we have implemented decent security measures to prevent third parties to decipher encrypted web traffic. Our ‘A’ rating with SSL Labs is better than the majority of web sites today. There is still a library upgrade pending, but we have it already on our list.

Posted in en | Tagged , , | 1 Comment

Run tests using layers with py.test

TL;DR

Long Story

We have many test suites which use test layers (e. g. the ones from plone.testing). We want to use py.test and all its fancy features to have a modern test runner. There was no way to convert such tests partly: either you have to port the whole project or you are stuck with the zope.testrunner.

On our Pyramid-Sprint Godefroid Chapelle, Thomas Lotze and me wrote a package which wraps layers as py.test fixtures. The result is gocept.pytestlayer.

Implementation

For each layer it creates two fixtures: one for the layer setUp/tearDown and one for the testSetUp/testTearDown. The layer fixture is configured for class scope but the plug-in orders the tests and knows about the next test so the layer is only torn down if the next test needs another fixture.

Usage

You only have to add a new section to your package buildout and running the test via

bin/py.test -x

detects the layers and displays the needed setup code. See the PyPI-Page of the package for details.

Future

Maybe it is possible to get rid of the fixture setup code, so running tests using layers gets even easier.

Posted in en | Tagged , , , , , , , | Leave a comment

Viewing scales metrics from Pyramid

We’ve recently started experimenting with the excellent scales library to collect in-process metrics (see Coda Hale’s CodeConf talk “Metrics everywhere” among many others for reasons why one definitely wants to do that).

Scales comes with a flask-based HTTP server that allows viewing the collected measurements and dumping them as JSON. But if you already are in a web application, there’s no real need to spin up yet another thread, open another port etc. to do this. In our case, we’re using Pyramid, so here’s a quick recipe to get the same view that greplin.scales.flaskhandler provides:

Update 2013-11-06: This code is now released as pyramid_scales.

# in your Pyramid setup
config.add_route('scales', '/scales/*prefix')

from StringIO import StringIO
from pyramid.view import view_config
import greplin.scales
import greplin.scales.formats

@view_config(route_name='scales', renderer='string')
def scales_stats(request):
    parts = request.matchdict.get('prefix')
    path = '/'.join(parts)
    stats = greplin.scales.util.lookup(greplin.scales.getStats(), parts)

    output = StringIO()
    outputFormat = request.params.get('format', 'html')
    query = request.params.get('query', None)
    if outputFormat == 'json':
        request.response.content_type = 'application/json'
        greplin.scales.formats.jsonFormat(output, stats, query)
    elif outputFormat == 'prettyjson':
        request.response.content_type = 'application/json'
        greplin.scales.formats.jsonFormat(output, stats, query, pretty=True)
    else:
        request.response.content_type = 'text/html'
        # XXX Dear pyramid.renderers.string_renderer_factory,
        # you can't be serious
        request.response.default_content_type = 'not-text/html'
        output.write('<html>')
        greplin.scales.formats.htmlHeader(output, '/' + path, __name__, query)
        greplin.scales.formats.htmlFormat(output, tuple(parts), stats, query)
        output.write('</html>')

    return output.getvalue()
Posted in en | Tagged , , , | 2 Comments

Reproducable automated deployments on RaspberryPi with batou

For continuous integration during development, we use Jenkins to automatically run tests for all projects we maintain. Some time ago we wanted to increase visibility of the results, so we set up a Raspberry Pi driving a few meters of LPD8806-based LED strip on which we can address single LEDs to represent the status of individual or aggregated builds.

Automating deployments is a good idea…

After an SD Card failure we were painfully remembered how hard it can be to set up a service where all parts were deployed manually. Fortunately we wrote at least some minimal documentation on how to set everything up, so after a few days we were presented with many broken builds. Of course nobody cared about the build status with all LEDs being dark. :(

Let’s automate!

Today we wondered if we can use our deployment-tool batou to make reproduceable deployments to a raspberry pi, and did some tests on a vanilla raspbian image (2013-07-26 “Wheezy”).

Preparing your Raspberry Pi

Of course, you can not deploy to it without some simple preparations. First thing is, batou needs to be able to log on the target host with a public ssh key, so we copied our public key to the raspi which has the address 192.168.0.5 in this example:

local> ssh-copy-id pi@192.168.0.5
pi@192.168.0.5's password:
Type password of user pi, default: "raspberry"

(If you don’t have the ssh-copy-id, you have to manually append your ssh public key to /home/pi/.ssh/authorized_keys, which you will need to create on a plain installation)

Manually install minimal requirements

Batou does also have a few requirements which are needed to bootstrap the environment:

  • mercurial – to pull the buildout which sets up batou
  • python-virtualenv – to create a clean python environment for the buildout
  • python-dev – to compile libcrypto against

Note: We are currently working on batou 1.0 which most likely will no longer need any of these.

You can install all the requirements at once with the following command on your raspi:

pi> sudo aptitude install mercurial python-virtualenv python-dev

Prepare batou

Now you are ready to do your first batou deployment to a raspberry pi. For our experiments we created a small hello-world batou deployment, containing a test component which deploys a file /tmp/test which contains “foo” to a raspberry pi specified by an IP address.

To begin, clone the repository on your local machine:

local> hg clone https://bitbucket.org/gocept/batou-on-raspberrypi
local> cd batou-on-raspberrypi

Now, edit environments/pi.cfg and set the IP-address of your Raspi.

To create the nessecary scripts to do the deployment, run buildout to create a sandbox containing all dependencies of batou and the scripts you can use to deploy:

local> python bootstrap.py
…
local> bin/buildout

Deploy!

After some minutes, your batou deployment sandbox will be ready for use. You most likely modified environments/pi.cfg so you need to check in that change first, because batou refuses to deploy a dirty working copy.

local> hg ci -m 'change ip of my raspi'

To run the deployment, call batou-remote with the name of the environment (“pi”, which corresponds to environments/pi.cfg). Because the ssh user you use to connect with the target host differs from your local user, you have to specify it with --ssh-user.

local> bin/batou-remote pi --ssh-user=pi

Batou will now set up itself on the remote side and deploys all components specified in pi.cfg. To show it worked, check if the deployed file contains the correct content:

pi> cat /tmp/test
foo

Further readings

To learn more about batou, check http://batou.readthedocs.org.

If you want deploy your real life mission critical python applications into a fully automated environment using batou, head over to the Flying Circus.

TL;DR

  • Create reproducable automated deployments for your software is great fun.
  • Preparing a raspi to be a target host for batou 0.2.12 based deployments is easy:
    • Install python-virtualenv, mercurial and python-dev.
    • Put your ssh public key on the raspi.
  • Example deployment can be found on bitbucket.
Posted in en | Tagged , , , | Leave a comment