Announcing Zope Autumn Sprint 2020

Earl Zope was very delighted that in May 2020 a few of his principal supporters gathered around the virtual campfire due to the pandemic situation and improve the welfare in Python 3 Wonderland. The supporters agreed at this very campfire to meet again in 2020 to crown the newly Earl Zope V. Hope is put into the upcoming relationship between Earl Zope V. and Prince Plone VI., that those two will bring a prosperous future for their respective countries.

As the finial release of Zope 5 was scheduled roughly for September 2020 here, we consulted the time schedule at gocept and found the 28th of September to be a good date. As the pandemic situation is still around we propose another remote sprint. As it was discussed earlier, a remote sprint has less organizational overhead, so even one day is a valid option and should help us with the release process.

Goals

The main goal is to release Zope 5 final, so that Plone 6 can be released later this year with this version of Zope. There is a project on Github with the relevant tickets. In case you want to work already on some of those tickets, feel free to contribute beforehand, or add further ideas to the list.

In addition to this big goal the unification of the testing environment is always an open task.

Organisation

In May, we lit our campfires at slack and zoom. To keep up with this good, low-barrier solution, we request you to join via Meetup, so we can prepare the invitation on time.

haproxy load-balancing for PHP applications with sticky sessions

We like applications that are written with a shared-nothing approach: it greatly simplifies running multiple instances on multiple hosts and allows for simple, robust load-balancer configuration.

Recently, we had to deploy a PHP application that – in the last minute – turned out to use PHP sessions and thus required sticky sessions.

We haven’t used sticky sessions in a while and the amount of reading required to find the specific working setup was substantial, so we’ll repeat here what a post at networkinghowtos.com already figured out:

backend default
    appsession PHPSESSID len 64 timeout 3h request-learn prefix

As you can see there isn’t much magic to it – the haproxy manual has a good in-detail explanation of the appsession option. The biggest point of this option is that you do not have haproxy injecting another session identifier but simply piggybacks on the existing one that PHP determines. Also, this option combines nicely with “leastconn” balancing if your application only uses cookies on a few selected pages and many users do not trigger getting a session cookie.

September, 18th–20th: DevOps Sprint

Since we have a strong history in web development, but also were involved in operating web applications we developed, the DevOps movement hit our nerves.

Under the brand name “Flying Circus” we are establishing a platform respecting the DevOps principles.

A large portion of our day-to-day work is dedicated to DevOps related topics. We like to collaborate by sharing ideas and work on tools we all need to make operations and development of web applications a smooth experience. A guiding question: how can we improve the operability of web applications?

A large field of sprintable topics comes to our mind:

Logging

Enable web application developers to integrate logging mechanisms into their apps easily. By using modern tools like Logstash for collecting and analyzing of the data, operators are able to find causes performance or other problems efficiently.

Live-Debugging and Monitoring

Monitoring is a must when operation software. At least for some people (including ourselves), Nagios is not the best fit for DevOps teams.

Deploying

We always wanted to have reproducable automated deployments. Coming from the Zope world, started with zc.buildout, we developed our own deployment tool batou. More recently upcoming projects, such as ansible, and tools (more or less) bound to cloud services like heroku.

Backup

After using bacula for a while, we started to work on backy, which aims to work directly on volume files of virtual machines.

and more…

Join us to work on these things and help to make DevOps better! The sprint will take place at our office, Forsterstraße 29, Halle (Saale), Germany. On September, 20th we will have a great party in the evening.

If you want to attend, please sign up on http://www.meetup.com/DevOps-Sprint/events/191582682/.

 

Accomodation

For your stay in Halle, we can recommend the following Hotels: “City Hotel am Wasserturm”, “Dorint Hotel Charlottenhof”, “Dormero Hotel Rotes Ross”. For those on budget, there is the youth hostel Halle (http://halle.djh-sachsen-anhalt.de/). Everything is in walking distance from our office.

How we organize large-scale roll-outs

In the coming week we will deploy an extensive OS update to our production environment which (right now) currently consists of 41 physical hosts running 195 virtual machines.

Updates like this are prepared very carefully in many small steps using our development and staging setups that reflect the exactly same environment as our production systems in the data center.

Nevertheless, we learned to expect the unexpected when deploying to our production environment. This is why we established the one/few/many paradigm for large updates. The remainder of this post talks about our scheduling mechanism to determine which machines are updated at what point in time.

Automated maintenance scheduling

The Flying Circus configuration management database (CMDB) keeps track of times that are acceptable to each customer for scheduled maintenance. When a machine determines that a particular automated activity will be disruptive (e.g. because it makes the system temporarily unstable or reboots) then it requests a maintenance period from the CMDB based on the customers’ preferences and the estimated duration of the downtime Customers are then automatically notified what will happen at what time.

This alone is too little to make a large update that affects all machines (and thus all customers) but it’s the mechanical foundation of the next step.

Maintenance weeks

When we roll out a large update we rather add additional padding for errors and thus we invented the “maintenance week”. For this we can ask the CMDB to proactively schedule relatively large maintenance windows for all machines in a given pattern.

Here’s a short version of how this schedule is built when an administrator pushes the “Schedule maintenance week” button in our CMDB (all times in UTC):

  1. Monday 09:00 – automation management, monitoring, and binary package compilation get updated
  2. Monday 13:00 – the first router and one storage server are updated
  3. Monday 17:00 – internal test machines (our litmus machines) and a small but representative set of customer machines that are marked as test environments get updated
  4. Tuesday 17:00 – the remainder of customer test machines, up to 5% of untested production VMs, and 20% of the storage servers are updated
  5. Wednesday 17:00 – 30% of the production VMs get updated and 30% of the storage servers are updated
  6. Thursday 17:00 – the remaining production VMs and storage servers get updated
  7. Saturday 09:00 – KVM hosts are updated and rebooted
  8. Saturday 13:00 – the second router is updated

Once the schedule has been established, customers are informed by email about the assigned slots. An internal cross-check ensures that all machines in the affected location do have a window assigned for this week.

Maintenance week schedule

This procedure causes the number of machines that get updated rise from Monday (22 machines) to Thursday (about 100 machines). Any problems we find on Monday we can fix on a small number of machines and provide a bugfix to avoid the issue on later days completely.

However, if you read the list carefully you are probably asking yourself: Why are customer VMs without tests updated early? Doesn’t this force customers without tests to experience outages more heavily?

Yes. And in our opinion this is a good thing: First, in earlier phases we have smaller numbers of machines to deal with. Any breakage that occurs on Monday or Tuesday can be dealt with more timely than if unexpected breakage occurs on Wednesday or Thursday where many machines are updated at onces. Second, if your service is critical then you should feel the pain of not having tests (similar to pain that you experience if you don’t write unit tests and try to refactor). We believe that “herd immunity” will give you a false sense of security and rather have unexpected errors occur early and clearly visible so they can be approached with a good fix instead of hiding them as long as possible.

We’re looking forward to our updates next week. Obviously we’re preparing for unexpected surprises, but what will they have in stock for us this time?

We also appreciate feedback: How do you prepare updates for many machines? Is there anything we’re missing? Anything that sounds like a good idea to you? Let us know – leave a comment!

News from the toolbox: gocept.selenium and our plans for its future

For a couple of years, we at gocept have been developing a Python library, gocept.selenium, whose goal it is to integrate testing web sites in real browsers with the Python unittest framework. There exist a number of approaches to doing this; when first starting real-browser tests, we opted for using selenium. Back then, it had not been integrated with webdriver yet (more on webdriver below).

There turned out to be multiple aspects to selenium integration: setting up the web server under test, starting a browser to run selenium and pointing it at the server, but also designing a wrapper around the selenium testing API to bring it in line with unittest’s way of defining specialised assertions.

We came up with the gocept.selenium package which includes both a selenese module defining such an API wrapper and a bunch of modules for integration with those web-server frameworks that we happen to use in our work, among them generic WSGI and a number of Zope-related servers. The integration mechanism is implemented in terms of test layers, so all of this requires the Zope test runner to be used. We released a 1.0 version of gocept.selenium in November 2012, marking the selenese API as stable.

The description of the package given so far already indicates two aspects that need yet to be addressed: Firstly, the selenium project is based on webdriver nowadays, with the old selenium implementation being kept for backwards compatibility at the moment. Secondly, collecting all those server integration modules in the same package that implements the actual selenium integration makes for rather complex (albeit optional) package dependencies and poses a maintainability problem.

We have dealt with the latter in December 2012, extracting all those integration modules from gocept.selenium into a new package, gocept.httpserverlayer. From the package’s documentation:
»This package provides an HTTP server for testing your application with normal HTTP clients (e.g. a real browser). This is done using test layers, which are a feature of zope.testrunner. gocept.httpserverlayer uses plone.testing for the test layer implementation, and exposes the following resources (accessible in your test case as self.layer[RESOURCE_NAME]):

  • http_host: The hostname of the HTTP server (Default: localhost)
  • http_port: The port of the HTTP server (Default: 0, which means chosen automatically by the operating system)
  • http_address: hostname:port, convenient to use in URLs (e.g. ‘http://user:password@%s/path’ % self.layer[‘http_address’])

In addition to generic WSGI and static-file serving, the server frameworks supported at this point (i.e. gocept.httpserverlayer 1.0.1) include Zope3/ZTK (both using zope.app.testing and zope.app.wsgi with the latter supporting Grok) as well as Zope2 and Plone (using ZopeTestCase, WSGI or plone.testing.z2).

After the creation of gocept.httpserverlayer, we released the 1.1 series of gocept.selenium which no longer brings its own integration code. For the sake of backwards compatibility, though, it still implements separate TestCase classes for each of the integration flavours.

This leaves webdriver support to be dealt with. Originally, we had hoped to simply sneak it in, having to change very little client code, if any at all. Our plan was to implement the old API (both for test setup and selenese) in terms of webdriver which should allow us to benefit from webdriver immediately, as some issues with the old selenium were causing trouble in our daily work (including the behaviour of type and typeKeys as well as drag-and-drop). We started a branch of gocept.selenium where we switched from integrating legacy selenium to talking to webdriver and changed the selenese implementation to use webdriver commands.

However, it turned out that a number of details couldn’t be completely hidden, and webdriver brought its own share of problems (including, sadly, new issues with drag-and-drop). We tried out our branch in a real project to the point that all tests would pass again, and ended up with a long list of upgrade notes describing incompatibilities, either temporary or not, both causing semantic differences of behaviour and necessitating changes to the test code. We identified a number of pieces of the old selenese API that we wouldn’t bother implementing, and we still had a few large projects that would help discover more things to watch out for.

It became clear that sneaking webdriver into an existing selenium test suite wasn’t the way to get to use it soon. So, instead of continuing to develop the branch and replacing the selenium-based implementation in gocept.selenium 2, we merged the branch now, in such a way that we have two different selenium integrations available at the same time, usable simultaneously in the same project. That way, new browser tests can be added using the webdriver integration layer, and existing tests can be migrated to using webdriver test case by test case, as needed.

We have made alpha releases of gocept.selenium 2 so people may experiment with the webdriver integration. Note that while the current implementation of the test layer (gocept.selenium.webdriver.Layer) contains some code to deal with Firefox, we have successfully run it against Chrome as well. While the integration layer exposes a raw webdriver object as the seleniumrc resource, there is also the WebdriverSeleneseLayer which offers a resource named selenium, which is the old selenese API implemented in terms of webdriver and can be used together with the base layer.

We are currently working towards a stable gocept.selenium 2 release that includes webdriver support at the level described, but at the same time also thinking about how our ideal testing API might be structured in order to integrate with the unittest API concepts but make better use of the object-oriented raw webdriver API than the current selenese does. If you have an interest in using webdriver in conjunction with the Python unittest framework you are very welcome to try out the current state of gocept.selenium 2 and get back to us with ideas and suggestions.

Happy new year – cleaning up the server room!

Welcome to 2013!

Alex and I are using this time of the year when most of our colleagues are still on holidays to perform maintenance on our office infrastructure.

To prepare for all the goodness we have planned for the Flying Circus in 2013 we decided to upgrade our internet connectivity (switching from painful consumer-grade DSL/SDSL connections to fibre, yeah!)  and also clean up our act in our small private server room. For that we decided to buy a better UPS and PDUs, a new rack to get some space between the servers and clean up the wiring.

Yesterday we prepared the parts we can do ourselves in preparation of the electricians coming in on Friday to start installing that nice Eaton 9355 8kVA UPS.

So, while the office was almost empty two of us managed to use our experience with the data center setups we do to turn a single rack  (pictures of which we’re too ashamed to post) into this:

 

Although the office was almost abandoned, those servers to serve a real purpose and we had to be careful to avoid too massive interruptions as they do handle:

  • our phone system and office door bell
  • secondary DNS for our infrastructure and customer domains
  • chat and support systems
  • monitoring with business-critical alerting

Here’s how we did it:

  • Power down all the components that are not production-related and move them from the existing rack (right one on the front picture) to the new one. For that we already had our rack logically split between “infrastructure development” and “office business” machines.
  • Move the development components (1 switch, 7 servers, 1 UPS) to the new rack. Wire everything again (nicely!) and power it up. Use the power-up cycle to verify that IPMI remote control works. Also notice which machines don’t boot cleanly (which we only found on machines that are under development regarding kernels anyway, yay).
  • Notice that the old UPS isn’t able to actually run all those servers’ load and keep one turned off until we get the new UPS installed.
  • Now that we had space in the existing rack we re-distributed the servers there as well to make the arrangement more logical (routers, switches, and other telco-related stuff at the top). Turn off single servers one-by-one and keep everyone in the office informed about short outages.
  • Install new PDUs in the space we got after removing superfluous cables. Get lots of scratches while taking stuff out and in.
  • Update our inventory databases, take pictures, write blog post. 🙂

As the existing setup was quite old and grew over time we were pretty happy to be able to apply the lessons we learned in those years in between and get everything cleaned up in less than 10 hours. We notice the following things that we did differently this time (and have been doing so in the data center for a while already):

  • Create bundles of network cables for one server (we use 4) and put them  in a systematic pattern into the switch, label them once with the servername at each end. Colors indicate VLAN.
  • Use real PDUs both for IEC and Schuko equipment. Avoid consumer-grade power-distribution.
  • Leave a rack unit between each component to allow operating without hurting yourself, the flexibility to pass wires (KVM) to the front, and to avoid temperature peaks within the rack.
  • Having over-capacity makes it easier to keep things clean which in turn makes you more flexible and brings ease to your mind to focus on the important stuff.

As the pictures indicate we aren’t completely done installing all the shiny new things, so here’s what’s left for the next days and weeks:

  • Wait for the electricians and Eaton to install and activate our new UPS.
  • Wire up the new PDUs with the new UPS and clean up the power wiring for each server.
  • Wait for the telco to finish digging and putting fibre into the street and get their equipment installed so we can enjoy a “real” internet connection.

All in all, we had a very productive and happy first working day in 2013. If this pace keeps up then we should encounter the singularity sometime in April.

yafowil in a Pyramid project

In a new Pyramid project we used deform to render forms. We did not really like it. (The reasons might be detailed in another post.)

To see if other form libraries do better I gave yafowil a try at our gocept Developer Punsch 3: yafowil comes with written documentation. To get a form in our Pyramid application I had to find out some things which are not so clearly documented:

  • Let the project depend on yafowil.webob via setup.py as it contains the necessary WebOb integration.
  • Import the loader from yafowil like below, to allow yafowil to register all its known components (even all the packages in the yafowil.wiget namespace). Otherwise I got strange errors. (The loader symbol is not needed at all in the rest of the code of the form.)
from yafowil import loader
  • To get a value displayed in the rendered form use value keyword parameter in the factory like this:
form['name'] = factory('field:label:text',
                       props=dict(label=u'name', required=True),
                       value=value_getter)

value can be a plain value or a function which gets the widget and the runtime data of the widget as parameters.

  • Some widgets need JavaScript-Libraries. The integration with Pyramid or Fanstatic is not part of the framework. yafowil.base.Factory.resources_for could be a starting point. (I did not do this yet, so it might be wrong.)

Conclusion: yafowil looks like an interesting framework and after getting a starting point it should be useable in Pyramid, too. Maybe this post can help to ease it a bit.