How we organize large-scale roll-outs

In the coming week we will deploy an extensive OS update to our production environment which (right now) currently consists of 41 physical hosts running 195 virtual machines.

Updates like this are prepared very carefully in many small steps using our development and staging setups that reflect the exactly same environment as our production systems in the data center.

Nevertheless, we learned to expect the unexpected when deploying to our production environment. This is why we established the one/few/many paradigm for large updates. The remainder of this post talks about our scheduling mechanism to determine which machines are updated at what point in time.

Automated maintenance scheduling

The Flying Circus configuration management database (CMDB) keeps track of times that are acceptable to each customer for scheduled maintenance. When a machine determines that a particular automated activity will be disruptive (e.g. because it makes the system temporarily unstable or reboots) then it requests a maintenance period from the CMDB based on the customers’ preferences and the estimated duration of the downtime Customers are then automatically notified what will happen at what time.

This alone is too little to make a large update that affects all machines (and thus all customers) but it’s the mechanical foundation of the next step.

Maintenance weeks

When we roll out a large update we rather add additional padding for errors and thus we invented the “maintenance week”. For this we can ask the CMDB to proactively schedule relatively large maintenance windows for all machines in a given pattern.

Here’s a short version of how this schedule is built when an administrator pushes the “Schedule maintenance week” button in our CMDB (all times in UTC):

  1. Monday 09:00 – automation management, monitoring, and binary package compilation get updated
  2. Monday 13:00 – the first router and one storage server are updated
  3. Monday 17:00 – internal test machines (our litmus machines) and a small but representative set of customer machines that are marked as test environments get updated
  4. Tuesday 17:00 – the remainder of customer test machines, up to 5% of untested production VMs, and 20% of the storage servers are updated
  5. Wednesday 17:00 – 30% of the production VMs get updated and 30% of the storage servers are updated
  6. Thursday 17:00 – the remaining production VMs and storage servers get updated
  7. Saturday 09:00 – KVM hosts are updated and rebooted
  8. Saturday 13:00 – the second router is updated

Once the schedule has been established, customers are informed by email about the assigned slots. An internal cross-check ensures that all machines in the affected location do have a window assigned for this week.

Maintenance week schedule

This procedure causes the number of machines that get updated rise from Monday (22 machines) to Thursday (about 100 machines). Any problems we find on Monday we can fix on a small number of machines and provide a bugfix to avoid the issue on later days completely.

However, if you read the list carefully you are probably asking yourself: Why are customer VMs without tests updated early? Doesn’t this force customers without tests to experience outages more heavily?

Yes. And in our opinion this is a good thing: First, in earlier phases we have smaller numbers of machines to deal with. Any breakage that occurs on Monday or Tuesday can be dealt with more timely than if unexpected breakage occurs on Wednesday or Thursday where many machines are updated at onces. Second, if your service is critical then you should feel the pain of not having tests (similar to pain that you experience if you don’t write unit tests and try to refactor). We believe that “herd immunity” will give you a false sense of security and rather have unexpected errors occur early and clearly visible so they can be approached with a good fix instead of hiding them as long as possible.

We’re looking forward to our updates next week. Obviously we’re preparing for unexpected surprises, but what will they have in stock for us this time?

We also appreciate feedback: How do you prepare updates for many machines? Is there anything we’re missing? Anything that sounds like a good idea to you? Let us know – leave a comment!

developer & admin BBQ IV

Zum vierten Mal wird am 30. April 2013 um 14:00 Uhr das „developer & admin BBQ“ stattfinden. Die Veranstaltung bietet Software-Entwicklern und Administratoren ein Forum um Ideen, Probleme und deren Lösungen auszutauschen. In einem an „Open Space“ angelehnten Format hat jeder Teilnehmer die Möglichkeit eigene Themen einzubringen, die dann in kleineren Runden bearbeitet werden können.

In der Vergangenheit wurden sowohl konkrete technische Problemstellungen, wie zum Beispiel „Plattformübergreifende Entwicklung für Mobilgeräte“ oder „Pymp your (vim|emacs) – sinnvolle Editor-Erweiterungen für Python-Entwicklung“ thematisiert. Aber auch theoretische Themen rund um agile Entwicklungsprozesse („Test Driven Development“) oder Anwendungsbetrieb („Deploying applications and the 12-factor app“)  wurden behandelt.

Wie bereits für die letzten Veranstaltungen gibt es auch dieses Mal eine Reihe von Themenvorschlägen:

  • Raspberry Pi – Möglichkeiten, Grenzen, Alternativen?
  • Übung macht den Meister: Code Katas.
  • Der „CMS-Zoo“ – Auf der Suche nach einem vernünftigen CMS.
  • Erst den Test, dann den Code – Test Driven Development
  • Ceph – performantes, stabiles und skalierbares verteiltes Dateisystem im Produktiveinsatz

Ab ca. 19 Uhr gibt es beim gemeinsamen Abendessen (je nach Wetterlage auch am Lagerfeuer mit Grill), die Möglichkeit sich in angenehmer Atmosphäre weiter auszutauschen und kennen zu lernen.

Wie bisher findet das BBQ wieder bei der Firma gocept gmbh & co. kg, Forsterstraße 29 in Halle statt.

Anmeldungen zur Teilnahme sowie Vorschläge für weitere Themen können hier eingetragen werden. Weiterhin ist die Veranstaltung bei meetup gelistet. Auch der Facebook-Event darf gerne zur Anmeldung benutzt werden.

P.S.: Since we are addressing local audience we are keeping this post in German. Basically we want developers and admins in our area to meet up, exchange ideas, and enjoy BBQ.

Happy new year – cleaning up the server room!

Welcome to 2013!

Alex and I are using this time of the year when most of our colleagues are still on holidays to perform maintenance on our office infrastructure.

To prepare for all the goodness we have planned for the Flying Circus in 2013 we decided to upgrade our internet connectivity (switching from painful consumer-grade DSL/SDSL connections to fibre, yeah!)  and also clean up our act in our small private server room. For that we decided to buy a better UPS and PDUs, a new rack to get some space between the servers and clean up the wiring.

Yesterday we prepared the parts we can do ourselves in preparation of the electricians coming in on Friday to start installing that nice Eaton 9355 8kVA UPS.

So, while the office was almost empty two of us managed to use our experience with the data center setups we do to turn a single rack  (pictures of which we’re too ashamed to post) into this:

 

Although the office was almost abandoned, those servers to serve a real purpose and we had to be careful to avoid too massive interruptions as they do handle:

  • our phone system and office door bell
  • secondary DNS for our infrastructure and customer domains
  • chat and support systems
  • monitoring with business-critical alerting

Here’s how we did it:

  • Power down all the components that are not production-related and move them from the existing rack (right one on the front picture) to the new one. For that we already had our rack logically split between “infrastructure development” and “office business” machines.
  • Move the development components (1 switch, 7 servers, 1 UPS) to the new rack. Wire everything again (nicely!) and power it up. Use the power-up cycle to verify that IPMI remote control works. Also notice which machines don’t boot cleanly (which we only found on machines that are under development regarding kernels anyway, yay).
  • Notice that the old UPS isn’t able to actually run all those servers’ load and keep one turned off until we get the new UPS installed.
  • Now that we had space in the existing rack we re-distributed the servers there as well to make the arrangement more logical (routers, switches, and other telco-related stuff at the top). Turn off single servers one-by-one and keep everyone in the office informed about short outages.
  • Install new PDUs in the space we got after removing superfluous cables. Get lots of scratches while taking stuff out and in.
  • Update our inventory databases, take pictures, write blog post. 🙂

As the existing setup was quite old and grew over time we were pretty happy to be able to apply the lessons we learned in those years in between and get everything cleaned up in less than 10 hours. We notice the following things that we did differently this time (and have been doing so in the data center for a while already):

  • Create bundles of network cables for one server (we use 4) and put them  in a systematic pattern into the switch, label them once with the servername at each end. Colors indicate VLAN.
  • Use real PDUs both for IEC and Schuko equipment. Avoid consumer-grade power-distribution.
  • Leave a rack unit between each component to allow operating without hurting yourself, the flexibility to pass wires (KVM) to the front, and to avoid temperature peaks within the rack.
  • Having over-capacity makes it easier to keep things clean which in turn makes you more flexible and brings ease to your mind to focus on the important stuff.

As the pictures indicate we aren’t completely done installing all the shiny new things, so here’s what’s left for the next days and weeks:

  • Wait for the electricians and Eaton to install and activate our new UPS.
  • Wire up the new PDUs with the new UPS and clean up the power wiring for each server.
  • Wait for the telco to finish digging and putting fibre into the street and get their equipment installed so we can enjoy a “real” internet connection.

All in all, we had a very productive and happy first working day in 2013. If this pace keeps up then we should encounter the singularity sometime in April.

gocept Developer Punsch 3

Nachdem wir dieses Jahr bereits zwei “Developer-BBQs” veranstaltet haben, laden wir am Freitag, 7.12.2012 ab 14:00 Uhr ein weiteres Mal in unser Büro ein. Jahreszeitenbedingt unter dem Titel “Developer-Punsch”!

Die Veranstaltung richtet sich wie immer an alle (Web-)Entwickler und Sysadmins, die wie wir gerne mal über den Tellerrand schauen. In einem an Open Space angelehnten Format möchten wir uns vorher gesammelten Themen widmen, die auch gerne mitgebracht werden können [1].

Wir würden uns freuen, zahlreiche Interessierte aus der Region und auch darüber hinaus  empfangen zu dürfen.

Ab ca. 19 Uhr wird es etwas zu Essen geben und der Punsch wird serviert.

Um Anmeldung per Mail (mail@gocept.com) oder auf dem Etherpad [1] wird gebeten.

P.S.: Since we are addressing local audience we are keeping this post in German. Basically we want developers and admins in our area to meet up, exchange ideas, and enjoy hot punch.

[1] Etherpad zur Anmeldung und Themensammlung

Introducing the “Flying Circus”

We have been busy in the last months to improve the presentation of our hosting and operations services a lot – and if you attended the Plone Conference in Arnhem, you may have noticed some bits and pieces already: T-Shirts, nice graphics, a new logo, etc.

When pondering how to name our product we quickly decided that just using the old “gocept.net” domain wasn’t good enough. As we are also ambivalent about the whole “cloud hype” we were looking for something else: something specific, something with technology, something where people who know their trade do awesome stuff, something not for the fearsome but for people with visions and grand ideas.

What we found was this:

Image

 

We call it the “Flying Circus” – for fearless man doing exactly what is needed to boost the performance, security, and reliability of your web application!

All this is just getting started and we will show a lot more at the PyConDE next week. Or, if you can not make it there, register for more information on flyingcircus.io!

gocept-Developer-BBQ 2

Unser erstes BBQ (Rückblick BBQ 1) war inhaltlich und kulinarisch ein voller Erfolg, und soll nun am Freitag, 14. September ab 14 Uhr seine Fortsetzung finden.

Also, alle (Web-)Entwickler und Sysadmins die wie wir gerne mal über den Tellerrand schauen, sind herzlich eingeladen. In einem an Open Space angelehnten Format möchten wir uns vorher gesammelten Themen widmen, die auch gerne mitgebracht werden können [1].

Ein paar Themenvorschläge haben wir schon gesammelt:

  • Wartbarer und testbarer  JS-Code
  • Resource-Inclusion, Server vs. Client
  • CSS-Präprozessoren (SASS, …)
  • Logging und ubiquituous graphing
  • Layouttests; hilft needle, sikuli?
  • Python 3
  • Automatisiertes Deployment mit “batou”
  • „Ist django der neue Trend?“
  • Meteor
  • ipv6

Wir würden uns freuen, zahlreiche Interessierte aus der Region und auch darüber hinaus in unserem schönen Garten empfangen zu dürfen. (Oder je nach Wetter in unseren Büroräumen.)

Auch dieses Mal wird es ab ca. 19 Uhr wieder leckere Speisen vom Grill geben.

Um Anmeldung per Mail (mail@gocept.com) oder auf dem Etherpad [1] wird gebeten.

P.S.: Since we are addressing local audience we are keeping this post in German. Basically we want developers and admins in our area to meet up, exchange ideas, and enjoy BBQ and beer.

[1] Etherpad zur Anmeldung und Themensammlung

Surprising experience with DELL support

Background: we had terrible support experiences with DELL over the last 4-5 years or so and I just had a single really good one today. We started moving slowly to a different vendor and won’t change our decision because of this one experience.

Our situation: we are currently fighting a subtle issue in or data center: spontaneous reboots of physical servers. It only happens rarely but is a bit of an annoyance. We have now experienced 10 cases over the last year and starting to investigate. The problem is that almost all machines rebooted only once and we can never find an actual cause.

While getting an overview of all restarts (machine, time, hardware model, role, bios version) we had to contact DELL ProSupport to figure out a contradictory statement on new BIOS versions.

First, I got directly to the technician and he actually (for once) did have our machine’s service tag on his desk. I explained to him that I needed a specific piece of information and that I’m currently investigating a broader issue that doesn’t seem to be related to a single machine. He took up on that, passed me the information and followed me building and correcting our model of the fault and gave helpful comments and additional data from their experiences in the support with those machines.

What I wondered about is that he gave me information which I expected to be one of the selling points of DELLs machines: management features, access to support experience instead of scripted/technologically challenged call-center Zombies. Again: kudos to the supporter who helped me today.

Here are the positive surprises:

  • The DELL R610 and R510 iDRAC express cards have SSH and WEB UIs for accessing some of the fancier features. I even finally found the power meter!
  • There seems to be a tool called “repository manager” which can create a bootable ISO that includes all firmware updates for all the machines that you select. Cool! However, it seems to need Windows 2008 (WTF?). Even on Windows
  • Maybe (I didn’t understand this fully) the lifecycle controller can perform all required firmware/BIOS updates via FTP directly when entering it during boot time. (Unfortunately you need to reboot just to find out whether you need updates.)

Recapitulating this phone call and the information I got, I reached some conclusions:

  • Big, big personal thanks to the DELL supporter, you made my day! (And you know who you are!)
  • Why do I get huge amounts of stupid manuals that I just through away but readable, accessible information that the iDRAC Express has HTTP and SSH support?
  • Why are all Linux updates for no reason wrapped into binaries that require Red Hat stuff? All the tools are there on other distributions. Can you please release things so that grown-ups can use them?
  • Can we please have an accessible, platform-independent way to retrieve the information whether firmware updates are pending? Aand whether any update in the chain is urgent?
  • I see myself confirmed that hardware vendors are just terrible at software. Even your supporter is trained by now to think that having to hit a button twice isn’t a bug but a feature. Come on!
  • We knew that the express cards do not support VGA redirection (we use ipmi sol generally) but that leaves AFAICT only the “mount a remote disk” and “redirect VGA” as features of the bigger iDRAC option. And that thing AFAICT costs around 300 EUR more.
  • Given the issues of how to update firmware if you are on a true free platform then I wonder why those cost extra. Seems like DELL does support MS and RedHat’s business model by forcing customers into those options.

Lastly, it’s nice to have an actually good experience with DELL support for once, but, given our overall experience we’re more than happy to be migrating to Thomas Krenn now.