{"id":1246,"date":"2013-03-03T20:10:49","date_gmt":"2013-03-03T19:10:49","guid":{"rendered":"http:\/\/blog.gocept.com\/?p=1246"},"modified":"2013-04-02T23:16:43","modified_gmt":"2013-04-02T21:16:43","slug":"how-we-organize-large-scale-roll-outs","status":"publish","type":"post","link":"https:\/\/blog.gocept.com\/2013\/03\/03\/how-we-organize-large-scale-roll-outs\/","title":{"rendered":"How we organize large-scale roll-outs"},"content":{"rendered":"

In the coming week we will deploy an extensive OS update to our production environment which (right now) currently consists of 41 physical hosts running 195 virtual machines.<\/p>\n

Updates like this are prepared very carefully in many small steps using our\u00a0development and staging setups<\/a> that reflect the exactly same environment as our production systems in the data center.<\/p>\n

Nevertheless, we learned to expect the unexpected when deploying to our production environment. This is why we established the one\/few\/many<\/strong> paradigm for large updates. The remainder of this post talks about our scheduling mechanism to determine which machines are updated at what point in time.<\/p>\n

Automated maintenance scheduling<\/strong><\/p>\n

The Flying Circus<\/a>\u00a0configuration management database (CMDB) keeps track of times that are acceptable to each customer for scheduled maintenance. When a machine determines that a particular automated activity will be disruptive (e.g. because it makes the system temporarily unstable or reboots) then it requests a maintenance period from the CMDB based on the customers’ preferences and the estimated duration of the downtime\u00a0Customers are then automatically notified what will happen at what time.<\/p>\n

This alone is too little to make a large update that affects all machines (and thus all customers) but it’s the mechanical foundation of the next step.<\/p>\n

Maintenance weeks<\/strong><\/p>\n

When we roll out a large update we rather add additional padding for errors and thus we invented the “maintenance week”. For this we can ask the CMDB to proactively schedule relatively large maintenance windows for all machines in a given pattern.<\/p>\n

Here’s a short version of how this schedule is built when an administrator pushes the “Schedule maintenance week” button in our CMDB (all times in UTC):<\/p>\n

    \n
  1. Monday 09:00 – automation management, monitoring, and binary package compilation get updated<\/li>\n
  2. Monday 13:00 – the first router and one storage server are updated<\/li>\n
  3. Monday 17:00 – internal test machines (our litmus machines) and a small but representative set of customer machines that are marked as test environments get updated<\/li>\n
  4. Tuesday 17:00 – the remainder of customer test machines, up to 5% of untested production VMs, and 20% of the storage servers are updated<\/li>\n
  5. Wednesday 17:00 – 30% of the production VMs get updated and 30% of the storage servers are updated<\/li>\n
  6. Thursday 17:00 – the remaining production VMs and storage servers get updated<\/li>\n
  7. Saturday 09:00 – KVM hosts are updated and rebooted<\/li>\n
  8. Saturday 13:00 – the second router is updated<\/li>\n<\/ol>\n

    Once the schedule has been established, customers are informed by email about the assigned slots. An internal cross-check ensures that all machines in the affected location do have a window assigned for this week.<\/p>\n

    \"Maintenance<\/a><\/p>\n

    This procedure causes the number of machines that get updated rise from Monday (22 machines) to Thursday (about 100 machines). Any problems we find on Monday we can fix on a small number of machines and provide a bugfix to avoid the issue on later days completely.<\/p>\n

    However, if you read the list carefully you are probably asking yourself: Why are customer VMs without tests updated early? Doesn’t this force customers without tests to experience outages more heavily?<\/p>\n

    Yes. And in our opinion this is a good thing: First, in earlier phases we have smaller numbers of machines to deal with. Any breakage that occurs on Monday or Tuesday can be dealt with more timely than if unexpected breakage occurs on Wednesday or Thursday where many machines are updated at onces. Second, if your service is critical then you should feel the pain of not having tests (similar to pain that you experience if you don’t write unit tests and try to refactor). We believe that “herd immunity” will give you a false sense of security and rather have unexpected errors occur early and clearly visible so they can be approached with a good fix instead of hiding them as long as possible.<\/p>\n

    We’re looking forward to our updates next week. Obviously we’re preparing for unexpected surprises, but what will they have in stock for us this time?<\/p>\n

    We also appreciate feedback: How do you prepare updates for many machines? Is there anything we’re missing? Anything that sounds like a good idea to you? Let us know – leave a comment!<\/p>\n","protected":false},"excerpt":{"rendered":"

    In the coming week we will deploy an extensive OS update to our production environment which (right now) currently consists of 41 physical hosts running 195 virtual machines. Updates like this are prepared very carefully in many small steps using our\u00a0development and staging setups that reflect the exactly same environment as our production systems in … Continue reading “How we organize large-scale roll-outs”<\/span><\/a><\/p>\n","protected":false},"author":12391367,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_newsletter_tier_id":0,"footnotes":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[10221,1],"tags":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pFP3y-k6","jetpack-related-posts":[{"id":1340,"url":"https:\/\/blog.gocept.com\/2013\/07\/31\/reproducable-automated-deployments-on-raspberrypi-with-batou\/","url_meta":{"origin":1246,"position":0},"title":"Reproducable automated deployments on RaspberryPi with batou","author":"Daniel Havlik","date":"July 31, 2013","format":false,"excerpt":"For continuous integration during development, we use Jenkins to automatically run tests for all projects we maintain. Some time ago we wanted to increase visibility of the results, so we set up a Raspberry Pi driving a few meters of LPD8806-based LED strip on which we can address single LEDs\u2026","rel":"","context":"In "en"","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":179,"url":"https:\/\/blog.gocept.com\/2012\/05\/28\/sprint-report-deploying-python-web-applications-platforms-and-applications\/","url_meta":{"origin":1246,"position":1},"title":"Sprint report: Deploying Python web applications – platforms and applications","author":"Daniel Havlik","date":"May 28, 2012","format":false,"excerpt":"Last week I met Stephan Diehl, Michael Hierweck, Veit Schiele, and Jens Vagelpohl\u00a0in Berlin for a sprint. Our chosen topic was \"Python web application\u00a0deployment\". In this post I'd like to recap our discussions, gocept's perspective on those, and the deployment tool \"batou\" that we have been incubating in the last\u2026","rel":"","context":"In "en"","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3350,"url":"https:\/\/blog.gocept.com\/2019\/11\/13\/union-cms-released-on-python-3\/","url_meta":{"origin":1246,"position":2},"title":"union.cms released on Python 3","author":"Michael Howitz","date":"November 13, 2019","format":false,"excerpt":"union.cms is a content management system which was once developed on Zope 2. It was one of the early adopters of the Five technology aka using Zope 3 components in Zope 2. Now it is one of the proud early adopters of Zope 4 on Python 3. It is used\u2026","rel":"","context":"In "en"","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"Green tree python","src":"https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2019\/11\/green-tree-python-1312700.jpg?fit=1200%2C863&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2019\/11\/green-tree-python-1312700.jpg?fit=1200%2C863&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2019\/11\/green-tree-python-1312700.jpg?fit=1200%2C863&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2019\/11\/green-tree-python-1312700.jpg?fit=1200%2C863&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2019\/11\/green-tree-python-1312700.jpg?fit=1200%2C863&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":3392,"url":"https:\/\/blog.gocept.com\/2020\/06\/08\/we-have-nearly-one-million-lines-of-python-2-code-in-production-and-now\/","url_meta":{"origin":1246,"position":3},"title":"We have nearly one million lines of Python 2 code in production \u2013 and now?","author":"Michael Howitz","date":"June 8, 2020","format":false,"excerpt":"How to successfully migrate a Python 2 project to Python 3.","rel":"","context":"In "en"","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"Python Web Conf","src":"https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/06\/michaelhowitz.jpg?fit=1200%2C600&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/06\/michaelhowitz.jpg?fit=1200%2C600&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/06\/michaelhowitz.jpg?fit=1200%2C600&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/06\/michaelhowitz.jpg?fit=1200%2C600&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/06\/michaelhowitz.jpg?fit=1200%2C600&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":199,"url":"https:\/\/blog.gocept.com\/2012\/07\/10\/surprising-experience-with-dell-support\/","url_meta":{"origin":1246,"position":4},"title":"Surprising experience with DELL support","author":"Daniel Havlik","date":"July 10, 2012","format":false,"excerpt":"Background: we had terrible support experiences with DELL over the last 4-5 years or so and I just had a single really good one today. We started moving slowly to a different vendor and won't change our decision because of this one experience. Our situation: we are currently fighting a\u2026","rel":"","context":"In "en"","block_context":{"text":"en","link":"https:\/\/blog.gocept.com\/category\/en\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3398,"url":"https:\/\/blog.gocept.com\/2020\/08\/03\/announcing-zope-autumn-sprint-2020\/","url_meta":{"origin":1246,"position":5},"title":"Announcing Zope Autumn Sprint 2020","author":"Steffen Allner","date":"August 3, 2020","format":false,"excerpt":"Earl Zope was very delighted that in May 2020 a few of his principal supporters gathered around the virtual campfire due to the pandemic situation and improve the welfare in Python 3 Wonderland. The supporters agreed at this very campfire to meet again in 2020 to crown the newly Earl\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"Road with trees and leaves in autumn","src":"https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/08\/pexels-pixabay-235721.jpg?fit=1200%2C800&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/08\/pexels-pixabay-235721.jpg?fit=1200%2C800&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/08\/pexels-pixabay-235721.jpg?fit=1200%2C800&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/08\/pexels-pixabay-235721.jpg?fit=1200%2C800&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/blog.gocept.com\/wp-content\/uploads\/2020\/08\/pexels-pixabay-235721.jpg?fit=1200%2C800&ssl=1&resize=1050%2C600 3x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts\/1246"}],"collection":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/users\/12391367"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/comments?post=1246"}],"version-history":[{"count":3,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts\/1246\/revisions"}],"predecessor-version":[{"id":1257,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/posts\/1246\/revisions\/1257"}],"wp:attachment":[{"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/media?parent=1246"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/categories?post=1246"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.gocept.com\/wp-json\/wp\/v2\/tags?post=1246"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}