Surprising experience with DELL support

Background: we had terrible support experiences with DELL over the last 4-5 years or so and I just had a single really good one today. We started moving slowly to a different vendor and won’t change our decision because of this one experience.

Our situation: we are currently fighting a subtle issue in or data center: spontaneous reboots of physical servers. It only happens rarely but is a bit of an annoyance. We have now experienced 10 cases over the last year and starting to investigate. The problem is that almost all machines rebooted only once and we can never find an actual cause.

While getting an overview of all restarts (machine, time, hardware model, role, bios version) we had to contact DELL ProSupport to figure out a contradictory statement on new BIOS versions.

First, I got directly to the technician and he actually (for once) did have our machine’s service tag on his desk. I explained to him that I needed a specific piece of information and that I’m currently investigating a broader issue that doesn’t seem to be related to a single machine. He took up on that, passed me the information and followed me building and correcting our model of the fault and gave helpful comments and additional data from their experiences in the support with those machines.

What I wondered about is that he gave me information which I expected to be one of the selling points of DELLs machines: management features, access to support experience instead of scripted/technologically challenged call-center Zombies. Again: kudos to the supporter who helped me today.

Here are the positive surprises:

  • The DELL R610 and R510 iDRAC express cards have SSH and WEB UIs for accessing some of the fancier features. I even finally found the power meter!
  • There seems to be a tool called “repository manager” which can create a bootable ISO that includes all firmware updates for all the machines that you select. Cool! However, it seems to need Windows 2008 (WTF?). Even on Windows
  • Maybe (I didn’t understand this fully) the lifecycle controller can perform all required firmware/BIOS updates via FTP directly when entering it during boot time. (Unfortunately you need to reboot just to find out whether you need updates.)

Recapitulating this phone call and the information I got, I reached some conclusions:

  • Big, big personal thanks to the DELL supporter, you made my day! (And you know who you are!)
  • Why do I get huge amounts of stupid manuals that I just through away but readable, accessible information that the iDRAC Express has HTTP and SSH support?
  • Why are all Linux updates for no reason wrapped into binaries that require Red Hat stuff? All the tools are there on other distributions. Can you please release things so that grown-ups can use them?
  • Can we please have an accessible, platform-independent way to retrieve the information whether firmware updates are pending? Aand whether any update in the chain is urgent?
  • I see myself confirmed that hardware vendors are just terrible at software. Even your supporter is trained by now to think that having to hit a button twice isn’t a bug but a feature. Come on!
  • We knew that the express cards do not support VGA redirection (we use ipmi sol generally) but that leaves AFAICT only the “mount a remote disk” and “redirect VGA” as features of the bigger iDRAC option. And that thing AFAICT costs around 300 EUR more.
  • Given the issues of how to update firmware if you are on a true free platform then I wonder why those cost extra. Seems like DELL does support MS and RedHat’s business model by forcing customers into those options.

Lastly, it’s nice to have an actually good experience with DELL support for once, but, given our overall experience we’re more than happy to be migrating to Thomas Krenn now.

Our first developer BBQ

We invited developers and sysadmins to join us for talking shop and barbecuing last Friday. Even though several people had signed up and said they wanted to come, at first none of the guests seemed to arrive. But after an hour and braving some ugly traffic jams on the way here, a few did make it. We’re happy you came to visit us, guys!

In the unconference part, these were the topics we talked about:

We had a good time (and the BBQ was tasty), so we’ll definitely want to do something like this again; here’s to hoping some more people will join us next time!

Sprint report: Deploying Python web applications – platforms and applications

Last week I met Stephan Diehl, Michael Hierweck, Veit Schiele, and Jens Vagelpohl in Berlin for a sprint. Our chosen topic was “Python web application deployment”. In this post I’d like to recap our discussions, gocept’s perspective on those, and the deployment tool “batou” that we have been incubating in the last months.

Continue reading “Sprint report: Deploying Python web applications – platforms and applications”

Don’t stop PostgreSQL’s autovacuum with your application

The problem

Some weeks ago, we received a complaint from a customer about bad PostgreSQL performance for a specific application. I took a look into the database and found strange things going on: the query planner was executing “interesting” query plans, tables were bloated with lots of dead rows (one was 6 times as big as it should be), and so on.

The cause revealed itself when looking at pg_stat_user_tables:

abc-hans=# SELECT relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze
FROM pg_stat_user_tables;
        relname        | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze
-----------------------+-------------+-----------------+--------------+------------------
 object_ref            |             |                 |              |
 archived_current      |             |                 |              |
 archived_item         |             |                 |              |
 pgtextindex           |             |                 |              |
 archived_container    |             |                 |              |
 archived_chunk        |             |                 |              |
 commit_lock           |             |                 |              |
 archived_blob_info    |             |                 |              |
 archived_state        |             |                 |              |
 archived_blob_link    |             |                 |              |
 archived_item_deleted |             |                 |              |
 pack_object           |             |                 |              |
 archived_class        |             |                 |              |
 archived_object       |             |                 |              |
 object_state          |             |                 |              |
 object_refs_added     |             |                 |              |
 blob_chunk            |             |                 |              |

Despite of heavy write activity on the database, no table had ever seen autovacuum or autoanalyze. But why?

As I delved into it, I noticed that PostgreSQL’s autovacuum/autoanalyze was practically stopped in two ways by the application. I’d like to share our findings to help other programmers not to get trapped in situations like this.

Unfinished transactions

It turned out that the application had one component which connected to the database and opened a transaction right after startup, but never finished that transaction:

abc-hans=# SELECT procpid, current_timestamp - xact_start AS xact_runtime, current_query
FROM pg_stat_activity ORDER BY xact_start;
 procpid |  xact_runtime   | current_query

---------+-----------------+-----------------------
   18915 | 11:46:20.8783   | <IDLE> in transaction
   21289 | 11:18:20.07042  | <IDLE> in transaction

Note that the database server was started about 11 ¾ hours ago in this example. Vacuuming (whether automatic or manual) stops at the oldest transaction id that is still in use. Otherwise it would be vacuuming active transactions, which is not sensible at all. In our example, vacuuming is stopped right away since the oldest running transaction is only one minute older than the running server instance. At least this is easy to resolve: we got the developers to fix the application. Now it finishes every transaction in a sensible amount of time with either COMMIT or ABORT.

Exclusive table locks

Unfortunately, this was not all of it: autovacuum was working now but quite sporadically. A little bit of research revealed that autovacuum will abort if it is not able to obtain a table lock within one second – and guess what: the application made quite heavy use of table locks. We found a hint that something suspicious is going on in the PostgreSQL log:

postgres[13251]: [40-1] user=,db= ERROR:  canceling autovacuum task

Searching the application source brought up several places where table locks were used. Example:

stmt = """
LOCK %(table)s IN EXCLUSIVE MODE;
DELETE FROM %(table)s WHERE docid = %%s;
INSERT INTO %(table)s (docid, coefficient, marker, text_vector)
VALUES (%%s, %%s, %%s, %(clause)s)
""" % {'table': self.table, 'clause': clause}

The textindex code was particularly problematic as it dealt often with large documents. Statements like the one above could easily place load on the database server high enough to cause frequent autovacuum aborts.

The developers said that they have introduced the locks because of concurrency issues. As we could not get rid of them, I have installed a nightly cron job to force-vacuum the database. PostgreSQL has shown much improved query responses since then. Some queries’ completion times even improved by a factor of 10. I’ve been told that in the meantime they have found a way to remove the locks so the cron job is not necessary anymore.

Summary

PostgreSQL shows good auto-tuning and is a pretty low-maintenance database server if you allow it to perform its autovacuum/autoanalyze tasks regularly. We have seen that application programs may put autovacuum effectively out of business. In this particular case, unfinished transactions and extensive use of table locks were the show-stoppers. After we have identified and removed these causes, our PostgreSQL database is running smoothly again.

We are currently in the process of integrating some of the most obvious signs of trouble into the standard database monitoring on our managed hosting platform to catch those problems quickly as they show up.

Custom widgets in zope.formlib

zope.formlib has the ability to customize the used widget like this:

class KeywordsManagementForm(five.formlib.formbase.SubPageForm):
    form_fields = zope.formlib.form.Fields(IKeywords)
    form_fields['keywords'].custom_widget = KWSelectWidgetFactory

I do not like this approach for two reasons:

  • the widget has to be set manually every time the specific field is used
  • there is no easy way to get a display widget if the form or field is not editable for the user

Defining a new schema field and registering the widget for this field seems a bit heavy, so I came up with providing a marker interface on the field:

class IHaveSelectableKeywords(zope.interface.Interface):
    """Marker interface to get a special keywords widget."""

class IKeywords(zope.interface.Interface):
    keywords = zope.schema.List(
        title = _("Edit Keywords"),
        value_type = zope.schema.Choice(
            vocabulary=u"uc.keywords.Keywords"))
    zope.interface.alsoProvides(keywords, IHaveSelectableKeywords)

I registered the edit widget and display widget for the IHaveSelectableKeywords interface, so the custom widget does not have to be set in the form like this (edit widget):

<adapter
   for=".IHaveSelectableKeywords
        zope.publisher.interfaces.browser.IBrowserRequest"
 provides="zope.app.form.browser.interfaces.ISimpleInputWidget"
 factory=".KWSelectWidgetFactory"
 permission="zope.Public" />

Python Barcamp Cologne

The Python BarCamp Cologne 2012 happened last weekend. It was well organized by Reimar Bauer and the Cyrus office space is just very well suited for this kind of event: lots of space, rooms, equipment, drinks, …

The proceedings of Saturday and Sunday are available on Etherpads.

My most favorite discovery was Sentry – an open-source exception logging tool that has a nice UI and is simple to set up. Kudos to the Disqus crew! I’m looking forward to making this available as a managed component in gocept.net as soon as I get time to do so. 🙂

Other interesting topics that I joined were: a discussion about WSGI servers, parallelization, template engines, debugging and the infamous lightning talks.

Obviously I couldn’t restrain myself and so I offered a session on service deployment trying to answer some questions that people had and presenting some of the code we wrote when extracting our knowledge into batou.

Another session that I tried to foster was about #failure: in addition to talking about the cool things that we found working I wanted to hear about stuff that doesn’t work: software, organisational issues, etc. We kind of got stuck on bashing anything with the label “Enterprise” and the standard library.

On enterprise: the most weird experience I had lately boils down to this video by RedHat about their JBoss offering – say what?

The stdlib bashing wasn’t aggressive at all: we found some specific quirks and tried to get some understanding why things are the way they are. For me, basically, the standard library is what comes out of “batteries included” – it will have something in there that helps you out accomplishing what you want (like a pack of AA batteries) but if you’re serious about it you might need to roll some different module (like a car battery). I don’t think dropping the standard library would be a wise choice and I also don’t think that “one size fits all”.

I also got a surprising invite to presenting at the GUUG meeting next year and I’m pretty excited about that!

So, thanks again to Reimar and the other people organizing and sponsoring this event!

Profiling class-based views

Just a quick note for profiling e.g. Zope views:

class MyView(object):
    def __call__(self):
        result = {}
        cProfile.runctx('result["x"] = super(Body, self).__call__()',
                        globals(), locals(), '/tmp/viewprofile')
        return result['x']

Even though “exec ‘result = super(…) in globals(), locals()’ works, it seems that cProfile does something a little differently here, so that writing to a local variable is not possible.