Recently, we’ve been experimenting with glusterfs as an alternative network storage backing our VM hosting. It looked like a very promising candidate to replace our current iSCSI stack: scale-out with decent performance, mostly self-configuring, self-replicating, self-healing. And all of this out-of-the-box without complex setup. In contrast, the conventional architecture with a complex layering of iSCSI targets, DRBD, and Linux-HA glued together with a pack of shell scripts looks rather 90’s.
We played with glusterfs for a while. Setting up and configuring the software went quite smooth compared to the traditional stuff. But after some stress testing in a replicated scenario, we found severe problems.
Synchronisation
On the storage, the virtual machines represent themselves basically as one big image file. This image can become several hundreds of Gigabytes big. This is OK as long as the replicated file servers are in sync. But once one goes offline and online again, the versions of the image may differ and the self-healing algorithm is triggered. Due to glusterfs’ architecture, this happens entirely on the filesystem client (i.e., the KVM host). After re-connecting a file server, all VM I/O is to be paused until self-healing is complete. The live VM is stuck for some amount of time between several seconds and more than a minute. A considerable portion of our hosting cluster could freeze for minutes. This is cleary unacceptable. Re-connecting a previously disconnected file server would be a risky operation: quite the opposite of what replication is good for.
No global state
Another feature of glusterfs is that replication is handled entirely on the filesystem client and not on the server. This leads to an orthogonal and modular approach which has a lot of advantages. But it makes it hard to determine when a file server can be disconnected safely: Given that self-healing takes a considerable amount of time, we cannot be sure if there is still some self-heal operation in progress. But disconnecting a replicated file server which had the newer copy of a VM image before the other file server has caught up would render the VM unusable. Unfortunately, there seems to be no easy way to query a glusterfs file server for active self-healing operations. This makes disconnecting a file server a risky operation, too.
Good for its intended use
In summary, we learned that glusterfs’ architecture is a good fit for the use case it has originally been designed: a NFS replacement with lots of small files. But for our scenario where continuously running processes need to access a few large image files uninterruptedly, glusterfs seems not to be the best fit.
So we will stick to the good ol’ iSCSI stack for now. Perhaps Ceph or Sheepdog will become viable alternatives in the future once they stabilise.
Your findings are correct. Self-healing of large VM images require significant performance improvement. We are addressing them in our upcoming v3.3 release:
* Granular self-heal: Lock will be held only on a byte range during healing. Currently entire VM image is locked.
* Background self-heal: Healing will happen on the server side entirely, between server to server. Currently servers compute checksums, but the clients handle healing by moving inconsistent data between servers.
* Pro-active self-heal: Storage servers will automatically verify associated files upon recovery from crash. Currently self-healing is passive (triggered on applications opening the file).
Thanks for the post.
Good to hear that these issues are actively worked on. 🙂
I’ve just set up a gluster replication test bench. My Citrix XenServer mounts one of the ‘bricks’ via NFS. The second brick gets the replicated data in no time. There is no gluster client involved.
For now, we’ll do brick re-connection after business hours, so self healing isn’t an issue right now.
Speed tests show gluster is prety close to iSCSI + DRBD, unless a heal is in progress. I’m thinking of trying to heal across a separate NIC to help that.
While I think your points are valid, you could easily circumvent them: If you use the gluster as nfs and boot your vm not via image-file but via nfsroot, you would a) move the replicating part back to the server, b) self-heal only what was really changed file-wise, c) self-heal the important files first (because they are accessed first). You can also trigger complete self-heals externally by another nfs-mount.
and glusterfs is horrible for lots of small files…. I have 5 million photos using 125GB of space and its slow as anything especially when I am rsyncing from its source it takes 1.5 hours to run (down from 3.5hours).
I am now trying to tune it, I am using the NFS client as suggested by many around the place and I have set the following:
performance.flush-behind: on
performance.write-behind-window-size: 1024MB
performance.cache-size: 512MB
performance.io-thread-count: 64
Any other suggestions?