The importance of being available

December 24, 2008 at 1:21 pm Leave a comment

First off – this isn’t about crackberry ‘instant-replies -to-emails-day-or-night’ availability – I’ll talk about that another time.

This is about how to build a high availability telephony service using open source components. I’m writing this with a specific recent project in mind, but most of the lessons are applicable to many services.

I’m going to focus on the hot-standby style cluster solution on the basis that (as Lady Lady Bracknell didn’t say in ‘The importance of being Earnest” )

To lose one parent server, Mr. Worthing, may be regarded as a misfortune. To lose both looks like carelessness.

The crucial thing about hot-standby is that either of the two servers can take the full service load and that only one is active at a time. (this is the simplest case of a more general n+1 architecture where you have one more server than you need for peak load.)

Step 1: Find out what your customer needs.

People (customers included) tend to have absurdly high expectations of what high-availability can achieve as compared with how much they will pay for it.

There is a huge temptation on the part of technologists to try to deliver what the customer says he wants, as opposed to what they need!  I had a painful education on the pitfalls of this tactic when the first cluster I built (many years ago) failed twice in the first year.

The customer had gone for a belt,braces,the-best-of-everything set-up. The first failure was when the journaling filesystem locked up because (I still wince at this) the license had expired.

The second failure was a hardware failure – the memory in a router went bad – but due to the redundant set up, the bit error in the routing table was propagated to the standby device, and then back to the replacement(s) as they were swapped in.

In both cases a cold start was required resulting in a few hours of downtime in each case. The cluster cost about 3 times what a single system would have, and provided lower uptime.

Ask about the business case. How much would it cost the business if the service were unavailable for 4 hours a year? That’s about 99.9% uptime, but it is also enough time to get tech out of bed, a spare server out of a cupboard a back-up restored and service resumed. Cold standby systems are easy to build, test and maintain. The hardest part is to make sure no-one re-purposes the apparently unused hardware.

Here’s how I evaluate it: What is the cost of a single server that can handle the load (include software, licenses, set-up, time etc) ? If the estimated loss to the business of a 4 ( or 8 ) hour outage is greater than three times that figure: Go for a cluster of servers.
If not: Buy spare hardware, keep good backups and have your tech’s home number.

Step 2:

You have decided you need a cluster, now decide what technology to use.

Requirements

Again this requires discussion with the customer to ensure that you understand their requirements and that they understand the specific features you are going to build.

We always write a specification at this stage, we state what the user wants and how, in general terms, we will deliver it. We make an effort not to get too deep into the technology at this stage, the specification needs to be understood by business people, not just database (or web) experts. The proposed solution needs to cover the following:

Sessions

There are quite a few technologies that really don’t get on with clusters. Basically anything with a long running session needs special treatment. 

The best way of dealing with sessions is to eliminate them if at all possible. In our case the customer accepted that phonecalls that were in progress when a component fails can be dropped, but that if the user calls back, then the call must go through. This avoids us having to cope with migrating calls to the standby server. 

Database

The other persistent problem is how to ensure that the database remains sufficiently up-to-date and consistent between the two or more database servers.

I’ll talk some more in the next post about the solution we selected, but in general there are two ways to do this:

  1. Have 2 separate datastores and use regular synchronization to keep them in step.
  2. Have 2 separate databases accessing the same data store but have the datastore itself highly redundant (using RAID).

The option you chose depends on your problem space. I’ll explore the choices we made in the next post.

Migration

How will users be moved from one server to the next? Some common ways are:

  1. DNS – Here you change the IP address associated with your service’s name to direct the users to the correct machine. You need to ensure that the Time-To-Live on the DNS entry is short enough that the old value won’t be cached (somewhere out in the DNS cloud) for longer than the acceptable downtime.
  2. IP address migration – Here you move the IP address associated with the service from one server to another by bringing up an interface on the new server (and taking down the interface on the old server) .
  3. Content redirectors – Here you move the problem (always popular) to a dedicated box which receives traffic from the users and then forwards it to the appropriate ‘real’ server.

I’ll talk about the technical choices we made on our recent project in the next post, but for those who can’t wait, we used the following:

Entry filed under: Uncategorized. Tags: .

The way to an exit. The importance of being available – (tech stuff)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Follow me on Twitter


%d bloggers like this: