 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
|
 |
|
|
 |
|
 |
 |
 |
| |
|
|
 |
 |
 |
|
|
| |
|
Open all hours |
 |
These days, your systems are expected to be up and running all the time. Pete Worlock finds out what’s involved in delivering High Availability. |
|
|

Information technology is now so ubiquitous that it must rank as mission-critical to most organisations of any size. Just how critical may depend on time and circumstance, but on the wrong day of the year any failure in Web server, email, database or network could be disastrous.
In one survey of US and European organisations the majority of respondents said 24 hours of downtime could be ‘potentially fatal’ to the business, while nearly a third put the threshold at just four hours. That fear of failure is the driving force behind the development of High Availability (HA) computing.
The need for HA systems is not new but the rise of global businesses that are operating literally around the clock has led to an explosion of competing solutions in recent years, matched by an increase in general confusion. The vocabulary of HA systems now includes such things as clustering and virtualisation, redundancy, fault tolerance and failover, high availability and continuous availability. Increasingly, ‘the cloud’ is getting in on the act, too. Much of the HA discussion applies to large-scale systems such as networks or server farms, but software vendors such as SAP and Oracle are now listing it as a feature of their products.
How high is ‘high’?
All discussions of HA computing begin with ‘the nines’. Whereas a system offering 99 per cent uptime might once have been considered HA, today HA requires a system that operates between 99.9 and 99.99 per cent of the time. The older definition means that your system would be down for an average of nearly four days per year. ‘Three nines’ or 99.9 per cent gives you a downtime of around eight hours per year, and ‘four nines’ or 99.99 per cent brings that down to less than 52 minutes. Any system that promises the gold standard of ‘five nines’ or 99.999 per cent availability is considered a continuous availability system with an annual downtime of less than five minutes.
The second aspect of the ‘how high?’ question is, how high does your availability need to be? Obviously the answer varies considerably. For a business that only uses its Web site for marketing purposes, some downtime may not matter much; for a business that relies exclusively on Web-based transactions, a failure translates directly into lost sales and lost revenue. Within larger organisations an application such as email may not count as mission-critical most of the time, but in the last few days of the business quarter it may well be vital. For yet other organisations, such as a financial company making real-time stock trades, any interruption at all could be fatal. Perhaps the ultimate requirement for HA computing is an air traffic control system.
It is also useful to note that availability isn’t an absolute. For example, it is possible that a server may be up and running while its services are unavailable because of a network failure. Or if server traffic is high, the service may be available yet result in users becoming frustrated because of unacceptably slow performance.
Interchangeable terms
Much of the confusion in the HA discussion arises from the fact that many vendors and consultants use the terms ‘redundancy’, ‘failover’ and ‘fault tolerance’ interchangeably. More accurately, however, they are merely aspects of a High Availability solution. For example, fault tolerance enables a system to provide error-free, non-stop availability in the event of a failure. Components such as mirrored disks and uninterruptible power supplies (UPSs) provide fault tolerance, and redundancy is an aspect of this as the failure of one storage device does not leave the service unavailable. The concept can be expanded to larger and more complex systems, such as duplicate servers operating in lock-step, but with a commensurate increase in costs. Full fault-tolerance requires that all of the resources needed by an application, including CPU, memory, storage and network, should be replicated (including, perhaps, the application software licence).
The differences may be subtle, but failover is different. When a component or system fails, it ‘fails over’ to a backup. Fault-tolerant systems provide for continuous processing; failover systems are failure-recovery systems and by definition allow for an interruption in service. Any data or transaction not already written to storage will be lost at the point of the failure, and depending on the nature of the system it may take minutes or hours to bring the secondary system up to the point at which service was lost.
Clusters
Further clouding the issue is the concept of clusters, often called High Availability Clusters. While it is inarguable that clustering is an HA solution, it should be clear from the previous discussion that it is possible to implement an HA system without clustering. Clusters provide multiple computers, or nodes, with redundancy so that if one node fails the other provides the service. If a server crashes, the cluster detects the failure and immediately restarts the application on another system. It can be seen that usually this is a form of failover, and may require configuration of the network hardware, importing of a file system, and the loading of some applications.
Also, not every application can be run on a cluster. Limitations include the ability to start, stop and check the status of the application, and the need for the application to use shared storage. Critically, the application must not corrupt data if it crashes or restarts.
Virtualisation
An increasingly promoted HA solution is virtualisation. Since applications are running on virtual machines, the reasoning suggests that the ability to switch between VMs provides a ready increase in availability. However, the virtual machine must be run on a physical server of some kind and a hardware failure means that all of the hosted virtual machines must stop and be restarted. In fact, virtualisation can actually reduce availability since multiple VMs consolidated on one server may remove a system that would otherwise be available on independent hardware.
Choosing a solution
Two useful metrics in assessing the need for HA systems are recovery point objective (RPO), and recovery time objective (RTO). RPO is the point to which data must be restored after a failure and might be the start of the business day, the last backup, or the last transaction processed. RTO is the length of time between the failure and the time when the process must recover. For example, an email ordering system can tolerate a longer RTO because it isn’t a real-time system, but it has a short RPO because data loss has a significant impact. At the other end of the scale, a stock trading system has no RPO or RTO because it can tolerate neither loss of service nor loss of data.
At this point in most buyer’s guides it is customary to point out that there is no ‘one size fits all’ solution. However, in HA systems there is no one solution of any size – this isn’t something you can simply order up from a single vendor. While there are single-vendor solutions that can take you a long way up the availability curve, including virtualisation and clustering, more commonly you will have to choose from a menu of tools and technologies that are dependent on your hardware, software and networking infrastructure.
At the simplest and most affordable level, you can increase availability by building in redundancy to areas such as power supplies, storage systems and network hardware. Further up the curve are virtualisation and clustering solutions, and for many IT professionals the support for both from Microsoft in Windows Server 2008 will be an obvious solution.
Virtualisation support in Windows Server 2008 comes from Microsoft’s Hyper-V technology which provides several HA features including Quick Migration between virtual machines to minimise downtime in the event of a failure. Volume Shadow Copy Services enables backups of running VMs without interruption, and health monitoring features allow automated recovery tasks.
Clustering is supported in Windows Server 2008 Enterprise and Windows Server 2008 Datacenter editions, with automated switching between nodes in the event of a failure. Microsoft also provides clustering support at the application level in both Exchange Server 2007 and 2010, and in SQL Server 2005 and 2008, for example. Both provide data replication between multiple instances of the applications to maintain service and data availability.
Clustering support at the server OS level is also provided in Oracle WebLogic. The Node Manager within WebLogic provides graceful recovery from software faults, while load-balancing features can detect hardware failures and redirect transactions to other servers. In the event of a significant failure, WebLogic also provides tools for server migration.
Additional clustering support can be provided by third-party solutions such as GeoCluster from Double-Take Software, which builds on the features in Windows Server 2008 to allow the building of failover clusters without a shared storage system so that clusters can be located in different locations (even in different countries). Other Double-Take solutions such as Double-Take for Windows provide additional HA support by allowing real-time data replication and failover for physical and virtual servers. It is application and hardware independent, and therefore provides HA support for a variety of applications including Microsoft Exchange, BlackBerry Enterprise Server, Oracle, SAP and others.
Similar features are available from Symantec Replication Exec which provides continuous data replication between Microsoft SQL and Exchange servers; and from the CA XOsoft family providing data replication, assured recovery and other failover features for systems including the Microsoft Server range, Oracle and BlackBerry Enterprise Server.
In virtualisation, the market leader is VMWare with a complete suite of solutions for server, desktop and application virtualisation. Significantly, VMWare has focussed attention on HA functionality including fault tolerance features in vSphere, which let you move workloads dynamically to different physical servers as well as among virtual machines, teaming of network interfaces to physical failure of a network card does not impede overall availability, and storage multi-pathing for tolerance of storage failures.
An additional module, VMware HA, is specifically designed to provide HA functionality to any application running in a virtual machine, regardless of operating system or underlying hardware. VMs are monitored for faults and restarted automatically when a failure is detected.
Other virtualisation solutions include Virtual Iron, which provides server partitioning for single and multi-server environments with support for HA features including automated management, virtual server migration with no downtime, and rapid recovery from failure.

The everRun Availability Center gives you control over Marathon
Technologies’ highly fault tolerant solution.
While many of these solutions constitute failover systems, the ultimate in HA solutions is the fully fault-tolerant server. Here Marathon Technologies provides a simple approach in its everRun range of software. Available in 64-bit (everRun 2G) and 32-bit (everRun HA and FT) versions, the system combines two standard Windows servers into a single fault-tolerant environment. The software detects any problems on one server and redirects transactions to the other in real time, without operator intervention. A key feature is that it runs on any off-the-shelf x86 or AMD server, and supports the older Windows Server 2003.
Finally, for those who prefer an open source approach to High Availability, the Linux community has a ten-year track record on the subject, largely through the Linux-HA project. The deliverable from Linux-HA is Heartbeat, a daemon that provides clustering infrastructure, which was further developed into the Pacemaker project. Heartbeat now ships as part of many leading distributions, including SUSE, Mandriva, Debian, Ubuntu and Red Hat Linux. Pacemaker currently ships with openSUSE and as part of the High Availability Extension for SUSE Linux Enterprise Server 11. Additional HA support provided by the High Availability Extension includes continuous data replication, automatic re-synchronising of data after a failure, and extensive support for clustering and virtualisation.
|
|
PETER WORLOCK

Peter Worlock has been a journalist and author for 30 years, and has written about the IT industry for more than two decades. As an antidote he is currently learning the mysteries of woodworking without the use of machines.
peterw@hardcopymag.com
|
|
|
Find out more...
You can find out more about the products mentioned in this Buyers Guide using the search facility at www.greymatter.com. For more information about Windows Server 2008 see the Buyers Guide which is listed on the Grey Matter home page.
A useful introduction to HA Computing, albeit with a Microsoft bias, can be downloaded from Microsoft at http://tinyurl.com/motcpt. |
|
|
|
|
|
|
|
 |
|
|
 |
|
 |
Copyright © 1983-2010 Grey Matter Ltd. All rights reserved. |
 |
 |
 |
 |
|