There are several ways users have managed and measured the effectiveness of their computer systems to date. Access times and up times are classic criteria that relate directly to how users experience their systems – measuring how responsive the systems are and whether they are always available.
It is only when things start to go wrong that the value of service levels is realised. For example, when a system fails a copy of the last backup must be recovered. If this backup was completed 12 hours previously, then there has to be an approach to regenerate the system to a current situation or all the business activities over the intervening 12 hours will have been lost. Reconstructing this situation can be achieved by running log files to reconstruct the data; however it might take hours to complete this, leaving the business at risk or at least not servicing users as is required.
A decision has to be made as to what the business will accept as its main goals. The point of recovery goes hand in hand with the time to recover; for example, if it is unacceptable to have a system out of operation for over 15 minutes, this means that any failure must be recovered within that timeframe.
Defining the terms
Two factors are used to assist in evaluating the processes and setting service level objectives: Recovery Point Objective (RPO) and Recovery Time Objective (RTO):
Recovery Point Objective (RPO): The point in time to which systems and data must be recovered after an outage. RPOs are used as the basis for backup strategies, and as a determinant of the amount of data that may need to be recreated after the systems or functions have been recovered.
Recovery Time Objective (RTO): The period of time within which systems, applications, or functions must be recovered after an outage. RTOs are used as the basis for recovery strategies.
Defining the operational goals
Expectations are for systems to be continuously available to users and to an increasing extent this is being achieved with inbuilt reliability and operational features; snapshot, mirroring and RAID are examples of this. But careful consideration must be given to all applications in order to meet the expectations and demands of the users. When an email server crashes and it takes a day to rebuild, the exposure to users is obvious.
Implementing continuous system operations requires a sound consideration of the recovery and data protection practices layered into the elements of the system infrastructure. Considering the issues carefully, the time to recover (or RTO) is perhaps the most obvious pain point. Ways to reduce the time to recover are by taking more frequent backups, having less data to rewind from log files or taking snapshots every few minutes. This builds its own overhead on a system that is usually addressed with further investment in high performance systems. This is a critical decision, one that needs to be carefully addressed.
The point to which a system will be recovered (or RPO) is also a delicate question and one that will differ according to each application. For example, with online bookings users do not want to go through the whole process again for one transaction, let alone a day’s worth of transactions. Thus online systems must be treated very carefully, while other applications such as the creation of management reports may start again from scratch if there is a system failure.
Building the practices to support the business goals
Data protection practices must be carefully developed. From backup routines to recovery processes, all must be carefully considered. The result may be as simple as opting for a disk-to-disk backup and a recovery regime with offline storage to tape. But some sensitive online systems need to be recovered and up and running again within 30 minutes. This requires a more sophisticated approach, including the quiescing of databases in order to have a defined point of recovery.
With networked systems and the need to establish centralised policies that reach out to the remote offices, the practices need to be planned so they can protect the data in transit between sites as well as the data at each site. The products supporting these activities include Cisco’s Wide Area File Services and StorageTek’s Echoview. And when this is extended to include disaster recovery services, remote copy tools are available from all the major players.
The range of issues and tools does not stop here; why not assess if a system is under pressure and at risk of failure before it goes down? Performance management tools are now available to support this feature, to understand the performance of system and network components, but also to track transactions by application.
The discussion on service levels is often simplified. But like all good, highly reliable systems, beneath the skin is a sophisticated set of tools and components to deliver continuous systems operations.