|
|
||
|
||
| Buy online now. | ||
|
Steps to developing an effective disaster recovery process There are 10 steps to developing an effective disaster recovery process. 1. Select a process owner. The process owner is the most important person in the disaster recovery process because of the many key roles he or she plays. The process owner must assemble and lead the cross-functional team in preparing the business impact analysis, identifying and prioritizing requirements, developing business continuity strategies, selecting an outside service provider, and conducting realistic tests of the process. The process owner should exhibit several key attributes and be selected very carefully. Potential candidates include an operations supervisor, the data center manager, and even the infrastructure manager. 2. Obtain executive support. Executive support, particularly in the form of an executive sponsor, is necessary for developing a truly robust disaster recovery process. You need funding approval from senior management for the resources you need in order to design and maintain an effective disaster recovery program. Another reason this support is important is that managers are typically the first to be notified when a disaster occurs. This sets off a chain of events involving management decisions about deploying the IT recovery team, declaring an emergency to the disaster recovery service provider, notifying facilities and physical security, and taking whatever emergency preparedness actions may be necessary. By involving management early in the design process, you secure their emotional and financial buy-in, thus increasing the likelihood that management will understand and fulfill its role in the disaster recovery process. The executive sponsor has several other responsibilities. One is selecting a process owner. Another is getting the support of other managers to ensure that participants are properly chosen and committed to the program. These other managers may be direct reports, peers within IT, or, in the case of facilities, outside of IT. Finally, the executive sponsor needs to demonstrate ongoing support by requesting and reviewing frequent progress reports, offering suggestions for improvement, questioning unclear elements of the plan, and resolving issues of conflict. 3. Identify and prioritize requirements. One of the cross-functional team’s first activities is to identify the requirements for each process, such as business, technical, and logistical requirements. Business requirements include defining the specific criteria for declaring a disaster and determining which processes are to be recovered and in what time frames. Technical requirements include what type of platforms will be eligible as recovery devices for servers, disks, and desktops, and how much bandwidth will be needed. Logistical requirements include the amount of time allowed to declare a disaster and transportation arrangements at both the disaster site and the recovery site. 4. Assemble a cross-functional team. The process owner must assemble representatives from appropriate departments into a cross-functional design team. Departments typically represented on this team include computer operations, applications development, server and systems administration, facilities, key customer departments, data security, physical security, and network operations. This team will work on requirements, conduct a business impact analysis, select an outside service provider, design the final overall recovery process, identify members of the recovery team, conduct tests of the recovery process, and document the plan. 5. Conduct a business impact analysis. Even the most thorough disaster recovery plan won’t be able to justify the expense of including every business process and application in the recovery. It’s important to inventory and prioritize critical business processes for the entire company. Key IT customers should help the process owner coordinate this effort to ensure that all critical processes are included. Processes that need to be resumed within 24 hours to prevent serious business impact, such as loss of revenue or major impact to customers, are rated as an A priority. Those processes that need to be resumed within 72 hours are rated as a B, and those that can take more than 72 hours are rated C. These identifications and prioritizations will be used to propose business continuity strategies. 6. Assess possible business continuity strategies. Based on the business impact analysis and the list of prioritized requirements, the cross-functional team should propose and assess several alternative business continuity strategies. These will likely include alternative remote sites within the company and geographic hot sites supplied by an outside provider. 7. Choose participants and clarify their roles for the recovery team. The cross-functional team chooses the individuals who will participate in the recovery activities after any declared disaster. The recovery team may be similar to the cross-functional team but should not be identical. Additional members should include the executive sponsor, key customer representatives, and representatives from any outside service providers. Once the recovery team is selected, it’s imperative that each individual’s role and responsibility be clearly defined, documented, and communicated. 8. Document the disaster recovery plan. The last official activity of the cross-functional team is to document the disaster recovery plan for use by the recovery team, which will then have responsibility for maintaining its accuracy, accessibility, and distribution. Documentation of the plan must also include up-to-date configuration diagrams of the hardware, software, and network components involved in the recovery. 9. Plan and execute regularly scheduled tests of the plan. Disaster recovery plans should be tested a minimum of once a year. Progressive companies test three or four times annually. Maintain a checklist during the test to record the disposition and duration of every task, and compare it to the list of planned tasks. Consider developing a test plan that spans up to three years—every six months the tests can become progressively more involved, starting with program and data restores and followed by processing loads and print tests, then initial network connectivity tests, and eventually full network and desktop load and functionality tests. 10. Conduct a lessons-learned postmortem after each test. The intent of the lessons-learned postmortem is to review exactly how the test was executed as well as to identify what went well, what needs to be improved, and what enhancements or efficiencies could be added to improve future tests. Nightmare incidents During many years of managing and consulting on IT infrastructures, I’ve encountered a number of nightmarish disaster recovery incidents. Some are humorous, some are “head-scratching,” and some are just plain bizarre. In all cases, they totally undermined what would have been a successful recovery from either a real or simulated disaster. Fortunately, no single client or employer with whom I was associated ever experienced more than any two of these, but in their eyes, even one was unacceptable. These incidents, listed below, illustrate how critical planning, preparation, and performance are to a good disaster recovery: Backup tapes have no data on them. Restore process has never been tested. Restore tapes are mislabeled. Restore tapes can’t be found. Offsite tape supplier hasn’t been paid and can’t retrieve tapes. Graveyard-shift operator doesn’t know how to contact recovery service. Recovery service to a classified defense program is not cleared. Recovery service to a classified defense program is cleared, but individual personnel aren’t cleared. Operator can’t carry tape canister onto the airplane. Tape canisters are mislabeled. The first four incidents all involve the handling of the backup tapes required to restore copies of data rendered inaccessible or damaged by a disaster. Verifying that the backup and, more importantly, the restore process are completing successfully should be one of the first requirements of any disaster recovery program. While most shops verify the backup portion of the process, more than a handful of shops don’t test to verify that the restore process also works. Labels and locations can also cause problems when tapes are marked or stored improperly. Although a rare case, I do know of a company who was unable to retrieve a tape because the offsite tape storage supplier hadn’t been paid in months. Fortunately, it was not during a critical recovery. Communication to, documentation of, and training of all shifts on the proper recovery procedures are critical. Third-shift graveyard operators often receive the least of these due to their off hours and higher-than-normal turnover. These operators need to know whom to call and how to contact offsite recovery services. Classified environments can present their own brand of recovery nightmares. One of our classified clients had applied for a security clearance for its offsite tape storage supplier and had begun using the service prior to the clearance being granted. When the client’s military customer found out, the tapes were confiscated. In a related issue, a separate defense contractor cleared its offsite vendor to a secured program but failed to clear the one individual who worked nights when a tape was requested for retrieval. The unclassified worker couldn’t retrieve the classified tape that night, delaying the restoration of the data for at least a day. The last two incidents involve tape canisters used during a full dry-run test of restoring and running critical applications at a remote hot site 3,000 miles away. The airline in question had just changed its carry-on baggage policy, which meant the recovery team couldn’t keep the tape canisters with them. Making matters worse was the fact that the canisters were mislabeled, which cost over six hours of restore time. There was much to talk about during the marathon postmortem session that followed this incident.
Images and content are copyright to Lindengrove 2003 Site designed by Lindengrove |
||