Friday, October 16, 2009

Disasters

It’s been some time since I’ve posted, and for good reason. We had a real life disaster exercise last week. A hardware issue rendered our system lifeless and although we did everything we could, we were still down for appx 24 hours. No data loss (thanks to some creative knowledge on how to recover missing files) but I still went without sleep for 40 straight hours to complete the recovery process.

In the spirit of last week I figured now was as good a time as any for a review of what should have been. No one should be without a DR plan.

  • Step 1 in disaster recovery planning: organize the disaster recovery planning team. The team should consist of a primary representative and an alternate from each participating department. Organizing the disaster recovery team begins by creating a group consisting of members that represent all functions of the organization. The team must also include a high-level manager, or CEO, to endorse the plan and eliminate obstacles. Once arranged, the team will start an awareness campaign and create a schedule of their anticipated activities.
  • Step 2 in disaster recovery planning: assess the risk in the Enterprise. The goal in this step is to assess the potential economic loss that could occur as a result of the determined risks. The team will use a business impact analysis to assess risk. In the analysis, all business processes should be identified and analyzed. As with any assessment, business processes should be ranked as critical, essential, necessary, and desirable. Legal and contractual requirements should also be assessed for consequences of business disruption.
  • Step 3 in disaster recovery planning: establish roles across department organizations. The disaster recovery planning team determines the role each department and external party must play in disaster recovery. This ensures that all resources and expertise are properly utilized.
  • Step 4 in disaster recovery planning: develop policies and procedures. Procedures are the step-by-step methods, while policies are the guidelines. Both are very important in recovering from a disaster. This step requires attention to detail. Procedures must be in place for every step in disaster recovery and response. Each function must be spelled out in black and white to ensure continuity. In our case having all the procedures in my head does not count towards documentation. Yes, I can do it from memory and experience, but should I not be here, what are they going to do?
  • Step 5 in disaster recovery planning: document disaster recovery procedures. Policy and procedures must be documented and sent through the proper channels for approval before being stored for future implementation. Each policy and procedure must be drafted, reviewed, and approved by management and all departments and organizations responsible for implementation. The plan must be available at all times during the testing phase, and especially during disaster response. Again - not something you only want one person to know.
  • Step 6 in disaster recovery planning: prepare to handle disasters. An “information campaign” is the word that works here. Get the information out, make everyone aware, and ensure they all know the plan. All parties must be aware of the plan from executives to general staff. Nothing worse than closed door communication - especially with your IT staff. Make sure they are well informed.
  • Step 7 in disaster recovery planning: train, test, and rehearse. Practice makes perfect! During this step, the organization conducts a live simulation including all departments and supporting organizations–as if a real disaster is taking place. Observers are in place to monitor and evaluate the procedures being implemented. Weaknesses are determined so updates and modifications can be made.
  • Step 8 in disaster recovery planning: ongoing management. Maintenance is the key here. Continual assessment of threats, changes in structure, and impact of new technology and recovery procedures. This step requires continual monitoring of new technologies and system changes. Any changes are documented, and updated training is given.
Long story short - we survived and all is heading back to normal. It will take time to document and make things right. I just hope we can avoid another hardware failure or disaster in the meantime. I was really hoping my days of pulling all nighters as an operations specialist were over. Yes, I CAN do it, and probably will if you ask me to, but I don't enjoy it as much as I once did and I think it's time for someone else to share the role of know-it-all recovery guru.

1 comment:

Scott said...

Ugh, very scary.

Thanks, I won't be able to sleep tonight >.<