Ongoing Operations has participated in hundreds Disaster Recovery Tests (and real events!) with clients. Although everyone calls the effort a DR “test”, the approach we take with these is to look at them as “exercises” as they are also a learning experience. This is also a great time to identify gaps in planning so improvements can be made. In today’s post we are going to review our DR methodology and the steps we take to ensure each client has a successful test.
SOW – We start with a Statement of Work to define the test/exercise. The SOW is critically important to the overall success of the entire exercise. Expectations are set, vendors are coordinated with and roles are defined to the level of detail necessary to ensure a successful test.
A member of our Professional Services BCP team initiates the SOW process once a Credit Union has requested a test date. This consultant will work with you from the beginning stages all the way through the end of the test.
The SOW consists of
- When the test is to occur – Test date(s)
- Who is involved in the testing – OGO, Client, and any 3rd parties
- How many people are testing – information to provide connectivity
- Where are they testing from – location of branches or other locations
- What is being tested – systems, processes
- What is the expected RTO/RPO?
- Contact information of CU participants and 3rd party vendors
Depending on the client’s needs, we may schedule a call or two to assist in preparing the SOW. When the SOW has been completed, your BCP consultant will arrange a “kick off call” between the Credit Union and an OGO engineer. During this call, the SOW will be reviewed and no stone is left unturned in establishing all the technical requirements of the upcoming test. During this call, our clients have direct access to our engineering staff in order to set the appropriate steps in motion for the testing. Our clients rely on our vast experience to help them avoid known pitfalls in performing these tests.
Three critical documents are required for this call:
- Completed SOW – Some modification could result from the call with our engineer.
- Server Matrix – The server matrix is a detailed list of your servers/systems being recovered ranked by phases. The criticality of the servers and their position in the recovery queue should be carefully and deliberately documented.
- Network Diagrams – Nothing shuts down a potential DR test faster than lack of routing knowledge. Your network diagrams will be reviewed by OGO engineers prior to the actual test.
We wrap up the session and then have the Credit Union “Sign off” on the agreement for scope of the exercise. Copies are attached to our ticketing system and the SOW is used throughout the test experience to make sure everyone is on target.
Scheduling – We understand how critical it is that these tests go well. Not just from a technology standpoint but also from a compliance perspective. The last thing you want during your tenure is a DOR (Document of Resolution) citing your recovery efforts are meeting expectations.
So we go one step further in our preparation process and require the SOW ahead of the scheduling to make sure that the right engineer is available on your test day. Nothing is more frustrating than getting bogged down in a test and needing escalation!
The actual scheduling is pretty simple. Once the SOW has been completed, we book your dates or coordinate alternate date according to resources. Specifically, scheduling requires:
- Completed SOW and supporting documentation (Server Matrix, Network Diagram) – this is to be submitted to the DR Test Coordinator at OGO a minimum of 30 days prior to the test date(s).
- Months prior to actual test date, coordination among OGO, client and any 3rd party vendors
- Confirm dates with all parties involved
Day(s) of testing – Depending on your particular test scope, your IT team may either be onsite at one of our locations or participating remotely. Your BCP planner will be actively engaged during the entire test time and will serve as a third party observer to your testing efforts (NCUA likes that!). Our typical test days goes something like this:
- Kick-off call
- Start server restores
- Start DR test network setup
- Schedule status calls every 2 hours (as done normally during a declaration)
- End of day wrap up – server restores kicked off to run over night, list of any outstanding items
- Day 2 same as day one
- Wrap up call
After The Test/Post Test Analysis
During the actual test exercise, your BCP consultant along with the OGO engineer are capturing information regarding the testing activities in our ticketing system. Key metrics are captured such as:
- Actual RTO/RPO of servers outlined in the server matrix
- Observation of CU Staff readiness
The regularly scheduled calls are a strength point in our recovery methodology as they ensure the Credit Union, BCP consultant and OGO engineer have touch points many times during the testing day(s). This keeps the test on schedule and within scope.
This data is then used (along with client notes) to develop the official DR Test Report. Gaps that have been identified will be communicated to the Credit Union along with suggestions on how to resolve the issue. Other areas that will be reported on include:
- Overall evaluation of the exercise
- Staff preparedness
- Scenario development and quality
- Infrastructure Recovery Success (Systems, Network, Data, Phones)
- Prioritization of business processes
- Ability to meet RTO (Recovery Time Objectives)
- Ability to meet RPO (Recovery Point Objectives)
- Process recovery (as defined in the Credit Union’s Statement of Work)
- Business Continuity Plan functionality
- Ability to handle unusual circumstances
- How can recovery times/objectives be improved?
The key to a successful test is to fully scope what’s to be tested and set expectations with everyone involved. Any front end testers should have scripts to follow with the processes they are testing fully defined as well. Results of the test should be noted and any issues/resolutions fully documented. All documentation should be forwarded to the CU DR Coordinator to file along with all other documentation for the test. Any issues encountered should be included in future tests.
We take your Credit Union disaster recovery test very seriously (as if it were a real event!) because we understand that bottom line – push comes to shove – you HAVE to be there for your members during a crisis.