June 10, 2014

No! You Don't Want Automatic DR!

X5_gear_leverIt is quite often that you will hear IT people say that they want the ultimate automatic disaster recovery solution that you can buy. You can also find some vendors who can sell you their solution as automatic disaster recovery solution only because you asked for one. But do you really want an automatic disaster recovery solution?

We are often victims of our bad understanding of the words but in technology you need to be very careful what you ask for. If you ask for automatic disaster recovery solution you may get something that you do not expect. Here is the scenario.

An automatic disaster recovery solution will be a feature rich solution that is able to recognize all kinds of "disaster" symptoms like bad weather, increased humidity around the data center, low power voltage or increased traffic, and it will trigger automatically the disaster recovery plan. There are a few problems with that though. Executing the DR plan is not as simple as flipping the light switch and turning the light. The implications are much bigger:

  • You need to redirect all your users to another data center that may be few hundred or even thousand miles away
  • If you use cold DR strategy your second center may require some time to become live
  • The data may not be current in your second data center

All this will have significant impact on your users' experience, which in the case of a real disaster may be warranted but if the "disaster" is just fluctuation in the voltage or stronger wind it will have more of a negative impact.

Some of you may argue that you can develop a very smart decision engine that will be able to determine whether this is real disaster symptom or a fake one but I think those people live in the future.

Therefore you should not look for automatic disaster recovery solution, and striving to achieve one should not be in your goals. What you need is automated solution that automates your DR run book but you still leave the control in the hands of humans who are able to determine properly whether the symptoms are or will lead to disaster.

This post was first published on our company's blog as No! You Don't Want Automatic DR!

June 04, 2014

What are good RTO and RPO?

No_u_turn_signExperiencing downtime is not something that companies wish for but as we have seen lately it is something that we hear quite often about. Interestingly enough very few enterprises, especially in the Small and Medium Business area, spent enough time to work out good procedures for recovery of their IT systems and applications. The recovery procedures should always be driven by the business needs, and this is the part where lot of IT departments are failing and as a result the recovery turns out to be reactive procedure that is triggered by the issue, results in a chaotic recovery activities and ends up with post-mortem but no improvements after that. Putting more initial thought into the Business Impact Analysis (BIA) is a prerequisite for a good recovery procedures and defining the two main characteristics - RTO and RPO are crucial part of this process.

Let's start with the first one - Recovery Time Objective (RTO). RTO is defined as the duration of time within which the system or the service must be restored after disruption in order to avoid unacceptable consequences related to break in business continuity. The first thing that you need to have in mind about RTO is that it is an objective - this means that it is a target that you may not be able to achieve all the time. There are certain activities that you need to do during this time that may have variable duration. At a high level those are grouped in:

  1. Recognizing that there is a disruption - this may depend on your level of monitoring or lack of it and may involve manual checking of each system or service that participates in the business process
  2. Troubleshooting and identifying the failing system and/or service - this will depend on the level of diagnostics you have implemented and may also involve different people or teams
  3. Fixing the issue - depending on the root cause this can be as simple as rebooting the system to as complex as requiting code changes or even ordering new hardware
  4. Testing the fix - last but not least you need to make sure that the fix actually resolves the issue

In all those four activities the human factor is the most variable part. People need to be notified, updated, they need time to understand the issue, troubleshoot, code etc. The more automation you provide the less impactful the human factor is for the recovery time.

Once the system or services is brought back to operation though you need to determine what is the state of the data. This is where the next characteristic becomes important - Recovery Point Objective (RPO). RPO is defined as the period in which data might be lost from the system due to disruption without major impact to the business continuity. Although this is also objective you need to be more careful with this one. There are few things to think about here:

  1. Is data loss acceptable at all? In lot of cases the answer is no but there are situations in which you can tolerate loss of data.
  2. How to recover the data? Does it require copying, shipping backup tapes or manual entry of the data?
  3. How long will it take to recover the data? Two extremes are from few seconds required to repoint the system to a replica of the data on another server to requesting an off-site backup copy of the data
  4. How to test that the data is recovered? This can vary from automated tests to manual tests

Depending on your RPO your time to recover the business operations for your system may vary.

When thinking about Business Continuity (BC) you need to think about both components - recovering the operation of the system or service (RTO) and recovering the data to a point at which it is usable for the business (RTO). Both those actions need to take time that is less than the Maximum Tolerable Downtime (MTD) as we defined it in Determining the Cost of Downtime. In general though you should set your RTO and RPO in a way that you have a buffer of time for unexpected issues that may occur during recovery.

This post was first published on our company's blog.

October 28, 2013

What exactly is a Service?

ServicebuttonWith the advancement in cloud technologies more and more companies are getting on the Anything-as-a-Service train but over the years the term services became so overloaded that people are having hard time understanding what it means. As any other technology term you hear lately some clarification may be required to understand what the person in front of you meant with "I sell services".

According to Wikipedia's definition of service (as a system architecture component) it is a set of related software functionalities that can be reused for different purposes, together with the policies that should control its usage. In today's cloud environment I would add two more things to the services definition:

  • Those functionalities must be exposed either through interoperable APIs or accessible via browser (i.e. must not be bound to a particular implementation platform)
  • And they must be accessible over the network (i.e. can be accessed remotely)
Although those characteristics should be enough to define what a service is, we really complicate the matter by thinking that everything that can be accessed over the network is a service. Well, for decades we've been accessing databases over the network - is it true to say that traditional databases are services? Comparing with the definition above the answer is "yes": it can be used for storing data for different purposes, one can use ODBC to access it from various platforms and languages and it is accessible over the network. Does that mean that by running my single instance DB on my home computer makes me Database-as-a-Service (DBaaS) provider? Not really! Here are few more things that we need to consider when we talk about services:
  • Services are normally exposed to the "external" world. What this means is that you offer the services outside your organization. Whether this is outside your team, your department or your company it is up to you but you should consider services the offering that generates business value for your organization.
  • There are also multi-tenant - this means that the services you offer can be consumed by multiple external entities at the same time without any modifications.
  • They are always up - third party businesses will depend on your services and you cannot afford to fail them hence avoiding single point of failure is crucial for the success of services
  • Last but not least services must be adopted - if you do not drive adoption through evangelizing, partnerships, good documentation, SDKs etc. the services you offer will not add value for your organization

Transitioning from a traditional software product organization to a services organization requires lot of effort and cultural change, and the best way to approach it is to clearly define the basics from the beginning.