September 17, 2015

What is wrong with Cloud Management Platforms?

In the last few months we’ve spent considerable amount of time investigating Cloud Management Platforms and trying to figure out their role within the broader cloud strategy that companies should implement. Interesting observation is that there is very little differentiation between the products and the functionality mostly boils down to spinning up and down virtual machines (and/or containers if you take into account the new players) in the cloud.

There is a lot of promise and expectations set for the CMPs:

  • Ability to easily configure application stacks in a WYSIWYG (think visual) environment
  • Ability to create application (or service) catalog that can be used by internal teams to spin up new environments
  • Ability to smoothly migrate deployments between clouds (private to public, public to public, public to private)
  • Ability to define flexible policies for governance
  • Ability to integrate with external systems like Continuous Integration/Continuous Delivery (CI/CD), ITSM etc.
  • Ability to apply financial control over the use of cloud infrastructure

And many more.


Unfortunately, despite all the superlatives from analysts and researchers, almost all of the CMPs we looked at fails to satisfy even the basic requirements above. Here are some of the gaps:

  • Lot of the CMP vendors do not take into account that there is already some level of automation done in the enterprise using tools like CloudFormation, Chef or Puppet and reusing those is challenging. Most of the time you need to start from scratch and rebuild your automation in the CMP itself using their proprietary technology (scripts or UI)
  • Although most of the CMPs offer application catalogs the management of those is weak – most of the times there is no hierarchical way to organize the catalogs the role access management is too basic to be applicable in an enterprise
  • Multi-cloud support is lacking. While most CMPs support AWS for public cloud and OpenStack and/or VMWare for private clouds the support for Azure, Google Cloud Engine (GCE) and other public clouds is very basic or non-existent
  • In most if not all of the CMP products policies are not first class citizens, which means that you are mostly stuck with what the CMP vendor thought you will need
  • External integrations are mostly limited to exporting and importing data. Most of the CMP vendors concentrate on offering Jenkins and ServiceNow integrations however the approaches require work in both tools and some even require third party tools in between
  • Financial control is limited to counting the hours a VM is running and additional costs like storage and traffic are not present. Financial modeling and projections are not possible in any of the products we looked at


In addition to all those the licensing model that all CMP vendors use requires you to pay not only initial licensing fee but also ongoing per VM (or with the new players in the market – per container) fee, which is comparable with the running cost of the VM in the cloud provider.


All things considered, the lack of differentiation and the ongoing licensing costs make it very hard to create a compelling business case for the CMPs in the enterprise. There is a high initial cost for implementing (which with all vendors involve Professional Services) and constant if not growing ongoing costs based on the number of VMs under management for value that seems to be mostly related to automating deployments on one or two clouds. In summary the value delivered does not justify the ongoing costs that vendors ask for.

June 10, 2014

No! You Don't Want Automatic DR!

X5_gear_leverIt is quite often that you will hear IT people say that they want the ultimate automatic disaster recovery solution that you can buy. You can also find some vendors who can sell you their solution as automatic disaster recovery solution only because you asked for one. But do you really want an automatic disaster recovery solution?

We are often victims of our bad understanding of the words but in technology you need to be very careful what you ask for. If you ask for automatic disaster recovery solution you may get something that you do not expect. Here is the scenario.

An automatic disaster recovery solution will be a feature rich solution that is able to recognize all kinds of "disaster" symptoms like bad weather, increased humidity around the data center, low power voltage or increased traffic, and it will trigger automatically the disaster recovery plan. There are a few problems with that though. Executing the DR plan is not as simple as flipping the light switch and turning the light. The implications are much bigger:

  • You need to redirect all your users to another data center that may be few hundred or even thousand miles away
  • If you use cold DR strategy your second center may require some time to become live
  • The data may not be current in your second data center

All this will have significant impact on your users' experience, which in the case of a real disaster may be warranted but if the "disaster" is just fluctuation in the voltage or stronger wind it will have more of a negative impact.

Some of you may argue that you can develop a very smart decision engine that will be able to determine whether this is real disaster symptom or a fake one but I think those people live in the future.

Therefore you should not look for automatic disaster recovery solution, and striving to achieve one should not be in your goals. What you need is automated solution that automates your DR run book but you still leave the control in the hands of humans who are able to determine properly whether the symptoms are or will lead to disaster.

This post was first published on our company's blog as No! You Don't Want Automatic DR!

June 04, 2014

What are good RTO and RPO?

No_u_turn_signExperiencing downtime is not something that companies wish for but as we have seen lately it is something that we hear quite often about. Interestingly enough very few enterprises, especially in the Small and Medium Business area, spent enough time to work out good procedures for recovery of their IT systems and applications. The recovery procedures should always be driven by the business needs, and this is the part where lot of IT departments are failing and as a result the recovery turns out to be reactive procedure that is triggered by the issue, results in a chaotic recovery activities and ends up with post-mortem but no improvements after that. Putting more initial thought into the Business Impact Analysis (BIA) is a prerequisite for a good recovery procedures and defining the two main characteristics - RTO and RPO are crucial part of this process.

Let's start with the first one - Recovery Time Objective (RTO). RTO is defined as the duration of time within which the system or the service must be restored after disruption in order to avoid unacceptable consequences related to break in business continuity. The first thing that you need to have in mind about RTO is that it is an objective - this means that it is a target that you may not be able to achieve all the time. There are certain activities that you need to do during this time that may have variable duration. At a high level those are grouped in:

  1. Recognizing that there is a disruption - this may depend on your level of monitoring or lack of it and may involve manual checking of each system or service that participates in the business process
  2. Troubleshooting and identifying the failing system and/or service - this will depend on the level of diagnostics you have implemented and may also involve different people or teams
  3. Fixing the issue - depending on the root cause this can be as simple as rebooting the system to as complex as requiting code changes or even ordering new hardware
  4. Testing the fix - last but not least you need to make sure that the fix actually resolves the issue

In all those four activities the human factor is the most variable part. People need to be notified, updated, they need time to understand the issue, troubleshoot, code etc. The more automation you provide the less impactful the human factor is for the recovery time.

Once the system or services is brought back to operation though you need to determine what is the state of the data. This is where the next characteristic becomes important - Recovery Point Objective (RPO). RPO is defined as the period in which data might be lost from the system due to disruption without major impact to the business continuity. Although this is also objective you need to be more careful with this one. There are few things to think about here:

  1. Is data loss acceptable at all? In lot of cases the answer is no but there are situations in which you can tolerate loss of data.
  2. How to recover the data? Does it require copying, shipping backup tapes or manual entry of the data?
  3. How long will it take to recover the data? Two extremes are from few seconds required to repoint the system to a replica of the data on another server to requesting an off-site backup copy of the data
  4. How to test that the data is recovered? This can vary from automated tests to manual tests

Depending on your RPO your time to recover the business operations for your system may vary.

When thinking about Business Continuity (BC) you need to think about both components - recovering the operation of the system or service (RTO) and recovering the data to a point at which it is usable for the business (RTO). Both those actions need to take time that is less than the Maximum Tolerable Downtime (MTD) as we defined it in Determining the Cost of Downtime. In general though you should set your RTO and RPO in a way that you have a buffer of time for unexpected issues that may occur during recovery.

This post was first published on our company's blog.