Downtime in the clouds! Are there lessons to learn?

Large part of the web community has been talking about the unexpected downtime that occurred in Rackspace’s cloud today. Some reports say that the service was degraded while others report an outage in the Dallas NOC. T-Mobile experienced a similar hiccup a few days back with some 2 million users left without any service! Customers are affected and cursing the cloud, although it does not seem to be all that of a ‘cloud problem’.


While large companies continue to enhance their hosting capabilities and service offerings, there continues to be looming issues on downtime, security breaches and the like. Although these companies did a decent job of incident handling, provided regular updates to their customers on the status and a hotline for customers to call, the moot questions still continues to linger in customer’s minds. Is their investment in cloud worthwhile after all? There remains a gap in expectations and some level of ambiguity over what cloud computing has to offer?


As companies build their cloud brand, it is then very essential to do some basic groundwork to see if they are ready to take off just yet. Unfortunately most companies venture out and then do a SWOT analysis in the midst of a cloud storm. While on the customer’s side, especially for paying customers there is an expectation of backup and resiliency when all of the data and device is under the control of the vendor.


One way to resolve this nervousness is to amplify the brand in the event of such outages. The larger the outage the more feverishly teams work to ensure that it is resolved. What’s more there is root cause analysis to understand why it happened and corrective action to resolve. There is a lot of visibility with the analyst, bloggers, PR and customer communities and this can be used as a chance to steer clear and strengthen the brand.


Although there is less human interaction in cloud architectures, the processes are not automatically all that more stringent. So a lot of guidelines and policies need to be laid down explicitly as best practices with complete encryption details and mechanisms of segmentation between tenants in any location where multi tenancy exists.


Needless to say there is a need to manage innovation with operation excellence. Operational excellence in this context would mean scalability, uptime, availability and ease of use. This needless to say must be exceptional to compete in the cloud services market.