Simon and Rob discuss AWS uptime

Hi Rob,

This is not important just some points I was thinking about during the team call this morning. You always have an interesting point of view so I thought I’d share and get your feedback.

In particular the discussion about how long AWS would need to survive without another outage in order to create 5x9s availability. It occurred to me that it was a completely fatuous measurement because I believe we’re at that Dev/Ops cross roads where we can no longer depend on hardware to save us from data loss. Looking at the AWS Service Level Agreement it basically states that “we won’t charge you if we can’t provide a service” to quote them directly they say “AWS will use commercially reasonable efforts…” which is not a commitment to service that I’d like to have my salary dependent upon.

Over the past 20 years or so there has been a creeping paradigm of hardware resilience for example RAID on disks, remote replication, clustered failover, hardware error checking and many other such technologies that have allowed developers to ignore resilience in their systems. But in the world of cloud, all of the resilience that has been baked into hardware is now gone. If you don’t own your own datacenter and haven’t ensured resilience and failover capability then, as a developer, you’d better start thinking about it again because your service provider isn’t go put it in, it would make their whole charging model unsupportable.

There is evidence that developers are already thinking about this and have done something about it. I’m probably not on the bleeding edge of ideas here. For example one of my bug bears with Docker is its lack of state. But in an environment where there isn’t any resilience it makes absolute sense not to preserve any state, if it doesn’t have a state then you can start another session on another piece of hardware anywhere you like with minimal disruption to your applications. Complete statelessness is hard to achieve, so somewhere there has to be some DB which needs hardware resilience.

I think my point really is that we have to fundamentally change our thoughts about the way systems work. Talking about 5x9s resilience is no longer relevant.

Comments, thoughts, ideas?

so I replied:

Hi Simon

I’m seeing a couple of data points as the market view on public cloud matures.

It’s not cheaper.
It’s an opportunity risk

There’s a growing market for tools and professional services for public cloud to help customers constrain their costs. I was amused to hear that Amazon itself manages a 2^nd hand market on reserved instances (RI’s) where customers can resell their unused RI’s to other AWS customers to recoup some of those costs. And there are consultancies that help customers either buy those RI’s from the marketplace at the lowest price or sell their RI’s at the highest price.

Seems to conflict with the idea that cloud is cheap.

The opportunity risk piece comes from the fact that organisations can implement products faster with public cloud without going through the CAPEX procurement cycle. And if those new products or services fail then they aren’t left with infrastructure they don’t need. So cloud isn’t a cheaper way to deliver infrastructure. But a faster way to deliver it in the short term.

How is that related to your point above? I think it’s examples of how the perception of public cloud is changing. I personally like to take the long view. When TV’s were first introduced they would be the death knell for the cinema. When the internet came around it would be the end for books and newspapers. The introduction of new technology always predicts the supercedence of existing tools. Don’t get me wrong; CD’s are no more (although interestingly vinyl lives on), and newspapers are in a dwindling market. But the cinema experience is fundamentally different to the TV experience. The book experience is different to reading websites.

Those examples show that different technologies enable us to consuming things in different ways. There’s no doubt the x86 server market is in decline. But that’s from a position where 100% of technology is on premises. As we learn more about the benefits of public cloud we learn that it also has drawbacks in the way it’s been implemented. I therefore believe that a balance will be found where some workloads will necessitate being within a customer’s physical location. And some will benefit from being hosted with a hyper-scale cloud provider. IT will be consumed in a different way. I don’t believe on-premises infrastructure will go away though.

I childishly take some glee from AWS outages. I take umbrage from the assertion that Infrastructure is irrelevant and the developer is the new king maker. Software Development is hard. Infrastructure is hard. So when AWS or Azure or AWS (again) has an outage it brings that reality into stark relief. The glee I get is from the humble pie being eaten by so called pundits who portray cloud as the next evolution of technology when really it’s the emperor’s new clothes. It’s somewhat unfair to expect developers to be experts in how to make their code run efficiently, be well documented, modular, deliver services to market quickly – and then also be experts on high availability, disaster recovery, data lifecycle management, etc. etc. etc. Both skills are important. But are fundamentally different. To expect public cloud to be able to deliver the same service without the same level of expertise is patently ridiculous. It won’t achieve 5x9s. But then I don’t believe it should. It’s something different. And should be architected and understood as such.

Tying the various threads together, I strongly believe there is a use case for agile, devops style methodologies. And an agile infrastructure to support it. But I firmly believe that one size doesn’t fit all and infrastructure architects and expertise has a strong role to play in conjunction with software development expertise to deliver the right solutions for the future.

Well that was a long winded response. I can see a copy and paste blog post coming J