Optimizing apps for cloud

Description

I've been getting a lot of questions lately about how to cloud-optimize an app, essentially moving beyond architectures that look more like old-school hosting, and closer to a true realization of the utility computing dream that everyone is signing up for these days. That tells me we're still not clear on the whole "what is cloud computing" thing, since some of the people asking me have actually built and deployed services, presumably with a little devil on their shoulder casting doubt about whether or not their app would pass muster against some canonical cloud reference architecture somewhere.

I remember speaking at a number of industry conferences when this whole thing was getting started circa 2007-2008-ish, and just about everyone – private sector companies, industry experts, luminaries, vendors (myself included) – would kick off our talks with some slide that says, "what is cloud computing?", followed by 20 minutes of mind-numbingly complex techno-goo on SaaS and PaaS and IaaS and just-about-everything-you-can-think-of-as-a-service. To make matters worse, the big thinker analysts, pundits, and researchers jumped into the fray to try to put their unique perspective on things, presumably in the interest of selling even more research and analysis to explain the double-click down on said perspective for those who were confused by it, which was pretty much everyone. No wonder people are still scratching their heads on this thing.

But the best part is that from this cacophony of editorial opinions about what the cloud is, the voice of reason emerged from the most unlikely of places … that's right, you guessed it: the US government. Maybe you have (or haven't) heard of the National Institute of Standards and Technology (NIST, for short), who a couple of years ago came up with a working definition for cloud computing that's impressive for its conciseness in nailing a set of essential attributes of cloud computing (on-demand self-service, network accessible, pooled resources, elastic, and metered/measured). Even the definition paper itself is only 3 pages, and it's a government document Smiley. See here. Lots of people (yes, us vendors included) have a tendency to sometimes tweak this definition in weird self-interested ways, but at the end of the day, it's a pretty bulletproof list of attributes that gets to the heart of what the cloud actually is.

Why should we even care about this? Because part of the shift going on in the industry at the moment has a major impact on the developers who build these apps, and the approach to design and architecture that ultimately decides whether or not the apps are really optimized for cloud computing. At the risk of oversimplification, n-tier apps are really yesterday's design point. The new design point is cloud. So how do you get there? How do you optimize design & architecture around this new design point? Here's an admittedly incomplete list, but it represents a set of big-ticket best practices that should help to get folks down this path …


Design for scale

Whenever people talk about cloud computing, they talk about scale. The platform you build on has a lot to do with that, but an app design that doesn't allow for scale renders the platform capabilities irrelevant. To deal with this, there are some design patterns that are well-understood and broadly used today. Statelessness is something that web developers have used for years to get scale, and it still holds true for cloud apps as well. Cloud apps run on commodity servers, any one of which could fail and get recycled, and you don't want your app affected when (not if) that happens. Writing asynchronous apps is another approach to getting scale – the idea here is that instead of relying on server availability to respond to multiple front-end requests, you can use things like message queues that can be scaled independently to process requests so users aren't waiting on synchronous responses from a slammed server. Another example of designing for scale involves using role concepts (web roles, worker roles, for example) to create "scale units", which are effectively units of work that you can consistently scale. In the world of "testing in production", this is important … it's simply not practical to test a service for hundreds of millions of users, but you can and should build and test a scalable unit of work that you know you can grow horizontally.

Design for failure

Resiliency is an attribute that's talked about quite a bit in the context of cloud computing, and it's grounded in the reality that stuff happens, hardware fails, human error comes into play, the list goes on and on. Designing for failure means cloud apps should absorb these failures, re-route workloads to running instances, and drive recovery time down to zero. You're going to fail. Embrace it and focus your energy on mean time to recovery (MTTR) vs. focusing on and over engineering for mean time to failure (MTTF). Included in the approach of designing for failure is geo-redundancy. When problems come up, they can often take down an entire datacenter. Even if you've replicated instances across multiple isolation zones or availability zones within a single datacenter, the unit of work is still the physical datacenter. If you lose that, your service goes with it, so multiple instances across multiple geos not only provides the benefit of high availability, but also a solution to the really hard problem of business continuity, which is now table stakes for a cloud app. What once was a serious piece of planning and orchestration becomes much simpler. The funny thing here is that if you talk to someone in enterprise IT about multi-instance and geo-redundancy, the response is often something along the lines of, "Yeah … no kidding." It's been a best practice in big IT for decades … and a lot of developers and cloud startups are learning why that is.

Decompose by workloads

A lot of applications are made up of workloads – seemingly individual pieces, each of which has a specific job to do. An online store, for example, is comprised of searching functionality and checking out, among other things. Each of these specific workloads may have unique availability requirements, costs, security requirements, capacity constraints, scalability, etc. For apps in the cloud, decomposing by workload means assuming more granular control over each workload, and optimizing each of them around what matters for that specific workload ... for some it might be scale, for others it might be resiliency or graceful degradation, for others it might be security. Even failure and recovery is dealt with at the workload level. You can make specific technology decisions at the workload level ... you might want to use a relational store for one workload, and a key value store for another. You're basically optimizing the app on a workload-by-workload basis, which is a much more adaptable approach than tightly coupled systems. By the way, if any of this sounds like SOA circa early 2000's, it's not a coincidence. This was one of the basic principles.

Design for interoperability

The idea of multiple components connecting across services running on the Web is not a new idea, as composite apps have been around for decades. What's different now is that app composition/mash-up is no longer done in the confines of a walled garden or a proprietary, single-vendor stack. It's now done in the cloud, and interoperability and standards-based approaches matter more than ever before. Cloud development requires people to "think more like the web", and build apps with a mix of platform services, languages, runtimes, frameworks, and protocols that work together. This means that identity federation becomes pretty important, as having a composite app in which each piece has its own unique identity/auth system is unwieldy to say the least. A common set of REST APIs also makes life easier from a composition standpoint, as well as OData for data access. The underlying assumption here is that religion about one stack to rule them all is a thing of past, and we hear this from customers all the time … heterogeneous environments, either on-prem or in the cloud, are the norm. The apps that run in these environments are simply nodes in a network of services, and those nodes need to interoperate without a lot of architectural gymnastics.

Design for operations

There's a fair amount of energy today around the idea of "dev/ops" as a new org model for a services business – a much tighter integration between the building and running of apps that's more aligned with the services world of continuous development and deployment. But the organizational construct doesn't matter if that app itself doesn't facilitate it and unlock its potential. The attributes that support this are things like measurability, and the ability to isolate, detect, and rollback. Apps need to provide health information, and the implementation of versioned interfaces for doing diagnostics, drilling into issues, and applying fixes & remediation is a design-time decision. Taking it a step further, there is the issue of automation, and the use of these interfaces to automate creating, provisioning, de-provisioning, and restoring services. The more of this that's manual, the less reliable the app will be, so automation is another important thing to optimize around. Testing also plays a huge role here … you don't know how reliable your app is unless you're stressing it with failures as part of your regular operation. Netflix's use of Chaos Monkey is probably the best example I've seen of how to go all-in on tuning your infrastructure to absorb and withstand failures.


As I mentioned earlier, this is not by any means an all-inclusive drill-down into prescriptive architectural guidance on cloud apps … it's intended to be more of an introduction to the principle: there are lot of developers these days putting single-instance, n-tier apps onto hosted VMs, proudly hanging the "cloud" shingle on their door, and then wondering why their apps are impacted by component failures, and why their apps don't scale, and why they have to manually look after their VMs, and why the services dream isn't being realized. I guess that's to be expected, given where we are in the process of moving to what is effectively a generational shift in computing, but we're moving toward something very different than the apps we know today. It's a new design point, a new set of app patterns, a whole new approach to designing, building, and running apps.

More episodes in this series

Related episodes

The Discussion

  • User profile image
    MiZu

    Thanx Tim for the comprehensive collection!

    When we started out to architect Sentinel Cloud 2 ½ years ago Brewer's CAP theorem helped a lot to get the people's mindset straight. We knew the network does fail and customers need high availability, so we drilled down what weak consistency means for our offering. And that really was driving a revolution in how to think about licensing and authorization in the cloud versus licensing for on-premise. Traditionally people think about licensing like one central control point having all data and always having full visibility. As an example take a typical enterprise concurrency license model, where a software publisher wants to know how many concurrent licenses are in use – now take this model to the cloud... boom! you are doomed to fail. The service consumers will complain about slow response times, as all geo's will need to talk to one global instance, managing all sessions to get to a strictly consistent global view. In addition this one system is the perfect example for a single point of failure – boom! again.

    So we took the path to completely replace 'old technology' and built our system to scale horizontally, globally and support weak, "eventual" consistency (like Amazon calls it). We have transformed the traditional enterprise license models to eventually consistent license models, employing asynchronous communication, to propagate state to stateless service nodes performing the decomposed functions required to collect, aggregate, prepare and distribute the information. Adding to this, local intelligence with smart caching enables us to deliver snappy response times and the means to cope with load and failures – something we never could have gotten to with the traditional architecture.

    But I think there is another very important take away. From an offering perspective every company moving to the cloud needs to consider, if their traditional offering does fit a cloud model – in case you have relied on strict consistency you will need to have a careful look if you really need it as it makes achieving all the design goals Tim mentioned very difficult, partially even impossible. So moving to the cloud is beyond us doing the right things in technology also our business companions need to understand the implications of walking in the clouds... 
    Michael "MiZu" Zunke, CTO SRM, SafeNet Inc.

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.