Building Awesome Games with Service Fabric

Sign in to queue

Description

Well we have some seriously exciting news around Microsoft Service Fabric's General Availability for Game Developers.

Service Fabric General Availability was announced at Build 2016 by Scott Guthrie's; where Scott Hanselman demonstrated the GA with UK Indie Game Developer Age of Ascent's live code showing how the small indie team have used Service Fabric to support over 50,000 players.

So what is Age of Ascent?

When Illyriad wanted to create a game of a massive scale never before achieved, they turned to Azure, ASP.NET and Azure Service Fabric to engage over 50,000 players, through the browser on any device. In this video you'll hear how they developed Age of Ascent on Microsoft's cloud platform.

Age of Ascent Infrastructure

 

So what is Service Fabric?

Service Fabric was born from years of experience at Microsoft delivering mission-critical cloud services and is production-proven since 2010. It's the foundational technology on which we run our Azure core infrastructure, powering services including Skype for Business, Intune, Azure Event Hubs, Azure Data Factory, Azure DocumentDB, Azure SQL Database, and Cortana.

This experience allowed us to design a platform that intrinsically understands the available infrastructure resources and needs of applications, enabling an automatically updating, self-healing behavior that is essential to delivering highly available and durable services at hyperscale.

Service Fabric's stateful services are out-of-the-box resilient and highly reliable in the face of multiple VM failures. Saying that, the levels of resiliency and durability of data can be increased, and that's what we'll run through below.

Setting up Service Fabric

The default deployment template for Service Fabric uses the VMs' fast temp drive for storage, which is stable across reboots+guest o/s updates. However, the temp drive isn't stable across events that cause a VM migration, e.g. host o/dates, host failures, VM resizing etc.

Luckily, that is also what Service Fabric replicates its state for, and why there is a primary – and multiple secondaries. The update and fault domains should prevent any standard data loss as long as you are using the recommended cluster size of 5 or higher.

Multiple Node

Fault Reduction Requirements

If you shrink your cluster to 3 nodes then the risks are higher; as if you have one node upgrade then you are out of quorum, and if a further node fails then you are down to a single node. With > 5 nodes you'd be down to 3 and still maintain a quorum. It also follows that more smaller nodes are better than fewer larger nodes.

Don't forget the microservice data will also migrate between the nodes dependent of load and resource usage anyway, so it's part of the system to deal with this.

Azure Dashboard

Assuming you are at 5 nodes or above; for disaster recovery using temp drives in this replicated manner you need to consider what would cause this to be a risk – as it would need failure of at least 3 VMs for a 5 node cluster, and more for a cluster with a higher replica count.

Total outage of a datacentre is one circumstance, however it's very rare and getting rarer as lessons are learnt. It has, however, happened – so can't be discounted. A much more likely scenario is a capped subscription running out of $ balance, which will eventually de-provision the VMs.

Otherwise its likely to be manual intervention, e.g. manually shutting down and de-provisioning the VMs in the cluster. When you start it up again, the VMs will have migrated host.

Expected reboots/upgrades are handled via the upgrade domains; resilience is provided by the fault domains (and Service Fabric). If you want to further insulate yourself from Azure infrastructure issues, you can choose a "durability tier of Gold" and run your services on Full node VMs like D15 or G5. This allows the system to push back against any Azure infrastructure changes that cause migrations (with the exception of host failure) for 2 hours, allowing the replica set to build itself a brand new copy, if needed.

By default, Service Fabric uses the temp drive. However, you can override the location of the data folder default setting via Azure Resource Management ARM template deployment, to change the permanence of the storage with a performance/pricing trade off. You can also raise the reliability of the cluster through increasing the number of secondaries from the default of 3; though it needs to be an odd number for the quorum. The numbers of replicas you can choose are 9, 7, 5 or 3; also referred to as Platinum, Gold, Silver and Bronze respectively.

Reliability of Service Fabric

Using memory-only replication (not currently available) at Bronze level will give you general resilience over 2 VM failures (in 5 node cluster), and protection against unexpected reboots/migration – and fastest performance.

What VMs Do I need to use?

D Series

D-series virtual machines feature solid state drives (SSDs). This series is ideal for applications that demand faster CPUs, better local disk performance, or higher memories.

Performance & Reliability Considerations

Using the temp drive SSDs (default) on D/Dv2 will give you the next highest performance drive, and will additionally survive a total cluster reboot.

To survive a total cluster failure/migration/data center outage; the next highest performance would be DS/DSv2 series using a custom ARM template with the data on multiple RAID'd premium storage drives – this is also the highest cost, due to premium storage being both charged by reserved space (rather than used space), and being more expensive than regular storage.

You shouldn't use DS/DSv2 over D/Dv2 VMs unless you are using premium storage; as 86% of the local SSD is given over to premium storage cache whether you are using it or not; so they have a much reduced availability of local SSDs. e.g. D1 has 50GB of SSD, but DS1 only has 7GB with 43GB of premium storage cache.

The next highest perf would be with data on multiple regular storage drives RAID'd, then on a single data drive and finally on the O/S drive.

So, there is a performance/disaster recovery/pricing trade off to be made in this whole calculation.

Temp drive for data will be highly resilient and survive node outages and host o/s failures; however, there are some disaster circumstances where it won't be enough, and it depends on your data.

Using Azure Resource Manager Templates

Do you want zero lost transactions (which may happen with backup) and are willing to take a perf hit and move to permanent storage; at various performance/cost levels. You can change the locations of the data and logs via ARM template deployment.

Also don't forget you can have multiple node types:

So you could have some on faster storage and some on maximum disaster-resilient storage; then use node constraints to say where the data is allowed to reside on based on how critical it is.

Interested in more about Age of Ascent and the use of Service Fabric

Build Keynote Presentation including Age of Ascent

https://channel9.msdn.com/Events/Build/2016/KEY02

Gaming at Cloud Scale

https://channel9.msdn.com/Events/Build/2016/T622

Azure Service Fabric for Developers

https://channel9.msdn.com/Events/Build/2016/B874

Hosted Web Apps Myth #3: Hosted Web Apps Aren't Good for Games:

https://channel9.msdn.com/Events/Build/2016/P446?ocid=player

Age of Ascent from Illyriad Powered by Azure Service Fabric and ASP.NET

https://youtu.be/XaN-eXkIEbM

Deploying ASP.NET Core Applications

https://channel9.msdn.com/Events/Build/2016/B811

The Discussion

Add Your 2 Cents