Well we have some seriously exciting news around Microsoft Service Fabric's General Availability for Game Developers.
Service Fabric General Availability was announced at Build 2016 by Scott Guthrie's; where Scott Hanselman demonstrated the GA with UK Indie Game Developer Age of Ascent's live code showing how the small indie team have used Service Fabric to support over 50,000 players.
So what is Age of Ascent?
When Illyriad wanted to create a game of a massive scale never before achieved, they turned to Azure, ASP.NET and Azure Service Fabric to engage over 50,000 players, through the browser on any device. In this video you'll hear how they developed Age of Ascent on Microsoft's cloud platform.
So what is Service Fabric?
Service Fabric was born from years of experience at Microsoft delivering mission-critical cloud services and is production-proven since 2010. It's the foundational technology on which we run our Azure core infrastructure, powering services including Skype for Business, Intune, Azure Event Hubs, Azure Data Factory, Azure DocumentDB, Azure SQL Database, and Cortana.
This experience allowed us to design a platform that intrinsically understands the available infrastructure resources and needs of applications, enabling an automatically updating, self-healing behavior that is essential to delivering highly available and durable services at hyperscale.
- Fast time to market: Service Fabric lets developers focus on building features that add business value to their application, without the overhead of designing and writing additional code to deal with issues of reliability, scalability, or latency in the underlying infrastructure.
- Choose your architecture: Build stateless or stateful microservices—an architectural approach where complex applications are composed of small, independently versioned services—to power the most complex, low-latency, data-intensive scenarios and scale them into the cloud.
- Microservice agility: Architecting fine-grained microservice applications allows continuous integration and development practices and accelerates delivery of new functions into the application.
- Visual Studio integration: Includes Visual Studio tooling, as well as command line support, so developers can quickly and easily build, test, debug, deploy, and update their Service Fabric applications on single-box, test, and production deployments.
Service Fabric's stateful services are out-of-the-box resilient and highly reliable in the face of multiple VM failures. Saying that, the levels of resiliency and durability of data can be increased, and that's what we'll run through below.
Setting up Service Fabric
The default deployment template for Service Fabric uses the VMs' fast temp drive for storage, which is stable across reboots+guest o/s updates. However, the temp drive isn't stable across events that cause a VM migration, e.g. host o/dates, host failures, VM resizing etc.
Luckily, that is also what Service Fabric replicates its state for, and why there is a primary – and multiple secondaries. The update and fault domains should prevent any standard data loss as long as you are using the recommended cluster size of 5 or higher.
Fault Reduction Requirements
If you shrink your cluster to 3 nodes then the risks are higher; as if you have one node upgrade then you are out of quorum, and if a further node fails then you are down to a single node. With > 5 nodes you'd be down to 3 and still maintain a quorum. It also follows that more smaller nodes are better than fewer larger nodes.
Don't forget the microservice data will also migrate between the nodes dependent of load and resource usage anyway, so it's part of the system to deal with this.
Assuming you are at 5 nodes or above; for disaster recovery using temp drives in this replicated manner you need to consider what would cause this to be a risk – as it would need failure of at least 3 VMs for a 5 node cluster, and more for a cluster with a higher replica count.
Total outage of a datacentre is one circumstance, however it's very rare and getting rarer as lessons are learnt. It has, however, happened – so can't be discounted. A much more likely scenario is a capped subscription running out of $ balance, which will eventually de-provision the VMs.
Otherwise its likely to be manual intervention, e.g. manually shutting down and de-provisioning the VMs in the cluster. When you start it up again, the VMs will have migrated host.
Expected reboots/upgrades are handled via the upgrade domains; resilience is provided by the fault domains (and Service Fabric). If you want to further insulate yourself from Azure infrastructure issues, you can choose a "durability tier of Gold" and run your services on Full node VMs like D15 or G5. This allows the system to push back against any Azure infrastructure changes that cause migrations (with the exception of host failure) for 2 hours, allowing the replica set to build itself a brand new copy, if needed.
By default, Service Fabric uses the temp drive. However, you can override the location of the data folder default setting via Azure Resource Management ARM template deployment, to change the permanence of the storage with a performance/pricing trade off. You can also raise the reliability of the cluster through increasing the number of secondaries from the default of 3; though it needs to be an odd number for the quorum. The numbers of replicas you can choose are 9, 7, 5 or 3; also referred to as Platinum, Gold, Silver and Bronze respectively.
Using memory-only replication (not currently available) at Bronze level will give you general resilience over 2 VM failures (in 5 node cluster), and protection against unexpected reboots/migration – and fastest performance.
What VMs Do I need to use?
D-series virtual machines feature solid state drives (SSDs). This series is ideal for applications that demand faster CPUs, better local disk performance, or higher memories.
|Instance||Cores||RAM||Disk sizes|| |
|D1||1||3.5 GB||50 GB|| |
|D2||2||7 GB||100 GB|| |
|D3||4||14 GB||200 GB|| |
|D4||8||28 GB||400 GB|| |
|D11||2||14 GB||100 GB|| |
|D12||4||28 GB||200 GB|| |
|D13||8||56 GB||400 GB|| |
|D14||16||112 GB||800 GB|| |
We have a new variant of the D-series sizes called "DS" that are specifically targeted for Premium Storage.
The pricing and billing meters for the DS sizes are the same as D-series. SSD storage included in D-series VMs is local temporary storage. For persistent storage, use DS VMs instead and purchase Premium Storage separately.
- These sizes are equivalent to the D-series Virtual Machine sizes in memory, processor, and pricing.
- The one difference that sets the DS-Series apart from the D-series is how the local SSD space is used.
- The D-series VMs provide 50GB – 800GB of local SSD space to be used as temporary storage.
- The DS-Series VMs will use 25% of the local SSD to provide temporary storage for the Virtual Machine.
- The remaining 75% of the local SSD available for the VM will be used to provide an efficient data cache for persistent disks using premium storage.
- The DS-Series of VMs are currently the only VM sizes that support persistent disks backed by Premium Storage.
Dv2-series instances are the next generation of D-series instances that can be used as Virtual Machines or Cloud Services.
|Instance||Cores||RAM||Disk sizes|| |
|D1 v2||1||3.5 GB||50 GB|| |
|D2 v2||2||7 GB||100 GB|| |
|D3 v2||4||14 GB||200 GB|| |
|D4 v2||8||28 GB||400 GB|| |
|D5 v2||16||56 GB||800 GB|| |
|D11 v2||2||14 GB||100 GB|| |
|D12 v2||4||28 GB||200 GB|| |
|D13 v2||8||56 GB||400 GB|| |
|D14 v2||16||112 GB||800 GB|| |
|D15 v2||20||140 GB||1,000 GB|| |
- Dv2-series instances will carry more powerful CPUs which are on average about 35% faster than D-series instances, and carry the same memory and disk configurations as the D-series.
- Dv2-series instances are based on the latest generation 2.4 GHz Intel Xeon® E5-2673 v3 (Haswell) processor, and with Intel Turbo Boost Technology 2.0 can go to 3.2 GHz.
- Dv2-series and D-series are ideal for applications that demand faster CPUs, better local disk performance, or higher memories and offer a powerful combination for many enterprise-grade applications.
Performance & Reliability Considerations
Using the temp drive SSDs (default) on D/Dv2 will give you the next highest performance drive, and will additionally survive a total cluster reboot.
To survive a total cluster failure/migration/data center outage; the next highest performance would be DS/DSv2 series using a custom ARM template with the data on multiple RAID'd premium storage drives – this is also the highest cost, due to premium storage being both charged by reserved space (rather than used space), and being more expensive than regular storage.
You shouldn't use DS/DSv2 over D/Dv2 VMs unless you are using premium storage; as 86% of the local SSD is given over to premium storage cache whether you are using it or not; so they have a much reduced availability of local SSDs. e.g. D1 has 50GB of SSD, but DS1 only has 7GB with 43GB of premium storage cache.
The next highest perf would be with data on multiple regular storage drives RAID'd, then on a single data drive and finally on the O/S drive.
So, there is a performance/disaster recovery/pricing trade off to be made in this whole calculation.
Temp drive for data will be highly resilient and survive node outages and host o/s failures; however, there are some disaster circumstances where it won't be enough, and it depends on your data.
Using Azure Resource Manager Templates
Do you want zero lost transactions (which may happen with backup) and are willing to take a perf hit and move to permanent storage; at various performance/cost levels. You can change the locations of the data and logs via ARM template deployment.
Also don't forget you can have multiple node types:
So you could have some on faster storage and some on maximum disaster-resilient storage; then use node constraints to say where the data is allowed to reside on based on how critical it is.
Interested in more about Age of Ascent and the use of Service Fabric
Build Keynote Presentation including Age of Ascent