Guest Post: Monitoring Microsoft Azure VMs with Datadog
- Posted: Jan 19, 2016 at 8:00AM
The following is a guest post by John Matson, a content developer at Datadog, which allows users to view metrics from all of their apps, tools and services in one place with their cloud monitoring as a service solution. This post is Part 1 of a three-part series on monitoring Azure Virtual Machines.
Whether you run Linux or Windows on Azure, you will want to monitor certain basic VM-level metrics to make sure your servers and services are healthy. Four of the most generally relevant metric types are CPU usage, disk I/O, memory utilization and network traffic. Below we'll explore the available metrics in these areas and highlight key metrics to monitor.
Azure users can monitor the following metrics using the Azure web portal or can access the raw data directly via the Azure diagnostics extension. Azure also integrates seamlessly with Datadog, providing additional monitoring functionality such as drag-and-drop dashboards, correlation of performance metrics across technologies, and sophisticated alerting mechanisms, including automated outlier detection. With our streamlined Azure integration setup, you can begin collecting all of the metrics listed below and visualizing them in Datadog within minutes.
This article references metric terminology introduced in Datadog's Monitoring 101 series, which provides a framework for metric collection and alerting.
CPU usage is one of the most commonly monitored host-level metrics in any setting. Whenever an application's performance starts to slide, one of the first metrics an operations engineer will usually check is the CPU usage on the machines running that application.
Percentage of time CPU utilized
CPU user time
Percentage of time CPU in user mode
CPU privileged time
Percentage of time CPU in kernel mode
CPU metrics allow you to determine not only how utilized your processors are (via CPU percentage) but also how much of that utilization is accounted for by user applications. The CPU user time metric tells you how much time the processor spent in the restricted "user" mode, in which applications run, as opposed to the privileged kernel mode, in which the processor has direct access to the system's hardware. The CPU privileged time metric captures the latter portion of CPU activity.
Although a system in good health can run with consistently high CPU utilization, you will want to be notified if your hosts' CPUs are nearing saturation. Below is a heatmap of CPU utilization across a large cluster of VMs — in Datadog you can easily set up a single alert to monitor CPU usage across an entire cluster, whether in aggregate or at the individual VM level. For instance, you may want to be notified if the average CPU utilization for the cluster serving a specific application surpasses 75 percent, or if any one host in that cluster is over 90 percent utilization for a prolonged period.
Datadog's color-coded host maps can also provide a high-level view of CPU usage. With host maps, you can view your entire infrastructure on one screen or filter and aggregate your VMs on the fly by region, instance size, or OS. Below is a breakdown of Azure VMs by region, color-coded by CPU usage.
Datadog host map
Monitoring disk I/O is critical for understanding how your applications are impacting your hardware, and vice versa. For additional visibility beyond the VM-level metrics covered here, you can also collect metrics from your Azure storage accounts to determine if your storage is being throttled or has availability issues that could impact performance.
Data read from disk, per second
Data written to disk, per second
Monitoring the amount of data read from disk can help you understand your application's dependence on disk. If the application is reading from disk more often than expected, you may want to add a caching layer or switch to faster disks to relieve any bottlenecks.
Monitoring the amount of data written to disk can help you identify bottlenecks caused by I/O. If you are running a write-heavy application, you may wish to upgrade the size of your VM to increase the maximum number of IOPS (input/output operations per second).
Datadog visualizes the read and write throughput from all your instances so you can ensure that disk I/O does not hinder performance.
Monitoring memory usage can help identify low-memory conditions and performance bottlenecks.
Free memory, in bytes/MB/GB
Number of pages written to or retrieved from disk, per second
Paging events occur when a program requests a page that is not available in memory and must be retrieved from disk, or when a page is written to disk to free up working memory. Excessive paging can introduce slowdowns in an application. A low level of paging can occur even when the VM is underutilized — for instance, when the virtual memory manager automatically trims a process' working set to maintain free memory. But a sudden spike in paging can indicate the VM needs more memory to operate efficiently. Datadog alerts can be set to trigger on the absolute number of pages, on sudden changes in the rate of paging, or on anomalous paging rates from a particular host as compared to its peers.
Azure's default metric set provides data on network traffic in and out of a VM. Depending on your OS, the network metrics may be available in bytes per second or via the number of TCP segments sent and received. Because TCP segments are limited in size to 536 bytes each, the number of segments sent and received provides a reasonable proxy for the overall volume of network traffic.
Bytes sent, per second
Bytes received, per second
TCP segments sent
Segments sent, per second
TCP segments received
Segments received, per second
You may wish to generate a low-urgency alert when your network traffic nears saturation. In Datadog, such alerts will not necessarily notify anyone directly but will record the occurrence in a durable, searchable event stream in case it becomes useful for investigating a performance issue. Such alerts can provide invaluable context for investigations without unnecessarily interrupting anyone's work, sleep, or personal time.
If your network traffic suddenly plummets, your application or network may be overloaded. Change alerts, which trigger on relative changes over a user-defined timeframe, are ideal for detecting such issues without firing off useless alerts as network traffic naturally ebbs and flows due to the time of day, the day of the week, or even the time of year.
In this post we've explored several general-purpose metrics you should monitor to keep tabs on your Azure Virtual Machines. Monitoring the metric set listed below will give you a high-level view of your VMs' health and performance:
When you integrate Azure with Datadog, your metrics will immediately begin to populate a pre-built, fully customizable Azure dashboard that displays these key metrics to provide a high-level view of your deployment.
Over time you will recognize additional, specialized metrics that are relevant to your applications. Part 2 of this series provides step-by-step instructions for collecting any metric you may need on an ad hoc basis, and Part 3 covers how you can easily implement continuous, comprehensive Azure monitoring with Datadog.
Many thanks to reviewers from Microsoft for providing important additions and clarifications prior to publication.