Carlos Nunez

22 November 2017

The 3 Stages of Enterprise Monitoring (and 3 Ways to Improve)

Bzzzt. Bzzzt. Bzzzt. Your phone vibrates. It's 4:30am. You only got home a few hours before after spending 12 hours in the office with your team getting to the bottom of some data corruption on a file share used by one of your critical applications. That was the third 12-hour bender this month and it’s starting to take its toll. You're longing for a vacation and thinking of perusing LinkedIn again.

You answer the phone. It's your VP. Some folks in London are unable to view inventory reports from their reporting application, and they don't know why. You don't know why either. You're the on-call person, so you have to wake up and figure this out. Again. You take to your phone to figure out what alerted in your email, but there are so many of them, it's hard to keep track of them all. It doesn't help that that 80% of those alerts are useless. Looks like you'll have to VPN in and hope that it's just a stale DNS record again.

Has this happened to you? If so, then your monitoring system has failed you, and this post aims to help explain why and hopefully show you a way out.

What Is Monitoring?

Like containers, machine learning and “put it all on the blockchain”, ‘monitoring’ is one of those words that gets tossed around in DevOps conversations with many different meanings and in disparate contexts. Some use it in the context of building a logging system. Others use it as a code for “we need better alerting”. With the word being used for so many different contexts, it's easy to forget what the word really means.

Dictionary.com tells us that the act of monitoring is to "observe and check the progress or quality of (something) over a period of time; (to) keep under systematic review." This is all well and good, but I'd like to narrow this definition down a little further.

Monitoring is the means by which you keep the technology that powers your business from negatively impacting cash flow.

Defining monitoring this way gives us three advantages.

1) It (rightly!) re-frames a technology problem as a business problem

As I'll describe later in this post, a common problem I've seen amongst operations teams is “too much monitoring”, where there are metrics being collected and alerts being generated for everything and anything that seems important, but very, very few of them are actually relevant to the wider business context.

The point of monitoring is, first and foremost, to serve business goals!

2) It helps define Service Level Agreements that your technology will adhere to

Many Service Level Agreements (SLAs) and indicators are made in a vacuum. If you've ever heard the phrase "this system needs nine nines," then you know what I'm talking about. Creating them under the context of cash flow puts dollar values against those agreements and helps your team calculate exactly how much it will cost to implement "nine nines" worth of uptime.

3) It places the responsibility of monitoring on the entire company, not just operations

Operations teams can't know how everything in an organization should run. In mature organizations, every team is responsible for defining the metrics, alerts and dashboards that describe what it means for their business domain to be “healthy”.

The Three Stages of Monitoring Maturity

Getting the monitoring problem right doesn't happen overnight. My experience has shown me that there are three distinct stages of monitoring maturity that enterprises go through. Let's take a deeper look into each one.

Stage Zero: No monitoring at all

Stage Zero is the greenfield of monitoring maturity. Your enterprise or business unit either doesn't have a monitoring suite at all or you have bought one, but other, more pressing deliverables have caused its implementation to fall by the wayside. SLAs or indicators either don't exist or they exist, but aren't enforced (are you getting “nine nines”?). Consequently, your operations teams use gut feel and past tickets instead of real-time data to gain a grip on your infrastructure. Repeat problems are addressed manually either by runbooks or by ad-hoc automation (i.e. scripts that someone wrote, with that someone usually being the person that everyone calls when something goes wrong).

Ultimately, your customers generally tell you what's wrong with your infrastructure before your infrastructure does.

This is not where you want to be!

There are several competitive disadvantages that result from this scenario.

1. Your engineers are wasted on fire-fighting

In almost every case I've seen, not having a monitoring platform leads to fragile infrastructure that's constantly broken. Fragile infrastructure requires constant attention and fire-fighting and, like oxygen to a flame, it sucks up all of your good talent on keeping the fires in check (and building an ever-growing mountain of technical debt) instead of taking your business to the next level. As said in The Phoenix Project, “Left unchecked, technical debt will ensure that the only work that gets done is unplanned work!”

2. It promotes a hero culture

You know who I'm talking about. The guy or gal that ultimately gets every ticket. The one that solves all of the "hard" problems. The "hero" that comes through when the fires get really big. This doesn't happen intentionally. This is always the result of unplanned work getting in the way of building the culture of learning and knowledge sharing needed to build awesome products that your customers will love. Worse, if that person ever leaves, years and years of institutional knowledge evaporate, and getting out of that black hole is incredibly hard.

3. It's expensive

The devil is always in the details. While not having monitoring or not planning for it from the onset seems cheap and easy, it creates technical debt that accumulates with interest. The more moving parts there are in your infrastructure, the higher that interest APR is. Paying that debt off by implementing monitoring “later” is often much more expensive than getting monitoring right from the start.

Stage 1: Some Monitoring

The next stage in the monitoring journey is by having some monitoring in place. My experience has shown me that most enterprises falls into this category and that there are two variations: too little monitoring or too much of it.

Engineering organizations with too little monitoring have a product in place that was often the result of an early proof-of-concept that never got progressed further. Knowledge of the product is poor, but there is just enough experience in the team to keep it running.

Engineering organizations with too much monitoring, on the other hand, have implemented multiple products, and knowledge of those tools typically reside within a single team, or in some cases, even a single person. Metrics, indicators, dashboards and alerts from your infrastructure and applications are everywhere, but separating the wheat from the chaff is difficult. This often leads to many duplicate alerts that get ignored, a condition that Google's Andrea Spadaccini calls ‘pager fatigue’.

There are problems common to both situations. SLAs and indicators are usually established and agreed upon at this stage, but they are often more a product of gut feel than data about your business. One or more engineering teams have some semblance of an on-call rotation at this stage, but most of the big problems still require ‘heros’ to fly into the scene when things get really bad. While your infrastructure sometimes tells you what's going on, your customers are, more often than not, still the first to let you know when something is amiss.

Over time, organizations that stay at this stage experience the same, albeit less drastic, impacts as those exhibited in Stage Zero: fire-fighting, hero culture and expensive outages. Even with these good intentions, trust in the engineering organization dwindles, and the war between the ‘business’ and the ‘IT department’ intensifies.

You don’t really want to be here, either.

Stage 2: Great Monitoring

This stage is the benchmark which every organization looks to, but that few actually dedicate the time and resources to reach. This happens for good reason: buying the right tools isn't enough to get here, you need cultural change as well. And that’s hard.

In this stage, monitoring is not just an operations responsibility, it is a company-wide responsibility. From project and product management to application feature teams, everyone is responsible for both knowing how to create metrics, alerts and dashboards for their business domain on a common monitoring platform.

Additionally, SLAs are formed by well-documented service level indicators and objectives, both of which are created and refined by the data retrieved from your monitoring platform. Infrastructure metrics that generate alarms are tied to these indicators, and when alarms are triggered, an automated runbook is executed to silence them. In a more advanced scenario, these alarms work in conjunction with time-series data exposed by their complementary metrics to predict and address infrastructure issues before they become issues. When manual intervention is needed, there is enough documentation on the application and its infrastructure for any engineer working the on-call shift to be able to address the problem themselves.

This is where you want to be!

There are several advantages that make getting to this stage worthwhile.

1. Less toil means better products

Operational duties in this model are largely reduced to improving process (ensuring documentation and automated runbooks are up to date, ensuring metrics, indicators and dashboards haven't gone stale) instead of churning through toil. As a result, your organization's engineers are spending their time on capital projects that help your business stay on the cutting edge and ahead of your competition.

2. Decisions become data-driven, enabling you to truly move fast and break things.

Great monitoring enables a data-driven culture that allows decisions to be made based on what your customers are actually doing (and how your infrastructure responds to it) instead of what you think your customers are doing. Because your product indicators are tracked and monitored against just like free disk space or web page requests per second, your product teams can iterate and experiment significantly more quickly than before.

3. Less toil means better talent retention

Paul Glen, co-author of The Geek Leader's Handbook, put it best: "We love to tackle hard (but not impossible) problems. Simple things are boring, but juicy problems are a joy”. While fighting fires can yield interesting problems (tracking down an unusually slow database query or failed login into a web service can be fun in a masochistic kind of way), implementing Kubernetes or architecting your organization's next-generation cross-cloud platform are much more fun and rewarding projects that help to attract and retain top engineering talent more easily.

4. Your technology spend becomes more efficient

A common example I've seen of this is improper compute sizing. This problem typically manifests by one or more application teams making requests for more computing power than they actually need. The only way to properly fix this is by having monitoring data that can either support or defend that request.

Three Ways of Getting to Monitoring Maturity

Bringing your organization to the ‘Great Monitoring’ stage is not impossible. However, much like your journey towards embracing the DevOps methodology, embracing a monitoring-first culture takes time and discipline to get right.

1. Use One Monitoring Platform Across All Your Teams

The data team uses ELK. The application teams use New Relic and a separate ELK cluster from the data team. The platform team uses Splunk.

If this situation seems familiar to you, then the next thing you need to consider is having every team use a common set of tools and using only those tools in your monitoring stack. This reduces the training requirements and number of knowledge silos needed to upkeep this stack and centralizes knowledge of what makes your platform actually run into a single source of truth.

If this situation is not familiar to you because you're at Stage Zero and do not have any tooling, then you will need to pick a tool. There are two classes of monitoring tools that I've observed: SaaS-based and on-premise tools.

SaaS-based tools, such as New Relic, Datadog and Sumo Logic, are meant to provide a large set of capabilities in a simple and easy to learn/discover interface and a mechanism for exporting metrics data from your infrastructure into theirs. On-premise tools, on the other hand, give you ultimate flexibility in how you employ your monitoring framework and keep all of your data in your datacenter or private cloud. Sensu, Nagios and Zenoss are examples of some products in this category.

While SaaS tend to be more expensive in the long run than an on-premise solution, their low-friction onboarding process makes it very easy to implement a high-quality monitoring platform in very little time and can give you a canvas on how to build effective monitoring for your organization that you can take with you into an on-premise solution.

2. Break Down Your Monitoring Silo with Monitoring-as-Code

The first step in this journey is moving monitoring responsibilities away from operations and into every team in your engineering organization. Infrastructure engineers know how to maintain core infrastructure. Database engineers know the right metrics to track to ensure that their databases remain high-performing. Product Analysts and Customer Success Managers know the indicators of their product that are key to customer engagement and satisfaction. Letting each of these teams create the metrics, alerts and dashboards for their domains as well as write the runbooks behind each alert will significantly improve the reliability and knowledge of your platform and get you quite far into your monitoring maturity journey.

Monitoring as code makes this task significantly easier. Implementing your alarms, indicators and dashboards with code using a common language provides a common ground for teams to accomplish the above while giving you the accountability, auditing and recovery benefits of having code in source control. Even better, application teams can update their monitoring stack at deployment time within their delivery pipelines. This way, documentation and maintenance of each component of your platform will almost never go stale.

You can see an example of this with the Barkdog Monitoring DSL for Datadog. In this scenario, application and product teams would write Barkdog DSL code specifying the metrics to alert on, the query to evaluate and test, the condition to alert on based on the result of that query and the action to take if the alert condition is met. Your engineers might need to write a DSL of their own depending on the monitoring platform used by your organization.

3. Implement a Highly-Scalable and Highly-Efficient On-Call Process with Interrupts

Operating systems are the bridge between the hardware inside of our computers, servers, routers, switches, and other computing devices and the software that does stuff with that hardware. There are two primary components to every operating system: an operating system kernel that does the things and an operating system userland that users of these computing devices actually see and touch (like the desktop and browser you're reading this on). Furthermore, despite the many advancements we have made in computer processor technology, at their core (no pun intended), processor cores can only do one thing at a time. The consequence of this is simple: every request made by software to do something to or with underlying hardware has to wait in line. However, some requests, like typing on a keyboard or playing a sound, are more urgent than others. In order to ensure that these requests get serviced quickly without delaying other requests in line, operating system kernels can dispatch interrupts to the operating system userland that tell it that “this needs to get done right now”.

Traditional operations teams work much in the same fashion. Engineers within engineering teams are in one of three states: working on capital projects that advance the business platform, improving current processes that make administrating that platform simpler and more efficient or servicing requests from consumers of the platform to address issues. However, some consumer requests, like "our customer-facing website is down in production" take precedence over others, like "this database query for my custom tool is unusually slow”. In most cases, these interrupts are given to anyone with free cycles to service the request.

The problem with interrupts in both cases is the same: the more time an operating system or engineering team spends addressing interrupts, the less them they are able to spend on other things that might be more beneficial in the long term. Google calls this situation ‘toil’; most engineers call it ‘fire-fighting’.

To address this, Google SRE teams adopted a new approach to servicing interrupts. Instead of interrupts getting serviced by anyone with free bandwidth, the primary on-call person is tasked with dealing with every service request that comes into the team's queue with the secondary on-call person serving as a backup when the queue gets sufficiently large. The key element here is that the person on "interrupts" is not to do any project or story work that does not pertain to a request or improving process. This way, if your engineer that's on interrupts is spending a lot of time working on similar requests, this gets treated as a "bug" that requires automation to fix.

Dave O'Connor, a Storage SRE at Google, helped pioneer this framework. You can read more about the mechanics of it in Site Reliability Engineering, which is available online, for free, here. The chapter on "Dealing with Interrupts" can be found here.

Monitoring is a Lifestyle, Not Just a KPI!

In conclusion, implementing effective monitoring is incredibly important for the business. A mature monitoring culture helps your organization make data-driven decisions that help you delight your customers better, break down communication silos between engineering and business teams and help you be more efficient with your technology spend.

However, much like becoming a DevOps-driven organization, building a mature monitoring platform doesn't happen overnight. The journey is a marathon, not a sprint, and picking the right tool for all of your teams, implementing monitoring as code and enacting an interrupts-driven support framework are checkpoints in the right direction. Having a digital platform that largely takes care of itself is not impossible or solely a privilege of startup unicorns; all you have to do is commit to begin!

Want to accelerate and improve your software delivery with DevOps? Check out DevOps Maturity: our software tool for self-assessing your DevOps maturity and improving your software delivery.