Our job is to build the best product on the market for incident management by using what we know and learn about IT teams and IT incidents.
We’ve conducted dozens of interviews with SRE teams around the world and processed millions of incidents. The first 150,000 incidents were crucial. Why? That was the moment we gained vital insight on monitoring, types of teams, types of incidents and problems, and potential solutions.
In our research, we found that some SRE and DevOps teams don’t always share their struggles or problems with the public while other teams do not implement practices to assist in overcoming those struggles to achieve success. This completely changed the game for us. (To learn more about these struggles and insights, stay tuned for an upcoming series.)
Although we used numbers to verify our hypothesis, these statistics are not enough to draw an entirely accurate conclusion. The numbers lack data diversity because our product targets a limited audience. This presents a vast opportunity for deeper research, which we plan to conduct and publish here in the future.
For now, our hope is that this overview will help you identify problems in your incident management process and implement practical solutions to achieve greater success.
"Alert Pressure" on SRE's
We quickly noticed a variety of patterns from different monitoring systems. Surprisingly, my colleagues and I never thought about these differences while we were members of SRE teams ourselves. We gained this insight as we started processing alerts from our clients. I’ll show you how it works with Grafana, AlertManager, Sentry, and self-made monitoring.
Take a look at the screenshot below from Grafana:
The world is always changing. Monitoring metrics obviously tends to cross thresholds as you see here. Everything was great in the beginning but, after the metric crossed the threshold, Grafana sent an alert and the Site Reliability Engineer checked to ensure it was nothing critical. Grafana sent the same alert again later, but it was ignored because of the first false-positive.
Grafana continued sending alerts and, within a few weeks, the team was trained to ignore the 10-20 alerts Grafana sent daily. It only takes 15 minutes to adjust the threshold, but the team was interrupted around 20 times each day. Focus and team performance dramatically declined.
SRE teams are generally quick to fall into the rabbit hole of a constant, meaningless alert flow and some level of distraction.
Let’s take a look at the AlertManager Pattern from the Prometheus stack. Unfortunately, most AlertManager installations usually present reproducible issue — they regenerate a constant alert flow during a long-lasting incident.
This is completely different from what we see with Grafana. AlertManager generates alerts during the incident. This is likely caused by its stateless nature and its somewhat confusing grouping configuration:
group_by: ['instance', 'job'] group_wait: 45s group_interval: 2m # Usually ~5 mins or more. repeat_interval: 4h
Read more about Alertmanager's grouping in our docs.
This load could be hugely distracting for on-call engineers. I met an engineer who connected Twilio to AlertManager and received a phone call every five minutes during an incident. These excessive notifications make it hard to concentrate when your phone constantly rings.
The exciting thing we experience with AlertManager is the insane number of alerts that go to our API the moment the client adopts it. The winner is 29,000 alerts per hour! Obviously, this is the result of a wrong grouping that generated alerts about the metrics from all nodes, pods, and services.
- “Hey, our backend is under some excessive load. I’m worried.”
- “It’s Client X experimenting with k8s and AlertManager.”
- “Oh, ok. Hope they figure out the grouping soon.”
Pattern of crash monitors like Sentry
We’ve tried to uncover why Sentry users generate these patterns—the flat rate of alerts followed by a spike and a small period of silence. An interview led us to this insight: engineers who monitor exceptions work in two different modes.
The first mode involves monitoring all new exceptions that show up in the system. They mostly look for new types of exceptions, something rare and unusual that could be a signal of a third-party outage or a security issue.
The second mode is usually enabled after a new release is deployed. There are many new exceptions, so it’s much easier to sort them by frequency and then fix them beginning with the first exception.
Since our product is about notifying engineers, the reason is as simple as:
Self-written monitoring pattern
Every client has a unique self-made thing that, although different from one another, usually shares a few commonalities:
- They could bounce around the threshold and spam
- They reflect more releases and spam
- They are usually stateless and spam during the incident timeframe
Their superpower is to fail unnoticed.
AlertManager has heartbeats, Sentry is a SaaS, and Grafana has a web interface that is visible to everyone. Self-made monitoring tools are usually based on cron, so it’s hard to notice when they stop working.
Action Items We All Love
Different patterns generate different problems that require different solutions.
For Grafana and other bounce-like monitoring, I suggest using “no flood tolerance.” This is hard to achieve since it is counter-intuitive in a daily routine. Production Meetings could help tremendously, but they need to have strong records and analytics of all the alerts from an observed period of time and the amount of work they caused.
For AlertManager and similar stateless instruments, it’s good to have additional services that will track the state of an incident and allow an on-call engineer or incident commander to focus on important action items instead of a flood of alerts and issues.
For any self-made monitoring system, monitoring is also required. This could be a simple heartbeat from cron scripts or a set of multiple monitoring instruments.
Topics Coming Soon:
- Practical Advice on Limiting the Number of Alerts with Toil Budget
- The Terrifying Cost of Unbalanced Incident Management in a Small Team