Most people are annoyed to get stuck doing boring chores that could be handled more efficiently which leads to burnout and exhaustion.
A couple of years ago Google introduced SRE practice, and one of the core terms there was Toil and at first glance, it can be misinterpreted as a boring repetitive task.
However, Toil is not just a work you don’t like to do. Let me elaborate on that, and try to answer why is it important for DevOps and SRE.
What is toil
In the SRE discipline, toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, and devoid of long-term value
Thus each time when an operator needs to touch a production system that represents toil time.
Why is that important
Toil tends to scale linearly as the service grows. As such, the SRE discipline strives to reduce toil as much as possible. This approach let engineers work on the real engineering tasks so they can spend time making their services better.
Calculation of Toil helps your team to defend SLOs with the maximum efficiency.
Toil and toil budgets closely influence the desire to “measure everything” and “leverage tooling and automation”. By giving operators a quantitative measurement, toil and toil budgets ensure a balance between administering the system and improving it.
How do we calculate toil
For instance at 7am the main website gets down and one of the monitoring systems pages the critical alert and sends it to Amixr Incident Management.
Once an engineer starts to work on the incident and mark it as Acknowledged.
At 10 am an engineer gets service up and running back and mark the Incident as Resolved. The time they spend on fixing the problem is considered as Toil and Amixr notifies how much budget left for further incidents.
It’s impossible to avoid Toil. But you could manage it with proper tools and proper Toil Budgeting.