Friday, January 11, 2013

Rethinking the Role of Incidents in Service Management

I once had my accounts at a bank where customer service was very good at resolving errors in my account. However, I ended up needing to call them almost every single month to get an error resolved. I don't bank there anymore. Imagine this bank using the slogan, "We fix account errors faster than you may expect!" Do you want to invest in that bank? Then why does ITSM's primary message often sound similar?
  • We outperform Incident SLAs 90% of the time
  • We've reduced the negative impact of changes by 60%
Do you realize that what we're saying in business terms is, essentially, "We fail less frequently"? Is that the message you want to send?

I'm not saying these measures are bad, or that we shouldn't tell our business partners about them. It's just that focusing our metrics around these types of measures implies that the reason we get a paycheck is that we can fix the problems created by the very systems we deployed. In other words: we're good at fixing our failures. I know the reality is more complicated than that, but is your business partner wrong to arrive at that conclusion?

Service value is based on business value. Period. Business value means increasing revenue, decreasing costs, increasing goodwill, or improving outcomes around a corporate strategic plan. Even at the non-profit where I work, value is based on one or more of these four factors. You might replace "goodwill" with "mission impact", but it is effectively the same thing.

Why then do we put so much emphasis on self-reported issues as a proxy for value? Self-reported issues were the easiest way to collect data regarding the value our services. It doesn't make it a better way, or even a good way; just an easy way. Let me ask it another way: are self-reported incidents a good way to measure effectiveness of service management? Of course not, and for three very good reasons:
  1. Self-reported incidents only tell us about things that hurt service consumers to the point that they have no choice but to contact us. Most people don't care enough to reach out, until the pain is so great that it cannot be carried any longer. What about all the non-reported defects?
  2. Service management is about ensuring the service consumer is getting what they need from the service. Incidents only tell us about broken stuff, which barely scratches the surface of ensuring the service consumer is getting what they need.
  3. Self-reported incidents tell us almost nothing about service value. How do they tell us about increased revenue, decreased costs, or increased goodwill? They might tell us a little bit about increasing or decreasing costs, but even that is a stretch. Even if tracking self reported defects could tell us something useful about cost control, it's still effectively telling the business IT is getting better at fixing our own failures. Again not a bad thing, AND nothing to crow about, either.
I propose that incidents should not be limited to service interruptions; and even if they are, they should cover ALL service interruptions, not just the self-reported ones. I want to take it further, however. Service Management needs to be more closely tied to Customer Experience Management. The book "Outside In: The Power of Putting Customers at the Center of Your Business", written by analysts from Forrester, provides an excellent overview and model for implementing customer experience management. Forrester provides a mini overview on their blog. One characteristic is mapping the intended customer experience for your product or service.

My idea is to take the intended experience for a service, and compare that to the actual customer/user service experience. An "incident" then becomes ANY variance between the expected and actual experience. They could be good or bad things. I want to know about errors, of course. I also want to know about the ways the users of my service do things differently, and possibly better than, the way we expected it to be used. I'll know a lot more about the value of the service this way.

This requires a bit of imagining the future, but it can be a near-term future. Developers can focus efforts on gathering and reporting customer experience with their apps. Would there be a market for applications that could allow configuration of customer experience standards? I know I would be interested in purchasing an app that, in addition to it's user functionality, included this sort of capability.

In the mean time, we can use many of the tools we already use for measuring customer experience. First, however, we need to document what we expect that experience to be and what it should provide.


  1. Love it. 100% agree.

    Except :) don't overwork the Incident. incident management is about making users happy again, restoring service or dealing with a perceived loss of service. it is user-focused: that is the expertise, the goal, the metrics.

    By all means track all interruptions to service or deviations from service levels. I've long argued we need an Interruption entity. Just don't use the Incident for it. i know ITIL does but that's damaging to our ability to be good at restoring a user's experience. i agree that's not what we should measure, but it is still a really important activity.

    Put another way, currently Incident Management is outward facing - let's not focus it on internal issues.

  2. Great post Dan.

    Reminds me of a post I did in 2008 on how I reported outages. I called it "Service Outage Avoidance" metric. It turned around the focus of resolving to avoiding. It had significant results for my team and changed the dynamic in how we accepted releases and planned architectures. We started with HA and worked backwards.

    Take a read and let me know what you think.


  3. Matt, I like your post, and agree that there are some similarities. The biggest difference is what I'm looking at is impacted by technology, but not necessarily the technology (CIs) itself. I've already seen some confusion between what I mean around customer experience vs. service levels. Service levels usually mean things like "how long did it take for the transaction to process?" Customer experience is about how the customer/consumer/user touchpoints with the service.

    A simple example would be a customer placing an order for a widget through our app. Our experience design indicates that the process of placing the order should take 3 clicks, over an average of 90 seconds, taking into account pieces of information that the customer may need to gather from other sources (credit card info, shipping address, etc). A variance from this experience would be a customer transaction that takes 7 clicks over a 15 minute period. The variance may be due to technology, and it may be due to customer issues. Either way, I'm interested. A variance could also be a customer transaction that involves only 2 clicks over 25 seconds. That variance may be good or bad; but either way I want to get to the cause.

    This is an oversimplified example, but I hope it makes the point. I want to focus service management on the customer experience, of which the technology is one, hugely important, part.