The 5 Metrics Every IT Leader Should Be Tracking (But Most Aren't)

Most IT teams measure everything and act on nothing. The average enterprise ITSM platform can generate 200 or more distinct reports. Senior leaders who ask "how is IT doing?" get a dashboard full of green lights — and then a P1 the next morning. The metrics are being collected. The dashboards are being maintained. The QBRs are being scheduled. And almost none of it is driving decisions.

This is not a technology problem. It is a selection problem. Organizations track what is easy to measure — ticket volume, SLA compliance percentages, response time averages — rather than what is actually predictive. Easy metrics tell you what happened. The five metrics below tell you what is going to happen, and whether the structural health of your IT operation is improving or eroding.

I have sat in rooms where IT leaders presented 40-metric dashboards to their executive teams and walked out without a single decision being made. I have also seen a single CFO conversation turn on one number — a change failure rate that no one had ever calculated before that meeting. What follows are not the most common metrics tracked in ITSM environments. They are the ones that actually predict whether IT is delivering value, preventing problems, and improving over time.

8.40

Industry avg MTTR in business hours. Elite teams: under 2 hours for standard incidents.

~5%

Change failure rate for elite performers (DORA 2024). Most organizations: 15–25%.

74%

Industry average first contact resolution rate. Top quartile: 80%+.

Metric 1 — Mean Time to Resolve (MTTR)

What it is

Mean Time to Resolve is the average elapsed time from when an incident is opened to when it is closed — measured in business hours, not clock hours. That distinction matters. A ticket opened at 4:45 PM on a Friday and resolved at 9:00 AM Monday is not a 64-hour resolution. In business hours, it is roughly a one-hour resolution. Measuring in clock hours inflates MTTR in ways that obscure the actual performance of your team.

Why it matters

MTTR is the single most direct measure of how fast IT restores value to the business after something breaks. Jeff Rumburg at MetricNet has called it "one of the key drivers of customer satisfaction for desktop support" — and in practice, it is also the metric that most directly predicts whether end users trust the service desk or route around it. An IT org with consistently low MTTR builds organizational credibility over time. One with high MTTR trains the business to solve problems without calling IT.

The benchmark

According to MetricNet's global benchmarking database (published via HDI SupportWorld, May 2018), the industry average MTTR is 8.40 business hours, with a range spanning from 0.67 to 33.67 hours across measured organizations. A common performance target is resolving 80% of incidents within 8 business hours or 24 clock hours. Elite organizations consistently achieve MTTR under 2 hours for standard incidents.

The Number Behind the Number

The average hides the distribution. A team that resolves 90% of tickets in 1 hour and 10% in 80 hours has a misleading "average" MTTR of around 9 hours. Always track P90 — the 90th percentile resolution time — alongside the mean. That tail is where the end user frustration lives.

The action

If your MTTR is above 10 business hours, the likely culprits are: no formal escalation path (tickets stall at tier 1 with no defined handoff trigger), an insufficient or unfindable knowledge base (agents are re-solving problems from scratch), or incidents being routed to the wrong tier on first assignment and bouncing. Pull your 20 longest-running tickets from last quarter. The patterns will be obvious within the first read.

Metric 2 — Change Failure Rate (CFR)

What it is

Change Failure Rate is the percentage of changes to your IT environment — deployments, configuration updates, infrastructure changes, patches — that result in a degraded service, require rollback, or cause an unplanned incident. A change that goes in clean and causes no downstream impact is a success. A change that triggers an incident, requires an emergency rollback, or consumes unplanned remediation hours is a failure, regardless of whether it was eventually "resolved."

Why it matters

Most IT teams track change volume. Almost none track what happens after the change goes in. Change failure is the single largest preventable source of incidents in mature ITSM environments. Industry analysts estimate that 65–80% of IT incidents can be traced back to change activity — a range widely cited across Gartner research and industry benchmarking, though the precise figure varies by organization type and environment complexity. Whatever the number is in your environment, the implication is the same: you are not primarily an incident management shop. You are a change management shop that keeps generating its own incidents.

"Every change that causes an incident is evidence of a gap in impact assessment. That gap has a calculable cost — and it compounds every quarter."

The benchmark

According to the DORA Accelerate State of DevOps Report (2024), elite performers achieve a change failure rate of approximately 5% or below. Low performers see rates up to 40%. The majority of measured organizations fall in the 15–25% range. If your organization has never measured CFR, a reasonable starting assumption is that you are in the majority.

Performance Tier	Change Failure Rate	What It Means
Elite	~5% or below	Changes are well-scoped, impact-assessed, and tested. Incidents from changes are the exception.
High (majority)	15–25%	1 in 5 to 1 in 4 changes causes a problem. Incident load is partly self-inflicted.
Low	Up to 40%	Nearly half of changes cause incidents. Firefighting is not exceptional — it is structural.

The action

Start tracking CFR in your ITSM tool this week. If you do not have a field for "did this change cause an incident?", add one. It takes 15 minutes to configure. Two weeks of data will show you where the problem lives — which services, which change types, which change owners. The data almost always points to a small number of high-velocity change categories that account for a disproportionate share of failures.

The Firefighting Trap

A low CFR combined with very low change volume often means changes are being avoided, not managed well. A high CFR combined with high change volume is the firefighting trap: changes create incidents, incidents require emergency fixes, emergency fixes are changes that create more incidents. Tracking CFR breaks the cycle by making it visible.

Metric 3 — First Contact Resolution Rate (FCR)

What it is

First Contact Resolution Rate is the percentage of incidents and service requests resolved on the first customer interaction — without escalation, callback, callback scheduling, or follow-up contact. A user submits a ticket, the service desk resolves it, the user confirms, the ticket closes. That is a first contact resolution. Any ticket that requires a second touch — a follow-up call, an escalation to tier 2, a "let me get back to you" — is not.

Why it matters

FCR is a cost metric disguised as a satisfaction metric. Every ticket that escalates from tier 1 to tier 2 costs 2–4 times more to resolve (MetricNet benchmarking research, via HDI SupportWorld). Escalation consumes senior engineer time, introduces handoff delays, and degrades the user experience at every step. FCR also directly predicts whether end users trust the service desk or route around it — calling a colleague directly, solving the problem themselves, or simply living with the degraded service rather than opening a ticket.

The benchmark

The service desk industry average FCR is approximately 74%, based on MetricNet's global benchmarking database (cited via HDI SupportWorld). Top-quartile organizations achieve 80% or above. If your FCR is below 65%, your knowledge base and tier-1 training are almost certainly the bottleneck — not your people.

FCR Degradation Patterns

FCR degrades for three predictable reasons: agents do not have access to good knowledge articles, tickets are being miscategorized and misrouted on intake, or the service catalog does not map to how users actually describe their problems. All three are fixable without new technology.

The action

Pull your 20 most-escalated ticket categories from the last 90 days. For each one, ask three questions: Does a knowledge article exist for this issue? Is that article findable in the time pressure of a live interaction? Is it accurate enough to actually solve the problem? That exercise alone typically surfaces 3–5 categories where a new or updated knowledge article would immediately improve FCR. That is a measurable improvement achievable in days, not quarters.

Metric 4 — Incident Recurrence Rate

What it is

Incident Recurrence Rate is the percentage of incidents that are repeat occurrences of the same root cause — tickets that should have been permanently resolved by a prior fix but weren't. If the same server crashes under the same load conditions for the third time in 90 days, that is not three separate incidents. That is one unresolved problem consuming three times the labor, three times the user impact, and three times the escalation overhead — while the root cause sits untouched in the backlog.

Why it matters

Recurring incidents are the clearest signal that problem management is broken or absent in your organization. They represent the opposite of continual improvement: the same work, performed the same way, producing the same failure, indefinitely. Organizations without a mature problem management practice often do not know their recurrence rate — they know their ticket volume, but they have not connected tickets to root causes or tracked how many tickets are reopened problems versus genuinely new issues.

"If you are not linking incidents to problems and tracking repeat offenders, you do not know your recurrence rate — you just know your ticket volume."

The benchmark

No single universal benchmark exists for recurrence rate — it is one of the most undertracked metrics in ITSM. The directional target is clear: any rate above 15% warrants investigation. World-class organizations drive recurrence below 5% through formal root cause analysis, known error documentation, and regular problem review meetings where repeat CIs are treated as owned problems, not ongoing firefighting items.

The action

Sort your incident queue by configuration item or service. The top 10 CIs by ticket volume are your problem management backlog — even if no one has formally called them that. For each of those top 10, ask: Has a root cause ever been formally identified? Has a permanent fix ever been scoped? Is there a known error record that captures the current workaround? That list is where your problem management practice starts.

The Problem Management Shortcut

You do not need a mature ITIL problem management process to start reducing recurrence. You need one weekly meeting, one owner per repeat CI, and a simple question: "Did this happen before, and if so, why is it happening again?" That meeting, held consistently, will surface the top 5 recurrence drivers in your environment within 30 days.

Metric 5 — Cost Per Ticket (by Channel)

What it is

Cost Per Ticket is your total monthly service desk operating expense divided by your total ticket volume for the same period. The blended number is useful. The channel-specific number is where the insight lives. Breaking cost per ticket down by channel — self-service portal, email/web form, chat, phone, and walk-up — shows you exactly where your cost structure is concentrated and where the leverage is.

Why it matters

Jeff Rumburg at MetricNet has written that "Cost per Ticket and Customer Satisfaction are often referred to as the foundation metrics in service and support." The reason is simple: cost per ticket is the number that connects everything else. MTTR improvement shows up in cost per ticket. FCR improvement shows up in cost per ticket. Recurrence reduction shows up in cost per ticket. It is the financial summary of operational performance.

The channel breakdown matters because channel cost varies dramatically. According to MetricNet's research (HDI SupportWorld, December 2021), channel cost varies by more than two orders of magnitude — 100x — from the lowest-cost self-service ticket to the highest-cost walk-up interaction. Every ticket that moves from a high-cost channel to a low-cost channel is direct, calculable cost reduction. This is not a future-state aspiration. It is an arithmetic fact about how your current ticket distribution maps to your cost structure.

Channel	Relative Cost	Deflection Opportunity
Self-Help / Portal	Lowest	Knowledge base quality + portal discoverability drive self-service adoption
Email / Web Form	Low	Automation and templated workflows reduce handling time
Chat	Medium	AI-assisted chat can deflect to self-service in real time
Voice / Phone	High	10% shift from phone to self-service typically reduces total cost per ticket 8–12%
Walk-Up	Highest	Walk-up volume is the most expensive and most controllable cost driver

The benchmark

There is no universal cost-per-ticket benchmark that applies across all organizations — the number varies by industry, team size, and service scope. The value is in your own trend over time and in channel comparison. Shifting 10% of voice contacts to self-service typically reduces total blended cost per ticket by 8–12%, according to MetricNet's channel cost research.

One additional data point worth carrying into budget conversations: it costs approximately $12,000 to replace a single North American service desk agent when recruiting, onboarding, and productivity loss are fully accounted (Jeff Rumburg, MetricNet, HDI SupportWorld, December 2021). Turnover, not technology, is often the real driver of rising cost per ticket — and turnover is highest in environments where agents spend their days handling repetitive, escalation-prone tickets that a better knowledge base would have resolved at tier 1.

The action

Calculate your blended cost per ticket for last month. Then calculate it separately for your top three volume channels. The gap between your highest-cost channel and your lowest-cost channel tells you where to invest in self-service deflection. If 30% of your volume is coming in by phone and phone costs 10x more than portal, that math produces a clear investment case for knowledge base and portal improvement — in dollars, not percentages.

The Dashboard That Actually Works

Most IT leaders do not need more data. They need fewer metrics tracked consistently, reviewed by the right people, and connected to decisions that actually get made.

These five numbers — MTTR, Change Failure Rate, First Contact Resolution Rate, Incident Recurrence Rate, and Cost Per Ticket by Channel — reviewed monthly in a 30-minute operations review with the right stakeholders, will surface problems before they become incidents and improvements before they become emergencies. That is a realistic standard. Most organizations can stand up this review cadence without new tools, new headcount, or a consulting engagement.

Is MTTR trending down?

Not whether it hit a target this month, but whether it is improving quarter over quarter. A team moving from 12 hours to 9 hours to 7 hours is a team that is learning. A team flat at 10 hours for 18 months has a systemic constraint that the metric is not surfacing.

Is CFR improving quarter over quarter?

Not zero — that is not a realistic target for any organization operating at meaningful change velocity. But is it 22% this quarter versus 28% last quarter? That direction is the signal. A change failure rate that does not move is a change management practice that is not improving.

Is FCR holding above 70%?

This is the floor. Below 70%, something structural is broken in tier-1 capability, routing accuracy, or knowledge quality. Above 70%, the question is whether you are moving toward 80%. The top quartile did not arrive there by accident — they invested specifically in knowledge management as a capability, not a byproduct.

Is the recurrence rate visible and declining?

If you cannot answer this question because you have never measured it, that is the answer. Start the measurement. The act of making recurrence visible — naming the top repeat CIs, assigning problem ownership, tracking whether they recur — is itself a step toward reducing it.

Is cost per ticket stable or improving as volume grows?

If ticket volume increases 20% and cost per ticket also increases 20%, you have a linear cost structure with no leverage. If ticket volume increases 20% and cost per ticket holds flat or declines, your self-service and tier-1 resolution investments are working. That is the number to show a CFO who asks what IT did with its budget this year.

The goal is not a perfect score on every metric. The goal is a direction. Motion matters more than momentary snapshots. An IT organization that is consistently moving these five numbers in the right direction — even slowly — is an IT organization that is structurally improving. That is a fundamentally different story to tell at a QBR than a dashboard of green SLA compliance lights that no one can act on.

The Board-Level Version

If you have 10 minutes with a senior executive who does not want to hear about ITSM, lead with two numbers: your change failure rate and your cost per ticket trend. CFR tells them how much of your incident load is self-inflicted. Cost per ticket trend tells them whether their investment in IT is producing leverage. Everything else is context.

Start With One. Track It for 90 Days.

If none of these five metrics are currently being tracked in your environment, do not try to stand up all five at once. Pick the one most relevant to the conversation you need to have in the next quarter. If your CFO is asking about IT efficiency, start with cost per ticket by channel. If your business is frustrated with how long incidents take to resolve, start with MTTR and P90. If your ops team is burned out from firefighting, start with change failure rate.

Ninety days of clean data on a single metric will tell you more about your environment than two years of 200-report dashboards. It will also give you a specific, credible number to bring into the next executive conversation — not an operational summary, but a data point with a direction and an action attached to it.

That is what changes the conversation. Not more metrics. Better ones.

Citations: Jeff Rumburg, MetricNet — "Metric of the Month: Incident Mean Time to Resolve," HDI SupportWorld / ThinkHDI, May 8, 2018 (thinkhdi.com) · Jeff Rumburg, MetricNet — "Understanding the Service Desk Metric of Cost per Ticket," HDI SupportWorld / ThinkHDI, December 28, 2021 (thinkhdi.com) · DORA Accelerate State of DevOps Report 2024, Google DORA Research Program; summary via Octopus Deploy (octopus.com/blog/2024-devops-performance-clusters) · MetricNet global benchmarking database — First Contact Resolution Rate benchmark via metricnet.com · Gartner / industry research — 65–80% of IT incidents attributable to change activity; this range is widely cited in industry research and should be treated as an estimate, not a single authoritative figure

Metric 1 — Mean Time to Resolve (MTTR)

What it is

Why it matters

The benchmark

The action

Metric 2 — Change Failure Rate (CFR)

What it is

Why it matters

The benchmark

The action

Metric 3 — First Contact Resolution Rate (FCR)

What it is

Why it matters

The benchmark

The action

Metric 4 — Incident Recurrence Rate

What it is

Why it matters

The benchmark

The action

Metric 5 — Cost Per Ticket (by Channel)

What it is

Why it matters

The benchmark

The action

The Dashboard That Actually Works

Is MTTR trending down?

Is CFR improving quarter over quarter?

Is FCR holding above 70%?

Is the recurrence rate visible and declining?

Is cost per ticket stable or improving as volume grows?

Start With One. Track It for 90 Days.

Want to See Where Your Environment Stands?