Manager Capability and Performance Calibration: What the Research Shows

The capability gap

Most managers were never taught to manage

The promotion path in most companies rewards strong individual work with a management title, then leaves the new manager to figure out the people part alone. The data on how common this is is stark. Gallup's State of the Global Workplace report finds that only 44% of managers have ever received formal management training, while managers account for roughly 70% of the variance in team engagement. The same research found that even basic training cuts active disengagement among managers by about half, which means the gap is not just common, it is the lever with the most leverage attached to it.

The cost of leaving it unaddressed has been rising. Gallup recorded manager engagement falling from 31% in 2022 to 22% in 2025, the steepest drop of any group in the workforce, with the largest single-year fall between 2024 and 2025. Managers have lost what Gallup calls the engagement premium and are now about as engaged as the people they lead. When the person who sets the tone for a team is stretched and unsupported, the team feels it.

The hardest task

Judging people is where capability gets tested

A manager does many things, but the one that carries the most weight, and the most risk, is evaluating performance. A rating drives pay, promotion, and who stays. It is also the task where an untrained manager is most exposed, because human judgment of other people runs on predictable shortcuts. These are not character flaws. They are systematic errors that every rater makes, and the people making them are usually unaware they are doing so. Naming them is the first step to controlling them.

The rater errors that distort reviews

Recency

The last few weeks outweigh the year. A strong finish or a recent stumble carries far more weight than the eleven months before it, because recent events are easier to recall.

Halo or horns

One trait colors everything. A standout strength inflates every score, or a single weakness drags them all down, instead of each area being rated on its own.

Leniency or severity

The whole scale shifts. One manager rates everyone high to avoid hard conversations, another rates everyone low to look demanding, so the same number means different things by manager.

Central tendency

Everyone lands in the middle. The rater avoids the ends of the scale, so genuine high and low performers are flattened into the same indistinct band.

Similar to me

Familiarity reads as competence. People whose style, background, or approach mirror the manager's get the benefit of the doubt, which is also where discrimination risk enters the review.

The fix

Calibration is the established correction

Calibration is a structured session where managers who supervise comparable groups compare their proposed ratings with each other, guided by HR or a neutral facilitator, before any review reaches an employee. The purpose is consistency: to make a 4 from one manager carry the same weight as a 4 from another. It is, in plain terms, a review of the reviews. SHRM describes the core sequence as managers posting names and proposed ratings for all to see, discussing each, and adjusting to assure accuracy and consistency before final appraisals are prepared.

The reason it works is that the errors above are hard to catch from inside your own head but easy to spot from outside. When one manager's team is rated uniformly higher than a comparable team, the discrepancy is visible in the room and gets examined. Performance-management specialist Dick Grote has noted that calibration also makes it easier for managers to deliver honest but negative appraisals, because the standard is shared rather than personal, and it exposes strong performers to a wider set of senior leaders.

What good looks like

What separates a useful session from a political one

Calibration done badly is worse than none at all. The common failure, documented by SHRM, is that sessions defer to the loudest or highest-ranking person in the room and end up calibrating one set of biased ratings against another. A few conditions keep a session honest.

Conditions for a calibration that works

Evidence

Ratings come with documentation, not impressions. Grote's framing is that the facilitator's job is to make sure managers bring data, not just favorable or unfavorable views.

Facilitation

A skilled, neutral facilitator runs it. Especially the first time, someone has to keep the discussion on the standard and off rank and volume.

Job-based

People are rated on what the job requires. Ratings tied to a job analysis keep the conversation on valid criteria rather than personality.

Not a curve

The goal is consistency, not a forced distribution. The point is a shared standard, not slotting people into a predetermined bell curve.

For a manager who has never run one, the gap between knowing calibration matters and being able to facilitate it well is exactly the capability gap this note opened with. The skills are learnable, which is the encouraging part: structured preparation, evidence-based discussion, and a clear rubric turn a vague exercise into a defensible one.

Sources

Where these figures come from

Primary sources

Gallup, State of the Global Workplace 2026. The source for managers driving roughly 70% of the variance in team engagement, the 44% who have received management training, the halving of active disengagement among trained managers, and manager engagement falling from 31% in 2022 to 22% in 2025. gallup.comChecked 24 June 2026
SHRM, Improving Performance Evaluations Using Calibration. The source for the calibration sequence, where managers post and discuss proposed ratings then adjust for consistency, and for Dick Grote's points on honest appraisals, skilled facilitation, and bringing data rather than views. shrm.orgChecked 24 June 2026
SHRM Labs, Fixing Performance Reviews. The source for the failure mode where calibration defers to the loudest or highest-ranking manager and ends up calibrating biased ratings against other biased ratings. shrm.orgChecked 24 June 2026
SHRM Certified Professional, rater errors in performance measurement. The source for the taxonomy of rater errors: halo and horns, leniency and severity, central tendency, recency, and similar-to-me bias. SHRM-CP referenceChecked 24 June 2026
Dartmouth College HR, Common Rater Errors. A university HR reference confirming the standard rater-error definitions and the point that observers are usually unaware they are making them. dartmouth.eduChecked 24 June 2026

These figures describe patterns across organizations, not a rule for yours. The right rating scale, calibration cadence, and manager-development approach depend on your size, your roles, and how your performance process is built. This note is general information to support better management practice, not a mandate.

Put it to work

Tools that build manager capability

From first-time supervisor to a fair, consistent review

Questions

Common questions

Gallup's State of the Global Workplace research finds that only 44% of managers report ever receiving formal management training. The same research finds that managers drive about 70% of the variance in team engagement, so most companies are leaving their single biggest engagement lever undeveloped.

It is a structured session where managers who supervise comparable groups compare their proposed performance ratings before any review reaches an employee, guided by HR or a neutral facilitator. The purpose is to make a given rating mean the same thing across teams, so a 4 from one manager carries the same weight as a 4 from another.

The well-documented ones are recency bias (recent events outweigh the full period), the halo or horns effect (one trait colors every score), leniency or severity (the whole scale shifts up or down by manager), central tendency (everyone clusters in the middle), and similar-to-me bias (familiarity reads as competence). Most raters make them without realizing it.

No. The goal of calibration is consistency, not a forced distribution. A good session aligns what each rating means across managers so the standard is shared. Slotting people into a predetermined curve is a different practice, and calibrating biased ratings against each other or against a curve is the common way the process goes wrong.

Manager capability and performance calibration

Most managers were never taught to manage

Judging people is where capability gets tested

Calibration is the established correction

What separates a useful session from a political one

Where these figures come from

Primary sources

Tools that build manager capability

From first-time supervisor to a fair, consistent review

Common questions

See where your management bench stands

Most managers were never taught to manage

Judging people is where capability gets tested

Calibration is the established correction

What separates a useful session from a political one

Where these figures come from

Primary sources

Tools that build manager capability

From first-time supervisor to a fair, consistent review

New Manager Kit

New Manager System

People Manager Toolkit

Manager Script Library

Practice Manager Conversations Toolkit

Shift Supervisor Toolkit

Foreman Field Toolkit

Performance Review and Calibration Toolkit

Common questions

See where your management bench stands