How to Build a Call Review Scorecard (Gong or Modjo)

Nils Brosch

B2B SaaS Sales Consultant · Benelux & DACH

One out of fifty sales managers cares about the quality of a call. Fifty out of fifty count calls. They count because counting is easy. A scorecard is the cheapest tool that forces the harder question, and most teams that own Gong or Modjo still don't have one worth using.

This is for the VP of Sales or sales manager who has the recordings, has the dashboards, and still can't tell you whether their reps run good discovery. After seven years and 1,400+ manually reviewed B2B SaaS calls across Benelux and DACH, which form the dataset behind my 2026 EU Sales Call Benchmark, the pattern is consistent. Teams are far better at building the infrastructure of sales than at inspecting what actually happens inside a call. The scorecard is how you close that gap.

A call review scorecard is a structured rubric a manager uses to rate a recorded sales call against a small set of defined skills, scoring each 1–5 and capturing one concrete, timestamped observation per skill. It is not a compliance checklist, and it is not your activity dashboard. That distinction is the whole game, because most "scorecards" sold as templates are really just call-counting in a prettier table.

Most KPIs are tracked because they are trackable. If a number doesn't provoke a coaching action, it's a vanity metric, and counting calls is the most popular vanity metric in sales.

1Score Quality, Not Volume

Here is the test for any metric on your scorecard: if it does not provoke a coaching action, it doesn't belong. Dials made, talk-to-listen ratio, calls logged. These get tracked because they are trackable, not because they change what a rep does next Monday. An A call is profoundly different from a B call, and no amount of counting tells you which one you just listened to.

My EU benchmark scores each core skill out of ten across the teams I review. Those averages are sobering, and they double as the reference points for your own scorecard. A rep's score isn't just a number, it's a number against a known European baseline.

3.8

Closing, the lowest of the four

4.2

Discovery, second lowest

5.5

Prospecting

6.5

Demo, the strongest skill

EU Sales Call Benchmark, average score out of 10 per skill (1,400+ calls, 140+ teams, Benelux & DACH).

2The Five Criteria That Actually Predict Deals

Resist the urge to score everything. A scorecard with twenty-five line items gets filled in mechanically and coached on never. The criteria that earn their place are the ones where my benchmark data shows reps consistently fail and where improvement visibly changes win rates. Five skills, not twenty-five:

🔍

Pre-call research & relevance Only 1 in 3 reps does proper pre-call research. The rep who opens with "tell me about your business" and the rep who opens with "I saw you expanded into Germany, I came with specific questions about that" are selling the same product at the same price. The buying experience is not remotely the same.

⚡

Discovery depth (impact & consequence) Score whether the rep moved past surface pain into business consequence: "What's the cost to your region of not having this?" In my data, only 23% of reps reach impact discovery at all.

🎯

Vision change Did the rep introduce a problem the buyer hadn't framed, shifting the evaluation criteria? 26% attempt it; fewer than 10% manage both impact discovery and vision change. When you only respond to what the buyer already knows, you become comparable, and comparable means competing on price.

🔗

Decision process & DMU Only 50% of reps explore the decision-making unit and 57% the decision process, and I scored that generously, awarding a point to anyone who merely asked. Score whether the rep proposed a buying journey rather than passively requesting one.

✅

Next step & commitment A specific, dated next step with a buyer-owned action item, or a vague "let me think about it"? This is where the 3.8 closing score is won or lost.

3Score 1 to 5, Never Tick-Box

Binary scoring, where pain is captured yes/no and the DMU is understood yes/no, massively oversimplifies the nuanced nature of discovery information. A rep can technically "ask about pain" and learn nothing. The fix is a 1–5 scale per criterion, with the bands defined in advance so two managers reviewing the same call land in the same place.

Define each band concretely. For discovery, a 1 is "asked no pain questions," a 3 is "surfaced a stated pain but no impact," and a 5 is "quantified impact and tied it to a business consequence the buyer confirmed." Without written bands, scoring drifts, calibration collapses, and the scorecard is worthless within a quarter. Be honest: does your current review have written bands, or does "good discovery" mean whatever the manager felt that day?

4Can AI Score Your Calls? Watch for AI Sales Psychosis

AI Sales Psychosis is the false sense of control that appears when Gong, Modjo, and AI notetakers tell reps and managers "don't worry, you've captured the pain" when the buyer gave only the faintest signal. I named it because I kept watching it happen. When I compared AI analysis of calls against the manual reviews that formed the basis of my 2026 benchmark, I found a discrepancy of more than 30%.

AI vs Manual Call Scoring: Where the AI Got It Wrong

% of reviewed calls where the AI's verdict disagreed with a manual review

Pain captured

50%+

DMU understood

35%+

All criteria (aggregate)

30%+

On the two criteria that matter most for qualification, the AI disagreed with a manual review on a third to half of calls, almost always by marking a deal more qualified than it was. Source: 2026 EU Sales Call Benchmark.

The failures were specific. The AI marked Pain criteria green because the customer nodded along while the rep recited common pains. It marked the DMU "well understood" when the buyer's actual answer was "yeah, I probably need to involve my manager at some point." Large language models are pleasing by nature. They soften gaps, make weak discovery look structured, and turn assumptions into CRM data. Suddenly everyone believes the deal is better qualified than it is, right up until the buyer chooses the status quo, a competitor, or simply vanishes.

Use AI to support your thinking. Don't outsource your judgment.

You can make AI scoring far more honest with one prompt instruction: require the model to quote the exact transcript evidence behind every judgment, both the question the seller asked and the information the buyer actually gave. If the AI can't produce the line, the criterion isn't met. That single rule turns an AI scorecard from a confidence machine into a genuine reviewer.

5The Tool Is Only as Good as the Team's Discipline

Conversational intelligence platforms like Gong can supercharge coaching, but they carry three failure modes worth naming before you build inside one. First, the big-brother effect: reps quietly switch off recording during sensitive moments, which skews the very data you're scoring. Second, vanity-metric drowning, where managers have so many numbers they trust none of them. Third, the maintenance burden, since scorecards in these tools need constant upkeep and neglected ones rot.

None of this is an argument against Gong or Modjo. It's an argument that the platform doesn't create the discipline; the manager does. A scorecard in a spreadsheet that a manager actually uses every week beats a beautifully configured Gong scorecard nobody opens. Start where your discipline is, not where the software is.

6Sampling, Not Surveillance: The Cadence That Makes It Stick

You do not need to score every call. To get a real grasp of call quality, take random samples on a weekly or bi-weekly basis and record them in the scorecard. Sampling is sustainable for a manager carrying seven reps, and it sidesteps the surveillance dynamic that makes reps perform for the recording rather than for the buyer. The cadence runs on a weekly loop:

Sample. Pull two or three random calls per rep each week, not the rep's hand-picked best one.
Score. Rate the five skills 1–5 against the written bands. Capture one specific observation per skill, with the timestamp.
Coach one thing. Bring a single skill to the 1:1. The fastest way to make coaching useless is to dump all five scores on a rep at once.
Make them practise it. Don't tell the rep what to do, have them do it. The scorecard finds the gap; role-play closes it. AI role-play tools like Jam (wejam.ai), which I co-founded for exactly this, let reps rehearse the weak skill at volume without burning manager time.

That last step is the one most teams skip, and it's the difference between feedback and coaching. Telling a rep "your discovery scored a 2" changes nothing. Running a five-minute role-play where they re-run the impact question until it lands is what moves the next call.

Where the Scorecard Fits a Coaching System: COMPASS

I built COMPASS after giving my own team genuinely boring coaching. I mostly banged on about one issue until they zoned out and answered Slack messages mid-session. The framework is a corrective for managers like the one I was. Good coaching is Continuous, Organized, Measured, Practical, Appealing, Specialized, and Strategic.

The scorecard is the Measured pillar, what makes coaching objective instead of a manager's gut feeling. But a scorecard without the Practical pillar (role-play, not lecture) and the Specialized pillar (tied to each rep's individual goals) is just measurement, and measurement on its own develops nobody. This is also why AI sales coaching tools that score calls but never drive practice tend to disappoint: they're all M, no P. The cadence and the 1:1 structure that surround the scorecard get their own treatment in the manager coaching cadence.

The scorecard tells you which call was a B and which was an A. Coaching is the only thing that turns the B into an A, and you can't coach what you never scored.

Frequently Asked Questions

How many criteria should a call review scorecard have?

Five to seven skills, scored 1 to 5. More than that and managers fill the scorecard in mechanically and coach on none of it. Pick the dimensions where your reps most consistently fail. For most B2B SaaS teams that's discovery depth, decision-process mapping, and securing a concrete next step.

Should I score every call or a sample?

Sample. Pull two or three random calls per rep weekly or bi-weekly. Scoring every call is unsustainable for a manager carrying seven reps, and it pushes reps to perform for the recording. Random sampling gives you an honest read on quality without turning the scorecard into surveillance.

Can AI score sales calls accurately?

Partly, and only with supervision. When I compared AI call scoring against manual reviews, I found a discrepancy of over 30%. The AI inflated discovery and qualification because language models soften gaps. Require the AI to quote transcript evidence for every judgment, and treat its output as a draft a manager checks, not a verdict.

Do I need Gong or Modjo to run a scorecard?

No. A spreadsheet a manager uses every week beats a Gong scorecard nobody opens. Conversational intelligence tools help with recording and retrieval, but the platform doesn't create the coaching discipline; the manager does. Start with the cadence, then move it into whichever tool your team will actually maintain.

What is the difference between counting calls and scoring calls?

Counting measures activity: how many calls happened. Scoring measures quality: whether those calls were any good. Only 1 in 50 managers scores rather than counts, because counting is easy and scoring requires a rubric and judgment. But activity without quality just means reaching the wrong outcome faster.

Where to Start

The scorecard is the easy part. You could build the five-skill rubric above in ten minutes. The hard part is the manager habit: sample, score, coach one thing, make the rep practise it. If you want help turning a scorecard into a coaching system your managers actually run, that's what my coaching work is built to do. And if you're not sure where your team's biggest quality gap sits, start with a baseline.

Start with the baseline

My free Gap Analysis scores two of your reps' real calls against the same criteria above, so you know exactly which skill to put on the scorecard before you build anything. Benelux and DACH, in person or remote, built on your actual calls.

Get a free call analysis Sales Coaching Program →