SUSE refuses to measure its engineers by how much code their agents write
Rick Spencer on why output, tokens, and lines of code tell you nothing, and what an open-source enterprise tracks instead
As AI agents move into engineering workflows, new leaderboard metrics are tracking lines of code submitted, tokens consumed, and per-developer utilization. If agents are generating output, then output should be measured, compared across engineers, and ranked.
Rick Spencer, General Manager for Technology and Product at SUSE, has looked hard at how the industry is measuring AI’s effect on engineering. “I consider that garbage vanity metrics,” he says, calling them unhelpful.
His argument for what to track instead is one of the more clarifying things an engineering leader can hear right now, because it separates the numbers that look like progress from the numbers that actually represent it.
Output is cheap; impact is what counts
The core of Spencer’s position is a distinction between output and impact, and it matters because the two come apart precisely when AI enters the picture. AI makes output cheap as the lines of code, pull requests, and token counts all mount up when agents are doing the writing, which makes them exactly the wrong thing to measure if what you care about is value delivered. “We’re really tending away from measurements that measure output and utilization, and we’re trying to focus on impact,” he says. A leaderboard that ranks engineers by how much their agents produced does not tell you who is solving the hardest problems or keeping customers safe. It tells you who is generating the most volume, and in an AI-assisted world, that number is close to meaningless.
There is also a structural reason the standard tooling does not fit SUSE, and it applies to more organizations than it might first appear. Much of the available measurement tooling assumes a particular shape of company. “They really assume you’re a proprietary software company where everyone’s working on a single code base,” Spencer explains, “which is just not how an open-source enterprise works.” His engineers work across hundreds, sometimes thousands, of repositories, where the maintenance work on each one differs enormously. A per-developer comparison across that landscape measures the shape of the work far more than it measures the contribution of the engineer, which is why he treats developer-to-developer comparison as fundamentally low value rather than merely imperfect.
The reporting burden itself is part of his objection, and it is a point leaders setting up AI dashboards should sit with carefully. A measurement regime that requires engineers to generate weekly utilization reports spends the very time it claims to be optimizing. “I’d rather have them working than reporting,” Spencer says. The instrument meant to measure productivity eventually becomes a tax on it.
What SUSE tracks instead
Rejecting vanity metrics only helps if there is something better to put in their place. And Spencer shares how SUSE measures business impact in terms that connect directly to what customers actually receive. “How fast are CVEs being addressed, how fast are patches being backported, how fast are our L3 responses getting closed while maintaining the same NPS score,” he underscores, listing what his teams track. The common thread is that each one is an outcome the customer feels, not an activity the engineer performs. AI has been applied to exactly these areas, so measuring the speed and quality of those outcomes tells you whether the AI is doing anything worth its cost, which is the actual question worth asking.
This shift from output to outcome reframes what a metric is for in the first place. A CVE response time captures whether the organization is keeping customers safe faster than it used to. A backport speed captures whether stable releases are getting their fixes without the manual grind that used to gate them. These numbers move because the underlying work got genuinely better, not because more text was generated, and that is the property that makes them trustworthy. They are also far harder to game, because the only way to improve them is to actually improve the thing the customer depends on.
Give managers visibility, not a leaderboard
None of this means SUSE ignores cost or utilization entirely, and the distinction Spencer draws here is the one that keeps the approach from collapsing into either negligence or surveillance. The company is building dashboards that give engineering managers visibility into their team’s cost and utilization, but the purpose is coaching rather than ranking. The unit of analysis is the team, and the question it answers is diagnostic. Spencer gives the example of a manager with an eight-person team noticing the numbers and asking the right kind of question. “We’re burning a lot of tokens. What are we actually doing that’s burning that many tokens? I’m not sure we’re getting value out of that.” The inverse matters just as much, where purchased seats for a code assistant sit unused, and the manager asks whether there are places the team should be drawing value that it is currently leaving on the table.
The governance side of that picture, including how SUSE keeps agents and their costs inside a boundary it can stand behind, is covered in a companion piece, How SUSE Runs AI Without Losing Control.
The difference between this and a leaderboard is not subtle, and it is the heart of the leadership lesson. A leaderboard exposes individuals and turns measurement into a game engineers play against each other, a game Spencer is explicit has nothing to do with customer value. Team-level cost visibility used for coaching does the opposite. It gives a manager the information to guide the team toward better use of the tools without making any individual engineer feel watched. “We’re really trying to decentralize and allow engineering managers to guide their teams on getting the most value out of the AI,” he says, “without it becoming like a leaderboard game where developers feel like they’re exposed.” The data exists to help the manager help the team, not to rank the team against itself.
The principle holding it together
What makes Spencer’s approach more than a list of preferred numbers is the principle holding it together, which is that measurement should serve the work rather than distort it. Every choice he describes follows from that one idea. Impact comes before output because output is the thing AI inflates. Team-level diagnostics come before individual leaderboards, because the goal is coaching rather than competition. Business outcomes come before activity counts, because outcomes are what customers actually receive. The decentralization to engineering managers reflects the same conviction that the people closest to the work are best placed to judge whether the AI is helping, given the right information and trusted to use it well.
The deeper point for any leader standing up AI measurement is that the easy numbers and the useful numbers are not the same, and AI has widened the gap between them. The figures that are simplest to collect, lines of code, tokens, and per-head utilization, are the ones AI has made least meaningful. The figures that matter, the speed and quality of the outcomes customers depend on, take more thought to define and more care to track. Spencer’s argument is that the effort is the job. “Let’s focus on the impact,” he says, “the business impact, not on the utilization.” For engineering leaders deciding what belongs on a dashboard as agents reshape their teams, that is the distinction worth getting right before the vanity metrics calcify into the way the organization sees itself.


