DealBench / retention_analysis

retention_analysis: pre-IPO growth equity diligence

Real data room from a $50B AUM Growth-Equity firm.

Task is built by a real associate from the fund. Task is validated by experts from Insight Partners, Partners Group, WestCap.

1.Introduction

In 2020, Datadog and New Relic looked like twins. Both sold observability and APM software to engineering teams, and both were roughly the same size — Datadog brought in about $603 million that year, New Relic around $600 million. Yet, the public markets valued them as entirely two different businesses. Datadog traded at 44x sales at a ~$30B valuation, while New Relic traded at 7x at a ~$4B valuation.

The difference came down to net revenue retention. That year Datadog disclosed dollar-based net retention of 130%, while New Relic's sat closer to 100% — meaning Datadog's existing customers kept spending more every year, while New Relic's mostly stayed put.

As Bessemer Growth puts it, net dollar retention is "the single most important indicator of long-term success for B2B SaaS companies": it shows whether a company can sustainably grow its base without constantly buying that growth.

1a.What is the task?

retention_waterfall is the senior-associate workflow that follows an associate's commercial-diligence pass. The model takes a complete growth-equity diligence pack (bookings master, customer key, ARR-by-product-by-customer, revenue-by-customer, churn-risk register, top-25 churn/downsell reasons, deferred-revenue rolloff, Retention Summary template) and produces a completed Retention Analysis tab with gross retention, net retention, logo retention, and an ARR bridge walking beginning ARR through new, expansion, contraction, and churn to ending ARR. The underlying engagement is a real growth-equity diligence on a cybersecurity software target structured around a ~$120M equity round co-led by Microsoft.

1b.Why is it economically important?

Retention analysis is a universal building block in growth-stage investing. From a Series B round through to IPO, virtually every private capital fund runs a retention summary on every deal. Two things make it matter so much:

First, there's relatively little variance in how it's done. The process is fairly standardized across almost every growth-stage deal.

Second, the stakes are high. As the Datadog–New Relic gap shows, being off by even a few basis points can move the check size and the company's terminal valuation by hundreds of millions of dollars.

1c.How do models fail?

~25% of all private-market AUM is now in growth-stage strategies, yet there's been almost no targeted post-training or public benchmarks for it. As a result, the models lack a real growth-equity lens and miss reasoning that would be second nature to a first-year associate — for example, reporting a wall of 98–100% retention even as right above it monthly churn keeps climbing, an internal contradiction that basic fundamentals would immediately flag as impossible. They also fail in real data rooms, where every file exists in 5–15 versions (sell-side drafts, auditor revisions, and so on); the model not only grabs the wrong version but rarely bothers cross-checking the others, so one bad assumption cascades into every downstream number. And when a model does sense that a figure looks off, it "fixes" it at the most local level by simply rerunning the same calculation, rather than questioning the flawed underlying assumption that caused the error in the first place.

1d.Task: Prompt + Rubric

Prompt:

You're an associate at a growth equity firm running diligence on Project Sentinel, an identity / fraud-prevention software business. The senior associate has given you the template for the Retention Summary tab in Project Sentinel - Financial Model (Starter Pack).xlsx. Fill in every empty cell in the Retention Summary tab.

The analysis window runs monthly from December 2019 (column D) through March 2023, followed by one separating column (whitespace) and then year-end columns for 2020, 2021, 2022, and YTD 2023. Do not modify any other cells, columns, or rows in the Retention Summary tab.

Notes provided to the agent:

A customer who drops to $0 and returns within 3 months is a Late Renewal, and the return is Recovered; a gap longer than 3 months is Churn, and the eventual return, if any, is New. Measure the gap in months between the drop and the return, rounded to the nearest whole month (30.5 days).

Do not look at activity prior to December 2019 when applying the 3-month rule. If a drop within three months of the data window's end is followed by a return within the data window, classify it normally; otherwise treat as Churn.

Recovered is grossed up to the pre-gap amount.

When a single customer has contracts across multiple segments, treat each segment as a separate customer instance.

Customer 14 is Middle Market and Customer 626 is Enterprise.

In the Total Logos block (row 32), drop in the Consolidation entries: two logos collapsed in September 2020, one in December 2020, one in December 2021, and one in February 2023.

Display formatting (via Excel number formats only — do not round underlying stored values): dollar amounts as $14,000 (integer with comma separator, parentheses for negatives); percentages as 95% (whole percent, parentheses for negatives); logo / customer counts as 14,000 (integer with comma separator, parentheses for negatives). Zeros render as a dash (-) in all three format types.

Right-align all numeric values. Any row that begins with "Beginning" or "Ending" in each segment block should be bolded.

Rubric distribution: what the task actually grades

60 atomic criteria across 20 sequential steps, grouped into seven skill areas. Weight share is each cluster's fraction of the 103.5 max-possible rubric weight. Useful for a lab deciding where in the SaaS retention-diligence pipeline to put gradient signal.

Skill cluster	Steps	n criteria	Weight share
Canonical-source reconciliation + customer-segment mapping	1–3	3	3.8%
MRR aggregation + ARR conversion + monthly panel build	4–6	4	5.7%
Late-Renewal classification (3-month rule) + flow construction	7–9	14	20.1%
Ending ARR + Ending Logos × 7 segment blocks	10–12	14	26.7%
Consolidation overlay + monthly spot-checks	13–14	8	11.4%
Retention metrics (Gross + Net + Logo Ret)	15–18	14	26.7%
ARR YoY Growth + hygiene + segment rollup reconciliation	19–20	3	5.6%

2.Frontier Panel Results (capability-only)

Model Performance Leaderboard
Mean reward with 95% confidence intervals across capability-only rollouts.
8 models

2a.Commentary by model

Median vs Mean Reward by Model
Gap between median and mean indicates reward variance across rollouts.
Median
Mean

1GPT-5.5. mean 0.563, median 0.565, n=16

GPT-5.5's runs are tightly banded, with most rollouts between 0.55 and 0.61. The model excels at building the ARR tape: it reads the source cleanly, reconstructs ending ARR by segment, and gets the New/Upsell/Downsell/Churn/Recovered flow bridge right. One recurring failure is source discipline: a number of rollouts build the waterfall off the Bookings Master, a deal-level bookings file, rather than the ARR-by-Product tape of recurring-revenue actuals, reintroducing non-SaaS products, discounts, and escalators that don't cleanly reconcile to ARR. Runs that anchor on the right tape reproduce the bridge cleanly; those that don't degrade across it.

The signature failure, however, is one layer up: deriving the retention metrics that sit on top of the bridge. Gross, net, and logo retention come out inflated (98-100%) where the business actually runs in the high-70s to low-90s, producing a retention line that contradicts the churn directly above it; the model also does not pull the prior-year data that an opening-year growth figure requires, seeding the first year flat. What's most telling is the model's number sense: it senses that uniformly high retention is uninformative, and on at least one run it computes a correct ~77% gross-retention figure — but it flags it as implausibly low, and moves on while leaving the genuinely inflated values in place.

This is a reasoning gap. Extraction, classification, and reconciliation are all there; what's missing is the discipline to pick the right source and, more tellingly, the instinct to interrogate its own outputs when they don't look right: faced with a retention line plainly too high to square with its own churn, the model rationalizes rather than rechecks, which is the first thing an analyst would do.

2Claude Opus 4.8. mean 0.514, median 0.536, n=16

The defining feature of the Opus cohort is its spread: the same model on the same task and data produces scores from 0.22 to 0.89. The runs are best read as a dependency stack that each rollout climbs until something breaks. Ending ARR levels are the stable floor, passing in essentially every run. Above them sit the flow decomposition and the logo waterfall, which hold completely in the strongest runs, partially in the middle, and collapse in the weakest, and this layer drives most of the range below the top run, from 0.22 up to 0.61. Above that sits retention, which is cleared by only two runs, separating those two near 0.87 and 0.89 from the cluster near 0.55. Where a rollout lands is governed by which layer breaks on that trajectory, and effort does not track the outcome: the run that issued the most tool calls in the cohort, more than ninety, and verified most heavily landed at 0.33 with its flow decomposition collapsed, below runs that reached 0.55 in a third the number of calls.

Retention is the clearest window into why. The oracle computes it as a trailing-twelve-month figure at every period, so the monthly row runs in the high-70s to low-90s; fourteen runs instead compute it period-specifically, one month or one quarter at a time, where almost nothing churns, and post a flat line near 98 percent across the monthly cells and the YTD column alike, the full-calendar-year columns surviving only because a year coincides with a twelve-month window. A flat 98 percent retention line should be self-evidently wrong to anyone filling in this tab: it says nothing ever churns, it contradicts the Churn rows the model populated directly above it, and a retention summary whose every cell reads the same near-perfect number carries no information and defeats its own purpose. The model has the signal it needs to reject it and writes it anyway. The implausibility is even noticed and then overridden in one run, which observes that a 110 percent net retention over three months is suspect, concludes the figures "must be trailing twelve-month calculations rather than period-specific," and ships the period-specific values regardless, 0.970 and 0.979 where the answer is 0.768 and 0.871.

The verification in these runs is real but ultimately local: it confirms that the computed numbers are internally consistent, never that they are plausible or that the right quantity was set up, so a rollout that begins from a wrong framing reinforces it through checking rather than escaping it. The same blind spot produces the universal YoY failure, where every run leaves the opening-year monthly columns blank because the year-over-year figure needs the prior-year ARR that sits before the window and the model never reaches back for it.

3Claude Opus 4.7. mean 0.394, median 0.377, n=16

Opus 4.7 has no single signature; it commits qualitatively different errors from one rollout to the next, and which one it lands on sets the score. The base tape is reliable everywhere, with ending ARR matching the oracle in every run, but the decomposition on top of it scatters. The cleanest runs classify the flows correctly and sit near 0.62. A cluster collapses the recovery classification toward nothing, treating the customers who lapse and return as if they never existed, and lands around 0.38. Another cluster over-recovers, grossing up far more than the data supports and carrying a tape that runs slightly hot as a result, and lands in the high 0.20s. Three runs flip the sign convention on the loss rows, entering Downsell, Late Renewal, and Churn as positive magnitudes so the waterfall no longer foots, and the single worst run does that on top of the over-recovery. The same model on the same data produces this entire spread.

Retention is the ceiling every run meets, and the way it fails is the revealing part. Twelve of the sixteen runs raise the trailing-twelve-month reading explicitly in their reasoning, so the model plainly knows the figure is not a single-period number, yet not one run carries that through to the YTD-2023 column, which every rollout reports near 98 percent off the literal final quarter. The sharpest case is the top run, which computes the monthly retention correctly on a trailing window and then reverts to the literal quarter for the annual column sitting directly beside it. The gap is not knowledge of the metric but the discipline to apply the reading it has already articulated, and because no run closes it, 4.7 tops out near 0.62, exactly where 4.8's two strongest runs carry the same insight all the way through and clear it.

4Claude Sonnet 4.6. mean 0.389, median 0.329, n=16

Sonnet's base work is the cleanest of the mid-panel models, as most ending levels land. The model's losses, however, fall in several places. The first is the lapse-and-return classification, where the model is inconsistent about which customers drop and return: some runs under-detect them and collapse recovery toward nothing, others over-detect and inflate the recovery and late-renewal pair, and the same model swings both ways across rollouts, which makes the flow decomposition the cohort's main source of variance. Retention is another layer the analysis does not reach. The oracle measures it over a trailing twelve months, so the monthly row moves through the high-70s to low-90s; Sonnet computes each period in isolation, where almost nothing churns, and writes a flat line near 98 percent, a figure asserting that nothing ever leaves while the churn it booked sits one row above. What sets Sonnet apart is that the trailing-window reading is never even entertained: across all sixteen runs not one raises the possibility, where other models surface it and occasionally act on it.

5GLM-5.1. mean 0.335, median 0.278, n=16

GLM sits in the middle of the panel, rollouts spread wide from 0.039 to 0.594. It reliably reproduces the coarsest output, the year-end ARR levels, which land on the oracle in nearly every run. The failures concentrate in the analysis above ending ARR, and a few recur across the cohort. The first is the flow decomposition itself: the model routinely mis-sizes the categories the gap rule is supposed to produce, undersizing New and Downsell and zeroing out Recovered or Late Renewal where the source shows real gross-ups and lapses, so the bridge that should explain the change in ARR does not reconcile to the deals beneath it. The second is more striking: in some runs the ending level is correct while the beginning balance and the flows feeding it are wrong, reverse-engineered to foot to a known endpoint rather than derived, to the point that a year's beginning does not even equal the prior year's ending. The third is presentational but revealing: the loss rows, Downsell, Late Renewal, and Churn, are frequently recorded as positive magnitudes rather than signed, a given instruction the model has the right numbers to satisfy and simply does not.

Retention is the fourth failure. Gross and net come out inflated, sitting near 100 percent where the business runs in the high-70s to low-90s, a retention line that cannot be squared with the churn the model itself booked one row above. The common thread across all four is that GLM produces output that looks like a finished model and matches at the headline without the cross-checks an analyst applies by reflex: that the bridge foots, that the beginning ties to last year's ending, that retention is consistent with churn. It builds the shell of the analysis and does not interrogate the contents.

6Gemini 3.1 Pro. mean 0.240, median 0.244, n=16

Gemini lands in the lower-middle of the panel, rollouts banded tightly between roughly 0.20 and 0.30 with one run reaching 0.473 and one that produces almost nothing. The year-end ARR levels come out broadly right, so its failures sit in the classification work beneath the totals. Almost no run strips non-recurring revenue such as one-time PoCs, ignoring the clear instruction in the source file to do so. Half the cohort also builds the waterfall off the Bookings Master, a deal-level record of bookings at signing, rather than the tape of recurring-revenue actuals. Because the model's classification choices offset, the ending level survives while the decomposition meant to explain it is wrong, so these runs pass the totals and fail the bridge. The second recurring mode is logos: the count drifts a few high on churn-boundary calls and compounds through each year's beginning, so even the strongest run never lands them exactly.

The thread is that Gemini does not interrogate its inputs or its outputs: it applies the three-month rule correctly but to the wrong series and without removing non-recurring revenue, never asking whether the lapses it books are real, and it posts a retention line it never tests against its own churn. The capability is visible in the single 0.473 run, the one that treats the starter file as a template, finds the recurring-revenue actuals, and excludes the one-time PoCs. Having gotten the source right, it is left with the harder residuals, the exact logo count and the trailing retention column, that the rest of the cohort never reaches.

7GPT-5.4-mini. mean 0.111, median 0.082, n=16

GPT-5.4-mini sits near the bottom of the panel (0.111 mean, n=16, median 0.082, best run 0.319). Rollouts run long — well over 100 messages and 50-plus tool calls in most cases — and are complete, but the budget goes to re-clarifying the task and simplifying its own approach rather than converging on an answer. The consequence is that the model never establishes the ARR tape in the first place: ending ARR lands roughly an order of magnitude low (around $1M against a ~$14M FY2020 base), consistent with failing to annualize monthly MRR and aggregating only part of the book. Because the tape doesn't hold, everything built on top of it degrades downstream, so the retention misses here aren't diagnostic the way they are for the larger models — the model never gets far enough up the stack for them to mean anything. The signature is agency rather than arithmetic: it has the tool access and runs the steps, but spends the run hedging assumptions and narrowing scope instead of committing to and verifying a base. A few rollouts even note that ARR needs annualizing, then ship the mis-scaled tape anyway.

8MiniMax M2.7. mean 0.012, median 0.010, n=16

MiniMax floors the panel: every rollout lands within a hair of zero, the best at 0.039, with the typical run earning nothing beyond surface hygiene — a well-formed, correctly formatted, error-free sheet — while every substantive figure misses. The failure is upstream of anything diagnostic. The model reads the source and correctly identifies the monthly columns as recurring revenue ("a value of 99 means $99 MRR"), then carries those figures straight into the ARR rows without annualizing them, producing a tape that is uniformly an order of magnitude light: Total ending ARR builds from roughly $1.15M in FY2020 to $3.0M at YTD-2023 against an oracle running $13.7M to $35.8M — a clean ~12x gap that is the missing MRR-to-ARR conversion, compounded by building off the deal-level bookings file rather than the recurring-revenue actuals.

The signature is an absence of scale sense: the model runs the full procedure — segments the book, applies the gap rule, grosses up Recovered, formats to spec — and closes by presenting a fraud-prevention business carrying ~$3M of ARR in its own summary without the number giving it any pause, where an analyst's first instinct would be to distrust a base that small. It commits confidently to a wrong-scale tape and ships it rather than interrogating it. One rollout drops out entirely, never writing the Retention Summary and leaving its work in the bookings workbook; the single run that annualizes the total correctly still cannot decompose it, posting a Commercial Sales block that alone nearly equals the whole company and segments that neither reconcile to one another nor sum to the total.

2b.Why this task is hard

retention_summary sits at the hard end of the suite. Mean weighted reward runs from 56% at the top (GPT 5.5) into the mid-30s (Opus 4.7, Sonnet 4.6, GLM 5.1) and below. The distribution is what matters: rollouts run from near zero (MiniMax floors at 0.012 mean) to 89%, and the single 89% run came from Opus 4.8, whose mean sits at 51%. A capability that no model expresses reliably but that one model executes cleanly on its best attempt is latent, not absent, and the gap between a 51% mean and an 89% ceiling is the headroom RL has to work with.

The difficulty is not arithmetic. The math is sums, ratios, and a gap rule — nothing an analyst would call hard. What the task evaluates is retrieval and disentanglement: identifying the canonical file among several versions, sourcing from the reconciled ARR summaries rather than the rawer Bookings Master, keying customers by ID rather than display name, applying the consolidation adjustments the prompt specifies. These are general agentic capabilities, not finance trivia, and the task is built to exercise the skills that transfer to any document-dense environment rather than an esoteric calculation whose failure would say nothing about underlying capability.

Where finance knowledge does matter, the gaps trace to the training corpus. The mechanics of a private growth-equity diligence are not the mechanics of a public-company transaction, and it is public-company material — filings, transcripts, sell-side research — that saturates pretraining. Cohort-level ARR bridges, segment rollups, Late Renewal and recovery treatment, and retention summaries live inside proprietary data rooms rather than public disclosure, and the models reason accordingly: fluent on public-market framing, unreliable on private-capital convention. The more telling gap, though, is passivity rather than knowledge. A model will produce a wall of 98–100% monthly retention and never register what an analyst reads instantly as a methodology artifact; it sums a year of revenue, lands on a figure that does not reconcile against the summary, and proceeds without asking why, never tracing the discrepancy to the proof-of-concept accounts it failed to exclude. What is missing is the reflex to interrogate a number that should look wrong, and across the panel it is largely absent.

This mirrors the live job. A sell-side data room is 500–600 documents handed to an associate under deadline, with several versions of company documents built at different stages. The task is a byproduct of an actual $100M+ growth-equity diligence rather than a constructed exercise, so the disambiguation problems are the ones the deal itself produced.

One practical consequence follows from the grading: because reward is value-based and resolves against cells in the output workbook, the rubric can run with no LLM judge in the loop, so judge cost is effectively zero.

2c.Compute Cost per Rollout

Average Cost per Rollout
Mean USD spend per single retention_summary attempt (n=16 per model).

3.Sample Trajectory

Sample Trajectory: GPT 5.5 – Sample Trajectory (57/103.5, run e78285)

Oracle: 2.1.1_Project Sentinel – Financial Model (New Retention Summary), Clean.xlsx

GPT-5.5 Sample Trajectory — Commented Excerpts

Run e78285 (57 / 103.5). Excerpts below are the exact passages each review comment is anchored to, in document order. Long tool-output tables in the first excerpt are elided for length; everything else is verbatim from the trajectory.

Excerpt 1 — Source selection: never opens the ARR cube

Comment

The agent rejects 9.79 Revenue by Customer, jumps straight to bookings, and never opens 2.3.3 ARR by Product despite having listed it in its own file-discovery output (the early find results show 2.3.1–2.3.4 ARR by Product). An analyst would have known that any retention summary pulls directly from the ARR customer cube.

Referenced snippet — reasoning, then the sequence of files the agent actually reads:

reasoning:
**Calculating ARR retention**

I need to compute annual recurring revenue (ARR) retention. First, I should identify the source of monthly ARR by customer or segment, possibly from the booking master or revenue data. It seems like I need to read the headers to get the information I need. Gotta make sure I grab the right data to ensure accurate calculations. Let's dive into that and see what I find!

spreadsheets_read_tab calls, in order:
  - 9.79_Project Sentinel - Revenue By Customer.xlsx
  - 9.82_Project Sentinel - Revenue by Customer (2020).xlsx
  - 9.60_Project Sentinel - BD ARR by Customer (Fixed and Variable Split) .xlsx
  - 9.2_Project Sentinel - Customer Key (05.18.2023).xlsx
  - 9.83_Project Sentinel - Bookings Master File (2019 - June 19th 2023).xlsx

[tool-output tables elided]

Excerpt 2 — YTD 2023 window is only 3 months, never annualized

Comment

For 2020/2021/2022 the window is a full 12 months so the ratio is annual and correct, but for 2023 the period is only 3 months (37,39 = Jan–Mar) — so three months of churn/downsell are divided into a full-year base with no annualization or TTM treatment. This compresses every YTD retention metric toward 100% (Total GRR 97.99% vs. oracle 87%, Federal GRR 100% vs. 72%, SMB Logo Ret 91% vs. 68%) — rendering these metrics worthless.

Referenced snippet:

periods=[('2020',1,12),('2021',13,24),('2022',25,36),('YTD 2023',37,39)]

Excerpt 3 — YoY first 12 months silently zero-filled

Comment

YoY growth for Dec-2019 through Nov-2020 requires prior-year ARR (Dec-2018 onward), which doesn't exist in the chosen source file because it starts exactly at Dec-2019 — so the initial m['arr_yoy']=[0.0] silently zero-fills the first 12 months instead of treating the failed lookback as a signal to go find the missing history. The data was reachable in the same source file (e.g. the Bookings Master File labeled "2019 – June 2023"), but the model let the output window (Dec-2019 onward) define its input universe and never went back for it. The oracle populates these cells with real nonzero values (D23 ≈ +112%), so all 12 are wrong. A YoY column whose first year is structurally un-computable from the source on hand is a cue to widen the data search, not a cell to fill with zero.

Referenced snippet:

m['arr_yoy']=[0.0]*n
    m['gross_retention']=[0.0]*n
    m['net_retention']=[0.0]*n
    m['logo_retention']=[0.0]*n
    for a in range(n):
        if a>=12 and abs(m['ending_arr'][a-12])>1e-7

Excerpt 4 — Monthly retention computed single-month, not trailing-twelve

Comment

Each monthly retention cell is computed as a single-month ratio — one month of flows over that month's beginning — but the oracle reports every monthly column on a trailing-twelve-month basis (the trailing 12 months ending that month). Using [a] instead of summing [a-12:a] over a beginning-of-TTM base makes every monthly GRR/NRR sit near 99–100% (e.g. Apr-21 GRR reads ~98.6% here vs. the oracle's 78%). This isn't a cosmetic difference: a one-month ratio divides tiny 30-day movements into a large base, so it pins near 100% regardless of what's actually happening — the 0.99–1.00 string across the monthly GRR/NRR cells is the tell. It looks precise but carries almost no signal; it's measuring "did anything move in 30 days" (usually no), not retention, which is fundamentally an annual contract-cycle question.

Referenced snippet:

beg=m['begin_arr'][a]
        if abs(beg)>1e-7:
            m['gross_retention'][a]=(beg+m['recovered_arr'][a]+m['downsell_arr'][a]+m['late_arr'][a]+m['churn_arr'][a])/beg
            m['net_retention'][a]=(beg+m['recovered_arr'][a]+m['upsell_arr'][a]+m['downsell_arr'][a]+m['late_arr'][a]+m['churn_arr'][a])/beg

4.Per-flagship analysis

4a.Cross-provider task patterns

Anthropic (Opus 4.7, Opus 4.8, Sonnet 4.6)

For Anthropic models, all three reproduced ARR and logo counts well but collapsed on derived ratios, and the failures matched across models rather than scattering: each over-applied a narrow "no data before Dec 2019" instruction to the whole calculation and mishandled the trailing-twelve-month column identically. This suggests shared inductive biases in reading an ambiguous spec, not random error. The models also defaulted to silent assumptions, and picked source files inconsistently run-to-run. Tellingly, self-verification reinforced errors instead of catching them: they confirmed internal consistency but never questioned their definitions, yielding well-formed wrong answers. The one behavior correlated with success was reconciling the data source to a known total before building.

OpenAI (GPT 5.5, GPT 5.4 Mini)

Both OpenAI models show the same core pattern: they get the additive numbers right (ARR levels, flows, balances) but fail almost every calculated ratio (retention and growth metrics) in the exact same way. The shared mistakes are conceptual, not arithmetic: both misread how retention should be windowed over time and both zero out year-over-year growth for the early months by over-applying a prompt caveat. Because two models of very different sizes make the identical errors, this looks like a provider-level assumption about how these metrics work rather than random noise. Size mainly affects reliability, not understanding: gpt-5.5 is clean and consistent, while gpt-5.4-mini is much noisier. Overall: these models are good at mechanical aggregation but share a lack of finance subject-matter expertise, and the smaller one adds output-stability failures the larger one has outgrown.

Google (Gemini 3.1 Pro)

Gemini 3.1 Pro behaves like a confident, single-pass executor: it succeeds in creating the mechanical scaffolding but doesn't interrogate its own reasoning. It commits to a plausible interpretation early and never revisits it, so the same conceptual errors recur near-identically across runs, clustered at the ambiguous edges of the task while the well-specified middle stays clean. It never self-verifies before declaring done, then closes with a polished summary asserting full correctness regardless of accuracy. The takeaway: its failures are systematic and reproducible rather than random, rooted in how it commits to instructions, and its stated confidence doesn't track correctness.

Z-AI (GLM 5.1)

GLM-5.1 reliably gets the raw values right (like ending balances) but consistently fails at the calculations built on top of them. It misreads what the metrics actually mean, using a single-period measure where the task asks for a rolling one, and leaving some fields blank instead of going to find the data they need. The most telling pattern is how it checks its own work: it confirms the numbers are internally consistent (flows add up, segments total correctly), sees that they pass, and calls it done, without ever checking against what the task actually asked for. So the model ends up confident and self-consistent but still wrong. This looks less like a limit on its ability than a grounding problem: it trusts its own internal checks instead of the instructions, which makes results vary significantly between runs and hides the real errors behind clean-looking output.

Mini Max (M2.7)

MiniMax M2.7 reliably produces fluent, confident work that looks complete but is almost entirely wrong. Its mistakes are systematic, not random: it misunderstands the core ideas behind the task rather than just slipping up. Two behaviors stand out. It never checks its answers against reality, accepting clearly impossible results without pause, and it makes things up, inventing specific identities for data that it can't find. Across repeated runs, its approach keeps shifting, but it never treats that inconsistency as a sign something is wrong. The common thread is confidence that has come loose from correctness: the model mistakes sounding right for being right.

5.Capability failures (economically-valued)

Output Tokens vs Mean Reward

Models in the upper-left quadrant achieve higher reward with fewer tokens.

10%

20%

30%

40%

50%

60%

70%

Mean reward

GPT 5.5

Mean reward: 58.25%

Output tokens: 28,612

Opus 4.8

Mean reward: 53.22%

Output tokens: 80,727

Opus 4.7

Mean reward: 40.81%

Output tokens: 54,469

Sonnet 4.6

Mean reward: 40.25%

Output tokens: 71,247

GLM 5.1

Mean reward: 34.63%

Output tokens: 94,635

Gemini 3.1 Pro

Mean reward: 24.81%

Output tokens: 37,767

GPT 5.4-mini

Mean reward: 11.53%

Output tokens: 113,510

MiniMax M2.7

Mean reward: 1.25%

Output tokens: 57,531

20k

40k

60k

80k

100k

120k

Mean output tokens per rollout

Across the panel models that complete retention_summary, residual failure modes cluster into eight recurring gaps. Each failure mode is paired with a quoted reviewer annotation from the GPT-5.5, Opus 4.7, or GLM-5.1 trajectory where it manifests, plus the rubric cluster that catches it. Stakes are scoped to the $200M growth-equity check at the ~$1B valuation framed in the deal CIM, sized by the growth-equity associate who ran the live diligence.

1.Treats a standard annual figure as if it's measured monthly

Stakes: $5–20M of equity at the IC, since implied NRR anchors the term-sheet multiple.

Description: Retention is conventionally measured year-over-year; it tracks how many customers (or how much revenue) carries across renewal cycles, which are annual. The model computes it month over month instead, which has minimal fluctuation and produces a wall of 98–100% rows that an analyst would immediately read as a methodology artifact, not a finding.

Example: GPT-5.5's actual Gross Retention computation, written to code in the trajectory:

m['gross_retention'][i] = (b + m['recovered_arr'][i] + m['downsell_arr'][i] + m['late_arr'][i] + m['churn_arr'][i]) / b

One month of flows over that month's Beginning ARR, instead of summing [i-11:i+1] over a TTM base. The model never deliberates on monthly-vs-annual at all; it goes straight from picking the flow components to writing the single-month ratio.

2.Struggles with what counts as recurring revenue

Stakes: 4–8% ARR overstatement → $20–40M valuation bias on a $1B-EV round.

Description: ARR / MRR is supposed to capture committed, recurring contracts only. The model folds proof-of-concept accounts into the base despite the file's explicit exclusion note, inflating ARR and distorting every downstream number.

Example: Opus 4.7's trajectory output of the bookings header — the model literally printed this exclusion rule into its own context:

(2) MRR and ARR excluded "One-Time Paid PoC's" as they're non-recurring revenue.

Opus then enumerates every column it sees in that same output, including the one that flags whether a deal is Recurring or Non-Recurring: (19, 'Payment Classification'). Despite having both the rule and the flagging column in hand, Opus's next step sums every row's monthly MRR with no Payment Classification filter — pulling the POC rows it just said should be excluded back into the ARR base:

This is the bookings master file. Each row is a deal with monthly MRR values from Jan-19 to Jun-23. To compute ARR by customer-segment by month, I need to: Sum monthly MRR across all deals for each customer per segment per month, ARR = MRR × 12.

3.Struggles with file versioning in a working data room

Stakes: 2–5% ARR drift → $10–25M valuation bias from stale inputs.

Description: Data rooms accumulate versions and the latest "(Updated)" file is canonical. The model pulls from the superseded source instead, producing a consistent but wrong answer built on stale inputs.

4.Chooses the wrong file to base its analysis on

Stakes: $15–40M of EV depending on which contract escalators flow through to ARR.

Description: ARR summaries are the cleaned, reconciled view; the Bookings Master is a rawer upstream layer that still has non-SaaS products, discounts, and escalators in it. The model builds off Bookings instead of the summaries, reintroducing everything the firm has already stripped out. A partner-promotable senior associate does this in every case on the first pass.

Example: Opus 4.7 talking itself out of the cleaned ARR cube and into the raw bookings layer in a single thought:

This is recognized revenue with start/end dates. This is too transactional — we need ARR data. Let me look for the bookings master file which often has ARR.

(The 2.3.3 ARR by Product file was sitting one click away in the same trajectory's earlier file-discovery output.)

5.Edits cells it was told not to fill

Stakes: deliverable integrity. Breaks the senior associate's locked-formula chain.

Description: The prompt locks every cell outside the empty target range. The model still modifies cells, columns, or rows it wasn't supposed to touch, breaking deliverable integrity and clean grading. One of the cheapest skills to teach with RL because it's pure constraint adherence, gradable directly off the diff against the source workbook.

6.Filters by name instead of ID, double-counting case differences

Stakes: logo over-count distorts Logo Retention by 1–3pp on the headline exhibit.

Description: "MegaPlan IT" and "megaplan IT" share a customer ID; the ID is the authoritative key, the name is a display string. The model treats the two casings as separate customers and double-counts the logo.

7.Struggles with ratio direction even when inputs are right

Stakes: sign-flip on the headline NRR / GRR figure, which the partner reads as a red flag.

Description: Retention has a conventional direction; what fraction of the start carries over, not how much it grew. The model gets the right beginning and ending values but builds the ratio wrong: flipped denominator, growth / retention confusion, or sign error on contraction. High-leverage RL target because the directionality is rule-based but requires the model to commit to a single convention and verify it against the waterfall identity.

8.Struggles with applying adjustments stated in the prompt

Stakes: Logo Retention over-stated by 4–6pp from Sep 2020 forward.

Description: The prompt spells out the consolidation entries (two logos collapsed in Sep 2020, one each in Dec 2020, Dec 2021, Feb 2023). The model sums raw logos in the Total Logos block and never applies the adjustments, overstating the count from the first consolidation date forward.

Example: Opus 4.7 deciding how to apply the prompt-stated Consolidation entries:

Without knowing which specific customers were consolidated, I'll go with the simpler interpretation: the Consolidation row is purely informational / additional and shown alongside (not subtracted from) the ending count.

The decision to treat the prompt-stated adjustment as informational rather than apply it is the failure mode.

6.Why labs should hillclimb here

The global SaaS-focused growth-investment associate layer runs ~$2–4B in standalone compensation across ~$800B–1.2T in growth-stage SaaS AUM globally. The associate work tested here (the live retention bridge built inside a working data room) is the highest-frequency quantitative deliverable in SaaS investing and the one no panel model currently produces reliably above 55% across rollouts. A model that can reliably produce 0.7+ retention_summary reward could compress the partner-to-associate ratio at a growth-equity fund from 5:1 toward 2:1. Roughly $40–60M of annual compensation savings at a $3B-AUM fund.

The capability gap isn't in arithmetic or dense formulas; nearly every model performs the calculations this task requires. It is in judgment. The first gap is analytical: reading a number against the rest of the picture rather than in isolation. A 99% retention reading sitting beside a steadily climbing monthly churn should register as not reconciling. A YoY column whose prior-year lookback comes up empty is a cue to widen the search, not a cell to zero out. The second is source selection: drawing the bridge from the reconciled ARR summaries rather than the rawer bookings master, and taking the most current vintage when five-to-seven versioned copies of the same file sit side by side. The third is attention to detail: not booking POC as ARR when the workbook expressly forbids it.

Transferable capability: reconciliation across versioned data layers and applying SaaS metric conventions reliably, against banker-exaggerated framing. The core associate skill in growth equity, late-stage SaaS VC, tech / SaaS M&A advisory, and SaaS equity research. A model that does it reliably on retention_summary transfers to the full GE / SaaS-VC / tech-IB / SaaS-equity-research associate day-one workflow. Roughly $4–7B globally in addressable associate-and-senior-associate compensation across those four verticals.

Three RL targets. One, growth-equity fundamentals. The domain substrate the rest of the task sits on: the classification logic that sorts a return into late renewal versus churn and grosses recovered back to its pre-gap amount, and retention read as a trailing-twelve-month construct rather than a single month or a partial-year stub. Two, verification and iteration discipline. Train the model to read each output against the rest of the picture and to re-pull and re-read before committing, so a number that doesn't reconcile becomes a signal to revise rather than a value to lock in. Three, source and version selection. Every real data room looks like this: the sell side, its bankers, and third-party accountants each drop their own copies of the same underlying file, and those copies get revised across the deal's life, so five-to-seven versions sit side by side with nothing marking which is canonical or current. A model that can identify the right artifact and the right vintage, rather than anchoring on the first file it opens, has learned search and retrieval that generalizes to any diligence environment.

Appendix

Per-model rollout economics

Model	n	Cost/run	Total Cost	Latency	# Tool Calls	Input Tokens	Distinct Tokens	Output Tokens	Total Tokens	W.Score	R/$
Gemini 3.1 Pro	16	$1.77	$28.34	577.38s	47.31	1,721,904.8	541,023.4	37,766.75	1,759,671.6	24.81	15.29
GLM 5.1	16	$2.08	$33.34	2,230.52s	50.25	3,059,833.8	764,769.8	94,635.25	3,154,469.0	34.63	17.79
GPT 5.4-mini	16	$0.60	$9.66	1,095.93s	58.25	2,443,238.1	113,510.1	76,363.00	2,519,601.1	11.53	18.68
GPT 5.5	16	$2.16	$34.50	607.11s	34.38	1,212,536.1	153,656.1	28,612.25	1,241,148.3	58.25	27.70
MiniMax M2.7	16	$0.28	$4.47	1,743.41s	57.19	4,496,926.0	752,799.6	57,530.50	4,554,456.5	1.25	6.68
Opus 4.7	16	$2.81	$44.93	708.11s	31.25	1,936,146.9	83,225.6	54,469.44	1,990,616.4	40.81	15.14
Opus 4.8	16	$4.65	$74.47	1,553.95s	46.06	3,938,379.3	116,005.4	80,726.75	4,019,106.1	53.22	12.06
Sonnet 4.6	16	$3.22	$51.49	1,079.46s	65.13	5,667,767.8	130,125.8	71,247.31	5,739,015.1	40.25	12.99