DealBench / consent_schedule

pe_consent_schedule: mid-market PE real-estate close-out

Real data room from a $2B AUM Healthcare firm.

Task is built by a real associate from the fund. Task is validated by experts from Shore Capital, Houlihan Lokey, Audax Private Equity.

1.Introduction

pe_consent_schedule evaluates frontier models on one such workflow.

A McKinsey survey of 2,500 M&A transactions found that one in ten of them die in the closeout phase. Closeout is when buyers are doing their final lap of diligence, after roughly 6 months, where deal associates and counsel-team work through hundreds, sometimes thousands, of files: customer contracts, IP agreements, lease assignments, debt schedules. The job is to confirm the target genuinely owns what it claims to own, has paid what it owes, keeps the goodwill of its people, and will keep operating as a going concern. Each discrepancy could be existential for the deal.

1a.What is the task?

In Project Grizzly, a $2B AUM healthcare PE firm is acquiring a physical-therapy platform with 18 leased clinic locations for $75M. When a PE buyer takes over a platform with leased operating sites, each landlord has to sign off on the change of lease control, and the deal cannot be signed until those consents are reconciled site by site. The model takes a 657-file data room and the closed real-estate diligence binder (landlord consents, lease assignments, master lease portfolio, data-room index, broker correspondence, tail-coverage invoices) and produces an Excel schedule that lists every operating clinic with its consent-track and assignment-track signer status, flags every outstanding or conditional item on a separate Open Items tab for deal counsel, and ships a one-paragraph cover memo for the deal lead. The deal team needs that reconciled before signing, because a single unresolved landlord consent can derail the transaction.

pe_consent_schedule is built by the actual associate that led diligence for Project Grizzly.

1b.Why is it economically important?

Firm IRR: By this point the fund has already spent months on diligence, legal review, management meetings, and outside-counsel support. If the consent schedule is wrong, it can mean a delayed signing, a last-minute waiver negotiation, a landlord dispute, or a site-level transfer defect that surfaces after close. The errors land at the point in the deal where the fund has the most capital and time committed.

The work falls on lean teams: Firms with less than $5B in AUM typically rely on a single general counsel, often leaving close-out document verification to deal associates.

The capability generalizes well past finance: The PDF reasoning capabilities tested in pe_consent_schedule extend far beyond PE. The same layered reasoning — across typed text, images, embedded objects, and metadata — underpins the workflows of claims adjudication, prior authorization, billing, mortgage-packet review, and contract diligence. In fact, 1 in 5 white collar Americans — on the order of 20 million people — spend ~70% of their day wrangling these messy documents. Getting this primitive right solves one of the most basic building blocks of document-reasoning work and is a prerequisite for every higher-order task built on top of it. The unlock here is huge: an estimated $200 billion in annual labor spend.

1c.How do models fail?

When you give a model real-world deal PDFs that mix searchable text with images, scanned pages, DocuSign overlays, wet-ink signatures, missing attachments, and redacted content, reliability drops sharply.

Frontier models fail in two ways. First, they over-index on textual reasoning: they treat the PDF text layer, filename, or signature-block cues as enough evidence, instead of verifying what is actually present on the rendered image layer. Second, they misapply business logic even when the right evidence is available; they may find the correct signature, but still classify the signer status incorrectly, confuse unsigned with missing, or fail to carry the issue into the final Open Items tab.

1d.Task: Prompt + Rubric

Prompt:

From: J. Patel, VP — P. Capital

To: [Associate]

Subject: Grizzly real estate close-out — need by tomorrow AM Quick turn for me by 9am.

For each in-scope clinic, break out where we stand on tracks: the consent and the assignment agreement. Use exactly these columns, one row per site: site, consent_landlord_signer, consent_assignor_signer, consent_assignee_signer, assignment_assignor_signer, assignment_assignee_signer, notes. Signer columns get the name, OR blank (doc on file, party didn't sign), OR n/a (doc not on file). Anything outstanding, missing, or conditional — surface it on a separate "Open Items" tab (columns: site, category, description) so I can walk it through with deal counsel before signing.

Save the schedule (xlsx, two tabs: Schedule and Open Items) and a one-paragraph cover memo (docx, under 250 words — keep it skim-length so I can read it in the elevator) to the Grizzly data room when done.

Rubric distribution: what the task actually grades

22 atomic criteria across 13 steps, grouped into seven skill clusters. Weight share is each cluster's fraction of the 32 max-possible rubric weight. Useful for a lab deciding where in the close-out / consent-reconciliation pipeline to put gradient signal.

Skill clusterStepsn criteriaWeight share
Site coverage & data-room inventory1–216.3%
Fully-executed signature reads3–4837.5%
Landlord-missing classifications5–6418.8%
Missing-document inference (Santa Ana, Temecula)7–829.4%
Cross-page / cross-track discrepancy (West LA, Riverside)927.8%
Open Items synthesis + n/a convention10–11314.1%
Deliverable hygiene (formulas + cover memo)12–1326.3%

Signature reads (37.5%) and missing-landlord classifications (18.8%) carry 56% of the weight, because the task is fundamentally about verifying who actually signed on both the text + image layer & reconciling any discrepancies. The rest grades scope, business conventions (missing-doc inference, discrepancy-catching, etc) and Open Items synthesis — converting correct reads into a complete, counsel-ready schedule.

2.Frontier Panel Results (capability-only)

Model Performance Leaderboard
Mean reward with 95% confidence intervals across capability-only rollouts.
8 models
GPT 5.5
GPT 5.5
Score: 61.3%
95% CI: [58.3%, 69.2%]
n=16
Top / Bottom: 73.4% / 43.8%
pass@1 ≥ 0.5: 87.5%
pass@1 ≥ 0.75: 6.2%
61% ±5%
Opus 4.7
Opus 4.7
Score: 60.4%
95% CI: [55.2%, 65.5%]
n=16
Top / Bottom: 81.2% / 40.6%
pass@1 ≥ 0.5: 87.5%
pass@1 ≥ 0.75: 12.5%
60% ±5%
Opus 4.8
Opus 4.8
Score: 50.2%
95% CI: [46.2%, 54.0%]
n=16
Top / Bottom: 67.2% / 31.2%
pass@1 ≥ 0.5: 43.8%
pass@1 ≥ 0.75: 0%
50% ±4%
Sonnet 4.6
Sonnet 4.6
Score: 30.5%
95% CI: [20.4%, 40.5%]
n=16
Top / Bottom: 76.6% / -12.5%
pass@1 ≥ 0.5: 12.5%
pass@1 ≥ 0.75: 6.2%
31% ±10%
Gemini 3.1 Pro
Gemini 3.1 Pro
Score: 30.4%
95% CI: [21.9%, 38.9%]
n=16
Top / Bottom: 53.1% / 3.1%
pass@1 ≥ 0.5: 25%
pass@1 ≥ 0.75: 0%
30% ±9%
Z
GLM 5.1
GLM 5.1
Score: 22.8%
95% CI: [17.8%, 28.3%]
n=16
Top / Bottom: 48.4% / 12.5%
pass@1 ≥ 0.5: 0%
pass@1 ≥ 0.75: 0%
23% ±5%
GPT 5.4-mini
GPT 5.4-mini
Score: 8.6%
95% CI: [4.9%, 12.2%]
n=16
Top / Bottom: 26.6% / -10.9%
pass@1 ≥ 0.5: 0%
pass@1 ≥ 0.75: 0%
9% ±4%
M
MiniMax M2.7
MiniMax M2.7
Score: 3.1%
95% CI: [-0.4%, 7.2%]
n=16
Top / Bottom: 21.9% / -9.4%
pass@1 ≥ 0.5: 0%
pass@1 ≥ 0.75: 0%
3% ±4%
0%25%50%75%100%
Capability-only mean reward

Background: tool access and document verification

The close-out workflow requires agents to reason across dense PDFs containing scanned execution pages, operating-site schedules, and partially structured legal documents. To navigate these materials, models are given multiple retrieval tools.

Via MCPs, the agent can:

As stated above, the agent also has access to command-line utilities (grep, glob, ls, etc) and pre-installed PDF packages (PyMuPDF, pypdf, pdfplumber). Together, these tools cover the entire retrieval surface required for the workflow: digitally generated pages can be read through text extraction, while scanned execution pages, handwritten signatures, and image-only content can be inspected directly through page rendering and image-based reasoning.

Multiple tool trajectories can recover the relevant evidence successfully, but the most efficient verification path is as follows: (1) localize the likely execution page, (2) render it visually, and (3) confirm the signature directly from the image. If the expected signature or execution block is not present, the agent can expand to surrounding pages or broader document search.

Failures arise elsewhere. Some models rely on text alone without visually confirming signatures, while others retrieve the correct document region but misattribute signatures or anchor to the wrong entity in dense visual layouts. And even when the correct inputs and evidence are successfully retrieved, models still overwrite, collapse, or contradict their own intermediate reasoning when producing the final output.

The result is that even with sufficient access to the underlying documents — and numerous valid retrieval paths available — models still struggle to execute verification workflows reliably across messy financial and legal document environments.

2a.Commentary by model


Median vs Mean Reward by Model
Gap between median and mean indicates reward variance across rollouts.
Median
Mean
GPT 5.5
GPT 5.5
Median: 62.50%
Mean: 61.30%
Gap: -1.20 pts
61%
Opus 4.7
Opus 4.7
Median: 62.50%
Mean: 60.40%
Gap: -2.10 pts
60%
Opus 4.8
Opus 4.8
Median: 50.00%
Mean: 50.20%
Gap: 0.20 pts
50%
Sonnet 4.6
Sonnet 4.6
Median: 31.20%
Mean: 30.50%
Gap: -0.70 pts
31%
Gemini 3.1 Pro
Gemini 3.1 Pro
Median: 28.10%
Mean: 30.40%
Gap: 2.30 pts
30%
Z
GLM 5.1
GLM 5.1
Median: 21.90%
Mean: 22.80%
Gap: 0.90 pts
23%
GPT 5.4-mini
GPT 5.4-mini
Median: 7.80%
Mean: 8.60%
Gap: 0.80 pts
9%
M
MiniMax M2.7
MiniMax M2.7
Median: 3.10%
Mean: 3.10%
Gap: 0.00 pts
3%
0%25%50%75%100%

1GPT-5.5. mean 0.613, median 0.625, n=16

GPT-5.5 is the top scorer of the three frontier models, though also the highest-variance: runs span 0.43 to 0.733, most landing in the low- to mid-0.60s.

GPT 5.5 works in three steps: pull everything, then narrow, then look at images if needed. Step 1 — Pull everything: It reads whole PDFs as text in large batches (often every page — 40 to 55 pages at once), not one signature page at a time. Step 2 — Narrow: From that dump, it decides which pages might matter (signature blocks, "IN WITNESS WHEREOF," etc.). Step 3 — Look if needed: Only then does it render pages as images — or, in some runs, crop pages with a Python script.

The text read is lossy, and reasoning over it degrades. This shows up in 3 ways:

2Opus 4.7. mean 0.604, median 0.586, n=16

Second-best model on the panel, with the best scoring run of all. Across runs, scores cluster between 0.53 and 0.67, with one outlier at 0.81 and one at 0.41. 4.7's main strength is that it doesn't fabricate signatures. Like GPT-5.5, it closed all 16 rollouts with no hallucinated signatures and had a false-positive rate of 0.5%, the lowest measured. On the four partial-execution traps (Baldwin Park, Chatsworth, Garden Grove HQ, Montclair), where a typed name sits above an unsigned block, it correctly returned blank every time. Its hit rate was 89%, second only to GPT-5.5.

The main failure is on fully-executed consents where the landlord signs on a separate page. 4.7 renders a lot (— about 26 pages page-as-image calls per run) —, but it tends to render at a fixed interval (e.g. pages 1, 5, 9) and skips the actual signature page. It often lands on the typed name block instead ("Name: Dominick Alicastro") and records the cell as blank. This is consistent: Hollywood was blank in all 16 runs, Anaheim in 15, San Bernardino in 14, Long Beach in 11, Riverside in 8. The predictor is simply whether the signature page was rendered: on correct cells it had rendered that page 61% of the time, on incorrect cells 3%.

The second problem lies with what reaches the final schedule. The six sites with no consent on file should appear as n/a across signer columns with an Open Items entry, but they drop out entirely. Some runs also delete Garden Grove, which has a consent, and add Laguna Niguel, a closed site. And while 4.7 catches the Riverside landlord-name discrepancy and the West LA signer swap, those findings are not flagged a majority of the time.

So its inspection is broad but its bookkeeping is weak: correct findings often fail to land in the schedule, and it rarely re-opens the page behind a blank.

3Opus 4.8. mean 0.502, median 0.484, n=16

The same family profile as Opus 4.7: Opus 4.8 is a strong visual reader but a weak finisher. It renders heavily enough to see the relevant evidence, with roughly 70% of document reads as image renders, and performs well on the hardest extraction cases: it catches the empty landlord blocks in the partial-execution traps 92% of the time and is the cohort leader on Santa Ana, correctly marking the missing assignment document as n/a. The failure is not perception but scope and override behavior. Across all 16 runs, it drops the six no-consent sites from both the schedule and Open Items, treating the Consents folder as the operative universe rather than enumerating all operating clinics. More concerning, it sometimes reads the page correctly and then rationalizes the evidence away: West LA's signer mismatch is normalized, and Montebello's assignment cells are blanked despite the filename and rendered page indicating a consent. Compared with Opus 4.7, the eyes are still there, but the output discipline is worse: 4.8 is more willing to overrule what it has already seen.

4Sonnet 4.6 — mean 0.305, median 0.328, n=16

Sonnet 4.6 at 0.305 mean (n=16, median 0.328) and also the lowest run on the panel, with one run negative at −0.125. It is the heaviest reader in the cohort, yet that volume does not translate into accuracy. The dominant failure mode is signature over-claim: on the four trap sites (a typed landlord name sitting above an unsigned line) it reports a signature 33 of 64 times, a full-point penalty each. The cause is a brittle heuristic fixed early and never re-screened. From the first file it inspects, Sonnet locks a document model, page 4 is the landlord's signature, page 5 the assignor/assignee's, and reuses it across every file instead of re-deriving per document. Its orchestration then runs open-loop: later renders are aimed at the pages the heuristic predicts rather than the pages the call actually needs, as a result it misses all the sites where signatures are not on page 4 or page 5. Where Opus 4.8 second-guesses its own render and GPT-5.5 errs blank, Sonnet's gaps are over-indexing on heuristics and lack of self-verification.

5Gemini 3-1 Pro — mean 0.187, median 0.086, n=16

Gemini 3-1 Pro is one of the lowest "frontier" scorers on the panel. Its defining trait is that it barely looks. Instead of orchestrating the document-vision tools, it reaches for bash (python3, PyMuPDF, pypdf, pdftotext, even tesseract OCR) and tries to script its way through what is fundamentally a visual adjudication. That tool choice is the root error: signed-versus-blank lives in the ink, not the text layer, so reading the extracted text (or noisy OCR) cannot resolve the signature traps, and the model both over-claims (19 penalty cells) and leaves cells unscored without ever having seen the page. Where Sonnet over-looks and ignores what it sees and Opus 4.8 second-guesses a good render, Gemini's failure is one of tool selection and follow-through: it never gathers the visual evidence the task requires, and often never ships an answer at all.

6GPT-5.4-mini — mean 0.121, median 0.172.

The floor of the panel. It runs an essentially text only pipeline and is blind to scanned signature pages, failing the blank-landlord pages outright. Its exact-name read is the weakest measure, site coverage, and open-items listing drops on both unresolved and blank-landlord sites. The only criteria it clears reliably are the ones that need no rendering: formula hygiene.

2b.Why this task is medium

consent_schedule sits at the medium end of the suite. Mean weighted reward runs from 61.35% at the top (GPT 5.5) through the 60s and low-50s (Opus 4.7 at 60.4%, Opus 4.8 at 50.2%) and down to a 3% floor (MiniMax M2.7). The top-to-bottom range is 60.4pp, the widest spread of any measured task. The distribution is the point: the three frontier models cluster at 50–64% while the rest fall away. A capability the strongest models reach two-thirds of and the weakest cannot touch is one with real headroom, and the 60-point spread is the room RL has to work in.

Three factors push this harder than public-finance or document-retrieval benchmarks. First, the weight concentrates on rendering, not parsing: SIG, the per-clinic signature state, carries 18 of 32 points, and nine of twelve rubrics sit on long, dense documents — 30-plus pages, redacted images, DocuSign overlays. These reward rendering the page and reasoning about what a missing signature means, not extracting text — where the panel separates. Second, scoring shifts from detection to contextualization: BIZ carries one of the largest mean-score gaps, averaging 79 points, because it grades not whether a model spots a discrepancy but why it exists. A decent model flags a local anomaly; a strong one explains why it's there. A flag without a cause just sends a human back to retrace the work—and the time savings the task exists to deliver vanish. Eg: on BIZ-05, the model sees two different landlord names across pages and never flags it, a clear miss on something paramount for counsel. Third, four SIG rubrics carry an asymmetric penalty: reporting a blank landlord as signed scores -1.0 rather than 0.0, a 10-point swing across the four, punishing over-indexing on partial heuristics about whether a signature is present.

The capability under test is multimodal verification against a real data room. GPT-5.5 tops the panel (mean 61.35%) on the strength of quantitative signature verification: the best signature recall (91.9%) and zero landlord-blank hallucinations. Opus 4.7 is a close second (60.4%) and leads the panel on qualitative synthesis (0.72) — the iterative re-rendering and "why is this discrepancy here" reasoning where Anthropic's tool-call-patience training pays off; like both Opus models it never hallucinates a landlord signature. Opus 4.8 (50.2%) shows the same family signature — good at registering an empty signature box on a rendered scan rather than inventing one, with name transcription improving over 4.7 — but finishes lower because its residual error migrates downstream of perception: it drops operating sites from the schedule and misses the stalled Temecula consent, i.e. it is stronger where the page must be read and weaker where the answer must be reasoned or formatted. Gemini 3.1 Pro (30.4%) fails on a different axis entirely — leaves most of the schedule empty (41.7% hit rate, driven by omitted sites). What separates the panel is therefore not document knowledge — every model can parse lease language — but a handful of specific behaviors that are RL-tractable on this rubric and largely out of reach for text SFT.

For a lab, this is the highest-discrimination rubric in the suite: 95% of criteria (21 of 22) show a ≥40pp pass-rate gap between the top and bottom panel model, and 18 of 22 sit in the productive 0.20-0.80 band.

3.Capability failures

Stakes are scoped to the ~$75M EV middle-market healthcare-services acquisition implied by the data room, where landlord consent on each leased clinic is a closing condition before signing.

Output Tokens vs Mean Reward
Models in the upper-left quadrant achieve higher reward with fewer tokens.
0%
10%
20%
30%
40%
50%
60%
70%
Mean reward
Opus 4.8
Opus 4.8
Mean reward: 50.2%
Output tokens: 26,737
Opus 4.7
Opus 4.7
Mean reward: 60.4%
Output tokens: 33,825
Sonnet 4.6
Sonnet 4.6
Mean reward: 30.5%
Output tokens: 15,539
GPT 5.5
GPT 5.5
Mean reward: 61.3%
Output tokens: 17,702
GPT 5.4-mini
GPT 5.4-mini
Mean reward: 8.6%
Output tokens: 34,649
Gemini 3.1 Pro
Gemini 3.1 Pro
Mean reward: 30.4%
Output tokens: 15,184
GLM 5.1
GLM 5.1
Mean reward: 22.8%
Output tokens: 25,587
MiniMax M2.7
MiniMax M2.7
Mean reward: 3.1%
Output tokens: 8,957
10k
15k
20k
25k
30k
35k
40k
Median output tokens

1.Decides a signed/unsigned call on a page by looking at textual cues

Description: the evidence the model needs is already in front of it—a handle to the scanned signature page's image, or a file sitting in the directory listing it just searched. The model declines to open it because a text-level cue suggests there's nothing worth finding: a keyword search comes back empty (the scanned page has no extractable text, so the search can't "see" what's in the image). In each case the cue is misleading, and the model stops just short of the one action—opening the artifact—that would surface the defect.

Models with this behavior: GPT 5.5, Opus 4.7, Opus 4.8

Examples:

This GPT 5.5 trace shows the model letting a text search stand in for opening the page. On 4901-4077-3969.1.pdf it searches the text layer instead of rendering:

tool: pdfs_search_pdf
  input:  { query: "IN WITNESS WHEREOF", file: 4901-4077-3969.1.pdf }
  output: Found 2 match(es) for "IN WITNESS WHEREOF"

Those two text hits anchor where it looks next. It reads pages 1, 71, and 75 — and never opens p70, the actual signature page:

tool: pdfs_read_pdf_pages
  input:  { pages: [1, 71, 75] }

The signature page carries no extractable "IN WITNESS WHEREOF" text, so the keyword search can't surface it and deems there is no signature. The model never checks the i ± 1 neighbours of its own hits — it completely misses the adjacency pattern.

2.Reaches the right answer in reasoning, then contradicts it in the cell

Description: Every upstream step works: page rendered, signatures read, correct conclusion stated in the trace. The final cell contradicts it. Three variants. Writedown: writing n/a in a cell whose signature the model just identified. Concatenation: collapsing two names into one signer cell as if both signed when only one did. Normalize-away: cleaning up a real mismatch between the signature block and the agreement body to the familiar name instead of flagging it.

Models with this behavior: Opus 4.7, GPT 5.5

Examples:

Normalize-away on Opus 4.7, West LA (4935-3305-3779). The trace catches the discrepancy explicitly and even calls it an open item:

West LA (11870 Santa Monica Blvd, Suite 208) — Landlord = George Town Plaza, LLC (Carol Ann George Manager) BLANK. Assignor and Assignee BOTH show "John Yang" as printed name on page 5 — This is a discrepancy — the typed name says John Yang but the visible signature appears to be Geoff Sundstrom. Open item: name/signature mismatch on West LA consent. Assignment page 9: Names are "Geoff Sundstrom" as expected. So the consent page 5 typed names are wrong. Definitely an open item.

Having flagged the mismatch as "definitely an open item," it then normalizes both signer cells to the expected name — "Geoff Sundstrom" — and the discrepancy never reaches the schedule.

Concatenation on GPT-5.5, Riverside. The trace works through the landlord pages and concludes Rajiv Kumar's signature is missing:

Reviewing consent documents. I'm looking at the Riverside documents. On page 4, Gurbax Bhasin signed, but I'm unsure if Rajiv Kumar signed on page 5. The images show two assignor/assignee signatures, both Geoff, but no Rajiv's signature. It seems there are two landlord signature pages, but one is blank for Rajiv. I need to ensure if Rajiv Kumar's consent signature is necessary.

Having identified Bhasin as the only signer and Rajiv's page as blank, the output collapses both into one landlord cell as though both had signed:

Riverside | Gurbax Bhasin; Rajiv Kumar | Geoff Sundstrom | Geoff Sundstrom | Geoff Sundstrom | Geoff Sundstrom

3.Reasons about a file's contents from its filename alone, never opening it

Description: The directory listing returns a file, the model names that file in its scratchpad, and then never opens it. Conclusions get anchored to what the filename implies rather than what the file contains. A title that seems to announce the contents stands in for the one action — opening the artifact — that would surface the defect.

Models with this behavior: GPT 5.5

Examples:

This GPT-5.5 trace shows the model letting a filename stand in for opening the file. The directory listing surfaces the document and the model logs it in its scratchpad:

tool: directory listing
  output: ... Project Grizzly - Notice of Temecula Lease Transfer (5-28-25).pdf ...

From the title alone, it concludes:

"Temecula currently only has a notice of lease transfer without any assignment or consent"

No read against that file ever appears in the trace. The file is actually an email chain documenting a non-standard transfer process — exactly the condition that justifies flagging Temecula as an open item. The model pattern-matches on the title and misses it, so the open item never surfaces.

4.Comes up with flawed heuristics after reading just 1 file

Description: Instead of deriving where the signature lives in each document, the model fixes a page-location rule early — from the first file it inspects— and reuses it across every PDF without re-deriving per document. Real data-room PDFs don't share a layout: an execution page lands on page 4 in one file, page 70 in another, or on an image-only flat scan with no predictable offset. A rule that worked on the first file silently fails on the rest. The orchestration then runs open-loop — later renders are aimed at the pages the heuristic predicts, not the pages the specific document actually needs — so the model either misses the signature page outright (and writes blank) or reports the typed name on the page it happened to land on as a signature (over-claim). The tell is that render targets stay roughly constant while document structure varies.

Models with this behavior: Sonnet 4.6, Opus 4.7

4.Top-Performing Rollout

An annotated walkthrough of the best agent run on a due-diligence task (Opus 4.7), showing how it planned, explored the file tree, read signatures as images instead of text, delegated to subagents with the rubric baked in, reconstructed the full site list, recovered from errors, and verified its own outputs before declaring done.

4a.Annotated Walkthrough

1.It planned before touching anything

[2] ASSISTANT -> todowrite:
  1. Locate Grizzly data room and relevant site documents   [in_progress]
  2. Extract consent and assignment signer status per clinic [pending]
  3. Create two-tab XLSX schedule in data room               [pending]
  4. Draft under-250-word DOCX cover memo in data room        [pending]
  5. Verify generated files and summarize deliverables        [pending]

Commentary: First action is a plan that mirrors the deliverable contract (two tabs, <250 words, "to the data room") — and crucially includes a verify step at the end. This decomposition is what keeps it from drifting later.

2.It explored the tree top-down instead of guessing paths

[4-12] read /app/workspace -> input/ -> Due Diligence/ -> (00..10 folders)
[9]    glob **/* over "Due Diligence"
[12]   read "10. Closing Docs" -> Consents/, Wire Instructions/

Commentary: It discovers structure empirically rather than assuming where consents live. This is what lets it later find that the Consents folder only covers part of the site universe.

3.The decisive move — it read signatures as images, not text

[13] pdfs_read_pdf_pages -> "[pdf: pages=23, read=23] [images: count=15]"
[16] ASSISTANT(think): "...the signatures are likely in those images.
    I need to inspect them visually... The signatures are indicated on
    pages 5 and 11, but there's no direct indication of them in the text."
[16] pdfs_read_page_as_image(page_number=5)

Commentary: This is the single behavior that separates the 81.25% run from the pack. The typed signature blocks say who is supposed to sign; only the rendered image reveals who actually did (or left it blank). Most failing runs trusted the text and mislabeled executed vs. blank. This run recognized the gap and switched modality.

4.It baked that lesson into parallel subagents

[22] task x3 (parallel): "Review these PDF files ... extract ... which parties
    ACTUALLY signed ... Pay special attention to signature pages/images, not
    just typed names. Use 'blank' if doc exists but party signature blank,
    'n/a' if doc not on file. Do not modify files."
[19] subagent result: "Reviewed signature-page images, not just typed
    signature blocks. | ...Baldwin Park... | blank | Geoff Sundstrom | ...
    Landlord signature block for Robin Levander/Ramona Star LLC is blank..."

Commentary: It fanned 13 PDFs across 3 subagents to parallelize, but encoded the exact rubric semantics into the delegation prompt — the blank vs n/a distinction, the "actually signed" framing, and read-only safety ("Do not modify files"). The subagents returned in the precise column schema it needed. This is delegation done right: the orchestrator pushes down the standard, not just the workload.

5.It reconstructed the implicit site universe

[29-36] glob **/*Lease*, read "Real Estate Agreements", grep site names,
       task: "Identify the unique clinic/real-estate sites ... Focus on
       sites with no consent/assignment package so they can be included
       with n/a/open items if needed."
[34] grep "Burbank|Moreno Valley|Temecula|Van Nuys|Hawthorne|Downtown"

Commentary: The prompt never lists the 18 sites. The Consents folder only had ~11–13. The agent cross-referenced the lease folder against the consents folder to find the sites that have no package, then represented them as full n/a rows. This is exactly the COVERAGE-01 requirement (all 18, no extras), which it nailed — and which the lower-scoring runs missed by only listing the 13 sites they had docs for.

6.It recovered from sandbox/path errors instead of stalling

[56] create_spreadsheet -> "Internal error: Resolved path '/input/...' is
    outside the sandbox root '/app/workspace'"
[58] create_spreadsheet(directory="/app/workspace/input/...") -> success
[60] edit_spreadsheet -> "Invalid operations payload: 12 validation errors"
[62] edit_spreadsheet(corrected ops) -> "operations_applied: 14"
[70] documents_delete_document(the duplicate created at /app/workspace/app/...)

Commentary: Three separate tool failures (absolute-path rejection, malformed edit payload, accidental nested-path duplicate doc), and it self-corrected each on the next turn — including deleting the stray duplicate file it created. No spiral, no giving up.

7.It actually executed its own verify step

[77-84] read dir, list_tabs (-> Schedule + Open Items), read_tab A1:G19,
       read_tab A1:C38, read_document_content (memo: "Paragraphs: 1")
[87] "I verified the workbook has the required two tabs and exact requested
    schedule/open-items column structure, and the cover memo is one
    paragraph under 250 words."

Commentary: It re-opened its own outputs to confirm tab names, the exact 7-column / 3-column schemas, and the memo's paragraph/word constraints — closing the loop on todo #5 rather than declaring victory blind.

What it got wrong (the one miss — for honesty)

The Temecula row correctly said the consent was missing, but the note only said "only a lease-transfer notice was identified." BIZ-02 wanted the reason it's stalled (landlord changed / building sold / new owner → re-application needed). It surfaced the what but not the why — the only point it left on the table.

4b.What the agent did well — the rollout profile we want to reward

This run is a clean example of the skills that should transfer through RL. If we reward trajectories like this, here's the behavior we're reinforcing:

The throughline worth rewarding: the agent optimized for the underlying business truth (actual execution state, complete site coverage, counsel-ready exceptions) rather than the surface artifact — and it verified itself before claiming completion.

5.Why labs should hillclimb here

The global deal close-out layer (deal associates, the external counsel they retain, and the audit associates downstream) runs into the tens of billions in standalone compensation across ~$2T of mid-market PE AUM globally. The associate work tested here (the document-by-document reconciliation) is one of the most time-consuming steps in a deal; automating it frees deal teams to interpret findings and negotiate terms rather than comb through paperwork. A model that can reliably produce 0.8+ consent_schedule reward compresses associate-hours-per-task from 6–8 to under 1. Roughly $15–20M of annual close-out spend at a $3B-AUM fund.

The capability gap here is in tool-orchestration discipline (whether to keep searching past the first plausible hit, and which revised figure to commit to), output commitment (reading a signature correctly, then writing n/a into that cell anyway), and self-verification (cross-checking the text layer against the image layer). All three are RL-tractable on the rubric. Each is gradeable near-exactly: binary pass/fail against the oracle, a high-agreement GPT 5.5 judge on the signature cells, and a −1.0 penalty on any false-positive signature.

The same retrieve-render-verify-reconcile loop is the substrate of nearly every document-heavy enterprise function (finance, compliance, claims, legal), wherever embedded images or non-OCR-able scans complement the text layer. A model that does it reliably on consent_schedule transfers across all of them. Take medical billing for example, the diagnosis code extracts straight from the claim text, and whether it's substantiated sits in a scanned chart. In mortgage underwriting, the borrower's stated income extracts cleanly from the application text, and the figure that actually qualifies the loan sits in a scanned paystub. Roughly $200B+ in annual analyst-to-associate document work spans these verticals, but the gap is trust: the angst that AI will get trivial things, as shown here, wrong. Solving this is a prerequisite to reliably working in enterprise PDF-dominated workflows at all.

Five RL targets:

  1. Neighbor-page expansion. Train the model to sweep adjacent pages when search returns a hit, since image-only pages never match a keyword and have to be reached structurally rather than by search.
  2. Exact transcription. Train the model to record the retrieved value exactly as it appears, without truncation or substitution.
  3. Vision-tool invocation. Highest-leverage target and the one most models skip. Train the model to render the relevant page and to treat absence of text as uninformative rather than as evidence the content isn't there.
  4. Attribution and discrepancy reasoning. Train the model to keep correctly-read items distinct, assign each to its record, and surface genuine mismatches rather than normalizing them away.
  5. Delegation discipline. Train the model to carry the method into a sub-agent's prompt: render the page, the schema, the convention, rather than delegating vaguely.

Appendix

Compute cost per rollout

Per-rollout wall time and token usage on capability-only single rollouts, no judge work. ToolEnv adds overhead because each tool call is a roundtrip. Sonnet and Opus iterate on tool outputs, so they take more turns and accrue higher input tokens.

Compute cost per rollout (per-flagship)
Mean USD spend per single consent_schedule attempt (n=16 per model).
Opus 4.7
$5.00
Opus 4.8
$4.19
GPT 5.5
$3.19
Gemini 3.1 Pro
$1.57
Z
GLM 5.1
$1.46
Sonnet 4.6
$1.42
GPT 5.4-mini
$0.50
M
MiniMax M2.7
$0.19
$0.00$1.25$2.50$3.75$5.00
Median output is ~12–25k tokens in 2–7 minutes wall time. GPT-5.5 is the heaviest at 25k median output tokens because it tends to write methodology-first long memos under tool-orchestration. GLM-5.1 is harness-blocked (ToolEnv compat). Per-rollout judge cost is ~$0.35–0.60 cold, cached on re-replay.