Tool use is bolted on after the model is trained. It never actually learned these APIs, so it guesses.
Functional overlap
When tools expose similar schemas, the model blends them and invents parameters that do not exist.
11
placeholder image · swap later
But LLMs are very good at one thing: writing code.
The pivot
They are bad at calling tools. They are great at one thing: writing code.
Calling tools directly
model→charge_invoice({…})
✕ wrong parameter, the task breaks
Writing code
// charge every overdue invoiceconst overdue = await db.query("due < now() AND !paid")
for (const inv of overdue) {
const r = await billing.charge(inv)
if (!r.ok) awaitretry(inv)
}
✓ runs, retries, completes
Deterministic: loops, conditions, retries, error handling.
One program does what dozens of brittle tool-calls could not.
The tools stay out of the context window.
So where does all of this run, for hours or even days?
One layer for cost, caching, and compliance on every model call.
Beyond the demo
Each answer leads to the next.
Context windows overflow. Tool calls break.
→
Have the model write code.
↓
A durable execution environment: sandboxes and state.
←
Code must run somewhere. The agent must remember.
↓
Now it can reach your tools and internal systems.
→
Zero Trust and MCP governance.
“I know what hyperscalers will look like in 10 years: exactly the same as they do now. I'm looking to Cloudflare to define what the next generation cloud looks like.”