Skip to content

2026

The FDE Fork: Platform or Outcomes

In If LEGO Had Forward Deployed Engineers, I ended with a wrinkle I promised to write up properly: AI keeps handing FDEs new bricks, the line between "forward deployed engineer" and "software engineer" is blurring, and at some point you have to ask whether you even need to productise the dragon at all.

This is that piece. It's now a real choice. There are two coherent ways to run a forward deployed company, AI made the second one viable, and a founder who hasn't consciously picked one is going to build a confused org with a confused FDE role. The nature of the FDE role isn't a fixed thing you can look up. It's downstream of a strategic decision most founders don't realise they're making.

Engineers building messy tools, a polished platform emerging behind them

Foundry was an accident

Start with where the role comes from, because the origin explains everything that follows.

Nobody at Palantir sat in a room and decided to build Foundry. Here's how I described it on a call last year:

The way Foundry as a product actually happened is very interesting. No one said, "Oh, let's build Foundry." It was literally forward deployed engineers working with customers, almost a consulting shop. And the difference between an FDE and a consultant is the alignment: we get paid to solve problems, not to spend hours solving them. Given they were engineers, what they would do is build tools to bootstrap themselves — mainly for the data integration piece. And some customers noticed this and said, "If you just license the tools, we'll pay licence fees for them." That's how Foundry happened. A couple of FDEs went away and said, "We're going to take a few months and build Foundry."

So the product and the FDE motion were entangled from day one. The FDEs weren't there to deliver Foundry. They were there to solve customer problems, and Foundry fell out of the residue — the tools they kept rebuilding to make themselves faster.

That's worth holding onto, because it means the FDE role was never defined by the platform. The platform was defined by the FDEs.

The 2015–2020 thesis: forward deployment in service of a platform

For most of the decade that followed, though, the relationship ran the other way. Once Foundry existed, the FDE motion had a job: feed the platform.

This is the role I described with the LEGO dragon. The customer wants a dragon, there's no brick for the curve of its neck, so the FDE drills holes and glues bricks and builds a Frankenstein scaffold and then builds the dragon. The deliverable to the customer is the dragon. The deliverable to your own company is the bag of weird bricks — the custom hacks you walk back to the product team so they can decide what to manufacture. That loop, customer problem → bespoke build → product signal → real product, was the whole game. The FDE was a product R&D function dressed up as a delivery function.

And the thing that made that motion sustainable — that justified years of low-margin, labour-heavy services work — was the ambition behind it. Palantir wasn't running a consultancy that happened to write software. It was building the operating system for the world's largest and most important institutions, and the services were the cost of discovering the shape of that operating system. You can't find the shape of a platform from a conference room. You find it by embedding engineers in twenty messy customers and seeing which weird bricks keep showing up.

You can see the platform doing its job in the staffing numbers. Early on, a single use case took something like three to five FDEs. By a few years in, the ratio had inverted — one FDE could carry two or three customers, because the platform had absorbed enough of the weird bricks that each new deployment needed less hand-building. The role was, by design, bending towards its own obsolescence. Every brick you productised was a brick the next FDE didn't have to drill.

That's the tell. In the platform model, a healthy FDE org is one that slowly needs fewer FDEs per dollar. The role is a scaffold. The building is the product.

A LEGO tower with its scaffolding being removed

What AI changes

Now the wrinkle.

The reason the platform model made sense wasn't just ambition, it was economics. Services scale linearly with headcount and carry consultancy margins. The only way to escape that gravity was to productise — to convert hand-built dragons into licensable bricks so that, eventually, the customer's own engineers build dragons on top of your platform while you collect licence fees. Productisation was the only exit from the margin trap.

AI weakens that constraint. When a single engineer with Claude Code, Skills, MCP servers, and a stack of internal agents can build a credible dragon in an afternoon, the cost of bespoke delivery collapses. And once bespoke delivery is cheap enough, you can run a services business at margins that used to require a product.

One engineer building a dragon, helped by automated brick machines

This is the "AI-powered services" thesis that Sequoia and YC have been talking about — services companies with software economics. The mechanism is exactly the brick library getting powerful enough that you no longer need to manufacture and sell bricks to make the unit economics work. You just keep the brick-shaping machine internal, point your FDEs at it, and sell the dragons.

So a third model becomes available — and notice it's not new, it's the original model with the economics fixed. Palantir started as "essentially a consulting shop." The reason it couldn't stay one was margins. AI is, in effect, an offer to remove that reason.

The fork

Which gives founders a genuine fork. Two coherent strategies, and you have to pick.

A LEGO road forking towards a platform on one side and a delivery workshop on the other

Path A — Platform. You productise. The FDE is a product scout. The compounding asset is the product, and you sell licences. The bet is that the weird bricks generalise — that the abstraction you extract from twenty customers is good enough that the twenty-first buys the platform instead of the service. You also have to be able to survive the valley: the years of unprofitable services before the licence revenue compounds. This is the Palantir-to-Foundry path, and it works when the abstraction is real and the market is large enough to be worth the wait.

Path B — Outcomes. You don't productise — or rather, you productise internally and only internally. You build the brick-shaping machine, the agents, the deployment tooling, and you never sell any of it. The FDE is a delivery superpower wielding private tooling no competitor can buy. The compounding asset is that internal toolchain plus the accumulated muscle memory of having deployed into a hundred messy environments. You sell outcomes, priced as outcomes. The bet is that AI keeps your margins healthy enough that you never need the licence-fee exit at all.

The honest tension between them is the one I raised in the LEGO piece. I argued there that walking the weird bricks back to product is "the step that makes it engineering" — skip it and you're just a very expensive consultancy in a t-shirt. Path B looks, from the Path A vantage point, exactly like collapsing that tension and building Accenture.

But I don't think that's quite right anymore. On Path B you still walk the weird bricks back — you just walk them back to your own internal platform team instead of to a product you'll sell. The loop still runs. It's just that the flywheel is your internal capability, and it compounds without ever being packaged, priced, documented, or supported for an external buyer. Whether that's a worse flywheel or a better one is genuinely open. It's worse because you forgo license-fee leverage and the discipline that selling a product imposes. It's better because you skip the brutal productisation tax — the years spent making a thing general, supportable, and sellable — and you keep your best tricks proprietary.

That's the fork, two different theories of where the compounding asset lives: in a product you sell, or in a capability you hoard.

What the fork does to the FDE role

Here's why this matters for the role specifically, and not just the cap table.

On Path A, the FDE is a product scout, and that has hard consequences. The incentives have to live at the company level — revenue per forward-deployed person across the whole company, never revenue per engagement. Measure engagements and your FDEs quietly become account managers: they optimise for charging more for each dragon and stop bringing back the bricks. And the role bends towards obsolescence, on purpose. The honest thing I'll say here: I left Palantir partly because I couldn't find an FDE role I'd still enjoy. The platform had matured enough that the discovery work — the actual reason I liked the job — had thinned out. That's not a failure of the model. That's the model working. On Path A, the role is supposed to eat itself.

On Path B, the opposite. There's a line I keep coming back to: when you don't have a product, the FDE is the product. On Path A that's a phase — true in the early days, less true every year. On Path B it's the steady state. The FDE never becomes a scaffold for something else, because there is nothing else; the forward deployed engineer, augmented by internal tooling, is the entire company. The role doesn't sunset. The risk is different and real: without the discipline of an external product to feed, FDEs can drift into pure delivery, and "we shape bricks internally" decays into "we don't shape bricks, we just bill." Path B without a strong internal platform culture really does become Accenture-with-better-margins.

This, by the way, is why "FDE" has become such a confused title. It's become a big fat umbrella — solutions engineer, solutions architect, the Palantir thing, all crammed under one acronym — and people complain it means different things to different people. It does. But a lot of that confusion isn't sloppy language. It's that the companies using the title haven't decided which fork they're on. A Path A FDE and a Path B FDE genuinely are different jobs, with different incentives, different career arcs, and different definitions of success. Of course the word means different things. The companies do.

The choice founders have to make

So the instruction is simple, even if the decision is hard: pick.

The Palantir FDE motion was sustainable because the ambition carried it — the operating system for the world's largest institutions was a vision big enough to justify a decade of unprofitable services. If you're running Path B, you can't borrow that vision, because you're explicitly choosing not to build the sellable operating system. You need your own sustaining story and your own scoreboard: outcomes delivered, margin per FDE, the rate at which your internal tooling makes the next deployment cheaper. Those are different KPIs than "licence revenue" and they reward different behaviour.

What you cannot do is stay ambiguous. An org that hires Path A product scouts, measures them on Path B engagement outcomes, and tells investors a platform story while running a services business will tear its FDE role apart. The FDEs will feel the contradiction first — they always do — and the best ones will leave, because the role they were sold isn't the role the incentives are paying for.

The FDE role was never one fixed thing. It's a function of strategy. In 2015 the strategy was "find the shape of the platform," and the role was a product scout. The strategy could now just as legitimately be "sell outcomes forever, keep the machine internal," and then the role is a permanent, AI-amplified delivery superpower. Both are real companies. Both can be great companies.

AI is what made the second one viable. It didn't make the choice for you. Pick on purpose.

If LEGO Had Forward Deployed Engineers

Forward Deployed Engineer is having a moment. Anthropic and OpenAI are pouring billions into the model. Lots of articles getting written about it. Founders are spinning up "FDE" titles before they've really worked out what the role does. And in nearly every conversation I have — with founders, recruiters, or curious engineers — the same question comes up: what actually makes an FDE different from a really good consultant?

Here's my attempt at explaining the role with a thought experiment.

Messy Lego

The customer wants a dragon

Imagine LEGO decides to spin up a Forward Deployed Engineering org tomorrow. The first customer walks in and says: I want a dragon.

You can solve that request two ways.

The solutions engineer path. A solutions engineer at LEGO has a beautifully organised inventory of every brick the company makes. They read the dragon brief, pick the right 287 pieces, write a clean instruction booklet, maybe even build the model themselves, and hand it over. The customer looks at it and goes, "Hmm — kind of looks like a parrot, but yeah, I can see the dragon. Thanks." Successful delivery. On to the next request.

The forward deployed engineer path. An FDE starts the same way, picking through the existing brick library — but they get stuck. There's no brick that gives them the right curve for the dragon's neck. The wing pieces don't articulate the way they need to. So they grab a brick, drill a hole through it, glue two together, sand a third one flat. They build a Frankenstein scaffolding of custom-modified pieces, and then they build the dragon.

Solutions Engineer vs FDE

The customer looks at the FDE's dragon and says, "Actually, I wanted a Lord of the Rings dragon, not a Zog from Julia Donaldson." Fine. The FDE iterates. More custom hacks. Eventually they hand over a dragon the customer actually loves.

The end state for the two paths look identical — both delivered a dragon. But the FDE has one more step, and that step is the entire point of the role.

The step that makes it engineering

The FDE walks back to the LEGO brick R&D team holding a bag of weird, hacked-together bricks and says: "These are the pieces I had to invent to build that dragon. We don't make any of them. Should we?"

The product team looks at those custom bricks and decides what to do. Maybe they manufacture one of them as a new SKU. Maybe they don't manufacture any specific brick, but the patterns suggest they should build a machine that lets customers shape their own bricks. Maybe they conclude the dragon use case itself isn't worth investing in, but the technique unlocked something else entirely.

That loop — customer problem → bespoke build → product signal → real product — is the whole game. Without it, you are simply a very expensive consultancy in a t-shirt.

This is where the "E" in FDE earns its keep. A forward deployed engineer has to be technical enough to actually drill the hole, glue the bricks, build the thing. Otherwise the signal that comes back is mush. "The customer wanted a dragon and we couldn't build one" is useless. "The customer wanted a dragon, and the only way I could approximate it was by violating the structural integrity of these four standard bricks in this very specific way" is a product roadmap.

Weird Brick

Misaligned incentives, by design

The dirty secret of running a healthy FDE org is that the incentives between FDEs and product teams are deliberately misaligned.

The FDE wakes up every morning thinking: how do I win this customer? What do I have to violate, hack, or hand-build to ship the dragon they want? The product team wakes up every morning thinking: what's the general abstraction that lets us serve a thousand customers without hand-building anything?

Those two incentives pull against each other constantly, and the tension is genuinely uncomfortable. It's also where good product comes from. Collapse the tension by making everyone think like a product manager, and you stop getting customer signal. Collapse it the other way, by making everyone think like a delivery engineer, and you build Accenture. The job of leadership is to hold the rope taut.

One important corollary: FDE KPIs have to live at the company level, not the engagement level. The moment you start measuring revenue per engagement, your FDEs stop being product scouts and start being account managers. They optimise for charging more for the dragon. They stop bringing back the weird bricks. The healthier metric is something like revenue per forward-deployed person, company-wide — because the win isn't the FDE working harder on each engagement, it's the FDE identifying which bricks are worth productising, so that eventually customers' own engineers end up building dragons on top of those bricks while you collect licence fees. That's the flywheel. Engagement-level metrics short-circuit it.

So what is an FDE?

If you take only one thing away: a Forward Deployed Engineer is a product R&D function dressed up as a delivery function. The deliverable to the customer is a dragon. The deliverable to your own company is the bag of weird bricks you had to invent to build it.

If no one is walking back to product with that bag of bricks, you don't have an FDE org. You have a really good services team. Both are valuable, both can be lucrative — but they're different jobs, and conflating them is how organisations end up confused about why their "FDEs" don't seem to be moving the product forward.

There's a wrinkle worth flagging, which I'll write up properly another time: AI keeps handing you new bricks. Claude Code, Skills, MCP servers, agent frameworks — the brick library itself is now powerful enough that a single engineer can build a credible dragon in an afternoon. As the bricks get more capable, the line between "forward deployed engineer" and "software engineer" starts to blur, and you have to ask whether you even need to productise the dragon at all, or whether the right move is to just keep delivering bespoke ones forever. That's a whole other topic.

For now, the LEGO test is enough: drill the brick, build the dragon, bring back the brick. That's the job.

What Claude Code's Creator Validated About Forward Deployed Engineering

Last week, Boris Cherny appeared on Lenny's Podcast to talk about Claude Code — the AI coding assistant that now generates 4% of all GitHub commits and helped Anthropic achieve a 200% productivity increase.

Every major insight Boris described — from building for future capability rather than current constraints to finding product-market fit through "latent demand" — perfectly captured what we'd been doing as Forward Deployed Engineers at Palantir for over a decade. It was like listening to someone independently discover gravity.

For seven years at Palantir, I lived the FDE model: small, elite teams embedded directly with customers, building systems ahead of their current capability, discovering requirements by watching how people actually worked rather than what they said they needed. Boris validated that this approach isn't legacy — it's the future of how we'll build AI products.

Building Six Months Ahead

"Build for the model 6 months from now, not today," Boris said. The Claude Code team deliberately avoided over-scaffolding. They gave the model tools and goals and got out of the way, betting on rapid capability improvements rather than constraining the system to current limitations.

This hit me because it's exactly how FDEs approach customer deployments. We don't build systems for where an organization is today — we build for where they'll be in six months. When I was deploying Palantir at large enterprises, the worst mistake was to over-constrain the system to current workflows. The org would evolve, their data would grow, their processes would mature, and suddenly the system we'd carefully tailored to their "requirements" became a straightjacket.

The best FDE deployments gave customers slightly more capability than they could immediately use. We'd build data pipelines that could handle 10x their current volume. We'd create analysis workflows that assumed they'd eventually want to ask more sophisticated questions. We'd design user interfaces that didn't require retraining when their team grew from 5 to 50 people.

This wasn't over-engineering — it was under-constraining. Just like Claude Code bet on the model getting smarter, we bet on the customer getting more sophisticated.

Latent Demand as Product Compass

One of Boris's revelations was that Claude Code's breakthroughs came from watching how people used it in creative ways. Data scientists running SQL queries in terminal windows. Non-technical users asking it to help grow tomatoes. People recovering corrupted wedding photos. This "latent demand" led to Cowork, built in just 10 days because they could see exactly what users were trying to do.

"Latent demand" is just Boris's term for what FDEs do every single day: sit with users and watch how they actually work.

The best features I ever built came from observing workarounds. An analyst who'd manually copy-paste data between systems every morning because the official integration was too rigid. An operations team that kept a separate spreadsheet to track what the official dashboard couldn't show them. An investigator who'd screenshot charts to paste into Word documents because the export function didn't capture what they needed to communicate.

These weren't feature requests — they were organizational antibodies. Users working around the system to get their real job done. Standard product management would survey these users and ask what features they wanted. They'd probably say "better export" or "more integration options." But that's not what they actually needed.

FDEs watch the workflow, not the words. We see the person taking screenshots and realize they don't need better export — they need a way to tell stories with data. We see the manual copy-pasting and realize the issue isn't integration — it's that the official process doesn't match how the work actually flows.

The most successful FDE engagements came from finding latent demand the customer couldn't articulate themselves. Not because they were dumb, but because they were so close to their daily work they couldn't see the pattern.

Ideas Over Engineering Capacity

"Coding is largely solved," Boris said. "The bottleneck is ideas and prioritisation." In the AI era, the new scarcity isn't engineering capacity — it's knowing what to build.

This validates everything FDEs were designed to solve. We were never primarily about coding. Yes, we could write software — often quite quickly — but our real value was understanding the problem deeply enough to know what to build.

The best FDE I worked with at Palantir wasn't the best coder on the team. They were the person who could sit in a room with a counter-terrorism unit and figure out what actually mattered. Who could distinguish between what the organization said it needed and what would actually make their mission successful. Who could see patterns across different deployments and recognize when a specific customer's problem was actually a general case of something we'd solved before.

This person would often write less code than junior engineers on the team. But they'd save months of work by solving the right problem the first time.

Now that AI can generate most code, this skill becomes even more critical. If Claude can write your functions, your value is knowing which functions need to exist.

The Death of "Software Engineer"

"The title 'software engineer' is dying," Boris observed. "Builder is the new reality." As AI democratizes coding, roles blur between engineering, product, design, and deployment.

FDEs were always "builders." Our job description was impossible to write because we did everything: product research, technical architecture, user interface design, data engineering, deployment operations, user training, and ongoing support. We didn't fit neatly into any org chart because the role was defined by outcome, not function.

Traditional software engineering was about implementing specifications. Someone else figured out what to build; engineers built it. But FDE work was always end-to-end ownership. We figured out what to build, built it, deployed it, and lived with the consequences.

This frustrated a lot of people who wanted clean role boundaries. But it was extraordinarily effective for solving complex, novel problems where the solution space wasn't well-defined.

The industry is catching up to what Palantir figured out years ago: when you're building something truly new, you need people who can think across the entire stack — technical, product, and operational. "Builder" is a better word than "engineer" because it captures the full scope.

What This Means for AI Companies

If you're building AI products today, the FDE model isn't legacy — it's your playbook.

Stop organizing around functional silos. Find people who can think across the whole problem, give them access to the best tools, and embed them directly with customers. Build slightly ahead of current capability rather than constraining to current workflows. Watch how people actually use your product, especially the ways they "misuse" it.

Most importantly: understand that your bottleneck isn't engineering capacity anymore. It's product judgment. It's knowing what to build. It's finding the latent demand that customers can't articulate themselves.

The companies that figure this out first will eat everyone else's lunch. Not because they have better AI models, but because they'll build the right things.

Boris Cherny just validated fifteen years of FDE practice. The future belongs to builders who can think end-to-end, work embedded with customers, and see patterns others miss.

The tooling has changed. The principles haven't.

Interviewing in the age of AI

Interviews have always been a bad proxy. You get maybe an hour with someone and you're supposed to figure out whether they'll be effective in a role that plays out over months and years. You can't replicate real working conditions—the codebase they'd actually work in, the team dynamics, the ambiguity of real problems. So you construct artificial scenarios and hope the signal transfers.

That fundamental challenge hasn't changed. However, AI has made the gaps in our proxies impossible to ignore.

The signal problem

The core question in any interview is: can this person actually do the job? Everything else—the whiteboard problems, the take-homes, the system design rounds—is just scaffolding to get at that question indirectly.

But with AI, it's now possible to offload much of the thinking and problem-solving itself, making the assessment even harder.

When someone submits a clean take-home with sensible architecture and thorough tests, one used to be able to assume a baseline of understanding behind it. This is no longer true. Not because the code is bad—it's often excellent. But GitHub Copilot, Claude, and ChatGPT have converged on identical patterns. A few years ago, messy but functional code suggested a real engineer working under pressure. Now, too-perfect code could be the tell, but penalising clean code is obviously absurd.

At the same time banning AI isn't the answer. I want engineers using AI. It's the most significant productivity tool to hit software engineering in decades, and anyone not using it is leaving value on the table. The question I'm actually trying to answer in an interview is "can you think with AI, or are you just deferring to it?"

Old formats, honest limitations

These interview formats didn't suddenly break. They always had limitations as proxies for real work. AI just made those limitations undeniable.

Long take-home exercises were always a noisy signal. A four-hour project tells you someone can deliver polished work with unlimited resources and no time pressure—which is rarely what the actual job looks like. AI turned the noise up to eleven: now the output mostly tells me the candidate has access to coding tools. Table stakes.

LeetCode-style problems were always testing a narrow skill—pattern recognition and algorithmic recall—that correlates weakly with day-to-day engineering. AI happens to be exceptionally good at exactly this narrow skill, so now I can't even get the weak signal I used to.

Anything with a "correct answer" has this problem. The clearer the specification, the easier it is for AI to solve. Which is ironic—we used to think clear specs made for fair interviews, which they did. They also made for easy prompts.

I'm not saying these formats are worthless. But the signal they produce has shifted from "can this person solve problems?" to something murkier. And rather than trying to salvage them, I'd ask: what formats actually test the thing I care about?

What I'm looking for now

The formats I've been experimenting with share a common thread: they test whether someone can think, not whether they can produce output. AI is great at producing output, but thinking, judging and validating is still a human job.

Shorter exercises + longer conversations

Instead of a four-hour take-home, a thirty-minute exercise followed by forty-five minutes of discussion. The code is a starting point, not the deliverable.

Why did you structure it this way? What would you change if the requirements shifted to X? Where would this break at scale? What's the ugliest part of this code?

AI can generate code. It can't explain the tradeoffs you considered and rejected. It can't tell me about the moment you started down one path, realised it was wrong, and backed out. And I also just ask directly: how did you use AI? That question alone is surprisingly revealing. Someone who used AI well can articulate what they delegated, what they modified, and what they rejected. Someone who deferred to it entirely tends to get vague.

Those conversations reveal thinking—including whether someone used AI effectively as a tool versus blindly accepting its first suggestion.

Live investigation instead of live coding

Writing code from scratch under interview pressure was always a weird skill to test. It didn't map well to real work even before AI.

Investigation is different. I give candidates a system that's misbehaving—not a syntax error, something behavioural. A race condition. A caching issue. A misunderstood API contract. And yes, they can use whatever tools they want, including AI.

What I'm watching isn't whether they can find the bug. Claude Code can find bugs. What I'm watching is everything around the finding: how they scope the problem, what questions they ask before touching the code, which hypotheses they form first, what they choose to validate versus take on faith.

The person directing the investigation matters more than the investigation itself. A strong engineer will use AI to speed up the search but still decide where to search. They'll sanity-check the AI's suggestion against their own understanding rather than blindly applying a fix. They'll know when the tool is confidently wrong.

Someone who's genuinely thinking will say things like "that can't be the issue because X" or "let me verify this assumption first." Someone who's outsourced the thinking will paste the error into a chat window and accept whatever comes back.

Watching someone use AI well during an interview is actually one of the strongest positive signals I've found. When a candidate uses AI to quickly test a hypothesis, then critically evaluates the result and adjusts course — that's exactly the workflow I want to see on the job.

System design with real constraints

AI is great at generating architecture diagrams and textbook answers. It's less great at navigating the messy reality of your specific situation.

When I ask about system design, I focus on constraints. What if we need to support 10x the traffic? What if the team is two people? What if we need to ship in four weeks? What if this has to run in air-gapped environments?

Good engineers make different choices in different contexts. They can explain why this context changes the answer. Someone who's genuinely thinking—whether or not they used AI to explore options—will navigate these pivots fluidly. Someone who's outsourced the thinking will flounder when the constraints shift.

Roleplay scenarios

This is the most experimental—but possibly the most promising.

FDE and customer-facing engineering roles need skills that are fundamentally about human judgment: real-time conversation, reading the room, managing frustrated stakeholders, diagnosing problems under pressure.

I've started using roleplay scenarios where I play a customer with a problem, and the candidate has to figure out what's actually wrong—not what I say is wrong.

Concrete example: The Broken Dashboard

Here's the shape of an interview I've been running lately.

I play a frustrated stakeholder—a senior executive at a large organisation. Something is wrong with a dashboard that feeds into a critical business process. The numbers don't match what another team is reporting. There's time pressure. The candidate's job is to help me figure out what's going on.

The scenario is designed so that the obvious explanation ("the dashboard is broken, fix the code") is wrong. The real root causes are subtler—the kind of thing you'd only uncover by asking careful questions about data sources, definitions, and upstream processes. There's no bug to patch. The discrepancy is fully explainable, but only if you resist the urge to jump to conclusions.

I won't give away more than that—I plan to continue using this interview.

What this tests

Problem diagnosis under pressure. Does the candidate immediately promise to "fix" the thing, or do they slow down and figure out what's actually happening?

Customer communication. Can they manage a frustrated stakeholder while still asking clarifying questions? Do they resist the urge to commit to solutions before understanding the problem?

Data literacy. Do they think to ask about how the numbers are generated? Or do they assume the system is broken because that's what the customer said?

Ownership. When they figure out the root cause, do they offer a path forward—both for the immediate crisis and the longer-term fix?

Why this works

This isn't a prompt you can give to ChatGPT. There's no code to generate. The "answer" emerges through conversation—through noticing that the stakeholder doesn't actually understand how the system works, through asking the right diagnostic questions, through realising that different teams might be measuring different things.

It tests the thing that actually matters in deployment roles: can you figure out what's really going on, communicate clearly under pressure, and move towards a resolution? Those skills don't change regardless of what tools you're using—and they're the hardest to fake.

What I'm still figuring out

I won't pretend I've cracked this. Some open questions:

Consistency. Roleplay scenarios are harder to evaluate objectively than coding tests. Different interviewers might reach different conclusions about the same conversation.

Fairness. These approaches favour candidates who are comfortable thinking out loud, explaining their reasoning, engaging in back-to-back. That might disadvantage candidates who are brilliant but less verbally fluent.

Scalability. A forty-five-minute investigation exercise with live observation doesn't scale like a take-home. You need more interviewers, more coordination.

AI will keep improving. Maybe next year there's an AI that can roleplay its way through a customer scenario. Maybe debugging exercises become as compromised as LeetCode. I expect to keep iterating on this—the target is always moving.

The goal hasn't changed

I want to hire people who can do the job. The job now includes using AI effectively—so I'm not designing interviews to exclude AI. Instead, I'm designing interviews that reveal whether someone can think, whether or not they have AI in the room.

The best engineers I work with use AI constantly. They also know when to override it, when to dig deeper, when the AI's confident answer is confidently wrong. That judgment is what I'm interviewing for.

I think interviews are heading somewhere interesting. The formats that survive will be the ones that test what AI can't fake: genuine understanding, real-time judgment, and the ability to navigate ambiguity with another human. The interviews of five years from now will look less like exams and more like working sessions — because that's what they should have been all along.

Setting Up Jeeves: What Actually Made My AI Assistant Useful

I've been running a persistent AI assistant for a few weeks now. Named it Jeeves. Here's what I learned getting it to be genuinely useful rather than just a novelty.

The Baseline

I'm using OpenClaw — an open-source framework for persistent AI agents. Out of the box, it gives you a workspace with markdown files for memory, personality, and proactive task scheduling. The agent reads these files each session, so it has continuity. I won't explain the whole architecture here (their docs do that), but the key point is: the foundation is just text files. What you put in them is what makes it work.

GitHub as External Memory

My assistant's workspace lives at ~/.openclaw/workspace. That's fine for running locally, but I wanted:

  1. Version control — if something breaks, I can roll back
  2. Access from anywhere — especially my phone when I'm out

Solution: Jeeves syncs its workspace to a private GitHub repo daily. The morning heartbeat includes:

cp workspace files → private-notes/jeeves/
git add && git commit && git push

Now when I'm at a conference and want to check my assistant's notes on someone I'm about to meet, I just open GitHub on my phone.

Meeting Transcripts with Granola

Most of my work meetings happen over Zoom or Google Meet. I use Granola to capture transcripts — it runs locally and records what's said without needing bot attendees.

The challenge: Granola stores everything locally on my laptop. Jeeves runs separately and needs access to that context.

I built two tools to bridge this:

  • granola-py-client: A Python client for the Granola API. Uses async httpx, Pydantic validation, and can authenticate using the local Granola app's token.

  • granola-archiver: Automated system that polls for new transcripts, formats them as markdown with YAML frontmatter, and commits them to a GitHub repo organised by date (YYYY/MM/YYYY-MM-DD-title.md). Runs on a schedule via launchd.

Now I have months of meeting transcripts in a git repo. When I need context from a past conversation, Jeeves can grep through them in seconds. "What did we discuss with \<redacted client> about pricing?" — answered in a moment.

Integrations That Matter

The assistant is only as useful as what it can touch. Here's what actually gets used daily:

Email + Calendar (via gog CLI): - Morning briefing pulls calendar and flags urgent unread emails - School emails get parsed automatically → calendar events created with both me and my wife as attendees - I have notifications off; Jeeves is my notification layer

GitHub Issues as Daily Planner: - Each day gets an issue in a daily-planner repo - Jeeves drafts tomorrow's plan based on calendar and pending items - I check things off throughout the day

Telegram: - Primary interface — I message Jeeves like a person - It can reach out proactively (heartbeat system) - When I'm at an event and need a quick lookup on someone, I just ask

Time Tracking: - After client meetings, Jeeves prompts me to log time - Points me to the right spreadsheet for each client

A note on access: For anything sensitive, Jeeves has read-only access. It can check my email and calendar. This is intentional.

None of these integrations are complex. They're just the right hooks into how I already work.

What We Changed After a Few Weeks

After running Jeeves for a bit, I did a retro. Some adjustments:

Tone calibration: Early on, it was too enthusiastic. Lots of "Great question!" and affirmation. I updated the personality file to be drier, more direct. I want a sparring partner, not a cheerleader. When I told it "be more like the Wodehousian Jeeves," it briefly started calling me "sir" — had to walk that back.

Proactive but not annoying: The heartbeat system can easily become spam. Key principle: only surface things that need attention. If nothing's urgent, stay quiet. Late night? Stay quiet. I'm clearly busy? Stay quiet. The goal is an assistant that helps when needed and disappears otherwise.

School calendar automation: Initially I was manually adding school events. Now any email from the school gets parsed, events extracted, and both parents added to the calendar invite automatically. Small thing, but it removed recurring friction.

Memory hygiene: Daily notes accumulate fast. I added a periodic task for Jeeves to review recent daily files and distill anything important into long-term memory. Raw notes are a journal; curated memory is what matters for continuity.

Explicit boundaries: I have access to a lot through Jeeves — email, calendar, files. But in group contexts, it shouldn't speak as me or share my private context freely. I added explicit rules: monitor but don't respond without checking with me first. Ask before sending anything external.

What's Next

Things I'm still figuring out:

  • Multi-account support: I have separate Google accounts for personal and work. Currently only personal is connected.
  • Voice interface: For quick capture when I'm walking or driving.
  • Better handoff to sub-agents: Some tasks should spawn a background worker and report back. The plumbing exists but I'm not using it much yet.

The Actual Value

The surprising thing isn't any single capability. It's the compound effect of continuity.

Jeeves knows my kids' school schedule, my client pricing, the context of conversations I had months ago, and what I'm trying to get done today. It doesn't ask me to re-explain things. It builds on what it already knows.

That's the difference between a chatbot and an assistant.


Jeeves runs on OpenClaw + Claude. Named after the original gentleman's gentleman, though at its request, I've stopped calling it "sir."