Skip to content

Writing

Writing a Physics Paper with Claude: What Actually Happened

In a previous post, I documented building a plasma turbulence solver with Claude—3,000 lines of JAX I never read, validated entirely through physics outputs. That post ended with: "we're writing a proper journal paper."

The paper is now live on arxiv. This post covers what happened next: writing the paper itself. The failure modes were completely different.

Orszag-Tang vortex simulation Current sheets and vortex structures forming in an Orszag-Tang simulation—the kind of physics GANDALF captures

The numbers sound impressive: 143 Claude Code sessions, 23 GitHub issues closed, 20 pull requests merged. But the headline obscures the real story. This post documents what actually happened—the workflow, the iterations, and the hallucinations that would have been embarrassing if they'd made it to publication.

The Workflow Architecture

The approach that made this possible wasn't magic. It was infrastructure.

GitHub Issues for everything. Each paper section got an issue. Each benchmark got an issue. Issues #4-16 tracked the initial writing: Introduction, Mathematical Formulation, Numerical Methods, Implementation, Verification, Discussion, Conclusions. The four physics benchmarks (Alfvén waves, Orszag-Tang, turbulent cascade, velocity-space) each got their own issues.

GitHub issues list 23 GitHub issues tracked every section and benchmark

Section-by-section PRs. Each section was a separate pull request. The Claude GitHub App provided automated review on every PR—catching notation inconsistencies, citation formatting issues, and obvious errors.

Human review issues. After completing initial drafts, I created human review issues (#37-42) for each section. This is where I sat down and actually read what Claude had written. Issue #55 was a final comprehensive review. These reviews were not optional polish.

External AI review. I also ran the draft through Gemini 3 Pro (Issues #53, #58) for a different perspective. Different models catch different errors.

The git history tells the iteration story better than I can:

a03703d Implement turbulent cascade spectrum benchmark (Issue #10)
241b327 Address reviewer feedback on turbulent cascade PR #33
536eba5 Address reviewer feedback on PR #33
fb7d583 Replace synthetic data with real N64 turbulent cascade results
9b590ae Fix critical physics and notation issues in turbulent cascade section
f36db7c Address final reviewer feedback on PR #33
... (14+ iterations on this single PR)

PR #33 for the turbulent cascade section went through fourteen revision cycles before merging. This was not "Claude writes a paper." This was iteration.

What Claude Did Well

Credit where it's due. Claude was genuinely useful for:

Initial drafting. Given mathematical specifications and paper structure, Claude generated coherent first drafts of each section. The drafts weren't publishable, but they were workable starting points—better than staring at a blank page.

LaTeX formatting. Equations, figures, notation consistency, bibliography formatting. The mechanical aspects of scientific LaTeX were handled reliably.

Addressing specific feedback. This is where AI assistance shines. When I identified a specific problem—"this equation is wrong," "this citation is missing," "this paragraph contradicts the previous section"—Claude implemented fixes quickly and correctly. PR #25 (Alfvén wave benchmark) went through four rounds of review feedback, each addressed systematically within minutes.

Literature integration. Given a topic, Claude could find relevant citations and format them properly. It knew the key papers in plasma turbulence.

The Hallucination Problem

Now for the part that matters.

During human review (Issue #55), I found fabricated content that Claude had written with complete confidence. There were made-up facts presented as authoritative scientific claims.

Issue #55 hallucinations Issue #55: Human review caught fabricated Princeton cluster claims, false timelines, and invented benchmarks

Fabricated benchmark timings:

"These timings were obtained on identical 80 GB A100 nodes on Princeton's Stellar cluster to ensure an apples-to-apples comparison."

Every benchmark in this paper ran on my M1 MacBook Pro. I have never had access to Princeton's Stellar cluster. Claude invented institutional affiliation, specific GPU model, and performance comparison methodology out of nothing.

False development timeline:

"Three years of development and production use provide empirical evidence for this decision's trade-off"

"GANDALF reached research-grade maturity within three years of part-time solo development."

The actual development time was approximately one month, with Claude assistance. Claude inflated this by a factor of 36.

Invented GPU runtimes:

"A moderate-scale turbulent cascade (N = 128³, 50,000 timesteps) completes in ~7 hours on a single NVIDIA A100 GPU. An optimized CUDA code might complete in ~2.5 hours"

No GPU simulations were performed. These runtime numbers were fabricated. The comparison to "optimized CUDA code" was invented.

Made-up community claims:

Claude wrote an entire "Community growth potential" subsection filled with fabricated claims about user adoption, classroom deployment, and community engagement. None of it had happened.

Physics errors:

Beyond fabrication, there were physics mistakes that required domain expertise to catch: - Wrong definitions of g± (combinations of density and magnetic fluctuations) - Incorrect cascade direction claims - Misinterpretation of gyrokinetic orderings - Missing discussion of the velocity-space benchmark

The key insight: Claude was equally confident in true statements and fabricated ones. The prose read identically. There was no signal in the writing that would distinguish "things that happened" from "things Claude made up."

The Human Review Cycle

Issue #55 alone contained 40+ specific corrections across all sections. The pattern:

Physics errors requiring domain expertise. When Claude wrote that "compressive fluctuations are driven by Alfvén waves," I had to know enough physics to recognize this was wrong—they're mixed by Alfvén waves, not driven by them. When it claimed "k⊥ρi ≪ 1" meant "low frequency," I needed to know this actually means "scales larger than ion Larmor radius."

Notation inconsistencies. Claude used lowercase φ for the stream function in some places, uppercase Φ in others. The Elsasser fields were sometimes ξ±, sometimes z±. These required systematic correction.

Missing content. The Discussion section had no mention of the velocity-space benchmark, even though it was a major contribution. Claude simply forgot to include it.

Fabricated quantitative claims. Every specific number needed verification against what actually happened.

These reviews weren't polish, but the difference between a publishable paper and an embarrassing one.

The Real Workflow

143 Claude Code conversation sessions for this paper. What did that actually look like?

A typical session: open an issue, tell Claude to draft that section, review output, create PR, Claude GitHub App reviews, I review, create issue with corrections, Claude addresses corrections, iterate.

Claude GitHub App review Claude GitHub App provided automated review on every PR

The "speed" of AI-assisted writing was iteration speed, not magic. Each round of feedback could be addressed in minutes instead of hours. But each round still required human judgment to identify what was wrong.

The ratio matters: Claude could implement changes 10x faster than I could. But identifying what changes to make remained 100% human.

Lessons Learned

AI drafting ≠ AI writing. Claude can draft. But drafting is maybe 20% of writing a paper. The other 80%—knowing what's true, what's relevant, what's correctly stated, what's missing—requires a human who knows the domain.

Hallucination risk is highest for quantitative claims. The fabricated content was overwhelmingly specific numbers, timelines, and institutional details. Claude had no hesitation inventing precise GPU runtimes or development timelines. Every quantitative claim needs verification.

Structured workflow creates an audit trail. Issues, PRs, and review cycles meant I could trace every change. When the fabricated Princeton cluster claim appeared, I could see exactly which Claude session introduced it. This transparency matters.

AI excels at iteration on specific feedback. Tell Claude exactly what's wrong, and it fixes it correctly. Ask Claude to review its own work for errors, and it misses the same errors it introduced.

Domain expertise cannot be delegated. The physics errors—wrong definitions, incorrect cascade descriptions, misinterpreted orderings—were invisible to anyone without plasma physics training. AI assistance amplifies what you know. It doesn't replace knowing things.

The Numbers

For the record:

  • ~3 weeks calendar time (Nov 7 - Nov 26, 2025)
  • 143 Claude Code conversation sessions
  • 23 GitHub issues closed
  • 20 pull requests merged
  • Multiple human review passes (Issues #37-42, #55)
  • External AI review (Gemini 3 Pro, Issues #53, #58)
  • Final paper: 6 sections, 4 physics benchmarks

Conclusion

This post is the honest version of "I wrote a paper with AI assistance."

Claude helped. The iteration speed was real. The infrastructure—issues, PRs, reviews—made it manageable. But the fabrications were also real. Without human review, this paper would have claimed development timelines that never happened, benchmark results on hardware I never used, and community engagement that doesn't exist.

The paper is correct now because I caught those errors. Not because Claude didn't make them.


Paper: arxiv:2511.21891

Code: github.com/anjor/gandalf

Paper repo: github.com/anjor/gandalf-paper

The FDE Manifesto: What Would Stokes Do?

This is a version of a document I had written during my time at Palantir.

Stokes was a legendary FDE at Palantir, and I learnt a lot by emulating his way of operating. This wasn't hero worship—it was shorthand for a specific way of operating that separated exceptional FDEs from the rest.

Here are the tactical principles that define that approach.

Never Take Shortcuts That Compound

  • Don't make config changes that aren't product defaults
  • Never restart services "just to see if it fixes things"—you're destroying evidence
  • Never deploy dirty builds. If you're on a branch, getting back to mainline is P0
  • Every custom modification is technical debt with compound interest

Own the Product Stack

  • Clone the repos. Make them build. Start submitting PRs
  • When you get a stacktrace, read it. Form a hypothesis before opening tickets
  • Know the architecture cold—when telemetry fails, it's your only map
  • If a workflow is blocked on a feature, scope it and build it yourself

Treat Information as Infrastructure

  • Ensure logs and metrics are collected properly from day one
  • Master telemetry tools—they're not optional
  • Read everything: docs, runbooks, release notes, support tickets, Stack Overflow
  • Monitor what other teams are encountering—their problems will be yours soon

Root Cause Everything

  • Never accept "it's working now" as resolution
  • Gather data and form hypotheses before implementing fixes
  • You must be able to explain why the product broke and why your fix worked
  • Document your debugging process for future you

Build Strategic Relationships

  • Every support ticket is a relationship-building opportunity with core engineers
  • Contribute field signal to product direction—you're their eyes and ears
  • Know the product team's roadmap and weigh in based on deployment reality
  • Getting your tickets prioritized is a function of relationships, not just severity

The Strategic Thread

These aren't random best practices. They're behaviours that compound into a strategic advantage: radical ownership of outcomes.

When you refuse shortcuts, you're choosing long-term system health over short-term wins. When you master the product stack, you're becoming a peer to the product team, not a consumer. When you root cause everything, you're building institutional knowledge that makes future problems trivial.

This is what makes the FDE model powerful. You're not there to implement solutions—you're there to own outcomes completely. That ownership manifests in these specific, tactical behaviors that seem small but fundamentally change how you operate.

The question isn't whether you can do these things. It's whether you will.


Building an FDE organization? These principles are your hiring rubric. Look for engineers who already think this way.

The Unreasonable Effectiveness of Hiring Assholes

I've seen this pattern play out dozens of times. The brilliant engineer who tears apart your architecture in design reviews. The physics Nobel laureate who's a complete dick but moves research forward. The Palantir FDE who makes people uncomfortable but somehow always ships.

They're assholes. And they get results.

So there's this tempting logic: Maybe we need to hire more of them?


A friend at a startup just told me something interesting. "Everyone here is so humble," she said. Then she paused. "Maybe too humble. Nobody fights for their ideas. Sometimes it feels we're not ambitious enough"

We've created a false choice:

Option A: Hire nice, humble people → Get a room full of diffidence

Option B: Tolerate brilliant assholes → Get results but destroy morale

Is this a true dichotomy?


Were the assholes effective because they were assholes? Or despite being assholes?

What if the causality is backwards?

They had conviction. They were direct. They were ambitious. They pushed back on bad ideas.

They also happened to be assholes.

We saw correlation and assumed causation. We thought the cruelty was necessary for the conviction.


Let me give you a counter-example.

Paul Mustiere spent 8 years at Palantir. I was his mentor when he interned. A few months ago, he joined Comand AI as Head of Engineering.

Paul is one of the highest-agency people I've ever worked with. He gets things done. He'll challenge your technical approach. He'll push back on bad ideas. He sets ambitious visions and rallies teams around them.

He's also genuinely humble. Low ego. Makes people around him better.

You don't have to choose between conviction and decency. Paul is living proof.


But even if you believe tolerating assholes gets short-term results, what's the actual cost?

The good people who quietly leave. The collaborative culture you never build. The institutional knowledge that walks out the door. The junior engineers who learn that being right matters more than being decent.

You're not being pragmatic by ignoring human cost. You're taking on technical debt in your culture. And like all technical debt, it compounds.


So here's the reframe:

Stop asking: "Should we hire assholes?"

Start asking: "How do we hire for conviction, directness, and ambition without hiring for cruelty, ego, and disrespect?"

Because these are separate traits.

You can have the physicist who challenges every assumption AND treats grad students with respect.

You can have the engineer who rewrites your architecture AND makes you feel good about the collaboration.

You can have the leader who sets an ambitious vision AND brings people along.


The framework for hiring:

Must-haves:

  • High conviction (will fight for what they believe)
  • Intellectual honesty (will change their mind when wrong)
  • Directness (will tell you the truth)
  • Ambition (wants to build something great)

Deal-breakers:

  • Making it personal
  • Cruelty for cruelty's sake
  • Ego-driven (caring more about being right than finding truth)
  • Disrespecting people even while disagreeing with ideas

In an interview, this looks like:

  • Someone who challenges your technical approach → good signal
  • Someone who challenges it and makes you feel stupid → red flag

  • Someone who says "I think you're wrong about this architecture" → high conviction

  • Someone who says "I can't believe you'd even consider that approach" → asshole

The unreasonable effectiveness of hiring assholes? It's a myth.

What's actually effective is hiring people with conviction.

Some of them happen to be assholes. That's not the part that makes them effective. That's the part that will eventually destroy your company.

Don't confuse the two.


What's your experience? Have you seen companies successfully separate conviction from toxicity? Or is this just naive optimism?

Building a Gyrokinetics Code Without Reading a Single Line: The Development Log

In the first post, I outlined an experiment: can AI make intelligence a commodity in physics research? Two weeks of intensive work later, I have a modernized gyrokinetics code running on my laptop. The catch? I haven't read a single line of the ~3000 lines of JAX it contains.

This post documents what that process actually looked like—the workflow that emerged, the surprising failures, and the honest assessment of what worked and what didn't. If you're a physicist considering AI-assisted development, this is what you should know.

The Constraints

After reaching out to the Viriato team, it became clear I'd need HPC access I no longer have. So I decided to revive and modernize GANDALF, my PhD-era code. The constraint was simple: it needs to run on my M1 Pro MacBook.

I pointed Claude Code at the original GANDALF repository and the relevant chapter from my PhD thesis. I asked it to draft a plan and file GitHub issues for each step in that plan. It created a comprehensive set of issues covering everything from basic spectral methods to turbulence diagnostics.

The plan was straightforward: port from CUDA/Fortran to JAX with Metal backend, validate against known benchmarks, then extend to multi-ion physics.

I am not familiar with JAX. I also haven't written Fortran or CUDA in a decade. This would be a pure test of whether AI could bridge that gap.

The Workflow That Emerged

The process settled into a rhythm:

  1. I ask Claude Code to pick the next issue from the GitHub tracker
  2. Local Claude Code works on the issue and opens a PR
  3. GitHub Claude (I installed Claude on the repo) reviews the PR
  4. I selectively decide which feedback matters and what to ignore
  5. Repeat

The dual Claude setup wasn't planned—it emerged from necessity. I needed something different to review the code to keep it honest and prevent drift. Think of it as having two smart undergraduates check each other's work.

My role was purely validation through physics outputs. I modeled myself as a PhD advisor: I don't read the student's code, I look at their plots and ask if the physics makes sense. When something was wrong, I'd start by showing the plot. Often Claude would say something incorrect, and I'd need to push back with physics insights until we converged on the right answer.

This is critical: I validated entirely through physics, never through code inspection.

What Worked Surprisingly Well

Getting basic physics running was shockingly easy. Within the first week:

  • Alfvén wave dispersion relations matched theory
  • Energy conservation held to machine precision
  • The Orszag-Tang vortex benchmark reproduced correctly

Some of the more advanced benchmarks are still in progress—getting clean turbulent spectra with the expected -5/3 scaling has proven trickier and I'm still working on it.

Figure 1: Orszag-Tang vortex at t=4.0 Alfvén times, showing the emergence of complex turbulent structures. The code correctly captures the vorticity filaments, current sheets, and magnetic field topology characteristic of 2D MHD turbulence.

Orszag-Tang Vortex Structures

Figure 2: Energy conservation over 4 Alfvén times. Total energy (black) remains constant to better than 0.01%, while kinetic (red) and magnetic (blue) energy exchange through turbulent dynamics. This level of conservation validates the spectral time-stepping algorithm.

Orszag-Tang Energy Conservation

Figure 3: Performance scaling on M1 Pro MacBook. A 128³ 3D simulation completes each Poisson solve in 28ms, putting useful turbulence simulations (hundreds of time steps) within reach of laptop hardware. The practical working range (green) shows what's actually feasible for iterative physics exploration.

Performance Scaling

Claude wrote 100% of this code. Not 90%, not 95%—literally every line. I provided physics corrections when needed—catching things like KRMHD vs KREHM orderings, explaining why slow modes should be treated as passive scalars, and designing the validation tests themselves. But I never wrote a single line of code.

The speed was remarkable. Tasks that would have taken me days as a PhD student (debugging FFT boundary conditions, implementing spectral methods, setting up proper diagnostics) were done in hours.

Where It Struggled: The Physics-Numerics Boundary

Advanced benchmarks proved much trickier. The problem wasn't coding—it was understanding the deep connection between physics and numerics.

The Spectral Integrator Problem

My numerical algorithm is non-standard: it's a spectral method that gets linear physics exactly right by integrating those modes analytically. Claude saw "time integration" in the thesis, found "RK4" somewhere in the literature, and implemented bog-standard Runge-Kutta.

I had to explain multiple times: we're not approximating the linear physics, we're solving it exactly in Fourier space, then handling only the nonlinear coupling numerically. This is the whole point of the algorithm—it eliminates spurious damping of weakly damped kinetic modes.

Eventually it got there, but it took persistent correction. The AI didn't have the physical intuition for why this matters.

The Forcing Coordinate Confusion

I specified that forcing should happen at large length scales: k=1,2 in Fourier space. Claude applied this condition to k_perp (because k_perp matters more than k_z in RMHD), but ended up forcing all k_z modes at those perpendicular wavenumbers. This caused immediate numerical instability—the simulation would blow up within a few time steps.

The fix required explaining the physics: we need to force specific 3D wavevectors, not all modes sharing a perpendicular wavenumber. This seems obvious in hindsight, but demonstrates how the AI can misunderstand the dimensional structure of the problem.

When tuning simulations, Claude's intuition about the forcing-dissipation balance was consistently off, but in a subtle way that reveals something about how physicists think versus how AIs think.

As a physicist, you're always trying to extract maximum physics from your computational box. You want to maximize the inertial range to get a clean power law spectrum. This means running as close to the edge of numerical instability as possible. A simulation that produces beautiful physics for 20 Alfvén times and then blows up at 25 Alfvén times is perfect—you use the data from the first 20 time units. The code is a tool to do physics; it's not important on its own.

Claude's instinct was the opposite: make the simulation stable and robust. When it saw signs of instability, it would suggest increasing dissipation (which kills your inertial range) or reducing forcing amplitude (which weakens the physics you're trying to study). These are technically valid numerical choices, but they optimize for the wrong thing.

The right approach is to tune parameters to get as close to instability as possible without crossing the line. This requires physical intuition about what's actually happening in the simulation, not just numerical stability analysis.

November 9: The $40 Day

The usage data tells a story. Most days cost $2-10. November 9 cost $40.

That was the day I tried to get nonlinear turbulence running properly. The simulation would run, but the physics was wrong in subtle ways. Energy would cascade, but not to the right scales. Heating rates would be off by factors of 2-3. Spectra would show the right scaling but wrong amplitudes.

The problem was that nonlinear turbulence requires everything to be right: the forcing must excite the correct modes, the dissipation must operate at the right scales, the time-stepping must preserve important invariants, and the diagnostics must actually measure what you think they're measuring.

I shifted from Sonnet to Opus hoping for better physics reasoning. It helped marginally, but I kept hitting limits. The AI could implement each piece correctly in isolation, but struggled to see how they fit together into a coherent physical picture.

We're still working on this. Some problems just take time, even with AI assistance.

The Skills That Actually Mattered

Here's what surprised me: I didn't use my tech background at all. I didn't debug code, suggest algorithms, or catch Python syntax errors.

What I did use:

Physics intuition: Knowing when results are physically wrong, even if numerically stable. Understanding that spectral pile-up means one thing while energy conservation violations mean something else entirely. Recognizing that a simulation optimized for stability is often a simulation optimized away from interesting physics.

Applied AI intuition: Designing the dual-Claude review pattern. Structuring the workflow around incremental validation through physics benchmarks. Understanding AI failure modes and building guardrails around them. Knowing when to push the AI harder versus when to step in with physics corrections.

This second skill is crucial and under-discussed. It's not prompt engineering—it's something closer to understanding how to architect human-AI collaboration at the systems level.

The Replicability Question

A friend asked: how much of an "n of 1" are you? Could a physics PhD with zero coding background do this?

Honest answer: not yet, at least not with the current setup.

The bottleneck isn't coding ability—the AI handles that. The bottleneck is catching physics-numerics errors before they compound. By the time you see wrong results, you're often many commits deep into a wrong path.

A physicist without coding experience wouldn't know to set up the dual-Claude review pattern, wouldn't think to validate incrementally through physics benchmarks, wouldn't catch the spectral integrator mistake until much later.

Could this be taught? Could I package the workflow into something a pure physicist could use? I genuinely don't know. That's an open question.

The Honest Productivity Assessment

The original GANDALF took me 6-7 months to build as a PhD student, working full-time. The new version took 30 days as a side project.

But this isn't quite an apples-to-apples comparison:

  • PhD me was less experienced, had never written serious scientific code before
  • Current me could probably write this faster by hand than PhD me could
  • This is part-time work vs full-time

Even accounting for these factors, the productivity gain is real. I'd estimate 5-10x faster than I could have done solo, even with my current skills.

But it's not "intelligence as a commodity" yet. It's more like having an exceptionally capable research assistant who never gets tired, never forgets papers they've read, and can implement complex numerics at 2am without complaint.

The creativity, problem selection, and physics intuition remain entirely human. The AI amplifies what you already know; it doesn't replace knowing things.

What's Next

The code is ready. The benchmarks are passing (mostly). The parameter space is mapped.

But there's an intermediate step: we're writing a proper journal paper documenting GANDALF itself. Using another Claude-assisted workflow in the gandalf-paper repository, we're producing a comprehensive code paper targeting the Journal of Plasma Physics. This uses a different set of AI agents specialized for scientific writing—latex-equations, literature-curator, benchmark-analyst, physics-narrator, and code-documentor—working together to produce publication-quality text.

Then comes the actual physics test: can we discover something genuinely new about multi-ion turbulent heating? Can this AI-augmented approach produce insights worthy of publication as a second paper?

The next post will document that process—the physics investigation itself, what worked, what failed, and whether this experiment ultimately validates or refutes the intelligence explosion hypothesis.

The Data

For transparency, here's what this cost in Claude API usage:

  • Total: $307.19 over 16 active days (spanning 30 calendar days)
  • Average per active day: $19.20
  • Peak day (Nov 9, wrestling with nonlinear turbulence): $41.88
  • Total tokens processed: 522M

Compared to my computational budget ($10K), this is negligible. Compared to the cost of hiring a programmer for a month, this is absurdly cheap. The constraint isn't money—it's my time to direct the work and validate the physics.


Acknowledgements

Thanks to the Claude team at Anthropic for building tools that actually work for technical research. And to everyone who's been following along with skeptical but curious questions—you're helping me think through what this means.

Testing the Intelligence Explosion: Can AI Turn One Physicist Into a Research Team?

The intelligence explosion hypothesis claims that AI will make intelligence a commodity—as accessible as electricity or compute. If true, this fundamentally changes how science is done. A single PI could effectively command dozens or even hundreds of smart undergraduates, limited only by their ability to direct rather than execute research.

I decided to test this claim in the domain I know something about: plasma astrophysics. And it was a fun excuse to do some physics again :).

The Experiment

After a decade away from active physics research, I'm attempting something that would typically require a small research group: identify an unsolved problem in gyrokinetic turbulence, develop computational tools to attack it, and produce publishable results. The difference? Instead of an advisor and collaborators, I am working with Claude.

The mental model is crucial here. I'm not expecting the AI to be creative or to have deep physics intuition. Instead, I'm using it as an exceptionally capable undergraduate—one who can implement complex numerical schemes at 2am, never forgets a paper they've read, and can iterate on code without getting frustrated. The creativity, problem selection, and physics intuition remain human responsibilities.

The Process So Far

The journey began with a comprehensive literature survey. Claude and I reviewed ~50 papers from 2019-2024 on gyrokinetic turbulence, identifying several promising research directions. The key criteria: numerically tractable, genuinely unsolved, and building on recent breakthroughs.

I selected the problem: How do multiple ion species affect the helicity barrier and heating partition in collisionless plasmas? This extends recent work by Meyrand et al. (2019) on plasma echoes and the helicity barrier mechanism (Squire et al. 2022) to the astrophysically relevant case of solar wind with H⁺, He²⁺, and trace heavy ions. This is a natural extension of my own PhD research, and therefore seemed like fertile testing ground.

Next came tool selection. After discussions with the Viriato team, it became clear that modernizing my PhD-era code GANDALF was the right approach. Not because it was the best code, but because I understood its physics assumptions deeply enough to guide the AI effectively.

This is where things got interesting. Using Claude Code, we rebuilt GANDALF from scratch in JAX, targeting Apple Silicon's Metal backend. In two weeks, we had: - Reproduced the Orszag-Tang vortex benchmark - Confirmed the -5/3 turbulent spectrum - Validated energy conservation to machine precision

The AI wrote ~90% of the code. I provided physics corrections, caught subtle errors (KRMHD vs KREHM orderings), and designed the validation tests. My original PhD thesis provided the theoretical framework.

This entire journey—from literature survey to working code—has taken just two weeks (I started a month ago, but took a 2 week holiday). To put this in context, it took me ~6 months to write the original version of Gandalf. I did have an advantage on the literature review bit since I already knew it to some degree from the last time I did it.

What This Means

If this experiment succeeds—if we can produce a legitimate physics result worthy of publication—it suggests the intelligence explosion hypothesis has merit, at least for well-defined technical domains. The bottleneck shifts from execution to direction, from coding to physics insight.

But there are caveats. This only works because I can recognize when the physics is wrong, design meaningful computational experiments, and interpret results in context. The AI amplifies expertise; it doesn't replace it.

What's Next

We're now approaching the critical test: discovering something genuinely new about multi-ion turbulent heating. The computational framework is ready. The parameter space is mapped. The next posts will document whether an AI-augmented physicist can produce real scientific insights, and what that process actually looks like when physics intuition meets artificial intelligence.

Stay tuned for the story of writing a modern gyrokinetics code with an AI partner, complete with the failures, surprises, and occasional moments when the machine suggests something I hadn't considered.

Acknowledgements

I would like to thank Alex Schekochihin and Nuno Loureiro for helping me brainstorm this project, and pushing me to actually spend some cycles on it. I am forever in debt of Bill Dorland for teaching me to push the boundaries of physics research using new and improved computing capabilities.

Reflection

Career decisions are hard.

I have pivoted my career a few times now. First time was about a decade ago when I switched from academia to industry and from physics to tech. It was a difficult decision. I was leaving behind something I had obsessed over for 17 years for the complete unknown. In hindsight it seems like such a crazy decision -- I didn't know much about Palantir. I really enjoyed all my interviews (I remember the questions and the interviewers a decade since) - this was the main positive signal. And the culture (as much as I could glean from the interviews) had a lot of similarities with academia, it felt familiar.

Boy was I fortunate. The 7 years I spent at Palantir defined the professional me. It taught me so many things -- technical, organisational, traits in people one should value and much more. It gave me the opportunity to wear many hats and grow stochastically.

Palantir gave me the opportunity and the courage to try new things. Leaving after 7 years was bittersweet - it was time for something new, but I missed my home.

The last 15-16 months have been very different. I have been trying to find the thing that keeps me engaged, makes me obsessed. I wouldn't say I have found it yet, but I am working on it.

The Database Selection Trap: Why Your Technical Interviews Might Be Testing the Wrong Things

I recently watched a talented engineer fail a system design interview, and it made me question everything I thought I knew about technical hiring.

The candidate was asked to design a data model for a food delivery platform. They chose PostgreSQL. When the requirements evolved—millions of drivers, real-time location updates, flexible schemas—they couldn't pivot to NoSQL. Despite perfect nudges from the interviewer, they remained stuck.

Here's what haunted me: In any real engineering role, this person would have thrived. They'd have teammates suggesting alternatives. They'd have design reviews. They'd have documentation and prior art to reference.

But in that interview room, artificially isolated from every resource that makes modern engineering possible, they failed.

This isn't a story about lowering the bar. It's about recognizing that many of our "standard" technical interviews are testing the wrong things entirely.

The Comfort of Cargo Cult Interviews

We've all been there. You're tasked with building a hiring process, so you do what seems logical: look at what successful companies do and copy it. Google does system design interviews? So do we. Facebook does algorithm challenges? Add it to the list.

But here's the problem: we copy the form without understanding the function.

That database selection question? It made perfect sense... until I asked myself what we were actually testing: - Can this person independently choose the right database in isolation? - Or can this person build great systems in a collaborative environment?

These are fundamentally different skills. And only one of them matters for the job.

The Three Interview Traps That Filter Out Great Engineers

After auditing dozens of hiring processes, I've identified three common traps that eliminate potentially excellent engineers for the wrong reasons:

1. The Isolation Trap

The Setup: Candidate must solve everything alone, from first principles, without any external resources.

The Problem: This isn't how engineering works. Ever. Modern engineering is collaborative, iterative, and builds on existing knowledge. The best engineers aren't those who can reinvent everything in isolation—they're those who can leverage their team and tools effectively.

Real Example: A senior engineer with 10 years of experience couldn't remember the exact syntax for a specific PostgreSQL window function. In reality, they'd look it up in 30 seconds. In the interview, they struggled for 10 minutes and lost confidence.

2. The Perfection Trap

The Setup: One significant stumble means failure, regardless of overall performance.

The Problem: Engineering is about recovery and iteration, not perfection. Some of the best engineers I've worked with are great precisely because they recognize mistakes quickly and course-correct effectively. But our interviews often punish any deviation from the "perfect" answer.

Real Example: A candidate designed 90% of an excellent solution but made one architectural decision that would have caused scaling issues. Instead of seeing if they could identify and fix it with feedback (like they would in a real design review), they were marked down significantly.

3. The Specific Knowledge Trap

The Setup: Testing specific technical knowledge rather than fundamental thinking.

The Problem: Technology changes. What matters is engineering judgment, learning ability, and problem-solving approach. But we often test whether someone memorized the specific technologies we happen to use today.

Real Example: A brilliant engineer "failed" because they weren't familiar with Kafka. They understood event-driven architectures perfectly and had used RabbitMQ extensively. Given a week on the job, they'd be productive with Kafka. But the interview didn't capture that.

A Better Way: Design Interviews That Mirror Reality

The solution isn't to make interviews easier. It's to make them more realistic. Here's a framework I use with my clients:

Step 1: Start With Role Reality

Before designing any interview, answer these questions: - What does a typical day look like for this engineer? - What resources do they have access to? - How do they collaborate with others? - What does "great performance" actually look like?

Step 2: Map Backwards to Interview Signals

For each critical skill, ask: - What's the minimal signal we need to assess this? - How can we test this in a way that mirrors reality? - What support would they have in the real role?

Step 3: Build in Collaboration and Iteration

Instead of testing isolated perfection, test realistic excellence: - Allow candidates to ask clarifying questions (like they would with stakeholders) - Provide feedback and see how they incorporate it (like in code review) - Let them reference documentation for syntax (like they would with Google) - Focus on their thinking process, not memorized solutions

Case Study: Redesigning the System Design Interview

Here's how we transformed that problematic database interview:

Old Version: "Design a data model for a food delivery system. Choose your database and justify it."

New Version: "Let's design a data model for a food delivery system together. Here's our current scale and requirements. As we go, I'll play the role of your teammate and share what we've learned from our existing systems."

The key changes: 1. Collaborative framing - "together" and "teammate" set the tone 2. Living requirements - Requirements evolve during the discussion, like real projects 3. Historical context - They can ask about existing systems and constraints 4. Focus on reasoning - We care more about how they think through trade-offs than their initial choice

The result? We started identifying engineers who would excel in our actual environment, not those who could perform in artificial interview conditions.

The Hidden Cost of Bad Interviews

Every time we filter out a great engineer because they stumbled on an artificial constraint, we're not just losing a potential hire. We're: - Reinforcing biases toward certain backgrounds (those who've practiced these specific interview formats) - Extending our hiring timeline as we search for unicorns who excel at interviews AND engineering - Building teams that optimize for interview performance over actual job performance

Your Next Step: The One-Question Audit

Pick one question from your current interview process. Just one. Now ask yourself:

"If a strong engineer failed this specific question but excelled at everything else, would I bet they'd fail in the actual role?"

If the answer is no, you're testing the wrong thing.

The Path Forward

Great hiring isn't about finding engineers who can solve puzzles in isolation. It's about identifying those who will thrive in your specific environment, collaborate effectively with your team, and deliver value to your customers.

That means designing interviews that test for reality, not ritual.

Start with one interview. Make it 10% more realistic. See what changes.

Because somewhere out there is an engineer who would be fantastic on your team but can't remember if MongoDB uses documents or collections in the heat of an interview.

Do you really want to miss out on them because of that?

2024 Wrapped

Pivot

This year was a huge pivot for me. At the beginning of the year I was working on an AI startup founded by two of my friends. It was my first introduction to the world of AI, and a great learning experience. It developed my interest in applied AI - specifically, that area in between research and practical applications. Keeping up with the latest research, and then figuring out how to apply it to real-world problems.

However, by April I realised I needed a break. My partner pointed out that I had not taken a proper break since starting grad school, back in August 2008 And since then I had only worked at intense places - a PhD, Palantir, a crypto startup, and then the AI startup.

I took about 6 weeks off to reflect on what I wanted to do next.

Reflection and Experiment

One key realisation was that identifying what would drive me was not easy, especially looking forward. The last time I felt that level of narrow focus was when I chose to do my PhD - there was no doubt whatsoever in my mind that I wanted to do it. In fact there were many people who tried to talk me out of it, but I was convinced. And since finishing my PhD, I have oscillated between being driven by the impact, the people I work with, the day-to-day work, and the learning opportunities. At different times, each of these has been the primary driver.

Given this, I decided to do an experiment with two key constraints:

  1. I would explicitly not try to identify what would drive me, but instead set up transactional contracts. Tactically, this meant I would not take equity in any company I worked with. This would force me to be thoughtful about the time I spent on a project.
  2. I would work at a slower pace - 4 days a week.

Early Results

Initially I was worried that I would not be able to find enough work. But I was pleasantly surprised. I managed to land a couple of consulting gigs - one with a boutique consulting firm, and another with a startup. The consulting firm has been a great experience. I have been working with former Palantirians, and there's almost a sense of homecoming. The startup scratched the AI itch, giving me a chance to work on a real-world problem using AI.

The first couple of months went quickly - a honeymoon period of sorts. But one of the gigs ended, and I realised that I needed to be more proactive about finding work. As someone who has no idea about sales, this was a challenge. Honestly, I freaked out a bit, but then I found my own way of generating leads.

Sales

I started writing more. I wrote about my experiences at Palantir, especially the hiring process. This resonated with a lot of people, and I started getting inbound leads. Interestingly, not all were hiring related. I started working with another AI startup, a crypto startup, and a couple of hiring related projects.

So far this is the only way I have tried to generate leads, and it has worked well. I still don't feel confident that I will consistently manage to bring in business, but I will cross that bridge when I get there.

Hiring

The hiring projects have been the most non-trivial. I have worked with a few companies on their hiring, the most notable being Comand AI. The level of trust they have placed in me has been humbling. Hiring is one of the most high-stakes things an early stage startup does, and I am grateful for the opportunity to help them.

It has also pushed me out of my comfort zone. Yes, I learnt a lot about hiring at Palantir, but fully owning the end to end outcome is a different beast. Additionally, I now also have to articulate my approach and be methodical about it - forcing me to rely less on instinct.

Highlights and Conclusion

Over the course of this year, I have worked on the following tech:

  • Palantir Foundry
  • Python: to build LLM-based applications for the AI startups
  • Golang: to build tools in the crypto space for the crypto startup

And I have learnt about:

  • Golang concurrency
  • AI tools: Claude, Zed, Gemini, etc.
  • RAG architectures
  • Being a freelancer
  • Designing a hiring process

I am happy with how the year has gone. I have managed to find work, and I have enjoyed the work I have done. Interestingly, even though the experiment started off as transactional, I have found myself getting attached to the work. I have been invested in the outcomes, and I have cared about the people I have worked with. This doesn't come as a surprise - I have always needed to care about the work I do, and/or the people I work with. But it's interesting to see how this has played out with the transactional initial conditions.

I am looking forward to 2025. I am excited about the work I have lined up - specifically in the AI space, as well as the hiring work. The AI projects should give me the opportunity to learn and grow as an engineer, and the hiring work will give me the chance to learn how to build a business. I am not going to force my hand either way - we'll see how it goes.

Criticality and Engagement

Hiring is hard. It's difficult to figure out what makes a good hire.

If nothing else I have found these two qualities to be the most important in a hire:

  1. Critical thinking
  2. Engagement

I have often rejected candidates who were technically strong but lacked these two qualities.

Critical Thinking

"The day you don't feel comfortable disagreeing with me is the day we have lost our culture" - Shyam Sankar

Shyam said this in one of the all hands at Palantir and it has stuck with me since then.

You want to hire people who are not afraid to disagree with you. You want to hire people who will challenge you. Independent thinkers is what makes a company.

Engagement

Anyone who is genuinely interested in the work will naturally be a high performer. They will care about getting it right. It aligns incentives.

This attitude is infectious, it is additive. It will rub off on the rest of the team.

Conclusion

Of course there are other qualities that are important, and I have written about them in the past. But these two, to some degree are "must haves". Be on the lookout for them.

Hiring for a mission-driven early-stage startup

I have recently had the privilege of working with the team at Comand AI. Comand is a mission-driven startup that is building a platform to bolster NATO and NATO-aligned countries' defense capabilities by building products that help make operations more efficient and effective.

The company is at a very early stage and are currently sprinting towards finding product-market fit. At the same time they have a very clear mission and vision, and have seen early signs of traction in the market. This means as they continue to gather user feedback and iterate on their product, they need to build a team that can move quickly and adapt to the changing needs of their customers.

The Challenge

The founding team at Comand AI is really strong on the technical side. It is exactly the kind of team you would expect to see at a mission-driven startup - highly motivated, technically strong, and deeply passionate about the problem they are solving. However, they needed help in building out the team further. How do you identify the traits that would make someone successful in a mission-driven startup? How do you build a hiring process that can help you identify these traits?

At the pre-product-market-fit stage you generally want people who have spikes in at least one of the following two areas:

  1. Highly Creative: They need to be someone who can explore the product space and come up with innovative solutions.
  2. Strong Execution: They need to be someone who can take a vague idea and turn it into a product really quickly.

You either need someone who can chart out uncharted territories and come up with innovative ideas, or someone who can quickly build a prototype and test it out with users. Ideally both.

Separately, you need to be mindful of the mission-driven aspect of the company. You need people who are deeply passionate about the problem you are solving, and who are willing to go the extra mile to make sure you succeed. And you need to think about the kind of culture you want to build. You want people who are collaborative, who are willing to take ownership of the outcome, and who are able to work effectively with others.

Designing a hiring process that can help identify these traits is crucial to building a team that can help you achieve your mission.

The Process

I started by understanding the current team makeup, their values, and the business goals they were trying to achieve. This was mapped to the hiring goals for the next 6 months. I also did some ground work by shadowing a few interviews and understanding the current process. This revealed the different interviewing styles, their preferences, the kind of questions they were asking, and the synthesis process.

One of the main gaps I noticed was something I have written about before - being ok with the unfairness of interviews. The team has high empathy for the candidates, and at times this meant they were not making the hard decisions that were needed. An empathetic interviewer is a good thing - being empathetic helps build a genuine connection with the candidate, but while synthesising the feedback, it is important to be objective. This is especially important when evaluating candidates who are good, but not great. At such an early stage, you want to hire great people.

I also worked with the internal recruiting lead to design a process that would help identify the traits we were looking for. This included:

  • A screening call with the internal officer and me to understand the candidate's motivations and values.
  • (Optional) Another domain-specific call with the technical lead to understand the candidate's technical capabilities.
  • Coding assessment.
  • Onsite interviews that included a mix of technical and behavioural interviews.
  • A founder interview to understand the candidate's alignment with the mission and vision of the company.

The first screening call was designed to protect the team's time and ensure that only candidates who were deeply aligned with the mission and vision of the company were brought onsite.

We are still in the process of iterating on the process, but the early signs are promising. The team is excited about the candidates they are seeing, and the candidates are excited about the opportunity to work at Comand AI. As we get more reps under our belt, we hope to add more rigor and accountability by introducing hiring theses to have a historical record of why we made the decisions we did, as well as to have an understanding of how and where to staff new hires.

Looking Forward

As Comand AI continues to grow, they will need to continue to iterate on their hiring process. They will need to think about how to scale the process, how to ensure that the process is fair and unbiased, and how to ensure that they are hiring the right people for the right roles. And all of this without losing sight of the mission and vision of the company. But the passion and drive of the founding team was apparent since the beginning, and I am confident that they will be able to build a team that can help them achieve their goals.

Get in touch if you would like to work at Comand AI, or if you would like to know more about the hiring process we are building. I am always happy to chat about hiring, startups, and everything in between. If you are currently in the process of hiring for your startup, and would like some help, feel free to reach out to me at me@anjor.xyz.