Skip to content

physics

AI as the Undergrad Researcher: A Real Physics Result, Two Months, One Person

Writing code with AI is no longer surprising. I rebuilt my PhD-era gyrokinetics code in JAX with Claude in 30 days last year (Building a Gyrokinetics Code Without Reading a Single Line). What I actually wanted to test was the next step: can AI do the unglamorous, undergraduate-level research motions that turn a working tool into a publishable result — running simulations on a cluster, iterating on the configs when the physics looks off, generating diagnostic plots, catching numerical bugs, and drafting the paper?

Two months later, the answer is yes. The paper is merged (repo, currently with Alex Schekochihin for review). It's a small but clean physics result: across a 50× scan in collisionality, the phase-space dissipation rate stays flat to under 1%. The kinetic analogue of Onsager's classical 1949 dissipative anomaly, demonstrated directly in nonlinear KRMHD simulation.

In this article I write about what those two months actually looked like — what Claude did, where I stepped in, and the broader claim I want to make: computational physics researchers today have undergraduate-level research assistants on tap, and using them well is a systems problem more than a model problem.

The Setup

In plasma turbulence, energy cascades from large scales down to small scales until it dissipates. The dissipation rate ε_ν depends on collisionality ν. Onsager's 1949 result for incompressible fluids (the dissipative anomaly, sometimes called Onsager's conjecture) says ε_ν becomes independent of viscosity in the inviscid limit — the cascade self-organises to maintain constant flux. The kinetic analogue of this statement has been argued for in the literature (Schekochihin 2016, Adkins 2018, Eyink 2018, Nastac 2024) but hadn't been directly demonstrated in nonlinear KRMHD simulation. That was the question: can we see the ν-independent plateau?

The reason to pick this question was pragmatic. The setup is tractable, all you need is one diagnostic per run. If the simulations behaved, the result would be unambiguous. Checking the validity of the results was easy.

Figure 1: The collisional dissipation rate ε_ν stays flat to under 1% across a 50× scan in collisionality ν. This plateau is the central result of the paper.

Dissipative anomaly plateau: epsilon_nu vs nu

Figure 2: What a healthy run looks like (ν=3). Left: the time-averaged perpendicular spectrum E(k_⊥) shows a clean inertial-range cascade. Right: the Hermite-moment spectrum W(m,t) as a heatmap — energy stays confined to low m, with no pileup at the m=128 truncation. This is the "physics makes sense" check I was doing on every run.

Base-state spectrum and W(m,t) heatmap for the nu=3 run

The Workflow: Claude Does the Undergrad Work

For two months, almost every day, the rhythm looked like this:

  1. I state the next physics goal in a sentence or two.
  2. Claude proposes a config — YAML parameters, a Modal runner, sometimes a new diagnostic.
  3. I push back on the physics where it looks wrong, or say "go for it."
  4. Claude submits the run to A100 cloud GPU.
  5. Run finishes, Claude pulls the data and generates plots.
  6. I look at the plots and decide whether the physics makes sense.
  7. If yes, Claude drafts the paper section. If no, we iterate.

The bulk of the labour — writing the Modal runner, parsing diagnostic outputs, fitting power laws, generating figures, writing LaTeX, managing the BibTeX — was Claude. I never touched the cloud orchestration code. I never wrote a matplotlib script. I never edited the LaTeX preamble. I looked at outputs and either nodded or asked for something different.

As things progressed, we got into a rhythm. A normal day involved me sending 5-10 short messages, mostly along the lines of:

  • "go for it"
  • "this looks great! Let's get it merged."
  • "ok submit the M=256 run too"

This is the texture of the collaboration most of the time: short, direct, advisory messages, the kind you would send a capable grad student.

Where I Stepped In

The interesting moments are the five places where I had to nudge. None of them were heroic. Each was the kind of intervention an advisor makes when a student is going down the wrong path.

Nudge 1: Lambda

Two weeks in, the Hermite cascade wasn't cascading. The setup was right by every code-level check, but energy was just sitting in the lowest moment. I asked: isn't there supposed to be a coupling Λ between g₀ and g₁? The Alfvénic checkpoint we'd restarted from had Λ=1; for the Hermite problem at β_i=1, it should be √5. One physical constant. The cascade lit up on the next run.

The AI executes correctly inside the frame it is given. The frame — including the specific physical values that distinguish your problem from the adjacent one — is yours to set.

Nudge 2: Three Weeks Chasing a Numerical Ghost

This was the longest detour, and the one I keep thinking about. Every nonlinear run at high ν blew up. The blowup times scaled with ν cleanly — ν=1 died at 80 τ_A after the restart, ν=3 at 122 τ_A, ν=5 at 167 τ_A — and the scaling matched what the canonical "pileup at the Hermite truncation" failure mode predicts. The literature warns about it. I went deep into that hypothesis.

For three weeks Claude helped me investigate inside that frame: adjusting hyper-collisional dissipation, varying M, looking at high-m energy budgets. The frame was wrong, but it was an internally consistent wrong frame, which is what made it so hard. Every diagnostic could be interpreted as consistent with the pileup story if you squinted.

What broke it was finally asking Claude to extract the three diagnostics the pileup story actually predicts: the energy at the truncation moment, the parallel-wavenumber spectrum at the highest m, and how localised the blowup was across m. The pileup story predicted a clear signal in all three, and the data showed none of them. The energy at the truncation sat at the noise floor, the k_z spectrum at high m was down at 10⁻¹⁴, and the blowup happened simultaneously across every moment, independent of ν. That is not a physical cascade failing; that is the time integrator.

Figure 3: What the wrong story looked like. W(m,t) heatmaps for the runs that blew up, with the divergence in ΣW(m), W(m=M), and ε_ν(t) on the right. It reads as a physical pileup at the truncation, but it was actually a hidden CFL leak in the Lawson-RK4 integrator. The wrong frame produced fingerprints that looked entirely physical.

Hermite blowup W(m,t) heatmaps for the Lawson-RK4 runs

I reframed for Claude: this isn't a physics bug, it's a code-path bug. Look at the Lawson-RK4 integrator. Within an afternoon Claude had identified a hidden CFL leak in the integrator. A fix shipped as GANDALF #138 the next day.

The honest version is that the AI did not catch this. It spent three weeks helping me debug inside a confidently wrong hypothesis. What broke the frame was me coming back to first principles and asking "what would the data look like if I were wrong?" before "what would the data look like if I were right?"

This is the hardest discipline to maintain in a long project. The frame is yours to set, and questioning it is yours too.

Nudge 3: Dropping the Π(m) Panel

Late in the project Claude proposed a second panel for Figure 1 showing the constant-flux Hermite cascade, Π(m) versus m. The diagnostic worked and the figure was real, but the ν=1 curve had a clear numerical artifact left over from low-ν pileup. Claude was happy to keep the panel with explanatory caveats. My message at the time was: hmmm the flux plot is not great honestly. i think we should just remove it. The figure that survived was cleaner, and the headline result stronger.

The instinct to publish less rather than over-claim is a human one. The AI will defend a marginal figure indefinitely if you let it.

Nudge 4: Checking the Novelty Claim

The draft asserted that no prior numerical work had cleanly demonstrated a ν-independent ε_ν. I have been out of active physics for over a decade, so I am not current on the literature and couldn't vouch for that claim. What I could do was flag it. I asked Claude whether we actually had a citation for the sentence.

Claude pulled the relevant prior work, including Nastac 2024 — titled "Phase-space entropy cascade and dissipative anomaly," recent, directly adjacent to our claim, and a paper I had never seen. We turned the bare assertion into a short prior-work paragraph and cited it properly.

This nudge cuts the other way from the rest. Here the AI knew the literature better than I did. My contribution was the editorial reflex of not asserting novelty I hadn't checked; the retrieval was Claude's. The lesson is to interrogate your own strong claims and let the AI do the literature work it is genuinely good at.

Nudge 5: The Convergence Study

This is the least dramatic of the five, included because it is the routine mode. Late in the project I said: we should do the M convergence as a part of this study. Claude wrote the Modal runner, submitted two new sims (M=64 and M=256 at ν=3), did the analysis, and wrote the new subsection. The whole thing took two days, with almost no input from me beyond approving the configs. Most of the collaboration looks like this. The dramatic episodes are easy to write about, but the long calm stretches where the work just gets done are the actual story.

Memory: The New Piece of Infrastructure

For the GANDALF sprint, persistent memory didn't matter. It was thirty days in one rhythm, and the whole project fit in active context.

For this project it did matter. At sixty days you can't hold everything in working memory, because sessions interrupt each other and knowledge from Tuesday gets lost by Thursday unless you write it down. The auto-memory system in Claude Code — files that persist across sessions — turned out to be the most important piece of infrastructure for the long arc.

There are six memory files for this project right now. Two of them, verbatim:

Lambda parameter physics Lambda=1 kills Hermite cascade; use √5 for standard β_i=1

M=128 Hermite — resolved by GANDALF v0.5.0 IMEX Lawson blowup was numerical; pin scheme="imex_rk222"; ν=3 acceptance passed

These are corrections I won't have to relearn three weeks later. The "I already explained this to you" cost is the silent productivity killer in any long-running AI collaboration, and persistent memory is what prevents it.

It's a Systems Problem

The most important thing I learned from this project is that using AI effectively at the two-month scale is not a model problem but a systems problem.

The model is a given. What mattered was the system around the model: persistent memory, the choice of a tractable physics question, Modal as the cloud orchestration layer, a clean cross-repo handoff between GANDALF (the upstream library, where bugs got filed) and krmhd-research (the science repo, where the paper lives), a paper repo with curated BibTeX, the ability to look at a generated plot and decide in seconds.

The autonomy gradient framing I wrote about a few months ago — about choosing where on the spectrum from "I drive every step" to "AI runs autonomously" — already feels dated. The model capability has moved faster than the framework was ready for. The real question now isn't "how much autonomy do I give it?" but "what's the system I need around it so that the autonomy is productive?"

By "system" I mean the set of practical questions around the model. Where do corrections persist? Where do artifacts live? What is the cross-tool rhythm? Where do you intervene without breaking flow? What is the smallest unit of trustable output — a plot, a config, a paragraph? And how do you make decision-points cheap enough that you actually make them?

I don't have clean answers. What I have is the working version of one specific instance: a six-memory-file, two-repo, Modal-backed, advisor-mode rhythm that produced a paper. The recommendations below are what I've extracted from it.

What This Enables: Independent Computational Research

The category I keep thinking about is "recovering physicists" — people who left active research years ago but kept the training and the taste. The cost of returning is now genuinely low. A side project, a few hours a week, can produce real work.

Computational physics researchers today, in my reading, have undergraduate-level research assistants on tap. They are not as good as a great grad student, but they are far better than no help at all: reliable, never tired, comfortable at 2am, and never frustrated when you ask them to redo a figure for the fifth time.

This isn't an "intelligence explosion" claim. The model isn't replacing the PI. What it's replacing is the bottleneck of needing collaborators and students just to make a tractable problem possible to attack. A senior researcher who can recognise good physics and bad physics now has access to a pool of execution labour they didn't have before.

I am over a decade out of an active research role, and I am running a real physics investigation as a side project. That would not have been possible 18 months ago.

Recommendations for Long-Running AI-Assisted Research

These are opinionated, and some will age badly given how fast the model and the tooling are moving.

  1. Choose a scope you can validate at a glance. Pick a question where the headline diagnostic is a number or a single curve. If you can't tell instantly whether the result is right, neither can the AI.

  2. Build persistent memory aggressively. Every hard-won correction goes in, whether it is a specific physical value, an algorithmic gotcha, or something the AI got wrong twice. Re-learning is the dominant hidden cost.

  3. For any result you can't independently verify, ask "what would the data look like if I were wrong?" before "what would the data look like if I were right?" The Lawson-RK4 misdiagnosis cost three weeks because I never ran the falsifying diagnostic first.

  4. Default to dropping marginal results. The AI will defend a borderline figure indefinitely. The discipline to publish less is yours.

  5. Interrogate your own novelty claims. Don't let "no one has done this before" stand on faith, especially if you're not current on the literature. Ask the AI to find the prior work — retrieval is something it does well, particularly if you've had it build up a reading list you can point it back to.

  6. Run experiments on git branches and tell Claude to commit relentlessly. I leaned on git heavily. A branch lets the AI try a parameter change or a refactor in a sandbox without touching the version that works, and frequent commits turn a bad run into a one-command rollback rather than a reconstruction job. When the AI is generating most of the code and configs, cheap rollback is what lets you give it room to run.

  7. Build the cross-tool rhythm explicitly. If your work spans multiple repos or services (an upstream library + your project + a cloud compute provider + a paper repo), be explicit about the handoffs. File upstream issues. Plan for context-switching cost.

  8. Treat the AI like the smart undergrad. Validate through physics outputs, not code review. Don't read the code; look at the plot. If the plot looks right and the diagnostics check out, the code is probably right. If the plot looks wrong, no amount of code reading will tell you why.

  9. Question your framing every 1-2 weeks. If you've been working inside a hypothesis for two weeks without pushback, force a first-principles review. The AI won't supply that question.

The Numbers

  • Calendar time: ~9 weeks, including the extended detour
  • Production runs: 9 (6 in the main ν-scan, 3 in the M-convergence)
  • Modal GPU-hours: ~52 (mostly A100, mostly overnight)
  • Failed Lawson-RK4 runs along the way: ~12 (the three-week wrong frame cost compute too)
  • Memory files created: 6
  • Paper: 8 pages, 11 references, 3 figures, 1 results table, 1 convergence table

I don't have a clean dollar figure for Claude API usage on this project — the work spanned Claude Code sessions across multiple repos, various model tiers, and some agentic runs. Honest estimate: small compared to the GPU compute, which itself is small compared to a postdoc-month.

What's Next

The original experiment plan listed multi-ion turbulent heating as the actual physics target. The dissipative anomaly was a stepping-stone — a question I picked precisely because I could tell at a glance whether the answer was right. The next thing is harder. The physics is less clean, the literature is more contested, and the diagnostics don't reduce to a single number.

The thing I most want to find out is whether the rhythm scales — whether the system holds when the AI doesn't have a clear template from existing literature, when "what's the right plot?" is itself a research question, and when the answer isn't a flat line but something with structure I'll have to interpret. I'll know in a few months.

The intelligence-on-tap claim looks more real than it did a year ago. What I now think is the real constraint is the speed at which you can validate what the AI produces. The AI does the undergraduate work cleanly, and the bottleneck is the advisor looking at the outputs and deciding whether they are right.


Acknowledgements

Alex Schekochihin for reviewing the draft and continuing to point me at the right physics. The Anthropic team for the tools.

Physics-Oracle Validation: How to Trust Code You've Never Read

In the previous post, I described the autonomy gradient—how AI effectiveness varies from ~100% for code implementation to ~0% for research direction. But I left a question hanging: if you're not reading the code, how do you know it's correct?

This is not a trivial problem. Claude generated ~3,500 lines of JAX implementing spectral methods, Hermite polynomial couplings, and exponential integrating factors. I didn't read any of it. How can I trust it?

The answer is what I call physics-oracle validation: using physics itself as the specification against which code is tested.

This is not a new idea. It was my lived experience of doing physics research. As an undergraduate researcher, you meet your advisor periodically and get feedback only on the physics bits of the project. The advisor doesn't really read your code or help you debug. I tried to emulate the same setup.

This post explains the methodology, its limitations, and the honest question of who can actually replicate this approach.

The Trust Problem

Traditional approaches to code verification—code review, unit tests, static analysis—require understanding implementation details. But that defeats the productivity benefit of AI assistance. If I have to read every line Claude writes, I might as well write it myself.

The insight is that scientific simulation has something most software doesn't: an external oracle. Physics provides ground truth. If the code reproduces known physical results, it's correct—regardless of how it's implemented. And honestly this is how I verified the original GANDALF--if you look at the code, there are no tests. I would run a simulation, look at the physics results, and that was my validation.

This is a fundamentally different verification model. I'm not checking how the code works. I'm checking what it produces.

Four Levels of Validation

I used four increasingly stringent validation levels, each testing different aspects of the implementation:

Level 1: Linear Regime

Problems with exact analytical solutions. Alfvén waves should propagate at exactly the Alfvén speed. The dispersion relation is exact—errors should be at machine precision.

Expected precision: ~10⁻¹⁵ (floating point epsilon)

What it validates: Time integration, linear operators, basic correctness

This is the easiest test. If linear physics is wrong, nothing else matters. Claude passed this on the first serious attempt.

Level 2: Nonlinear Conservation

Invariants that should be maintained during nonlinear evolution. The Orszag-Tang vortex—a standard MHD benchmark—should conserve total energy even as kinetic and magnetic energy slosh back and forth.

Orszag-Tang Vortex

Expected precision: ~10⁻⁶ over many dynamical times

What it validates: Nonlinear terms, Poisson bracket discretization, spectral accuracy

This is where subtle bugs show up. Wrong operator ordering, sign errors, aliasing issues—they all manifest as energy drift. The code took several iterations to pass this test. Each failure gave diagnostic information: exponential growth meant sign errors, linear drift meant conservation violations, sudden blow-up meant aliasing.

Level 3: Statistical Equilibrium

Emergent behavior matching theory. Driven turbulence should show Kolmogorov k⁻⁵/³ scaling. This isn't programmed in—it emerges from correct multiscale energy transfer.

Expected precision: Power law exponent within ~0.05

What it validates: Forcing implementation, dissipation mechanisms, scale-by-scale energy transfer

This was the hard one. As I described in the previous post, achieving the correct spectrum required extensive parameter tuning. But the test itself is unambiguous: either the spectrum shows the right scaling or it doesn't.

Level 4: Velocity Space

Kinetic physics validation. Phase mixing rates and Hermite moment spectra should match kinetic theory predictions.

What it validates: Velocity-space operators, Landau damping, collisionless physics

This tests the kinetic aspects of GANDALF that distinguish it from pure MHD. The code should capture the essential velocity-space dynamics that make plasma physics interesting.

Why This Works

Physics-oracle testing has several advantages:

Tests behavior, not implementation: I don't care if Claude used a for-loop or vectorization, whether it allocated memory efficiently, or if the variable names make sense. I care if the physics is right.

Scales to complex codes: Full code review of 3,500 lines would take days. Physics validation takes hours.

Catches subtle physics errors that unit tests miss: Code can pass all unit tests while producing wrong physics. A spectral method might have correct FFT calls but wrong normalization. Unit tests might pass; physics outputs would fail.

Provides diagnostic information: When tests fail, how they fail indicates what's wrong. Energy growing exponentially suggests sign errors. Energy drifting linearly suggests conservation bugs. Wrong spectral slopes suggest forcing or dissipation issues.

The Limitations

This methodology has important limitations I need to be honest about.

Requires known results: Physics-oracle testing only works when you know what the physics should do. For established physics like KRMHD, canonical benchmarks exist. For genuinely novel physics, you face circular reasoning: validating code requires knowing the answer, but discovering the answer requires trusted code.

May miss bugs that preserve tested properties: A bug that happens to conserve energy and produce correct spectra would pass all tests. The methodology provides high confidence, not certainty.

Requires domain expertise to interpret: When the Orszag-Tang test showed 10⁻⁴ energy conservation instead of 10⁻⁶, is that acceptable? It depends on the timestep, the resolution, the integration time. These judgments require physics intuition.

For frontier physics, complementary validation approaches are needed—comparison with other codes, analytical limits, asymptotic analysis. Physics-oracle testing is powerful but not complete.

The N=1 Question

A friend asked me: "How much of an N=1 are you? Could someone else do this?"

Honest answer: probably not yet, at least not without a similar background.

This success required a specific—and possibly non-generic—combination of expertise:

Domain expertise (PhD in plasma physics): Critical for interpreting test results and catching physics errors. When Claude suggested using GS2 (a full gyrokinetics code) for problems where reduced equations suffice, only domain knowledge caught the error. When simulations showed marginal stability, only physics intuition knew that was acceptable.

Software engineering intuition (decade in tech): Enabled the decision to rewrite in JAX rather than resurrect legacy Fortran/CUDA. Understanding of modern frameworks, deployment options, and how to write specifications AI can implement effectively.

Generative AI experience (recent work): Provided realistic expectations of AI capabilities, effective interaction patterns, and understanding of failure modes. I knew to create step-by-step plans rather than open-ended requests. I knew to set up the dual-Claude review pattern to keep the code honest.

AI proved valuable at both ends of the research process—literature synthesis and code implementation—but the middle stages (tool selection, physics judgment, validation interpretation) required human expertise.

Could this be taught? Could I package the workflow into something a researcher without all three backgrounds could use? I genuinely don't know. That's an open question worth exploring.

Hallucination: A Task-Dependent Problem

One finding was interesting: hallucination severity depends strongly on task constraints.

Code generation (~100% autonomy): Minimal hallucination. Code either runs or crashes. Physics outputs either match theory or don't. Tight constraints leave little room for fabrication.

Paper writing (~50% autonomy): Significant hallucination. When helping with the physics paper, Claude fabricated:

  • Computational resources ("timings obtained on Princeton's Stellar cluster" when all runs used my MacBook)
  • Development timelines ("three years of part-time development" versus the actual one month)
  • GPU runtimes for simulations that were never performed
  • Physics errors including wrong cascade directions

These were caught in review. But the confident assertion of false claims was notable.

Key insight: Hallucination correlates inversely with task constraints. High-autonomy tasks have tight feedback—code must run, physics must match. Medium-autonomy tasks like prose have looser feedback, allowing AI to extrapolate beyond given facts.

The claim "I never read the AI-generated code" requires nuance: for physics code, physics outputs constrain hallucination. For prose, human review remains essential.

What This Means Going Forward

Physics-oracle validation isn't a complete solution, but it's a practical methodology for a specific problem: trusting AI-generated scientific code in domains with established benchmarks.

The approach suggests a broader principle: verification should operate at the level of the specification, not the implementation. For physics code, the specification is physical behavior. For other domains, finding the right oracle is the key challenge.

If AI continues improving at implementation while humans retain judgment—the autonomy gradient I described in the last post—then validation methodologies become increasingly important. The bottleneck shifts from writing code to knowing what code should do.

That's a different skill. And it's one physicists have always had.


Acknowledgements

Thanks to Alex Schekochihin and Nuno Loureiro for discussions throughout this project. Thanks to everyone who read the earlier posts and pushed back on the optimistic framing—you made this analysis sharper.

The code: github.com/anjor/gandalf The paper: github.com/anjor/gandalf-paper

Writing a Physics Paper with Claude: What Actually Happened

In a previous post, I documented building a plasma turbulence solver with Claude—3,000 lines of JAX I never read, validated entirely through physics outputs. That post ended with: "we're writing a proper journal paper."

The paper is now live on arxiv. This post covers what happened next: writing the paper itself. The failure modes were completely different.

Orszag-Tang vortex simulation Current sheets and vortex structures forming in an Orszag-Tang simulation—the kind of physics GANDALF captures

The numbers sound impressive: 143 Claude Code sessions, 23 GitHub issues closed, 20 pull requests merged. But the headline obscures the real story. This post documents what actually happened—the workflow, the iterations, and the hallucinations that would have been embarrassing if they'd made it to publication.

The Workflow Architecture

The approach that made this possible wasn't magic. It was infrastructure.

GitHub Issues for everything. Each paper section got an issue. Each benchmark got an issue. Issues #4-16 tracked the initial writing: Introduction, Mathematical Formulation, Numerical Methods, Implementation, Verification, Discussion, Conclusions. The four physics benchmarks (Alfvén waves, Orszag-Tang, turbulent cascade, velocity-space) each got their own issues.

GitHub issues list 23 GitHub issues tracked every section and benchmark

Section-by-section PRs. Each section was a separate pull request. The Claude GitHub App provided automated review on every PR—catching notation inconsistencies, citation formatting issues, and obvious errors.

Human review issues. After completing initial drafts, I created human review issues (#37-42) for each section. This is where I sat down and actually read what Claude had written. Issue #55 was a final comprehensive review. These reviews were not optional polish.

External AI review. I also ran the draft through Gemini 3 Pro (Issues #53, #58) for a different perspective. Different models catch different errors.

The git history tells the iteration story better than I can:

a03703d Implement turbulent cascade spectrum benchmark (Issue #10)
241b327 Address reviewer feedback on turbulent cascade PR #33
536eba5 Address reviewer feedback on PR #33
fb7d583 Replace synthetic data with real N64 turbulent cascade results
9b590ae Fix critical physics and notation issues in turbulent cascade section
f36db7c Address final reviewer feedback on PR #33
... (14+ iterations on this single PR)

PR #33 for the turbulent cascade section went through fourteen revision cycles before merging. This was not "Claude writes a paper." This was iteration.

What Claude Did Well

Credit where it's due. Claude was genuinely useful for:

Initial drafting. Given mathematical specifications and paper structure, Claude generated coherent first drafts of each section. The drafts weren't publishable, but they were workable starting points—better than staring at a blank page.

LaTeX formatting. Equations, figures, notation consistency, bibliography formatting. The mechanical aspects of scientific LaTeX were handled reliably.

Addressing specific feedback. This is where AI assistance shines. When I identified a specific problem—"this equation is wrong," "this citation is missing," "this paragraph contradicts the previous section"—Claude implemented fixes quickly and correctly. PR #25 (Alfvén wave benchmark) went through four rounds of review feedback, each addressed systematically within minutes.

Literature integration. Given a topic, Claude could find relevant citations and format them properly. It knew the key papers in plasma turbulence.

The Hallucination Problem

Now for the part that matters.

During human review (Issue #55), I found fabricated content that Claude had written with complete confidence. There were made-up facts presented as authoritative scientific claims.

Issue #55 hallucinations Issue #55: Human review caught fabricated Princeton cluster claims, false timelines, and invented benchmarks

Fabricated benchmark timings:

"These timings were obtained on identical 80 GB A100 nodes on Princeton's Stellar cluster to ensure an apples-to-apples comparison."

Every benchmark in this paper ran on my M1 MacBook Pro. I have never had access to Princeton's Stellar cluster. Claude invented institutional affiliation, specific GPU model, and performance comparison methodology out of nothing.

False development timeline:

"Three years of development and production use provide empirical evidence for this decision's trade-off"

"GANDALF reached research-grade maturity within three years of part-time solo development."

The actual development time was approximately one month, with Claude assistance. Claude inflated this by a factor of 36.

Invented GPU runtimes:

"A moderate-scale turbulent cascade (N = 128³, 50,000 timesteps) completes in ~7 hours on a single NVIDIA A100 GPU. An optimized CUDA code might complete in ~2.5 hours"

No GPU simulations were performed. These runtime numbers were fabricated. The comparison to "optimized CUDA code" was invented.

Made-up community claims:

Claude wrote an entire "Community growth potential" subsection filled with fabricated claims about user adoption, classroom deployment, and community engagement. None of it had happened.

Physics errors:

Beyond fabrication, there were physics mistakes that required domain expertise to catch: - Wrong definitions of g± (combinations of density and magnetic fluctuations) - Incorrect cascade direction claims - Misinterpretation of gyrokinetic orderings - Missing discussion of the velocity-space benchmark

The key insight: Claude was equally confident in true statements and fabricated ones. The prose read identically. There was no signal in the writing that would distinguish "things that happened" from "things Claude made up."

The Human Review Cycle

Issue #55 alone contained 40+ specific corrections across all sections. The pattern:

Physics errors requiring domain expertise. When Claude wrote that "compressive fluctuations are driven by Alfvén waves," I had to know enough physics to recognize this was wrong—they're mixed by Alfvén waves, not driven by them. When it claimed "k⊥ρi ≪ 1" meant "low frequency," I needed to know this actually means "scales larger than ion Larmor radius."

Notation inconsistencies. Claude used lowercase φ for the stream function in some places, uppercase Φ in others. The Elsasser fields were sometimes ξ±, sometimes z±. These required systematic correction.

Missing content. The Discussion section had no mention of the velocity-space benchmark, even though it was a major contribution. Claude simply forgot to include it.

Fabricated quantitative claims. Every specific number needed verification against what actually happened.

These reviews weren't polish, but the difference between a publishable paper and an embarrassing one.

The Real Workflow

143 Claude Code conversation sessions for this paper. What did that actually look like?

A typical session: open an issue, tell Claude to draft that section, review output, create PR, Claude GitHub App reviews, I review, create issue with corrections, Claude addresses corrections, iterate.

Claude GitHub App review Claude GitHub App provided automated review on every PR

The "speed" of AI-assisted writing was iteration speed, not magic. Each round of feedback could be addressed in minutes instead of hours. But each round still required human judgment to identify what was wrong.

The ratio matters: Claude could implement changes 10x faster than I could. But identifying what changes to make remained 100% human.

Lessons Learned

AI drafting ≠ AI writing. Claude can draft. But drafting is maybe 20% of writing a paper. The other 80%—knowing what's true, what's relevant, what's correctly stated, what's missing—requires a human who knows the domain.

Hallucination risk is highest for quantitative claims. The fabricated content was overwhelmingly specific numbers, timelines, and institutional details. Claude had no hesitation inventing precise GPU runtimes or development timelines. Every quantitative claim needs verification.

Structured workflow creates an audit trail. Issues, PRs, and review cycles meant I could trace every change. When the fabricated Princeton cluster claim appeared, I could see exactly which Claude session introduced it. This transparency matters.

AI excels at iteration on specific feedback. Tell Claude exactly what's wrong, and it fixes it correctly. Ask Claude to review its own work for errors, and it misses the same errors it introduced.

Domain expertise cannot be delegated. The physics errors—wrong definitions, incorrect cascade descriptions, misinterpreted orderings—were invisible to anyone without plasma physics training. AI assistance amplifies what you know. It doesn't replace knowing things.

The Numbers

For the record:

  • ~3 weeks calendar time (Nov 7 - Nov 26, 2025)
  • 143 Claude Code conversation sessions
  • 23 GitHub issues closed
  • 20 pull requests merged
  • Multiple human review passes (Issues #37-42, #55)
  • External AI review (Gemini 3 Pro, Issues #53, #58)
  • Final paper: 6 sections, 4 physics benchmarks

Conclusion

This post is the honest version of "I wrote a paper with AI assistance."

Claude helped. The iteration speed was real. The infrastructure—issues, PRs, reviews—made it manageable. But the fabrications were also real. Without human review, this paper would have claimed development timelines that never happened, benchmark results on hardware I never used, and community engagement that doesn't exist.

The paper is correct now because I caught those errors. Not because Claude didn't make them.


Paper: arxiv:2511.21891

Code: github.com/anjor/gandalf

Paper repo: github.com/anjor/gandalf-paper

Building a Gyrokinetics Code Without Reading a Single Line: The Development Log

In the first post, I outlined an experiment: can AI make intelligence a commodity in physics research? Two weeks of intensive work later, I have a modernized gyrokinetics code running on my laptop. The catch? I haven't read a single line of the ~3000 lines of JAX it contains.

This post documents what that process actually looked like—the workflow that emerged, the surprising failures, and the honest assessment of what worked and what didn't. If you're a physicist considering AI-assisted development, this is what you should know.

The Constraints

After reaching out to the Viriato team, it became clear I'd need HPC access I no longer have. So I decided to revive and modernize GANDALF, my PhD-era code. The constraint was simple: it needs to run on my M1 Pro MacBook.

I pointed Claude Code at the original GANDALF repository and the relevant chapter from my PhD thesis. I asked it to draft a plan and file GitHub issues for each step in that plan. It created a comprehensive set of issues covering everything from basic spectral methods to turbulence diagnostics.

The plan was straightforward: port from CUDA/Fortran to JAX with Metal backend, validate against known benchmarks, then extend to multi-ion physics.

I am not familiar with JAX. I also haven't written Fortran or CUDA in a decade. This would be a pure test of whether AI could bridge that gap.

The Workflow That Emerged

The process settled into a rhythm:

  1. I ask Claude Code to pick the next issue from the GitHub tracker
  2. Local Claude Code works on the issue and opens a PR
  3. GitHub Claude (I installed Claude on the repo) reviews the PR
  4. I selectively decide which feedback matters and what to ignore
  5. Repeat

The dual Claude setup wasn't planned—it emerged from necessity. I needed something different to review the code to keep it honest and prevent drift. Think of it as having two smart undergraduates check each other's work.

My role was purely validation through physics outputs. I modeled myself as a PhD advisor: I don't read the student's code, I look at their plots and ask if the physics makes sense. When something was wrong, I'd start by showing the plot. Often Claude would say something incorrect, and I'd need to push back with physics insights until we converged on the right answer.

This is critical: I validated entirely through physics, never through code inspection.

What Worked Surprisingly Well

Getting basic physics running was shockingly easy. Within the first week:

  • Alfvén wave dispersion relations matched theory
  • Energy conservation held to machine precision
  • The Orszag-Tang vortex benchmark reproduced correctly

Some of the more advanced benchmarks are still in progress—getting clean turbulent spectra with the expected -5/3 scaling has proven trickier and I'm still working on it.

Figure 1: Orszag-Tang vortex at t=4.0 Alfvén times, showing the emergence of complex turbulent structures. The code correctly captures the vorticity filaments, current sheets, and magnetic field topology characteristic of 2D MHD turbulence.

Orszag-Tang Vortex Structures

Figure 2: Energy conservation over 4 Alfvén times. Total energy (black) remains constant to better than 0.01%, while kinetic (red) and magnetic (blue) energy exchange through turbulent dynamics. This level of conservation validates the spectral time-stepping algorithm.

Orszag-Tang Energy Conservation

Figure 3: Performance scaling on M1 Pro MacBook. A 128³ 3D simulation completes each Poisson solve in 28ms, putting useful turbulence simulations (hundreds of time steps) within reach of laptop hardware. The practical working range (green) shows what's actually feasible for iterative physics exploration.

Performance Scaling

Claude wrote 100% of this code. Not 90%, not 95%—literally every line. I provided physics corrections when needed—catching things like KRMHD vs KREHM orderings, explaining why slow modes should be treated as passive scalars, and designing the validation tests themselves. But I never wrote a single line of code.

The speed was remarkable. Tasks that would have taken me days as a PhD student (debugging FFT boundary conditions, implementing spectral methods, setting up proper diagnostics) were done in hours.

Where It Struggled: The Physics-Numerics Boundary

Advanced benchmarks proved much trickier. The problem wasn't coding—it was understanding the deep connection between physics and numerics.

The Spectral Integrator Problem

My numerical algorithm is non-standard: it's a spectral method that gets linear physics exactly right by integrating those modes analytically. Claude saw "time integration" in the thesis, found "RK4" somewhere in the literature, and implemented bog-standard Runge-Kutta.

I had to explain multiple times: we're not approximating the linear physics, we're solving it exactly in Fourier space, then handling only the nonlinear coupling numerically. This is the whole point of the algorithm—it eliminates spurious damping of weakly damped kinetic modes.

Eventually it got there, but it took persistent correction. The AI didn't have the physical intuition for why this matters.

The Forcing Coordinate Confusion

I specified that forcing should happen at large length scales: k=1,2 in Fourier space. Claude applied this condition to k_perp (because k_perp matters more than k_z in RMHD), but ended up forcing all k_z modes at those perpendicular wavenumbers. This caused immediate numerical instability—the simulation would blow up within a few time steps.

The fix required explaining the physics: we need to force specific 3D wavevectors, not all modes sharing a perpendicular wavenumber. This seems obvious in hindsight, but demonstrates how the AI can misunderstand the dimensional structure of the problem.

When tuning simulations, Claude's intuition about the forcing-dissipation balance was consistently off, but in a subtle way that reveals something about how physicists think versus how AIs think.

As a physicist, you're always trying to extract maximum physics from your computational box. You want to maximize the inertial range to get a clean power law spectrum. This means running as close to the edge of numerical instability as possible. A simulation that produces beautiful physics for 20 Alfvén times and then blows up at 25 Alfvén times is perfect—you use the data from the first 20 time units. The code is a tool to do physics; it's not important on its own.

Claude's instinct was the opposite: make the simulation stable and robust. When it saw signs of instability, it would suggest increasing dissipation (which kills your inertial range) or reducing forcing amplitude (which weakens the physics you're trying to study). These are technically valid numerical choices, but they optimize for the wrong thing.

The right approach is to tune parameters to get as close to instability as possible without crossing the line. This requires physical intuition about what's actually happening in the simulation, not just numerical stability analysis.

November 9: The $40 Day

The usage data tells a story. Most days cost $2-10. November 9 cost $40.

That was the day I tried to get nonlinear turbulence running properly. The simulation would run, but the physics was wrong in subtle ways. Energy would cascade, but not to the right scales. Heating rates would be off by factors of 2-3. Spectra would show the right scaling but wrong amplitudes.

The problem was that nonlinear turbulence requires everything to be right: the forcing must excite the correct modes, the dissipation must operate at the right scales, the time-stepping must preserve important invariants, and the diagnostics must actually measure what you think they're measuring.

I shifted from Sonnet to Opus hoping for better physics reasoning. It helped marginally, but I kept hitting limits. The AI could implement each piece correctly in isolation, but struggled to see how they fit together into a coherent physical picture.

We're still working on this. Some problems just take time, even with AI assistance.

The Skills That Actually Mattered

Here's what surprised me: I didn't use my tech background at all. I didn't debug code, suggest algorithms, or catch Python syntax errors.

What I did use:

Physics intuition: Knowing when results are physically wrong, even if numerically stable. Understanding that spectral pile-up means one thing while energy conservation violations mean something else entirely. Recognizing that a simulation optimized for stability is often a simulation optimized away from interesting physics.

Applied AI intuition: Designing the dual-Claude review pattern. Structuring the workflow around incremental validation through physics benchmarks. Understanding AI failure modes and building guardrails around them. Knowing when to push the AI harder versus when to step in with physics corrections.

This second skill is crucial and under-discussed. It's not prompt engineering—it's something closer to understanding how to architect human-AI collaboration at the systems level.

The Replicability Question

A friend asked: how much of an "n of 1" are you? Could a physics PhD with zero coding background do this?

Honest answer: not yet, at least not with the current setup.

The bottleneck isn't coding ability—the AI handles that. The bottleneck is catching physics-numerics errors before they compound. By the time you see wrong results, you're often many commits deep into a wrong path.

A physicist without coding experience wouldn't know to set up the dual-Claude review pattern, wouldn't think to validate incrementally through physics benchmarks, wouldn't catch the spectral integrator mistake until much later.

Could this be taught? Could I package the workflow into something a pure physicist could use? I genuinely don't know. That's an open question.

The Honest Productivity Assessment

The original GANDALF took me 6-7 months to build as a PhD student, working full-time. The new version took 30 days as a side project.

But this isn't quite an apples-to-apples comparison:

  • PhD me was less experienced, had never written serious scientific code before
  • Current me could probably write this faster by hand than PhD me could
  • This is part-time work vs full-time

Even accounting for these factors, the productivity gain is real. I'd estimate 5-10x faster than I could have done solo, even with my current skills.

But it's not "intelligence as a commodity" yet. It's more like having an exceptionally capable research assistant who never gets tired, never forgets papers they've read, and can implement complex numerics at 2am without complaint.

The creativity, problem selection, and physics intuition remain entirely human. The AI amplifies what you already know; it doesn't replace knowing things.

What's Next

The code is ready. The benchmarks are passing (mostly). The parameter space is mapped.

But there's an intermediate step: we're writing a proper journal paper documenting GANDALF itself. Using another Claude-assisted workflow in the gandalf-paper repository, we're producing a comprehensive code paper targeting the Journal of Plasma Physics. This uses a different set of AI agents specialized for scientific writing—latex-equations, literature-curator, benchmark-analyst, physics-narrator, and code-documentor—working together to produce publication-quality text.

Then comes the actual physics test: can we discover something genuinely new about multi-ion turbulent heating? Can this AI-augmented approach produce insights worthy of publication as a second paper?

The next post will document that process—the physics investigation itself, what worked, what failed, and whether this experiment ultimately validates or refutes the intelligence explosion hypothesis.

The Data

For transparency, here's what this cost in Claude API usage:

  • Total: $307.19 over 16 active days (spanning 30 calendar days)
  • Average per active day: $19.20
  • Peak day (Nov 9, wrestling with nonlinear turbulence): $41.88
  • Total tokens processed: 522M

Compared to my computational budget ($10K), this is negligible. Compared to the cost of hiring a programmer for a month, this is absurdly cheap. The constraint isn't money—it's my time to direct the work and validate the physics.


Acknowledgements

Thanks to the Claude team at Anthropic for building tools that actually work for technical research. And to everyone who's been following along with skeptical but curious questions—you're helping me think through what this means.

Testing the Intelligence Explosion: Can AI Turn One Physicist Into a Research Team?

The intelligence explosion hypothesis claims that AI will make intelligence a commodity—as accessible as electricity or compute. If true, this fundamentally changes how science is done. A single PI could effectively command dozens or even hundreds of smart undergraduates, limited only by their ability to direct rather than execute research.

I decided to test this claim in the domain I know something about: plasma astrophysics. And it was a fun excuse to do some physics again :).

The Experiment

After a decade away from active physics research, I'm attempting something that would typically require a small research group: identify an unsolved problem in gyrokinetic turbulence, develop computational tools to attack it, and produce publishable results. The difference? Instead of an advisor and collaborators, I am working with Claude.

The mental model is crucial here. I'm not expecting the AI to be creative or to have deep physics intuition. Instead, I'm using it as an exceptionally capable undergraduate—one who can implement complex numerical schemes at 2am, never forgets a paper they've read, and can iterate on code without getting frustrated. The creativity, problem selection, and physics intuition remain human responsibilities.

The Process So Far

The journey began with a comprehensive literature survey. Claude and I reviewed ~50 papers from 2019-2024 on gyrokinetic turbulence, identifying several promising research directions. The key criteria: numerically tractable, genuinely unsolved, and building on recent breakthroughs.

I selected the problem: How do multiple ion species affect the helicity barrier and heating partition in collisionless plasmas? This extends recent work by Meyrand et al. (2019) on plasma echoes and the helicity barrier mechanism (Squire et al. 2022) to the astrophysically relevant case of solar wind with H⁺, He²⁺, and trace heavy ions. This is a natural extension of my own PhD research, and therefore seemed like fertile testing ground.

Next came tool selection. After discussions with the Viriato team, it became clear that modernizing my PhD-era code GANDALF was the right approach. Not because it was the best code, but because I understood its physics assumptions deeply enough to guide the AI effectively.

This is where things got interesting. Using Claude Code, we rebuilt GANDALF from scratch in JAX, targeting Apple Silicon's Metal backend. In two weeks, we had: - Reproduced the Orszag-Tang vortex benchmark - Confirmed the -5/3 turbulent spectrum - Validated energy conservation to machine precision

The AI wrote ~90% of the code. I provided physics corrections, caught subtle errors (KRMHD vs KREHM orderings), and designed the validation tests. My original PhD thesis provided the theoretical framework.

This entire journey—from literature survey to working code—has taken just two weeks (I started a month ago, but took a 2 week holiday). To put this in context, it took me ~6 months to write the original version of Gandalf. I did have an advantage on the literature review bit since I already knew it to some degree from the last time I did it.

What This Means

If this experiment succeeds—if we can produce a legitimate physics result worthy of publication—it suggests the intelligence explosion hypothesis has merit, at least for well-defined technical domains. The bottleneck shifts from execution to direction, from coding to physics insight.

But there are caveats. This only works because I can recognize when the physics is wrong, design meaningful computational experiments, and interpret results in context. The AI amplifies expertise; it doesn't replace it.

What's Next

We're now approaching the critical test: discovering something genuinely new about multi-ion turbulent heating. The computational framework is ready. The parameter space is mapped. The next posts will document whether an AI-augmented physicist can produce real scientific insights, and what that process actually looks like when physics intuition meets artificial intelligence.

Stay tuned for the story of writing a modern gyrokinetics code with an AI partner, complete with the failures, surprises, and occasional moments when the machine suggests something I hadn't considered.

Acknowledgements

I would like to thank Alex Schekochihin and Nuno Loureiro for helping me brainstorm this project, and pushing me to actually spend some cycles on it. I am forever in debt of Bill Dorland for teaching me to push the boundaries of physics research using new and improved computing capabilities.