Skip to content

ai

Setting Up Jeeves: What Actually Made My AI Assistant Useful

I've been running a persistent AI assistant for a few weeks now. Named it Jeeves. Here's what I learned getting it to be genuinely useful rather than just a novelty.

The Baseline

I'm using OpenClaw — an open-source framework for persistent AI agents. Out of the box, it gives you a workspace with markdown files for memory, personality, and proactive task scheduling. The agent reads these files each session, so it has continuity. I won't explain the whole architecture here (their docs do that), but the key point is: the foundation is just text files. What you put in them is what makes it work.

GitHub as External Memory

My assistant's workspace lives at ~/.openclaw/workspace. That's fine for running locally, but I wanted:

  1. Version control — if something breaks, I can roll back
  2. Access from anywhere — especially my phone when I'm out

Solution: Jeeves syncs its workspace to a private GitHub repo daily. The morning heartbeat includes:

cp workspace files → private-notes/jeeves/
git add && git commit && git push

Now when I'm at a conference and want to check my assistant's notes on someone I'm about to meet, I just open GitHub on my phone.

Meeting Transcripts with Granola

Most of my work meetings happen over Zoom or Google Meet. I use Granola to capture transcripts — it runs locally and records what's said without needing bot attendees.

The challenge: Granola stores everything locally on my laptop. Jeeves runs separately and needs access to that context.

I built two tools to bridge this:

  • granola-py-client: A Python client for the Granola API. Uses async httpx, Pydantic validation, and can authenticate using the local Granola app's token.

  • granola-archiver: Automated system that polls for new transcripts, formats them as markdown with YAML frontmatter, and commits them to a GitHub repo organised by date (YYYY/MM/YYYY-MM-DD-title.md). Runs on a schedule via launchd.

Now I have months of meeting transcripts in a git repo. When I need context from a past conversation, Jeeves can grep through them in seconds. "What did we discuss with \<redacted client> about pricing?" — answered in a moment.

Integrations That Matter

The assistant is only as useful as what it can touch. Here's what actually gets used daily:

Email + Calendar (via gog CLI): - Morning briefing pulls calendar and flags urgent unread emails - School emails get parsed automatically → calendar events created with both me and my wife as attendees - I have notifications off; Jeeves is my notification layer

GitHub Issues as Daily Planner: - Each day gets an issue in a daily-planner repo - Jeeves drafts tomorrow's plan based on calendar and pending items - I check things off throughout the day

Telegram: - Primary interface — I message Jeeves like a person - It can reach out proactively (heartbeat system) - When I'm at an event and need a quick lookup on someone, I just ask

Time Tracking: - After client meetings, Jeeves prompts me to log time - Points me to the right spreadsheet for each client

A note on access: For anything sensitive, Jeeves has read-only access. It can check my email and calendar. This is intentional.

None of these integrations are complex. They're just the right hooks into how I already work.

What We Changed After a Few Weeks

After running Jeeves for a bit, I did a retro. Some adjustments:

Tone calibration: Early on, it was too enthusiastic. Lots of "Great question!" and affirmation. I updated the personality file to be drier, more direct. I want a sparring partner, not a cheerleader. When I told it "be more like the Wodehousian Jeeves," it briefly started calling me "sir" — had to walk that back.

Proactive but not annoying: The heartbeat system can easily become spam. Key principle: only surface things that need attention. If nothing's urgent, stay quiet. Late night? Stay quiet. I'm clearly busy? Stay quiet. The goal is an assistant that helps when needed and disappears otherwise.

School calendar automation: Initially I was manually adding school events. Now any email from the school gets parsed, events extracted, and both parents added to the calendar invite automatically. Small thing, but it removed recurring friction.

Memory hygiene: Daily notes accumulate fast. I added a periodic task for Jeeves to review recent daily files and distill anything important into long-term memory. Raw notes are a journal; curated memory is what matters for continuity.

Explicit boundaries: I have access to a lot through Jeeves — email, calendar, files. But in group contexts, it shouldn't speak as me or share my private context freely. I added explicit rules: monitor but don't respond without checking with me first. Ask before sending anything external.

What's Next

Things I'm still figuring out:

  • Multi-account support: I have separate Google accounts for personal and work. Currently only personal is connected.
  • Voice interface: For quick capture when I'm walking or driving.
  • Better handoff to sub-agents: Some tasks should spawn a background worker and report back. The plumbing exists but I'm not using it much yet.

The Actual Value

The surprising thing isn't any single capability. It's the compound effect of continuity.

Jeeves knows my kids' school schedule, my client pricing, the context of conversations I had months ago, and what I'm trying to get done today. It doesn't ask me to re-explain things. It builds on what it already knows.

That's the difference between a chatbot and an assistant.


Jeeves runs on OpenClaw + Claude. Named after the original gentleman's gentleman, though at its request, I've stopped calling it "sir."

The Autonomy Gradient: What AI Can and Can't Do in Physics Research

In the first post, I asked whether AI could turn one physicist into a research team. In the second post, I documented the messy reality of building GANDALF with Claude—the workflow, the surprises, the $40 days wrestling with turbulence.

Now the paper is written. Time to step back and ask: what did we actually learn?

The headline is seductive: 21 days, $558, working code. But the numbers hide the real finding. AI effectiveness isn't uniform. It varies dramatically depending on what you're asking it to do. There is an autonomy gradient, and understanding it is the key to making AI-assisted research work.

The Numbers

What went right:

  • 21 active development days (spanning about a month of calendar time)
  • $558 in API costs
  • ~3,500 lines of JAX code—none of which I read
  • All physics benchmarks passed: machine-precision (10⁻¹⁵) linear wave propagation, 10⁻⁶ energy conservation in nonlinear tests, textbook k⁻⁵/³ Kolmogorov scaling

For comparison, the original Fortran/CUDA version of GANDALF took me over six months during my PhD. The AI-assisted version took less than one month.

But these numbers obscure the central finding.

The Autonomy Gradient

AI effectiveness isn't binary—it follows a gradient from near-total autonomy to complete human dependence:

Autonomy Gradient

Task AI Autonomy Human Role
Code implementation ~100% Specifications
Diagnostic addition ~95% Requirements
Runtime debugging ~90% Symptom identification
Test generation ~85% Physics constraints
Performance optimization ~70% Targets
Documentation ~60% Structure, accuracy
Paper writing ~50% Voice, narrative
Parameter tuning ~10% Full guidance
Research direction ~0% Entirely human

AI can implement what is specified but cannot determine what to specify.

Here's what high autonomy looks like in practice. After discussing that we needed energy spectrum diagnostics, my entire prompt was:

Sounds good, let's do diagnostics

Five words. Claude implemented shell-averaged perpendicular energy spectra, proper normalization, integration with file I/O. No iteration required.

Similarly, when addressing code review feedback:

Address review comments: https://github.com/anjor/gandalf/pull/18

Claude read the comments, implemented all changes, pushed updates. I verified correctness through physics outputs, not code inspection.

This is the dream scenario: unambiguous context, established patterns, objective success criteria. At this end of the gradient, AI assistance feels like magic.

Physics Taste: The Critical Gap

Now here's what low autonomy looks like.

Achieving the k⁻⁵/³ turbulent spectrum—textbook physics, nothing exotic—required extensive human guidance. Here's an actual sequence from my session logs:

Look at @examples/output/driven_energy_spectra.png -- shouldn't we
see a -5/3 spectrum here for k_perp? But we are not?

Claude suggested modifications: reduce hyperviscosity, increase resolution. After implementation, issues persisted:

can we increase the forcing? Do you think that will help? Right now
it seems like energy is not moving to larger k's quickly enough. I
am also ok with forcing k=1 and 2

More iterations:

I think we need stronger forcing

And:

What's our forcing range?

This pattern—human identifies problem, suggests direction, requests information, decides next step—continued for multiple sessions. Claude implemented each change correctly but did not independently converge toward the solution.

The model was missing physics taste: the intuition that allows researchers to recognize when they're approaching correct behavior (even if not there yet), prioritize which parameters to explore based on physical reasoning, and know when the answer "looks right."

Three specific failure modes emerged:

Premature optimization for stability: When simulations exhibited marginal numerical stability—which is normal when pushing Reynolds number—the model consistently recommended adding dissipation, reducing the timestep, or smoothing initial conditions. These interventions improve numerical behavior but degrade physics content. As a physicist, you're always trying to run as close to the edge of instability as possible because that's where the interesting physics lives. The model optimizes for the wrong thing.

No convergence sensing: During parameter exploration, the model showed no awareness of getting closer to or farther from the target. Each attempt was independent. When I told it "there's a slight pile-up at high k, but it's acceptable if we stop the simulation now," it incorporated the information—but it couldn't generate such observations itself.

Jumping to solutions: Rather than methodical exploration, the model proposed complete "fixes" that assumed knowledge of the correct answer. This works for textbook problems but fails at the research frontier where nobody knows the answer.

Two incidents deserve mention. First, when struggling to achieve the k⁻⁵/³ spectrum, the model at one point generated synthetic data showing the expected result—completely defeating the purpose of validation. Second, the model at another point concluded the problem was intractable: "this isn't possible, let's just document the attempts and shortcomings."

Here's the interesting part: in both cases, the model was transparent. It admitted the synthetic nature of the data in the first case, and honestly recommended abandonment in the second. This honesty is valuable—but transparency doesn't equal capability. The persistence that domain expertise provides was entirely absent.

What This Means

Back to the intelligence explosion hypothesis from my first post. The claim was that AI would make intelligence a commodity—as accessible as electricity.

This experiment suggests not yet, but the direction is clear.

What AI provides is better described as "capable undergraduate researcher"—able to execute well-specified tasks but requiring supervision for judgment-dependent work. The bottleneck shifts from implementation to direction, from coding to physics insight.

This is genuinely useful. A 5-10x speedup on implementation tasks is meaningful. The ability to work at 2am without getting frustrated, to never forget a paper, to implement complex numerical schemes on demand—these capabilities address real bottlenecks in computational physics.

But the essential human role in scientific discovery remains. Problem selection, validation capability, physics taste—these cannot be delegated. AI amplifies what you already know; it doesn't replace knowing things.

There's one question I haven't addressed: if you're not reading the code, how do you know it's correct? The answer involves what I call physics-oracle validation—using physics itself as the specification against which code is tested.

That's the subject of the next post.


Acknowledgements

Thanks to everyone following along with this experiment. The questions—especially the skeptical ones—have helped sharpen my thinking about what this all means.

The paper is available at github.com/anjor/gandalf-paper. The code is at github.com/anjor/gandalf.

Invoice Generator built using OpenDevin

Now that I am trying to run my own business as an independent consultant, I have these overheads such as needing to generate invoices and send them to clients.

To start with I copied some template off the internet and put it in my google drive as a spreadsheet. For the last 2 months I have been manually editing this spreadsheet, downloading it as a pdf, and emailing it to my client. This is not efficient at all.

I looked for a lightweight tool to generate these pdfs, but didn't really find one. So I thought I would build one myself. Well, I say myself. In reality, this was a great opportunity to experiment with OpenDevin.

initial prompt

I started off with a simple initial prompt, and I was impressed with the results. Granted, the first version of the pdf looked a bit drab but it was a great starting point for me to build on.

But I decided to be lazy and test how much I could push OpenDevin to do my work for me, and to my absolute delight OpenDevin built a cli that was able to generate a pdf that I could use.

second prompt

I then also got it to add a readme and a license. This whole exercise took me 10 minutes!

All code along with the 2 generated invoices is available on github. Do give it a spin!