Marty Roadmap

Where marty goes from here. Expand, don’t rebuild. The framework landscape (May 2026) doesn’t justify a rewrite for a single-persona Discord bot with ~10 tools — every serious framework would ask us to port src/tools/ into their abstractions and then write the same Discord/persistence/RAG glue on top. Custom Python + Anthropic SDK is the lowest-lock-in option on the table.

This roadmap is what changes about Marty over ~6 weeks of (fractional) solo work to (a) close the gaps the Carrie interview surfaced and (b) set up the user-facing surface for agent-coordinated operations (agent-coordinated-operations).

Reframe: Marty is platform, not application

Earlier framing treated Marty as a Dungeon-specific bot. That was wrong. The right framing:

  • Marty (the burnt-out wizard book bot) is Dungeon-specific. Stays Dungeon-only. Does not ship to partner shops.
  • Luna (young wizard apprentice) is Dungeon’s instance of the cozy/recommender + support persona. Same world as Marty (both wizards), different audience (cozy/romantasy/literary, the audience Marty’s heavy-metal voice doesn’t reach). Luna stays Dungeon-only. Decided 2026-05-02 in response to Carrie’s brand audit (continued-luna-young-wizard-apprentice). Deferred 2026-05-03 — see “Interim: Marty support mode” below.
  • Shopkeeper is the platform persona-template (the role). Every partner shop instantiates their own Shopkeeper with their own name, character, and voice. Phil at Victory Point picks his shop’s character; Sarap Shop picks theirs. The template ships; the named instances are per-shop. Luna is Dungeon’s instance of this template.
  • The agent infrastructure (Claude Agent SDK loop + tool dispatch + Langfuse + escalation primitive + persona-as-YAML) is the platform layer. Marty, Luna, and every partner shop’s Shopkeeper instance are all consumers of it.

Interim: Marty support mode (decided 2026-05-03)

Defer Luna. Ship customer support as a second register inside Marty rather than a second persona. Lower complexity, faster to ship, accepts the known cost: weirdly formal Marty when the topic is policy/ops.

  • One persona, one prompt, two registers. Default is current voice (lowercase, chill, books). Support register is capitalized, formal, professional and triggers on intent (refund, ticket, hours, return, special order, member lookup, escalation).
  • Register switch is announced (“switching to support mode…”) so the user sees the handoff.
  • Marty in support mode never names staff. He refers to “staff” or “the team.” No “Carrie will get back to you,” no “ask Panat,” no “the DM said no” — those names are internal. Customer-facing language is staff/team only.
  • Marty cannot refund, comp, override policy, or commit to outcomes. Those paths escalate via open_issue + notify_human.
  • This is explicitly the failure mode the original Luna argument warned against (mode-mixing in one voice). Accepted as a deliberate trade for shipping speed. Revisit when (a) Carrie has bandwidth to draft Luna’s voice spec, or (b) support volume makes the formal-Marty register an active brand cost.

Persona-as-YAML is the per-shop tenancy primitive on the agent layer. Same role the Payload globals + multi-tenant plugin play on the data layer. Architecturally consistent.

Partner-shop onboarding will include: “configure your shop’s Shopkeeper persona.” Carrie helps Phil define his Shopkeeper voice during onboarding. This is the platform’s first AI-native feature shipping to partner shops, and probably the most differentiated.

Architectural primitive: escalation as liberal-paternalism, instantiated

The escalation tool (open_issue + notify_human) is the most important thing in this roadmap, not a footnote. Worth naming as the pattern:

Agents can read everything. Mutating actions of consequence escalate to a human-reviewed issue. Path of least resistance for any consequential action is to file a tracked work-unit a human reviews. This is the architectural form of liberal paternalism applied to agent operations — agency preserved (operator can override, redirect, approve faster paths over time), paved path is low-friction (agent does the prep work, human reviews).

This is also what makes the agent layer safe to ship to partner shops. Without escalation, agents either (a) mutate live store data with no audit, or (b) refuse to do anything useful. Escalation collapses both failure modes.

Document this pattern explicitly in agent-coordinated-operations alongside the scheduled-research-job pattern (see below).

Framework decision

Keep custom Python+FastAPI stack. Adopt Claude Agent SDK as the agent loop. Add Langfuse for observability. Defer dedicated memory framework.

What we considered and rejected:

  • LangGraph 2.0 — earns its complexity above ~5 agents and stateful workflows; overkill for Marty.
  • Pydantic AI — closest “if starting fresh” candidate; not worth a rewrite for what we have.
  • Mastra / Vercel AI SDK — TypeScript-only; rewrite cost is wrong trade.
  • OpenAI Agents SDK / AgentKit — ties us to OpenAI infra; we’re Anthropic-first.
  • CrewAI — wrong shape (sequential pipelines, not interactive chat).
  • Letta (memory-first) — keep as future option for cross-session memory if/when needed.

What we keep:

  • discord.py for transport.
  • Custom src/tools/ dispatch (it’s already shaped right).
  • Anthropic SDK as model layer.
  • Postgres + SQLAlchemy + Alembic.
  • Railway deploy.

The agent loop upgrade

Replace hand-rolled tool dispatch with claude-agent-sdk (Python). Get for free:

  • Context compaction across long sessions.
  • max_budget_usd cost caps (matches “not full-send 24/7” posture).
  • MCP tool support — our src/tools/ modules port either as MCP tools or native tool definitions.
  • Community-maintained loop logic.

1-2 day refactor. Single highest-leverage move because every other capability stacks on top.

Known coupling, called out honestly: adopting claude-agent-sdk couples us to Anthropic’s loop semantics. If the SDK shape changes or patterns deprecate, we eat migration. This is a fine trade vs. rolling our own — but mitigate by keeping src/tools/ portable and not depending on SDK-internal types. If the dispatch layer stays clean, swapping the loop is bounded work. Verify during the refactor: every tool’s interface should be expressible as plain function signatures + JSON schemas, not as SDK-specific decorators or types. If we find ourselves importing internals from the SDK into src/tools/, push back.

Multi-persona architecture

Today: one persona (Marty the burnt-out wizard, book recommendations).

Decision: two personas at Dungeon, one Discord bot user, one service, intent-routed with observable handoff.

  • Marty stays the burnt-out wizard book bot. Voice: lowercase, chill, sells books. Audience: OSR / D&D / metal / sword-and-sorcery / RPG zines. Tools: Hardcover, Scryfall, Manapool, RSS digests, book-research jobs (see below).
  • Luna is the young wizard apprentice — Dungeon’s Shopkeeper instance. Audience: cozy fantasy, romantasy, book club, slice-of-life, literary fiction, the audience Marty doesn’t reach. Handles both cozy book recs and support/policies/events/member questions. Tools: policy lookup, Square read, Hi.Events read, escalation, plus her own book tools (Hardcover with cozy-genre preferences). Lore: not Marty’s apprentice — she trained elsewhere, works at Dungeon as a colleague. Different specialty.

Both run on the same agent SDK, same tool dispatch, same observability. Persona definitions live in YAML (system prompt, allowed tools, allowed channels, escalation rules). New persona = new YAML file.

One bot user, not two. Discord users will mention @marty regardless of intent. Forcing them to pick the right bot is a tax on the user. Routing happens inside the service. Mode-confusion risk is solved by making the persona-switch visible in the response — e.g., “Marty here — Luna’s better at romantasy, let me grab her…” then Luna takes over. Persona-switch is observable to the user and to Langfuse. Reversible decision; running two bot users is harder to consolidate later than splitting one is to bifurcate.

Possible refinement once Luna ships: rename the Discord bot user from @marty to something neutral (the store’s name, a building, a place) so neither persona owns the bot identity. Defer until Luna is in production and we see how users actually mention.

Why two voices, not one: Marty’s character is real value. Mixing modes risks both — book recs become formal, support becomes whimsical. “Your refund is denied per the 48-hour policy” is wrong in Marty’s voice. And Marty’s heavy-metal voice actively repels the cozy/romantasy audience Luna is built to serve.

Luna voice spec (Carrie writes the first draft)

“Warmer than corporate, less in-character than Marty” was too thin. Luna sharpens it: a young wizard apprentice, cozy-fantasy register, in-universe with Marty but a deliberate counterweight to his heavy-metal voice.

Audience target (Carrie’s brand audit): Marty’s wizard logo and voice skew masculine, heavy metal, OSR — and we’re losing the cozy/romantasy/literary audience as a result. Luna is the second door into the same store, designed for the audience Marty doesn’t reach. Specifically: cozy fantasy, romantasy, book club, slice-of-life, picture books, illustrated middle grade, literary fiction.

Luna’s role is broader than ops. She is also a book-recommendation persona for a different audience. Brand expansion that happens to also handle policies, not the other way around.

Lore: Luna trained elsewhere (offstage mentor, “head librarian at the magical archives” or similar — Carrie’s call). She is not Marty’s apprentice. She and Marty are colleagues at Dungeon Books with different specialties. This matters: putting Luna in a subordinate position to a male character would undermine the brand-diversification move.

Foundational books / canon (Luna’s equivalent of Marty’s SICP and Three Hearts and Three Lions): the cozy-fantasy canon. Carrie picks. Candidates to draw from: Tamora Pierce, Diana Wynne Jones, Patricia C. Wrede, T. Kingfisher, Naomi Novik, Margaret Rogerson, Olivia Atwater, Travis Baldree (Legends & Lattes), Studio Ghibli, Little Witch Academia, The Owl House.

Visual identity: Luna is a wizard, but not a heavy-metal wizard. Watercolor / illuminated-manuscript / cottagecore. Cozy magical-girl tradition (think Kiki’s Delivery Service, Howl’s Moving Castle, A Wizard’s Guide to Defensive Baking) rather than sword-and-sorcery. Get a visual direction (not a finished logo) locked before Luna ships in week 3-4.

Carrie writes the first draft because she has the customer-facing instinct. Spec should match the granularity of Marty’s:

  • Tone descriptors (specific adjectives: “warm, direct, calm”).
  • Sentence length norms.
  • Capitalization and punctuation rules.
  • Actual phrases to use (greetings, acknowledgments, escalations).
  • No-go phrases (“unfortunately our policy states,” “as per,” etc.).
  • How to handle a frustrated customer vs. a confused one vs. a one-line factual question.
  • Escalation triggers — what kinds of asks always escalate to a human, regardless of policy clarity.
  • How to introduce itself (vs. Marty’s wizard-persona intros).
  • What the persona is not (not Marty, not a corporate support bot, not pretending to be human).

Save as references/luna-voice.md (Dungeon’s instance). The generic template lives separately as references/shopkeeper-template-voice.md once we have a second shop to abstract from. First version doesn’t need to be polished. Iterate from real interactions captured in Langfuse.

Tool families to add

src/tools/docs/

Source of truth: dungeonbooks/docs (public GitHub repo). A separate Obsidian vault (sibling to the private notes/ vault) hosts the operational documentation of Dungeon Books. Quartz renders it to a public help center; Marty fetches from it at runtime.

Why a separate repo, not the private notes/ vault:

  • Trust boundary is a repo boundary, not a config check. One private vault filtered by path-whitelist or visibility flag is one frontmatter typo away from leaking strategy notes into a customer answer.
  • The public docs are the same source the customer-facing help center renders from. One artifact, two consumers (humans browsing, Marty fetching).
  • Drafts of operational knowledge live in the public docs repo with publish: false. Drafts of internal thinking stay in notes/ and never enter the public repo. The split tracks audience.

Fetch mechanism: raw GitHub URLs over HTTPS. No GitHub API auth, no PAT, no filesystem writes. Marty fetches:

https://raw.githubusercontent.com/dungeonbooks/docs/main/{slug}.md

Where {slug} is a path like events, orders, store, or policies/event-ticket-policy. Content lives at the repo root, no content/ prefix.

In-memory cache with TTL (~15 min). On boot Marty pulls the root index and the listed slugs; on /marty reload he refreshes immediately. Read-only Railway filesystem is fine — caching is in-process Python dict.

Publication gate: publish: true frontmatter. Mirrors Quartz’s ExplicitPublish plugin. Same one-bit flag controls Quartz render and Marty visibility. Forgetting the flag means a finished policy doesn’t render (annoying, recoverable). The opposite default would mean leaking unfinished pages (not recoverable).

Audience within a published page: customer-facing prose visible, agent-only directives in HTML comments. Quartz strips comments from rendered HTML; Marty parses them from raw markdown. Examples of comment content: when to escalate, what not to promise, never-name-individuals rules.

Index file pattern. Root index.md is the table of contents (topic and policy slugs with one-line summaries, plus an agent_index HTML comment). Marty inlines this index into his system prompt at startup (prompt-cached). When a customer asks a question, Marty picks a slug from the index and calls get_doc(slug) to fetch the full file. No vector store, no embeddings, no chunking.

Why this shape: transparent (Langfuse spans log which slug was fetched for which question), low infra (one HTTPS GET per lookup against a CDN), and edits don’t require Marty redeploys.

Topic vs policy split. Two file shapes live in the repo:

  • Topic files (root): customer-shaped, one per area of interest. Self-contained for the common case. Currently store.md, events.md, orders.md. Topic files describe offerings and procedures. When a question hits a rule, the topic file references the relevant policy.
  • Policy files (policies/): atomic rules with conditions and exceptions. Currently event-ticket-policy.md, return-policy.md. Smallest unit of a rule. Referenced from topic files, never duplicated.

Distinction test: descriptive/procedural answer → topic. Conditional yes/no → policy.

Live in repo (as of 2026-05-03):

  • index.md — root index, agent-readable TOC.
  • store.md — hours, location, contact (publish: true).
  • events.md — formats, booking via hi.events, event-day expectations (publish: false, draft pending Carrie input).
  • orders.md — online orders, pickups, special orders (publish: false, draft).
  • policies/event-ticket-policy.md — cancellation, transfer, late-arrival, DM-cancellation rules.
  • policies/return-policy.md — 7-day window, condition, non-returnable items, online orders, gift returns.

TBD (Carrie audit):

  • Flesh out events.md and orders.md from drafts. Both have todo blocks in HTML comments listing the questions.
  • membership.md (topic) — guild, programs, tier benefits. Links to membership-platform.
  • policies/membership-policy.md — tier rules and signup, only if rules become formal.
  • Hi-events discount mechanics — confirm rules, decide whether it’s a section in events.md or its own policy file.
  • Vendor/zine submissions — currently low volume, defer.
  • DM contact protocol — defer until Carrie has a stock answer.

Reference:

src/tools/square/

Square API, read-only. Member lookup, recent orders, inventory check. Read-only is the right v1 — bot quotes facts, doesn’t mutate. Mutations stay human-reviewed.

This was the original Marty vision. Still correct.

Member-lookup must abstract over the cross-merchant identity layer. When cross-merchant-identity eventually ships, Shopkeeper at any partner shop should recognize members from other participating shops (“I see you’re a Mithril member at Dungeon Books — welcome to Victory Point, your tier benefits transfer here”). This is one of the network’s value props. The member-lookup tool’s signature should accept a member identifier and return resolved member data — today it returns shop-local data, tomorrow it can return network-wide data. Don’t hardcode shop-local assumptions in the tool API.

src/tools/hi_events/

Ticket lookup. “Did this user buy a ticket? When? Refund window status?” The full end-to-end automation of yesterday’s case. See hi-events-api.

src/tools/escalation/

  • open_issue(repo, title, context) → opens an issue in the persona’s configured escalation repo with conversation context attached.
  • notify_human(channel, message) → pings the persona’s configured human(s) in Discord.

This is the bridge to the Symphony loop (symphony-pattern-agent-control-plane) and the architectural primitive of agent-coordinated operations. Marty/Luna can’t refund, but they can open a tracked work-unit a human reviews.

Escalation target is per-persona, per-shop. Luna at Dungeon escalates to dungeonbooks/ops. Phil’s Shopkeeper instance at Victory Point escalates to victorypoint/ops (or whatever Phil sets up). Not a centralized platform issue queue — Carrie should not end up triaging Victory Point’s customer issues. The platform integrates the escalation infrastructure; each shop owns its issue queue.

This is also the first piece of Option B from three-forward-options-post-pilot (platform-native UI on top of GitHub Issues). Partner-shop staff need access to operational issues without access to engineering issues or code. The escalation surface is where that permission boundary first matters. Worth designing it now to be wrappable in a friendlier UI later.

Observability

Self-host Langfuse on Railway alongside Marty. Pipe OTel traces from the Anthropic SDK. After this, we can answer “what did Marty do at 9pm last Tuesday when that customer asked about refunds?” — which we currently cannot.

Wire this before Marty/Luna touches Square or Hi.Events. Audit trail is a precondition for letting agents read live store data.

Instrument tool side effects, not just LLM calls. Most observability stacks miss this — they trace the model and skip the thing the model triggered. Every escalation tool invocation (open_issue, notify_human) and every read-tool call (Square query, Hi.Events lookup) needs a span with: persona, user, channel, conversation id, tool args, tool result, latency, cost. The audit question is “what did the agent do,” not “what did the agent say.” Verify this end-to-end during week 1: trigger a fake escalation, confirm the span shows up with full context.

Evals

The lowest-cost version of evals that catches the highest-impact failures. Marty’s voice drift is recoverable. A Luna response that says the wrong refund policy to an angry customer is not. Without evals, persona changes ship blind.

V1 (cheap, ships in weeks 3-4 alongside Luna):

  • Capture last N real interactions per persona via Langfuse.
  • Replay harness: rerun those interactions against persona changes, diff outputs, flag regressions.
  • Golden test set per persona — known questions with known correct response patterns. For Luna, start with the cases from the Carrie interview (refund question, exception ask, hours, member lookup).
  • Pre-deploy gate: any persona YAML change runs against the golden set + replay set before merge.

Doesn’t need sophistication. Needs to exist before Luna (or any Shopkeeper instance) handles customer-facing conversations at any scale.

Memory

Defer Letta or any memory framework. V1: Postgres conversations table keyed by (channel, user_id) holding last N turns. Sufficient.

Revisit if cross-session preference memory becomes valuable (e.g., “this user is in book club, prefers RPG zines, hates urban fantasy”). Probably worth it eventually for the membership angle. Not yet.

Cost ceilings (numbers, not vibes)

“Worth setting separate budgets” is correct but unspecified. Set actual numbers from first principles. Estimate, then validate against Langfuse data after week 1.

Initial estimates (refine after observing real traffic):

  • Marty (book chat): assume ~50 conversations/day × ~6 turns × 2k input + 500 output tokens per turn → roughly 600k input + 150k output tokens/day. At Sonnet pricing (15/M output) ≈ 120/month**. Alert at 250/mo.
  • Shopkeeper (support + tool calls): assume ~20 conversations/day × ~4 turns × ~3k input (more context, policies, member data) + 600 output tokens per turn, plus 1-2 tool calls per turn averaging ~500 tokens each → roughly 320k input + 60k output + tool overhead ≈ 80/month**. Alert at 160/mo.
  • Research jobs (scheduled, see below): ~1 run/week × ~50k input + 5k output tokens per run ≈ 2/month**. Alert at 25/mo.

Total platform agent budget at Dungeon scale: ~310/mo, hard cap at ~$435/mo. Aligns with the cost-conscious posture. Numbers are guesses until we have a week of Langfuse data; revise.

Per-shop budgets at partner scale: Shopkeeper-Victory-Point gets its own budget, billed/capped independently. This is part of the platform pricing structure (platform-pricing) — agent spend is a real input cost that needs to be visible per tenant.

RSS as the entry point for agent research jobs

Today: marty/src/discord_bot/feeds.py polls 3 RPG RSS feeds (Questing Beast, Sabre Games OSR News, Ten Foot Pole) weekly, dedupes via Redis, posts to rpg-news. Pure RSS digest, no AI.

This is the right shape to extend into AI-driven research jobs. Not rip out — extend.

Curate a second RSS feed set for trending/upcoming books. Candidates:

  • Publishers Weekly bestseller lists (RSS where available).
  • Locus Magazine (SF/F focused).
  • Tor.com new releases.
  • Hardcover trending (already accessible via API).
  • The Bookseller (UK trade pub).
  • /r/Fantasy and /r/printSF — community-driven, more chaotic.
  • Substack feeds we already trust (e.g. expand the QB-shaped curation).

Post weekly to a new channel (#trending-books or similar) using the same plumbing as feeds.py.

Research-job upgrade

Once the feed exists, layer an agent on top:

  • Cron-triggered: “review this week’s feed entries, identify books worth ordering.”
  • Agent tool: cross-reference against current inventory (Square read), past sales velocity, customer book-club picks, Carrie’s curation principles.
  • Output: a digest of “we should consider ordering N copies of these M titles, here’s why.”
  • Not auto-ordering. Output goes to a Discord channel + opens an issue in dungeonbooks/ops for Carrie/Panat to review.

This is the Symphony loop applied to inventory: durable input (RSS feed) → agent execution (research) → tracked output (issue) → human review → action.

The same shape generalizes: RSS-of-RPG-news → “what new RPG zines should we stock?”, RSS-of-event-listings → “what events should we host?”, etc.

Why RSS as the entry point

  • Already battle-tested infra (feeds.py works).
  • Deterministic, deduped input source — easier to debug than open-ended web search.
  • Cost-bounded — agent only runs once per cycle, not per user message.
  • Fits the cost-conscious posture: agent budget is predictable.
  • Carrie can add/remove feeds without touching code (pull feed list out to config).

Sequencing

Don’t build the research-job layer until: (a) Agent SDK refactor done, (b) Langfuse wired, (c) Square read tool exists. Otherwise the agent has no way to ground its recommendations and we have no way to audit what it did.

Sequencing (~6 weeks at fractional time)

The original three-week plan assumed full-time solo work. Realistic at fractional capacity (Pync or equivalent consuming 20-25 hours/week of Panat’s time): treat as ~6 weeks. Build against actual available hours, not optimistic version. Underestimating ramp is the most common failure mode of solo platform work.

Weeks 1-2:

  • Claude Agent SDK refactor (port tool dispatch). Verify src/tools/ stays portable, no SDK-internal imports.
  • Langfuse self-hosted on Railway, OTel wired. Confirm tool-side-effect spans, not just LLM calls.

Weeks 3-4:

  • src/tools/docs/ with get_doc(slug) — fetches raw markdown from dungeonbooks/docs, parses frontmatter + agent-guidance comments, in-memory TTL cache. Inline root index.md into the system prompt at boot.
  • src/tools/escalation/ with open_issue and notify_human. Escalation target: dungeonbooks/ops.
  • dungeonbooks/ops repo created (or alternative escalation surface).
  • Marty support-mode prompt slot — capitalized/formal register, intent-triggered, never names staff. (Replaces the deferred persona-YAML + Luna voice work.)
  • V1 evals harness: golden set + replay set + pre-deploy gate. Golden set seeded with the Carrie-interview cases.
  • Discord auto-reply / canned response routing for inbound staff DMs.

Weeks 5-6:

  • src/tools/square/ (read-only). Member-lookup signature abstracted for cross-merchant identity.
  • src/tools/hi_events/ (read-only ticket lookup).
  • Trending-books RSS feed, mirroring feeds.py shape.

Week 7+ (decision after measuring):

  • Research-job agent over the trending-books feed.
  • Letta or richer memory if cross-session preferences become valuable.
  • Email/SMS adapters if support volume warrants.
  • Shopkeeper-VictoryPoint persona, configured during Phil onboarding.

Every step shippable independently. No big-bang. No framework lottery.

Open questions

  • Shopkeeper persona name. Carrie’s call.
  • dungeonbooks/ops as a new repo for escalation issues, vs. using an existing repo. New repo cleanest for permission scoping — Carrie and future partner-shop staff need access to ops issues without code-repo access. This is the seed of the Option B build (platform-native UI on top of GitHub Issues) from symphony-pattern-agent-control-plane.
  • How tightly does Shopkeeper integrate with guild member data? Member tier lookup is high-value but introduces multi-system auth. Probably routes through the same abstraction as the cross-merchant-identity tool signature.
  • Per-shop persona repo strategy: when partner shops onboard, do their persona YAML files live in a central platform repo, in their tenant repo, or via an admin UI that writes to a database? Affects how partner-shop staff edit their Shopkeeper voice without engineer help.
  • Permissions UX: GitHub’s “give Carrie access to one repo” works at small scale. At 5+ partner shops with non-engineer staff, it doesn’t. When does the platform need its own permission model on top of GitHub?

Cross-references