Chapter 04

Service Topology

Decomposing a self-hosted LLM partner system into independently replaceable services — memory, portal, and workers — without falling into microservice theater.

Updated 2026-05-12

Why services, not a monolith

The previous chapter argued for storing the partnership's persistent state in a plain-text repository. That works as long as the only consumer is a human or an agent reading at session start.

The moment retrieval becomes non-trivial — semantic search over years of conversation, graph queries over related memories, sub-second response time during chat — flat-file scanning is no longer enough. A runtime layer is needed on top of the substrate.

The temptation is to fold all of that runtime into the chat application itself. A single Next.js process holding the chat UI, the memory store, the embedding model, the LLM client, the auth. It works for a weekend. It collapses the moment you want to swap the LLM, change the embedding model, run a long ingestion job, or expose the memory to another client.

The fix is the boring one: split the runtime into services with explicit interfaces, and let each service have one job.

The minimum viable partition

For a personal partner stack, three services are sufficient:

ServiceOwnsSpeaks
MemoryThe retrieval substrate — structured records, vector index, the embedding pipelineAn HTTP API for search, list, write, delete
PortalThe UI, the auth, the conversation history, the human-facing experienceHTTP to the memory service, plus an LLM provider
WorkersBackground ingestion, vectorization, summarization, graph computationThe same stores the memory service reads, but only writes

Three services is the floor. More than three is usually premature. The split has to map to different lifecycles, not to different concepts — if two pieces always change together, they belong in one service.

Memory service

The memory service has two stores, not one:

  • A structured store (a relational database or even SQLite) for the canonical record. Every memory atom has a row here with full content, metadata, timestamps, ownership, summary, and a stable ID.
  • A vector store (a vector database such as LanceDB, pgvector, or similar) for embeddings. It holds the same atoms by ID, plus their embedding vectors and any derived graph structure.

The structured store is the source of truth. The vector store is a derived index that can be rebuilt at any time from the structured store. Writes go to the structured store first, then a background worker propagates them to the vector store. The memory service never lets the vector store drift undetected — every record has a vectorized_at timestamp that the worker reconciles.

The service exposes a small HTTP surface:

MethodPathPurpose
POST/searchSemantic search by query embedding
POST/graph_searchWalk the related-memory graph from a seed
POST/listPaginated listing with filters
PUT/update/{id}Update a record (re-embed)
DELETE/delete/{id}Delete from both stores

The shape is deliberately RPC-flat. There is no GraphQL, no nested REST. The memory service does not need to be RESTfully pretty; it needs to be cheap to call from a coroutine in the portal.

Portal

The portal owns everything the user touches and nothing about how memory is stored.

  • A UI layer: chat window, conversation list, settings.
  • An auth layer: sessions, password hashing, per-user API keys.
  • A conversation store: its own database, separate from the memory store, holding the literal turn-by-turn transcripts.
  • An LLM client: the place where prompts are assembled and sent to whichever model provider is configured.

The portal is the only service that talks to humans. The memory service has no human surface — it has an admin UI, but the admin UI is itself a separate small client of the memory API, not part of the memory process.

This separation is what makes the LLM provider swappable. The portal can switch from one provider to another by editing one configuration value, and neither the memory service nor the workers care.

Workers

Anything that takes more than 100ms is a worker, not a request handler.

  • Ingestion worker. Reads new files from the raw input directory, runs the normalization and chunking pipeline, writes atoms to the structured store.
  • Vectorization worker. Polls the structured store for atoms with vectorized_at null, embeds them in batches, writes them to the vector store, updates the timestamp.
  • Summarization worker. Runs a local or remote LLM to produce summaries for atoms that lack them.
  • Graph worker. Computes a K-nearest-neighbor graph over the vector store, writes edge weights back into the structured store.

Workers run on their own loops with their own intervals and batch sizes. They are configured by environment variables, not by code changes. Each worker has a single source of work and a single sink, no branching logic.

The reason for splitting workers from the API process is operational: a worker that hangs on a slow embedding call should not also be the process that is supposed to answer search requests. They share data but not their failure modes.

The pluggable model layer

Both the embedding model and the LLM are runtime configuration, not code. The memory service reads EMBEDDING_PROVIDER (local, OpenAI-compatible, etc.) and MODEL_NAME from the environment. The portal reads its LLM provider the same way.

This is mundane but load-bearing. Models churn faster than anything else in the stack. Anything that hardcodes a model name will be the first thing that breaks when a provider deprecates a checkpoint. Treat the model layer like a database driver: an interface with multiple implementations, swapped at boot.

Topology, sketched

[ User ] ──► [ Portal ] ──► [ Memory API ] ──► [ Structured store ]
                  │                  ▲
                  │                  │
                  ▼                  │
            [ LLM provider ]    [ Vector store ]
                              [ Workers ]
                              [ Substrate ]

The substrate from the previous chapter sits at the bottom. Workers read it, push into the stores, and the memory API serves the stores to the portal. The portal never reads the substrate or the stores directly.

Deployment shape

The three services compose well under a single docker-compose.yml:

  • Each service is its own container.
  • The structured store gets a bind mount so it can be inspected and backed up with normal filesystem tools.
  • The vector store gets a named volume so the container's native filesystem semantics apply — bind-mounting a vector database into a non-native filesystem (e.g., across a VM boundary) causes pathological I/O. Discover this once.
  • A reverse-proxy or tunnel container (Cloudflare Tunnel, Caddy, Tailscale Funnel) exposes only the portal to the outside world. The memory API and the workers stay on the internal Docker network.

The result is a stack that can be brought up on a new machine with docker compose up -d plus an environment file. No installation steps, no service-specific tooling.

What this is not

This is not microservices for scale. There is one user, possibly two, possibly a small group. The split exists for replaceability and isolation of failure, not for horizontal scaling. There is no service mesh, no API gateway, no Kubernetes.

It is also not SaaS-replaceable. The whole point of running this stack yourself is that your conversation history, your memory index, your embedding choices, and your model selection do not live on someone else's server. Once you accept the operational cost, the design space opens up considerably.

Tradeoffs

  • Operational complexity is real. Three services and several workers mean more things that can fail. The mitigation is to make every failure observable: structured logs out of every container, a single command to tail them all, a single command to redeploy.
  • Schema migrations get awkward. Changing the structured store schema requires updating the memory API, the workers, and anything else that reads the same database. The right move is to put the schema in one place — a migrations directory inside the memory service — and never edit the database from outside it.
  • First boot is slow. Embedding the first years of history takes hours on consumer hardware. Plan for it; do not interleave it with development work.

The reward, when it works, is a stack you fully understand, fully own, and can carry forward as model providers come and go.