Architecture

Waymark is a Postgres-backed instance queue plus an in-memory runner. The high-level flow:

Python SDK
   │
   ▼
waymark-bridge (gRPC)
   │
   ▼
Postgres
   │
   ▼
waymark-start-workers runloop
   │
   ▼
Python worker pool

Everything durable lives in Postgres. Everything that has to be fast lives in the Rust runloop's in-memory state, with carefully ordered writes back to Postgres so that a worker crash never loses progress.

Runtime components

ComponentRole
waymark-bridgeAccepts workflow registrations and queue requests from the Python SDK.
waymark-start-workersOwns the runloop, scheduler, worker bridge, status reporter, and embedded web UI.
RemoteWorkerPoolDispatches action calls to Python worker processes over gRPC.
PostgresBackendShared persistence layer for core runtime, schedules, and web UI queries.

Persistence model

Six tables carry all durable state:

TableHolds
workflow_versionsImmutable workflow IR payloads (program_proto) keyed by (workflow_name, workflow_version).
queued_instancesThe claim table - scheduled_at, lock_uuid, lock_expires_at.
runner_instancesPer-instance state snapshot, plus terminal result / error.
runner_actions_doneAppend-only completed action results, keyed by (execution_id, attempt).
workflow_schedulesRecurring schedule definitions and run metadata.
worker_statusPer-pool throughput and latency snapshots.

State payloads in runner_instances.state are serialized as MessagePack (rmp_serde) byte blobs, not protobuf - small, fast to load, and easy to evolve.

Workflow registration

WorkflowRegistryBackend::upsert_workflow_version:

  1. INSERT ... ON CONFLICT (workflow_name, workflow_version) DO NOTHING RETURNING id.
  2. On conflict, SELECT id, ir_hash.
  3. Reject if ir_hash changed for the same key.

Version keys are immutable. Re-registering with an unchanged IR is a no-op; re-registering with a changed IR but the same version is rejected loudly. This keeps "what does version 3 look like?" answerable for any historical instance.

Queueing & claiming

Queueing writes the queue row and the runner state row in a single transaction:

  1. Batch insert into queued_instances(instance_id, scheduled_at, payload).
  2. Batch insert into runner_instances(instance_id, entry_node, workflow_version_id, state).

The dual write prevents the bad case of a queue row that has no runner state to hydrate.

Claiming uses FOR UPDATE SKIP LOCKED:

  • Selects rows where scheduled_at <= now and the lock is missing or expired.
  • Stamps lock_uuid and lock_expires_at on the row.
  • Joins runner_instances to hydrate the graph state in the same trip.
  • Backfills action_results from runner_actions_done by selecting the latest row per execution_id (DISTINCT ON ... ORDER BY attempt DESC, id DESC).

Multiple runloop replicas can race for due rows and Postgres serializes them - the standard pattern for Postgres-backed queues.

Persisting progress

The runloop persists deltas in a deliberate order:

  1. Insert into runner_actions_done first (append-only history).
  2. Update runner_instances.state for the claimed row.
  3. Update queued_instances.scheduled_at to whatever the graph reports as its next deadline.

State updates are lock-gated - the UPDATE includes WHERE qi.lock_uuid = $lock_uuid AND lock_expires_at > now() - so a stale owner whose lock has expired can never overwrite live state. If the lease has lapsed, the write rejects and the runloop drops the in-memory executor.

When an instance reaches a terminal state, save_instances_done:

  1. Batch-updates runner_instances.result / error.
  2. Deletes the matching queued_instances rows.

The completed history stays queryable in runner_instances; the active queue stays small.

Locking & recovery

Locks are leases on queued_instances rows. A few moving parts keep them honest:

  • A periodic heartbeat refreshes lock_expires_at for actively-running instances.
  • An expired-lock reclaimer sweeps the table periodically and clears leases that have lapsed (FOR UPDATE SKIP LOCKED, in batches).
  • If the lock changes hands while the runloop holds the in-memory executor, the runloop notices on its next state write, evicts the executor, and releases its local state.

The end result: a hard worker crash drops the lease at the lock_expires_at deadline. The instance is reclaimed by another runloop, hydrated from the last persisted snapshot in runner_instances.state, and resumes from there. There is no replay of the original Python run() body.

Instance lifecycle

End to end:

  1. The Python SDK compiles a workflow class to IR and calls bridge registration.
  2. The bridge upserts workflow_versions and queues a QueuedInstance.
  3. The runloop claims due instances, hydrates the DAG and state, and fans work into executor shards.
  4. Action completions update in-memory state; durable deltas flow back through PostgresBackend in the order described above.
  5. Finished instances are written to runner_instances.result / error and removed from queued_instances.

The full source-of-truth is the Architecture doc in the repo; this page tracks it but the repo is the canonical reference.