Architecture
Waymark is a Postgres-backed instance queue plus an in-memory runner. The high-level flow:
Python SDK
│
▼
waymark-bridge (gRPC)
│
▼
Postgres
│
▼
waymark-start-workers runloop
│
▼
Python worker pool
Everything durable lives in Postgres. Everything that has to be fast lives in the Rust runloop's in-memory state, with carefully ordered writes back to Postgres so that a worker crash never loses progress.
Runtime components
| Component | Role |
|---|---|
waymark-bridge | Accepts workflow registrations and queue requests from the Python SDK. |
waymark-start-workers | Owns the runloop, scheduler, worker bridge, status reporter, and embedded web UI. |
RemoteWorkerPool | Dispatches action calls to Python worker processes over gRPC. |
PostgresBackend | Shared persistence layer for core runtime, schedules, and web UI queries. |
Persistence model
Six tables carry all durable state:
| Table | Holds |
|---|---|
workflow_versions | Immutable workflow IR payloads (program_proto) keyed by (workflow_name, workflow_version). |
queued_instances | The claim table - scheduled_at, lock_uuid, lock_expires_at. |
runner_instances | Per-instance state snapshot, plus terminal result / error. |
runner_actions_done | Append-only completed action results, keyed by (execution_id, attempt). |
workflow_schedules | Recurring schedule definitions and run metadata. |
worker_status | Per-pool throughput and latency snapshots. |
State payloads in runner_instances.state are serialized as MessagePack
(rmp_serde) byte blobs, not protobuf - small, fast to load, and easy
to evolve.
Workflow registration
WorkflowRegistryBackend::upsert_workflow_version:
INSERT ... ON CONFLICT (workflow_name, workflow_version) DO NOTHING RETURNING id.- On conflict,
SELECT id, ir_hash. - Reject if
ir_hashchanged for the same key.
Version keys are immutable. Re-registering with an unchanged IR is a no-op; re-registering with a changed IR but the same version is rejected loudly. This keeps "what does version 3 look like?" answerable for any historical instance.
Queueing & claiming
Queueing writes the queue row and the runner state row in a single transaction:
- Batch insert into
queued_instances(instance_id, scheduled_at, payload). - Batch insert into
runner_instances(instance_id, entry_node, workflow_version_id, state).
The dual write prevents the bad case of a queue row that has no runner state to hydrate.
Claiming uses FOR UPDATE SKIP LOCKED:
- Selects rows where
scheduled_at <= nowand the lock is missing or expired. - Stamps
lock_uuidandlock_expires_aton the row. - Joins
runner_instancesto hydrate the graph state in the same trip. - Backfills
action_resultsfromrunner_actions_doneby selecting the latest row perexecution_id(DISTINCT ON ... ORDER BY attempt DESC, id DESC).
Multiple runloop replicas can race for due rows and Postgres serializes them - the standard pattern for Postgres-backed queues.
Persisting progress
The runloop persists deltas in a deliberate order:
- Insert into
runner_actions_donefirst (append-only history). - Update
runner_instances.statefor the claimed row. - Update
queued_instances.scheduled_atto whatever the graph reports as its next deadline.
State updates are lock-gated - the UPDATE includes WHERE qi.lock_uuid = $lock_uuid AND lock_expires_at > now() - so a stale
owner whose lock has expired can never overwrite live state. If the
lease has lapsed, the write rejects and the runloop drops the in-memory
executor.
When an instance reaches a terminal state, save_instances_done:
- Batch-updates
runner_instances.result/error. - Deletes the matching
queued_instancesrows.
The completed history stays queryable in runner_instances; the active
queue stays small.
Locking & recovery
Locks are leases on queued_instances rows. A few moving parts keep
them honest:
- A periodic heartbeat refreshes
lock_expires_atfor actively-running instances. - An expired-lock reclaimer sweeps the table periodically and clears
leases that have lapsed (
FOR UPDATE SKIP LOCKED, in batches). - If the lock changes hands while the runloop holds the in-memory executor, the runloop notices on its next state write, evicts the executor, and releases its local state.
The end result: a hard worker crash drops the lease at the
lock_expires_at deadline. The instance is reclaimed by another
runloop, hydrated from the last persisted snapshot in
runner_instances.state, and resumes from there. There is no replay of
the original Python run() body.
Instance lifecycle
End to end:
- The Python SDK compiles a workflow class to IR and calls bridge registration.
- The bridge upserts
workflow_versionsand queues aQueuedInstance. - The runloop claims due instances, hydrates the DAG and state, and fans work into executor shards.
- Action completions update in-memory state; durable deltas flow back
through
PostgresBackendin the order described above. - Finished instances are written to
runner_instances.result/errorand removed fromqueued_instances.
The full source-of-truth is the Architecture doc in the repo; this page tracks it but the repo is the canonical reference.