Architecture

Waymark is a Postgres-backed instance queue plus an in-memory runner. The high-level flow:

Python SDK
   │
   ▼
waymark-bridge (gRPC)
   │
   ▼
Postgres
   │
   ▼
waymark-start-workers runloop
   │
   ▼
Python worker pool

Everything durable lives in Postgres. Everything that has to be fast lives in the Rust runloop's in-memory state, with carefully ordered writes back to Postgres so that a worker crash never loses progress.

Runtime components

Component	Role
`waymark-bridge`	Accepts workflow registrations and queue requests from the Python SDK.
`waymark-start-workers`	Owns the runloop, scheduler, worker bridge, status reporter, and embedded web UI.
`RemoteWorkerPool`	Dispatches action calls to Python worker processes over gRPC.
`PostgresBackend`	Shared persistence layer for core runtime, schedules, and web UI queries.

Persistence model

Six tables carry all durable state:

Table	Holds
`workflow_versions`	Immutable workflow IR payloads (`program_proto`) keyed by `(workflow_name, workflow_version)`.
`queued_instances`	The claim table - `scheduled_at`, `lock_uuid`, `lock_expires_at`.
`runner_instances`	Per-instance state snapshot, plus terminal `result` / `error`.
`runner_actions_done`	Append-only completed action results, keyed by `(execution_id, attempt)`.
`workflow_schedules`	Recurring schedule definitions and run metadata.
`worker_status`	Per-pool throughput and latency snapshots.

State payloads in runner_instances.state are serialized as MessagePack (rmp_serde) byte blobs, not protobuf - small, fast to load, and easy to evolve.

Workflow registration

WorkflowRegistryBackend::upsert_workflow_version:

INSERT ... ON CONFLICT (workflow_name, workflow_version) DO NOTHING RETURNING id.
On conflict, SELECT id, ir_hash.
Reject if ir_hash changed for the same key.

Version keys are immutable. Re-registering with an unchanged IR is a no-op; re-registering with a changed IR but the same version is rejected loudly. This keeps "what does version 3 look like?" answerable for any historical instance.

Queueing & claiming

Queueing writes the queue row and the runner state row in a single transaction:

Batch insert into queued_instances(instance_id, scheduled_at, payload).
Batch insert into runner_instances(instance_id, entry_node, workflow_version_id, state).

The dual write prevents the bad case of a queue row that has no runner state to hydrate.

Claiming uses FOR UPDATE SKIP LOCKED:

Selects rows where scheduled_at <= now and the lock is missing or expired.
Stamps lock_uuid and lock_expires_at on the row.
Joins runner_instances to hydrate the graph state in the same trip.
Backfills action_results from runner_actions_done by selecting the latest row per execution_id (DISTINCT ON ... ORDER BY attempt DESC, id DESC).

Multiple runloop replicas can race for due rows and Postgres serializes them - the standard pattern for Postgres-backed queues.

Persisting progress

The runloop persists deltas in a deliberate order:

Insert into runner_actions_done first (append-only history).
Update runner_instances.state for the claimed row.
Update queued_instances.scheduled_at to whatever the graph reports as its next deadline.

State updates are lock-gated - the UPDATE includes WHERE qi.lock_uuid = $lock_uuid AND lock_expires_at > now() - so a stale owner whose lock has expired can never overwrite live state. If the lease has lapsed, the write rejects and the runloop drops the in-memory executor.

When an instance reaches a terminal state, save_instances_done:

Batch-updates runner_instances.result / error.
Deletes the matching queued_instances rows.

The completed history stays queryable in runner_instances; the active queue stays small.

Locking & recovery

Locks are leases on queued_instances rows. A few moving parts keep them honest:

A periodic heartbeat refreshes lock_expires_at for actively-running instances.
An expired-lock reclaimer sweeps the table periodically and clears leases that have lapsed (FOR UPDATE SKIP LOCKED, in batches).
If the lock changes hands while the runloop holds the in-memory executor, the runloop notices on its next state write, evicts the executor, and releases its local state.

The end result: a hard worker crash drops the lease at the lock_expires_at deadline. The instance is reclaimed by another runloop, hydrated from the last persisted snapshot in runner_instances.state, and resumes from there. There is no replay of the original Python run() body.

Instance lifecycle

End to end:

The Python SDK compiles a workflow class to IR and calls bridge registration.
The bridge upserts workflow_versions and queues a QueuedInstance.
The runloop claims due instances, hydrates the DAG and state, and fans work into executor shards.
Action completions update in-memory state; durable deltas flow back through PostgresBackend in the order described above.
Finished instances are written to runner_instances.result / error and removed from queued_instances.

The full source-of-truth is the Architecture doc in the repo; this page tracks it but the repo is the canonical reference.