Production Deployment
A production Waymark deployment is three pieces: your application, a worker pool, and Postgres. It should be basically plug and play with any stack: self-hosted or in the cloud. There's no broker to host, no scheduler service, no replay servers. Postgres holds every piece of shared state, and everything else is a stateless container you can kill and restart at will.
Topology
- Your app is whatever serves traffic today - a FastAPI service, a
script, a cron job. It queues workflows when it wants to delegate processing to the job cluster. The SDK boots a singleton
waymark-bridgeinside the container on first use, so the app needs nothing beyond a connection to the database. - The worker pool is one container running
waymark-start-workers. It pulls queued actions from Postgres, executes them across a pool of Python processes, and persists results. - Postgres is the only stateful piece. Anything 14+ works, including the managed instance you already have.
Because both app and workers coordinate purely through Postgres, they don't need to reach each other over the network. Scale them, restart them, and deploy them independently.
Share an image
Build a single image that contains your application code and the waymark package. They're both referencing the same underlying workflow and action code, so it makes sense for the image to have parity. The app service and the worker service only need different commands in the entrypoint.
If you already have Dockerfile, you just need a new service that mounts
with waymark-start-workers. If you don't already have one, this basic one should
get you started.
FROM python:3.12-slim
RUN pip install uv
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY src ./src
ENV PATH="/app/.venv/bin:${PATH}"
# The default command serves the app; the worker service
# overrides it with `waymark-start-workers`.
CMD ["uvicorn", "yourapp.web:app", "--host", "0.0.0.0", "--port", "8000"]
Compose the services
The same shape works in Docker Compose, Kubernetes, or anything that runs containers. In Compose:
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: app
POSTGRES_USER: app
POSTGRES_PASSWORD: app
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app"]
interval: 5s
volumes:
- postgres-data:/var/lib/postgresql/data
workers:
build: .
command: ["waymark-start-workers"]
depends_on:
postgres:
condition: service_healthy
environment:
WAYMARK_DATABASE_URL: postgresql://app:app@postgres:5432/app
WAYMARK_USER_MODULE: yourapp.workflows
WAYMARK_WEBAPP_ENABLED: "true"
ports:
- "24119:24119"
app:
build: .
depends_on:
workers:
condition: service_started
environment:
WAYMARK_DATABASE_URL: postgresql://app:app@postgres:5432/app
ports:
- "8000:8000"
volumes:
postgres-data: {}
Production settings
The defaults are tuned for a single host. Tuning in production is usually governed by:
| Variable | What it controls |
|---|---|
WAYMARK_WORKER_COUNT | Python worker processes in the pool. Defaults to host CPU count to effectively allow Python to saturate each core. |
WAYMARK_CONCURRENT_PER_WORKER | Concurrent actions per worker. Raise it for IO-bound actions like api requests. |
If a third-party library leaks memory inside your actions, set
WAYMARK_MAX_ACTION_LIFECYCLE to recycle workers after a fixed number of
actions. In-flight work completes before the old process exits.
One retention default to know: finished workflow instances are garbage
collected after 24 hours. If you want longer run history in the webapp
or for audits, increase the value of WAYMARK_GARBAGE_COLLECTOR_RETENTION_HOURS.
The full list of environment variables, including queue polling, lock TTLs, and scheduler intervals, lives in Configuration.
The webapp
Setting WAYMARK_WEBAPP_ENABLED=true on the worker pool serves the
built-in webapp on port 24119: live workflow progress, run history,
and per-action inputs, retries, and outputs. The
Webapp guide covers everything it can do.
The webapp binds to 0.0.0.0 by default and has no built-in auth.
Keep the port on your private network, or by firewalling your worker service
to not be exposed to the open Internet. We suggest the latter since there's
usually no reason why workers have to be externally accessible.
Scaling out
The first approach to scale is to run a new worker on a fresh box. Each pool claims queued work through Postgres row locks with heartbeats, so pools on different hosts never double-execute an action, and a crashed host's locks expire and get reclaimed automatically. Your app tier scales the same way it always has - queueing a workflow is just a database write.
The same mechanism is what makes deploys safe. When a worker container is killed mid-run, its lock expires (15 seconds by default) and another pool picks the instance up from the last persisted state. No work is lost; the worst case is a partially-completed action attempt running again, which is the whole point of durable execution.
When Postgres is the bottleneck, you'll see it in queue latency long before correctness suffers. Until then, one database is all the infrastructure Waymark asks for. On very large deployments, we've seen some users benefit from adding another separate Postgres service that just owns Waymark jobs. Either way you should also make sure that you've tuned Postgres properly, which we can also help with.