Retries & Timeouts
Background tasks fail. Sometimes they raise; sometimes they hang. Waymark gives you a single place to express how each action should respond to either failure mode, applied per call.
Default behavior
If you don't configure anything, you get the conservative defaults:
- A bare
await some_action(...)runs the action once. If it raises, the exception propagates out of the workflow as a normal Python error. No automatic retries. - An action that exceeds its dispatch timeout is treated as a runtime failure (not a user exception) and is retried until it eventually runs to completion. Timeouts usually mean a worker died mid-execution, which is something we should recover from automatically.
These defaults are deliberately tame: you have to opt in to retries for your own code, but cross-worker coordination failures retry indefinitely under the hood.
Configuring with run_action
To override the defaults, wrap the action call in self.run_action(...)
and supply a RetryPolicy:
from waymark import RetryPolicy
from waymark.workflow import Workflow
@workflow
class FlakyApiWorkflow(Workflow):
async def run(self, payload: dict) -> Result:
return await self.run_action(
call_third_party_api(payload),
retry=RetryPolicy(attempts=5),
)
RetryPolicy(attempts=N) is total executions - five attempts means up to
four retries after the initial try. RetryPolicy() with no attempts
defaults to a generous retry budget (100 internally), suitable for
external calls that you really expect to succeed eventually.
Filtering by exception
You usually don't want to retry on every exception - a ValueError from
a bad input is permanent, but a ConnectionError is worth another shot.
Pass exception_types to scope retries to specific classes (referenced
by name, since the workflow body doesn't import the exception):
await self.run_action(
call_third_party_api(payload),
retry=RetryPolicy(
attempts=5,
exception_types=["ConnectionError", "TimeoutError"],
),
)
If the action raises something not in the list, the policy doesn't
match - the failure propagates immediately. If multiple RetryPolicy
configurations could match (e.g., one for RateLimitError, one for
generic network errors), pass a list of policies; the runtime picks the
first matching one.
Timeouts
A per-call timeout caps how long an individual action attempt is
allowed to run before the dispatcher reclaims the lease and treats the
attempt as failed:
from datetime import timedelta
await self.run_action(
slow_summary(report_id),
retry=RetryPolicy(attempts=3),
timeout=timedelta(minutes=2),
)
A timeout that fires is infrastructure-level failure, not a user exception. The runtime always retries timeouts until the action either completes or your retry budget is exhausted.
What is enforced today
Waymark is alpha. The retry-policy surface is broader than what the runtime currently honors, and we want to be honest about the difference:
| Field | Parsed | Enforced today |
|---|---|---|
attempts | ✓ | ✓ |
exception_types | ✓ | ✓ |
backoff_seconds | ✓ | ✗ - retries are immediate |
timeout | ✓ | ✗ - not yet applied to dispatch |
In other words: retry count and exception filtering work as documented today. Backoff intervals and per-call timeouts are accepted by the API and persisted in IR, but the runloop scheduler doesn't yet read them when deciding when to dispatch a retry. Both are on the near-term path to enforcement; track the source notes for the latest status.
If you're depending on backoff or per-action timeouts in the meantime,
the safest pattern is to encode the wait into the action itself (an
explicit await asyncio.sleep(...) between work) and to time-box your
own outbound calls inside the action.
API reference
For the precise signatures and field types, see the Python RetryPolicy reference.