Retries & Timeouts

Background tasks fail. Sometimes they raise; sometimes they hang. Waymark gives you a single place to express how each action should respond to either failure mode, applied per call.

Default behavior

If you don't configure anything, you get the conservative defaults:

  • A bare await some_action(...) runs the action once. If it raises, the exception propagates out of the workflow as a normal Python error. No automatic retries.
  • An action that exceeds its dispatch timeout is treated as a runtime failure (not a user exception) and is retried until it eventually runs to completion. Timeouts usually mean a worker died mid-execution, which is something we should recover from automatically.

These defaults are deliberately tame: you have to opt in to retries for your own code, but cross-worker coordination failures retry indefinitely under the hood.

Configuring with run_action

To override the defaults, wrap the action call in self.run_action(...) and supply a RetryPolicy:

from waymark import RetryPolicy
from waymark.workflow import Workflow

@workflow
class FlakyApiWorkflow(Workflow):
    async def run(self, payload: dict) -> Result:
        return await self.run_action(
            call_third_party_api(payload),
            retry=RetryPolicy(attempts=5),
        )

RetryPolicy(attempts=N) is total executions - five attempts means up to four retries after the initial try. RetryPolicy() with no attempts defaults to a generous retry budget (100 internally), suitable for external calls that you really expect to succeed eventually.

Filtering by exception

You usually don't want to retry on every exception - a ValueError from a bad input is permanent, but a ConnectionError is worth another shot. Pass exception_types to scope retries to specific classes (referenced by name, since the workflow body doesn't import the exception):

await self.run_action(
    call_third_party_api(payload),
    retry=RetryPolicy(
        attempts=5,
        exception_types=["ConnectionError", "TimeoutError"],
    ),
)

If the action raises something not in the list, the policy doesn't match - the failure propagates immediately. If multiple RetryPolicy configurations could match (e.g., one for RateLimitError, one for generic network errors), pass a list of policies; the runtime picks the first matching one.

Timeouts

A per-call timeout caps how long an individual action attempt is allowed to run before the dispatcher reclaims the lease and treats the attempt as failed:

from datetime import timedelta

await self.run_action(
    slow_summary(report_id),
    retry=RetryPolicy(attempts=3),
    timeout=timedelta(minutes=2),
)

A timeout that fires is infrastructure-level failure, not a user exception. The runtime always retries timeouts until the action either completes or your retry budget is exhausted.

What is enforced today

Waymark is alpha. The retry-policy surface is broader than what the runtime currently honors, and we want to be honest about the difference:

FieldParsedEnforced today
attempts
exception_types
backoff_seconds✗ - retries are immediate
timeout✗ - not yet applied to dispatch

In other words: retry count and exception filtering work as documented today. Backoff intervals and per-call timeouts are accepted by the API and persisted in IR, but the runloop scheduler doesn't yet read them when deciding when to dispatch a retry. Both are on the near-term path to enforcement; track the source notes for the latest status.

If you're depending on backoff or per-action timeouts in the meantime, the safest pattern is to encode the wait into the action itself (an explicit await asyncio.sleep(...) between work) and to time-box your own outbound calls inside the action.

API reference

For the precise signatures and field types, see the Python RetryPolicy reference.