Retrying Failed or Errored Steps¶
You can specify a retryStrategy that will dictate how failed or errored steps are retried:
# This example demonstrates the use of retry back offs
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: retry-backoff-
spec:
entrypoint: retry-backoff
templates:
- name: retry-backoff
retryStrategy:
limit: 10
retryPolicy: "Always"
backoff:
duration: "1" # Must be a string. Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
factor: 2
maxDuration: "1m" # Must be a string. Default unit is seconds. Could also be a Duration, e.g.: "2m", "6h", "1d"
affinity:
nodeAntiAffinity: {}
container:
image: python:alpine3.6
command: ["python", -c]
# fail with a 66% probability
args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]
limitis the maximum number of times the container will be retried.retryPolicyspecifies if a container will be retried on failure, error, both, or only transient errors (e.g. i/o or TLS handshake timeout). "Always" retries on both errors and failures. Also available:OnFailure(default), "OnError", and "OnTransientError" (available after v3.0.0-rc2).backoffis an exponential back-offnodeAntiAffinityprevents running steps on the same host. Current implementation allows only emptynodeAntiAffinity(i.e.nodeAntiAffinity: {}) and by default it uses labelkubernetes.io/hostnameas the selector.
Providing an empty retryStrategy (i.e. retryStrategy: {}) will cause a container to retry until completion.