-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
onExit steps executed after deadline exceeded are stopped if an onExit pod spends >=10 seconds in Pending state #13060
Labels
Comments
I am pretty sure the bug is that this check should be aware of exit pods argo-workflows/workflow/controller/operator.go Line 1276 in adef075
|
@jswxstw, do you want to own this issue then. I think you're volunteering to fix it? |
Yes, I’ll fix it. |
jswxstw
pushed a commit
to jswxstw/argo-workflows
that referenced
this issue
May 21, 2024
Signed-off-by: jswxstw <jswxstw@gmail.com>
jswxstw
pushed a commit
to jswxstw/argo-workflows
that referenced
this issue
May 21, 2024
Signed-off-by: jswxstw <jswxstw@gmail.com>
jswxstw
pushed a commit
to jswxstw/argo-workflows
that referenced
this issue
May 21, 2024
Signed-off-by: jswxstw <jswxstw@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what did you expect to happen?
Expected Behavior: On timeout of a workflow, the onExit steps are expected to ignore the deadline and execute after the main steps are stopped.
Actual Behavior: On timeout of a workflow, if an onExit step takes 10 seconds or more to go from a Pending state to a Running state (default requeue time of 10 seconds is configured in the workflow-controller deployment), then the onExit step will fail immediately with message "Step exceeded its deadline" and the remaining onExit steps are not executed.
Link to original message on Slack: https://cloud-native.slack.com/archives/C01QW9QSSSK/p1715787843656389
This issue has been verified to also occur on latest v3.5.6 Argo-Workflows, by another user who has replied to the above Slack message.
The 10 second limit for onExit pods, after deadline is reached, seems to be tied to the controller environment variable DEFAULT_REQUEUE_TIME as i can get the issue to occur more frequently with lower requeue time values or less frequently with higher requeue time values applied to the workflow-controller deployment.
Version
v3.4.10
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: