Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More resiliant approach for in-flight requests on terminating Pods #10730

Open
2 tasks done
alen-z opened this issue May 16, 2024 · 5 comments
Open
2 tasks done

More resiliant approach for in-flight requests on terminating Pods #10730

alen-z opened this issue May 16, 2024 · 5 comments
Labels
contributor/wanted Participation from an external contributor is highly requested kind/enhancement a new or improved feature. priority/P2 need to be fixed in the future

Comments

@alen-z
Copy link

alen-z commented May 16, 2024

Welcome!

  • Yes, I've searched similar issues on GitHub and didn't find any.
  • Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What did you expect to see?

Is Traefik about to rely on state of EndpointSlice to determine when to continue sending traffic to terminating Pods?

KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/1669-proxy-terminating-endpoints/README.md

ProxyTerminatingEndpoints Kuberentes flags are already live in GKE for example.

This would make Traefik much more reliable in cases where preStop hook is not implemented.

@kevinpollet kevinpollet added kind/enhancement a new or improved feature. priority/P2 need to be fixed in the future contributor/wanted Participation from an external contributor is highly requested and removed status/0-needs-triage labels May 23, 2024
@kevinpollet kevinpollet self-assigned this May 23, 2024
@kevinpollet
Copy link
Member

Hello @alen-z and thanks for your interest in Traefik,

This is an interesting idea, there is already a pull request #10664 which brings the EndpointSlices support which might fix this issue.

@jnoordsij wdyt?

@jnoordsij
Copy link
Contributor

I've looked into the mentioned KEP, but I just don't see how exactly it should benefit Traefik. In both the existing handling of endpoints and the one I implemented with the EndpointSlice API in #10664, any terminating endpoints should already be considered invalid as they should be marked as not ready by Kubernetes.

The preStop hook will help cover the possible time overlap between Traefik being aware of the endpoint no longer be available (whether through Endpoint or EndpointSlice API) and the pod being actually terminated; this is described at various places, see e.g. https://itnext.io/how-do-you-gracefully-shut-down-pods-in-kubernetes-fb19f617cd67. AFAICS, the mentioned proposal does not offer any alternative to that.

What could be done after #10664 is creating some kind of configuration flag (or even/later default behavior), that does broadly the same thing as the mentioned KEP: rather than relying on the ready flag, Traefik could allow all serving=true endpoints and then using only those with terminating=true if there are no endpoints with terminating=false, to have a "theoretical" better chance of finding suitable endpoints.

However, I'm not sure if this has any real-world benefit, as the motivation mentions the problem at hand being "When using Service Type=LoadBalancer w/ externalTrafficPolicy=Local" while Traefik should typically be sending traffic to ClusterIP (or NodePort) services. Moreover the Non-Goals lists "Handling terminating endpoints for other consumers of the EndpointSlice API, such as ingress controllers or external load balancers.", leading me to believe this is not relevant for Traefik?

@alen-z
Copy link
Author

alen-z commented May 28, 2024

Hey @jnoordsij, great gist and appreciate the time invested. Also, pretty nice to see #10664. I wasn't aware, pretty new from the Traefik kitchen.

Ultimate goal: Removing the need to have preStop hook and no new requests sent to terminating Pod + make a Pod stay alive to serve in-flight requests (automated preStop bacisally). This means finding the way to remove this gap between Traefik being notified and Pod being terminated (if preStop does not exist).

I have one thing in mind: Can Traefik intervene in the process of EndpointSlice and the state that controls endpoint termination? Basically, have Traefik as one of the validation gates before changing the state for termination? This puts Traefik in a position to acknowledge and influence the termination action. Example: Pod would not terminate until Traefik responds that it has no in-flight request — then EndpointSlice state changes. Or maybe Traefik can influence Pod lifecycle itself to say when it's ready to allow Pod to terminate, finishing in-flight requests even before Pod started the termination?

How to involve Traefik in a similar way that kube-proxy is involved is a good question. I'd need to look more into it to propose implementation details, but from the top of my head: finalizers, maybe. Add Traefik finalizer, remove when ready to terminate after finishing in-flight. Or Admission Controller of some sort, though I'm not aware if relevant events to EndpointSlice resource are passing there?

@jnoordsij
Copy link
Contributor

I see! I do like the idea, although I think it might be very challenging to achieve such a thing in practice.

But as far as I can see, that goal does not have any direct relation to KEP-1669 or my PR #10664, given they're really about something else (namely the endpoint part) and do not actively attempt to alter Pod termination logic itself in any way.

@kevinpollet kevinpollet removed their assignment May 29, 2024
@alen-z
Copy link
Author

alen-z commented May 29, 2024

Yes, you are right. KEP was just initial spark that evolved.

Glad you find it interesting. If there is a way to try this, maybe we can put it behind a flag to start with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor/wanted Participation from an external contributor is highly requested kind/enhancement a new or improved feature. priority/P2 need to be fixed in the future
Projects
None yet
Development

No branches or pull requests

4 participants