Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Support customizable readiness probe timeout #3472

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Apr 24, 2024

Expose the readiness probe timeout to users. Example:

# test.yaml
service:
  readiness_probe:
    path: /health
    initial_delay_seconds: 20
    timeout_seconds: 100
  replicas: 1

resources:
  cpus: 2+
  ports: 8081

run: python -m http.server 8081
$ sky serve up test.yaml
Service from YAML spec: test.yaml
Service Spec:
Readiness probe method:           GET /health
Readiness initial delay seconds:  20
Readiness probe timeout seconds:  100
Replica autoscaling policy:       Fixed 1 replica
Spot Policy:                      No spot policy

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • The YAML above
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

timeout=serve_constants.READINESS_PROBE_TIMEOUT_SECONDS)
response = requests.post(readiness_path,
json=post_data,
timeout=timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the timeout for the post request directly or should we instead handle the timeout ourselves for when to set the replica to be NOT_READY? Reason: changing the timeout for the post request will cause the probing blocked by the a single request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on handle the timeout ourselves for when to set the replica to be NOT_READY..? Do you refer to letting the user configure the consecutive failure timeout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it is from a user's request

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean should we have the user to customize _CONSECUTIVE_FAILURE_THRESHOLD_TIMEOUT, instead of changing the timeout?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants