Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Update k8s FAQ on autoscaling #3437

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

romilbhardwaj
Copy link
Collaborator

Follow up to #3415.

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

To run on an autoscaling cluster, you may need to adjust the resource provisioning timeout (:code:`Kubernetes.TIMEOUT` in `clouds/kubernetes.py`) to a large value to give enough time for the cluster to autoscale. We are working on a better interface to adjust this timeout - stay tuned!
Support for autoscaling clusters is experimental.
To run on autoscaling clusters, set the :code:`provision_timeout` key in :code:`~/.sky/config.yaml` to a large value to give enough time for the cluster autoscaler to provision new nodes.
This will allow SkyPilot to wait for the cluster to scale up before launching the task. Example:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This will allow SkyPilot to wait for the cluster to scale up before launching the task. Example:
This will allow SkyPilot enough time to wait for the cluster to scale up before failing over to the next candidate resource (e.g., next cloud). Example:


# ~/.sky/config.yaml
kubernetes:
provision_timeout: 900 # Wait 15 minutes for nodes to get provisioned before fail over
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
provision_timeout: 900 # Wait 15 minutes for nodes to get provisioned before fail over
provision_timeout: 900 # Wait 15 minutes for nodes to get provisioned before failover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants