-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][AWS] Allow specification of Security Groups for resources. #3501
base: master
Are you sure you want to change the base?
Conversation
ae9633d
to
e644f58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback for resolving the following would be appreciated:
- ports / security_group_name for serve
- cloud resources only receive
cluster_name_on_cloud
formake_deploy_variables
instead ofcluster_name
@@ -920,12 +920,6 @@ def _try_validate_ports(self) -> None: | |||
""" | |||
if self.ports is None: | |||
return | |||
if skypilot_config.get_nested(('aws', 'security_group_name'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how best to resolve this. For Serve, ports are automatically specified for the load balancer which conflicts with being able to set SGs for it.
#3473
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a customized security group, it is possible that the security group does not the required ports opened and the user does not have enough permission to modify the security group to add those rules.
Also, checking the ports that are already open for a security group is a bit tricky, as it is possible that some of the rules open up some ports but only expose them to a specific range of IP.
skypilot/sky/provision/aws/instance.py
Lines 718 to 729 in 05aafb8
existing_ports: Set[int] = set() | |
for existing_rule in sg.ip_permissions: | |
# Skip any non-tcp rules. | |
if existing_rule['IpProtocol'] != 'tcp': | |
continue | |
# Skip any rules that don't have a FromPort or ToPort. | |
if 'FromPort' not in existing_rule or 'ToPort' not in existing_rule: | |
continue | |
existing_ports.update( | |
range(existing_rule['FromPort'], existing_rule['ToPort'] + 1)) | |
ports_to_open = resources_utils.port_set_to_ranges( | |
resources_utils.port_ranges_to_set(ports) - existing_ports) |
Tagging @cblmemo for anything to chime in here.
In the long run, we may have to handle those, but it might be fine for now to print a WARNING / hint here to let a user know that the security group is specified and it is not guaranteed that the ports will be correctly opened on AWS.
if self.cloud is None or instance(self.cloud, clouds.AWS):
security_group_name = skypilot_config.get_nested(('aws', 'security_group_name'), None)
if security_group_name is not None:
with ux_utils.print_exception_no_traceback():
logger.warning(
f'Ports {self.ports} and security group name are specified: {security_group_name}. '
'It is not guaranteed that the ports will be opened if the specified security group is not correctly set up. '
'Please try to specify `ports` only and leave out `aws.security_group_name` in `~/.sky/config.yaml` to allow SkyPilot to automatically create and configure the security group.')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning looks good to me. Random idea: could also consider remove the port specification for load balancer if we use security groups as resoruces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the suggested warning.
sky/utils/schemas.py
Outdated
@@ -567,7 +569,7 @@ def get_config_schema(): | |||
# Validation may fail if $schema is included. | |||
if k != '$schema' | |||
} | |||
resources_schema['properties'].pop('ports') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how best to resolve this. For Serve, ports are automatically specified for the load balancer which conflicts with being able to set SGs for it.
#3473
@@ -102,11 +102,27 @@ Available fields and semantics: | |||
|
|||
# Security group (optional). | |||
# | |||
# The name of the security group to use for all instances. If not specified, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to match the format of remote identity.
sky/clouds/aws.py
Outdated
if user_security_group is not None and not isinstance( | ||
user_security_group, str): | ||
for profile in user_security_group: | ||
if fnmatch.fnmatchcase(cluster_name_on_cloud, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does not currently have access to cluster_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch! A proposal would be we refactor the make_deploy_resources_variable
to take ClusterName
dataclass in all cloud classes, instead of a single string for the cluster name:
skypilot/sky/provision/provisioner.py
Lines 41 to 50 in 05aafb8
@dataclasses.dataclass | |
class ClusterName: | |
display_name: str | |
name_on_cloud: str | |
def __repr__(self) -> str: | |
return repr(self.display_name) | |
def __str__(self) -> str: | |
return self.display_name |
To do so, we can move the ClusterName
class out of provision
module to utils.resources_utils
, and let the cloud class and the functions under provision
module to use ClusterName
object as input.
@Michaelvll @concretevitamin do you have any suggestions for the concerns I raised? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support for specifying security group names for different cluster @JGSweets! This is awesome. I left some comments, mainly concerning that specifying security group name will cause open ports not reliable if the user does not have the security group set up correctly. For now, a good warning or hint message can be enough.
sky/clouds/aws.py
Outdated
if user_security_group is not None and not isinstance( | ||
user_security_group, str): | ||
for profile in user_security_group: | ||
if fnmatch.fnmatchcase(cluster_name_on_cloud, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch! A proposal would be we refactor the make_deploy_resources_variable
to take ClusterName
dataclass in all cloud classes, instead of a single string for the cluster name:
skypilot/sky/provision/provisioner.py
Lines 41 to 50 in 05aafb8
@dataclasses.dataclass | |
class ClusterName: | |
display_name: str | |
name_on_cloud: str | |
def __repr__(self) -> str: | |
return repr(self.display_name) | |
def __str__(self) -> str: | |
return self.display_name |
To do so, we can move the ClusterName
class out of provision
module to utils.resources_utils
, and let the cloud class and the functions under provision
module to use ClusterName
object as input.
sky/utils/schemas.py
Outdated
@@ -567,7 +569,7 @@ def get_config_schema(): | |||
# Validation may fail if $schema is included. | |||
if k != '$schema' | |||
} | |||
resources_schema['properties'].pop('ports') | |||
resources_schema['properties'].pop('port', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ports
should be the correct field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ports is correct as it pops this from
_get_single_resources_schema`
Line 659 in 8a0a34d
'ports': { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I mean, it seems we changed it to port
now, which seems not a valid property to pop? It seems the issue is fixed in the latest commit : )
sky/utils/schemas.py
Outdated
{ | ||
# A list of single-element dict to pretain the | ||
# order. | ||
# Example: | ||
# security_group_name: | ||
# - my-cluster1-*: my-security-group-1 | ||
# - my-cluster2-*: my-security-group-2 | ||
# - "*"": my-security-group-3 | ||
'type': 'array', | ||
'items': { | ||
'type': 'object', | ||
'additionalProperties': { | ||
'type': 'string' | ||
}, | ||
'maxProperties': 1, | ||
'minProperties': 1, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: Since this pattern has been used by the remote_identity
and security_group_name
, we can probably have a constant or a function to generate this dict.
@@ -920,12 +920,6 @@ def _try_validate_ports(self) -> None: | |||
""" | |||
if self.ports is None: | |||
return | |||
if skypilot_config.get_nested(('aws', 'security_group_name'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a customized security group, it is possible that the security group does not the required ports opened and the user does not have enough permission to modify the security group to add those rules.
Also, checking the ports that are already open for a security group is a bit tricky, as it is possible that some of the rules open up some ports but only expose them to a specific range of IP.
skypilot/sky/provision/aws/instance.py
Lines 718 to 729 in 05aafb8
existing_ports: Set[int] = set() | |
for existing_rule in sg.ip_permissions: | |
# Skip any non-tcp rules. | |
if existing_rule['IpProtocol'] != 'tcp': | |
continue | |
# Skip any rules that don't have a FromPort or ToPort. | |
if 'FromPort' not in existing_rule or 'ToPort' not in existing_rule: | |
continue | |
existing_ports.update( | |
range(existing_rule['FromPort'], existing_rule['ToPort'] + 1)) | |
ports_to_open = resources_utils.port_set_to_ranges( | |
resources_utils.port_ranges_to_set(ports) - existing_ports) |
Tagging @cblmemo for anything to chime in here.
In the long run, we may have to handle those, but it might be fine for now to print a WARNING / hint here to let a user know that the security group is specified and it is not guaranteed that the ports will be correctly opened on AWS.
if self.cloud is None or instance(self.cloud, clouds.AWS):
security_group_name = skypilot_config.get_nested(('aws', 'security_group_name'), None)
if security_group_name is not None:
with ux_utils.print_exception_no_traceback():
logger.warning(
f'Ports {self.ports} and security group name are specified: {security_group_name}. '
'It is not guaranteed that the ports will be opened if the specified security group is not correctly set up. '
'Please try to specify `ports` only and leave out `aws.security_group_name` in `~/.sky/config.yaml` to allow SkyPilot to automatically create and configure the security group.')
4be366f
to
9c71b97
Compare
@@ -154,7 +154,11 @@ | |||
# we need to take this field from the new yaml. | |||
('provider', 'tpu_node'), | |||
('provider', 'security_group', 'GroupName'), | |||
('available_node_types', 'ray.head.default', 'node_config', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixes an error with IAM roles if the cluster previously existed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we want to always update the IamInstanceProfile
for an old cluster when a user update the ~/.sky/config.yaml
? Does this work with launching an existing cluster with a different remote_identity
? It would be nice to include some tests in the PR description : )
if security_group_name is not None: | ||
with ux_utils.print_exception_no_traceback(): | ||
logger.warning( | ||
f'Ports {self.ports} and security group name are ' | ||
f'specified: {security_group_name}. It is not ' | ||
'guaranteed that the ports will be opened if the ' | ||
'specified security group is not correctly set up. ' | ||
'Please try to specify `ports` only and leave out ' | ||
'`aws.security_group_name` in `~/.sky/config.yaml` to ' | ||
'allow SkyPilot to automatically create and configure ' | ||
'the security group.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the ideal place as it seems to print the warning 10+
times per launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch! Let's remove the warning here and move it to the aws.make_deploy_resources_variables
so that the warning will only shown when launching an exact launchable resource. (check out the comment in aws.py above)
@Michaelvll @cblmemo I've updated the PR with the Do we want to further refactor |
9c71b97
to
0b2d5b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @JGSweets and sorry for the delay! This PR and the ClusterName
refactoring looks mostly good to me. Left several additional comments. I don't think we need further refactoring for provisioners at the moment.
sky/utils/schemas.py
Outdated
@@ -567,7 +569,7 @@ def get_config_schema(): | |||
# Validation may fail if $schema is included. | |||
if k != '$schema' | |||
} | |||
resources_schema['properties'].pop('ports') | |||
resources_schema['properties'].pop('port', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I mean, it seems we changed it to port
now, which seems not a valid property to pop? It seems the issue is fixed in the latest commit : )
@@ -154,7 +154,11 @@ | |||
# we need to take this field from the new yaml. | |||
('provider', 'tpu_node'), | |||
('provider', 'security_group', 'GroupName'), | |||
('available_node_types', 'ray.head.default', 'node_config', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we want to always update the IamInstanceProfile
for an old cluster when a user update the ~/.sky/config.yaml
? Does this work with launching an existing cluster with a different remote_identity
? It would be nice to include some tests in the PR description : )
if user_security_group is None and resources.ports is not None: | ||
# Already checked in Resources._try_validate_ports | ||
assert user_security_group is None | ||
security_group = USER_PORTS_SECURITY_GROUP_NAME.format( | ||
cluster_name_on_cloud) | ||
elif user_security_group is not None: | ||
assert resources.ports is None | ||
security_group = user_security_group | ||
else: | ||
cluster_name.name_on_cloud) | ||
elif user_security_group is None: | ||
security_group = DEFAULT_SECURITY_GROUP_NAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the warning from aws.py to here so that it will only be printed once whenever it is trying to launch an AWS cluster with the security group specified.
security_group = user_security_group
if security_group is None:
if resources.ports is not None:
# Already checked in Resources._try_validate_ports
security_group = USER_PORTS_SECURITY_GROUP_NAME.format(
cluster_name.name_on_cloud)
elif user_security_group is None:
security_group = DEFAULT_SECURITY_GROUP_NAME
elif resources.ports is not None:
# resources.ports will only take effect when the security group is not specified
# by user.
logger.warning(
f'{colorama.Fore.YELLOW}AWS security group name is specified '
f'({security_group}) with aws.security_group_name in ~/.sky/config.yaml, '
f'while ports ({resources.ports}) is also specified in task resources. The '
'ports are not guaranteed to be opened in the specified security group, '
'due to the complex rules setup. Please manually make sure it.'
f'{colorama.Style.RESET_ALL}')
Unfortunately, since we call make_deploy_resources_variables
in the following place, the logging will still twice for each failover. The resources_vars
retrieved is just for getting the custom_resources
later, so we can add a custom_resources
field in the provision_record
returned by the bulk_provision
.
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 1557 to 1562 in 0b2d5b2
resources_vars = ( | |
to_provision.cloud.make_deploy_resources_variables( | |
to_provision, | |
resources_utils.ClusterName( | |
cluster_name, handle.cluster_name_on_cloud), | |
region, zones)) |
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 1544 to 1553 in 0b2d5b2
provision_record = provisioner.bulk_provision( | |
to_provision.cloud, | |
region, | |
zones, | |
resources_utils.ClusterName( | |
cluster_name, handle.cluster_name_on_cloud), | |
num_nodes=num_nodes, | |
cluster_yaml=handle.cluster_yaml, | |
prev_cluster_ever_up=prev_cluster_ever_up, | |
log_dir=self.log_dir) |
I am ok to leave the change for bulk_provision
and provision_record
to a future PR if we think the current PR involves too many things.
raise ValueError( | ||
'Cannot specify ports when AWS security group name is ' | ||
'specified.') | ||
if self.cloud is None or isinstance(self.cloud, clouds.AWS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to:
if self.cloud is None or isinstance(self.cloud, clouds.AWS): | |
if isinstance(self.cloud, clouds.AWS): |
with ux_utils.print_exception_no_traceback(): | ||
logger.warning( | ||
f'Ports {self.ports} and security group name are ' | ||
f'specified: {security_group_name}. It is not ' | ||
'guaranteed that the ports will be opened if the ' | ||
'specified security group is not correctly set up. ' | ||
'Please try to specify `ports` only and leave out ' | ||
'`aws.security_group_name` in `~/.sky/config.yaml` to ' | ||
'allow SkyPilot to automatically create and configure ' | ||
'the security group.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with ux_utils.print_exception_no_traceback(): | |
logger.warning( | |
f'Ports {self.ports} and security group name are ' | |
f'specified: {security_group_name}. It is not ' | |
'guaranteed that the ports will be opened if the ' | |
'specified security group is not correctly set up. ' | |
'Please try to specify `ports` only and leave out ' | |
'`aws.security_group_name` in `~/.sky/config.yaml` to ' | |
'allow SkyPilot to automatically create and configure ' | |
'the security group.') | |
logger.warning( | |
f'Ports {self.ports} and security group name are ' | |
f'specified: {security_group_name}. It is not ' | |
'guaranteed that the ports will be opened if the ' | |
'specified security group is not correctly set up. ' | |
'Please try to specify `ports` only and leave out ' | |
'`aws.security_group_name` in `~/.sky/config.yaml` to ' | |
'allow SkyPilot to automatically create and configure ' | |
'the security group.') |
def _get_cloud_name_property_mapping(field: str): | ||
return { | ||
field: { | ||
'oneOf': [ | ||
{ | ||
'type': 'string' | ||
}, | ||
{ | ||
# A list of single-element dict to pretain the | ||
# order. | ||
# Example: | ||
# property_name: | ||
# - my-cluster1-*: my-property-1 | ||
# - my-cluster2-*: my-property-2 | ||
# - "*"": my-property-3 | ||
'type': 'array', | ||
'items': { | ||
'type': 'object', | ||
'additionalProperties': { | ||
'type': 'string' | ||
}, | ||
'maxProperties': 1, | ||
'minProperties': 1, | ||
}, | ||
} | ||
] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use a constant instead and let the callers use the constant.
_PRORPERTY_NAME_OR_CLUSTER_NAME_TO_PROPERTY = {
'oneOf': [
{
'type': 'string'
},
{
# A list of single-element dict to pretain the
# order.
# Example:
# property_name:
# - my-cluster1-*: my-property-1
# - my-cluster2-*: my-property-2
# - "*"": my-property-3
'type': 'array',
'items': {
'type': 'object',
'additionalProperties': {
'type': 'string'
},
'maxProperties': 1,
'minProperties': 1,
},
}
]
}
In the callers, we just use:
'remote_identity': _PRORPERTY_NAME_OR_CLUSTER_NAME_TO_PROPERTY
'security_group_name': _PRORPERTY_NAME_OR_CLUSTER_NAME_TO_PROPERTY
if security_group_name is not None: | ||
with ux_utils.print_exception_no_traceback(): | ||
logger.warning( | ||
f'Ports {self.ports} and security group name are ' | ||
f'specified: {security_group_name}. It is not ' | ||
'guaranteed that the ports will be opened if the ' | ||
'specified security group is not correctly set up. ' | ||
'Please try to specify `ports` only and leave out ' | ||
'`aws.security_group_name` in `~/.sky/config.yaml` to ' | ||
'allow SkyPilot to automatically create and configure ' | ||
'the security group.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good catch! Let's remove the warning here and move it to the aws.make_deploy_resources_variables
so that the warning will only shown when launching an exact launchable resource. (check out the comment in aws.py above)
Similar to #3488 but for SGs
This PR:
Addresses: #3473
In
~/.sky/config.yaml
:security_group_name
via stringTested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh
EDITED: