Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting online domains using peers health check #8795

Closed

Conversation

khoaguin
Copy link
Member

@khoaguin khoaguin commented May 9, 2024

Description

Now, we are running peers health check periodically in background task (this PR).
This PR tries to get the list of online domains using the results of this peers health check feature instead of trying to login the domains as guest clients to check if they are online every time we call sy.domains

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

khoaguin and others added 3 commits May 9, 2024 14:25
…e a guess client at a gateway, just ignore it and move on to another one

Co-authored-by: Shubham Gupta <shubhamgupta3121@gmail.com>
@khoaguin
Copy link
Member Author

khoaguin commented May 10, 2024

NetworkRegistry._repr_html_ currently looks like

image

@khoaguin

This comment was marked as resolved.

@khoaguin

This comment was marked as resolved.

- improve repr for domain and network registry to show online domains and networks
- `sy.gateways` and `sy.domains` work with python webserver nodes
@khoaguin
Copy link
Member Author

khoaguin commented May 10, 2024

TODO:

  • Run again tests/integration/local/gateway_local_test.py in CI
  • Make peer heath check works with k8s nodes

khoaguin and others added 2 commits May 13, 2024 08:46
- add a wait in `gateway_local_test` before checking for online domains

Co-authored-by: Shubham Gupta <shubhamgupta3121@gmail.com>
@khoaguin khoaguin changed the title [WIP] Getting online domains using peers health check Getting online domains using peers health check May 13, 2024
- add some wait times before checking node connection status
@khoaguin

This comment was marked as resolved.

@khoaguin
Copy link
Member Author

khoaguin commented May 14, 2024

Trying to test with 2 k8s domains failed

Reproduce: First, launch a gateway and domain1 with

CLUSTER_NAME=testgateway1 CLUSTER_HTTP_PORT=9081 tox -e dev.k8s.launch.gateway
CLUSTER_NAME=testdomain1 CLUSTER_HTTP_PORT=9082 tox -e dev.k8s.launch.domain

which works. Then trying to launch a second domain with CLUSTER_NAME=testdomain2 CLUSTER_HTTP_PORT=9083 tox -e dev.k8s.launch.domain which also show that it's successful
image

Login to the gateway and first domain work
image

However, failed to login the second domain
image

@khoaguin khoaguin requested a review from shubham3121 May 14, 2024 14:06
@khoaguin
Copy link
Member Author

khoaguin commented May 15, 2024

Note on some flaky integration k8s network tests after merging the Peers Health Check PR:

Since we now check the peers' health by making a connection and create a guest client on the peer node (using peer_client = peer.client_with_context(context=context) in PeerHealthCheckTask.peer_route_healthcheck which is run in a background thread, the tests in tests/integration/network/gateway_test.py that add and update priority of routes with invalid ports may result in error Failed to establish a connection with. However, this error will not be seen in the main thread that run the test. Furthermore, the flakiness of these tests is because we only run peer_route_healthcheck periodically, so they will only fail when they run at the moment when the invalid routes have the highest priority.

Another issue was that we updated the whole peer object in PeerHealthCheckTask.peer_route_heathcheck which can update the outdated node_routes to the node's store in the background thread. This is fixed in this PR #8851.

TODO: checking route's validity (e.g. by pinging) before adding it to a list of node routes?

@khoaguin khoaguin force-pushed the peer-health-check-online-domains branch from 376810a to 3e42aad Compare May 17, 2024 02:08
@khoaguin
Copy link
Member Author

Screenshot from 2024-05-27 13-30-55

@khoaguin
Copy link
Member Author

khoaguin commented May 27, 2024

Moved the PR here #8851 with some bug fixes and improvements

@khoaguin khoaguin closed this May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants