Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog mechanism for network threads #108710

Open
DaveCTurner opened this issue May 16, 2024 · 2 comments
Open

Watchdog mechanism for network threads #108710

DaveCTurner opened this issue May 16, 2024 · 2 comments
Assignees
Labels
:Distributed/Network Http and internode communication implementations >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

It's important for us to be able to detect situations where a network thread spends too long doing non-network things. Today we log some warnings in this area but they're not 100% useful (e.g. the OutboundHandler warnings include the time spent doing other things while the outbound channel is unwritable). Making this stuff more granular is hard, especially if we don't want to disturb the performance of these performance-critical threads.

Rather than pushing more timing and logging work onto these threads, it seems like a better approach would be to build a separate watchdog mechanism that runs occasionally (say, every 15s) and ensures that every network thread is either idle or completed at least one task since the last time the watchdog ran. Built right, I reckon we could make each thread report its progress by simply adjusting a volatile long field (maybe reserving one bit as an idle flag) which seems like it should be adequately performant.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Network Http and internode communication implementations Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. labels May 16, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label May 16, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

WIP solution at #109204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

2 participants