Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Open
3 tasks
faec opened this issue May 9, 2024 · 8 comments
Open
3 tasks

Kubernetes metadata overwhelms memory limits in the Agent process #4729

faec opened this issue May 9, 2024 · 8 comments
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@faec
Copy link
Contributor

faec commented May 9, 2024

Diagnostics from production Agents running on Kubernetes show:

  • The elastic-agent process itself uses more memory than all its configured inputs combined.
  • Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers. 70% of that is from elastic-agent-autodiscover and the other 20% is from helpers internal to elastic-agent.

We need to understand why the Kubernetes helpers are using so much memory, and find a way to mitigate it.

Definition of done

  • Provide steps for a reproducible setup that can demonstrate the aforementioned memory usage with an Agent diagnostic
  • Attach Agent diagnostic to this issue to use as a baseline, so we can compare against it when improvements are made
  • Reduce memory use by Kubernetes helpers from 90% to TBD% (TBD, at the moment, until we've done more investigation)
@faec faec added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 9, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests #4730

@faec
Copy link
Contributor Author

faec commented May 16, 2024

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests

FWIW the diagnostics described by this issue were from 8.13.3.

@jlind23 jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 21, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@jlind23
Copy link
Contributor

jlind23 commented May 21, 2024

After chatting with @cmacknz and @pierrehilbert, assigning this to you @faec and making it a high priority for the next sprint.

@bturquet
Copy link

cc @gizas

@faec
Copy link
Contributor Author

faec commented May 22, 2024

Agent's variable provider API is very opaque, which is probably a big part of this. Agent's Coordinator doesn't provide any constraints on what variables might be requested, hence the Kubernetes helpers make (and cache) very large / verbose state queries. #2887 is related -- a possible Agent-side solution is to implement better policy parsing to validate the full configuration and give variables providers like Kubernetes a list of variables that are used.

@bturquet / @gizas, if we add hooks to the variable provider API for the Coordinator to give a list of possible variables, what work would be needed to restrict Kubernetes queries to those variables?

@gizas
Copy link
Contributor

gizas commented May 23, 2024

@faec trying to understand here how we can combine those pieces. So lets say the the parsing changes and there is a list of variables that the provider will need to populate.
On kubernetes provider here we start the watchers but with general arguments.

The other metadata enrichment we do with enrichers again is unrelated with the flow you describe here.

Maybe we can sync offline for me to understand more about this?

cc @MichaelKatsoulis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

6 participants