Kubernetes metadata overwhelms memory limits in the Agent process #4729

faec · 2024-05-09T19:10:48Z

Diagnostics from production Agents running on Kubernetes show:

The elastic-agent process itself uses more memory than all its configured inputs combined.
Within the elastic-agent process, more than 90% of memory use is in Kubernetes helpers. 70% of that is from elastic-agent-autodiscover and the other 20% is from helpers internal to elastic-agent.

We need to understand why the Kubernetes helpers are using so much memory, and find a way to mitigate it.

Definition of done

Provide steps for a reproducible setup that can demonstrate the aforementioned memory usage with an Agent diagnostic
Attach Agent diagnostic to this issue to use as a baseline, so we can compare against it when improvements are made
Reduce memory use by Kubernetes helpers from 90% to TBD% (TBD, at the moment, until we've done more investigation)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-09T19:10:50Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz · 2024-05-09T20:24:11Z

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests #4730

faec · 2024-05-16T11:46:18Z

Possible related, an increase starting in 8.14.0 was detected by the ECK integration tests

FWIW the diagnostics described by this issue were from 8.13.3.

elasticmachine · 2024-05-21T14:04:43Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

jlind23 · 2024-05-21T14:05:13Z

After chatting with @cmacknz and @pierrehilbert, assigning this to you @faec and making it a high priority for the next sprint.

bturquet · 2024-05-21T16:58:28Z

cc @gizas

faec · 2024-05-22T20:32:52Z

Agent's variable provider API is very opaque, which is probably a big part of this. Agent's Coordinator doesn't provide any constraints on what variables might be requested, hence the Kubernetes helpers make (and cache) very large / verbose state queries. #2887 is related -- a possible Agent-side solution is to implement better policy parsing to validate the full configuration and give variables providers like Kubernetes a list of variables that are used.

@bturquet / @gizas, if we add hooks to the variable provider API for the Coordinator to give a list of possible variables, what work would be needed to restrict Kubernetes queries to those variables?

gizas · 2024-05-23T06:51:20Z

@faec trying to understand here how we can combine those pieces. So lets say the the parsing changes and there is a list of variables that the provider will need to populate.
On kubernetes provider here we start the watchers but with general arguments.

The other metadata enrichment we do with enrichers again is unrelated with the flow you describe here.

Maybe we can sync offline for me to understand more about this?

cc @MichaelKatsoulis

faec added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 9, 2024

cmacknz mentioned this issue May 9, 2024

ECK TestFleet* is failing elastic/cloud-on-k8s#7790

Open

jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 21, 2024

jlind23 assigned faec May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Kubernetes metadata overwhelms memory limits in the Agent process #4729

faec commented May 9, 2024 •

edited by ycombinator

elasticmachine commented May 9, 2024

cmacknz commented May 9, 2024 •

edited

faec commented May 16, 2024

elasticmachine commented May 21, 2024

jlind23 commented May 21, 2024

bturquet commented May 21, 2024

faec commented May 22, 2024

gizas commented May 23, 2024 •

edited

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Kubernetes metadata overwhelms memory limits in the Agent process #4729

Comments

faec commented May 9, 2024 • edited by ycombinator

Definition of done

elasticmachine commented May 9, 2024

cmacknz commented May 9, 2024 • edited

faec commented May 16, 2024

elasticmachine commented May 21, 2024

jlind23 commented May 21, 2024

bturquet commented May 21, 2024

faec commented May 22, 2024

gizas commented May 23, 2024 • edited

faec commented May 9, 2024 •

edited by ycombinator

cmacknz commented May 9, 2024 •

edited

gizas commented May 23, 2024 •

edited