[SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange #45509

moio · 2024-05-16T11:57:03Z

Internal reference: SURE-8169

Rancher Server Setup

Rancher version: SUSE Rancher 2.8.3
Installation option (Docker install/Helm Chart): Helm on AKS

Information about the Cluster

Kubernetes version: AKS 1.27
Cluster Type (Local/Downstream): local

Describe the bug
Heap profile shows high memory usage on aks.(*aksOperatorController).onClusterChange.

(over 40% usage)

To Reproduce
TBD

Result

Expected Result
No significant memory usage from aks.(*aksOperatorController).onClusterChange in flame graphs.

The text was updated successfully, but these errors were encountered:

moio · 2024-05-20T13:53:32Z

Reproducer

Prerequisites

Rancher 2.8.3 with a publicly-accessible Internet address (eg. in AKS)
an Azure account
its subscription id
a resource group in that Azure account

Setup

import any number of AKS clusters as managed clusters
- add Azure cloud credentials to Rancher (full docs). TL;DR:
  - run:
```
export SUBSCRIPTION_ID=XXXXXXXX-YYYY-ZZZZ-AAAA-BBBBBBBB
export GROUP=st-rg

az ad sp create-for-rbac \
  --scope /subscriptions/$SUBSCRIPTION_ID/resourceGroups/$GROUP \
  --role Contributor
```
  - note the JSON output
  - go to ☰ -> Cluster Management -> Cloud Credentials -> Create -> Azure
  - paste appId from the JSON output to Client ID, password to Secret, $SUBSCRIPTION_ID to Subscription ID
- import the clusters as as managed clusters ☰ -> Cluster Management -> Import Existing -> Azure AKS

Test (of base version)

In one terminal with kubeconfig pointing to the upstream cluster run:

while true
do
	for cluster in $(kubectl get clusters.management.cattle.io --no-headers -o custom-columns=name:.metadata.name | grep -v local)
	do
	  kubectl patch clusters.management.cattle.io ${cluster} --type='json' -p='[{"op": "replace", "path": "/status/serviceAccountTokenSecret", "value":""}]'
  done
done

This script forces Rancher to continuously execute a core function of AKS cluster deployment (generateAndSetServiceAccount). This emulates the reported environment where clusters are frequently deployed and redeployed without the necessity of really redeploying AKS clusters, which is time intensive.

In another terminal with kubeconfig pointing to the upstream cluster run:

while true; do
	for pod in $(kubectl -n cattle-system get pods -l app=rancher --no-headers -o custom-columns=name:.metadata.name); do
		kubectl exec -n cattle-system ${pod} -c rancher -- curl -s http://localhost:6060/debug/pprof/heap -o heap
		kubectl cp -n cattle-system -c rancher ${pod}:heap ./heap

		go tool pprof -top --trim=false --show_from=generateAndSetServiceAccount ./heap | grep kubernetes.NewForConfig
	done
done

This script monitors one important source of memory usage by generateAndSetServiceAccount (kubernetes.NewForConfig).

Expected output is either:

a continuous stream of ShowFrom expression matched no samples, which means no memory was allocated or
ShowFrom expression matched no samples interleaved with lines like the following:

         0     0%  7.72%     7.51MB  5.79%  k8s.io/client-go/kubernetes.NewForConfig

The fourth number indicates memory usage. It should increase only slightly, if at all, after several minutes. Importantly, stopping the first script in the other terminal should make usage go down to zero in some minutes.

Actual output is:
ShowFrom expression matched no samples interleaved with lines like the following:

         0     0%  7.72%     7.51MB  5.79%  k8s.io/client-go/kubernetes.NewForConfig

The fourth number indicates memory usage. It is increasing, and stopping the first script in the other terminal does not make usage go down to zero.

Test (of patched version)

When the expected behavior above is reproduced:

stop both scripts
swap the Rancher image with the patched one: kubectl set image -n cattle-system deployment/rancher rancher=rancher/rancher:v2.8.3-debug-45509-1
wait for clusters to come back as fully available in the UI, refresh the homepage to double check that is the case
re-run the scripts. Expected behavior should be observed

moio added the kind/bug Issues that are defects reported by users or that we know have reached a real release label May 16, 2024

moio changed the title ~~[BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange~~ [SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange May 20, 2024

moio self-assigned this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange #45509

[SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange #45509

moio commented May 16, 2024 •

edited

moio commented May 20, 2024

[SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange #45509

[SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange #45509

Comments

moio commented May 16, 2024 • edited

moio commented May 20, 2024

Reproducer

Prerequisites

Setup

Test (of base version)

Test (of patched version)

moio commented May 16, 2024 •

edited