Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange #45509

Open
moio opened this issue May 16, 2024 · 1 comment
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release

Comments

@moio
Copy link
Contributor

moio commented May 16, 2024

Internal reference: SURE-8169

Rancher Server Setup

  • Rancher version: SUSE Rancher 2.8.3
  • Installation option (Docker install/Helm Chart): Helm on AKS

Information about the Cluster

  • Kubernetes version: AKS 1.27
  • Cluster Type (Local/Downstream): local

Describe the bug
Heap profile shows high memory usage on aks.(*aksOperatorController).onClusterChange.

image

(over 40% usage)

To Reproduce
TBD

Result

Expected Result
No significant memory usage from aks.(*aksOperatorController).onClusterChange in flame graphs.

@moio moio added the kind/bug Issues that are defects reported by users or that we know have reached a real release label May 16, 2024
@moio
Copy link
Contributor Author

moio commented May 20, 2024

Reproducer

Prerequisites

  • Rancher 2.8.3 with a publicly-accessible Internet address (eg. in AKS)
  • an Azure account
  • its subscription id
  • a resource group in that Azure account

Setup

  • import any number of AKS clusters as managed clusters
    • add Azure cloud credentials to Rancher (full docs). TL;DR:
      • run:
        export SUBSCRIPTION_ID=XXXXXXXX-YYYY-ZZZZ-AAAA-BBBBBBBB
        export GROUP=st-rg
        
        az ad sp create-for-rbac \
          --scope /subscriptions/$SUBSCRIPTION_ID/resourceGroups/$GROUP \
          --role Contributor
      • note the JSON output
      • go to ☰ -> Cluster Management -> Cloud Credentials -> Create -> Azure
      • paste appId from the JSON output to Client ID, password to Secret, $SUBSCRIPTION_ID to Subscription ID
    • import the clusters as as managed clusters ☰ -> Cluster Management -> Import Existing -> Azure AKS

Test (of base version)

In one terminal with kubeconfig pointing to the upstream cluster run:

while true
do
	for cluster in $(kubectl get clusters.management.cattle.io --no-headers -o custom-columns=name:.metadata.name | grep -v local)
	do
	  kubectl patch clusters.management.cattle.io ${cluster} --type='json' -p='[{"op": "replace", "path": "/status/serviceAccountTokenSecret", "value":""}]'
  done
done

This script forces Rancher to continuously execute a core function of AKS cluster deployment (generateAndSetServiceAccount). This emulates the reported environment where clusters are frequently deployed and redeployed without the necessity of really redeploying AKS clusters, which is time intensive.

In another terminal with kubeconfig pointing to the upstream cluster run:

while true; do
	for pod in $(kubectl -n cattle-system get pods -l app=rancher --no-headers -o custom-columns=name:.metadata.name); do
		kubectl exec -n cattle-system ${pod} -c rancher -- curl -s http://localhost:6060/debug/pprof/heap -o heap
		kubectl cp -n cattle-system -c rancher ${pod}:heap ./heap

		go tool pprof -top --trim=false --show_from=generateAndSetServiceAccount ./heap | grep kubernetes.NewForConfig
	done
done

This script monitors one important source of memory usage by generateAndSetServiceAccount (kubernetes.NewForConfig).

Expected output is either:

  1. a continuous stream of ShowFrom expression matched no samples, which means no memory was allocated or
  2. ShowFrom expression matched no samples interleaved with lines like the following:
         0     0%  7.72%     7.51MB  5.79%  k8s.io/client-go/kubernetes.NewForConfig

The fourth number indicates memory usage. It should increase only slightly, if at all, after several minutes. Importantly, stopping the first script in the other terminal should make usage go down to zero in some minutes.

Actual output is:
ShowFrom expression matched no samples interleaved with lines like the following:

         0     0%  7.72%     7.51MB  5.79%  k8s.io/client-go/kubernetes.NewForConfig

The fourth number indicates memory usage. It is increasing, and stopping the first script in the other terminal does not make usage go down to zero.

Test (of patched version)

When the expected behavior above is reproduced:

  • stop both scripts
  • swap the Rancher image with the patched one: kubectl set image -n cattle-system deployment/rancher rancher=rancher/rancher:v2.8.3-debug-45509-1
  • wait for clusters to come back as fully available in the UI, refresh the homepage to double check that is the case
  • re-run the scripts. Expected behavior should be observed

@moio moio changed the title [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange [SURE-8169] [BUG] high memory usage for the aks.(*aksOperatorController).onClusterChange May 20, 2024
@moio moio self-assigned this May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

1 participant