Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak #118703

dmitry-irtegov · 2023-06-16T07:32:48Z

What happened?

After we upgraded k8s.io/client-go package to v0.26.0, we noticed memory leaks in our components that access the Kubernetes API. For some components we managed to remove the leak by rolling back client-go to 0.25.4, but for other components it creates a dependency hell.

After some profiling and debugging I managed to write a mockup code that is roughly based on our production code and reliably triggers the leak in client-go 0.26.0 to 0.27.3. The code is uploaded to https://github.com/kublr/memory_leak_demo . It includes the pprof endpoint so you can verify the leak. The memory allocations happen in transport.(*tlsTransportCache)get method.

The leak is triggered by the presence of a custom Dial function in rest.Config object. As I infer from the commits like #117311 , this usage is now considered incorrect, but we need some guidance or advice how to adapt our production code.

What did you expect to happen?

We expected to use public interfaces without triggering memory leaks in the package.

How can we reproduce it (as minimally and precisely as possible)?

Compile and run the sample code in https://github.com/kublr/memory_leak_demo . Using pprof you can observe growing amount of memory allocated in transport(*tlsTransportCache)get(), and using the debugger you can observe that every call of kubernetes.NewForConfig() creates a new entry in TLS cache.

Anything else we need to know?

As I understand, the real culprit is line

kubernetes/staging/src/k8s.io/client-go/rest/transport.go

Line 115 in fa78f28

conf.DialHolder = &transport.DialHolder{Dial: c.Dial}

, where client-go creates a DialHolder structure for configs that do not have one, but have a Dial hook.

Because DialHolder pointer is added to the copy of the original config (at least in the code path originating from kubernetes.NewForConfig), the next calls to kubernetes.NewForConfig will result in the new DialHolder instances.

Because pointer to the DialHolder instance is a part of TLS cache key, this creates a sequence of unique map keys that produce useless cache entries (that will never be a cache hit), plus a memory leak.

I reiterate that the problem is not a memory leak itself, but a broken cache with unique cache keys. I think we would be better to mark configs with custom Dial hook as non-cacheable.

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.10", GitCommit:"e770bdbb87cccdc2daa790ecd69f40cf4df3cc9d", GitTreeState:"archive", BuildDate:"2023-05-19T08:42:33Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.9", GitCommit:"a1a87a0a2bcd605820920c6b0e618a8ab7d117d4", GitTreeState:"clean", BuildDate:"2023-04-12T12:08:36Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

We also observed the leaks with k8s versions from 1.22 to 1.26

Cloud provider

We observed the leak on vSphere, Azure and onprem installations. Most likely all clouds are affected.

OS version

# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux dvi-7696-1686743539-vsp1-master-0 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

custom

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

aojea · 2023-06-16T12:21:20Z

/sig apimachiner
/cc @aojea @enj

Mo can you please take a look?

k8s-ci-robot · 2023-06-16T12:21:22Z

@aojea: The label(s) sig/apimachiner cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachiner
/cc @aojea @enj

Mo can you please take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

aojea · 2023-06-16T12:21:30Z

/sig apimachinery

k8s-ci-robot · 2023-06-16T12:21:32Z

@aojea: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachinery

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

enj · 2023-06-16T13:24:30Z

@dmitry-irtegov you will need to update your code to use a transport.Config with a shared dial holder in the same way as #117258 to avoid the leak (or just create a single clientset instead of reconstructing it over and over again).

We need the caching behavior to be the default otherwise clients like kubectl break by creating too many clientsets which then run out of file descriptors.

We can try to make the rest.Config to transport.Config logic better (maybe by updating it to also use a holder?), but we are unlikely to catch all mistakes (unless we make fields private and force compile time errors on all users...).

dmitry-irtegov · 2023-06-16T16:16:07Z

update your code to use a transport.Config with a shared dial holder in the same way as #117258

I do not really understand how #117258 approach can be applied to our situation. It might be not clear from my mockup, but in our application there are multiple instances of "MyK8sApiWrapper" for access to different clusters. and they are accessed from different contexts and threads. For same reason, keeping a single instance of clientset is not a realistic option.

To avoid leak, transport.Config and DialHolder instances must be tied to MyK8sApiWrapper instance. I thought about this, but not sure if this is a good idea, though cannot formulate why. May be you can dispel my doubts.

Also, most of our components are long running services so they will recycle "MyK8sApiWrapper" instances from time to time. Because every instance will have its own cache entry, the cache entries still would leak, just not so fast as now.

What really makes me wonder, that my mockup (and our applications) do not seem to leak connections. I never seen it opening the second TCP socket (not even to speak of multiple connections or running out of sockets), and in the pprof I did not see any sign of multiple clientset or net/http objects created. So the connection pooling in client-go seems to do its job right: with repeated NewForConfig it returns the same object or, at least, objects pointing to the same HTTP socket.

It is only the TLS cache that is leaking. And this leak does not break the connection pooling it supposed to enhance. Which makes me wonder, do we ever need TLS cache in the existing form.

enj · 2023-06-16T18:08:16Z

To avoid leak, transport.Config and DialHolder instances must be tied to MyK8sApiWrapper instance.

This is what I would expect you to do (though not exactly as mentioned below).

Also, most of our components are long running services so they will recycle "MyK8sApiWrapper" instances from time to time. Because every instance will have its own cache entry, the cache entries still would leak, just not so fast as now.

This is true of the API server code in #117258 as well (basically, a known limitation). You would solve this by holding onto the *DialHolder for a given cluster even if you recycle the MyK8sApiWrapper.

It is only the TLS cache that is leaking. And this leak does not break the connection pooling it supposed to enhance. Which makes me wonder, do we ever need TLS cache in the existing form.

It does leak, as I had to fix them: kubernetes/kubectl#1128 (comment)

For now, I would probably just set the Proxy func to utilnet.NewProxierWithNoProxyCIDR(http.ProxyFromEnvironment) so that the TLS cache is skipped for your code. This isn't guaranteed behavior and may stop working at any time, but it will at least solve the issue for you for now (by restoring the old behavior).

For long term improvements to client-go, some options I can see are:

Giving rest.Config more control over the the transport.Config's DialHolder
Giving rest.Config the ability to opt-out of TLS caching is a dedicated way
Giving users the ability to inform client-go that they are done with a particular client set, so that the cache can be cleaned up correctly
Using a runtime.Finalizer to automatically clean up the cache when all clientset references have been GC'd

dmitry-irtegov · 2023-06-19T09:02:58Z

You would solve this by holding onto the *DialHolder for a given cluster even if you recycle the MyK8sApiWrapper.

Well, this is why I thought the approach of #117258 is not applicable to us (at least, not directly applicable). Our Dial function is not really a function, it is a closure referencing the instance of MyK8sApiWrapper. Single DialHolder instance for different wrapper instances would immediately break everything.

I even managed to display this behavior in my mockup, though there it might not seem logical.

Am I correct, that to properly convert rest.Config to transport.Config I need to call rest.Config.TransportConfig() and then manually set DialHolder field?

enj · 2023-06-19T13:38:30Z

Am I correct, that to properly convert rest.Config to transport.Config I need to call rest.Config.TransportConfig() and then manually set DialHolder field?

Yes.

dmitry-irtegov · 2023-06-20T06:53:11Z

Thank you for the answer. But now I have another question.
transport functions return http.RoundTripper, but I need a kubernetes.Interface or unversioned rest.RESTClient.

I guess I need to use kubernetes.NewForConfigAndClient() and rest.UnversionedRESTClientForConfigAndClient() respectively, but how do I properly convert http.RoundTripper to http.Client?

enj · 2023-06-20T14:34:56Z

http.Client just takes transport as a field:

kubernetes/staging/src/k8s.io/client-go/kubernetes/clientset.go

Lines 460 to 471 in 2e93c65

    
           func NewForConfig(c *rest.Config) (*Clientset, error) { 
        
           	configShallowCopy := *c 
        
           	if configShallowCopy.UserAgent == "" { 
        
           		configShallowCopy.UserAgent = rest.DefaultKubernetesUserAgent() 
        
           	} 
        
           	// share the transport between all clients 
        
           	httpClient, err := rest.HTTPClientFor(&configShallowCopy) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	}

kubernetes/staging/src/k8s.io/client-go/rest/transport.go

Lines 32 to 42 in 2e93c65

    
           func HTTPClientFor(config *Config) (*http.Client, error) { 
        
           	transport, err := TransportFor(config) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	var httpClient *http.Client 
        
           	if transport != http.DefaultTransport || config.Timeout > 0 { 
        
           		httpClient = &http.Client{ 
        
           			Transport: transport, 
        
           			Timeout:   config.Timeout, 
        
           		}

Alternatively, you could copy the subset of the transport building logic from:

kubernetes/staging/src/k8s.io/client-go/transport/cache.go

Lines 74 to 145 in 2e93c65

    
           func (c *tlsTransportCache) get(config *Config) (http.RoundTripper, error) { 
        
           	key, canCache, err := tlsConfigKey(config) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	if canCache { 
        
           		// Ensure we only create a single transport for the given TLS options 
        
           		c.mu.Lock() 
        
           		defer c.mu.Unlock() 
        
           		defer metrics.TransportCacheEntries.Observe(len(c.transports)) 
        
           		// See if we already have a custom transport for this config 
        
           		if t, ok := c.transports[key]; ok { 
        
           			metrics.TransportCreateCalls.Increment("hit") 
        
           			return t, nil 
        
           		} 
        
           		metrics.TransportCreateCalls.Increment("miss") 
        
           	} else { 
        
           		metrics.TransportCreateCalls.Increment("uncacheable") 
        
           	} 
        
           	// Get the TLS options for this client config 
        
           	tlsConfig, err := TLSConfigFor(config) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	// The options didn't require a custom TLS config 
        
           	if tlsConfig == nil && config.DialHolder == nil && config.Proxy == nil { 
        
           		return http.DefaultTransport, nil 
        
           	} 
        
           	var dial func(ctx context.Context, network, address string) (net.Conn, error) 
        
           	if config.DialHolder != nil { 
        
           		dial = config.DialHolder.Dial 
        
           	} else { 
        
           		dial = (&net.Dialer{ 
        
           			Timeout:   30 * time.Second, 
        
           			KeepAlive: 30 * time.Second, 
        
           		}).DialContext 
        
           	} 
        
           	// If we use are reloading files, we need to handle certificate rotation properly 
        
           	// TODO(jackkleeman): We can also add rotation here when config.HasCertCallback() is true 
        
           	if config.TLS.ReloadTLSFiles && tlsConfig != nil && tlsConfig.GetClientCertificate != nil { 
        
           		dynamicCertDialer := certRotatingDialer(tlsConfig.GetClientCertificate, dial) 
        
           		tlsConfig.GetClientCertificate = dynamicCertDialer.GetClientCertificate 
        
           		dial = dynamicCertDialer.connDialer.DialContext 
        
           		go dynamicCertDialer.Run(DialerStopCh) 
        
           	} 
        
           	proxy := http.ProxyFromEnvironment 
        
           	if config.Proxy != nil { 
        
           		proxy = config.Proxy 
        
           	} 
        
           	transport := utilnet.SetTransportDefaults(&http.Transport{ 
        
           		Proxy:               proxy, 
        
           		TLSHandshakeTimeout: 10 * time.Second, 
        
           		TLSClientConfig:     tlsConfig, 
        
           		MaxIdleConnsPerHost: idleConnsPerHost, 
        
           		DialContext:         dial, 
        
           		DisableCompression:  config.DisableCompression, 
        
           	}) 
        
           	if canCache { 
        
           		// Cache a single transport for these options 
        
           		c.transports[key] = transport 
        
           	} 
        
           	return transport, nil 
        
           }

and then set that transport on rest.Config.Transport so that you can use this branch (be careful to avoid double wrapping via HTTPWrappersForConfig):

kubernetes/staging/src/k8s.io/client-go/transport/transport.go

Lines 51 to 52 in 2e93c65

    
           if config.Transport != nil { 
        
           	rt = config.Transport

enj · 2023-06-26T13:49:34Z

/assign
(so I remember to do something about this)
/sig api-machinery
/triage accepted

2uasimojo · 2023-07-13T18:45:02Z

@dmitry-irtegov Thank you for this, we're hitting the exact same issue in hive. Any chance you can share your implementation of the workaround so I can follow your example?

When ClusterDeployment.Spec.ControlPlaneConfig.APIServerIPOverride is in use, we experience the memory leak described by kubernetes/kubernetes#118703. That issue has not been resolved at the time of this writing, but describes a couple of workarounds. Try the one described in [this comment](kubernetes/kubernetes#118703 (comment)): > For now, I would probably just set the Proxy func to > utilnet.NewProxierWithNoProxyCIDR(http.ProxyFromEnvironment) so that > the TLS cache is skipped for your code. This isn't guaranteed behavior > and may stop working at any time, but it will at least solve the issue > for you for now (by restoring the old behavior). HIVE-2272

olegch · 2023-07-18T23:22:23Z

Hello,

@enj

We need the caching behavior to be the default otherwise clients like kubectl break by creating too many clientsets which then run out of file descriptors.

It seems that in the current code tlsCache is never cleaned, so even if one is not using DialHolder, a long-running service that works with multiple k8s API endpoints may still run into a memory leak.
It is easy to imagine a service like that - ArgoCD is a good example. Users can register and de-register Kubernetes clusters via UI or API, so while at any given moment ArgoCD "knows" about a limited number of clusters, tlsCache will over time collect unlimited number of entries.

Does it maybe make sense to implement a time-limited cache so that entries that are not accessed for a certain period of time are removed?
We use this strategy in our code where we use static process/package-scoped caches.

If I am not mistaken, the same issue is discussed in #109289.

2uasimojo · 2023-07-19T18:53:20Z

even if one is not using DialHolder, a long-running service that works with multiple k8s API endpoints may still run into a memory leak.

I can corroborate this. When we deployed code with the workaround, it fixed envs using DialContext -- but we also saw a long-standing "sawtooth" memory pattern disappear in the envs that weren't.

enj · 2023-07-20T12:47:49Z

It seems that in the current code tlsCache is never cleaned, so even if one is not using DialHolder, a long-running service that works with multiple k8s API endpoints may still run into a memory leak.

Correct, though this has been true long before any of the dial holder changes.

Does it maybe make sense to implement a time-limited cache so that entries that are not accessed for a certain period of time are removed?

In general we need an approach that make it clear to client-go that consumers are done with a particular cache entry. It is unrelated to time because a client could be in-use but idle.

2uasimojo · 2023-07-20T13:04:21Z

I'm sure we're shooting for something clever, but I would be happy with either

config.UseCache = False  // the normally-cached thing is GCed when the client goes out of scope

or

client := GimmeClient(...)
defer client.Scrub()  // explicit cleanup when this function completes

Again, my use case is in the context of Reconcile loops in controllers that manage many spoke clusters that come and go. Each invocation is idempotent and isolated; and we can't (reliably) inject logic that knows when a specific spoke is deleted. Thus

done with a particular cache entry

is not really possible for us. We would rather simply not use the cache, and accept whatever performance hit that entails.

enj · 2023-07-20T16:02:21Z

You can already do the former by setting the proxy func or constructing the transport directly.

2uasimojo · 2023-07-20T16:16:56Z

You can already do the former by setting the proxy func

Right, cool, I'm doing that and it's working for me, but I'm concerned about this:

This isn't guaranteed behavior and may stop working at any time

If we can declare this a permanent/supported path, great 👍

That said, it would be neat if there were a way to spell it that was more intuitive. For networking dunces like me, "set a proxy function to " doesn't naturally map to "don't use a cache for an internal thing I can't see".

Similar for

constructing the transport directly

(which I didn't even try to understand, tbf)

olegch · 2023-07-20T16:59:10Z

In general we need an approach that make it clear to client-go that consumers are done with a particular cache entry. It is unrelated to time because a client could be in-use but idle.

Yes, this is true, but it is probably the easiest fix that would remove the leak and at the same time keep most of the benefits of the cache.
Removing entry from the cache after some idle time will not change the external behavior of the client because the cache entry will just be reinstated when used again.
At the same time it would not affect performance much because if the item was removed, it was not used that often, and re-creating a cache entry is not critical for performance, and if it was used often, it would not be removed.

dmitry-irtegov added the kind/bug Categorizes issue or PR as related to a bug. label Jun 16, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 16, 2023

k8s-ci-robot assigned enj Jun 26, 2023

2uasimojo mentioned this issue Jul 13, 2023

Work around upstream memory leak with APIServerIPOverride openshift/hive#2062

Merged

bennerv mentioned this issue Aug 8, 2023

Prevent memory leaks with DialContext usage and new k8s client creation Azure/ARO-RP#3090

Merged

enekofb mentioned this issue Sep 1, 2023

Clusters Service pod grows in memory until being evicted weaveworks/weave-gitops-enterprise#3189

Closed

bennerv mentioned this issue Sep 20, 2023

Do not cache new k8s clients when using Dial openshift/installer-aro-wrapper#62

Open

lizardruss linked a pull request May 15, 2024 that will close this issue

client-go/transport: fix memory leak when using rest.Config Dial function #124894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak #118703

Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak #118703

dmitry-irtegov commented Jun 16, 2023

aojea commented Jun 16, 2023

k8s-ci-robot commented Jun 16, 2023

aojea commented Jun 16, 2023

k8s-ci-robot commented Jun 16, 2023

enj commented Jun 16, 2023

dmitry-irtegov commented Jun 16, 2023

enj commented Jun 16, 2023

dmitry-irtegov commented Jun 19, 2023

enj commented Jun 19, 2023

dmitry-irtegov commented Jun 20, 2023

enj commented Jun 20, 2023 •

edited

enj commented Jun 26, 2023

2uasimojo commented Jul 13, 2023

olegch commented Jul 18, 2023 •

edited

2uasimojo commented Jul 19, 2023

enj commented Jul 20, 2023

2uasimojo commented Jul 20, 2023

enj commented Jul 20, 2023

2uasimojo commented Jul 20, 2023

olegch commented Jul 20, 2023 •

edited

Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak #118703

Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak #118703

Comments

dmitry-irtegov commented Jun 16, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

aojea commented Jun 16, 2023

k8s-ci-robot commented Jun 16, 2023

aojea commented Jun 16, 2023

k8s-ci-robot commented Jun 16, 2023

enj commented Jun 16, 2023

dmitry-irtegov commented Jun 16, 2023

enj commented Jun 16, 2023

dmitry-irtegov commented Jun 19, 2023

enj commented Jun 19, 2023

dmitry-irtegov commented Jun 20, 2023

enj commented Jun 20, 2023 • edited

enj commented Jun 26, 2023

2uasimojo commented Jul 13, 2023

olegch commented Jul 18, 2023 • edited

2uasimojo commented Jul 19, 2023

enj commented Jul 20, 2023

2uasimojo commented Jul 20, 2023

enj commented Jul 20, 2023

2uasimojo commented Jul 20, 2023

olegch commented Jul 20, 2023 • edited

enj commented Jun 20, 2023 •

edited

olegch commented Jul 18, 2023 •

edited

olegch commented Jul 20, 2023 •

edited