Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak #118703

Open
dmitry-irtegov opened this issue Jun 16, 2023 · 20 comments · May be fixed by #124894
Open

Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak #118703

dmitry-irtegov opened this issue Jun 16, 2023 · 20 comments · May be fixed by #124894
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@dmitry-irtegov
Copy link

What happened?

After we upgraded k8s.io/client-go package to v0.26.0, we noticed memory leaks in our components that access the Kubernetes API. For some components we managed to remove the leak by rolling back client-go to 0.25.4, but for other components it creates a dependency hell.

After some profiling and debugging I managed to write a mockup code that is roughly based on our production code and reliably triggers the leak in client-go 0.26.0 to 0.27.3. The code is uploaded to https://github.com/kublr/memory_leak_demo . It includes the pprof endpoint so you can verify the leak. The memory allocations happen in transport.(*tlsTransportCache)get method.

The leak is triggered by the presence of a custom Dial function in rest.Config object. As I infer from the commits like #117311 , this usage is now considered incorrect, but we need some guidance or advice how to adapt our production code.

What did you expect to happen?

We expected to use public interfaces without triggering memory leaks in the package.

How can we reproduce it (as minimally and precisely as possible)?

Compile and run the sample code in https://github.com/kublr/memory_leak_demo . Using pprof you can observe growing amount of memory allocated in transport(*tlsTransportCache)get(), and using the debugger you can observe that every call of kubernetes.NewForConfig() creates a new entry in TLS cache.

Anything else we need to know?

As I understand, the real culprit is line

conf.DialHolder = &transport.DialHolder{Dial: c.Dial}
, where client-go creates a DialHolder structure for configs that do not have one, but have a Dial hook.

Because DialHolder pointer is added to the copy of the original config (at least in the code path originating from kubernetes.NewForConfig), the next calls to kubernetes.NewForConfig will result in the new DialHolder instances.

Because pointer to the DialHolder instance is a part of TLS cache key, this creates a sequence of unique map keys that produce useless cache entries (that will never be a cache hit), plus a memory leak.

I reiterate that the problem is not a memory leak itself, but a broken cache with unique cache keys. I think we would be better to mark configs with custom Dial hook as non-cacheable.

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.10", GitCommit:"e770bdbb87cccdc2daa790ecd69f40cf4df3cc9d", GitTreeState:"archive", BuildDate:"2023-05-19T08:42:33Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.9", GitCommit:"a1a87a0a2bcd605820920c6b0e618a8ab7d117d4", GitTreeState:"clean", BuildDate:"2023-04-12T12:08:36Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

We also observed the leaks with k8s versions from 1.22 to 1.26

Cloud provider

We observed the leak on vSphere, Azure and onprem installations. Most likely all clouds are affected.

OS version

# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux dvi-7696-1686743539-vsp1-master-0 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

custom

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@dmitry-irtegov dmitry-irtegov added the kind/bug Categorizes issue or PR as related to a bug. label Jun 16, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 16, 2023
@aojea
Copy link
Member

aojea commented Jun 16, 2023

/sig apimachiner
/cc @aojea @enj

Mo can you please take a look?

@k8s-ci-robot
Copy link
Contributor

@aojea: The label(s) sig/apimachiner cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachiner
/cc @aojea @enj

Mo can you please take a look?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Member

aojea commented Jun 16, 2023

/sig apimachinery

@k8s-ci-robot
Copy link
Contributor

@aojea: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig apimachinery

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@enj
Copy link
Member

enj commented Jun 16, 2023

@dmitry-irtegov you will need to update your code to use a transport.Config with a shared dial holder in the same way as #117258 to avoid the leak (or just create a single clientset instead of reconstructing it over and over again).

We need the caching behavior to be the default otherwise clients like kubectl break by creating too many clientsets which then run out of file descriptors.

We can try to make the rest.Config to transport.Config logic better (maybe by updating it to also use a holder?), but we are unlikely to catch all mistakes (unless we make fields private and force compile time errors on all users...).

@dmitry-irtegov
Copy link
Author

update your code to use a transport.Config with a shared dial holder in the same way as #117258

I do not really understand how #117258 approach can be applied to our situation. It might be not clear from my mockup, but in our application there are multiple instances of "MyK8sApiWrapper" for access to different clusters. and they are accessed from different contexts and threads. For same reason, keeping a single instance of clientset is not a realistic option.

To avoid leak, transport.Config and DialHolder instances must be tied to MyK8sApiWrapper instance. I thought about this, but not sure if this is a good idea, though cannot formulate why. May be you can dispel my doubts.

Also, most of our components are long running services so they will recycle "MyK8sApiWrapper" instances from time to time. Because every instance will have its own cache entry, the cache entries still would leak, just not so fast as now.

What really makes me wonder, that my mockup (and our applications) do not seem to leak connections. I never seen it opening the second TCP socket (not even to speak of multiple connections or running out of sockets), and in the pprof I did not see any sign of multiple clientset or net/http objects created. So the connection pooling in client-go seems to do its job right: with repeated NewForConfig it returns the same object or, at least, objects pointing to the same HTTP socket.

It is only the TLS cache that is leaking. And this leak does not break the connection pooling it supposed to enhance. Which makes me wonder, do we ever need TLS cache in the existing form.

@enj
Copy link
Member

enj commented Jun 16, 2023

To avoid leak, transport.Config and DialHolder instances must be tied to MyK8sApiWrapper instance.

This is what I would expect you to do (though not exactly as mentioned below).

Also, most of our components are long running services so they will recycle "MyK8sApiWrapper" instances from time to time. Because every instance will have its own cache entry, the cache entries still would leak, just not so fast as now.

This is true of the API server code in #117258 as well (basically, a known limitation). You would solve this by holding onto the *DialHolder for a given cluster even if you recycle the MyK8sApiWrapper.

It is only the TLS cache that is leaking. And this leak does not break the connection pooling it supposed to enhance. Which makes me wonder, do we ever need TLS cache in the existing form.

It does leak, as I had to fix them: kubernetes/kubectl#1128 (comment)


For now, I would probably just set the Proxy func to utilnet.NewProxierWithNoProxyCIDR(http.ProxyFromEnvironment) so that the TLS cache is skipped for your code. This isn't guaranteed behavior and may stop working at any time, but it will at least solve the issue for you for now (by restoring the old behavior).

For long term improvements to client-go, some options I can see are:

  1. Giving rest.Config more control over the the transport.Config's DialHolder
  2. Giving rest.Config the ability to opt-out of TLS caching is a dedicated way
  3. Giving users the ability to inform client-go that they are done with a particular client set, so that the cache can be cleaned up correctly
  4. Using a runtime.Finalizer to automatically clean up the cache when all clientset references have been GC'd

@dmitry-irtegov
Copy link
Author

You would solve this by holding onto the *DialHolder for a given cluster even if you recycle the MyK8sApiWrapper.

Well, this is why I thought the approach of #117258 is not applicable to us (at least, not directly applicable). Our Dial function is not really a function, it is a closure referencing the instance of MyK8sApiWrapper. Single DialHolder instance for different wrapper instances would immediately break everything.

I even managed to display this behavior in my mockup, though there it might not seem logical.

Am I correct, that to properly convert rest.Config to transport.Config I need to call rest.Config.TransportConfig() and then manually set DialHolder field?

@enj
Copy link
Member

enj commented Jun 19, 2023

Am I correct, that to properly convert rest.Config to transport.Config I need to call rest.Config.TransportConfig() and then manually set DialHolder field?

Yes.

@dmitry-irtegov
Copy link
Author

Thank you for the answer. But now I have another question.
transport functions return http.RoundTripper, but I need a kubernetes.Interface or unversioned rest.RESTClient.

I guess I need to use kubernetes.NewForConfigAndClient() and rest.UnversionedRESTClientForConfigAndClient() respectively, but how do I properly convert http.RoundTripper to http.Client?

@enj
Copy link
Member

enj commented Jun 20, 2023

http.Client just takes transport as a field:

func NewForConfig(c *rest.Config) (*Clientset, error) {
configShallowCopy := *c
if configShallowCopy.UserAgent == "" {
configShallowCopy.UserAgent = rest.DefaultKubernetesUserAgent()
}
// share the transport between all clients
httpClient, err := rest.HTTPClientFor(&configShallowCopy)
if err != nil {
return nil, err
}

func HTTPClientFor(config *Config) (*http.Client, error) {
transport, err := TransportFor(config)
if err != nil {
return nil, err
}
var httpClient *http.Client
if transport != http.DefaultTransport || config.Timeout > 0 {
httpClient = &http.Client{
Transport: transport,
Timeout: config.Timeout,
}


Alternatively, you could copy the subset of the transport building logic from:

func (c *tlsTransportCache) get(config *Config) (http.RoundTripper, error) {
key, canCache, err := tlsConfigKey(config)
if err != nil {
return nil, err
}
if canCache {
// Ensure we only create a single transport for the given TLS options
c.mu.Lock()
defer c.mu.Unlock()
defer metrics.TransportCacheEntries.Observe(len(c.transports))
// See if we already have a custom transport for this config
if t, ok := c.transports[key]; ok {
metrics.TransportCreateCalls.Increment("hit")
return t, nil
}
metrics.TransportCreateCalls.Increment("miss")
} else {
metrics.TransportCreateCalls.Increment("uncacheable")
}
// Get the TLS options for this client config
tlsConfig, err := TLSConfigFor(config)
if err != nil {
return nil, err
}
// The options didn't require a custom TLS config
if tlsConfig == nil && config.DialHolder == nil && config.Proxy == nil {
return http.DefaultTransport, nil
}
var dial func(ctx context.Context, network, address string) (net.Conn, error)
if config.DialHolder != nil {
dial = config.DialHolder.Dial
} else {
dial = (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext
}
// If we use are reloading files, we need to handle certificate rotation properly
// TODO(jackkleeman): We can also add rotation here when config.HasCertCallback() is true
if config.TLS.ReloadTLSFiles && tlsConfig != nil && tlsConfig.GetClientCertificate != nil {
dynamicCertDialer := certRotatingDialer(tlsConfig.GetClientCertificate, dial)
tlsConfig.GetClientCertificate = dynamicCertDialer.GetClientCertificate
dial = dynamicCertDialer.connDialer.DialContext
go dynamicCertDialer.Run(DialerStopCh)
}
proxy := http.ProxyFromEnvironment
if config.Proxy != nil {
proxy = config.Proxy
}
transport := utilnet.SetTransportDefaults(&http.Transport{
Proxy: proxy,
TLSHandshakeTimeout: 10 * time.Second,
TLSClientConfig: tlsConfig,
MaxIdleConnsPerHost: idleConnsPerHost,
DialContext: dial,
DisableCompression: config.DisableCompression,
})
if canCache {
// Cache a single transport for these options
c.transports[key] = transport
}
return transport, nil
}

and then set that transport on rest.Config.Transport so that you can use this branch (be careful to avoid double wrapping via HTTPWrappersForConfig):

if config.Transport != nil {
rt = config.Transport

@enj
Copy link
Member

enj commented Jun 26, 2023

/assign
(so I remember to do something about this)
/sig api-machinery
/triage accepted

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2023
@2uasimojo
Copy link

@dmitry-irtegov Thank you for this, we're hitting the exact same issue in hive. Any chance you can share your implementation of the workaround so I can follow your example?

2uasimojo added a commit to 2uasimojo/hive that referenced this issue Jul 13, 2023
When ClusterDeployment.Spec.ControlPlaneConfig.APIServerIPOverride is in
use, we experience the memory leak described by
kubernetes/kubernetes#118703. That issue has
not been resolved at the time of this writing, but describes a couple of
workarounds. Try the one described in [this comment](kubernetes/kubernetes#118703 (comment)):

> For now, I would probably just set the Proxy func to
> utilnet.NewProxierWithNoProxyCIDR(http.ProxyFromEnvironment) so that
> the TLS cache is skipped for your code. This isn't guaranteed behavior
> and may stop working at any time, but it will at least solve the issue
> for you for now (by restoring the old behavior).

HIVE-2272
@olegch
Copy link
Contributor

olegch commented Jul 18, 2023

Hello,

@enj

We need the caching behavior to be the default otherwise clients like kubectl break by creating too many clientsets which then run out of file descriptors.

It seems that in the current code tlsCache is never cleaned, so even if one is not using DialHolder, a long-running service that works with multiple k8s API endpoints may still run into a memory leak.
It is easy to imagine a service like that - ArgoCD is a good example. Users can register and de-register Kubernetes clusters via UI or API, so while at any given moment ArgoCD "knows" about a limited number of clusters, tlsCache will over time collect unlimited number of entries.

Does it maybe make sense to implement a time-limited cache so that entries that are not accessed for a certain period of time are removed?
We use this strategy in our code where we use static process/package-scoped caches.

If I am not mistaken, the same issue is discussed in #109289.

@2uasimojo
Copy link

even if one is not using DialHolder, a long-running service that works with multiple k8s API endpoints may still run into a memory leak.

I can corroborate this. When we deployed code with the workaround, it fixed envs using DialContext -- but we also saw a long-standing "sawtooth" memory pattern disappear in the envs that weren't.

@enj
Copy link
Member

enj commented Jul 20, 2023

It seems that in the current code tlsCache is never cleaned, so even if one is not using DialHolder, a long-running service that works with multiple k8s API endpoints may still run into a memory leak.

Correct, though this has been true long before any of the dial holder changes.

Does it maybe make sense to implement a time-limited cache so that entries that are not accessed for a certain period of time are removed?

In general we need an approach that make it clear to client-go that consumers are done with a particular cache entry. It is unrelated to time because a client could be in-use but idle.

@2uasimojo
Copy link

I'm sure we're shooting for something clever, but I would be happy with either

config.UseCache = False  // the normally-cached thing is GCed when the client goes out of scope

or

client := GimmeClient(...)
defer client.Scrub()  // explicit cleanup when this function completes

Again, my use case is in the context of Reconcile loops in controllers that manage many spoke clusters that come and go. Each invocation is idempotent and isolated; and we can't (reliably) inject logic that knows when a specific spoke is deleted. Thus

done with a particular cache entry

is not really possible for us. We would rather simply not use the cache, and accept whatever performance hit that entails.

@enj
Copy link
Member

enj commented Jul 20, 2023

You can already do the former by setting the proxy func or constructing the transport directly.

@2uasimojo
Copy link

You can already do the former by setting the proxy func

Right, cool, I'm doing that and it's working for me, but I'm concerned about this:

This isn't guaranteed behavior and may stop working at any time

If we can declare this a permanent/supported path, great 👍

That said, it would be neat if there were a way to spell it that was more intuitive. For networking dunces like me, "set a proxy function to " doesn't naturally map to "don't use a cache for an internal thing I can't see".

Similar for

constructing the transport directly

(which I didn't even try to understand, tbf)

@olegch
Copy link
Contributor

olegch commented Jul 20, 2023

In general we need an approach that make it clear to client-go that consumers are done with a particular cache entry. It is unrelated to time because a client could be in-use but idle.

Yes, this is true, but it is probably the easiest fix that would remove the leak and at the same time keep most of the benefits of the cache.
Removing entry from the cache after some idle time will not change the external behavior of the client because the cache entry will just be reinstated when used again.
At the same time it would not affect performance much because if the item was removed, it was not used that often, and re-creating a cache entry is not critical for performance, and if it was used often, it would not be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants