Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client-go/transport: fix memory leak when using rest.Config Dial function #124894

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lizardruss
Copy link

@lizardruss lizardruss commented May 15, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

When using a rest.Config Dial function and constructing a client with client.New() or kubernetes.NewForConfig(), the tlsConfigCache is ineffective because the Dial function is wrapped in a new DialHolder instance every time. This leads to a memory leak since the tlsConfigCache grows unbounded with unique keys. This change allows skipping the tlsConfigCache for this case.

Which issue(s) this PR fixes:

Fixes #118703

Special notes for your reviewer:

Here are graphs showing the change in memory usage before & after this fix. I have marked this as not having a user facing change, since unless the user has taken care to reuse DialHolder instances when creating clients, they would not have seen the benefits of the tlsConfigCache.

before
image (3)

after
Screenshot 2024-05-15 at 10 26 40 AM

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


…tion with client.New and kubernetes.NewForConfig
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 15, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @lizardruss!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @lizardruss. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 15, 2024
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 15, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lizardruss
Once this PR has been reviewed and has the lgtm label, please assign smarterclayton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lizardruss lizardruss marked this pull request as ready for review May 15, 2024 16:08
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2024
@lizardruss lizardruss changed the title client-go/transport: fix memory leak when using rest.Config Dial func… client-go/transport: fix memory leak when using rest.Config Dial function May 15, 2024
@seans3
Copy link
Contributor

seans3 commented May 21, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2024
@liggitt
Copy link
Member

liggitt commented May 21, 2024

/assign @enj

@k8s-ci-robot
Copy link
Contributor

@lizardruss: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-verify 74ff449 link true /test pull-kubernetes-verify
pull-kubernetes-e2e-gce 74ff449 link true /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@aojea
Copy link
Member

aojea commented May 21, 2024

I didn't check in detail, is this related ot this #117258, we have added some metrics to have more visibility into these problems #117295, @lizardruss just curiosity, do you know if the metrics are useful in this case ?

@@ -112,7 +112,7 @@ func (c *Config) TransportConfig() (*transport.Config, error) {
}

if c.Dial != nil {
conf.DialHolder = &transport.DialHolder{Dial: c.Dial}
conf.DialHolder = &transport.DialHolder{Dial: c.Dial, DisableCache: true}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem with all of these types of changes is that they make assumptions about how the caller was using this code. A caller of this method could be re-using the return value, meaning they would correctly be getting the TLS cache benefits, which this PR would now disable. We allow for infinite flexibility, so every change has the potential to break someone in a subtle way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your perspective, and originally I wanted to expose the DialHolder on the rest.Config struct, however that is a larger change. It would provide a more intentional way of re-using the DialHolder when the caching behavior is expected.

As things exist now, unless the caller takes care to re-use the returned DialHolder, it's not obvious that a memory leak occurs when the Dial function is configured. The places we most often encounter this issue are client.New() and kubernetes.NewForConfig(), which make re-using the DialHolder non-obvious.

@cici37
Copy link
Contributor

cici37 commented May 28, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using Dial function with client-go kubernetes.NewForConfig triggers a memory leak
7 participants