[Fleet] Prevent concurrent runs of Fleet setup #183636

juliaElastic · 2024-05-16T12:46:41Z

Closes https://github.com/elastic/ingest-dev/issues/3346

Unit and integration tests are created or updated
Turn down info logging

The linked issue seems to be caused by multiple kibana instances running Fleet setup at the same time, trying to create the preconfigured cloud policy concurrently, and in case of failures, the agent policy is left with a revision with no inputs, this way preventing fleet-server to start properly.

See the concurrent errors in the logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/tUpMP

This fix introduces a fleet-setup-lock SO type, which is used to create a document as a lock by Fleet setup, and is deleted when the setup is completed. Concurrent calls to Fleet setup will return early if this doc exists.

To verify:
Run the test ./run_fleet_setup_parallel.sh from local kibana, and verify the generated logs that only one of them ran Fleet setup.

apmmachine · 2024-05-16T12:46:53Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

/oblt-deploy : Deploy a Kibana instance using the Observability test environments.
run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

juliaElastic · 2024-05-16T12:46:53Z

/ci

juliaElastic · 2024-05-16T13:27:20Z

/ci

juliaElastic · 2024-05-16T14:13:24Z

/ci

…te --fix'

juliaElastic · 2024-05-17T11:36:51Z

/ci

… src/core/server/integration_tests/ci_checks'

juliaElastic · 2024-05-17T12:40:15Z

/ci

juliaElastic · 2024-05-17T12:56:06Z

/ci

x-pack/plugins/fleet/server/services/agent_policy.ts

elasticmachine · 2024-05-17T14:55:24Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic · 2024-05-17T14:56:15Z

x-pack/plugins/fleet/server/services/setup.ts

    return await awaitIfPending(async () => createSetupSideEffects(soClient, esClient));
  } catch (error) {
    apm.captureError(error);
    t.setOutcome('failure');
    throw error;
  } finally {
    t.end();
+    try {
+      await settingsService.saveSettings(soClient, {
+        fleet_setup_status: 'completed',


Added writing out the completed status to the finally branch to make sure it is written out in case of any errors.
Leaving this flag in_progress would prevent any subsequent Fleet setup calls from running.

We might want to have an escape hatch if this happens, e.g. a force flag to be able to run Fleet setup if the status is stuck in in_progress.

Adding a force flag makes sense for edge cases and troubleshooting, imo

nchaulet · 2024-05-17T17:00:37Z

x-pack/plugins/fleet/server/services/setup.ts

+    // check if fleet setup is already started
+    const settings = await settingsService.getSettingsOrUndefined(soClient);
+
+    if (settings && settings.fleet_setup_status === 'in_progress') {


I think there is still a possibility for a race condition no? I am wondering if we should introduce a real distributed lock for the setup.

Not sure how easy it will be to implement this with saved object, maybe something like this with no retry on conflict, happy to discuss this more

if (!settings.fleet_setup_status) { const setupId = uuid(); await settingsService.saveSettings(soClient, { fleet_setup_status: 'in_progress', fleet_setup_id: uuid(), // to check who get the lock fleet_setup_started_at: Date.now(), // to get a TTL for the lock }, { retryOnConflict: false });

}

+1, what are the exact concurrency semantics here? Is this actually atomic? A distributed lock would work if you had a perfect one, but those can also run into problems around retries depending on exactly how it works.

A better solution would be to make it so that concurrent attempts to create the policy don't matter by making the action idempotent. What if the requests to create the policy went through a queue as a way of serializing the action and getting a consistent result each time?

What if each instance of Kibana writes to it's own copy of the same object with a unique name, and we pick the first one as the one that gets used, or an arbitrary one, assuming they are all the same.

There might be other options as well.

My thinking with the in progress flag was to avoid the unnecessary work of each kibana instance creating the preconfigured policies. We are using a similar approach in package install where the package is set to installing status and can't be installed concurrently until completed.

Something like selecting a leader kibana instance could work, though I'm not sure how easy it is to implement.

Using a queue would need a separate component that executes it, isn't it?

Using a unique policy id for each attempt would potentially generate a lot of documents, as Fleet setup runs every time Kibana restarts or Fleet UI is visited. We are also relying on a specific name of the cloud policy.

I think a less intrusive change would be to catch version conflict errors and check the existence of the policy before retrying/rollback.
Also how we are bumping the revision field now is not atomic, it is being read and updated separately, we could improve it with an update painless script, to make sure two instances are not overwriting the same revision doc.

@juliaElastic If we go to avoid concurrent calls could we catch the version conflict when updating the settings object, (and use that document as the lock)?

Using a queue would need a separate component that executes it, isn't it?

We could eventually use the task manager for that, that is the available queue in Kibana, but it will add some delay to bump the policy not sure if it's acceptable for us. But it could solve a few scalabilty issue to when bumping agent policies revision.

juliaElastic · 2024-05-30T08:00:05Z

/ci

juliaElastic · 2024-05-30T08:56:24Z

/ci

juliaElastic · 2024-05-30T11:24:43Z

/ci

juliaElastic · 2024-05-30T11:58:31Z

/ci

juliaElastic · 2024-05-30T12:00:18Z

x-pack/plugins/fleet/server/plugin.ts

        // Retry Fleet setup w/ backoff
        await backOff(
          async () => {
            await setupFleet(
              new SavedObjectsClient(core.savedObjects.createInternalRepository()),
-              core.elasticsearch.client.asInternalUser
+              core.elasticsearch.client.asInternalUser,
+              { useLock: true }


using this option to only use the lock when called from plugin start logic, not from API

juliaElastic · 2024-05-30T13:41:59Z

/ci

juliaElastic · 2024-05-30T15:18:12Z

x-pack/plugins/fleet/server/integration_tests/fleet_setup.test.ts

+/**
+ * Verifies that multiple Kibana instances running in parallel will not create duplicate preconfiguration objects.
+ */
+describe.skip('Fleet setup preconfiguration with multiple instances Kibana', () => {


I've used these tests locally to start es and run 2 kibana instances as separate processes.
I've tried to run it as a CI task, but didn't see any output, not sure how to check the result of processes running in the background.

Alternatively we can use multiple zones in scheduled scale tests to make sure the bug doesn't surface again. (the test would fail if fleet-server doesn't come online) - this is done in daily 10k and daily 50vm runs to use 3 zones

Can we check the log files of these Kibana instances maybe?

Another issue is that the jest integration step is picking up these tests if not skipped, and they fail because the fleet_setup.test.ts test requires es to be started separately in es.test.ts. I tried moving the files to another folder, but then the jest_integration script doesn't find them, and it doesn't seem easy to add a new script.

…ter 1 hour

x-pack/plugins/fleet/server/services/agent_policy.ts

nchaulet

a few nitpicks comments but otherwise LGTM 🚀

kibana-ci · 2024-05-31T14:27:26Z

💚 Build Succeeded

Buildkite Build
Commit: d82244c

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id	before	after	diff
`fleet`	1026	1027	+1

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`fleet`	1201	1202	+1

Unknown metric groups

API count

id	before	after	diff
`fleet`	1322	1323	+1

ESLint disabled line counts

id	before	after	diff
`fleet`	43	44	+1

Total ESLint disabled count

id	before	after	diff
`fleet`	56	57	+1

History

💛 Build #213262 was flaky 1137d1e
💛 Build #213240 was flaky a6057dd
💚 Build #213230 succeeded eb1beab
💚 Build #213082 succeeded c0cf7e8
💚 Build #213051 succeeded cde6d8e

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

add logging to debug

1384111

juliaElastic added the ci:cloud-deploy Create or update a Cloud deployment label May 16, 2024

add logging to debug

84a53ac

prevent multiple concurrent fleet setups

a42c283

kibanamachine and others added 3 commits May 16, 2024 14:44

[CI] Auto-commit changed files from 'node scripts/check_mappings_upda…

5e9c0fe

…te --fix'

Merge branch 'main' into logging

f7997bf

move update flag to finally

1323b8b

juliaElastic changed the title ~~add logging to debug~~ [Fleet] Prevent concurrent runs of Fleet setup May 17, 2024

[CI] Auto-commit changed files from 'node scripts/jest_integration -u…

330d26d

… src/core/server/integration_tests/ci_checks'

info logging

e2c083f

juliaElastic self-assigned this May 17, 2024

juliaElastic marked this pull request as ready for review May 17, 2024 14:48

juliaElastic requested review from a team as code owners May 17, 2024 14:48

juliaElastic added the release_note:fix label May 17, 2024

kpollich reviewed May 17, 2024

View reviewed changes

x-pack/plugins/fleet/server/services/agent_policy.ts Outdated Show resolved Hide resolved

botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label May 17, 2024

juliaElastic commented May 17, 2024

View reviewed changes

nchaulet reviewed May 17, 2024

View reviewed changes

kc13greiner self-requested a review May 20, 2024 18:14

Merge branch 'main' into logging

ca5c171

juliaElastic added 2 commits May 30, 2024 09:58

fix test

0305975

fix test

e32afc9

retry setup on conflict error

912d92f

retry fleet setup in security

b10a30a

added option to only use setup lock when called from plugin start

cde6d8e

juliaElastic commented May 30, 2024

View reviewed changes

juliaElastic added 2 commits May 30, 2024 14:15

remove ci step

835114e

remove version from agent policy update

c0cf7e8

juliaElastic marked this pull request as ready for review May 30, 2024 15:07

juliaElastic commented May 30, 2024

View reviewed changes

juliaElastic and others added 3 commits May 31, 2024 09:48

Merge branch 'main' into logging

eb1beab

continue setup and delete lock if started more than 1 hour ago

0c608f3

delete previous lock immediately to prevent concurrent setup calls af…

a6057dd

…ter 1 hour

juliaElastic requested review from cmacknz, kpollich and a team May 31, 2024 10:56

add back script, extract constant

1137d1e

juliaElastic removed the ci:cloud-deploy Create or update a Cloud deployment label May 31, 2024

nchaulet reviewed May 31, 2024

View reviewed changes

x-pack/plugins/fleet/server/services/agent_policy.ts Outdated Show resolved Hide resolved

nchaulet approved these changes May 31, 2024

View reviewed changes

removed unnecessary code

d82244c

juliaElastic merged commit 464f797 into elastic:main May 31, 2024
20 checks passed

kibanamachine added v8.15.0 backport:skip This commit does not require backporting labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Prevent concurrent runs of Fleet setup #183636

[Fleet] Prevent concurrent runs of Fleet setup #183636

juliaElastic commented May 16, 2024 •

edited

apmmachine commented May 16, 2024

juliaElastic commented May 16, 2024

juliaElastic commented May 16, 2024

juliaElastic commented May 16, 2024

juliaElastic commented May 17, 2024

juliaElastic commented May 17, 2024

juliaElastic commented May 17, 2024

elasticmachine commented May 17, 2024

juliaElastic May 17, 2024 •

edited

kpollich May 17, 2024

nchaulet May 17, 2024 •

edited

cmacknz May 17, 2024

juliaElastic May 21, 2024

nchaulet May 21, 2024

juliaElastic commented May 30, 2024

juliaElastic commented May 30, 2024

juliaElastic commented May 30, 2024

juliaElastic commented May 30, 2024

juliaElastic May 30, 2024

juliaElastic commented May 30, 2024

juliaElastic May 30, 2024 •

edited

nchaulet May 31, 2024

juliaElastic May 31, 2024 •

edited

nchaulet left a comment

kibana-ci commented May 31, 2024

API count

ESLint disabled line counts

Total ESLint disabled count

[Fleet] Prevent concurrent runs of Fleet setup #183636

[Fleet] Prevent concurrent runs of Fleet setup #183636

Conversation

juliaElastic commented May 16, 2024 • edited

apmmachine commented May 16, 2024

🤖 GitHub comments

juliaElastic commented May 16, 2024

juliaElastic commented May 16, 2024

juliaElastic commented May 16, 2024

juliaElastic commented May 17, 2024

juliaElastic commented May 17, 2024

juliaElastic commented May 17, 2024

elasticmachine commented May 17, 2024

juliaElastic May 17, 2024 • edited

Choose a reason for hiding this comment

kpollich May 17, 2024

Choose a reason for hiding this comment

nchaulet May 17, 2024 • edited

Choose a reason for hiding this comment

cmacknz May 17, 2024

Choose a reason for hiding this comment

juliaElastic May 21, 2024

Choose a reason for hiding this comment

nchaulet May 21, 2024

Choose a reason for hiding this comment

juliaElastic commented May 30, 2024

juliaElastic commented May 30, 2024

juliaElastic commented May 30, 2024

juliaElastic commented May 30, 2024

juliaElastic May 30, 2024

Choose a reason for hiding this comment

juliaElastic commented May 30, 2024

juliaElastic May 30, 2024 • edited

Choose a reason for hiding this comment

nchaulet May 31, 2024

Choose a reason for hiding this comment

juliaElastic May 31, 2024 • edited

Choose a reason for hiding this comment

nchaulet left a comment

Choose a reason for hiding this comment

kibana-ci commented May 31, 2024

💚 Build Succeeded

Metrics [docs]

Module Count

Public APIs missing comments

API count

ESLint disabled line counts

Total ESLint disabled count

History

juliaElastic commented May 16, 2024 •

edited

juliaElastic May 17, 2024 •

edited

nchaulet May 17, 2024 •

edited

juliaElastic May 30, 2024 •

edited

juliaElastic May 31, 2024 •

edited