Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Prevent concurrent runs of Fleet setup #183636

Merged
merged 46 commits into from
May 31, 2024

Conversation

juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented May 16, 2024

Closes https://github.com/elastic/ingest-dev/issues/3346

  • Unit and integration tests are created or updated
  • Turn down info logging

The linked issue seems to be caused by multiple kibana instances running Fleet setup at the same time, trying to create the preconfigured cloud policy concurrently, and in case of failures, the agent policy is left with a revision with no inputs, this way preventing fleet-server to start properly.

See the concurrent errors in the logs: https://platform-logging.kb.us-west2.gcp.elastic-cloud.com/app/r/s/tUpMP

This fix introduces a fleet-setup-lock SO type, which is used to create a document as a lock by Fleet setup, and is deleted when the setup is completed. Concurrent calls to Fleet setup will return early if this doc exists.

To verify:
Run the test ./run_fleet_setup_parallel.sh from local kibana, and verify the generated logs that only one of them ran Fleet setup.

@juliaElastic juliaElastic added the ci:cloud-deploy Create or update a Cloud deployment label May 16, 2024
@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic juliaElastic changed the title add logging to debug [Fleet] Prevent concurrent runs of Fleet setup May 17, 2024
@juliaElastic
Copy link
Contributor Author

/ci

… src/core/server/integration_tests/ci_checks'
@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic juliaElastic self-assigned this May 17, 2024
@juliaElastic juliaElastic marked this pull request as ready for review May 17, 2024 14:48
@juliaElastic juliaElastic requested review from a team as code owners May 17, 2024 14:48
@botelastic botelastic bot added the Team:Fleet Team label for Observability Data Collection Fleet team label May 17, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

return await awaitIfPending(async () => createSetupSideEffects(soClient, esClient));
} catch (error) {
apm.captureError(error);
t.setOutcome('failure');
throw error;
} finally {
t.end();
try {
await settingsService.saveSettings(soClient, {
fleet_setup_status: 'completed',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added writing out the completed status to the finally branch to make sure it is written out in case of any errors.
Leaving this flag in_progress would prevent any subsequent Fleet setup calls from running.

We might want to have an escape hatch if this happens, e.g. a force flag to be able to run Fleet setup if the status is stuck in in_progress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a force flag makes sense for edge cases and troubleshooting, imo

// check if fleet setup is already started
const settings = await settingsService.getSettingsOrUndefined(soClient);

if (settings && settings.fleet_setup_status === 'in_progress') {
Copy link
Member

@nchaulet nchaulet May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still a possibility for a race condition no? I am wondering if we should introduce a real distributed lock for the setup.

Not sure how easy it will be to implement this with saved object, maybe something like this with no retry on conflict, happy to discuss this more

if (!settings.fleet_setup_status) {
const setupId = uuid();
await settingsService.saveSettings(soClient, {
     fleet_setup_status: 'in_progress',
    fleet_setup_id: uuid(), // to check who get the lock
    fleet_setup_started_at: Date.now(), // to get a TTL for the lock
      }, {
        retryOnConflict: false
      });

}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, what are the exact concurrency semantics here? Is this actually atomic? A distributed lock would work if you had a perfect one, but those can also run into problems around retries depending on exactly how it works.

A better solution would be to make it so that concurrent attempts to create the policy don't matter by making the action idempotent. What if the requests to create the policy went through a queue as a way of serializing the action and getting a consistent result each time?

What if each instance of Kibana writes to it's own copy of the same object with a unique name, and we pick the first one as the one that gets used, or an arbitrary one, assuming they are all the same.

There might be other options as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking with the in progress flag was to avoid the unnecessary work of each kibana instance creating the preconfigured policies. We are using a similar approach in package install where the package is set to installing status and can't be installed concurrently until completed.

Something like selecting a leader kibana instance could work, though I'm not sure how easy it is to implement.

Using a queue would need a separate component that executes it, isn't it?

Using a unique policy id for each attempt would potentially generate a lot of documents, as Fleet setup runs every time Kibana restarts or Fleet UI is visited. We are also relying on a specific name of the cloud policy.

I think a less intrusive change would be to catch version conflict errors and check the existence of the policy before retrying/rollback.
Also how we are bumping the revision field now is not atomic, it is being read and updated separately, we could improve it with an update painless script, to make sure two instances are not overwriting the same revision doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliaElastic If we go to avoid concurrent calls could we catch the version conflict when updating the settings object, (and use that document as the lock)?

Using a queue would need a separate component that executes it, isn't it?

We could eventually use the task manager for that, that is the available queue in Kibana, but it will add some delay to bump the policy not sure if it's acceptable for us. But it could solve a few scalabilty issue to when bumping agent policies revision.

@kc13greiner kc13greiner self-requested a review May 20, 2024 18:14
@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic
Copy link
Contributor Author

/ci

// Retry Fleet setup w/ backoff
await backOff(
async () => {
await setupFleet(
new SavedObjectsClient(core.savedObjects.createInternalRepository()),
core.elasticsearch.client.asInternalUser
core.elasticsearch.client.asInternalUser,
{ useLock: true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using this option to only use the lock when called from plugin start logic, not from API

@juliaElastic
Copy link
Contributor Author

/ci

@juliaElastic juliaElastic marked this pull request as ready for review May 30, 2024 15:07
/**
* Verifies that multiple Kibana instances running in parallel will not create duplicate preconfiguration objects.
*/
describe.skip('Fleet setup preconfiguration with multiple instances Kibana', () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used these tests locally to start es and run 2 kibana instances as separate processes.
I've tried to run it as a CI task, but didn't see any output, not sure how to check the result of processes running in the background.

Alternatively we can use multiple zones in scheduled scale tests to make sure the bug doesn't surface again. (the test would fail if fleet-server doesn't come online) - this is done in daily 10k and daily 50vm runs to use 3 zones

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check the log files of these Kibana instances maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that the jest integration step is picking up these tests if not skipped, and they fail because the fleet_setup.test.ts test requires es to be started separately in es.test.ts. I tried moving the files to another folder, but then the jest_integration script doesn't find them, and it doesn't seem easy to add a new script.

@juliaElastic juliaElastic requested review from cmacknz, kpollich and a team May 31, 2024 10:56
@juliaElastic juliaElastic removed the ci:cloud-deploy Create or update a Cloud deployment label May 31, 2024
Copy link
Member

@nchaulet nchaulet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nitpicks comments but otherwise LGTM 🚀

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
fleet 1026 1027 +1

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
fleet 1201 1202 +1
Unknown metric groups

API count

id before after diff
fleet 1322 1323 +1

ESLint disabled line counts

id before after diff
fleet 43 44 +1

Total ESLint disabled count

id before after diff
fleet 56 57 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @juliaElastic

@juliaElastic juliaElastic merged commit 464f797 into elastic:main May 31, 2024
20 checks passed
@kibanamachine kibanamachine added v8.15.0 backport:skip This commit does not require backporting labels May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:fix Team:Fleet Team label for Observability Data Collection Fleet team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants