Separate expensive test in elastic-agent CI to a separate pipeline #4710

pchila · 2024-05-08T15:53:14Z

In order to have a bit more control on how many concurrent serverless/extended leak/integration tests we are running in parallel at any given time we may think of creating a separate pipeline that runs only those tests.

For example we can create a pipeline named elastic-agent-extended-testing that will create a build for a PR or a release branch commit or may be even invoked by the current elastic-agent pipeline (I am not very knowledgeable of all the buildkite features and settings) that may start from a commit (or even from already created packages) and run our longer tests.
We may set up this pipeline to only run a certain number of builds in parallel giving us one more knob we can turn to slow down the impact we have on cloud environments (especially wrt exhausting ARM VMs or other scarce resources in a zone/region).

Edit: if we implement a separate pipeline for extended testing we will need to adjust the PR checks ( we will need an additional required check from the new build) and also our build indicator for main and release branches (the small green tick or red x next to our commits in Github)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-08T15:53:15Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine · 2024-05-08T15:53:16Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

pchila · 2024-05-08T15:53:29Z

cc @ycombinator @rdner

ycombinator · 2024-05-08T16:07:57Z

@pchila @blakerouse is there some easy way already to invoke these tests separately, e.g. a mage command with an environment variable? Or do we need to build out some additional functionality in the integration testing framework first to make this separation happen?

I imagine once we have an easy way to invoke these tests separately, it's a matter of editing the buildkite pipeline definition to run both test jobs.

pchila · 2024-05-08T16:26:49Z

@ycombinator

@pchila @blakerouse is there some easy way already to invoke these tests separately, e.g. a mage command with an environment variable? Or do we need to build out some additional functionality in the integration testing framework first to make this separation happen?

I imagine once we have an easy way to invoke these tests separately, it's a matter of editing the buildkite pipeline definition to run both test jobs.

We already have mage targets that run the tests... they expect the necessary packages to be present though so we can (either/or):

make the new pipeline package elastic-agents in all necessary formats starting from the source code (very similar to what we are doing today)
make the new pipeline download already built packages from another pipeline/some URL specific for the commit we are testing (this is interesting as it opens up possibilities of running tests not just for PRs or main)

I believe that option 1. is easier and quicker... and we can always make it evolve toward 2. later if we see value in it

rdner · 2024-05-10T15:11:37Z

I talked to @pazone and he acknowledged that it's possible to do and asked to assign this ticket to him.

blakerouse · 2024-05-10T15:18:06Z

The integration testing framework already supports the execution of specific groups https://github.com/elastic/elastic-agent/blob/main/docs/test-framework-dev-guide.md#selecting-specific-group

That could be used to only execute a set of groups for different types. That might be too granular as it would require the selection of groups and any time that a new group is added it would require adjusting the job.

Adding something like define.Require({Extended: true}) might be more desirable. That would allow the selection of extended testing versus non-extended testing.

pazone · 2024-05-10T15:18:27Z

We can also avoid running ITs on each PR commit. For example, we setup a special PR comment that will trigger the separate extended testing pipeline.

cmacknz · 2024-05-10T15:19:32Z

We should introduce the monorepo plugin so we stop pointlessly triggering integration tests. That can be done separately from this issue. #3674

cmacknz · 2024-05-10T15:22:40Z

I have wanted two things for a while as ergonomic improvements that I think would fit under this:

Introducing a Buildkite step that publishes all the packages. This would save me from having to recompile PRs to test them locally. Buildkite needs a 'mage package' step separate from integration tests #4142. This could feed into the integration test job(s). Perhaps instead of uploading the packages via SSH we could have the test VMs download them from blob storage in the same GCP region to speed this step up as well.
Introducing separate Buildkite steps for the individual integration test architectures. For example instead of seeing "Integrations Tests" failed and having to triage and retry all of them I'd rather see "Integration Tests - Windows X86" failed and have the ability to retry only the Windows step. The primary con of this is the increased number of mage integration:test targets running so I think this fits under here.

Finally, for the concurrency issue are there any built in Buildkite features we can use to help this quickly? https://buildkite.com/docs/pipelines/controlling-concurrency for example?

blakerouse · 2024-05-10T15:27:27Z

I have wanted two things for a while as ergonomic improvements that I think would fit under this:

Introducing a Buildkite step that publishes all the packages. This would save me from having to recompile PRs to test them locally. Buildkite needs a 'mage package' step separate from integration tests #4142. This could feed into the integration test job(s). Perhaps instead of uploading the packages via SSH we could have the test VMs download them from blob storage in the same GCP region to speed this step up as well.

As long as this still works from a local developer system. I do not want to have to think about having to upload to GCP before running the tests or anything like that.

Introducing separate Buildkite steps for the individual integration test architectures. For example instead of seeing "Integrations Tests" failed and having to triage and retry all of them I'd rather see "Integration Tests - Windows X86" failed and have the ability to retry only the Windows step. The primary con of this is the increased number of mage integration:test targets running so I think this fits under here.

TEST_PLATFORMS="windows/amd64" works for this use case and is already built into the framework.

rdner · 2024-05-10T17:23:07Z

@pazone for now the important part in order to stabilize the CI is to extract the following steps in a separate pipeline:

Serverless Beats Tests
Serverless integration test
Integration tests

This separate pipeline should have a limited concurrency level. Let's start with 4 maximum concurrent builds. So, we cannot run more than 4 concurrent steps of the type listed above.

Once we have this, we can iterate and extend functionality / granularity as suggested in the comments here.

Unless @cmacknz has any objection.

cmacknz · 2024-05-10T17:43:25Z

So, we cannot run more than 4 concurrent steps of the type listed above.

Why 4? I don't want us all to be waiting around for more hours than necessary. I know we have to pick a number, but what problem are we specifically trying to solve with this? Reducing number of VMs below some quota? What quota? Where is it set?

Other option here is to reduce the number of unique test groups in the integration tests, which create individual VM instances, or to refactor the three separate targets into one target to let them all run sequentially.

rdner · 2024-05-10T18:00:28Z

Why 4? I don't want us all to be waiting around for more hours than necessary. I know we have to pick a number, but what problem are we specifically trying to solve with this? Reducing number of VMs below some quota? What quota? Where is it set?

@cmacknz we need to start with some number and iterate. We can start with 16, 10, any number and then see how it affects the CI. 4 is the lowest possible number I think we can go with and then see how stable the CI has become. If the CI is fully stable we increase the next week until we start to see instabilities.

Changing this number is relatively easy, the important part is to have some limit in place that we can adjust.

cmacknz · 2024-05-10T18:08:32Z

Is there a way we can not guess? Can we observe how many active VMs we have at a time in GCP and go off of that? If we have to guess that works but it would be faster to know what universe of scale we are already operating at.

How many VMs does a single integration test run require right now?

For example, counting manually it looks like we have 9 unique test groups, which means that there should be something like at least 9 VMs of each machine type that needs to be scheduled simultaneously.

I see:

container
Upgrade
Default
FleetPrivileged
FleetAirgapped
FleetAirgappedPrivileged
DEB
RPM
FQDN

rg 'Group\:' testing/integration
testing/integration/container_cmd_test.go
37:             Group: "container",

testing/integration/upgrade_rollback_test.go
44:             Group: Upgrade,
159:            Group: Upgrade,

testing/integration/diagnostics_test.go
131:            Group: Default,
157:            Group: Default,
183:            Group: Default,
206:            Group: Default,

testing/integration/fake_test.go
76:             Group: Default,
137:            Group: Default,

testing/integration/upgrade_gpg_test.go
27:             Group: Upgrade,
96:             Group: Upgrade,

testing/integration/delay_enroll_test.go
29:             Group: Fleet,

testing/integration/upgrade_fleet_test.go
44:             Group: Fleet,
58:             Group: FleetPrivileged,
108:            Group: FleetAirgapped,
120:            Group: FleetAirgappedPrivileged,

testing/integration/upgrade_standalone_retry_test.go
31:             Group: Upgrade,

testing/integration/upgrade_downgrade_test.go
29:             Group: Upgrade,

testing/integration/logs_ingestion_test.go
43:             Group: Fleet,
105:            Group: Deb,
173:            Group: RPM,

testing/integration/package_version_test.go
36:             Group: Default,
64:             Group: Fleet,

testing/integration/upgrade_standalone_same_commit_test.go
40:             Group: Upgrade,

testing/integration/monitoring_probe_reload_test.go
45:             Group: "fleet",

testing/integration/fqdn_test.go
36:             Group: FQDN,

testing/integration/upgrade_standalone_test.go
27:             Group: Upgrade,

testing/integration/upgrade_uninstall_test.go
28:             Group: Upgrade,

testing/integration/endpoint_security_test.go
81:             Group: Fleet,
107:            Group: Fleet,
135:            Group: Fleet,
504:            Group: Fleet,
575:            Group: Fleet,
655:            Group: Fleet,

testing/integration/proxy_url_test.go
98:             Group: Fleet,
146:            Group: Fleet,
196:            Group: Fleet,
260:            Group: Fleet,
328:            Group: Fleet,

testing/integration/install_test.go
29:             Group: Default,
96:             Group: Default,
175:            Group: Default,

testing/integration/upgrade_standalone_inprogress_test.go
30:             Group: Upgrade,

testing/integration/upgrade_broken_package_test.go
30:             Group: Upgrade,

testing/integration/agent_long_running_leak_test.go
65:             Group: "fleet",

testing/integration/apm_propagation_test.go
56:             Group: Default,

testing/integration/otel_test.go
112:            Group: Default,
228:            Group: Default,

testing/integration/metrics_monitoring_test.go
36:             Group: Fleet,

testing/integration/install_privileged_test.go
27:             Group: Default,
79:             Group: Default,

testing/integration/beats_serverless_test.go
53:             Group: Default,

testing/integration/monitoring_probe_preserve_text_cfg_test.go
80:             Group: "fleet",

rdner · 2024-05-13T09:20:29Z

@cmacknz even if we calculated the precise amount of VMs needed per build, we still would not know the failure/instability point of the GCP API. I don't have historical data for the amount of VMs we had running when GCP started failing for us.

If it's ~9 VMs per running build, 4 concurrent builds would already result in 36 VMs, plus we can guarantee their shut down only after 4 hours. Which makes the amount of running VMs at the same time somewhat unpredictable, this makes me think we have to guess.

The number can be anything, we can start with 8, does not matter. But we must start with something and then iterate. It's like Progress, SIMPLE Perfection in our source code says. I'm afraid we might spend too much time perfecting the number instead of trying some number and moving from there.

cmacknz · 2024-05-13T13:52:41Z

we still would not know the failure/instability point of the GCP API

My main concern is that without understanding in detail what the limit is, or exactly what problem we are working around, we may permanently make CI slower without an explanation or a path to fixing the underlying problem.

I do not want us to arbitrarily limit the number of CI jobs and permanently cap developer productivity without an actual reason. Are we requesting too many machines at once? Is the type of machine we are requesting not as available as others we could use instead? Are we using the API wrong and we are supposed to overflow into another AZ in the same region when we get capacity related errors? Do we need a capacity reservation with GCP itself to guarantee available (would this also be cheaper?)?

If you don't want to try an dig into the root cause at least do an analysis of if the reduced flakiness by capping concurrent builds actually saves us time over running more builds in parallel with the risk that some of them fail.

If we can make CI 100% reliable by executing a single build at a time, we have solved the flakiness problem at the expense of greatly limiting our throughput as a team. I will happily trade off one flaky build per day for unlimited concurrency for example at the other end of this spectrum.

cmacknz · 2024-05-13T13:55:13Z

For comparison, the Beats CI buildkite jobs look to have at least as many concurrent jobs on Buildkite agents if not more. How are they avoiding capacity issues there? That repository is also much more active than agent.

rdner · 2024-05-13T15:19:30Z

If we can make CI 100% reliable by executing a single build at a time, we have solved the flakiness problem at the expense of greatly limiting our throughput as a team.

@cmacknz but I didn't suggest to keep this low number forever, we start with some number and then gradually increase until we hit instability issues again. It's going to be a bit slower in the beginning, yes.

I will happily trade off one flaky build per day for unlimited concurrency for example at the other end of this spectrum.

well, that's our current situation, why do we even consider to change anything then? We can as well keep it as it is.

For comparison, the Beats CI buildkite jobs look to have at least as many concurrent jobs on Buildkite agents if not more. How are they avoiding capacity issues there? That repository is also much more active than agent.

I don't think we can compare here: every agent build creates a large amount of VMs. To my knowledge, Beats does not do that.

My understanding is that the fact of creating massive amount of VMs concurrently using the GCP API is the source of our problems. We can have a lot of builds (main, 8.14, 8.13, daily, PRs) running at the same time. Every build has 3 different steps that create VMs. I suppose we just overload the GCP scheduler in this zone but I don't have any data to support this claim.

cmacknz · 2024-05-13T15:34:10Z

well, that's our current situation, why do we even consider to change anything then? We can as well keep it as it is.

Flaky tests are a net negative to developer productivity and PR throughput. We want to drive them to zero as a way of optimizing developer productivity. I view the primary goal of eliminating flaky tests to be increasing developer productivity.

This means that solutions that eliminate flaky tests with the tradeoff that developer productivity and throughput is worse than it already is are a net negative to this goal (with room for nuance, eliminating flakiness by being 1 minute slower is obviously fine). This is why I don't like the idea of just limiting build parallelism without a strategy to eliminate it.

If we learn something from briefly reducing parallelism, like we want to confirm it fixes the problem, or try to find the point where it fails exactly, then I'm OK with that as a temporary measure to help move the solution forward.

I just don't want us to permanently reduce parallelism and flakiness at the expense of developer productivity, because we are really using flakiness as a proxy measurement of wasted developer time. Overall developer productivity is the result we actually care about.

pazone · 2024-05-17T18:28:12Z

Now it's separated. Each test type has a dedicated concurrency group limited by 8. The values can be adjusted in integration.pipeline.yml

pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team labels May 8, 2024

rdner assigned pazone May 10, 2024

This was referenced May 13, 2024

Extended testing pipeline resource #4738

Merged

[CI] Separated extended testing pipeline #4739

Merged

pazone linked a pull request May 13, 2024 that will close this issue

[CI] Separated extended testing pipeline #4739

Merged

7 tasks

pazone closed this as completed in #4739 May 17, 2024

mergify bot mentioned this issue May 17, 2024

[8.13](backport #4739) [CI] Separated extended testing pipeline #4775

Closed

7 tasks

mergify bot mentioned this issue May 17, 2024

[8.14](backport #4739) [CI] Separated extended testing pipeline #4776

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate expensive test in elastic-agent CI to a separate pipeline #4710

Separate expensive test in elastic-agent CI to a separate pipeline #4710

pchila commented May 8, 2024 •

edited

elasticmachine commented May 8, 2024

elasticmachine commented May 8, 2024

pchila commented May 8, 2024

ycombinator commented May 8, 2024

pchila commented May 8, 2024 •

edited

rdner commented May 10, 2024

blakerouse commented May 10, 2024

pazone commented May 10, 2024

cmacknz commented May 10, 2024 •

edited

cmacknz commented May 10, 2024 •

edited

blakerouse commented May 10, 2024

rdner commented May 10, 2024 •

edited

cmacknz commented May 10, 2024

rdner commented May 10, 2024

cmacknz commented May 10, 2024

rdner commented May 13, 2024

cmacknz commented May 13, 2024 •

edited

cmacknz commented May 13, 2024

rdner commented May 13, 2024

cmacknz commented May 13, 2024

pazone commented May 17, 2024

Separate expensive test in elastic-agent CI to a separate pipeline #4710

Separate expensive test in elastic-agent CI to a separate pipeline #4710

Comments

pchila commented May 8, 2024 • edited

elasticmachine commented May 8, 2024

elasticmachine commented May 8, 2024

pchila commented May 8, 2024

ycombinator commented May 8, 2024

pchila commented May 8, 2024 • edited

rdner commented May 10, 2024

blakerouse commented May 10, 2024

pazone commented May 10, 2024

cmacknz commented May 10, 2024 • edited

cmacknz commented May 10, 2024 • edited

blakerouse commented May 10, 2024

rdner commented May 10, 2024 • edited

cmacknz commented May 10, 2024

rdner commented May 10, 2024

cmacknz commented May 10, 2024

rdner commented May 13, 2024

cmacknz commented May 13, 2024 • edited

cmacknz commented May 13, 2024

rdner commented May 13, 2024

cmacknz commented May 13, 2024

pazone commented May 17, 2024

pchila commented May 8, 2024 •

edited

pchila commented May 8, 2024 •

edited

cmacknz commented May 10, 2024 •

edited

cmacknz commented May 10, 2024 •

edited

rdner commented May 10, 2024 •

edited

cmacknz commented May 13, 2024 •

edited