Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate expensive test in elastic-agent CI to a separate pipeline #4710

Closed
pchila opened this issue May 8, 2024 · 21 comments · Fixed by #4739
Closed

Separate expensive test in elastic-agent CI to a separate pipeline #4710

pchila opened this issue May 8, 2024 · 21 comments · Fixed by #4739
Assignees
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@pchila
Copy link
Contributor

pchila commented May 8, 2024

In order to have a bit more control on how many concurrent serverless/extended leak/integration tests we are running in parallel at any given time we may think of creating a separate pipeline that runs only those tests.

For example we can create a pipeline named elastic-agent-extended-testing that will create a build for a PR or a release branch commit or may be even invoked by the current elastic-agent pipeline (I am not very knowledgeable of all the buildkite features and settings) that may start from a commit (or even from already created packages) and run our longer tests.
We may set up this pipeline to only run a certain number of builds in parallel giving us one more knob we can turn to slow down the impact we have on cloud environments (especially wrt exhausting ARM VMs or other scarce resources in a zone/region).

Edit: if we implement a separate pipeline for extended testing we will need to adjust the PR checks ( we will need an additional required check from the new build) and also our build indicator for main and release branches (the small green tick or red x next to our commits in Github)

@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent Label for the Agent team labels May 8, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pchila
Copy link
Contributor Author

pchila commented May 8, 2024

cc @ycombinator @rdner

@ycombinator
Copy link
Contributor

@pchila @blakerouse is there some easy way already to invoke these tests separately, e.g. a mage command with an environment variable? Or do we need to build out some additional functionality in the integration testing framework first to make this separation happen?

I imagine once we have an easy way to invoke these tests separately, it's a matter of editing the buildkite pipeline definition to run both test jobs.

@pchila
Copy link
Contributor Author

pchila commented May 8, 2024

@ycombinator

@pchila @blakerouse is there some easy way already to invoke these tests separately, e.g. a mage command with an environment variable? Or do we need to build out some additional functionality in the integration testing framework first to make this separation happen?

I imagine once we have an easy way to invoke these tests separately, it's a matter of editing the buildkite pipeline definition to run both test jobs.

We already have mage targets that run the tests... they expect the necessary packages to be present though so we can (either/or):

  1. make the new pipeline package elastic-agents in all necessary formats starting from the source code (very similar to what we are doing today)
  2. make the new pipeline download already built packages from another pipeline/some URL specific for the commit we are testing (this is interesting as it opens up possibilities of running tests not just for PRs or main)

I believe that option 1. is easier and quicker... and we can always make it evolve toward 2. later if we see value in it

@rdner
Copy link
Member

rdner commented May 10, 2024

I talked to @pazone and he acknowledged that it's possible to do and asked to assign this ticket to him.

@blakerouse
Copy link
Contributor

The integration testing framework already supports the execution of specific groups https://github.com/elastic/elastic-agent/blob/main/docs/test-framework-dev-guide.md#selecting-specific-group

That could be used to only execute a set of groups for different types. That might be too granular as it would require the selection of groups and any time that a new group is added it would require adjusting the job.

Adding something like define.Require({Extended: true}) might be more desirable. That would allow the selection of extended testing versus non-extended testing.

@pazone
Copy link
Contributor

pazone commented May 10, 2024

We can also avoid running ITs on each PR commit. For example, we setup a special PR comment that will trigger the separate extended testing pipeline.

@cmacknz
Copy link
Member

cmacknz commented May 10, 2024

We should introduce the monorepo plugin so we stop pointlessly triggering integration tests. That can be done separately from this issue. #3674

@cmacknz
Copy link
Member

cmacknz commented May 10, 2024

I have wanted two things for a while as ergonomic improvements that I think would fit under this:

  1. Introducing a Buildkite step that publishes all the packages. This would save me from having to recompile PRs to test them locally. Buildkite needs a 'mage package' step separate from integration tests #4142. This could feed into the integration test job(s). Perhaps instead of uploading the packages via SSH we could have the test VMs download them from blob storage in the same GCP region to speed this step up as well.
  2. Introducing separate Buildkite steps for the individual integration test architectures. For example instead of seeing "Integrations Tests" failed and having to triage and retry all of them I'd rather see "Integration Tests - Windows X86" failed and have the ability to retry only the Windows step. The primary con of this is the increased number of mage integration:test targets running so I think this fits under here.

Finally, for the concurrency issue are there any built in Buildkite features we can use to help this quickly? https://buildkite.com/docs/pipelines/controlling-concurrency for example?

@blakerouse
Copy link
Contributor

I have wanted two things for a while as ergonomic improvements that I think would fit under this:

  1. Introducing a Buildkite step that publishes all the packages. This would save me from having to recompile PRs to test them locally. Buildkite needs a 'mage package' step separate from integration tests #4142. This could feed into the integration test job(s). Perhaps instead of uploading the packages via SSH we could have the test VMs download them from blob storage in the same GCP region to speed this step up as well.

As long as this still works from a local developer system. I do not want to have to think about having to upload to GCP before running the tests or anything like that.

  1. Introducing separate Buildkite steps for the individual integration test architectures. For example instead of seeing "Integrations Tests" failed and having to triage and retry all of them I'd rather see "Integration Tests - Windows X86" failed and have the ability to retry only the Windows step. The primary con of this is the increased number of mage integration:test targets running so I think this fits under here.

TEST_PLATFORMS="windows/amd64" works for this use case and is already built into the framework.

@rdner
Copy link
Member

rdner commented May 10, 2024

@pazone for now the important part in order to stabilize the CI is to extract the following steps in a separate pipeline:

  • Serverless Beats Tests
  • Serverless integration test
  • Integration tests

This separate pipeline should have a limited concurrency level. Let's start with 4 maximum concurrent builds. So, we cannot run more than 4 concurrent steps of the type listed above.

Once we have this, we can iterate and extend functionality / granularity as suggested in the comments here.

Unless @cmacknz has any objection.

@cmacknz
Copy link
Member

cmacknz commented May 10, 2024

So, we cannot run more than 4 concurrent steps of the type listed above.

Why 4? I don't want us all to be waiting around for more hours than necessary. I know we have to pick a number, but what problem are we specifically trying to solve with this? Reducing number of VMs below some quota? What quota? Where is it set?

Other option here is to reduce the number of unique test groups in the integration tests, which create individual VM instances, or to refactor the three separate targets into one target to let them all run sequentially.

@rdner
Copy link
Member

rdner commented May 10, 2024

Why 4? I don't want us all to be waiting around for more hours than necessary. I know we have to pick a number, but what problem are we specifically trying to solve with this? Reducing number of VMs below some quota? What quota? Where is it set?

@cmacknz we need to start with some number and iterate. We can start with 16, 10, any number and then see how it affects the CI. 4 is the lowest possible number I think we can go with and then see how stable the CI has become. If the CI is fully stable we increase the next week until we start to see instabilities.

Changing this number is relatively easy, the important part is to have some limit in place that we can adjust.

@cmacknz
Copy link
Member

cmacknz commented May 10, 2024

Is there a way we can not guess? Can we observe how many active VMs we have at a time in GCP and go off of that? If we have to guess that works but it would be faster to know what universe of scale we are already operating at.

How many VMs does a single integration test run require right now?

For example, counting manually it looks like we have 9 unique test groups, which means that there should be something like at least 9 VMs of each machine type that needs to be scheduled simultaneously.

I see:

  1. container
  2. Upgrade
  3. Default
  4. FleetPrivileged
  5. FleetAirgapped
  6. FleetAirgappedPrivileged
  7. DEB
  8. RPM
  9. FQDN
rg 'Group\:' testing/integration
testing/integration/container_cmd_test.go
37:             Group: "container",

testing/integration/upgrade_rollback_test.go
44:             Group: Upgrade,
159:            Group: Upgrade,

testing/integration/diagnostics_test.go
131:            Group: Default,
157:            Group: Default,
183:            Group: Default,
206:            Group: Default,

testing/integration/fake_test.go
76:             Group: Default,
137:            Group: Default,

testing/integration/upgrade_gpg_test.go
27:             Group: Upgrade,
96:             Group: Upgrade,

testing/integration/delay_enroll_test.go
29:             Group: Fleet,

testing/integration/upgrade_fleet_test.go
44:             Group: Fleet,
58:             Group: FleetPrivileged,
108:            Group: FleetAirgapped,
120:            Group: FleetAirgappedPrivileged,

testing/integration/upgrade_standalone_retry_test.go
31:             Group: Upgrade,

testing/integration/upgrade_downgrade_test.go
29:             Group: Upgrade,

testing/integration/logs_ingestion_test.go
43:             Group: Fleet,
105:            Group: Deb,
173:            Group: RPM,

testing/integration/package_version_test.go
36:             Group: Default,
64:             Group: Fleet,

testing/integration/upgrade_standalone_same_commit_test.go
40:             Group: Upgrade,

testing/integration/monitoring_probe_reload_test.go
45:             Group: "fleet",

testing/integration/fqdn_test.go
36:             Group: FQDN,

testing/integration/upgrade_standalone_test.go
27:             Group: Upgrade,

testing/integration/upgrade_uninstall_test.go
28:             Group: Upgrade,

testing/integration/endpoint_security_test.go
81:             Group: Fleet,
107:            Group: Fleet,
135:            Group: Fleet,
504:            Group: Fleet,
575:            Group: Fleet,
655:            Group: Fleet,

testing/integration/proxy_url_test.go
98:             Group: Fleet,
146:            Group: Fleet,
196:            Group: Fleet,
260:            Group: Fleet,
328:            Group: Fleet,

testing/integration/install_test.go
29:             Group: Default,
96:             Group: Default,
175:            Group: Default,

testing/integration/upgrade_standalone_inprogress_test.go
30:             Group: Upgrade,

testing/integration/upgrade_broken_package_test.go
30:             Group: Upgrade,

testing/integration/agent_long_running_leak_test.go
65:             Group: "fleet",

testing/integration/apm_propagation_test.go
56:             Group: Default,

testing/integration/otel_test.go
112:            Group: Default,
228:            Group: Default,

testing/integration/metrics_monitoring_test.go
36:             Group: Fleet,

testing/integration/install_privileged_test.go
27:             Group: Default,
79:             Group: Default,

testing/integration/beats_serverless_test.go
53:             Group: Default,

testing/integration/monitoring_probe_preserve_text_cfg_test.go
80:             Group: "fleet",

@rdner
Copy link
Member

rdner commented May 13, 2024

@cmacknz even if we calculated the precise amount of VMs needed per build, we still would not know the failure/instability point of the GCP API. I don't have historical data for the amount of VMs we had running when GCP started failing for us.

If it's ~9 VMs per running build, 4 concurrent builds would already result in 36 VMs, plus we can guarantee their shut down only after 4 hours. Which makes the amount of running VMs at the same time somewhat unpredictable, this makes me think we have to guess.

The number can be anything, we can start with 8, does not matter. But we must start with something and then iterate. It's like Progress, SIMPLE Perfection in our source code says. I'm afraid we might spend too much time perfecting the number instead of trying some number and moving from there.

@cmacknz
Copy link
Member

cmacknz commented May 13, 2024

we still would not know the failure/instability point of the GCP API

My main concern is that without understanding in detail what the limit is, or exactly what problem we are working around, we may permanently make CI slower without an explanation or a path to fixing the underlying problem.

I do not want us to arbitrarily limit the number of CI jobs and permanently cap developer productivity without an actual reason. Are we requesting too many machines at once? Is the type of machine we are requesting not as available as others we could use instead? Are we using the API wrong and we are supposed to overflow into another AZ in the same region when we get capacity related errors? Do we need a capacity reservation with GCP itself to guarantee available (would this also be cheaper?)?

If you don't want to try an dig into the root cause at least do an analysis of if the reduced flakiness by capping concurrent builds actually saves us time over running more builds in parallel with the risk that some of them fail.

If we can make CI 100% reliable by executing a single build at a time, we have solved the flakiness problem at the expense of greatly limiting our throughput as a team. I will happily trade off one flaky build per day for unlimited concurrency for example at the other end of this spectrum.

@cmacknz
Copy link
Member

cmacknz commented May 13, 2024

For comparison, the Beats CI buildkite jobs look to have at least as many concurrent jobs on Buildkite agents if not more. How are they avoiding capacity issues there? That repository is also much more active than agent.

@pazone pazone linked a pull request May 13, 2024 that will close this issue
7 tasks
@rdner
Copy link
Member

rdner commented May 13, 2024

If we can make CI 100% reliable by executing a single build at a time, we have solved the flakiness problem at the expense of greatly limiting our throughput as a team.

@cmacknz but I didn't suggest to keep this low number forever, we start with some number and then gradually increase until we hit instability issues again. It's going to be a bit slower in the beginning, yes.

I will happily trade off one flaky build per day for unlimited concurrency for example at the other end of this spectrum.

well, that's our current situation, why do we even consider to change anything then? We can as well keep it as it is.

For comparison, the Beats CI buildkite jobs look to have at least as many concurrent jobs on Buildkite agents if not more. How are they avoiding capacity issues there? That repository is also much more active than agent.

I don't think we can compare here: every agent build creates a large amount of VMs. To my knowledge, Beats does not do that.

My understanding is that the fact of creating massive amount of VMs concurrently using the GCP API is the source of our problems. We can have a lot of builds (main, 8.14, 8.13, daily, PRs) running at the same time. Every build has 3 different steps that create VMs. I suppose we just overload the GCP scheduler in this zone but I don't have any data to support this claim.

@cmacknz
Copy link
Member

cmacknz commented May 13, 2024

well, that's our current situation, why do we even consider to change anything then? We can as well keep it as it is.

Flaky tests are a net negative to developer productivity and PR throughput. We want to drive them to zero as a way of optimizing developer productivity. I view the primary goal of eliminating flaky tests to be increasing developer productivity.

This means that solutions that eliminate flaky tests with the tradeoff that developer productivity and throughput is worse than it already is are a net negative to this goal (with room for nuance, eliminating flakiness by being 1 minute slower is obviously fine). This is why I don't like the idea of just limiting build parallelism without a strategy to eliminate it.

If we learn something from briefly reducing parallelism, like we want to confirm it fixes the problem, or try to find the point where it fails exactly, then I'm OK with that as a temporary measure to help move the solution forward.

I just don't want us to permanently reduce parallelism and flakiness at the expense of developer productivity, because we are really using flakiness as a proxy measurement of wasted developer time. Overall developer productivity is the result we actually care about.

@pazone
Copy link
Contributor

pazone commented May 17, 2024

Now it's separated. Each test type has a dedicated concurrency group limited by 8. The values can be adjusted in integration.pipeline.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants