Make delayed enrollment try indefinitely #4727

blakerouse · 2024-05-09T15:02:48Z

What does this PR do?

Make delayed enrollment try indefinitely to enroll into Fleet.

Why is it important?

In some cases when a new machine comes online with delayed enrollment it doesn't have network access immediately so the enrollment needs to keep trying until it is able to successfully enroll into Fleet.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

Doesn't disrupt the user, it provides the more reliability for delayed enrollment.

How to test this PR locally

Install with --delay-enroll
Manually start the Elastic Agent service
Should enroll and then run.

Related issues

Closes Elastic Agent service configured with delayed enrollment exits when it cannot connect. #4716

elasticmachine · 2024-05-09T15:02:51Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

strawgate · 2024-05-09T15:06:11Z

one thing that we might want to consider is having the service report "started" earlier in the process so that the service does flap instead of the service showing as "running" but not actually doing anything?

cmacknz · 2024-05-09T17:26:02Z

Yeah I am concerned that this will make true failed enrolments harder to debug and notice, as the agent process will just appear deadlocked. I am thinking of failures like the "we forget we had a proxy" case or an invalid enrollment token or something.

I think we should look at solutions that behave better in the case of failure first:

Doing the delayed enrollment asynchronously so that the agent is still response, but if we do it late enough I think we have to re-exec, which potentially runs into this problem of exiting before Windows thinks we've started again.
Moving the point where we report ourselves as running to Windows out of the goroutine so that it is guaranteed to happen first before we can exit, and seeing if Windows then properly handles restarts. Elastic Agent service configured with delayed enrollment exits when it cannot connect. #4716 (comment). I do wonder if we also need a way to detect that windows has received+processed the svc.Running event before we exit, in case it is not enough to just have sent it and assuming it gets processed in time.

There is a chance we need a combination of both of those things.

blakerouse · 2024-05-09T18:00:30Z

Yeah I am concerned that this will make true failed enrolments harder to debug and notice, as the agent process will just appear deadlocked. I am thinking of failures like the "we forget we had a proxy" case or an invalid enrollment token or something.

If any of that is wrong then it should still keep retrying. It will be logged that it cannot enroll and the Elastic Agent will stay running occurring to the WIndows service manager.

I think we should look at solutions that behave better in the case of failure first:

Doing the delayed enrollment asynchronously so that the agent is still response, but if we do it late enough I think we have to re-exec, which potentially runs into this problem of exiting before Windows thinks we've started again.

Why is it being changed to asynchronous? The Elastic Agent cannot do anything other than enroll at this point. All problems with enrollment in delayed enrollment should be retried. This is a VM image being deployed there is no command and control until its enrolled.

Moving the point where we report ourselves as running to Windows out of the goroutine so that it is guaranteed to happen first before we can exit, and seeing if Windows then properly handles restarts. Elastic Agent service configured with delayed enrollment exits when it cannot connect. #4716 (comment). I do wonder if we also need a way to detect that windows has received+processed the svc.Running event before we exit, in case it is not enough to just have sent it and assuming it gets processed in time.

So confused on how this is anyway related to delayed enrollment? Where in the code does delayed enrollment prevent the windows code to tell the service manager that it is running? The code clearly shows that it started in its own goroutine before the start of delayed enrollment. Enrollment is not preventing this reporting of its status to the Windows service manager.

The Elastic Agent as a service is running, and its trying to perform enrollment. It should do nothing other than continue to try and enroll.

There is a chance we need a combination of both of those things.

I am at a loss of the real problem that requires any such combination and how service manager control communication is anyway related to delayed enrollment.

strawgate · 2024-05-09T18:10:16Z

If any of that is wrong then it should still keep retrying. It will be logged that it cannot enroll and the Elastic Agent will stay running occurring to the WIndows service manager.

In my imagination this would likely cause agents installed by remote management tools or msi from returning and showing the administrator the deployment error as they wait for enroll before returning when not using delay enroll?

Having it retry forever would result in an install that never finishes?

blakerouse · 2024-05-09T18:12:46Z

If any of that is wrong then it should still keep retrying. It will be logged that it cannot enroll and the Elastic Agent will stay running occurring to the WIndows service manager.

In my imagination this would likely cause agents installed by remote management tools or msi from returning and showing the administrator the deployment error.

Having it retry forever would show as an install that never finishes?

This does not affect install. Install is always successful with delayed enrollment. Delayed enrollment will not enroll until first boot of a sys-prepped image. At that point there is no visibility and cannot have visibility until it enrolls.

strawgate · 2024-05-09T18:20:01Z

This does not affect install. Install is always successful with delayed enrollment. Delayed enrollment will not enroll until first boot of a sys-prepped image. At that point there is no visibility and cannot have visibility until it enrolls.

Right but most installs don't delay enrollment and those would be impacted by enrollment retrying forever?

blakerouse · 2024-05-09T18:23:35Z

This does not affect install. Install is always successful with delayed enrollment. Delayed enrollment will not enroll until first boot of a sys-prepped image. At that point there is no visibility and cannot have visibility until it enrolls.

Right but most installs don't delay enrollment and those would be impacted by enrollment retrying forever?

No, this PR only affects delayed enrollment.

cmacknz · 2024-05-09T18:29:50Z

I am going off of this comment #4716 (comment) where it appears:

Delayed enrollment failing causes the agent to exit.
The agent exiting before the Windows service manager knows it is in the running state causes it to never be restarted.

I think retrying forever would solve this, as agent keeps running. All remaining concerns are now about how users will be able to detect that their agent is stuck retrying forever. Will the elastic-agent diagnostics or elastic-agent logs commands work while agent is in this retry loop?

My biggest concern at this point is that we release this and end up with a support case where the user can't get us the information we or the support engineers can use to explain to them what has gone wrong.

blakerouse · 2024-05-09T18:39:15Z

Delayed enrollment is always way before anything else the Elastic Agent does. There is no control protocol running at that point and never has been. Either of those commands should work except diagnostics will not collect items from the daemon because the control protocol is not running yet.

Clear log messages will be present for all enrollment failures.

pchila

having the elastic-agent stuck in an infinite loop without a real way of getting out other than either enroll successfully or being killed kinda makes me nervous: at this point we are not even listening for changes in the configuration file or anything else.

Am I missing something or we can only exit the delayed enroll loop when we enroll successfully (the context cannot be really cancelled at this point) ?

Maybe we can give up after 10 attempts or so and rely on the service manager to restart the elastic-agent service if it's not deactivated ?

blakerouse · 2024-05-10T13:42:09Z

having the elastic-agent stuck in an infinite loop without a real way of getting out other than either enroll successfully or being killed kinda makes me nervous: at this point we are not even listening for changes in the configuration file or anything else.

It should not be even reading a configuration file, it should only have read the delayed enrollment information file at this point. In this mode it either enrolls or it doesn't do anything, there is no other modes or things the Elastic Agent should be doing.

Am I missing something or we can only exit the delayed enroll loop when we enroll successfully (the context cannot be really cancelled at this point) ?

Context can be cancelled when the service is stopped. Again remember this is for VM sysprepped images that are being deployed. The Elastic Agent goal in the mode is to enroll, that is the only goal it has in this state.

Maybe we can give up after 10 attempts or so and rely on the service manager to restart the elastic-agent service if it's not deactivated ?

Why? If we operate in this mode the service manager will just restart the Elastic Agent and it will do the same thing again.

michel-laterman

lgtm

cmacknz · 2024-05-21T19:48:32Z

I wanted to see how this behaves so I hard coded the func (e *enrollCmdOption) remoteConfig() (remote.Config, error) implementation to unconditionally return an error to test this out in a VM.

If I start the service, it appears as running when I check systemctl status elastic-agent. I can also see the error logs if I horizontally scroll enough.

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo systemctl status elastic-agent
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-05-21 15:08:13 EDT; 2min 30s ago
   Main PID: 2871 (elastic-agent)
      Tasks: 6 (limit: 1060)
     Memory: 106.2M (peak: 110.6M)
        CPU: 2min 8.859s
     CGroup: /system.slice/elastic-agent.service
             └─2871 /opt/Elastic/Agent/elastic-agent

May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
lines 1-20/20 (END)

The elastic-agent status command does not work when we are stuck in this restart loop:

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo elastic-agent status
Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /opt/Elastic/Agent/elastic-agent.sock: connect: no such file or directory"
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.15/fleet-troubleshooting.html

The elastic-agent logs command does work and shows the error:

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo elastic-agent logs
{"log.level":"error","@timestamp":"2024-05-21T19:13:27.203Z","log.origin":{"file.name":"cmd/run.go","file.line":558},"message":"failed to perform delayed enrollment (will try again): fail to enroll","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

I can successfully uninstall the service while in this failure loop:

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo elastic-agent uninstall
Elastic Agent will be uninstalled from your system at /opt/Elastic/Agent. Do you want to continue? [Y/n]:y
[=== ] Done  [0s]
Elastic Agent has been uninstalled.

All of this is enough to be able to tell that something is wrong.

cmacknz · 2024-05-21T19:50:25Z

I tried this on Windows as well with the same results, except that the Windows Service manager and commands like Get-Service don't give you any logs.

I'm going to approve this given you can tell it's not working using the elastic-agent logs command and uninstalling is still possible.

Keeping the service restarts on failure might have made this easier to detect for some users, but it is still detectable.

elastic-sonarqube · 2024-05-22T18:27:18Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

* Make delay enrollment indefinite. * Add changelog entry. (cherry picked from commit c14df02)

Make delay enrollment indefinite.

4c97923

blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-v8.14.0 labels May 9, 2024

blakerouse self-assigned this May 9, 2024

blakerouse requested a review from a team as a code owner May 9, 2024 15:02

blakerouse requested review from andrzej-stencel and pchila May 9, 2024 15:02

Add changelog entry.

6112fda

ycombinator requested review from michel-laterman and removed request for andrzej-stencel May 9, 2024 21:49

pchila reviewed May 10, 2024

View reviewed changes

michel-laterman approved these changes May 13, 2024

View reviewed changes

cmacknz approved these changes May 21, 2024

View reviewed changes

Merge branch 'main' into indefinite-delay-enrollment

ccdaae7

blakerouse enabled auto-merge (squash) May 22, 2024 13:36

Merge branch 'main' into indefinite-delay-enrollment

5c4c124

blakerouse merged commit c14df02 into elastic:main May 23, 2024
14 checks passed

mergify bot pushed a commit that referenced this pull request May 23, 2024

Make delayed enrollment try indefinitely (#4727)

48dfaab

* Make delay enrollment indefinite. * Add changelog entry. (cherry picked from commit c14df02)

mergify bot mentioned this pull request May 23, 2024

[8.14](backport #4727) Make delayed enrollment try indefinitely #4800

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make delayed enrollment try indefinitely #4727

Make delayed enrollment try indefinitely #4727

blakerouse commented May 9, 2024 •

edited

elasticmachine commented May 9, 2024

strawgate commented May 9, 2024

cmacknz commented May 9, 2024

blakerouse commented May 9, 2024

strawgate commented May 9, 2024 •

edited

blakerouse commented May 9, 2024

strawgate commented May 9, 2024

blakerouse commented May 9, 2024

cmacknz commented May 9, 2024

blakerouse commented May 9, 2024

pchila left a comment

blakerouse commented May 10, 2024

michel-laterman left a comment

cmacknz commented May 21, 2024

cmacknz commented May 21, 2024

elastic-sonarqube bot commented May 22, 2024

Make delayed enrollment try indefinitely #4727

Make delayed enrollment try indefinitely #4727

Conversation

blakerouse commented May 9, 2024 • edited

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

elasticmachine commented May 9, 2024

strawgate commented May 9, 2024

cmacknz commented May 9, 2024

blakerouse commented May 9, 2024

strawgate commented May 9, 2024 • edited

blakerouse commented May 9, 2024

strawgate commented May 9, 2024

blakerouse commented May 9, 2024

cmacknz commented May 9, 2024

blakerouse commented May 9, 2024

pchila left a comment

Choose a reason for hiding this comment

blakerouse commented May 10, 2024

michel-laterman left a comment

Choose a reason for hiding this comment

cmacknz commented May 21, 2024

cmacknz commented May 21, 2024

elastic-sonarqube bot commented May 22, 2024

Quality Gate passed

blakerouse commented May 9, 2024 •

edited

strawgate commented May 9, 2024 •

edited