Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make delayed enrollment try indefinitely #4727

Merged
merged 4 commits into from
May 23, 2024

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented May 9, 2024

What does this PR do?

Make delayed enrollment try indefinitely to enroll into Fleet.

Why is it important?

In some cases when a new machine comes online with delayed enrollment it doesn't have network access immediately so the enrollment needs to keep trying until it is able to successfully enroll into Fleet.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

Doesn't disrupt the user, it provides the more reliability for delayed enrollment.

How to test this PR locally

  1. Install with --delay-enroll
  2. Manually start the Elastic Agent service
  3. Should enroll and then run.

Related issues

@blakerouse blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-v8.14.0 labels May 9, 2024
@blakerouse blakerouse self-assigned this May 9, 2024
@blakerouse blakerouse requested a review from a team as a code owner May 9, 2024 15:02
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@strawgate
Copy link

one thing that we might want to consider is having the service report "started" earlier in the process so that the service does flap instead of the service showing as "running" but not actually doing anything?

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

Yeah I am concerned that this will make true failed enrolments harder to debug and notice, as the agent process will just appear deadlocked. I am thinking of failures like the "we forget we had a proxy" case or an invalid enrollment token or something.

I think we should look at solutions that behave better in the case of failure first:

  1. Doing the delayed enrollment asynchronously so that the agent is still response, but if we do it late enough I think we have to re-exec, which potentially runs into this problem of exiting before Windows thinks we've started again.
  2. Moving the point where we report ourselves as running to Windows out of the goroutine so that it is guaranteed to happen first before we can exit, and seeing if Windows then properly handles restarts. Elastic Agent service configured with delayed enrollment exits when it cannot connect. #4716 (comment). I do wonder if we also need a way to detect that windows has received+processed the svc.Running event before we exit, in case it is not enough to just have sent it and assuming it gets processed in time.

There is a chance we need a combination of both of those things.

@blakerouse
Copy link
Contributor Author

Yeah I am concerned that this will make true failed enrolments harder to debug and notice, as the agent process will just appear deadlocked. I am thinking of failures like the "we forget we had a proxy" case or an invalid enrollment token or something.

If any of that is wrong then it should still keep retrying. It will be logged that it cannot enroll and the Elastic Agent will stay running occurring to the WIndows service manager.

I think we should look at solutions that behave better in the case of failure first:

  1. Doing the delayed enrollment asynchronously so that the agent is still response, but if we do it late enough I think we have to re-exec, which potentially runs into this problem of exiting before Windows thinks we've started again.

Why is it being changed to asynchronous? The Elastic Agent cannot do anything other than enroll at this point. All problems with enrollment in delayed enrollment should be retried. This is a VM image being deployed there is no command and control until its enrolled.

  1. Moving the point where we report ourselves as running to Windows out of the goroutine so that it is guaranteed to happen first before we can exit, and seeing if Windows then properly handles restarts. Elastic Agent service configured with delayed enrollment exits when it cannot connect. #4716 (comment). I do wonder if we also need a way to detect that windows has received+processed the svc.Running event before we exit, in case it is not enough to just have sent it and assuming it gets processed in time.

So confused on how this is anyway related to delayed enrollment? Where in the code does delayed enrollment prevent the windows code to tell the service manager that it is running? The code clearly shows that it started in its own goroutine before the start of delayed enrollment. Enrollment is not preventing this reporting of its status to the Windows service manager.

The Elastic Agent as a service is running, and its trying to perform enrollment. It should do nothing other than continue to try and enroll.

There is a chance we need a combination of both of those things.

I am at a loss of the real problem that requires any such combination and how service manager control communication is anyway related to delayed enrollment.

@strawgate
Copy link

strawgate commented May 9, 2024

If any of that is wrong then it should still keep retrying. It will be logged that it cannot enroll and the Elastic Agent will stay running occurring to the WIndows service manager.

In my imagination this would likely cause agents installed by remote management tools or msi from returning and showing the administrator the deployment error as they wait for enroll before returning when not using delay enroll?

Having it retry forever would result in an install that never finishes?

@blakerouse
Copy link
Contributor Author

If any of that is wrong then it should still keep retrying. It will be logged that it cannot enroll and the Elastic Agent will stay running occurring to the WIndows service manager.

In my imagination this would likely cause agents installed by remote management tools or msi from returning and showing the administrator the deployment error.

Having it retry forever would show as an install that never finishes?

This does not affect install. Install is always successful with delayed enrollment. Delayed enrollment will not enroll until first boot of a sys-prepped image. At that point there is no visibility and cannot have visibility until it enrolls.

@strawgate
Copy link

This does not affect install. Install is always successful with delayed enrollment. Delayed enrollment will not enroll until first boot of a sys-prepped image. At that point there is no visibility and cannot have visibility until it enrolls.

Right but most installs don't delay enrollment and those would be impacted by enrollment retrying forever?

@blakerouse
Copy link
Contributor Author

This does not affect install. Install is always successful with delayed enrollment. Delayed enrollment will not enroll until first boot of a sys-prepped image. At that point there is no visibility and cannot have visibility until it enrolls.

Right but most installs don't delay enrollment and those would be impacted by enrollment retrying forever?

No, this PR only affects delayed enrollment.

@cmacknz
Copy link
Member

cmacknz commented May 9, 2024

I am going off of this comment #4716 (comment) where it appears:

  1. Delayed enrollment failing causes the agent to exit.
  2. The agent exiting before the Windows service manager knows it is in the running state causes it to never be restarted.

I think retrying forever would solve this, as agent keeps running. All remaining concerns are now about how users will be able to detect that their agent is stuck retrying forever. Will the elastic-agent diagnostics or elastic-agent logs commands work while agent is in this retry loop?

My biggest concern at this point is that we release this and end up with a support case where the user can't get us the information we or the support engineers can use to explain to them what has gone wrong.

@blakerouse
Copy link
Contributor Author

Delayed enrollment is always way before anything else the Elastic Agent does. There is no control protocol running at that point and never has been. Either of those commands should work except diagnostics will not collect items from the daemon because the control protocol is not running yet.

Clear log messages will be present for all enrollment failures.

@ycombinator ycombinator requested review from michel-laterman and removed request for andrzej-stencel May 9, 2024 21:49
Copy link
Contributor

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the elastic-agent stuck in an infinite loop without a real way of getting out other than either enroll successfully or being killed kinda makes me nervous: at this point we are not even listening for changes in the configuration file or anything else.

Am I missing something or we can only exit the delayed enroll loop when we enroll successfully (the context cannot be really cancelled at this point) ?

Maybe we can give up after 10 attempts or so and rely on the service manager to restart the elastic-agent service if it's not deactivated ?

@blakerouse
Copy link
Contributor Author

having the elastic-agent stuck in an infinite loop without a real way of getting out other than either enroll successfully or being killed kinda makes me nervous: at this point we are not even listening for changes in the configuration file or anything else.

It should not be even reading a configuration file, it should only have read the delayed enrollment information file at this point. In this mode it either enrolls or it doesn't do anything, there is no other modes or things the Elastic Agent should be doing.

Am I missing something or we can only exit the delayed enroll loop when we enroll successfully (the context cannot be really cancelled at this point) ?

Context can be cancelled when the service is stopped. Again remember this is for VM sysprepped images that are being deployed. The Elastic Agent goal in the mode is to enroll, that is the only goal it has in this state.

Maybe we can give up after 10 attempts or so and rely on the service manager to restart the elastic-agent service if it's not deactivated ?

Why? If we operate in this mode the service manager will just restart the Elastic Agent and it will do the same thing again.

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cmacknz
Copy link
Member

cmacknz commented May 21, 2024

I wanted to see how this behaves so I hard coded the func (e *enrollCmdOption) remoteConfig() (remote.Config, error) implementation to unconditionally return an error to test this out in a VM.

If I start the service, it appears as running when I check systemctl status elastic-agent. I can also see the error logs if I horizontally scroll enough.

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo systemctl status elastic-agent
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
     Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; preset: enabled)
     Active: active (running) since Tue 2024-05-21 15:08:13 EDT; 2min 30s ago
   Main PID: 2871 (elastic-agent)
      Tasks: 6 (limit: 1060)
     Memory: 106.2M (peak: 110.6M)
        CPU: 2min 8.859s
     CGroup: /system.slice/elastic-agent.service
             └─2871 /opt/Elastic/Agent/elastic-agent

May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
lines 1-20/20 (END)

The elastic-agent status command does not work when we are stuck in this restart loop:

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo elastic-agent status
Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /opt/Elastic/Agent/elastic-agent.sock: connect: no such file or directory"
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.15/fleet-troubleshooting.html

The elastic-agent logs command does work and shows the error:

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo elastic-agent logs
{"log.level":"error","@timestamp":"2024-05-21T19:13:27.203Z","log.origin":{"file.name":"cmd/run.go","file.line":558},"message":"failed to perform delayed enrollment (will try again): fail to enroll","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

I can successfully uninstall the service while in this failure loop:

ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo elastic-agent uninstall
Elastic Agent will be uninstalled from your system at /opt/Elastic/Agent. Do you want to continue? [Y/n]:y
[=== ] Done  [0s]
Elastic Agent has been uninstalled.

All of this is enough to be able to tell that something is wrong.

@cmacknz
Copy link
Member

cmacknz commented May 21, 2024

I tried this on Windows as well with the same results, except that the Windows Service manager and commands like Get-Service don't give you any logs.

I'm going to approve this given you can tell it's not working using the elastic-agent logs command and uninstalling is still possible.

Keeping the service restarts on failure might have made this easier to detect for some users, but it is still detectable.

@blakerouse blakerouse enabled auto-merge (squash) May 22, 2024 13:36
@blakerouse blakerouse merged commit c14df02 into elastic:main May 23, 2024
14 checks passed
mergify bot pushed a commit that referenced this pull request May 23, 2024
* Make delay enrollment indefinite.

* Add changelog entry.

(cherry picked from commit c14df02)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.14.0 Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic Agent service configured with delayed enrollment exits when it cannot connect.
6 participants