-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make delayed enrollment try indefinitely #4727
Make delayed enrollment try indefinitely #4727
Conversation
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
one thing that we might want to consider is having the service report "started" earlier in the process so that the service does flap instead of the service showing as "running" but not actually doing anything? |
Yeah I am concerned that this will make true failed enrolments harder to debug and notice, as the agent process will just appear deadlocked. I am thinking of failures like the "we forget we had a proxy" case or an invalid enrollment token or something. I think we should look at solutions that behave better in the case of failure first:
There is a chance we need a combination of both of those things. |
If any of that is wrong then it should still keep retrying. It will be logged that it cannot enroll and the Elastic Agent will stay running occurring to the WIndows service manager.
Why is it being changed to asynchronous? The Elastic Agent cannot do anything other than enroll at this point. All problems with enrollment in delayed enrollment should be retried. This is a VM image being deployed there is no command and control until its enrolled.
So confused on how this is anyway related to delayed enrollment? Where in the code does delayed enrollment prevent the windows code to tell the service manager that it is running? The code clearly shows that it started in its own goroutine before the start of delayed enrollment. Enrollment is not preventing this reporting of its status to the Windows service manager. The Elastic Agent as a service is running, and its trying to perform enrollment. It should do nothing other than continue to try and enroll.
I am at a loss of the real problem that requires any such combination and how service manager control communication is anyway related to delayed enrollment. |
In my imagination this would likely cause agents installed by remote management tools or msi from returning and showing the administrator the deployment error as they wait for enroll before returning when not using delay enroll? Having it retry forever would result in an install that never finishes? |
This does not affect install. Install is always successful with delayed enrollment. Delayed enrollment will not enroll until first boot of a sys-prepped image. At that point there is no visibility and cannot have visibility until it enrolls. |
Right but most installs don't delay enrollment and those would be impacted by enrollment retrying forever? |
No, this PR only affects delayed enrollment. |
I am going off of this comment #4716 (comment) where it appears:
I think retrying forever would solve this, as agent keeps running. All remaining concerns are now about how users will be able to detect that their agent is stuck retrying forever. Will the My biggest concern at this point is that we release this and end up with a support case where the user can't get us the information we or the support engineers can use to explain to them what has gone wrong. |
Delayed enrollment is always way before anything else the Elastic Agent does. There is no control protocol running at that point and never has been. Either of those commands should work except diagnostics will not collect items from the daemon because the control protocol is not running yet. Clear log messages will be present for all enrollment failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having the elastic-agent stuck in an infinite loop without a real way of getting out other than either enroll successfully or being killed kinda makes me nervous: at this point we are not even listening for changes in the configuration file or anything else.
Am I missing something or we can only exit the delayed enroll loop when we enroll successfully (the context cannot be really cancelled at this point) ?
Maybe we can give up after 10 attempts or so and rely on the service manager to restart the elastic-agent service if it's not deactivated ?
It should not be even reading a configuration file, it should only have read the delayed enrollment information file at this point. In this mode it either enrolls or it doesn't do anything, there is no other modes or things the Elastic Agent should be doing.
Context can be cancelled when the service is stopped. Again remember this is for VM sysprepped images that are being deployed. The Elastic Agent goal in the mode is to enroll, that is the only goal it has in this state.
Why? If we operate in this mode the service manager will just restart the Elastic Agent and it will do the same thing again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I wanted to see how this behaves so I hard coded the If I start the service, it appears as running when I check ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo systemctl status elastic-agent
● elastic-agent.service - Elastic Agent is a unified agent to observe, monitor and protect your system.
Loaded: loaded (/etc/systemd/system/elastic-agent.service; enabled; preset: enabled)
Active: active (running) since Tue 2024-05-21 15:08:13 EDT; 2min 30s ago
Main PID: 2871 (elastic-agent)
Tasks: 6 (limit: 1060)
Memory: 106.2M (peak: 110.6M)
CPU: 2min 8.859s
CGroup: /system.slice/elastic-agent.service
└─2871 /opt/Elastic/Agent/elastic-agent
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
May 21 15:10:44 admirable-pigfish elastic-agent[2871]: {"log.level":"error","@timestamp":"2024-05-21T15>
lines 1-20/20 (END) The
The ubuntu@admirable-pigfish:~/elastic-agent-8.15.0-SNAPSHOT-linux-arm64$ sudo elastic-agent logs
{"log.level":"error","@timestamp":"2024-05-21T19:13:27.203Z","log.origin":{"file.name":"cmd/run.go","file.line":558},"message":"failed to perform delayed enrollment (will try again): fail to enroll","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"} I can successfully uninstall the service while in this failure loop:
All of this is enough to be able to tell that something is wrong. |
I tried this on Windows as well with the same results, except that the Windows Service manager and commands like I'm going to approve this given you can tell it's not working using the Keeping the service restarts on failure might have made this easier to detect for some users, but it is still detectable. |
Quality Gate passedIssues Measures |
* Make delay enrollment indefinite. * Add changelog entry. (cherry picked from commit c14df02)
What does this PR do?
Make delayed enrollment try indefinitely to enroll into Fleet.
Why is it important?
In some cases when a new machine comes online with delayed enrollment it doesn't have network access immediately so the enrollment needs to keep trying until it is able to successfully enroll into Fleet.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added tests that prove my fix is effective or that my feature works./changelog/fragments
using the changelog toolDisruptive User Impact
Doesn't disrupt the user, it provides the more reliability for delayed enrollment.
How to test this PR locally
--delay-enroll
Related issues