Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unprivileged and privileged subcommand to Elastic Agent #4621

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Apr 25, 2024

What does this PR do?

Adds a new unprivileged and privileged subcommand that converts an installed Elastic Agent from privileged to unprivileged or back again.

Why is it important?

To allow users to change the permission execution mode without having to re-install the Elastic Agent.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works (will have integration tests)
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Author's Checklist

  • can convert from privileged to unprivileged
  • can convert from unprivileged to privileged

How to test this PR locally

$ sudo elastic-agent unprivileged
$ sudo elastic-agent privileged

Related issues

@blakerouse blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-skip labels Apr 25, 2024
@blakerouse blakerouse self-assigned this Apr 25, 2024
@blakerouse blakerouse changed the title Work on privileged/unprivileged command. Add unprivileged and privileged subcommand to Elastic Agent Apr 25, 2024
@blakerouse blakerouse marked this pull request as ready for review May 9, 2024 14:50
@blakerouse blakerouse requested a review from a team as a code owner May 9, 2024 14:50
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@blakerouse
Copy link
Contributor Author

SonarQube you are correct about unit tests, but you are very wrong about integration tests...

This has integration tests for both privileged and unprivileged going in both directions.

@ycombinator ycombinator removed the request for review from andrzej-stencel May 9, 2024 21:50
Comment on lines 69 to 70
// cannot switch to unprivileged when Elastic Defend exists in the policy
err = ensureNoElasticDefend()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally osquery is running as privileged. Quite a lot of functionality will be unavailable if running unprivileged or returning different results.

Osqueryd also checks for "safe permissions" when it loads our custom extension.

 When osquery launches, each path is evaluated for "safe permissions" 
(extension executable files must be owned by root or Administrator) 
and executed as a monitored child process

https://osquery.readthedocs.io/en/stable/deployment/extensions/#autoloading-extensions

If you observe a runtime error from osquery, Extension binary has unsafe permissions, 
you have to lock down the filesystem permissions on the extension executable. 
See the steps in "Extensions Binary Permissions," above.

https://osquery.readthedocs.io/en/stable/deployment/extensions/#troubleshooting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For osquery, you should mark the integration as requiring root privileges like defend does in https://github.com/elastic/endpoint-package/blob/ce921805467ab55953adc3721e7db8581915bd35/package/endpoint/manifest.yml#L23-L25. You can also do this for individual data streams but I don't think that applies to OSquery.

You could also add a prevention to the spec file to prevent us from even starting osquery if you aren't root, but since it doesn't fundamentally not work when not root, this might not be necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do update the spec, you would want to update the agentbeat spec.


By default this command will ask for a confirmation before making this change. You can bypass the confirmation request
using the -f flag. This will always stop the running Elastic Agent (if running) before performing the switch. It is
possible that loss of metrics could occur during this small window of time. The command also always starts the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible that loss of metrics could occur during this small window of time

Loss of metrics or data in general, if this is intended to be a warning that this is not a zero downtime operation you could just say that directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both commands you may want to caution about what happens if their execution is interrupted.

return fmt.Errorf("failed to create component model from policy: %w", err)
}
for _, comp := range comps {
if comp.InputSpec != nil && comp.InputSpec.InputType == endpoint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to actually evaluate the runtime preventions and see if something has the root one?

That would be a bit more future proof. The constrain isn't really "is defend running" it's "is there something that requires root to function" running, which today is only defend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to determine if that is the behavior that we actually want. An input specification that requires root, when switched to unprivileged mode will still allow the Elastic Agent to run and the input will just report as failed. This would then be viewable to the user in status as well as in Fleet and would allow them to adjust/remove the integration.

In the case of Endpoint it is more detrimental to switch the mode with it enabled. Endpoint being a separate piece of software that is installed, once switched it cannot be uninstalled. That leaves the Elastic Agent in a completely broken state, unless switched back to privileged and then remove the integration.

I believe that we have 4 options for handling this:

  1. Prevent switch if any input has a runtime protection for requiring root.
  2. Prevent switch if any service input has a runtime protection for requiring root.
  3. Add new option to input specification that says it doesn't support unprivileged mode directly. This is a little different then a runtime protection, being that it could be used in switch to prevent the switch. Endpoint specification can set this specifically and then future inputs can as well.
  4. Keep it how it is at the moment and hard-code it to endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a bit more future proof. The constrain isn't really "is defend running" it's "is there something that requires root to function" running, which today is only defend.

Osquery as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add new option to input specification that says it doesn't support unprivileged mode directly.

Let's avoid building something new for information we already have, so this option is out.

This would then be viewable to the user in status as well as in Fleet and would allow them to adjust/remove the integration.

Agree with this, along the lines of not building anything new, we can rely on this to inform the user after the fact. When this is exposed from Fleet, Fleet will have information about the installed packages and will likely warn or prevent the user in the same way. Most users are going to do this en-mass at won't be reading warning text anyway.

Prevent switch if any service input has a runtime protection for requiring root.
Keep it how it is at the moment and hard-code it to endpoint.

Given the distinction that a service input may be totally broken if we do this, it makes sense to only add additional functionality for service components to deal with this distinction.

If it isn't a lot of effort to check the spec file so we don't have to hard code this case, then doing this in a general way is preferable.

Otherwise given endpoint is the only service component we expect to have for the foreseeable future, we can leave this as is and add a comment explaining why it specifically is special cased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a bit more future proof. The constrain isn't really "is defend running" it's "is there something that requires root to function" running, which today is only defend.

Osquery as well

@aleksmaus Root is required to function or it doesn't work well without root? Trying to understand the difference here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to allow the switch with Endpoint enabled, but Elastic Agent uninstall Endpoint just like the uninstall command does. It can perform uninstall and then switch to non-root. This provides more of a way of not blocking the user from switching, we could even prompt if they want to remove it an switch. [y/N] could be the default with being a force -f switch where it always uninstalls it.

I like this option a little more because it doesn't prevent the switch just because a service component that requires root is enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksmaus Root is required to function or it doesn't work well without root? Trying to understand the difference here.

We could run osqueryd binary standalone without root permissions and some of the functionality will not be available and some of the queries will return different results (with no errors).

So there is no indication from the user point of view that something is wrong with permissions, the data just would be missing, which creates "blind spots" from security point of view without user being aware.

The file system searches would be missing some valid results because of the lack of the permissions, and osquery doesn't return the error in these cases, it just returns a smaller resultset.

Here are just a few examples of running osquery with sudo and without on the same query:

Screenshot 2024-05-20 at 9 52 11 AM Screenshot 2024-05-20 at 10 02 48 AM Screenshot 2024-05-20 at 10 04 02 AM Screenshot 2024-05-20 at 10 13 00 AM

Also the file permissions for our extensions should be set to root or Administrator:
https://osquery.readthedocs.io/en/stable/deployment/extensions/#autoloading-extensions

(extension executable files must be owned by root or Administrator)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksmaus I think it's best to move this to its own issue and discussion on osquerybeat as non-root. Its could be possible to allow it to run or not run. I believe that discussion is better to be had in a specific issue and then come to a conclusion. We can adjust in a follow up PR based on that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to allow the switch with Endpoint enabled, but Elastic Agent uninstall Endpoint just like the uninstall command does. It can perform uninstall and then switch to non-root. This provides more of a way of not blocking the user from switching, we could even prompt if they want to remove it an switch. [y/N] could be the default with being a force -f switch where it always uninstalls it.

You would have to include and test that tamper protection still applies on this path which will be extra testing we have to keep around. I think it is much simpler if we force users to put unprivileged agents in policies that do not include Defend. There are no edge cases around Defend being conditionally uninstalled or removed that way. The composable agent policies work in the UI will help users deal with any shared content between the privileged and privileged policies.

So I'd say we should continue to refuse the conversion if defend is in the policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I was just worrying that it prevents issues with different personas in an organization. A persona that manages the policies and a persona that manages installation of the endpoints. That is probably rare and not something we need to worry about.

Copy link
Contributor

mergify bot commented May 20, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b to-unprivileged upstream/to-unprivileged
git merge upstream/main
git push upstream to-unprivileged

internal/pkg/agent/cmd/install.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/privileged.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/privileged.go Outdated Show resolved Hide resolved
pkg/control/v2/client/wait/agent.go Show resolved Hide resolved
pkg/control/v2/client/wait/common.go Outdated Show resolved Hide resolved
testing/integration/switch_privileged_test.go Show resolved Hide resolved
return nil
}

func ensureNoServiceComponentIssues() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this has any test coverage. You could add a check in the endpoint test suite that the unprivileged command fails, or add a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True I will add an integration test, as I would prefer to see a true scenario be tested.

@blakerouse
Copy link
Contributor Author

I have validated manually that this works on macOS, because we do not have integration testing coverage on macOS. I was able to switch between the modes successfully and the vaults are handled properly.

Copy link

Quality Gate failed Quality Gate failed

Failed conditions
1.5% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

return cmd
}

func unprivilegedCmd(streams *cli.IOStreams, cmd *cobra.Command) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like there's quite some repetition meaning this and privilegedCmd function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants