Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return error if agent init script fails to download valid binary #13280

Merged
merged 7 commits into from
May 30, 2024

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented May 15, 2024

Partially addresses #6711

This change runs the downloaded binary with a --version flag and checks if the command responds with text which matches Coder. This is not the strictest of checks, but it's the most pragmatic in terms of backwards-compatibility (i.e. if we added a new --verify command, some agents would not have this yet by definition).

We considered using SHA1 hash comparisons but that was a heavier lift and gets us effectively to the same point.

@dannykopping dannykopping changed the title Throw an error if agent init script fails to download valid binary WIP: Throw an error if agent init script fails to download valid binary May 15, 2024
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping dannykopping changed the title WIP: Throw an error if agent init script fails to download valid binary fix: throw an error if agent init script fails to download valid binary May 16, 2024
@dannykopping dannykopping marked this pull request as ready for review May 16, 2024 08:23
provisionersdk/scripts/bootstrap_linux.sh Show resolved Hide resolved
provisionersdk/scripts/bootstrap_linux.sh Outdated Show resolved Hide resolved
provisionersdk/scripts/bootstrap_linux.sh Outdated Show resolved Hide resolved
Signed-off-by: Danny Kopping <danny@coder.com>
@@ -31,4 +31,12 @@ fi

export CODER_AGENT_AUTH="${AUTH_TYPE}"
export CODER_AGENT_URL="${ACCESS_URL}"
exec ./$BINARY_NAME agent

output=$(./${BINARY_NAME} --version | head -n1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall now, but I worry if ${} can be interpreted as a terraform variable here. This is good practice but I think we should avoid it in the bootstrap scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good call-out, but BINARY_NAME is not replaced by the provider, it seems:
https://github.com/coder/terraform-provider-coder/blob/7815596401d6e69aebb4ceefe1e84369cb63c4ac/provider/agent.go#L345-L376

IMHO I think we should use a different template replacement syntax than Bash's variable expansion, to make it very clear that these are replaced by a script and not accepted into the script as env vars.

provisionersdk/scripts/bootstrap_windows.ps1 Show resolved Hide resolved
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

@@ -86,4 +86,12 @@ fi

export CODER_AGENT_AUTH="${AUTH_TYPE}"
export CODER_AGENT_URL="${ACCESS_URL}"
exec ./$BINARY_NAME agent

output=$(./${BINARY_NAME} --version | head -n1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see stderr redirected here so we can give a sensible output. Thoughts?

❯ output=$(echo '<html>' >hi; chmod +x hi; ./hi --version); declare -p output
./hi: line 1: syntax error near unexpected token `newline'
./hi: line 1: `<html>'
typeset output=''

@github-actions github-actions bot added the stale This issue is like stale bread. label May 24, 2024
@mtojek mtojek removed the stale This issue is like stale bread. label May 24, 2024
@dannykopping
Copy link
Contributor Author

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

Absolutely agreed; I will get around to testing this as soon as I have a couple hours to focus.

Alternatively @mtojek offered a hand and he might pick this up.

@mtojek
Copy link
Member

mtojek commented May 24, 2024

I tested the PR and have some observations:

  1. To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?
  2. UI does not indicate what is wrong (see below). Did I mess up something?
Screenshot 2024-05-24 at 14 04 53 Screenshot 2024-05-24 at 14 03 49

@dannykopping dannykopping changed the title fix: throw an error if agent init script fails to download valid binary fix: error out if agent init script fails to download a valid binary May 30, 2024
@@ -71,8 +71,9 @@ fi
gh_auth

# get branch name and pr number
branchName=$(gh pr view --json headRefName | jq -r .headRefName)
prNumber=$(gh pr view --json number | jq -r .number)
info=$(gh pr status --repo=coder/coder --json headRefName,number --jq '.createdBy[0]')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't support forks before, but you cannot specify --repo in pr view so I changed it to pr status.

@dannykopping
Copy link
Contributor Author

@mtojek thanks for taking a look! I'm going to look into this today and answer your questions.

@dannykopping
Copy link
Contributor Author

Created #13408 so I can deploy this in the preview environment.

@dannykopping
Copy link
Contributor Author

Testing Linux

Using the preview environment, I spun up a workspace with the kubernetes template.
I shelled into the coder pod and replaced coder-linux-amd64 with the following shell script:

coder-54c9c56d67-7ddbc:~/.cache/coder/site/bin$ cat coder-linux-amd64
#!/usr/bin/env bash
echo "I am not the agent you are loooking for"

The agent produced this:

+ curl -fsSL --compressed https://pr13408.test.cdr.dev/bin/coder-linux-amd64 -o coder
+ break
+ chmod +x coder
+ [ -n  ]
+ export CODER_AGENT_AUTH=token
+ export CODER_AGENT_URL=https://pr13408.test.cdr.dev/
+ ./coder --version
+ head -n1
+ output=I am not the agent you are loooking for
+ echo I am not the agent you are loooking for
+ grep -q Coder
+ echo ERROR: Downloaded agent binary returned unexpected version output
ERROR: Downloaded agent binary returned unexpected version output
+ echo coder --version output: "I am not the agent you are loooking for"
coder --version output: "I am not the agent you are loooking for"
+ exit 2
+ waitonexit
+ echo === Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...
=== Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...
+ sleep 86400

image

@bpmct what do you think we should do here? I don't think we currently stream the agent logs to the workspace detail page, only the provisioner logs, so I'm not sure how we'll display a specific error in this case.

I will continue testing on both Mac (Darwin) and Windows to ensure the changes to the scripts work correctly.

@johnstcn
Copy link
Member

Troubleshooting failed workspace agent bootstrapping has historically been one of the more difficult issues to troubleshoot, and tends to require manually inspecting the execution environment.

There are a number of scenarios that can cause a bootstrapping a workspace to fail, including but not limited to:

  1. Init script fails because of bad syntax (developer error, or possibly a very strange execution environment)
  2. Init scripts fails due to missing dependencies (e.g. wget, curl)
  3. Init script fails to download agent (DNS resolution failure etc.)
  4. Init script fails to execute agent (missing libs, bad arch, etc.)

The above error you caused would fall under the last category. At this point, we should have reasonable confidence that we can connect to the control plane, and we have all the dependencies needed to download the agent binary. If executing the binary fails, we could potentially do a best-effort curl -XPOST back to the control plane to send some troubleshooting information. I would consider it outside of the scope of this PR though.

@dannykopping
Copy link
Contributor Author

That's a cool idea @johnstcn 👍
Agreed it's out of scope for this PR. It's definitely in scope for the attached issue this PR is trying to fix, though, so I think we can merge this regardless once the testing is complete and once we've agreed on the mechanism forward we can address that.

@dannykopping
Copy link
Contributor Author

I've tried my best to set up the preview environment to test on Windows (see this thread), but it's quickly turning into more trouble than it's worth IMHO.

I'm going to merge this PR and test in dogfood.
If there are any problems there I'll revert.

@dannykopping dannykopping changed the title fix: error out if agent init script fails to download a valid binary fix: return error if agent init script fails to download valid binary May 30, 2024
@dannykopping
Copy link
Contributor Author

Not sure why https://github.com/coder/coder/pull/13280/checks?check_run_id=25592484459 is failing, because the title does match the regex...

@dannykopping dannykopping merged commit 59ab505 into coder:main May 30, 2024
64 of 66 checks passed
@dannykopping dannykopping deleted the dk/verify-agent branch May 30, 2024 11:33
@github-actions github-actions bot locked and limited conversation to collaborators May 30, 2024
@dannykopping
Copy link
Contributor Author

I tested the PR and have some observations:

  1. To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?
  2. UI does not indicate what is wrong (see below). Did I mess up something?

@mtojek to answer your questions:

  1. To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?

Nope, I tested with the same template and the changes were present.

  1. UI does not indicate what is wrong (see below). Did I mess up something?

See #13280 (comment).

@dannykopping
Copy link
Contributor Author

Testing Darwin

Pretty much the same as Linux...

Testing Windows

Setting the coder-windows-amd64.exe file to a simple helloworld.exe file leads to this outcome:

image

We deploy our Windows VMs in GCP, and the init script output goes to one of the serial ports which GCP keeps the logs of (the screenshot above).

I used the following command to retrieve those logs:

$ gcloud compute --project=<project> instances get-serial-port-output coder-danny-windows-rdp --zone=europe-west4-b --port=1

All looks fine to me 👍
This didn't require any changes from the workspace owner or the admin.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants