fix: return error if agent init script fails to download valid binary #13280

dannykopping · 2024-05-15T15:24:20Z

Partially addresses #6711

This change runs the downloaded binary with a --version flag and checks if the command responds with text which matches Coder. This is not the strictest of checks, but it's the most pragmatic in terms of backwards-compatibility (i.e. if we added a new --verify command, some agents would not have this yet by definition).

We considered using SHA1 hash comparisons but that was a heavier lift and gets us effectively to the same point.

Signed-off-by: Danny Kopping <danny@coder.com>

provisionersdk/scripts/bootstrap_windows.ps1

scripts/check_site_icons.sh

provisionersdk/scripts/bootstrap_linux.sh

Signed-off-by: Danny Kopping <danny@coder.com>

mafredri · 2024-05-16T09:29:40Z

provisionersdk/scripts/bootstrap_darwin.sh

@@ -31,4 +31,12 @@ fi

 export CODER_AGENT_AUTH="${AUTH_TYPE}"
 export CODER_AGENT_URL="${ACCESS_URL}"
-exec ./$BINARY_NAME agent
+
+output=$(./${BINARY_NAME} --version | head -n1)


I can't recall now, but I worry if ${} can be interpreted as a terraform variable here. This is good practice but I think we should avoid it in the bootstrap scripts.

It's a good call-out, but BINARY_NAME is not replaced by the provider, it seems:
https://github.com/coder/terraform-provider-coder/blob/7815596401d6e69aebb4ceefe1e84369cb63c4ac/provider/agent.go#L345-L376

IMHO I think we should use a different template replacement syntax than Bash's variable expansion, to make it very clear that these are replaced by a script and not accepted into the script as env vars.

provisionersdk/scripts/bootstrap_windows.ps1

mafredri

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

mafredri · 2024-05-16T12:12:51Z

provisionersdk/scripts/bootstrap_linux.sh

@@ -86,4 +86,12 @@ fi

 export CODER_AGENT_AUTH="${AUTH_TYPE}"
 export CODER_AGENT_URL="${ACCESS_URL}"
-exec ./$BINARY_NAME agent
+
+output=$(./${BINARY_NAME} --version | head -n1)


I'd still like to see stderr redirected here so we can give a sensible output. Thoughts?

❯ output=$(echo '<html>' >hi; chmod +x hi; ./hi --version); declare -p output ./hi: line 1: syntax error near unexpected token `newline' ./hi: line 1: `<html>' typeset output=''

dannykopping · 2024-05-24T09:13:27Z

I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.

Absolutely agreed; I will get around to testing this as soon as I have a couple hours to focus.

Alternatively @mtojek offered a hand and he might pick this up.

mtojek · 2024-05-24T12:10:51Z

I tested the PR and have some observations:

To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?
UI does not indicate what is wrong (see below). Did I mess up something?

dannykopping · 2024-05-30T06:30:07Z

scripts/deploy-pr.sh

@@ -71,8 +71,9 @@ fi
 gh_auth

 # get branch name and pr number
-branchName=$(gh pr view --json headRefName | jq -r .headRefName)
-prNumber=$(gh pr view --json number | jq -r .number)
+info=$(gh pr status --repo=coder/coder --json headRefName,number --jq '.createdBy[0]')


This didn't support forks before, but you cannot specify --repo in pr view so I changed it to pr status.

dannykopping · 2024-05-30T06:31:08Z

@mtojek thanks for taking a look! I'm going to look into this today and answer your questions.

dannykopping · 2024-05-30T08:10:42Z

Created #13408 so I can deploy this in the preview environment.

dannykopping · 2024-05-30T09:28:05Z

Testing Linux

Using the preview environment, I spun up a workspace with the kubernetes template.
I shelled into the coder pod and replaced coder-linux-amd64 with the following shell script:

coder-54c9c56d67-7ddbc:~/.cache/coder/site/bin$ cat coder-linux-amd64
#!/usr/bin/env bash
echo "I am not the agent you are loooking for"

The agent produced this:

+ curl -fsSL --compressed https://pr13408.test.cdr.dev/bin/coder-linux-amd64 -o coder
+ break
+ chmod +x coder
+ [ -n  ]
+ export CODER_AGENT_AUTH=token
+ export CODER_AGENT_URL=https://pr13408.test.cdr.dev/
+ ./coder --version
+ head -n1
+ output=I am not the agent you are loooking for
+ echo I am not the agent you are loooking for
+ grep -q Coder
+ echo ERROR: Downloaded agent binary returned unexpected version output
ERROR: Downloaded agent binary returned unexpected version output
+ echo coder --version output: "I am not the agent you are loooking for"
coder --version output: "I am not the agent you are loooking for"
+ exit 2
+ waitonexit
+ echo === Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...
=== Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...
+ sleep 86400

@bpmct what do you think we should do here? I don't think we currently stream the agent logs to the workspace detail page, only the provisioner logs, so I'm not sure how we'll display a specific error in this case.

I will continue testing on both Mac (Darwin) and Windows to ensure the changes to the scripts work correctly.

johnstcn · 2024-05-30T10:16:04Z

Troubleshooting failed workspace agent bootstrapping has historically been one of the more difficult issues to troubleshoot, and tends to require manually inspecting the execution environment.

There are a number of scenarios that can cause a bootstrapping a workspace to fail, including but not limited to:

Init script fails because of bad syntax (developer error, or possibly a very strange execution environment)
Init scripts fails due to missing dependencies (e.g. wget, curl)
Init script fails to download agent (DNS resolution failure etc.)
Init script fails to execute agent (missing libs, bad arch, etc.)

The above error you caused would fall under the last category. At this point, we should have reasonable confidence that we can connect to the control plane, and we have all the dependencies needed to download the agent binary. If executing the binary fails, we could potentially do a best-effort curl -XPOST back to the control plane to send some troubleshooting information. I would consider it outside of the scope of this PR though.

dannykopping · 2024-05-30T10:20:48Z

That's a cool idea @johnstcn 👍
Agreed it's out of scope for this PR. It's definitely in scope for the attached issue this PR is trying to fix, though, so I think we can merge this regardless once the testing is complete and once we've agreed on the mechanism forward we can address that.

dannykopping · 2024-05-30T10:55:17Z

I've tried my best to set up the preview environment to test on Windows (see this thread), but it's quickly turning into more trouble than it's worth IMHO.

I'm going to merge this PR and test in dogfood.
If there are any problems there I'll revert.

dannykopping · 2024-05-30T10:59:17Z

Not sure why https://github.com/coder/coder/pull/13280/checks?check_run_id=25592484459 is failing, because the title does match the regex...

dannykopping · 2024-05-30T13:48:52Z

I tested the PR and have some observations:

To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?

UI does not indicate what is wrong (see below). Did I mess up something?

@mtojek to answer your questions:

To refresh the init script I had to re-push the template version (Docker template). Is this inevitable?

Nope, I tested with the same template and the changes were present.

UI does not indicate what is wrong (see below). Did I mess up something?

See #13280 (comment).

dannykopping · 2024-05-30T15:04:09Z

Testing Darwin

Pretty much the same as Linux...

Testing Windows

Setting the coder-windows-amd64.exe file to a simple helloworld.exe file leads to this outcome:

We deploy our Windows VMs in GCP, and the init script output goes to one of the serial ports which GCP keeps the logs of (the screenshot above).

I used the following command to retrieve those logs:

$ gcloud compute --project=<project> instances get-serial-port-output coder-danny-windows-rdp --zone=europe-west4-b --port=1

All looks fine to me 👍
This didn't require any changes from the workspace owner or the admin.

github-actions bot assigned dannykopping May 15, 2024

dannykopping force-pushed the dk/verify-agent branch from 8eb3abe to 60d4e14 Compare May 15, 2024 15:40

dannykopping changed the title ~~Throw an error if agent init script fails to download valid binary~~ WIP: Throw an error if agent init script fails to download valid binary May 15, 2024

dannykopping added 5 commits May 16, 2024 09:55

Update bootstrap scripts to check for executable correctness

6705c9e

Signed-off-by: Danny Kopping <danny@coder.com>

Add comment to more easily find string replacements

d11c2d3

Signed-off-by: Danny Kopping <danny@coder.com>

Appease shellcheck

b63b479

Signed-off-by: Danny Kopping <danny@coder.com>

Make lint script more portable

5438b65

Signed-off-by: Danny Kopping <danny@coder.com>

Add tests

8f08e00

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/verify-agent branch from 60d4e14 to 8f08e00 Compare May 16, 2024 08:00

dannykopping changed the title ~~WIP: Throw an error if agent init script fails to download valid binary~~ fix: throw an error if agent init script fails to download valid binary May 16, 2024

dannykopping commented May 16, 2024

View reviewed changes

provisionersdk/scripts/bootstrap_windows.ps1 Show resolved Hide resolved

scripts/check_site_icons.sh Show resolved Hide resolved

dannykopping marked this pull request as ready for review May 16, 2024 08:23

dannykopping requested review from mafredri, johnstcn and kylecarbs May 16, 2024 08:24

johnstcn reviewed May 16, 2024

View reviewed changes

provisionersdk/scripts/bootstrap_linux.sh Show resolved Hide resolved

provisionersdk/scripts/bootstrap_linux.sh Outdated Show resolved Hide resolved

provisionersdk/scripts/bootstrap_linux.sh Outdated Show resolved Hide resolved

Use more expressive error, double-quote output

70e3091

Signed-off-by: Danny Kopping <danny@coder.com>

mafredri reviewed May 16, 2024

View reviewed changes

dannykopping requested a review from johnstcn May 16, 2024 09:44

johnstcn approved these changes May 16, 2024

View reviewed changes

mafredri approved these changes May 16, 2024

View reviewed changes

github-actions bot added the stale This issue is like stale bread. label May 24, 2024

mtojek removed the stale This issue is like stale bread. label May 24, 2024

dannykopping changed the title ~~fix: throw an error if agent init script fails to download valid binary~~ fix: error out if agent init script fails to download a valid binary May 30, 2024

Merge branch 'main' of github.com:/coder/coder into dk/verify-agent

3e121e0

dannykopping commented May 30, 2024

View reviewed changes

dannykopping force-pushed the dk/verify-agent branch from 36ad275 to 7f4de67 Compare May 30, 2024 06:34

dannykopping mentioned this pull request May 30, 2024

chore: modify preview deployment script to work with forks #13404

Closed

dannykopping force-pushed the dk/verify-agent branch from 7f4de67 to 3e121e0 Compare May 30, 2024 08:03

dannykopping mentioned this pull request May 30, 2024

In-repo clone of #13280 #13408

Draft

dannykopping changed the title ~~fix: error out if agent init script fails to download a valid binary~~ fix: return error if agent init script fails to download valid binary May 30, 2024

dannykopping merged commit 59ab505 into coder:main May 30, 2024
64 of 66 checks passed

dannykopping deleted the dk/verify-agent branch May 30, 2024 11:33

github-actions bot locked and limited conversation to collaborators May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return error if agent init script fails to download valid binary #13280

fix: return error if agent init script fails to download valid binary #13280

dannykopping commented May 15, 2024 •

edited

mafredri May 16, 2024

dannykopping May 16, 2024

mafredri left a comment

mafredri May 16, 2024

dannykopping commented May 24, 2024

mtojek commented May 24, 2024

dannykopping May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

johnstcn commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

fix: return error if agent init script fails to download valid binary #13280

fix: return error if agent init script fails to download valid binary #13280

Conversation

dannykopping commented May 15, 2024 • edited

mafredri May 16, 2024

Choose a reason for hiding this comment

dannykopping May 16, 2024

Choose a reason for hiding this comment

mafredri left a comment

Choose a reason for hiding this comment

mafredri May 16, 2024

Choose a reason for hiding this comment

dannykopping commented May 24, 2024

mtojek commented May 24, 2024

dannykopping May 30, 2024

Choose a reason for hiding this comment

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

Testing Linux

johnstcn commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

dannykopping commented May 30, 2024

Testing Darwin

Testing Windows

dannykopping commented May 15, 2024 •

edited