-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: return error if agent init script fails to download valid binary #13280
Conversation
8eb3abe
to
60d4e14
Compare
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
60d4e14
to
8f08e00
Compare
Signed-off-by: Danny Kopping <danny@coder.com>
@@ -31,4 +31,12 @@ fi | |||
|
|||
export CODER_AGENT_AUTH="${AUTH_TYPE}" | |||
export CODER_AGENT_URL="${ACCESS_URL}" | |||
exec ./$BINARY_NAME agent | |||
|
|||
output=$(./${BINARY_NAME} --version | head -n1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't recall now, but I worry if ${}
can be interpreted as a terraform variable here. This is good practice but I think we should avoid it in the bootstrap scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good call-out, but BINARY_NAME
is not replaced by the provider, it seems:
https://github.com/coder/terraform-provider-coder/blob/7815596401d6e69aebb4ceefe1e84369cb63c4ac/provider/agent.go#L345-L376
IMHO I think we should use a different template replacement syntax than Bash's variable expansion, to make it very clear that these are replaced by a script and not accepted into the script as env vars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see that some manual e2e tests are performed against example templates, at least Docker and Windows just so we're sure we don't break anything. I'm worried that since these are rarely touched there's a high possibility of breaking fringe use-cases.
@@ -86,4 +86,12 @@ fi | |||
|
|||
export CODER_AGENT_AUTH="${AUTH_TYPE}" | |||
export CODER_AGENT_URL="${ACCESS_URL}" | |||
exec ./$BINARY_NAME agent | |||
|
|||
output=$(./${BINARY_NAME} --version | head -n1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see stderr redirected here so we can give a sensible output. Thoughts?
❯ output=$(echo '<html>' >hi; chmod +x hi; ./hi --version); declare -p output
./hi: line 1: syntax error near unexpected token `newline'
./hi: line 1: `<html>'
typeset output=''
Absolutely agreed; I will get around to testing this as soon as I have a couple hours to focus. Alternatively @mtojek offered a hand and he might pick this up. |
scripts/deploy-pr.sh
Outdated
@@ -71,8 +71,9 @@ fi | |||
gh_auth | |||
|
|||
# get branch name and pr number | |||
branchName=$(gh pr view --json headRefName | jq -r .headRefName) | |||
prNumber=$(gh pr view --json number | jq -r .number) | |||
info=$(gh pr status --repo=coder/coder --json headRefName,number --jq '.createdBy[0]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't support forks before, but you cannot specify --repo
in pr view
so I changed it to pr status
.
@mtojek thanks for taking a look! I'm going to look into this today and answer your questions. |
36ad275
to
7f4de67
Compare
7f4de67
to
3e121e0
Compare
Created #13408 so I can deploy this in the preview environment. |
Testing LinuxUsing the preview environment, I spun up a workspace with the kubernetes template. coder-54c9c56d67-7ddbc:~/.cache/coder/site/bin$ cat coder-linux-amd64
#!/usr/bin/env bash
echo "I am not the agent you are loooking for" The agent produced this: + curl -fsSL --compressed https://pr13408.test.cdr.dev/bin/coder-linux-amd64 -o coder
+ break
+ chmod +x coder
+ [ -n ]
+ export CODER_AGENT_AUTH=token
+ export CODER_AGENT_URL=https://pr13408.test.cdr.dev/
+ ./coder --version
+ head -n1
+ output=I am not the agent you are loooking for
+ echo I am not the agent you are loooking for
+ grep -q Coder
+ echo ERROR: Downloaded agent binary returned unexpected version output
ERROR: Downloaded agent binary returned unexpected version output
+ echo coder --version output: "I am not the agent you are loooking for"
coder --version output: "I am not the agent you are loooking for"
+ exit 2
+ waitonexit
+ echo === Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...
=== Agent script exited with non-zero code (2). Sleeping 24h to preserve logs...
+ sleep 86400 @bpmct what do you think we should do here? I don't think we currently stream the agent logs to the workspace detail page, only the provisioner logs, so I'm not sure how we'll display a specific error in this case. I will continue testing on both Mac (Darwin) and Windows to ensure the changes to the scripts work correctly. |
Troubleshooting failed workspace agent bootstrapping has historically been one of the more difficult issues to troubleshoot, and tends to require manually inspecting the execution environment. There are a number of scenarios that can cause a bootstrapping a workspace to fail, including but not limited to:
The above error you caused would fall under the last category. At this point, we should have reasonable confidence that we can connect to the control plane, and we have all the dependencies needed to download the agent binary. If executing the binary fails, we could potentially do a best-effort |
That's a cool idea @johnstcn 👍 |
I've tried my best to set up the preview environment to test on Windows (see this thread), but it's quickly turning into more trouble than it's worth IMHO. I'm going to merge this PR and test in dogfood. |
Not sure why https://github.com/coder/coder/pull/13280/checks?check_run_id=25592484459 is failing, because the title does match the regex... |
@mtojek to answer your questions:
Nope, I tested with the same template and the changes were present.
See #13280 (comment). |
Testing DarwinPretty much the same as Linux... Testing WindowsSetting the We deploy our Windows VMs in GCP, and the init script output goes to one of the serial ports which GCP keeps the logs of (the screenshot above). I used the following command to retrieve those logs: $ gcloud compute --project=<project> instances get-serial-port-output coder-danny-windows-rdp --zone=europe-west4-b --port=1 All looks fine to me 👍 |
Partially addresses #6711
This change runs the downloaded binary with a
--version
flag and checks if the command responds with text which matchesCoder
. This is not the strictest of checks, but it's the most pragmatic in terms of backwards-compatibility (i.e. if we added a new--verify
command, some agents would not have this yet by definition).We considered using SHA1 hash comparisons but that was a heavier lift and gets us effectively to the same point.