Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled upgrade fails with error parsing version "" if agent restarts before the upgrade starts #3912

Open
4 of 6 tasks
AndersonQ opened this issue Dec 14, 2023 · 5 comments · May be fixed by #4441
Open
4 of 6 tasks
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@AndersonQ
Copy link
Member

AndersonQ commented Dec 14, 2023

  • Version: 8.12/main
  • Operating System: all
  • Discuss Forum URL: N/A
  • Steps to Reproduce:
    • schedule and upgrade on fleet for 2 min form now
    • check the agent report the upgrade is scheduled
    • restart the agent
    • wait the upgrade to start
    • it'll fail:
# elastic-agent status                                                                                                 

┌─ fleet
│  └─ status: (HEALTHY) Connected
├─ elastic-agent
│  └─ status: (HEALTHY) Running
└─ upgrade_details
   ├─ target_version:
   ├─ state: UPG_FAILED
   ├─ action_id: 9c06cd82-50e5-41bc-8a68-ed73b55c1912
   └─ metadata
      ├─ failed_state: UPG_DOWNLOADING
      └─ error_msg: could not download artifact: error parsing version "": version string does not match expected format

The issue seems to be on how the actions are loaded from disk. Using the patch patch.txt we can see the test will fail because the upgrade action version isn't loaded.
If you stop the test and check the store file, it contains all the data, but it sin't loaded correctly:

  • state.yml:
action_queue:
- action_id: test
  type: UPGRADE
  start_time: "2023-12-14T08:09:59Z"
  version: 1.2.3
  source_uri: https://example.com
  retry_attempt: 1
- action_id: abc123
  type: POLICY_CHANGE
  policy:
    hello: world

Bug root cause

TL;DR: we deserialise the action from one schema (the fleetapi.FleetAction) and serialise a different schema (our concrete action types, like the fleetapi.ActionUpgrade. When reading from disk, we loose data, such as the version from the upgrade actions.

Proposed fix

  • use only JSON to serialise/deserialise the actions
  • make all fleetapi.ActionTYPE models to match the schema we receive from fleet. Add any action specific properties nested under data
  • migrate the current state store in YAML to a new state store using JSON
    • properly deserialise the old store to do not lose data during the migration

Fix implementation phases (using a feature branch):

@AndersonQ AndersonQ added the bug Something isn't working label Dec 14, 2023
@AndersonQ AndersonQ changed the title Scheduled upgrade fails with error parsing version ""if agent restarts before the upgrade starts Scheduled upgrade fails with error parsing version "" if agent restarts before the upgrade starts Dec 14, 2023
@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Dec 14, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2023

@pierrehilbert we should get this one fixed, likely it leads to failed upgrades or agents stuck in updating.

@pierrehilbert
Copy link
Contributor

Agree, I already added this to current sprint but didn't assign this yet.
Need to see what we can postpone to take care of it.

AndersonQ added a commit to AndersonQ/elastic-agent that referenced this issue Feb 15, 2024
* simplify fleetapi.Actions.UnmarshalJSON
* add test to ensure the state store is correctly loaded from disk
* skip state store migration tests, they will be fixes on a follow-up PR as part of elastic#3912
@AndersonQ AndersonQ mentioned this issue Feb 19, 2024
3 tasks
AndersonQ added a commit that referenced this issue Feb 23, 2024
* simplify fleetapi.Actions.UnmarshalJSON
* add test to ensure the state store is correctly loaded from disk
* skip state store migration tests, they will be fixes on a follow-up PR as part of #3912
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 5, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 17, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ycombinator ycombinator removed the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
5 participants