Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_ecr_assets: produces invalid tasks by linking to empty "attestation" image layer #30258

Open
anentropic opened this issue May 17, 2024 · 6 comments
Labels
@aws-cdk/aws-ecr-assets Related to AWS CDK Docker Image Assets bug This issue is a bug. p2

Comments

@anentropic
Copy link

anentropic commented May 17, 2024

Describe the bug

I started getting the following error when trying to run my Fargate tasks:

"StoppedReason": "CannotPullContainerError: ref pull has been retried 1 time(s): failed to unpack image on snapshotter overlayfs: failed to extract layer sha256:c02342326b04a05fa0fc4c703c4aaa8ffb46cc0f2eda269c4a0dee53c8296182: failed to get stream processor for application/vnd.in-toto+json: no processor for media-type: unknown"

If I go into AWS web ui to the task definition I can find the id of the ECR image that it points to

Then if I look at that ECR image I can see it has 0 size:

Screenshot 2024-05-17 at 17 12 22

I can see in my ECR images list that since 10 May every CDK deployment has pushed a zero size image to ECR instead of the expected one:
Screenshot 2024-05-17 at 17 14 48

I have the following CDK code:

        django_command_img = ecr_assets.DockerImageAsset(
            self,
            "Django Command Image",
            directory="./",
            target="fargate-task",
            build_args={
                "python_version": global_task_config.python_version,
            },
            platform=ecr_assets.Platform.LINUX_ARM64
            if is_arm64
            else ecr_assets.Platform.LINUX_AMD64,
        )

        _task = ecs.FargateTaskDefinition(
            self,
            "Django Command Task",
            cpu=config.cpu,
            memory_limit_mib=config.memory_size,
            runtime_platform=ecs.RuntimePlatform(
                cpu_architecture=ecs.CpuArchitecture.ARM64
                if is_arm64
                else ecs.CpuArchitecture.X86_64,
                operating_system_family=ecs.OperatingSystemFamily.LINUX,
            ),
            family=f"{resource_prefix}-task-django-command",
        )
        container_name = f"{resource_prefix}-container-django-command"
        _task.add_container(
            "Django Command Container",
            image=ecs.ContainerImage.from_docker_image_asset(django_command_img),
            container_name=container_name,
            environment={
                "ETL": "true",
                **common_django_env,
            },
            # health_check=ecs.HealthCheck(),
            logging=ecs.LogDrivers.aws_logs(
                stream_prefix="containers",
                log_group=log_group,
            ),
        )

(Before 10 May I had previously deployed and ran Fargate tasks successfully from this definition)

Expected Behavior

A usable ECS task definition is deployed

Current Behavior

Inscrutable error message

It appears that CDK has created the task definition against an invalid ECR image

Reproduction Steps

See above

Additional Information/Context

I have located what seems to be the cause, with help from this issue thread: moby/moby#45600

Using aws ecr batch-get-image I can see the following manifest in my problem zero sized image:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.oci.image.config.v1+json",
    "digest": "sha256:1b22b41bc8846dc11d2718b173aa7481e06fe4e3893b410032e02f858da7d165",
    "size": 167
  },
  "layers": [
    {
      "mediaType": "application/vnd.in-toto+json",
      "digest": "sha256:875402577bcf06a0681056a91feeb0ce68f41fa30ad735ae802e555f1519351d",
      "size": 1464,
      "annotations": {
        "in-toto.io/predicate-type": "https://slsa.dev/provenance/v0.2"
      }
    }
  ]
}

This seems to relate to the error message and fit with the details in the moby issue linked above.

Basically when cdk deploy builds the image locally (via docker buildx) then extra "attestation" items are added to the root manifest (???)

I guess by themselves these aren't harmful (they are part of OCI standard or whatever) but CDK is maybe not expecting them and ends up pushing and tagging the wrong thing into ECR

Possible Solution

BUILDX_NO_DEFAULT_ATTESTATIONS=1 cdk deploy worked for me (after adding an arbitrary change to my Dockerfile to force a rebuild)

I think it would be better if CDK explicitly adds --provenance=false in its calls to docker buildx

See https://docs.docker.com/reference/cli/docker/buildx/build/#provenance and https://docs.docker.com/build/attestations/attestation-storage/

CDK CLI Version

2.142.0 (build 289a1e3)

Framework Version

No response

Node.js Version

v18.18.0

OS

macOS 14.4.1

Language

Python

Language Version

3.11.5

Other information

No response

@anentropic anentropic added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 17, 2024
@github-actions github-actions bot added the @aws-cdk/aws-ecr-assets Related to AWS CDK Docker Image Assets label May 17, 2024
@anentropic
Copy link
Author

anentropic commented May 17, 2024

What is a bit strange is it would make most sense if this resulted from a recent change in say Docker Desktop

But AFAICT the adding of attestations by default dates back a lot longer than last week (to maybe Jan 2023 https://www.docker.com/blog/highlights-buildkit-v0-11-release/#1-slsa-provenance)

So maybe I just got lucky the first couple of times I deployed my Fargate task

@anentropic
Copy link
Author

anentropic commented May 17, 2024

I think this issue is also affecting Lambda functions that use the Docker image runtime

I had a bunch of deployment issues the last couple of days where I would get an error like:

18:14:44 | UPDATE_FAILED | AWS::Lambda::Function | LambdasFromDockerImageCommand996BC9A4
Resource handler returned message: "Resource of type 'AWS::Lambda::Function' with identifier 'docker-lambda-command' did not stabilize." (RequestToken: 654bcf62-b81e-bf4c-1eff-408af63620cc, HandlerErrorCode: NotStabilized)

At first I thought there was something wrong in the meaningful part of the Dockerfile they were built from, I made a change and redeployed and the problem seemed to go away.

But then later when deploying from another branch I made the same 'fix' and it didn't work.

Subsequently I made a connection between my ECS/Fargate issue above and the fact that only my 'Docker image' Lambda functions seemed to have deployment problems, the 'Python runtime' one was doing ok.

I tried now BUILDX_NO_DEFAULT_ATTESTATIONS=1 cdk deploy and it failed

But that was because I hadn't changed the Dockerfile so new image was not pushed (?)

Then I added a RUN echo "hello" to the Dockerfile and tried again and this time they deployed ok.

To be clear I think without the BUILDX_NO_DEFAULT_ATTESTATIONS=1 flag it is random whether we end up with a usable ECR image or not.

If attestations are important then I think there is something in CDK that needs to be aware of them to avoid this issue (pushing and tagging wrong part of OCI manifest as the image to use). Or else just force buildx not to create them in the first place.

@pahud
Copy link
Contributor

pahud commented May 20, 2024

(Before 10 May I had previously deployed and ran Fargate tasks successfully from this definition)

Did you redeploy it after that time? What you have changed as it seemed to be working before?

And, are you able to reproduce this issue by providing a sample Dockerfile and CDK code snippets so we could reproduce that in our environment?

@pahud pahud added p2 response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels May 20, 2024
@anentropic
Copy link
Author

I believe it's random whether right image part gets tagged and pushed

so any reproduction attempt is going to need some way to repeatedly force the docker image to be rebuilt

I am puzzled why it started happening now, I assumed some update to either cdk or Docker Desktop

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label May 20, 2024
@anentropic
Copy link
Author

anentropic commented May 21, 2024

Another possibility is that it relates to build/deploy failures

All of the below is omitting the BUILDX_NO_DEFAULT_ATTESTATIONS=1 env flag

I just had this failure from a cdk deploy:

#10 DONE 40.3s

#11 exporting to image
#11 exporting layers
#11 exporting layers 10.8s done
#11 exporting manifest sha256:560ce72bba99f87f99b5a72af0501d6d04fe6abd49f31585a4c9efc4b6cf8f37 0.0s done
#11 exporting config sha256:4a44d9a6522f7abf267fb08a4dd7892f78cd3782a3044e7defe305731fec6c0e done
#11 exporting attestation manifest sha256:dc2b75a2fe487b379f48f6dc7ee4700084e396689201e122f7ebed55214a1c4a 0.0s done
#11 exporting manifest list sha256:41f53826a2731afc38744f496350aab4e2a225d935690c12c671e34b612d3ccd 0.0s done
#11 naming to docker.io/library/cdkasset-6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:latest done
#11 unpacking to docker.io/library/cdkasset-6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:latest
#11 unpacking to docker.io/library/cdkasset-6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:latest 3.4s done
#11 DONE 14.3s

View build details: docker-desktop://dashboard/build/desktop-linux/desktop-linux/hnu3dvjn0b1qsrdl77umb7xyq

What's Next?
  View a summary of image vulnerabilities and recommendations → docker scout quickview
my-project-website-dev-eu:  success: Built 6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:570110252051-eu-west-1

 ❌ Deployment failed: Error: Failed to build asset 6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:570110252051-eu-west-1
    at Deployments.buildSingleAsset (/Users/anentropic/.nvm/versions/node/v18.18.0/lib/node_modules/aws-cdk/lib/index.js:443:11302)
    at async Object.buildAsset (/Users/anentropic/.nvm/versions/node/v18.18.0/lib/node_modules/aws-cdk/lib/index.js:443:197148)
    at async /Users/anentropic/.nvm/versions/node/v18.18.0/lib/node_modules/aws-cdk/lib/index.js:443:181290

Failed to build asset 6c0f67b12e981df085f476d81628516616283374442f7afb80c034078279e264:570110252051-eu-west-1

(at approx 16:07 local time)

this fails without reaching the "IAM Statement Changes" confirmation screen part of the deployment.

and if I look in ECR:
Screenshot 2024-05-21 at 16 09 21

so the single 0 sized image corresponds to the failed build I just experienced

Then if I don't update the content of my Dockerfile perhaps a subsequent successful deploy will not push a new image?

I try again and get the same error as above. Another 0 size image seems to get pushed:
Screenshot 2024-05-21 at 16 48 38

I try a third time and this time the deploy proceeds and the IAM confirmation is reached, but after that eventually fails with the original error from this issue:

"Resource of type 'AWS::Lambda::Function' with identifier 'docker-lambda-command' did not stabilize."

No further images have been pushed by this 3rd failed deployment.

@anentropic
Copy link
Author

anentropic commented May 21, 2024

I try again setting BUILDX_NO_DEFAULT_ATTESTATIONS=1

"Resource of type 'AWS::Lambda::Function' with identifier 'docker-lambda-command' did not stabilize."

as expected

I add a RUN echo "hello" to the Dockerfile and try again with BUILDX_NO_DEFAULT_ATTESTATIONS=1

deployed successfully, with two full-size images:
Screenshot 2024-05-21 at 17 38 21

which I believe are one for my Lambdas (which all have the same code but different cmd) and one for the Fargate task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecr-assets Related to AWS CDK Docker Image Assets bug This issue is a bug. p2
Projects
None yet
Development

No branches or pull requests

2 participants