Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.5 unknown name groups during SSO #13061

Open
4 tasks done
Freddybob4244 opened this issue May 16, 2024 · 4 comments
Open
4 tasks done

v3.5.5 unknown name groups during SSO #13061

Freddybob4244 opened this issue May 16, 2024 · 4 comments
Labels
area/sso-rbac P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@Freddybob4244
Copy link

Freddybob4244 commented May 16, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Summary

All non-admin users assume a read-only role. After v3.5.5 is applied, non-admin users can no longer use argo-workflows and are prompted with a generic error in the UI:

image
image

Admin users can navigate normally.

Argo-server pod logs provide a bit more useful insight:

time="2024-05-15T17:49:33.746Z" level=error msg="failed to perform RBAC authorization" error="failed to evaluate rule: unknown name groups (1:43)\n | 'ac1ef805-ac6a-4ce9-854a-1cb406aa7121' in groups\n | ..........................................^"

time="2024-05-15T17:49:33.746Z" level=warning msg="finished unary call with code PermissionDenied" error="rpc error: code = PermissionDenied desc = not allowed" grpc.code=PermissionDenied grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_

time="2024-05-15T17:49:33Z" grpc.time_ms=1.682 span.kind=server system=grpc`

time="2024-05-15T17:49:33.746Z" level=info duration=2.267779ms method=GET path=/api/v1/workflows/argo size=34 status=403

There aren't any additional logs that are related to this error that I can find.

Testing Performed

To confirm 3.5.5 introduced the issue I tested a few different versions with a read-only test user in a sandbox installation.

  • Installed 3.5.3 - No issues
  • Installed 3.5.4 - No issues
  • Installed 3.5.5 - Issue introduced
  • Installed 3.5.6 - Issue persists
  • Installed Lastest - Issue persists

Configurations

Argo-Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argo-server
  namespace: argo
spec:
  selector:
    matchLabels:
      app: argo-server
  template:
    metadata:
      labels:
        app: argo-server
    spec:
      containers:
        - args:
            - server
            - "--auth-mode"
            - sso
            - "--auth-mode"
            - client
            - "--namespaced"
            - "--secure=false"
          env: []
          image: quay.io/argoproj/argocli:3.5.6
          name: argo-server
          ports:
            - containerPort: 2746
              name: web
          readinessProbe:
            httpGet:
              path: /
              port: 2746
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 20
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
          volumeMounts:
            - mountPath: /tmp
              name: tmp
      nodeSelector:
        kubernetes.io/os: linux
      securityContext:
        runAsNonRoot: true
      serviceAccountName: argo-server
      volumes:
        - emptyDir: {}
          name: tmp

workflow-controller-configmap

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
  namespace: argo
data:
  artifactRepository: |
    archiveLogs: true
    s3:
      bucket: logs-devops
      endpoint: minio:9000
      insecure: true
      accessKeySecret:
        name: my-minio-cred
        key: accesskey
      secretKeySecret:
        name: my-minio-cred
        key: secretkey
  links: |
    - name: Example Workflow Link
      scope: workflow
      url: http://logging-facility?namespace=${metadata.namespace}&workflowName=${metadata.name}&startedAt=${status.startedAt}&finishedAt=${status.finishedAt}
    - name: Example Pod Link
      scope: pod
      url: http://logging-facility?namespace=${metadata.namespace}&podName=${metadata.name}&startedAt=${status.startedAt}&finishedAt=${status.finishedAt}
  metricsConfig: |
    disableLegacy: true
    enabled: true
    path: /metrics
    port: 9090
  persistence: |
    connectionPool:
      maxIdleConns: 100
      maxOpenConns: 0
      connMaxLifetime: 0s
    nodeStatusOffLoad: true
    archive: true
    archiveTTL: 30d
    postgresql:
      ssl: true
      sslmode: require
      host: ****.postgres.database.azure.com
      port: 5432
      database: argodevops
      tableName: argo_workflows
      userNameSecret:
        name: argo-postgres-creds
        key: username
      passwordSecret:
        name: argo-postgres-creds
        key: password
  sso: >
    issuer:
    https://login.microsoftonline.com/****/v2.0

    clientId:
      name: client-id-secret
      key: client-id-key
    clientSecret:
      name: client-secret-secret
      key: client-secret-key
    redirectUrl: https://argo-devops.dev.int.****.cloud/oauth2/callback 

    scopes:
      - https://graph.microsoft.com/Group.Read.All
    rbac:
      enabled: true
  workflowDefaults: |
    spec:
      activeDeadlineSeconds: 14400
      ttlStrategy:
        secondsAfterCompletion: 604800
      podGC:
        strategy: OnWorkflowCompletion
      nodeSelector:
        agentpool: "argoeph"
      tolerations:
        - key: workload-type
          operator: Equal
          value: argo
          effect: NoSchedule

Server Service Account

apiVersion: v1
kind: ServiceAccount
metadata:
  name: argo-server
  namespace: argo

Server ClusterRoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: argo-server-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: argo-server-cluster-role
subjects:
  - kind: ServiceAccount
    name: argo-server
    namespace: argo

Server ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argo-server-cluster-role
rules:
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
      - watch
      - list
  - apiGroups:
      - ""
    resources:
      - secrets
    verbs:
      - get
      - create
  - apiGroups:
      - ""
    resources:
      - pods
      - pods/exec
      - pods/log
    verbs:
      - get
      - list
      - watch
      - delete
  - apiGroups:
      - ""
    resources:
      - events
    verbs:
      - watch
      - create
      - patch
  - apiGroups:
      - ""
    resources:
      - serviceaccounts
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - argoproj.io
    resources:
      - eventsources
      - sensors
      - workflows
      - workfloweventbindings
      - workflowtemplates
      - cronworkflows
      - clusterworkflowtemplates
    verbs:
      - create
      - get
      - list
      - watch
      - update
      - patch
      - delete

Additional Info

In a Slack convo Anton suggested that #12573 may be suspect.

Link to slack message thread

People engaged already:

Happy to provide any more information on request or connect to experiment/work through the issue

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

This isn't an issue with workflow execution.

Logs from the workflow controller

This isn't an issue with workflow execution.

Logs from in your workflow's wait container

This isn't an issue with workflow execution.
@Freddybob4244 Freddybob4244 added type/bug type/regression Regression from previous behavior (a specific type of bug) labels May 16, 2024
@agilgur5 agilgur5 added the P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority label May 16, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone May 16, 2024
@agilgur5
Copy link
Member

agilgur5 commented May 17, 2024

Thanks for the detailed report!

In a Slack convo Anton suggested that #12573 may be suspect.

Specifically, the only SSO change in v3.5.5 was #12318, which is unrelated in this case (different provider, different claim). So my suspicion would then be that a deps change somehow impacted this, and #12573 upgraded expr which is used for rbac-rule evaluation here and is the part with the error:

[...] failed to evaluate rule: unknown name groups (1:43)\n | 'ac1ef805-ac6a-4ce9-854a-1cb406aa7121' in groups\n | ..........................................^"

@Freddybob4244 could you provide a few more details that I had asked for in my last message:

  • The SA you were trying to match with the rbac-rule annotation you were using here
  • To clarify, is ac1ef805-ac6a-4ce9-854a-1cb406aa7121 actually one of your groups?

Admin users can navigate normally.

Also can you provide the admin SA with its rbac-rule as well? I'm curious how it differs since it applies correctly.

Happy to provide any more information on request or connect to experiment/work through the issue

From the same message, could you try some of these options and check what the logs say:

  • claims other than groups
  • expressions other than in

I'm not sure if we have a robust test environment for SSO issues; we do have an optional Dex but I'm not sure if it's configured and used for SSO tests (I don't think it is IIRC). That would be great to have for reproducibility and to add test cases for the pieces of SSO that are not provider specific.

@agilgur5 agilgur5 changed the title Read-Only SSO Breaks moving to v3.5.5 unknown name groups during SSO in v3.5.5 May 17, 2024
@agilgur5
Copy link
Member

agilgur5 commented May 22, 2024

From DMs:

So I am back from my long weekend - turns out that v3.4.17 has the same problem. I checked the release and it seems to have the expr change. I pushed v3.4.16 (which doesn't have that change) and it works.

That seems to confirm that #12573 is the source of the issue here, now we need to figure out why it's causing this exactly and how to fix that (and if it requires another upstream fix in expr).

Also that expr upgrade seems to have been erroneously backported to Argo v3.4.17 per #13043 (comment)

@agilgur5
Copy link
Member

Confirmed with a different user in another Slack thread that this is a regression in 3.5.5 and that reverting to 3.5.4 fixed it

@agilgur5
Copy link
Member

@isubasinghe any chance you could take a look at this? Related to the expr upgrade and changes you made in #12573. It's a P1 given that it breaks SSO RBAC

@agilgur5 agilgur5 changed the title unknown name groups during SSO in v3.5.5 v3.5.5 unknown name groups during SSO Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sso-rbac P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

2 participants