Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max and min pointed at Sidecars not working on 0.35 #7368

Closed
AlexDCraig opened this issue May 16, 2024 · 15 comments
Closed

Max and min pointed at Sidecars not working on 0.35 #7368

AlexDCraig opened this issue May 16, 2024 · 15 comments

Comments

@AlexDCraig
Copy link

Thanos, Prometheus and Golang version used: docker.io/bitnami/thanos:0.35.0-debian-12-r4, sidecar version v0.34.1, prometheus version v2.50.1

Object Storage Provider: Azure

What happened:

Using Thanos Query, I can no longer use max() or min() operators and have it work with my sidecars. This is because the query going from Query -> Sidecar has fundamentally changed. For instance, when I run:

max(jvm_gc_pause_seconds_max{cluster="dev",` pod=~"podname.*"}) by (pod)

It yields this query on the sidecar:

[prometheus-k-prom-prometheus-operator-prometheus-0 thanos-sidecar] ts=2024-05-16T22:43:15.966821191Z caller=promclient.go:547 level=debug msg="range query" url="http://127.0.0.1:9090/api/v1/query_range?analyze=false&dedup=false&end=1715899348&engine=&explain=false&partial_response=true&query=max+by+%28pod%29+%28%7Bcluster%3D%22dev%22%2C+pod%3D~%22podname.%2A%22%2C+__name__%3D%22jvm_gc_pause_seconds_max%22%7D%29&start=1715877462&step=86"

This is new behavior. Using Thanos version 0.34, this log doesn't even appear. This logged query ^ will not load anything on the Sidecar because it has no "cluster" labels in there, that's added in transit.

This seems to happen with range queries like max() and min(). It doesn't happen with avg() or sum().

What you expected to happen:

Query can reach Sidecar, and just like versions past, it can load recent data from it and aggregate using max() or min().

How to reproduce it (as minimally and precisely as possible):

Use the versions above. Ship Prometheus data using a sidecar every 2hrs. Use external labels like cluster when shipping the data out. Notice the most recent 2hr data is missing when running a max() query, but data from object storage still loads.

Full logs to relevant components: The interesting log is shared above.

Anything else we need to know:

@MichaHoffmann
Copy link
Contributor

MichaHoffmann commented May 17, 2024

Hey,

can you please share your configuration of the related components? Something is odd ~ thanos sidecars usually dont issue range queries.

@AlexDCraig
Copy link
Author

AlexDCraig commented May 17, 2024

@MichaHoffmann Just want to highlight on this that 0.34 with the same exact config doesn't have this problem. Here's what I'm supplying to the various components.

Query:

- args:
        - query
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --query.replica-label=prometheus_replica
        - --endpoint=dnssrv+_grpc._tcp.thanos-bitnami-storegateway-headless.thanos-bitnami.svc.cluster.local
        - --endpoint=t-1-0.thanos.mydomain.com:443
        - --endpoint=t-1-1.thanos.mydomain.com:443
        - --endpoint=s2-0.thanos.mydomain.com:443
        - --endpoint=s2-1.thanos.mydomain.com:443
        - --endpoint=p2-0.thanos.mydomain.com:443
        - --endpoint=p2-1.thanos.mydomain.com:443
        - --endpoint=ci-0.thanos.mydomain.com:443
        - --endpoint=d3-0.thanos.mydomain.com:443
        - --endpoint=d3-1.thanos.mydomain.com:443
        - --endpoint=pt-0.thanos.mydomain.com:443
        - --endpoint=pt-1.thanos.mydomain.com:443
        - --endpoint=ps-0.thanos.mydomain.com:443
        - --endpoint=ps-1.thanos.mydomain.com:443
        - --endpoint=ss-0.thanos.mydomain.com:443
        - --endpoint=ss-1.thanos.mydomain.com:443
        - --endpoint=i0.thanos.mydomain.com:443
        - --endpoint=i1.thanos.mydomain.com:443
        - --endpoint=cu-0.thanos.mydomain.com:443
        - --endpoint=cu-1.thanos.mydomain.com:443
        - --endpoint=ee-0.thanos.mydomain.com:443
        - --endpoint=ee-1.thanos.mydomain.com:443
        - --endpoint=l2-0.thanos.mydomain.com:443
        - --endpoint=l2-1.thanos.mydomain.com:443
        - --endpoint=dnssrv+_grpc._tcp.thanos-receiver-headless.thanos-receiver.svc.cluster.local
        - --alert.query-url=https://thanos-query-frontend-bitnami.mydomain.com
        - --query.auto-downsampling
        - --grpc-client-tls-secure
        - --grpc-client-tls-skip-verify
        - --grpc-client-tls-cert=/etc/certs/client.crt
        - --grpc-client-tls-key=/etc/certs/client.key
        - --grpc-client-tls-ca=/etc/certs/ca.crt

Query Frontend:

- args:
        - query-frontend
        - --log.level=info
        - --log.format=logfmt
        - --http-address=0.0.0.0:9090
        - --query-frontend.downstream-url=http://thanos-bitnami-query:9090
        - --query-range.split-interval=12h
        - --query-frontend.compress-responses
        - |
          --query-range.response-cache-config=
          type: IN-MEMORY
          config:
            max_size: 2GB

Store:

- args:
        - store
        - --log.level=info
        - --log.format=logfmt
        - --grpc-address=0.0.0.0:10901
        - --http-address=0.0.0.0:10902
        - --data-dir=/data
        - --objstore.config-file=/conf/objstore.yml
        - --sync-block-duration=3m
        - --grpc-server-tls-cert=/etc/certs/server.crt
        - --grpc-server-tls-key=/etc/certs/server.key
        - --grpc-server-tls-client-ca=/etc/certs/ca.crt

Let me know if this sufficient, or there's more config you'd like to see. Thanks!

@MichaHoffmann
Copy link
Contributor

Can you also please share the configuration of the sidecar that is logging the error?

@AlexDCraig
Copy link
Author

Sidecar:

- args:
        - sidecar
        - --prometheus.url=http://127.0.0.1:9090/
        - '--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}'
        - --grpc-address=:10901
        - --http-address=:10902
        - --objstore.config=$(OBJSTORE_CONFIG)
        - --tsdb.path=/prometheus
        - --log.level=debug
        - --log.format=logfmt

@MichaHoffmann
Copy link
Contributor

The thing that is really weird to me is that the only thing that really runs that code (QueryRange from promclient.go) is the thanos ruler but the log statement you passed indicates that its from a container named "thanos-sidecar". Do you by chance run a ruler too?

@MichaHoffmann
Copy link
Contributor

Could it be that some sidecars are on a version pre 0.34.0 ? And use the queryPushdown feature? We removed all raw promql queries from sidecars in f29b338

@AlexDCraig
Copy link
Author

AlexDCraig commented May 20, 2024

@MichaHoffmann No, all Thanos sidecars are version 0.34.1:

- --thanos-default-base-image=quay.io/thanos/thanos:v0.34.1

Also, we don't use Thanos Ruler. or at least, we don't have a Thanos Ruler deployment running, or intend to. The Thanos sidecars on the remote clusters are configured in the Prometheus Operator.

@MichaHoffmann
Copy link
Contributor

Sorry, it changes nothing but just to correct myself: that change was released in 0.34.1. The only way I could understand this is if you would run a sidecar with version before 0.34.1. Something is pretty weird here; can you spot check the Thanos version of that sidecar that logs that line maybe just to be extra sure?

@AlexDCraig
Copy link
Author

AlexDCraig commented May 20, 2024

k get pod prometheus-k-prom-prometheus-operator-prometheus-0 -n monitoring -o yaml

apiVersion: v1
kind: Pod
metadata:
  name: prometheus-k-prom-prometheus-operator-prometheus-0
  namespace: monitoring
spec:
  containers:
 ...
  - args:
    - sidecar
    - --prometheus.url=http://127.0.0.1:9090/
    - '--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}'
    - --grpc-address=:10901
    - --http-address=:10902
    - --objstore.config=$(OBJSTORE_CONFIG)
    - --tsdb.path=/prometheus
    - --log.level=debug
    - --log.format=logfmt
    image: quay.io/thanos/thanos:v0.34.1
    imagePullPolicy: IfNotPresent
    name: thanos-sidecar

@MichaHoffmann
Copy link
Contributor

I mean can you run something like "Thanos --version" inside the container?

@AlexDCraig
Copy link
Author

Certainly:

k exec -it prometheus-k-prom-prometheus-operator-prometheus-0 -c thanos-sidecar -n monitoring -- /bin/sh

~ $ thanos --version
thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
  build user:       root@61db75277a55
  build date:       20240219-17:13:48
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

@pvlltvk
Copy link

pvlltvk commented May 23, 2024

Hi guys!
I can confirm that we have the same issue in our environment. We use 0.34.0 for a sidecar and after upgrading Thanos Query to 0.35.0 min/max operators don't work in the same way as @AlexDCraig described

@MichaHoffmann
Copy link
Contributor

@pvlltvk does it work again if you upgrade sidecars?

@pvlltvk
Copy link

pvlltvk commented May 31, 2024

@MichaHoffmann Yes, I can confirm that with after sidecar upgrade to 0.35.0 it works again

@MichaHoffmann
Copy link
Contributor

@MichaHoffmann Yes, I can confirm that with after sidecar upgrade to 0.35.0 it works again

Awesome, thanks for confirming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants