Corrupting data written to remote storage in case sample_age_limit is hit #13979

david-vavra · 2024-04-23T18:39:23Z

What did you do?

We've started experiencing data corruption since we have started using remote_write.queue_config.sample_age_limit. We want to drop old samples after our Disaster recovery tests, in order to make sure we see fresh data as soon as possible (and we do not care much about samples scraped during the test itself). After these tests (which is basically the only time when sample_age_limit applies as we try to make sure we do not hit it during normal operation) we have seen unexpected counter resets as well as new, unexpected timeseries.

What did you expect to see?

Hitting sample_age_limit should only drop old data.

What did you see instead? Under which circumstances?

Our production setup is Prometheus with remote write directed towards Mimir cluster. In case when remote write endpoint is not accessible for more than sample_age_limit, we have seen corruptions in data ingested to remote storage. In both cases, just 2-3 samples just after remote write endpoint resumes its operation, and Prometheus drops old data and starts ingesting again.

So far we have notices two cases:

unexpected samples gets appended to an existing timeseries. Timestamp does not align with our scrape_interval, and value also does not align the timeseries. It probably happens to any metric type, but we have noticed this only with counters as the "fake" values are interpreted as a counter reset, causing huge spikes in our graphs.
new timeseries is created, with just a few samples just after the remote_write resume to ingest data to remote storage just like in the previous case. We have noticed it for cases which are clearly nonsense. It seems that existing series get mixed, like in this case when we tested it with Prometheus holding only its own exposed metrics and node_exporter - go_gc_duration_seconds_count{cluster="local-test", instance="localhost:9100", job="node-exporter", name="systemd-networkd.service", state="activating", type="notify-reload"}

The above we have been able to reproduce in setup with localy running Prometheus binary with configured remote_write towards our staging Mimir cluster. The outage have been simulated by iptables -I OUTPUT -d __IP__ -j DROP.

We have also tried to reproduce it in setup with two Prometheis with one serving as a remote_write receiver (see the attached docker-compose). In this scenario, we haven't seen any corrupted data in the receiving Prometheus, but we are getting log complaining about corrupted data:

System information

Linux 6.5.0-27-generic x86_64

Prometheus version

prometheus, version 2.50.1 (branch: HEAD, revision: 8c9b0285360a0b6288d76214a75ce3025bce4050)
  build user:       root@6213bb3ee580
  build date:       20240226-11:36:26
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

global:
        scrape_interval: 30s
        scrape_timeout: 10s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        evaluation_interval: 30s
        external_labels:
          __replica__: prometheus-local-test-0
          cluster: local-test
      scrape_configs:
      - job_name: prometheus-scif-monitoring
        honor_timestamps: false
        track_timestamps_staleness: false
        scrape_interval: 30s
        scrape_timeout: 10s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        metrics_path: /metrics
        scheme: http
        enable_compression: false
        follow_redirects: false
        enable_http2: false
        static_configs:
        - targets:
          - localhost:9090
      - job_name: node-exporter
        honor_timestamps: false
        track_timestamps_staleness: false
        scrape_interval: 5s
        scrape_timeout: 1s
        scrape_protocols:
        - OpenMetricsText1.0.0
        - OpenMetricsText0.0.1
        - PrometheusText0.0.4
        scheme: http
        enable_compression: false
        follow_redirects: false
        enable_http2: false
        metrics_path: /metrics
        static_configs:
        - targets:
          - node-exporter:9100
      remote_write:
      - url: http://receive-prometheus:9090/api/v1/write
        remote_timeout: 12s
        follow_redirects: true
        enable_http2: true
        queue_config:
          capacity: 100
          max_shards: 3
          min_shards: 12
          batch_send_deadline: 5s
          min_backoff: 30ms
          max_backoff: 5s
          sample_age_limit: 1m
        metadata_config:
          send: true
          send_interval: 1m
          max_samples_per_send: 100

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

prometheus          | ts=2024-04-23T10:06:18.038Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:07:26.043Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:08:34.050Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="Post \"http://receive-prometheus:9090/api/v1/write\": context deadline exceeded"
prometheus          | ts=2024-04-23T10:09:11.336Z caller=dedupe.go:112 component=remote level=info remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Remote storage resharding" from=12 to=3
receive-prometheus  | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
receive-prometheus  | ts=2024-04-23T10:09:11.998Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus          | ts=2024-04-23T10:09:12.029Z caller=dedupe.go:112 component=remote level=warn remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: label name \"job\" is not unique: invalid sample"
receive-prometheus  | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:134 level=error component=web msg="Out of order sample from remote write" err="duplicate sample for timestamp" series="{__name__=\"go_gc_duration_seconds\", __replica__=\"prometheus-local-test-0\", cluster=\"local-test\", instance=\"node-exporter:9100\", job=\"node-exporter\"}" timestamp=1713866946725
receive-prometheus  | ts=2024-04-23T10:09:12.035Z caller=write_handler.go:77 level=error component=web msg="Error appending remote write" err="label name \"job\" is not unique: invalid sample"
prometheus          | ts=2024-04-23T10:09:12.035Z caller=dedupe.go:112 component=remote level=error remote_name=dabaca url=http://receive-prometheus:9090/api/v1/write msg="non-recoverable error" count=2000 exemplarCount=0 err="server returned HTTP status 400 Bad Request: duplicate sample for timestamp"

The text was updated successfully, but these errors were encountered:

david-vavra · 2024-04-23T18:59:02Z

Attached is the docker compose we use to replicate the issue.
docker-compose.yml

EDIT:
Steps to reproduce:

run the attached compose
simulate outage of remote write receiving endpoints date; docker pause mimir receive-prometheus
wait for at least sample_age_limit
resume remote write date; docker unpause mimir receive-prometheus

FUSAKLA · 2024-05-01T09:56:32Z

@marctc haven't you experienced similar issues? You might have the context as author of this feature 🙏

FUSAKLA · 2024-05-10T23:36:39Z

Attempt to fix this #14078

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

FUSAKLA linked a pull request May 10, 2024 that will close this issue

Fix data corruption in remote write if max_sample_age is applied #14078

Open

FUSAKLA added a commit to FUSAKLA/prometheus that referenced this issue May 28, 2024

fix: try to reproduce the bug from prometheus#13979 in a test case

eaf26b3

Signed-off-by: Martin Chodur <m.chodur@seznam.cz>

koly23 mentioned this issue May 29, 2024

remote write: all samples lost when server returns 500 #14126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupting data written to remote storage in case sample_age_limit is hit #13979

Corrupting data written to remote storage in case sample_age_limit is hit #13979

david-vavra commented Apr 23, 2024 •

edited

david-vavra commented Apr 23, 2024 •

edited

FUSAKLA commented May 1, 2024

FUSAKLA commented May 10, 2024

Corrupting data written to remote storage in case sample_age_limit is hit #13979

Corrupting data written to remote storage in case sample_age_limit is hit #13979

Comments

david-vavra commented Apr 23, 2024 • edited

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

david-vavra commented Apr 23, 2024 • edited

FUSAKLA commented May 1, 2024

FUSAKLA commented May 10, 2024

david-vavra commented Apr 23, 2024 •

edited

david-vavra commented Apr 23, 2024 •

edited