-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupting data written to remote storage in case sample_age_limit is hit #13979
Comments
Attached is the docker compose we use to replicate the issue. EDIT:
|
@marctc haven't you experienced similar issues? You might have the context as author of this feature 🙏 |
Attempt to fix this #14078 |
FUSAKLA
added a commit
to FUSAKLA/prometheus
that referenced
this issue
May 28, 2024
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What did you do?
We've started experiencing data corruption since we have started using
remote_write.queue_config.sample_age_limit
. We want to drop old samples after our Disaster recovery tests, in order to make sure we see fresh data as soon as possible (and we do not care much about samples scraped during the test itself). After these tests (which is basically the only time whensample_age_limit
applies as we try to make sure we do not hit it during normal operation) we have seen unexpected counter resets as well as new, unexpected timeseries.What did you expect to see?
Hitting
sample_age_limit
should only drop old data.What did you see instead? Under which circumstances?
Our production setup is Prometheus with remote write directed towards Mimir cluster. In case when remote write endpoint is not accessible for more than
sample_age_limit
, we have seen corruptions in data ingested to remote storage. In both cases, just 2-3 samples just after remote write endpoint resumes its operation, and Prometheus drops old data and starts ingesting again.So far we have notices two cases:
go_gc_duration_seconds_count{cluster="local-test", instance="localhost:9100", job="node-exporter", name="systemd-networkd.service", state="activating", type="notify-reload"}
The above we have been able to reproduce in setup with localy running Prometheus binary with configured remote_write towards our staging Mimir cluster. The outage have been simulated by
iptables -I OUTPUT -d __IP__ -j DROP
.We have also tried to reproduce it in setup with two Prometheis with one serving as a remote_write receiver (see the attached docker-compose). In this scenario, we haven't seen any corrupted data in the receiving Prometheus, but we are getting log complaining about corrupted data:
System information
Linux 6.5.0-27-generic x86_64
Prometheus version
Prometheus configuration file
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
The text was updated successfully, but these errors were encountered: