Duplication of data #47

komw · 2020-11-25T18:48:47Z

I have a problem with duplication of data.
I have a 2 instances of InfluxDB (in1, in2), on the front of it there is a influxdb-srelay with HA config. Every write command is executed in in1/in2 instance.
Also on in1/in2 I have a 2 instances of syncflux in configuration -> at in1 there is master:in1,slave:in2, at in2 there is master:in2, slave:in1.
I'm executing some write queries, and everything is ok, queries are executed on both instances.
And now, I'm shutting down in2 instance still sending write command. Next I'm restarting in2 -> syncflux are starting to process chunks from in1 and it writes it to in2 instance. The problem is that some of data which was before shutting down in instance in2 are also retrieved from in1, and added as a duplicates in process chunks command.

My configs are simple as examples from github, srelay are using HA example, syncflux are using default HA configuration with initial-replication = "both" (changing to none doesn't help)
Why syncflux duplicates the data? Why it not checks that data is present in the database?

At the screenshot there is example:
19.28 - Servers was started
19.29 - I've executed a one write command
19.30 - I've stopped a second instance, and execute two write commands.
19.32 - I've started secondary database, and syncflux rebuild database, but it add a duplication of write command at 19.29, so at secondary graph there is a 2 instead of 1

The text was updated successfully, but these errors were encountered:

komw · 2020-11-25T19:07:17Z

And another bug with deduplication -> If I restart the first instance, it will get data from second one, so also in first instance I'll have a duplicated data.

toni-moreno · 2020-11-25T21:56:15Z

I can not understant what is the problem .

Could you please tell me what exactly means "duplicate data" for you with an example please?

komw · 2020-11-25T22:09:07Z

Please follow my example from above:
If have 2 instances of influxDB in HA (syncflux + srelay) and I lost for some time one of my server, and it the revocery process doesn't recover only the data which I lost for that server, but also data which that server has, so it duplicates that entries in DB.

Simple as that
2 servers are available and gets writes from srelay, one of them are stoping (because of any reason), and next if it join the stack once again, the recovery process doesn't recover only missing data, but it gets also some of data which are exists in that database, and insert it as, a new entries, so some of data is duplicated.

Please take a look -> the right side is a dump from server which was online all the time, left side is from server which was ofline for some time. Syncflux downloaded not only missed data, but entries which was available in that DB

If you need my configs/more info please let me know

toni-moreno · 2020-11-25T22:36:41Z

This ha system only works with data sent to the srelay with timestamp, if not sent each node will set the local timestamp and data could be different in both nodes.

If data is equal in (tags, fiels, and timestamp) data can not be "duplicated" it will be overwritten.

Could you show me the original sent data in ILP format ?
Could you please compare the same query without groupby time in both nodes with nanoseconds resolution.?

komw · 2020-11-26T10:26:35Z

This ha system only works with data sent to the srelay with timestamp, if not sent each node will set the local timestamp and data could be different in both nodes.

You are right!
That was the problem, time was generated by DB, but it should be generated by the client, and srelay should put in into several DB in the same point.

What do you think, is there any situation when we shouldn't create time at the client and let it to create in databases?
Maybe srelay should add a current time to any write operation if it doesn't have time field?

I thing that also such information should be at the beginning of the documentation ;) Also I found your diagram how the HA setup should looks like, adding it to the documentation will be helpful for developers:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplication of data #47

Duplication of data #47

komw commented Nov 25, 2020

komw commented Nov 25, 2020

toni-moreno commented Nov 25, 2020

komw commented Nov 25, 2020

toni-moreno commented Nov 25, 2020 •

edited

komw commented Nov 26, 2020 •

edited

Duplication of data #47

Duplication of data #47

Comments

komw commented Nov 25, 2020

komw commented Nov 25, 2020

toni-moreno commented Nov 25, 2020

komw commented Nov 25, 2020

toni-moreno commented Nov 25, 2020 • edited

komw commented Nov 26, 2020 • edited

toni-moreno commented Nov 25, 2020 •

edited

komw commented Nov 26, 2020 •

edited