Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplication of data #47

Open
komw opened this issue Nov 25, 2020 · 5 comments
Open

Duplication of data #47

komw opened this issue Nov 25, 2020 · 5 comments

Comments

@komw
Copy link

komw commented Nov 25, 2020

I have a problem with duplication of data.
I have a 2 instances of InfluxDB (in1, in2), on the front of it there is a influxdb-srelay with HA config. Every write command is executed in in1/in2 instance.
Also on in1/in2 I have a 2 instances of syncflux in configuration -> at in1 there is master:in1,slave:in2, at in2 there is master:in2, slave:in1.
I'm executing some write queries, and everything is ok, queries are executed on both instances.
And now, I'm shutting down in2 instance still sending write command. Next I'm restarting in2 -> syncflux are starting to process chunks from in1 and it writes it to in2 instance. The problem is that some of data which was before shutting down in instance in2 are also retrieved from in1, and added as a duplicates in process chunks command.

My configs are simple as examples from github, srelay are using HA example, syncflux are using default HA configuration with initial-replication = "both" (changing to none doesn't help)
Why syncflux duplicates the data? Why it not checks that data is present in the database?

At the screenshot there is example:
19.28 - Servers was started
19.29 - I've executed a one write command
19.30 - I've stopped a second instance, and execute two write commands.
19.32 - I've started secondary database, and syncflux rebuild database, but it add a duplication of write command at 19.29, so at secondary graph there is a 2 instead of 1

Zrzut ekranu 2020-11-25 o 19 38 41

@komw
Copy link
Author

komw commented Nov 25, 2020

And another bug with deduplication -> If I restart the first instance, it will get data from second one, so also in first instance I'll have a duplicated data.

@toni-moreno
Copy link
Owner

I can not understant what is the problem .

Could you please tell me what exactly means "duplicate data" for you with an example please?

@komw
Copy link
Author

komw commented Nov 25, 2020

Please follow my example from above:
If have 2 instances of influxDB in HA (syncflux + srelay) and I lost for some time one of my server, and it the revocery process doesn't recover only the data which I lost for that server, but also data which that server has, so it duplicates that entries in DB.

Simple as that
2 servers are available and gets writes from srelay, one of them are stoping (because of any reason), and next if it join the stack once again, the recovery process doesn't recover only missing data, but it gets also some of data which are exists in that database, and insert it as, a new entries, so some of data is duplicated.

Please take a look -> the right side is a dump from server which was online all the time, left side is from server which was ofline for some time. Syncflux downloaded not only missed data, but entries which was available in that DB
Zrzut ekranu 2020-11-25 o 23 06 05

If you need my configs/more info please let me know

@toni-moreno
Copy link
Owner

toni-moreno commented Nov 25, 2020

This ha system only works with data sent to the srelay with timestamp, if not sent each node will set the local timestamp and data could be different in both nodes.

If data is equal in (tags, fiels, and timestamp) data can not be "duplicated" it will be overwritten.

Could you show me the original sent data in ILP format ?
Could you please compare the same query without groupby time in both nodes with nanoseconds resolution.?

@komw
Copy link
Author

komw commented Nov 26, 2020

This ha system only works with data sent to the srelay with timestamp, if not sent each node will set the local timestamp and data could be different in both nodes.

You are right!
That was the problem, time was generated by DB, but it should be generated by the client, and srelay should put in into several DB in the same point.

What do you think, is there any situation when we shouldn't create time at the client and let it to create in databases?
Maybe srelay should add a current time to any write operation if it doesn't have time field?

I thing that also such information should be at the beginning of the documentation ;) Also I found your diagram how the HA setup should looks like, adding it to the documentation will be helpful for developers:
55809639-9036d000-5ae6-11e9-90d2-3b1b6639ecc4 (1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants