Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read-repair re-adds objects that should have been deleted #4945

Open
1 task done
etiennedi opened this issue May 16, 2024 · 0 comments
Open
1 task done

Read-repair re-adds objects that should have been deleted #4945

etiennedi opened this issue May 16, 2024 · 0 comments
Assignees
Labels

Comments

@etiennedi
Copy link
Member

etiennedi commented May 16, 2024

How to reproduce this bug?

All commands assume the use of this reproduction script.

  1. Setup a three-node cluster locally (the script assumes the HTTP ports are 8080,8081,8082 and the gRPC ports are 50051, 50052, 50053 respectively)
  2. Import data using python3 read_repair_bug.py import
    • You can verify that the data was imported correctly using python3 read_repair_bug.py query.
    • This should show 10 objects on each node
  3. Kill node 2 or 3 (I recommend not to kill node 1 because the local scripts don't like when the "root" memberlist node dies)
  4. Run a batch delete using python3 read_repair_bug.py delete
  5. Restart the dead node
  6. Verify that the nodes are now out of sync using python3 read_repair_bug.py query
    • You should see 6 objects on the healthy nodes, but 10 objects on the node that missed an update
  7. Query with consistency level ALL using python3 read_repair_bug.py query --consistency-level ALL
    • Note: It may take more than one iteration for the bug to show up. I had to run this command 3 times in my last attempt.
    • EDIT: This step may depend on timing. Right now it seemed as I needed to wait ~60s until the repair messed things up. If this is correct, this could mean that it's related to flushing memtables, as idle memtables would be flushed about 60s later.

What is the expected behavior?

The node that missed the update is being repaired and eventually all nodes shows 6 objects.

What is the actual behavior?

Instead of replicating the delete we seem to replicate the inconsistent behavior from the out-of-sync node and up with 10 objects on all nodes. In other words, the objects that should have been deleted were incorrectly recreated.

Supporting information

No response

Server Version

So far only tested on v1.23.9. Will test more versions.

Code of Conduct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants