Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError on slots_cache during reshard #3238

Open
gordo-veda opened this issue May 17, 2024 · 2 comments
Open

IndexError on slots_cache during reshard #3238

gordo-veda opened this issue May 17, 2024 · 2 comments

Comments

@gordo-veda
Copy link

It seems that the slots cache throws an IndexError sometimes during an online reshard.

redis-py 5.0.4

  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1115, in execute_command
    raise e
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1101, in execute_command
    res[node.name] = self._execute_command(node, *args, **kwargs)
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1210, in _execute_command
    raise e
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1138, in _execute_command
    target_node = self.nodes_manager.get_node_from_slot(
  File \"/usr/local/lib/python3.9/site-packages/redis/cluster.py\", line 1425, in get_node_from_slot
    return self.slots_cache[slot][node_idx]
IndexError: list index out of range

please let me know if i can provide any additional details, or help reproduce this.

@gordo-veda
Copy link
Author

This seems related: #2575

@gordo-veda
Copy link
Author

this is a disgusting hack to avoid forking redis-py, but it definitely improves stability while under load during scaling events when using elasticache.

basically we are just runtime-patching NodesManager to try the first node if an IndexError happens, as demonstrated in the related MR: #2575

def wrap_get_node_from_slot(cls):
    """Work around redis-py cluster incompatibility with elasticache."""
    if hasattr(cls, "original_get_node_from_slot"):
        return cls

    original_get_node_from_slot = cls.get_node_from_slot

    @wraps(original_get_node_from_slot)
    def get_node_from_slot_with_fallback(self, slot, read_from_replicas=False, server_type=None):
        """Fall-back to primary when elasticache scales and breaks the redis-py cluster client."""
        try:
            return original_get_node_from_slot(self, slot, read_from_replicas, server_type)
        except IndexError:
            # https://github.com/redis/redis-py/pull/2575
            return self.slots_cache[slot][0]

    cls.original_get_node_from_slot = original_get_node_from_slot
    cls.get_node_from_slot = get_node_from_slot_with_fallback

wrap_get_node_from_slot(NodesManager)

i understand there are some politics around AWS/Redis and forks and such, but regardless of anyone's opinions, i hope this is helpful for anyone else encountering this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant