Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(emqx_router_helper): make sure late routes cleanup cannot remove new routes of the restarted node #12294

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

SergeTupchiy
Copy link
Contributor

@SergeTupchiy SergeTupchiy commented Jan 10, 2024

Release version: v/e5.?

Summary

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Added property-based tests for code which performs user input validation
  • Changed lines covered in coverage report
  • Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@SergeTupchiy SergeTupchiy requested review from lafirest and a team as code owners January 10, 2024 18:10
@SergeTupchiy SergeTupchiy marked this pull request as draft January 10, 2024 18:12
apps/emqx/test/emqx_routing_SUITE.erl Outdated Show resolved Hide resolved
apps/emqx/test/emqx_routing_SUITE.erl Outdated Show resolved Hide resolved
apps/emqx/test/emqx_routing_SUITE.erl Outdated Show resolved Hide resolved
apps/emqx/src/emqx_router_helper.erl Outdated Show resolved Hide resolved
@SergeTupchiy SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch from 572e626 to 7c62c03 Compare January 11, 2024 19:14
2
);
replicant when Node =:= node() ->
%% Mnesia is down locally, but emqx_router_helper is still running and holds the lock,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole concept effectively treats any crashes too seriously...
if mria_membership crashes on a replicant:

  1. emqx_router_helper on this replicant node will delete the lock.
  2. another node will receive mria, down event, acquire a lock and cleanup the routes
    However, mria_membership proc can be restarted and the replicant will continue its operation, as though nothing happen...

Probably it's better to do routes cleanups only when the node is really down but not as soon as Mria (or even Mnesia is down)? 🤔
This would have its own drawback: we won't cleanup routes to a node if it's still connected but not running (but this scenario is probably less likely to happen).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed {mria, down} events to be sent only when mria_membership terminates with shutdown reason, so that if it crashes and restarts due to some errors, it doesn't cause any cleanups.
If Mria is shutdown on a remote node, it makes sense to cleanup routes even if that node keeps being connected at Erlang distr level..

@SergeTupchiy SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch 3 times, most recently from 9acbc68 to 12a8128 Compare January 12, 2024 21:33
Copy link
Contributor

@keynslug keynslug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment on lines 159 to 180
replicant when Node =:= node() ->
%% Mnesia (or Mria) is down locally,
%% but emqx_router_helper is still running and holds the lock,
%% remove it so that other healthy nodes can do cleanup.
global:del_lock({?LOCK(Node), self()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Hm, is it really something that can happen during normal operation? What would be the circumstances that trigger it? I'm under impression that mria / mnesia app is started in permanent mode in production releases, so it going down should cause whole node to terminate anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, not expected normally...
The idea was to release the lock as soon as the node detects that something is wrong with it with no regards to any other conditions (like permanent apps). I wanted to release the lock as soon as local Mria/Mnesia shutdown is detected, because otherwise (if router_helper is still alive and holds the lock), other core nodes may fail to win the lock and cleanup routes (because they only do a limited number of retries).
I think it would be simpler if router_helper only react to {membership, {node, down, Node}} events to do routes cleanup 🤔 If so, we can be sure that the lock held by the down node is released...

BTW, I've just noticed that mria is actually running as a temporary app, I think this is the reason:

Looks like it's worth fixing, so that it respects emqx_machine restart type...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmstone what do you think about ignoring mnesia/mria down events in the router_helper and only do cleanups when nodedown is received?

@SergeTupchiy SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch from 12a8128 to 197033a Compare January 15, 2024 15:55
@SergeTupchiy SergeTupchiy marked this pull request as ready for review January 15, 2024 15:56
locking_nodes() ->
locking_nodes(mria_rlog:role()).

locking_nodes(core) ->
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to avoid acquiring the lock on all the replicants, since the replicants don't do any cleanup for other nodes, and if the cluster is really large, it can become a noticable overhead...
However, I was quite stuck at getting a list of core nodes from a replicant, not absolutely happy with all the Mria functions we currently have (mria_membership:running_core_nodelist is async and can probably return an empty list on start, mria_lb:core_nodes can block/timeout...) 🤔
Perhaps can improve it after https://emqx.atlassian.net/browse/EMQX-11740

ok.

locking_nodes() ->
locking_nodes(mria_rlog:role()).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and mria_config:whoami(), which is the correct API ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both are equally usable. whoami/0 will tel if the rlog is disabled and will return mnesia in that case.
role/0 always return core or replicant.

locking_nodes(core) ->
mria_mnesia:running_nodes();
locking_nodes(replicant) ->
%% No reliable source to get the list of core nodes, unfortunately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not critical to this PR. but it would be good to have mria brodcast core nodes to all replicants ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, it has this mechanism, but I find it is not very suitable, especially when called upon app start in init/1, please see my comment: #12294 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replicant node inserts info about core nodes by doing async pings, so at the point when emqx_router_helper is started, it can probably get an empty core nodes list, unless enough time has passed..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keynslug suggested a good alternative for now:

mria:async_dirty(?ROUTE_SHARD, fun mria_mnesia:running_nodes/0, []).

This functionality was redundant and caused duplicated `nodedown` messages,
as calling `ekka:monitor(membership)` is enough to get notified about
all nodes (both core and replicant) up/down events.
@SergeTupchiy SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch 2 times, most recently from 1391753 to cc6be1b Compare January 15, 2024 20:19
mria 0.8.2 adds `{mria, down, Node}` membership events,
that can help to detect if  Mria/Mnesia is down on a replicant node,
while the node is still connected at Erlang distr level.
@SergeTupchiy SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch from cc6be1b to 834ea5d Compare January 15, 2024 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants