fix(emqx_router_helper): make sure late routes cleanup cannot remove new routes of the restarted node #12294

SergeTupchiy · 2024-01-10T18:10:48Z

Release version: v/e5.?

Summary

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

Added tests for the changes
Added property-based tests for code which performs user input validation
Changed lines covered in coverage report
Change log has been added to changes/(ce|ee)/(feat|perf|fix|breaking)-<PR-id>.en.md files
For internal contributor: there is a jira ticket to track this change
Created PR to emqx-docs if documentation update is required, or link to a follow-up jira ticket
Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

If changed package build workflow, pass this action (manual trigger)
Change log has been added to changes/ dir for user-facing artifacts update

apps/emqx/test/emqx_routing_SUITE.erl

apps/emqx/src/emqx_router_helper.erl

SergeTupchiy · 2024-01-11T19:26:01Z

apps/emqx/src/emqx_router_helper.erl

+                2
+            );
+        replicant when Node =:= node() ->
+            %% Mnesia is down locally, but emqx_router_helper is still running and holds the lock,


This whole concept effectively treats any crashes too seriously...
if mria_membership crashes on a replicant:

emqx_router_helper on this replicant node will delete the lock.

another node will receive mria, down event, acquire a lock and cleanup the routes
However, mria_membership proc can be restarted and the replicant will continue its operation, as though nothing happen...

Probably it's better to do routes cleanups only when the node is really down but not as soon as Mria (or even Mnesia is down)? 🤔
This would have its own drawback: we won't cleanup routes to a node if it's still connected but not running (but this scenario is probably less likely to happen).

Changed {mria, down} events to be sent only when mria_membership terminates with shutdown reason, so that if it crashes and restarts due to some errors, it doesn't cause any cleanups.
If Mria is shutdown on a remote node, it makes sense to cleanup routes even if that node keeps being connected at Erlang distr level..

keynslug

🎉

keynslug · 2024-01-15T09:36:25Z

apps/emqx/src/emqx_router_helper.erl

+        replicant when Node =:= node() ->
+            %% Mnesia (or Mria) is down locally,
+            %% but emqx_router_helper is still running and holds the lock,
+            %% remove it so that other healthy nodes can do cleanup.
+            global:del_lock({?LOCK(Node), self()});


Q: Hm, is it really something that can happen during normal operation? What would be the circumstances that trigger it? I'm under impression that mria / mnesia app is started in permanent mode in production releases, so it going down should cause whole node to terminate anyway.

Indeed, not expected normally...
The idea was to release the lock as soon as the node detects that something is wrong with it with no regards to any other conditions (like permanent apps). I wanted to release the lock as soon as local Mria/Mnesia shutdown is detected, because otherwise (if router_helper is still alive and holds the lock), other core nodes may fail to win the lock and cleanup routes (because they only do a limited number of retries).
I think it would be simpler if router_helper only react to {membership, {node, down, Node}} events to do routes cleanup 🤔 If so, we can be sure that the lock held by the down node is released...

BTW, I've just noticed that mria is actually running as a temporary app, I think this is the reason:

https://github.com/emqx/emqx/blob/master/apps/emqx_machine/src/emqx_machine.erl#L54

https://github.com/emqx/ekka/blob/master/src/ekka.erl#L94

https://github.com/emqx/mria/blob/main/src/mria.erl#L126

Looks like it's worth fixing, so that it respects emqx_machine restart type...

@zmstone what do you think about ignoring mnesia/mria down events in the router_helper and only do cleanups when nodedown is received?

SergeTupchiy · 2024-01-15T16:06:08Z

apps/emqx/src/emqx_router_helper.erl

+locking_nodes() ->
+    locking_nodes(mria_rlog:role()).
+
+locking_nodes(core) ->


The idea is to avoid acquiring the lock on all the replicants, since the replicants don't do any cleanup for other nodes, and if the cluster is really large, it can become a noticable overhead...
However, I was quite stuck at getting a list of core nodes from a replicant, not absolutely happy with all the Mria functions we currently have (mria_membership:running_core_nodelist is async and can probably return an empty list on start, mria_lb:core_nodes can block/timeout...) 🤔
Perhaps can improve it after https://emqx.atlassian.net/browse/EMQX-11740

zmstone · 2024-01-15T16:38:40Z

apps/emqx/src/emqx_router_helper.erl

+    ok.
+
+locking_nodes() ->
+    locking_nodes(mria_rlog:role()).


this and mria_config:whoami(), which is the correct API ?

I think both are equally usable. whoami/0 will tel if the rlog is disabled and will return mnesia in that case.
role/0 always return core or replicant.

zmstone · 2024-01-15T16:44:09Z

apps/emqx/src/emqx_router_helper.erl

+locking_nodes(core) ->
+    mria_mnesia:running_nodes();
+locking_nodes(replicant) ->
+    %% No reliable source to get the list of core nodes, unfortunately.


maybe not critical to this PR. but it would be good to have mria brodcast core nodes to all replicants ?

For sure, it has this mechanism, but I find it is not very suitable, especially when called upon app start in init/1, please see my comment: #12294 (comment)

The replicant node inserts info about core nodes by doing async pings, so at the point when emqx_router_helper is started, it can probably get an empty core nodes list, unless enough time has passed..

@keynslug suggested a good alternative for now:

mria:async_dirty(?ROUTE_SHARD, fun mria_mnesia:running_nodes/0, []).

…new routes of the restarted node

This functionality was redundant and caused duplicated `nodedown` messages, as calling `ekka:monitor(membership)` is enough to get notified about all nodes (both core and replicant) up/down events.

…ing routes twice

mria 0.8.2 adds `{mria, down, Node}` membership events, that can help to detect if Mria/Mnesia is down on a replicant node, while the node is still connected at Erlang distr level.

SergeTupchiy requested review from lafirest and a team as code owners January 10, 2024 18:10

SergeTupchiy marked this pull request as draft January 10, 2024 18:12

keynslug reviewed Jan 11, 2024

View reviewed changes

SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch from 572e626 to 7c62c03 Compare January 11, 2024 19:14

SergeTupchiy commented Jan 11, 2024

View reviewed changes

SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch 3 times, most recently from 9acbc68 to 12a8128 Compare January 12, 2024 21:33

keynslug reviewed Jan 15, 2024

View reviewed changes

SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch from 12a8128 to 197033a Compare January 15, 2024 15:55

SergeTupchiy marked this pull request as ready for review January 15, 2024 15:56

SergeTupchiy commented Jan 15, 2024

View reviewed changes

zmstone reviewed Jan 15, 2024

View reviewed changes

SergeTupchiy added 3 commits January 15, 2024 18:44

fix(emqx_router_helper): make sure late routes cleanup cannot remove …

da71846

…new routes of the restarted node

refactor: remove emqx_router_helper:monitor/1

e68c115

This functionality was redundant and caused duplicated `nodedown` messages, as calling `ekka:monitor(membership)` is enough to get notified about all nodes (both core and replicant) up/down events.

fix(emqx_router_helper): drain extra membership events to avoid clean…

2840ea3

…ing routes twice

SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch 2 times, most recently from 1391753 to cc6be1b Compare January 15, 2024 20:19

chore: upgrade ekka to 0.18.2 (mria 0.8.2)

834ea5d

mria 0.8.2 adds `{mria, down, Node}` membership events, that can help to detect if Mria/Mnesia is down on a replicant node, while the node is still connected at Erlang distr level.

SergeTupchiy force-pushed the eliminate-new-routes-cleanup-if-node-restarts branch from cc6be1b to 834ea5d Compare January 15, 2024 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(emqx_router_helper): make sure late routes cleanup cannot remove new routes of the restarted node #12294

fix(emqx_router_helper): make sure late routes cleanup cannot remove new routes of the restarted node #12294

SergeTupchiy commented Jan 10, 2024 •

edited

SergeTupchiy Jan 11, 2024 •

edited

SergeTupchiy Jan 12, 2024

keynslug left a comment

keynslug Jan 15, 2024

SergeTupchiy Jan 15, 2024

SergeTupchiy Jan 15, 2024

SergeTupchiy Jan 15, 2024 •

edited

zmstone Jan 15, 2024

SergeTupchiy Jan 15, 2024 •

edited

zmstone Jan 15, 2024

SergeTupchiy Jan 15, 2024

SergeTupchiy Jan 15, 2024

SergeTupchiy Jan 15, 2024

fix(emqx_router_helper): make sure late routes cleanup cannot remove new routes of the restarted node #12294

Are you sure you want to change the base?

fix(emqx_router_helper): make sure late routes cleanup cannot remove new routes of the restarted node #12294

Conversation

SergeTupchiy commented Jan 10, 2024 • edited

Summary

PR Checklist

Checklist for CI (.github/workflows) changes

SergeTupchiy Jan 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keynslug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeTupchiy Jan 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeTupchiy Jan 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SergeTupchiy commented Jan 10, 2024 •

edited

SergeTupchiy Jan 11, 2024 •

edited

SergeTupchiy Jan 15, 2024 •

edited

SergeTupchiy Jan 15, 2024 •

edited