Events being lost in worker mode #7223

mmoorfield · 2024-05-03T07:19:15Z

Describe the bug

When a new medusa instance is started in worker mode, it is consuming messages that were already on redis before the subscribers are initialized.

The start order of server + worker instances is important in order to ensure that the worker is active before new events arrive. This is problematic as it means that auto-scaling of workers is not really possible without introducing the potential to lose events.

System information

Medusa version (including plugins): 1.20.4
Node.js version: 18
Database: Postgres
Operating system: Linux and Mac
Browser (if relevant): N/A
Event Bus: Redis

Steps to reproduce the behavior

Start a medusa instance in worker_mode: server
Submit a new order (or any other activity that triggers an event to the redis event bus)
Start a second medusa instance in worker_mode: worker
Observe the logs and notice that the event is being processed before the subscribers are initialized and the orders are effectively lost

✔ Models initialized – 19ms
✔ Plugin models initialized – 11ms
✔ Strategies initialized – 18ms
✔ Database initialized – 56ms
✔ Repositories initialized – 22ms
✔ Services initialized – 6ms
⠋ Initializing modules
info:    Connection to Redis in module 'event-bus-redis' established
info:    Connection to Redis in module 'cache-redis' established
info:    Processing cart.created which has 0 subscribers
info:    Processing cart.updated which has 0 subscribers
info:    Processing cart.updated which has 0 subscribers
info:    Processing cart.updated which has 0 subscribers
info:    Processing payment.updated which has 0 subscribers
info:    Processing order.placed which has 0 subscribers
info:    Processing cart.updated which has 0 subscribers
✔ Modules initialized – 90ms
✔ Express intialized – 1ms
✔ Plugins intialized – 554ms
✔ Subscribers initialized – 5ms
✔ API initialized – 29ms
⠋ Initializing defaults
✔ Defaults initialized – 67ms
⠋ Initializing search engine indexing
✔ Indexing event emitted – 3ms
✔ Server is ready on port: 9000 – 19ms

Expected behavior

Expect the ability to subscribe to events by multiple worker instances and for all custom subscribers to be initialised before any workers start processing.

The text was updated successfully, but these errors were encountered:

olivermrbl · 2024-05-03T11:35:33Z

@mmoorfield, thanks for submitting the issue.

We hadn't thought about this scenario, but you are right. This can indeed happen with the current way our subscribers are loaded, which is after the worker starts processing events.

The solution to this is likely not a quick fix – is this something you need urgently?

I have some ideas about how we can solve this, but those are more comprehensive changes, e.g. introducing a new application life cycle hook, that gets executed after the entire application has started. Here, we would be able to tell the worker to start processing without worrying about whether subscribers are registered or not.

mmoorfield · 2024-05-03T12:01:49Z

Thanks @olivermrbl - It's quite a significant problem for us as we deploy the API and Workers separately on AWS ECS (Fargate) with auto-scaling of the containers. It's a great improvement to the scalability of the stack.

But with this new approach we can't guarantee the start order of things and under high load introducing a new container to the environment does have the potential to consume events with 0 subscribers registered and lose events.

We have just implemented a temporary workaround currently by making use of the BullMQ worker autrun = false config option which will not kick off the workers automatically on creation. We then made a modification to the redis event bus module to add the ability to start the worker explicitly using bullWorker_.run() which we then invoke via a custom API route.

In our docker start script we wait for the worker instance to fully start before invoking this custom API route.

It works but is not ideal to fork the event bus module. Welcome any other ideas you have.

olivermrbl · 2024-05-03T13:12:26Z

But with this new approach we can't guarantee the start order of things and under high load introducing a new container to the environment does have the potential to consume events with 0 subscribers registered and lose events.

Can I get you to elaborate on this? At first glance, with a life cycle hook, we are guaranteed that a specific instance won't pick up events before subscribers are registered.

mmoorfield · 2024-05-03T13:49:30Z

Sorry for the confusion. Your proposed approach of a life cycle hook is a good one and would address this.

I was referring to the new approach of running medusa instances in server and worker mode separately where we can't control the start sequence.

mmoorfield added status: needs triaging type: bug labels May 3, 2024

olivermrbl removed the status: needs triaging label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Events being lost in worker mode #7223

Events being lost in worker mode #7223

mmoorfield commented May 3, 2024

olivermrbl commented May 3, 2024

mmoorfield commented May 3, 2024

olivermrbl commented May 3, 2024

mmoorfield commented May 3, 2024

Events being lost in worker mode #7223

Events being lost in worker mode #7223

Comments

mmoorfield commented May 3, 2024

Describe the bug

System information

Steps to reproduce the behavior

Expected behavior

olivermrbl commented May 3, 2024

mmoorfield commented May 3, 2024

olivermrbl commented May 3, 2024

mmoorfield commented May 3, 2024