federation: parallel sending per instance #4623

phiresky · 2024-04-13T21:31:27Z

Currently, with the implementation of the federation queue from #3605 that was enabled in 0.19, the federation is parallel across instances but sequential per receiving instance.

This means that the maximum throughput of activities is limited by the network latency between instances as well as the internal latency for processing a single activity.

There is an extensive discussion here: #4529 (comment) Though the issue itself is only about one sub-problem.

This PR changes the federation sending component to send activities in parallel, with a configurable maximum concurrency of 8. The implementation is more complex than I expected since we need to keep track of the last_successful_id (which needs to be the highest activity id where every single lower activity has been successfully sent) and we need to keep track of failure counts without immediately jumping to hour-long retry delays when 8 concurrent sends fail simultaneously.

The implementation roughly works as follows:

We have a main loop that waits for new activities and spawns send tasks for them when the conditions are right
We create a multi-producer single-consumer channel, each activity send sends a notification through this channel to the main loop:
- If successful, the activity id is added to a in-memory priority queue, if the lowest ID in the queue is the previous ID + 1 we increment the last_successful_id counter
- If failing, the fail_count and last retry is updated. We completely ignore the activity id here and always take the maximum (but not sum) of fail_count from any activity send that is happening

In order for the results of this to still be correct, fixes need to be applied to make all activities commutative (as discussed above).

It should be possible to also make the concurrency only happen when necessary since for most instance-instance connections it is not, which would reduce the ordering issue. This is not implemented here though.

Currently, this PR fails the federation tests. I think this is both due to a bug somewhere as well as due to the ordering problem.

Nutomic · 2024-04-15T09:11:46Z

crates/federate/src/worker.rs

+    let domain = self.instance.domain.clone();
+    tokio::spawn(async move {
+      let mut report = report;
+      if let Err(e) = InstanceWorker::send_retry_loop(


Store this in a variable first to make it more readable.

crates/federate/src/worker.rs

Nutomic · 2024-04-15T09:13:11Z

crates/federate/src/worker.rs

    activity: &SentActivity,
    object: &SharedInboxActivities,
+    inbox_urls: Vec<Url>,
+    report: &mut UnboundedSender<SendActivityResult>,
+    initial_fail_count: i32,


Would make more sense to use a single shared AtomicI32 for fail count, instead of passing it back and forth like this.

it's not passed back and forth, it's only passed one way, from the worker to the send task. the issue with not passing it in explicitly would be that then the send_retry_loop function would access self, and thus the whole InstanceWorker struct would have to be Send+Sync, which means wrapping everything in locks.

That's why i create those local variables before the lambda calling this

Then you can share around an Arc<AtomicI32> instead of self.

The point is we don't need any atomics, so adding atomics without any real reason is bad. this is a read-only immutable variable that is passed only one way, from the main thread to each send task so it knows if the request fails how much to sleep initially.

just realizing you're referring to the channel send when you are talking about the "passing it back".

I see that might be an option. But I don't like it. The send task is cleanly separated and has no direct interaction with the main thread. It's nicely compartmentalized and you can now purely look at the send.rs file to understand how the send-retry-loop of one activity works. It purely sends notifications upwards, it never receives updates downwards after it is started.

The failure event needs to be transferred to the main task in any case so it can update the database. The main fail_count should only be updated under certain conditions and these conditions belong in the main task, since the logic that updates the database needs the same information.

In addition, if we changed this to an atomic for the fails then we should also change the "success" sends to direct writes into a RwLock instead of writing to a channel. Otherwise there's a weird mix where some data is sent via channel and some via atomics. But changing that would mean you would again have to figure out how to signal the main task when to continue and again make all interactions more complex to understand since they can happen with arbitrary concurrency.

Alright, it seems a bit weird at first but makes sense with your explanation. Would be good if you can copy that into a code comment on SendActivityResult.

crates/federate/src/worker.rs

Nutomic · 2024-04-15T09:43:43Z

crates/federate/src/worker.rs

+        // which can only happen by an event sent into the channel
+        self
+          .handle_send_results(&mut receive_send_result, &mut successfuls, &mut in_flight)
+          .await?;


I would launch a separate background task for this instead of manually checking here. Can also potentially be moved into a separate struct/file.

maybe, that's what I originally wanted to do, but the problem then is that you need some way for the "read activities loop" to pause and unpause depending on the state of the results handling tasks. That would require either another arbitrary sleep() or another channel for sending around values.

I intentionally put that inline in here because then the "read activities" loop knows exactly when to continue and nothing need any thread synchronization. Nothing in this struct needs atomics or locks because there's only a single task accessing anything, the only thing running in parallel is the actual sends.

I agree that passing muts to local variables is pretty... weird, I originally just had this part inlined, i moved it out so it's easier to grok the whole loop.

recv_many blocks automatically if there are no items. Then you can also sleep for WORK_FINISHED_RECHECK_DELAY if there are less than 4 items or so. Anyway UnboundedReceiver works across threads so I dont see why it would need any locks.

You should also be able to handle print_stats() from the same receiver, I dont see any reason why it should need a separate channel. Additionally it should be possible to use a single receiver worker to write data for all instances, no need to do it separately per instance.

Some of this may be too much effort to be worth implementing now, but at least mention the possibilities in a comment for future work.

recv_many blocks automatically if there are no items

Yes but the problem is that you're saying you want to have the recv be in a separate thread from the thread that queues new sends. So then how would the sending-queuer know when it's supposed to send new items? It's inline in here exactly because it blocks if there's no items and thus pauses the send loop for the exactly correct time until a new item should be sent. The alternatives are hacky and would require more arbitrarily chosen delays. And this event is expected to happen 10+ times per second per instance so waiting would basically be a busy loop (sleep 10ms, recheck, sleep 10ms, recheck, ...)

I don't really see how this would be a possibility, it just seems all around worse to me to move this to a separate thread

Makes sense. What about using only a single unbounded_channel for activity send completion and print_stats()?

crates/federate/src/worker.rs

phiresky · 2024-04-15T17:45:17Z

I've split the instance worker into three separate files:

worker.rs - main code
inboxes.rs - code that makes sure the community lists are up to date and calculates the inboxes
send.rs - a single send-retry-loop task that runs concurrently

phiresky · 2024-04-15T17:46:14Z

crates/federate/src/inboxes.rs

@@ -0,0 +1,149 @@
+use crate::util::LEMMY_TEST_FAST_FEDERATION;


as a note, i've not made any changes to this code, just moved it into a separate struct

crates/federate/src/inboxes.rs

crates/federate/src/worker.rs

Nutomic · 2024-04-16T10:01:04Z

As mentioned in the dev chat it would be very useful to have some unit tests here, to ensure it works as expected.

phiresky · 2024-04-16T13:37:31Z

As mentioned in the dev chat it would be very useful to have some unit tests here, to ensure it works as expected.

Any ideas on how best to do that? The only clean way I can think of would be abstracting all DB and HTTP interactions (I guess it would be like 5-10 functions?) into a trait so the whole federate crate code is pure, and then mocking the DB interactions and HTTP interactions with data from memory.

Nutomic · 2024-04-17T13:00:25Z

Mocking would be too complicated and could introduce problems of its own. Look at how other tests are implemented, eg for db views. Basically write some test data to the db, then call functions and see if they behave as expected. You can start a local server with inbox route to check that activities are received (with an instance like localhost:8123). For start_stop_federation_workers() there should be some way to check the content of workers variable, eg by moving it into a struct and moving the function into impl block.

crates/federate/src/send.rs

…n of the same federation queues

Co-authored-by: dullbananas <dull.bananas0@gmail.com>

Nutomic · 2024-05-31T07:50:45Z

I realized that there is a better criteria for parallel sending, rather than grouping by community id or post id. Instead we can send activities in parallel as long as they have different actors (ie different sent_activity.actor_apub_id). In that case the risk for any conflicts is minimal, plus this field is already available.

So the algorithm would be something like this:

send out activity
store (actor_apub_id, task_handle) in a sending collection
proceed to next activity
if actor_apub_id is not in sending collection, send this one out too and store it in sending
if actor_apub_id is in sending collection, wait here until it is completed

…nd-parallel

phiresky · 2024-06-04T12:51:09Z

So after updating to main branch and a few smaller changes, the federation tests now actually pass (seemingly reliably), with the default configuration of concurrent_sends_per_instance=1. I've marked this PR as ready for review.

I've in the latest commits changed the CI to actually run the federation tests twice (with a few necessary changes to make that work). The reason is that it is pretty tedious to always have to fully reset the databases in order to run the federation tests, so I think it makes sense to make them not dependent on running on fully empty databases.

dullbananas · 2024-06-04T14:18:55Z

api_tests/src/user.spec.ts

+  let user = await registerUser(
+    alpha,
+    alphaUrl,
+    "تجريب" + Math.random().toString().slice(2, 5),


Use the milliseconds since the epoch to make it simpler and more likely to produce a unique username:

Suggested change

"تجريب" + Math.random().toString().slice(2, 5),

"تجريب" + Date.now(),

dullbananas · 2024-06-04T17:10:41Z

crates/federate/src/lib.rs

@@ -214,6 +229,7 @@ mod test {
        .app_data(context.clone())
        .build()
        .await?;
+      let federation_worker_config = FederationWorkerConfig::default(); // TODO


dullbananas · 2024-06-04T18:47:40Z

crates/federate/src/worker.rs

+        // handle_send_results does not guarantee that we are now in a condition where we want to
+        // send a new one, so repeat this check until the if no longer applies
+        continue;
+      } else {


This else is redundant because of continue

dullbananas · 2024-06-04T19:00:09Z

crates/federate/src/worker.rs

+          let last_successful_id = self.state.last_successful_id.map(|e| e.0).context(
+            "impossible: id is initialized in get_latest_ids and never returned to None",
+          )?;
+          let expected_next_id = last_successful_id + (successfuls.len() as i64) + in_flight + 1;
+          // compare to next id based on incrementing
+          if expected_next_id != next_id_to_send.0 {


Suggested change

let last_successful_id = self.state.last_successful_id.map(|e| e.0).context(

"impossible: id is initialized in get_latest_ids and never returned to None",

)?;

let expected_next_id = last_successful_id + (successfuls.len() as i64) + in_flight + 1;

// compare to next id based on incrementing

if expected_next_id != next_id_to_send.0 {

let expected_next_id = self.state.last_successful_id.map(|e| e.0 + last_successful_id + (successfuls.len() as i64) + in_flight + 1);

// compare to next id based on incrementing

if expected_next_id != Some(next_id_to_send.0) {

dullbananas · 2024-06-04T19:17:11Z

crates/federate/src/worker.rs

+    if let Some(last) = self.state.last_successful_id {
+      Ok((last, latest_id))


Suggested change

if let Some(last) = self.state.last_successful_id {

Ok((last, latest_id))

let last = if let Some(last) = self.state.last_successful_id {

last

dullbananas · 2024-06-04T19:18:03Z

crates/federate/src/worker.rs

-      // instance
-
-      // skip all past activities:
+      // instance skip all past activities:


Suggested change

// instance skip all past activities:

// instance

// skip all past activities:

dullbananas · 2024-06-04T19:19:22Z

crates/federate/src/worker.rs

+      Ok((latest_id, latest_id))
+    }


Suggested change

Ok((latest_id, latest_id))

}

latest_id

}

Ok((last, latest_id))

phiresky added 2 commits April 13, 2024 23:18

federation: parallel sending

539f06a

federation: some comments

491daab

Nutomic reviewed Apr 15, 2024

View reviewed changes

phiresky added 6 commits April 15, 2024 17:59

lint and set force_write true when a request fails

987174a

inbox_urls return vec

a66aec6

split inbox functions into separate file

a3d705f

cleanup

7eedcb7

extract sending task code to separate file

e719baf

move federation concurrent config to config file

5e986ef

phiresky force-pushed the federation-send-parallel branch from 936469a to 5e986ef Compare April 15, 2024 17:45

phiresky commented Apr 15, 2024

View reviewed changes

off by one issue

c1932f9

Nutomic reviewed Apr 16, 2024

View reviewed changes

crates/federate/src/inboxes.rs Outdated Show resolved Hide resolved

Nutomic reviewed Apr 16, 2024

View reviewed changes

crates/federate/src/inboxes.rs Outdated Show resolved Hide resolved

Nutomic reviewed Apr 16, 2024

View reviewed changes

crates/federate/src/worker.rs Outdated Show resolved Hide resolved

dullbananas reviewed Apr 20, 2024

View reviewed changes

crates/federate/src/send.rs Show resolved Hide resolved

dessalines added this to the 0.19.5 milestone Apr 29, 2024

improve msg

a7c7abd

dullbananas reviewed May 21, 2024

View reviewed changes

crates/federate/src/send.rs Outdated Show resolved Hide resolved

phiresky added 2 commits May 29, 2024 15:16

fix both permanent stopping of federation queues and multiple creatio…

13ff059

…n of the same federation queues

Merge branch 'fix-dupe-activity-sending' into federation-send-parallel

7cb4e82

phiresky mentioned this pull request May 30, 2024

Unit tests and cleanup for outgoing federation code (part 2) #4751

Draft

phiresky and others added 4 commits May 30, 2024 12:23

fix after merge

10d3b7d

Merge remote-tracking branch 'origin/main' into federation-send-parallel

ffb99cd

lint fix

a0b0a7a

Update crates/federate/src/send.rs

cdff275

Co-authored-by: dullbananas <dull.bananas0@gmail.com>

comment about reverse ordering

175133f

phiresky force-pushed the federation-send-parallel branch from c1397cd to 175133f Compare May 30, 2024 11:02

phiresky added 2 commits May 30, 2024 13:21

remove crashable, comment

2acdc78

comment

9d87921

phiresky marked this pull request as ready for review May 30, 2024 12:39

phiresky requested a review from SleeplessOne1917 as a code owner May 30, 2024 12:39

move comment

5538794

phiresky added 9 commits May 31, 2024 11:12

run federation tests twice

7ee63f4

fix test run

3784b7f

prettier

c2d18d3

fix config default

5a418ac

upgrade rust to 1.78 to fix diesel cli

c66bf26

Merge remote-tracking branch 'origin/upgrade-rust' into federation-se…

1c1018b

…nd-parallel

fix clippy

101901b

delay

dfccf3e

Merge branch 'main' into federation-send-parallel

2dd7b71

dullbananas requested changes Jun 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

federation: parallel sending per instance #4623

federation: parallel sending per instance #4623

phiresky commented Apr 13, 2024

Nutomic Apr 15, 2024

Nutomic Apr 15, 2024

phiresky Apr 15, 2024

Nutomic Apr 16, 2024

phiresky Apr 16, 2024

phiresky Apr 16, 2024 •

edited

Nutomic Apr 17, 2024

Nutomic Apr 15, 2024

phiresky Apr 15, 2024

Nutomic Apr 16, 2024

phiresky Apr 16, 2024 •

edited

Nutomic Apr 17, 2024

phiresky commented Apr 15, 2024

phiresky Apr 15, 2024

Nutomic commented Apr 16, 2024

phiresky commented Apr 16, 2024

Nutomic commented Apr 17, 2024 •

edited

Nutomic commented May 31, 2024

phiresky commented Jun 4, 2024

dullbananas Jun 4, 2024

dullbananas Jun 4, 2024

dullbananas Jun 4, 2024

dullbananas Jun 4, 2024

dullbananas Jun 4, 2024

dullbananas Jun 4, 2024

dullbananas Jun 4, 2024

		@@ -0,0 +1,149 @@
		use crate::util::LEMMY_TEST_FAST_FEDERATION;

	"تجريب" + Math.random().toString().slice(2, 5),
	"تجريب" + Date.now(),

		if let Some(last) = self.state.last_successful_id {
		Ok((last, latest_id))

federation: parallel sending per instance #4623

Are you sure you want to change the base?

federation: parallel sending per instance #4623

Conversation

phiresky commented Apr 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phiresky Apr 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phiresky Apr 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phiresky commented Apr 15, 2024

Choose a reason for hiding this comment

Nutomic commented Apr 16, 2024

phiresky commented Apr 16, 2024

Nutomic commented Apr 17, 2024 • edited

Nutomic commented May 31, 2024

phiresky commented Jun 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phiresky Apr 16, 2024 •

edited

phiresky Apr 16, 2024 •

edited

Nutomic commented Apr 17, 2024 •

edited