Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] new scheduler #32

Open
alkis opened this issue Jan 27, 2018 · 2 comments
Open

[rfc] new scheduler #32

alkis opened this issue Jan 27, 2018 · 2 comments

Comments

@alkis
Copy link
Contributor

alkis commented Jan 27, 2018

The current scheduler has the following problems:

  1. it needs configuration (workers, io_workers, run_on_io): it would be nicer if user does not need to configure anything and yet get maximum performance all the time
  2. it has a global ready list: scaling to anything other than a few cores and this will become a bottleneck
  3. timer thread is also global: another point of contention
  4. event loop is on a separate thread
    Both 3) and 4) cause unnecessary OS context switches whenever there is a timer expiry or I/O poll

I propose the following design for a new scheduler which I plan to implement. This is a request for comments. My understanding of may is not that deep so it is possible that some things won't work :-)

  • may has N schedulers (S) where N is the number of CPUs of the machine
  • each S runs on its own kernel thread
  • each S has a single threaded eventloop (coros waiting for io) and a timerlist (coros waiting on timer)
  • each S has its own readylist which contains coroutines ready to run (crossbeam-deque: work-stealing)
  • each S has its own yieldlist (coros that yielded and can resume immediately)
  • may has a single list of parked threads (parked)
  • may has a counter of stealing threads (num_stealing)

The scheduling loop will look like this:

loop {
  // when yieldlist is not empty, this means we have coroutines the yielded but are ready to run.
  // as such we do need to check the eventloop and return as fast as possible
  let timeout = if !sched.yieldlist.empty() { 0 } else { sched.next_deadline() - now() };
  while let Some(co) = yieldlist.pop_back() {
    sched.readylist.push(co);
  }
  // select moves ready coroutines to the local readylist. each ready coroutine is pushed to the front
  sched.eventloop.select(timeout);
  // if we have more than 1 coroutine ready to run, and there are no threads stealing,
  // and we have parked threads, unpark one to increase parallelism
  if sched.readylist.len() > 1 && num_stealing.load(Ordering::Acquire) == 0 && !parked.empty() {
    if let Some(t) = parked.pop() {
      t.unpark();
    }
  }
  // run all coros until readylist is empty
  while let Some(co) = sched.readylist.pop() {
    run_coroutine(co);
  }
  // we have cpu bound coroutines that yielded, restart the loop to make more coroutines ready/expire timers, etc.
  if !select.yieldlist.empty() {
    continue;
  }
  // we have no ready coroutines to run. time to steal!
  assert!(sched.readlist.empty());
  num_stealing.fetch_add(1, Ordering::Release);
  // see implementation below
  if sched.steal() {
    continue;
  }
  // we didn't manage to steal anything, which means there is nothing to do.
  num_stealing.fetch_sub(1, Ordering::Release);
  parked.push(thread::current());
  thread::park()
}

To make stealing fast and avoid spurious park()/unpark() we spin for a while trying to steal and then give up.

fn steal(&mut self) {
  let deadline = now() + 100ms;  // needs tuning
  loop {
    let id = rand() % N;  // random victim
    if id == self.id {  // stealing from ourselves is silly :-p
      continue;
    } else {
      let stolen = loop {
        match schedules[id].readylist.steal() {
          Steal::Empty => break None,
          Steal::Data(co) => break Some(co),
          Steal::Retry => {},
      };
      if let Some(co)  = stolen {
        self.readylist.push(co);
        return true;
      }
    }
    if now() > deadline {
      return false;
    }
  }
}
@Xudong-Huang
Copy link
Owner

Xudong-Huang commented Jan 28, 2018

thanks for the proposal! Here I list some comments according to your input.

it needs configuration (workers, io_workers, run_on_io): it would be nicer if user does not need to configure anything and yet get maximum performance all the time

we still need configuration to config the coroutine stack size, thus can't avoid the config totally. And it's not a bad thing that user can adjust and tune for their own application.

it has a global ready list: scaling to anything other than a few cores and this will become a bottleneck

yes, there is contention point here and this is not scale well. I used to have a queue for each worker thread, but the result is not as good as I expected. It need to steal all other worker thread queues, which is a little time consuming, but maybe my implementation was not qualified. We can improve here definitely 😄

timer thread is also global: another point of contention

the global timer thread just throw the timed out coroutines into the ready list. It's not running any coroutines. the delay is not significant important here since the coroutine is already timed out.

event loop is on a separate thread
Both 3) and 4) cause unnecessary OS context switches whenever there is a timer expiry or I/O poll

io worker thread has it's own time out checking, thus not contention with any other threads. and running coroutines directly on it's own, not push them to the ready list, thus there is no contention here.

It's good to have a more advanced scheduler. I'd like to see that happen, and let's test and compare if it's better.

@alkis
Copy link
Contributor Author

alkis commented Jan 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants