Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient. #111

Open
wants to merge 22 commits into
base: unstable
Choose a base branch
from

Conversation

lipzhu
Copy link
Contributor

@lipzhu lipzhu commented Apr 1, 2024

Description

This patch try to fix the issue when io-threads number is configured, io-threads will be stopped according below logic, this will cause the io-threads useless and longer latency. According to existing logic, the io-threads will be activated by a hardcode threshold value, this will cause QPS regression by the different configured io-threads number even the same clients requests.

int stopThreadedIOIfNeeded(void) {
    int pending = listLength(server.clients_pending_write);

    /* Return ASAP if IO threads are disabled (single threaded mode). */
    if (server.io_threads_num == 1) return 1;

    if (pending < (server.io_threads_num*2)) {
        if (server.io_threads_active) stopThreadedIO();
        return 1;
    } else {
        return 0;
    }
}

With this patch, it will dynamically adjust the /IO threads number in runtime according to real workloads to tradeoff between the latency and CPU efficiency. I use decay rate formula based on clients_pending_writes to make throughput stable.

Benchmark Result

Test Environment

  • OS: CentOS Stream 8
  • Kernel: 6.2.0
  • Platform: Intel Xeon Platinum 8380
  • Base commit: b2a3973

Test Steps

Start Valkey-Server

Start valkey-server taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf with below config.

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
io-threads 4
io-threads-do-reads yes

Test from throughput perspective

Start valkey-benchmark in localhost with below commands:
taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 10
to measure the SET GET performance gap correspondingly, results are listed as below:
we can find that the io-threads is not used due to clients_pending_write < (server.io_threads_num*2), the throughput is limited by only single thread even more CPU resources were allocated.

NOTE: QPS will use relative value instead of absolute value.
CPU utilized: perf stat -p `pidof valkey-server` sleep 5

Base CPU utilized Opt CPU utilized QPS Gain(Opt/Base-1)
SET 1 2 23%
GET 1 2 25%

Test from CPU efficiency perspective

If we increase the requests number by below commands to make origin io-threads work, throughput increased as expected, but all CPU are fully utilized. With this patch, we can make CPU more efficiency by reducing the active io-threads number.
taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 15

Base CPU utilized Opt CPU utilized QPS Gain(Opt/Base-1)
SET 3.999 2.908 8%
GET 3.999 2.918 3%

One test case issue is related to #397

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 4, 2024

@valkey-io/core-team Could you help to review this patch?

@madolson madolson self-requested a review April 4, 2024 15:50
@madolson
Copy link
Member

madolson commented Apr 4, 2024

@lipzhu I noticed there wasn't a test, we should at least add one for that.

@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 7, 2024

@madolson I am thinking how to add the unit tests for io-threads, it's for performance purpose, I am uncertain how to simulate it in unit tests, can you provide some guidance?

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 11, 2024

Ping @madolson .

src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
}
}

test "Enable io-threads" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case that validates

  1. the active thread count can increase to the max
  2. the active thread count can decay back to 1

Writing a reliable test for this feature is tricky and we might need to introduce additional DEBUG hooks but I think it will be worth the effort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing a reliable test for this feature is tricky and we might need to introduce additional DEBUG hooks but I think it will be worth the effort

Can you provide more guidance of how to introduce additional DEBUG hooks? In my local, I used memtier_benchmark -c 25 -t 4 to simulate the multiple clients/requests, I am not sure if the TCL test framework had the similar API or where can I test the correctness, debug print a message of the io_threads_active_num?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly ping @PingXie

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about adding some new DBBUG command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it). After seeing #344, I think a real unit test would be more preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingXie Thanks for your effort to review this.

I was thinking about adding some new DBBUG command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it)

I just add a monitor for DEBUG io-threads command, sorry, I didn't get your point to modify the engine behavior, can you be more specific?

After seeing #344, I think a real unit test would be more preferable.

Do you want to add the unit test in this PR, or we can add the unit test when new test framework merged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be more specific?

No worries. I was thinking of manipulating the data structure like clients_pending_write in a new DEBUG command to simulate different workload. Now that we will have a true unit test framework, I don't think we should bother with this hack at all. Do you mind reverting the changes in debug.c?

Do you want to add the unit test in this PR, or we can add the unit test when new test framework merged?

I think @madolson is actively working on #344 and it is also close to completion. That said, I am fine with adding the tests in a separate PR and if you decide so, could you open an issue and assign it to yourself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of adding some DEBUG commands so we can properly do TCL tests of IO threading features. It probably doesn't make sense to play with the pending writes directly, but something like DEBUG set-active-io-thread-count could be a good DEBUG command, even outside of just TCL testing, so that we can force IO threads on and off. E.g. I remember debugging some TLS related issues with IO threading that only happened when IO threads turned off, and it required me to run and stop valkey-benchmark repeatedly.

So I would suggest:

  1. Unit testing the "pending writes" to "active IO threads" logic
  2. E2E testing the whole IO thread flow with a DEBUG command that sends commands while adjusting IO thread count up and down

But we can follow up on those improvements

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I think it would be good to address it separately though. Let's open a separate issue?

For this change, I am mostly interested in proving that the EMA logic is working properly (it can scale up and also scale back down) so I think a real unit test would be a lot less flaky to maintain in the long run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingXie #396 #397 were created to track the debug commands and tests. @murphyjacob4 please feel free to add. BTW, I have no permission to assign the issues.

@PingXie
Copy link
Member

PingXie commented Apr 24, 2024

My understanding of this change is that it helps with the volatile workload case where request rates fluctuate quite a bit. This change doesn't allocate/activate more IO worker threads at runtime but it allows the IO worker threads to stay active a bit longer using the Exponential Moving Average (EMA). This trades off a bit of CPU efficiency for a more consistent latency through these workload spikes/troughs, potentially significant.

Overall I am onboard with this change. I think the only thing I am looking forward to is a test that proves we can still activate all the threads configured via io-threads and the IO threads can indeed quiesce.

@PingXie
Copy link
Member

PingXie commented Apr 24, 2024

I am thinking how to add the unit tests for io-threads, it's for performance purpose, I am uncertain how to simulate it in unit tests, can you provide some guidance?

I don't think we need a (unit) test for performance. We will use release candidates to get feedback. What we really need is a correctness test as I mentioned above.

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu lipzhu requested a review from PingXie April 26, 2024 02:24
Copy link

codecov bot commented Apr 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.47%. Comparing base (6fb90ad) to head (a0848a8).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #111      +/-   ##
============================================
+ Coverage     70.02%   70.47%   +0.45%     
============================================
  Files           109      109              
  Lines         59957    59955       -2     
============================================
+ Hits          41985    42254     +269     
+ Misses        17972    17701     -271     
Files Coverage Δ
src/networking.c 91.80% <100.00%> (+6.44%) ⬆️
src/server.c 88.93% <100.00%> (+<0.01%) ⬆️

... and 11 files with indirect coverage changes

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
src/server.h Outdated Show resolved Hide resolved
src/server.c Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/networking.c Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
}
}

test "Enable io-threads" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about adding some new DBBUG command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it). After seeing #344, I think a real unit test would be more preferable.

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu
Copy link
Contributor Author

lipzhu commented Apr 28, 2024

@PingXie I added integration test to validate the io_threads_active_num EMA logic working properly. Could you help to review again?

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu lipzhu requested a review from PingXie April 29, 2024 07:26
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM. Some final touches and this PR is good to go.

src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/networking.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
Comment on lines 40 to 45
regexp {io_threads_active:([0-9]+)} $server_info -> \
io_threads_active
regexp {io_threads_maximum_num:([0-9]+)} $server_info -> \
io_threads_maximum_num
regexp {io_threads_active_num:([0-9]+)} $server_info -> \
io_threads_active_num
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
regexp {io_threads_active:([0-9]+)} $server_info -> \
io_threads_active
regexp {io_threads_maximum_num:([0-9]+)} $server_info -> \
io_threads_maximum_num
regexp {io_threads_active_num:([0-9]+)} $server_info -> \
io_threads_active_num
set io_threads_active [get_info_field $server_info io_threads_active]
set io_threads_maximum_num [get_info_field $server_info io_threads_maximum_num]
set io_threads_active_num [get_info_field $server_info io_threads_active_num]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, proc get_info_field is a common proc in cluster module, I just moved it to support/util.tcl.


# Scale up io_threads_active_num by sending large number requests to server.
set bench_pid [start_benchmark 10 50]
wait_for_condition 1000 500 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a loop but judging by your test runs, this test doesn't seem to increase the test execution time.

Another concern of mine is the potential flakiness of this test since it has non-deterministic dependency in nature.

That said, I am fine with keeping this test for now. Push comes shove, we can remove it since we will get coverage from #397 as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test doesn't seem to increase the test execution time

Depends on the write speed, the condition of meet total count of 10000 should be very fast.

Another concern of mine is the potential flakiness of this test since it has non-deterministic dependency in nature.

That said, I am fine with keeping this test for now. Push comes shove, we can remove it since we will get coverage from #397 as well.

Got it.

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lipzhu!

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu
Copy link
Contributor Author

lipzhu commented May 7, 2024

What is the next process, is it ready to merge? @PingXie

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@madolson madolson changed the title Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficiency. Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient. May 23, 2024
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM , just two minor comments.

src/server.c Outdated Show resolved Hide resolved
src/networking.c Show resolved Hide resolved
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@madolson madolson added major-decision-pending Major decision pending by TSC team needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. labels May 28, 2024
@madolson
Copy link
Member

@valkey-io/core-team Need approval for adding a single new metric, io_threads_active_num, which will indicate the number of IO threads that are actively doing work. Some of the work in this PR overlaps with the other performance work that is ongoing, but the tests should all largely still be applicable. Asking for an async vote for the new metric.

The real world usage of the metric is probably so an operator can understand how many of the CPUs are actually doing useful work when debugging.

@zuiderkwast
Copy link
Contributor

For some context, we already have io_threads_active 0 or 1 (on/off). I suppose we need to keep this one as is...

The config io-threads includes the main thread in the count, so the minimum is 1. Will io_threads_active_num include the main thread too?

@lipzhu
Copy link
Contributor Author

lipzhu commented May 29, 2024

For some context, we already have io_threads_active 0 or 1 (on/off). I suppose we need to keep this one as is...

The io_threads_active is kept.

The config io-threads includes the main thread in the count, so the minimum is 1. Will io_threads_active_num include the main thread too?

Yes, io_threads_active_num started from 1, main thread is included.

Comment on lines -4259 to -4333
void startThreadedIO(void) {
serverAssert(server.io_threads_active == 0);
for (int j = 1; j < server.io_threads_num; j++) pthread_mutex_unlock(&io_threads_mutex[j]);
server.io_threads_active = 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes will just cause merge conflicts for the Async IO threads feature. It's better that we don't merge it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the Async IO threads remove the startThreadIO and use the dynamic threads number too, this is similar with this patch. The difference is that Async IO threads keep the threshold based on the real time events count, this patch use EMA to make a more consistent latency when requests spikes/troughs.

Maybe we could merge this firstly and then rebase Async IO threads on this? If needed, I could help the rebase work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uriyage WDYT?

Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@lipzhu
Copy link
Contributor Author

lipzhu commented Jun 3, 2024

So we are still waiting for the approval from @soloestoy @enjoy-binbin , could you help to take a look at this when you are free?

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, did not review the tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-pending Major decision pending by TSC team needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants