Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor usage of (?) rio slower than sequential read(2) i/o #5

Open
sirupsen opened this issue Feb 1, 2020 · 12 comments
Open

poor usage of (?) rio slower than sequential read(2) i/o #5

sirupsen opened this issue Feb 1, 2020 · 12 comments

Comments

@sirupsen
Copy link

sirupsen commented Feb 1, 2020

Hey Tyler, been eagerly following your progress with rio and sled on Twitter. Been tracking io_uring for a while, decided to finally spend some time programming in it today.

I've been working on moving my I/O benchmarks in napkin-math to use io_uring on Linux. I started with the sequential test-case, as this seemed easiest to start with (I skimmed through Axboe's io-uring-cp(1) too, since it's

I had some trouble allocating the buffers for all these reads, FWIW--maybe something rio could help with. After poking around the API docs, I found a way to get a mutable slice without upsetting the borrow-checker with unsafe (🙈)

To my surprise, io_uring kept performing worse than simple read(2), even though I'm re-using the buffers. I tried tuning the size of the buffers and the depth of the queue, but to no avail (the settings in the script below are the ones Axboe uses in his cp example, so it seems sensible). Do note that I'm fairly inexperienced with Rust, so I may be missing things 👀

I've tried to make this very easy to reproduce. If you throw this gist into examples/sequential_io.rb and run it with cargo run --release --example sequential_io, you should see something along the lines of:

curl https://gist.githubusercontent.com/sirupsen/dca5e3483fec7cd06f2f98e1934f5b85/raw/1150c7a3d7a98127ee1111a810a63fb2459a7913/sequantial_io.rs > examples/sequential_io.rs
cargo run --release --example sequential_io
# 512mb, sequential, summing
read(2) done in 250ms, checksum: 536870912
io_uring(2) done in 295ms, checksum: 536870912

Edit: Some concerns with the summing overhead, see my comment below, but it shows the same pattern when eliminated.

I'm on 5.3.0-29-generic, I know I could be more recent, but that's the most recent Ubuntu kernel I could dust up that doesn't require re-compiling perf etc

Edit: I upgraded to 5.5.1, and results are the same.

@sirupsen sirupsen changed the title rio slower than sequential read(2) i/o poor usage of (?) rio slower than sequential read(2) i/o Feb 1, 2020
@Licenser
Copy link
Contributor

Licenser commented Feb 2, 2020

As an outside observer who finds this very fascinating, I would love to throw in a few thoughts. That said feel free to disregard them :)

The two algorithms seem very different. Let me try to phrase them to make sure I didn't misunderstand what happens.

The sequential one alternates between reading and computation and reading and computation etc. The io_ring one does all the reading then all the computation.

The ironing handles QUEUE_DEPTH parallel reads on consecutive chunks, does computation and repeats more consecutive chunks, etc.

So to throw out a few theories (from normal to far out there)

File systems and disks (even nvme disks) are spectacularly good at 100% sequential IO, depending on the FS (and I don't know apples one good enough but I think ZFS does this for example) they will at times try to pre-read the next few chunks in a sequential access and have them ready and cached if they think you might need them. So the 100% sequential access in the first example might have an edge here in being seen like that and optimized at the FS level. I don't know enough about ioring's internals but filing QUEUE_DEPTH requests at once might run them 'out of order' and while nvme disks are fairly good at that they still are not as fast in random access as in sequential, when I remember right that had to do with them being forced to read a number of bytes at a time if the requests don't align that would cause extra (hardware level) reads that can be optimized out in a purely sequential access.

The boring part that parallel requests in a purely sequential workload like this are likely to always loose unless you can recover some of the cost of the parallelism by multiplexing work - the example still waits for all requests to be completed. perhaps moving the sum-loop inside the wait-loop would give a better result? Basically what I fear here is that the way the code is structured, the process has to .wait for the slowest part to be done and just idles in that time instead of performing work it could already do - it seems to have all the cost of a serial code as in it having to wait and all the overhead of parallel code.

On this, I might be just very scared because I had a huge performance drop of it, but the summing over the BUFFER_SIZE will probably all reside in the CPU cache from reading to calculation and never having to hit main memory. For the ioring example since it is reading 32 times as much you might hit some cache limitations especially with it being non-linear and other things in the background (while reading) also wanting their chunk of cache-flash. Then again - as I said I just got scared by this so I see cache issues everywhere now 😂

Last and really far out there. thermals? Read, calculate, read, calculate might give the components a bit more time to cool down and prevent throttling (both the nvme and the cpu) especially since this is a laptop? OTOH it feels like a very short period so I'd be surprised by this.

EDITIED:

Perhaps an interesting secondary compairison would be to calculate the sum from N files once in succession and once using ioring and being parallel?

@sirupsen
Copy link
Author

sirupsen commented Feb 2, 2020

I absolutely agree with you that I didn't necessarily expect io_uring to win by much on a sequential bench, but still by some by avoiding the read(2) syscall overhead. It'd likely be slower than mmap(2)'ing the file. I know io_uring really shines on out-of-order operations. This also seems more likely to be an issue with my implementation or rio than io_uring?

It actually does turn out that the summing cost was much higher than I expected! I underestimated the cost of that dramatically. I just had it there to ensure that both were reading the same number of bytes. Without it, they're both much faster on higher inputs, but io_uring still loses:

# 512mb, sequential, minimal summing
read(2) done in 70ms
io_uring(2) done in 123ms

I'm still doing some summing, to prevent the compiler from optimizing it out. The black_box implementation slows down compilation like crazy.

FWIW, I tried to use poll in the summing above to use the futures trait to do something epoll-like to not block, but I kept getting poll not defined. Was as if it wasn't picking up the trait 🤷‍♂

@twissel
Copy link

twissel commented Feb 2, 2020

Here is some numbers from fio:

fio --name=read --ioengine=sync --iodepth=1 --rw=read  --bs=256k --direct=0 --size=512M --numjobs=1 --runtime=10 --time_based=1
read: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T) 256KiB-256KiB, ioengine=sync, iodepth=1
fio-3.16-52-gb0484
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=2727MiB/s][r=10.9k IOPS][eta 00m:00s]
read: (groupid=0, jobs=1): err= 0: pid=10406: Sun Feb  2 21:36:35 2020
  read: IOPS=10.8k, BW=2710MiB/s (2842MB/s)(26.5GiB/10001msec)
    clat (usec): min=43, max=7176, avg=81.19, stdev=30.10
     lat (usec): min=43, max=7177, avg=81.24, stdev=30.10
    clat percentiles (usec):
     |  1.00th=[   74],  5.00th=[   77], 10.00th=[   77], 20.00th=[   78],
     | 30.00th=[   78], 40.00th=[   78], 50.00th=[   79], 60.00th=[   79],
     | 70.00th=[   79], 80.00th=[   80], 90.00th=[   81], 95.00th=[   90],
     | 99.00th=[  167], 99.50th=[  231], 99.90th=[  371], 99.95th=[  429],
     | 99.99th=[  437]
   bw (  MiB/s): min= 2605, max= 2832, per=99.80%, avg=2704.92, stdev=72.39, samples=19
   iops        : min=10420, max=11328, avg=10819.68, stdev=289.57, samples=19
  lat (usec)   : 50=0.04%, 100=96.55%, 250=3.00%, 500=0.41%
  lat (msec)   : 2=0.01%, 10=0.01%
  cpu          : usr=0.45%, sys=71.91%, ctx=107529, majf=0, minf=73
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=108429,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=2710MiB/s (2842MB/s), 2710MiB/s-2710MiB/s (2842MB/s-2842MB/s), io=26.5GiB (28.4GB), run=10001-10001msec

Disk stats (read/write):
  nvme0n1: ios=107162/2, merge=0/3, ticks=11835/8, in_queue=8, util=89.52%
fio --name=read --ioengine=io_uring  --iodepth=64 --rw=read  --bs=256k --direct=0 --size=512M --numjobs=1 --runtime=10  --time_based=1
read: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T) 256KiB-256KiB, ioengine=io_uring, iodepth=64
fio-3.16-52-gb0484
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=2752MiB/s][r=11.0k IOPS][eta 00m:00s]
read: (groupid=0, jobs=1): err= 0: pid=10516: Sun Feb  2 21:38:15 2020
 read: IOPS=11.2k, BW=2806MiB/s (2942MB/s)(27.4GiB/10005msec)
   slat (nsec): min=391, max=326652, avg=1650.22, stdev=1577.09
   clat (usec): min=336, max=38025, avg=5687.37, stdev=4112.83
    lat (usec): min=342, max=38027, avg=5689.10, stdev=4112.85
   clat percentiles (usec):
    |  1.00th=[ 2409],  5.00th=[ 4752], 10.00th=[ 4817], 20.00th=[ 4948],
    | 30.00th=[ 4948], 40.00th=[ 4948], 50.00th=[ 4948], 60.00th=[ 4948],
    | 70.00th=[ 5014], 80.00th=[ 5014], 90.00th=[ 5211], 95.00th=[ 6652],
    | 99.00th=[29492], 99.50th=[30540], 99.90th=[32375], 99.95th=[33817],
    | 99.99th=[36963]
  bw (  MiB/s): min= 2735, max= 2914, per=99.98%, avg=2805.28, stdev=61.34, samples=20
  iops        : min=10940, max=11658, avg=11221.20, stdev=245.22, samples=20
 lat (usec)   : 500=0.05%, 750=0.07%, 1000=0.08%
 lat (msec)   : 2=0.56%, 4=1.26%, 10=94.93%, 20=0.02%, 50=3.03%
 cpu          : usr=1.73%, sys=17.98%, ctx=108553, majf=0, minf=9
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued rwts: total=112287,0,0,0 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  READ: bw=2806MiB/s (2942MB/s), 2806MiB/s-2806MiB/s (2942MB/s-2942MB/s), io=27.4GiB (29.4GB), run=10005-10005msec

Disk stats (read/write):
 nvme0n1: ios=110927/90, merge=0/43, ticks=504330/157, in_queue=400516, util=89.92%

io_uring is slightly faster, at least in this test case.

@Licenser
Copy link
Contributor

Licenser commented Feb 3, 2020

I'll throw that up on here, the flamegraph of running the test (100 times)

flamegraph

Notice it is going through the syscalls. After some digging io_ring requires SQPOLL to bypass them, I added an updated gist that uses the option but it fails to run (note: needs to run as a privileged user) and I'm not sure why.

https://gist.github.com/Licenser/45786698fc8a3ad78a957d8618882d0d

perhaps someone can help with a hint why it ends up as:

❯ sudo target/release/examples/seq_read
read(2) done in 39ms, checksum: 536887297
thread '<unnamed>' panicked at 'error in cqe reaper: Os { code: 6, kind: Other, message: "No such device or address" }', src/io_uring/cq.rs:96:17

@twissel
Copy link

twissel commented Feb 3, 2020

Because you need to register files for sqpoll

To  successfully use this feature, the application must register
a set of files to be used for  IO  through  io_uring_register(2)
using  the  IORING_REGISTER_FILES  opcode. Failure to do so will
result in submitted IO being errored with EBADF.

@Licenser
Copy link
Contributor

Licenser commented Feb 3, 2020

Thanks!

@twissel
Copy link

twissel commented Feb 3, 2020

Why are you waiting for all (QUEUE_DEPTH) cqes to complete here?: https://gist.github.com/Licenser/45786698fc8a3ad78a957d8618882d0d#file-black_box-rs-L90.

@Licenser
Copy link
Contributor

Licenser commented Feb 3, 2020

I had a different implementation that used a vecdeq with no performance difference. So I wanted to introduce sqpoll with as little change to the original as possible to prevent cross contamination between changes. I’ll try to get the poll to work tomorrow if sirupsen doesn’t beat me to it ;)

@twissel
Copy link

twissel commented Feb 3, 2020

sqpoll won't change anything, i'm easily hitting my nvme disk read limit without it.

@Licenser
Copy link
Contributor

Licenser commented Feb 4, 2020

It's less about outperforming the file operation then getting in the same ballpark. At the moment the code above is about 2x slower reading the file with rio then the old fashioned way.

@Licenser
Copy link
Contributor

Licenser commented Feb 4, 2020

So I looked into how to register a file, it sounds a lot easier then it is 😂

SQPOLL doesn't seem to be supported at the moment so I tried to add it #8 but I am somewhat lost as to why it doesn't work.

https://github.com/axboe/liburing/blob/master/man/io_uring_register.2 is the best explanation I found so far and I think I followed the requirements but I'm probably missing something sneaky.

@Licenser Licenser mentioned this issue Feb 8, 2020
@chrisvest
Copy link

I noticed when working on chrisvest/xxv#18 that if you have small buffers and/or a small queue depth, then Rio spends a fair amount of time on the futex calls for servicing the wait()/await calls. Using big buffers amortises those futex system calls and brings performance closer to what one sees in Fio with smaller buffers. I'm suspecting that a design that instead puts the completion queue behind a thin queue-endpoint-like veneer might perform better, since there might not be any locking necessary when completions are readily available, and it also naturally lets you process completions in arrival order instead of submission order. That's quite a departure from the current design, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants