optimize Stream flatMapPar #8863

eyalfa · 2024-05-16T22:01:00Z

I've introduced a benchmark for flatMapPar, initial results:

Benchmark                          (chunkCount)  (chunkSize)  (parChunkSize)   Mode  Cnt  Score   Error  Units
StreamParBenchmark.akkaFlatMapPar         10000         5000              50  thrpt   15  1.208 ± 0.011  ops/s
StreamParBenchmark.fs2FlatMapPar          10000         5000              50  thrpt   15  0.124 ± 0.009  ops/s
StreamParBenchmark.zioFlatMapPar          10000         5000              50  thrpt   15  0.201 ± 0.002  ops/s

This pr introduces two alternate implementations to ZChannel.mergeAllWith, one of them is roughly 50% faster than the original:

Benchmark                          (chunkCount)  (chunkSize)  (parChunkSize)   Mode  Cnt  Score   Error  Units
StreamParBenchmark.zioFlatMapPar         10000         5000              50  thrpt   15  0.303 ± 0.002  ops/s

the other is over 4 times faster:

Benchmark                                   (chunkCount)  (chunkSize)  (parChunkSize)   Mode  Cnt   Score   Error  Units
StreamParBenchmark.zioFlatMapPar                   10000         5000              50  thrpt   15   0.866 ± 0.013  ops/s

the slower implementation is required in order to support MergeStrategy.BufferSliding, hence it could not be discarded.

The first implementation basically reuses the same techniques used in #8819, it also uses multiple Refs to reduce contention when updating the OutDone value.

The faster implementation takes a different approach, instead of spawning a fiber per nested stream it spawns n worker fibers.
the workers compete over upstream elements using a queue and write the nested streams elements and completions to a second queue.
Upstream completion places a special message in the second queue, consisting of upstream's completion value and the number of nested streams. The channel draining the second queue keeps track of the number of completions and the special upstream completion message in order to aggregate the final OutDone value and detect streams completion.

This pr also add couple of benchmarks showing the effect of chunking on the flatMapPar operator, first one is a bit unfair as it replaces some of the stream operations with Chunk's equivalent ones, while second one is closer to flatMapPar. Both techniques can only live in user code as they may cause indefinite blocking in case of effectfull streams.

the complete benchmarks results:

Benchmark                                   (chunkCount)  (chunkSize)  (parChunkSize)   Mode  Cnt   Score   Error  Units
StreamParBenchmark.zioFlatMapPar                   10000         5000              50  thrpt   15   0.866 ± 0.013  ops/s
StreamParBenchmark.zioFlatMapParChunks             10000         5000              50  thrpt   15  29.721 ± 0.266  ops/s
StreamParBenchmark.zioFlatMapParChunksFair         10000         5000              50  thrpt   15   2.529 ± 0.033  ops/s

…eAllWith

eyalfa · 2024-05-16T22:01:18Z

cc @jdegoes

eyalfa · 2024-05-18T17:46:14Z

still working on this, I was able to add another benchmark that (IMHO) proves that the bottleneck in this benchmark is actually forking. the slightly modified benchmark works on the chunks level instead of the element level (implementing the same logic), resulting with the same number of enqueue/dequeue operations going through the queue but a 48(!!) improvement factor over the 'naive' benchmark. I suspect this factor of 48 comes from the chunk size of 50 used by the streams benchmark and it implies that the heavy lifting here is starting and managing fibers.

benchmark code:

@Benchmark
  def zioFlatMapParChunks: Long = {
    val result = ZStream
      .fromIterable(zioChunks)
      .flatMapPar(4){ c =>
        val cc = c.flatMap(i => Chunk(i, i + 1))
        ZStream.fromChunk(cc)
      }
      .runCount

    unsafeRun(result)
  }

benchmarks results:

Benchmark                               (chunkCount)  (chunkSize)  (parChunkSize)   Mode  Cnt   Score   Error  Units
StreamParBenchmark.zioFlatMapPar               10000         5000              50  thrpt   15   0.309 ± 0.003  ops/s
StreamParBenchmark.zioFlatMapParChunks         10000         5000              50  thrpt   15  14.463 ± 0.316  ops/s

…mark

eyalfa · 2024-05-19T18:12:01Z

still working on this, I was able to add another benchmark that (IMHO) proves that the bottleneck in this benchmark is actually forking. the slightly modified benchmark works on the chunks level instead of the element level (implementing the same logic), resulting with the same number of enqueue/dequeue operations going through the queue but a 48(!!) improvement factor over the 'naive' benchmark. I suspect this factor of 48 comes from the chunk size of 50 used by the streams benchmark and it implies that the heavy lifting here is starting and managing fibers.

benchmark code:
@Benchmark
  def zioFlatMapParChunks: Long = {
    val result = ZStream
      .fromIterable(zioChunks)
      .flatMapPar(4){ c =>
        val cc = c.flatMap(i => Chunk(i, i + 1))
        ZStream.fromChunk(cc)
      }
      .runCount

    unsafeRun(result)
  }
benchmarks results:
Benchmark                               (chunkCount)  (chunkSize)  (parChunkSize)   Mode  Cnt   Score   Error  Units
StreamParBenchmark.zioFlatMapPar               10000         5000              50  thrpt   15   0.309 ± 0.003  ops/s
StreamParBenchmark.zioFlatMapParChunks         10000         5000              50  thrpt   15  14.463 ± 0.316  ops/s

rushed a bit into conclusions, the benchmark wasn't a fair comparison to faltMapPar, though it did inspire an alternative very fast implementation (see the updated PR description)

jdegoes · 2024-05-21T16:36:14Z

@eyalfa Have you thought about re-using the same n fibers, rather than continuously forking more?

eyalfa · 2024-05-22T07:01:56Z

@eyalfa Have you thought about re-using the same n fibers, rather than continuously forking more?

@jdegoes , I did 😎
see the PR's description for details including the final benchmark results. what happened is that I understood the cost of forking only after adding the 'batched' benchmarks, this initiated the idea of using a 'fibers pool' which proved to be very performant.
see my comments on the two implementations on the diff

varshith257 · 2024-05-22T07:06:25Z

@eyalfa Can review this #8879?

I have been playing with tapsink behaviour from two days and finally produced my results there 😅

eyalfa · 2024-05-22T06:54:43Z

streams/shared/src/main/scala/zio/stream/ZChannel.scala

+    ],
+    n: => Int,
+    bufferSize: => Int /* = 16*/,
+    mergeStrategy: MergeStrategy.BackPressure.type


@jdegoes
this implementation reuses the fibers, it gradually allocates a 'fibers pool'.
I experienced with means to avoid fibers allocation, but at least the means I've attempted resulted with some performance penalty. the unbounded use-case will avoid allocating a new fiber if the queue becomes empty immediately after offer but this approach seemed to expensive for the common case (todo here, add benchmarks for the unbounded scenario)

eyalfa · 2024-05-22T06:55:28Z

streams/shared/src/main/scala/zio/stream/ZChannel.scala

+      resSch
+    }
+
+    mergeStrategy match {


@jdegoes , this is where the 'fibers pool' approach is selected. notice the BufferSliding strategy effectively relies on a fiber per operation.

eyalfa · 2024-05-22T06:58:29Z

streams/shared/src/main/scala/zio/stream/ZChannel.scala

@@ -1848,7 +1848,7 @@ object ZChannel {
  ): ZChannel[Env, InErr, InElem, InDone, OutErr, OutElem, OutDone] =
    mergeAllWith(channels, Int.MaxValue)(f)

-  def mergeAllWith[Env, InErr, InElem, InDone, OutErr, OutElem, OutDone](
+  def mergeAllWith0[Env, InErr, InElem, InDone, OutErr, OutElem, OutDone](


@jdegoes , the original implementation temporarily kept around, will be dropped before merge (if pr is to be accepted)

Was it dropped? 🤔

@eyalfa Was it dropped? 🤔

eyalfa · 2024-05-22T07:27:50Z

@eyalfa Can review this #8879?

I have been playing with tapsink behaviour from two days and finally produced my results there 😅

I'll have a look sure, but next time comment on the relevant issue/pr, this is a different pr, unrelated to the tapSink issue

varshith257 · 2024-05-22T07:29:54Z

I'll have a look sure, but next time comment on the relevant issue/pr, this is a different pr, unrelated to the tapSink issue

I have seen your reply from mobile and replied from it and didn't seen this PR of another :)

jdegoes · 2024-05-22T22:47:04Z

@eyalfa Strong work! 💪

eyalfa added 5 commits May 16, 2024 00:36

strm_flatMapPar__opt: introduce a benchmark

0ba999c

strm_flatMapPar__opt: introduce an alternative impl for ZChannel.merg…

09a07ce

…eAllWith

strm_flatMapPar__opt: stripe the done updaters

c058dc6

strm_flatMapPar__opt: stripe the done updaters

6ea2a6f

strm_flatMapPar__opt: fmt

0faa29c

eyalfa added 3 commits May 17, 2024 11:28

strm_flatMapPar__opt: compilation fix

e5cc998

Merge branch 'series/2.x' into strm_flatMapPar__opt

d2fb51d

Merge branch 'series/2.x' into strm_flatMapPar__opt

978f425

eyalfa added 6 commits May 19, 2024 13:12

strm_flatMapPar__opt2: approach2

771b62f

strm_flatMapPar__opt2: go channel

d3b88a1

strm_flatMapPar__opt2: add a more fair version of the 'chunked' bench…

d11a438

…mark

strm_flatMapPar__opt2: fix groupBy support

2a33f4d

strm_flatMapPar__opt2: eliminate option around OutDone, add few comments

b611fd3

strm_flatMapPar__opt: fmt

38600e1

eyalfa commented May 22, 2024

View reviewed changes

jdegoes approved these changes May 22, 2024

View reviewed changes

jdegoes merged commit 3e50286 into zio:series/2.x May 22, 2024
21 checks passed

eyalfa mentioned this pull request May 27, 2024

revisit Zstream.mapZIOPar and Zstream.mapZIOParUnordered optimization #8908

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize Stream flatMapPar #8863

optimize Stream flatMapPar #8863

eyalfa commented May 16, 2024 •

edited

eyalfa commented May 16, 2024

eyalfa commented May 18, 2024 •

edited

eyalfa commented May 19, 2024

jdegoes commented May 21, 2024

eyalfa commented May 22, 2024

varshith257 commented May 22, 2024

eyalfa May 22, 2024

eyalfa May 22, 2024

eyalfa May 22, 2024

guizmaii May 23, 2024

guizmaii May 28, 2024

eyalfa commented May 22, 2024

varshith257 commented May 22, 2024 •

edited

jdegoes commented May 22, 2024

optimize Stream flatMapPar #8863

optimize Stream flatMapPar #8863

Conversation

eyalfa commented May 16, 2024 • edited

eyalfa commented May 16, 2024

eyalfa commented May 18, 2024 • edited

eyalfa commented May 19, 2024

jdegoes commented May 21, 2024

eyalfa commented May 22, 2024

varshith257 commented May 22, 2024

eyalfa May 22, 2024

Choose a reason for hiding this comment

eyalfa May 22, 2024

Choose a reason for hiding this comment

eyalfa May 22, 2024

Choose a reason for hiding this comment

guizmaii May 23, 2024

Choose a reason for hiding this comment

guizmaii May 28, 2024

Choose a reason for hiding this comment

eyalfa commented May 22, 2024

varshith257 commented May 22, 2024 • edited

jdegoes commented May 22, 2024

eyalfa commented May 16, 2024 •

edited

eyalfa commented May 18, 2024 •

edited

varshith257 commented May 22, 2024 •

edited