[Feature] Reduce redundant shuffle for spark dynamic bucket writes #3222

wForget · 2024-04-17T03:52:04Z

Search before asking

I searched in the issues and found nothing similar.

Motivation

Dynamic bucket writing does two shuffles, the first repartitionByKeyPartitionHash seems unnecessary, It seems to be only used to determine assignId. However, assignId can be calculated through partitionHash/keyHash/numParallelism/numAssigners, we do not need to do extra shuffle. Can we remove it?

paimon/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/PaimonSparkWriter.scala

Line 143 in e27ceb4

repartitionByKeyPartitionHash(

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

wForget · 2024-04-17T03:53:13Z

@YannByron could you please take a look?

JingsongLi · 2024-04-30T11:08:36Z

it is hard, Perhaps different assigners will have the same bucket data

wForget added the enhancement New feature or request label Apr 17, 2024

wForget changed the title ~~[Feature] Reduce redundant shuffle for dynamic bucket writes~~ [Feature] Reduce redundant shuffle for spark dynamic bucket writes Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Reduce redundant shuffle for spark dynamic bucket writes #3222

[Feature] Reduce redundant shuffle for spark dynamic bucket writes #3222

wForget commented Apr 17, 2024 •

edited

wForget commented Apr 17, 2024

JingsongLi commented Apr 30, 2024

[Feature] Reduce redundant shuffle for spark dynamic bucket writes #3222

[Feature] Reduce redundant shuffle for spark dynamic bucket writes #3222

Comments

wForget commented Apr 17, 2024 • edited

Search before asking

Motivation

Solution

Anything else?

Are you willing to submit a PR?

wForget commented Apr 17, 2024

JingsongLi commented Apr 30, 2024

wForget commented Apr 17, 2024 •

edited