Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of distinct aggregations #21907

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Dith3r
Copy link
Member

@Dith3r Dith3r commented May 10, 2024

Description

Improve performance of distinct aggregations by defining additional strategies based on source properties (for example, NDV).

The strategy to use for multiple distinct aggregations.
SINGLE_STEP Computes distinct aggregations in single-step without any pre-aggregations.
This strategy will perform poorly if the number of distinct grouping keys is small.
MARK_DISTINCT uses MarkDistinct for multiple distinct aggregations
or for mix of distinct and non-distinct aggregations.
PRE_AGGREGATE Computes distinct aggregations using a combination of aggregation
and pre-aggregation steps.
AUTOMATIC chooses the strategy automatically.

Single-step strategy is preferred. However, for cases with limited concurrency due to
a small number of distinct grouping keys, it will choose an alternative strategy
based on input data statistics.

Strategy Duration Query
MARK_DISTINCT 5353ms select count(ss_customer_sk), count(distinct ss_ticket_number) from hive.tpcds_sf1000_orc.store_sales;
PRE_AGGREGATE 2468ms select count(ss_customer_sk), count(distinct ss_ticket_number) from hive.tpcds_sf1000_orc.store_sales;
MARK_DISTINCT 10109ms select count(ss_quantity), count(distinct ss_item_sk), count(distinct ss_store_sk) from hive.tpcds_sf1000_orc.store_sales;
PRE_AGGREGATE 8253ms select count(ss_quantity), count(distinct ss_item_sk), count(distinct ss_store_sk) from hive.tpcds_sf1000_orc.store_sales;

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Improve performance of distinct aggregations. ({issue}`21907`)

Improve performance of distinct aggregations by defining additional strategies based on source properties (for example, NDV). Rename corresponding config property optimizer.mark-distinct-strategy to
optimizer.distinct-aggregations-strategy and values to NONE -> SINGLE_STEP and ALWAYS -> MARK_DISTINCT. Replace optimizer.optimize-mixed-distinct-aggregations with a new
optimizer.distinct-aggregations-strategy `pre_aggregate`.

@cla-bot cla-bot bot added the cla-signed label May 10, 2024
@github-actions github-actions bot added docs hudi Hudi connector iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels May 10, 2024
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 4 times, most recently from 0d37869 to abce86f Compare May 10, 2024 10:35
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from a8e0691 to 1dbce97 Compare May 13, 2024 10:19
Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % comments

@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from 3c9b688 to e1575ae Compare June 4, 2024 07:44
@Dith3r Dith3r requested a review from raunaqmorarka June 4, 2024 09:20
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 4 times, most recently from 9045fda to 12631c0 Compare June 4, 2024 13:52
@Dith3r Dith3r requested a review from raunaqmorarka June 4, 2024 13:52
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from b1b21b8 to e313705 Compare June 4, 2024 14:16
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comments

@raunaqmorarka
Copy link
Member

Please update the release notes section to specify which system session properties and configs were removed/defunct and what are the new alternatives.

@Dith3r Dith3r requested a review from martint June 6, 2024 07:26
@Dith3r Dith3r force-pushed the ke/dist-aggr branch 3 times, most recently from ab2d632 to 76ed7d1 Compare June 6, 2024 07:38
lukasz-stec and others added 4 commits June 6, 2024 09:45
Extract the logic to determine whether the direct distinct aggregation applicability,
which can be reused in multiple optimiser rules.
…oupBy

Rename class to before refactoring to preserve history
The rule replaces `OptimizeMixedDistinctAggregations`,
and adds support for multiple distinct aggregations.
Replace optimizer.optimize-mixed-distinct-aggregations with a new
optimizer.distinct-aggregations-strategy `pre_aggregate`. Also rename corresponding config property optimizer.mark-distinct-strategy to
optimizer.distinct-aggregations-strategy and values to NONE -> SINGLE_STEP and ALWAYS -> MARK_DISTINCT
Use estimated aggregation source NDV and the number
of grouping keys to decide if pre-aggregate strategy
should be used for a given aggregation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector hudi Hudi connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

None yet

4 participants