-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47353][SQL] Enable collation support for the Mode expression #46597
base: master
Are you sure you want to change the base?
Conversation
01c6706
to
365e639
Compare
63d22f2
to
3758e43
Compare
9329234
to
ec22116
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala
Show resolved
Hide resolved
@uros-db This is all cleaned up. Let's get some of the other reviewers to look at it? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?
In my local tests, I found that Mode performs a byte-by-byte comparison for structs, which does not consider collation. So that is still outstanding. Good catch! @uros-db There are several strategies we might adopt to handle structs with collation fields. I am looking into implementations. It is potentially straightforward though have some gotchas. Do you feel I should solve for that in a separate PR or in this one? I assume you prefer that this get solve in this PR and not a follow-up PR, right? |
I have added implementation for mode to support structs with fields with the various collations. Performance is not great, so far.
I will add the benchmark results from GHA once I get your feedback. I haven;t yet added support for collation for mode on array types, as in the "Collation Support in Spark" design doc, it says support for that is TBD. So I wanted to check in as to whether you think I should add support for that now or as a followup. |
What I would really like to try is to move from this implementation to an approach that will have the collation-support logic moved to the PartialAggregation stage, by moving logic to But as it has already been a couple weeks of development on this, I believe we should, for this PR, confine all the collation logic in the stage that can't be serialized and deserialized -- the |
I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs. However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc. With that in mind, it seems that we should adopt one of two approaches:
|
also note that covering StringTypes which are fields of StructType is not by itself enough - suppose there's a field of StructType that is another StructType that has a field of collated StringType, etc. same goes for arrays, handling ArrayType(StringType) is not enough by itself - we also need ArrayType(ArrayType(StringType)) in short, I would say that we need a recursive approach to properly handle all possible collated string instances |
As for changing how but then of course there's the problem of preserving one of the actual values - you correctly noticed that we can't just return collationKey, as that value might not be present in the original array I suppose a separate map might do the trick here (mapping collationKey to original string value), and since we don't have preference towards which value gets returned, simply returning the first one that appeared is considered correct behaviour |
@uros-db if you are fine with me splitting it into two PRs, that's what I will do! I will modify this PR to fail for complex types that have collated strings. And I will get the PR to implement full (recursive) support for said complex types ready to be reviewed right after this one is merged. I appreciate your flexibility! |
a80a394
to
1fae9d9
Compare
latest review added checkinputdatatype to not support complex types containing nonbinary collations added checkinputdatatype to not support complex types containing nonbinary collations added struct test stuff Tests pass test structs fix scalastyle Collation Support for Mode
1fae9d9
to
0bab248
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
8e365a1
to
b071d17
Compare
@uros-db Should I also add collation support to The only difference will be
|
…essions/aggregate/Mode.scala Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
b071d17
to
f054589
Compare
@uros-db ? |
We can leave now that you've explored various options and finished the |
@uros-db when should I add back support for complex types? Should i wait until we have buy-in for the current approach from @dbatomic @nikolamand-db @stefankandic @stevomitric or should I do it now ? |
(I no longer think the code for support for complex types needs to be a seperate PR. ) |
@dbatomic have you had a chance to look at this? |
@uros-db I haven't heard back from anyone. Is there some other PR this should wait for? EG if you are implementing getBinaryKey for complex types in a separate PR, that would make sense. Just keep me informed, as to whatever is going on. Thanks! |
Hey @GideonPotok, thanks for the ping and sorry for the delay! I'll make sure to remind folks from the SQL team to take a look at this and give some feedback themselves. I'd say it's fine if you want to proceed with covering all complex types with collated strings, as we don't currently have any other open tickets within the collation effort On the other hand, I'd advise some more patience while we gather some input from @dbatomic @stefankandic @nikolamand-db @mihailom-db @stevomitric on whether this would be the correct general approach. From where I see it - this is good enough for a starting point. But, the team may have some other ideas for this, or they may prefer the using collationKeys for aggregation with a separate map to preserve original strings so they don't get lost approach, so I think it's best to hear them out |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala
Outdated
Show resolved
Hide resolved
case c: StringType if | ||
!CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality => | ||
val collationId = c.collationId | ||
val modeMap = buffer.toSeq.groupMapReduce { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not expert in this part of code but I wander if we could do better than this.
I see that most of the logic is in OpenHashMap
and OpenHashSet
. In OpenHashSet
hash calc is usually done like this hashcode(hasher.hash(k))
. If we just could get hash to respect collation problem might be solved.
On collation level we do have Collation.hashFunction
. Can we somehow pass this to the OpenHashSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbatomic What you are proposing would make sense. The complexity is increased. But i can whip up a draft PR and we can see whether it makes sense to proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbatomic @uros-db here is a mockup/proof of concept of this proposal: GideonPotok#2.
The relevant unit test has passed, which indicates that this approach is viable! Now, we need to consider whether to advance and determine how to integrate the relevant information about key datatype into the OpenHashMap. What are your thoughts on the feasibility of moving forward?
I'm primarily concerned about the risks involved: Integrating collation with specialized types and complex hash functions might lead to subtle bugs. Considering the crucial nature of this data structure, we should approach any changes with a detailed plan for validation and with caution. It may be wise to consider less invasive modifications , such as the one proposed in this PR (#46597).
Despite these concerns, this approach is functioning, and it touches on a particularly intriguing part of the codebase that I am eager to work on. If you think it's a promising route, I'm ready to complete the implementation and perform further benchmarks. However, I would appreciate some design suggestions as mentioned below.
To effectively implement this, I see two possible directions:
- Is there a benefit to using
AnyRef
(as inOpenHashMap[AnyRef, ...]
) byTypedAggregateWithHashMapAsBuffer
? This was introduced here: https://github.com/apache/spark/pull/37216/files without a clear explanation of whyAnyRef
was preferred over generics. ShouldTypedAggregateWithHashMapAsBuffer
remain unchanged, or should it evolve to rely on (Pseudo Code)OpenHashMap[childExpression.dataType.getClass, ...]
for more specific typing? @beliefer, although it’s been some time since you worked on this, could you advise on whether this component should be modified? - Assuming
TypedAggregateWithHashMapAsBuffer
remains unchanged, I'm seeking a more effective method to inject the custom hashing logic (and a customkeyExistsAtPos
method) fromMode
into the OpenHashMap, depending on thechildExpr.dataType
. I would greatly value ideas on how to best integrate this. At the moment, the proof of concept is assuming any object passed intoOpenHashSet
that is notLong,Int,Double, or Float
is aUTF8String
withUTF8_BINARY_LCASE
collation.
Lastly, while I am eager to complete the implementation, I hope to ensure that this is something you would definitively want to pursue, barring any significant performance setbacks revealed by benchmarking. I've developed this proof of concept and it's operational, but a full implementation should ideally be something you are confident is the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#46597 would look a lot better if I were to fully implement it. Waiting to hear whether to proceed....
What changes were proposed in this pull request?
SPARK-47353
Pull requests
Scala TreeMap (RB Tree)
GroupMapReduce <- Most performant
GroupMapReduce (Cleaned up) (This PR) <- Most performant
Comparing Experimental Approaches
Central Change to Mode
eval
Algorithm:eval
Method: Theeval
method now checks if the column being looked at is string with non-default collation and if so, uses a groupingMinor Change to Mode:
collationId
: A new lazy valuecollationId
is computed from thedataType
of thechild
expression, used to fetch the appropriate collation comparator whencollationEnabled
is true.This PR will fail for complex types containing collated strings
Follow up PR will implement that
Unit Test Enhancements: Significant additions to
CollationStringExpressionsSuite
to test new functionality including:Mode
function when handling strings with different collation settings.Benchmark Updates:
CollationBenchmark
classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.Why are the changes needed?
Does this PR introduce any user-facing change?
Yes, this PR introduces the following user-facing changes:
collationEnabled
property to theMode
expression.Mode
expression to customize its behavior.How was this patch tested?
This patch was tested through a combination of new and existing unit and end-to-end SQL tests.
Mode
function correctly handles strings with different collation settings.Out of scope: Special Unicode Cases higher planes
Tests do not need to include Null Handling.
Benchmark Tests:
Manual Testing:
Was this patch authored or co-authored using generative AI tooling?
Nope!