[SPARK-47353][SQL] Enable collation support for the Mode expression #46597

GideonPotok · 2024-05-15T13:42:10Z

What changes were proposed in this pull request?

Pull requests

Scala TreeMap (RB Tree)
GroupMapReduce <- Most performant
GroupMapReduce (Cleaned up) (This PR) <- Most performant
Comparing Experimental Approaches

Central Change to Mode `eval` Algorithm:

Update to eval Method: The eval method now checks if the column being looked at is string with non-default collation and if so, uses a grouping

buff.toSeq.groupMapReduce {
        case (key: String, _) =>
          CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
        case (key: UTF8String, _) =>
          CollationFactory.getCollationKey(key, collationId)
        case (key, _) => key
      }(x => x)((x, y) => (x._1, x._2 + y._2)).values

Minor Change to Mode:

Introduction of collationId: A new lazy value collationId is computed from the dataType of the child expression, used to fetch the appropriate collation comparator when collationEnabled is true.

This PR will fail for complex types containing collated strings
Follow up PR will implement that

Unit Test Enhancements: Significant additions to `CollationStringExpressionsSuite` to test new functionality including:

Tests for the Mode function when handling strings with different collation settings.

Benchmark Updates:

Enhanced the CollationBenchmark classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.

Why are the changes needed?

Ensures consistency in handling string comparisons under various collation settings.
Improves global usability by enabling compatibility with different collation standards.

Does this PR introduce any user-facing change?

Yes, this PR introduces the following user-facing changes:

Adds a new collationEnabled property to the Mode expression.
Users can now specify collation settings for the Mode expression to customize its behavior.

How was this patch tested?

This patch was tested through a combination of new and existing unit and end-to-end SQL tests.

Unit Tests:
- CollationStringExpressionsSuite:
  - Make the newly added tests more in the same design pattern as the existing tests
- Added multiple test cases to verify that the Mode function correctly handles strings with different collation settings.

Out of scope: Special Unicode Cases higher planes

Tests do not need to include Null Handling.

Benchmark Tests:
Manual Testing:

 ./build/mvn -DskipTests clean package 
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
   spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
    import org.apache.spark.sql.types.StringType
    import org.apache.spark.sql.functions
    import spark.implicits._
    val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
    val df = data.toDF("word")
    val dfLC = df.withColumn("word",
      col("word").cast(StringType("UTF8_BINARY_LCASE")))
    val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
    dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
|  abc|
+-----+

AFTER:
+-----+
|count|
+-----+
|  Def|
+-----+

*/

Continuous Integration (CI):
- The patch passed all relevant Continuous Integration (CI) checks, including:
  - Unit test suite
  - Benchmark suite
  - Consider moving the new benchmark to the catalyst module

Was this patch authored or co-authored using generative AI tooling?

Nope!

sql/core/benchmarks/CollationBenchmark-results.txt

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt

sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

GideonPotok · 2024-05-17T13:52:47Z

@uros-db This is all cleaned up. Let's get some of the other reviewers to look at it?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

uros-db

since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?

GideonPotok · 2024-05-17T20:23:45Z

since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?

In my local tests, I found that Mode performs a byte-by-byte comparison for structs, which does not consider collation. So that is still outstanding. Good catch!

@uros-db There are several strategies we might adopt to handle structs with collation fields. I am looking into implementations. It is potentially straightforward though have some gotchas.

Do you feel I should solve for that in a separate PR or in this one? I assume you prefer that this get solve in this PR and not a follow-up PR, right?

GideonPotok · 2024-05-18T20:07:55Z

@uros-db

I have added implementation for mode to support structs with fields with the various collations. Performance is not great, so far.

[info] collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ---------------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE - mode - 30105 elements                     31             32           1          9.8         102.3       1.0X
[info] UNICODE - mode - 30105 elements                                1              1           0        240.4           4.2      24.6X
[info] UTF8_BINARY - mode - 30105 elements                            1              1           0        239.1           4.2      24.5X
[info] UNICODE_CI - mode - 30105 elements                            57             59           2          5.3         189.9       0.5X

I will add the benchmark results from GHA once I get your feedback.

I haven;t yet added support for collation for mode on array types, as in the "Collation Support in Spark" design doc, it says support for that is TBD. So I wanted to check in as to whether you think I should add support for that now or as a followup.

GideonPotok · 2024-05-18T20:19:02Z

What I would really like to try is to move from this implementation to an approach that will have the collation-support logic moved to the PartialAggregation stage, by moving logic to Mode.merge and Mode.update. I would use a modified open hash map for that with hashing based on the collation key and with a separate map to map from collation key to one of the actual values observed that maps to that collation key (which experimentation has shown could work).

But as it has already been a couple weeks of development on this, I believe we should, for this PR, confine all the collation logic in the stage that can't be serialized and deserialized -- the eval stage. And I should try what I have described above in a PR raised after we have merged the approach that has already been tested (i.e. this PR).

uros-db · 2024-05-19T11:54:17Z

I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs.

However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc.

With that in mind, it seems that we should adopt one of two approaches:

implement the support for collated StringType in this PR, but fail (throw exception) for complex types that have collated strings
implement full support at once

uros-db · 2024-05-19T11:58:53Z

also note that covering StringTypes which are fields of StructType is not by itself enough - suppose there's a field of StructType that is another StructType that has a field of collated StringType, etc.

same goes for arrays, handling ArrayType(StringType) is not enough by itself - we also need ArrayType(ArrayType(StringType))

in short, I would say that we need a recursive approach to properly handle all possible collated string instances

uros-db · 2024-05-19T12:12:36Z

As for changing how Mode.update works in order to inject collationKey, I think that should be enough to do the trick? it seems that Mode.merge should then work by default

but then of course there's the problem of preserving one of the actual values - you correctly noticed that we can't just return collationKey, as that value might not be present in the original array

I suppose a separate map might do the trick here (mapping collationKey to original string value), and since we don't have preference towards which value gets returned, simply returning the first one that appeared is considered correct behaviour

GideonPotok · 2024-05-19T12:17:12Z

I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs.

However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc.

With that in mind, it seems that we should adopt one of two approaches:

implement the support for collated StringType in this PR, but fail (throw exception) for complex types that have collated strings

implement full support at once

@uros-db if you are fine with me splitting it into two PRs, that's what I will do! I will modify this PR to fail for complex types that have collated strings. And I will get the PR to implement full (recursive) support for said complex types ready to be reviewed right after this one is merged. I appreciate your flexibility!

latest review added checkinputdatatype to not support complex types containing nonbinary collations added checkinputdatatype to not support complex types containing nonbinary collations added struct test stuff Tests pass test structs fix scalastyle Collation Support for Mode

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

GideonPotok · 2024-05-24T16:16:23Z

@uros-db Should I also add collation support to org.apache.spark.sql.catalyst.expressions.aggregate.PandasMode?

The only difference will be

Support for null keys (thus StringType won't necessarily mean all values in buffer are UTF8String, some might just be null, right?)
PandasMode returns a list of all values that are tied for mode. In that case, should all the values be present? Eg if you have the pandas_mode of ['a', 'a', 'a', 'b', 'b', 'B'], with utf_binary_lcase collation, what do you think pandas_mode should return? If we want to support PandasMode, I can do a little research on what other databases seem to favor for this type of question.

…essions/aggregate/Mode.scala Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>

GideonPotok · 2024-05-26T14:23:52Z

@uros-db ?

uros-db · 2024-05-28T09:59:56Z

We can leave PandasMode for a separate PR, but we'll definitely need to take care of it at one point

now that you've explored various options and finished the groupMapReduce approach, I think should can call in other SQL team reviewers to take a look at this and provide their feedback: @dbatomic @nikolamand-db @stefankandic @stevomitric

sql/core/benchmarks/CollationBenchmark-results.txt

GideonPotok · 2024-05-30T13:40:01Z

@dbatomic @nikolamand-db @stefankandic @stevomitric bump

GideonPotok · 2024-05-31T14:09:06Z

@uros-db when should I add back support for complex types? Should i wait until we have buy-in for the current approach from @dbatomic @nikolamand-db @stefankandic @stevomitric or should I do it now ?

GideonPotok · 2024-05-31T14:09:14Z

(I no longer think the code for support for complex types needs to be a seperate PR. )

GideonPotok · 2024-06-02T15:24:29Z

@dbatomic have you had a chance to look at this?

GideonPotok · 2024-06-04T12:27:55Z

@uros-db I haven't heard back from anyone. Is there some other PR this should wait for? EG if you are implementing getBinaryKey for complex types in a separate PR, that would make sense. Just keep me informed, as to whatever is going on. Thanks!

uros-db · 2024-06-04T13:01:28Z

Hey @GideonPotok, thanks for the ping and sorry for the delay! I'll make sure to remind folks from the SQL team to take a look at this and give some feedback themselves. I'd say it's fine if you want to proceed with covering all complex types with collated strings, as we don't currently have any other open tickets within the collation effort

On the other hand, I'd advise some more patience while we gather some input from @dbatomic @stefankandic @nikolamand-db @mihailom-db @stevomitric on whether this would be the correct general approach. From where I see it - this is good enough for a starting point. But, the team may have some other ideas for this, or they may prefer the using collationKeys for aggregation with a separate map to preserve original strings so they don't get lost approach, so I think it's best to hear them out

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

dbatomic · 2024-06-04T13:51:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

+      case c: StringType if
+        !CollationFactory.fetchCollation(c.collationId).supportsBinaryEquality =>
+        val collationId = c.collationId
+        val modeMap = buffer.toSeq.groupMapReduce {


I am not expert in this part of code but I wander if we could do better than this.
I see that most of the logic is in OpenHashMap and OpenHashSet. In OpenHashSet hash calc is usually done like this hashcode(hasher.hash(k)). If we just could get hash to respect collation problem might be solved.

On collation level we do have Collation.hashFunction. Can we somehow pass this to the OpenHashSet?

@dbatomic What you are proposing would make sense. The complexity is increased. But i can whip up a draft PR and we can see whether it makes sense to proceed.

@dbatomic @uros-db here is a mockup/proof of concept of this proposal: GideonPotok#2.

The relevant unit test has passed, which indicates that this approach is viable! Now, we need to consider whether to advance and determine how to integrate the relevant information about key datatype into the OpenHashMap. What are your thoughts on the feasibility of moving forward?

I'm primarily concerned about the risks involved: Integrating collation with specialized types and complex hash functions might lead to subtle bugs. Considering the crucial nature of this data structure, we should approach any changes with a detailed plan for validation and with caution. It may be wise to consider less invasive modifications , such as the one proposed in this PR (#46597).

Despite these concerns, this approach is functioning, and it touches on a particularly intriguing part of the codebase that I am eager to work on. If you think it's a promising route, I'm ready to complete the implementation and perform further benchmarks. However, I would appreciate some design suggestions as mentioned below.

To effectively implement this, I see two possible directions:

Is there a benefit to using AnyRef (as in OpenHashMap[AnyRef, ...]) by TypedAggregateWithHashMapAsBuffer? This was introduced here: https://github.com/apache/spark/pull/37216/files without a clear explanation of why AnyRef was preferred over generics. Should TypedAggregateWithHashMapAsBuffer remain unchanged, or should it evolve to rely on (Pseudo Code) OpenHashMap[childExpression.dataType.getClass, ...] for more specific typing? @beliefer, although it’s been some time since you worked on this, could you advise on whether this component should be modified?

Assuming TypedAggregateWithHashMapAsBuffer remains unchanged, I'm seeking a more effective method to inject the custom hashing logic (and a custom keyExistsAtPos method) from Mode into the OpenHashMap, depending on the childExpr.dataType. I would greatly value ideas on how to best integrate this. At the moment, the proof of concept is assuming any object passed into OpenHashSet that is not Long,Int,Double, or Float is a UTF8String with UTF8_BINARY_LCASE collation.

Lastly, while I am eager to complete the implementation, I hope to ensure that this is something you would definitively want to pursue, barring any significant performance setbacks revealed by benchmarking. I've developed this proof of concept and it's operational, but a full implementation should ideally be something you are confident is the right direction.

#46597 would look a lot better if I were to fully implement it. Waiting to hear whether to proceed....

github-actions bot added the SQL label May 15, 2024

GideonPotok force-pushed the spark_47353_3_clean branch 4 times, most recently from 01c6706 to 365e639 Compare May 15, 2024 17:02

GideonPotok marked this pull request as ready for review May 16, 2024 21:22

GideonPotok force-pushed the spark_47353_3_clean branch from 63d22f2 to 3758e43 Compare May 16, 2024 23:20

GideonPotok changed the title ~~[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2]~~ [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] May 16, 2024

GideonPotok force-pushed the spark_47353_3_clean branch 3 times, most recently from 9329234 to ec22116 Compare May 16, 2024 23:34

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-results.txt Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala Show resolved Hide resolved

uros-db reviewed May 17, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 17, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 17, 2024

View reviewed changes

GideonPotok requested a review from uros-db May 17, 2024 20:36

GideonPotok force-pushed the spark_47353_3_clean branch from a80a394 to 1fae9d9 Compare May 22, 2024 22:16

GideonPotok force-pushed the spark_47353_3_clean branch from 1fae9d9 to 0bab248 Compare May 22, 2024 22:21

uros-db reviewed May 24, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 24, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 24, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

GideonPotok force-pushed the spark_47353_3_clean branch from 8e365a1 to b071d17 Compare May 24, 2024 15:24

GideonPotok requested a review from uros-db May 24, 2024 16:16

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

f054589

…essions/aggregate/Mode.scala Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>

GideonPotok force-pushed the spark_47353_3_clean branch from b071d17 to f054589 Compare May 24, 2024 19:45

GideonPotok and others added 2 commits May 28, 2024 17:14

Merge branch 'master' into spark_47353_3_clean

5d171d6

added new bms

a49ccd4

GideonPotok commented May 30, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-results.txt Show resolved Hide resolved

GideonPotok commented May 30, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-results.txt Show resolved Hide resolved

GideonPotok changed the title ~~[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce~~ [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce May 30, 2024

GideonPotok changed the title ~~[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce~~ [SPARK-47353][SQL] Enable collation support for the Mode expression Jun 4, 2024

dbatomic reviewed Jun 4, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

dbatomic reviewed Jun 4, 2024

View reviewed changes

GideonPotok requested a review from dbatomic June 5, 2024 15:27

GideonPotok and others added 2 commits June 6, 2024 10:24

applied suggestion for refactor of checkInputDataTypes from dbatomic

4574546

Merge branch 'master' into spark_47353_3_clean

66457b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47353][SQL] Enable collation support for the Mode expression #46597

[SPARK-47353][SQL] Enable collation support for the Mode expression #46597

GideonPotok commented May 15, 2024 •

edited

GideonPotok commented May 17, 2024

uros-db left a comment

GideonPotok commented May 17, 2024

GideonPotok commented May 18, 2024 •

edited

GideonPotok commented May 18, 2024

uros-db commented May 19, 2024 •

edited

uros-db commented May 19, 2024

uros-db commented May 19, 2024 •

edited

GideonPotok commented May 19, 2024

GideonPotok commented May 24, 2024 •

edited

GideonPotok commented May 26, 2024

uros-db commented May 28, 2024

GideonPotok commented May 30, 2024

GideonPotok commented May 31, 2024

GideonPotok commented May 31, 2024 •

edited

GideonPotok commented Jun 2, 2024

GideonPotok commented Jun 4, 2024 •

edited

uros-db commented Jun 4, 2024

dbatomic Jun 4, 2024

GideonPotok Jun 4, 2024

GideonPotok Jun 5, 2024 •

edited

GideonPotok Jun 6, 2024

[SPARK-47353][SQL] Enable collation support for the Mode expression #46597

Are you sure you want to change the base?

[SPARK-47353][SQL] Enable collation support for the Mode expression #46597

Conversation

GideonPotok commented May 15, 2024 • edited

What changes were proposed in this pull request?

Pull requests

Central Change to Mode eval Algorithm:

Minor Change to Mode:

Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite to test new functionality including:

Benchmark Updates:

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

GideonPotok commented May 17, 2024

uros-db left a comment

Choose a reason for hiding this comment

GideonPotok commented May 17, 2024

GideonPotok commented May 18, 2024 • edited

GideonPotok commented May 18, 2024

uros-db commented May 19, 2024 • edited

uros-db commented May 19, 2024

uros-db commented May 19, 2024 • edited

GideonPotok commented May 19, 2024

GideonPotok commented May 24, 2024 • edited

GideonPotok commented May 26, 2024

uros-db commented May 28, 2024

GideonPotok commented May 30, 2024

GideonPotok commented May 31, 2024

GideonPotok commented May 31, 2024 • edited

GideonPotok commented Jun 2, 2024

GideonPotok commented Jun 4, 2024 • edited

uros-db commented Jun 4, 2024

dbatomic Jun 4, 2024

Choose a reason for hiding this comment

GideonPotok Jun 4, 2024

Choose a reason for hiding this comment

GideonPotok Jun 5, 2024 • edited

Choose a reason for hiding this comment

GideonPotok Jun 6, 2024

Choose a reason for hiding this comment

GideonPotok commented May 15, 2024 •

edited

Central Change to Mode `eval` Algorithm:

Unit Test Enhancements: Significant additions to `CollationStringExpressionsSuite` to test new functionality including:

GideonPotok commented May 18, 2024 •

edited

uros-db commented May 19, 2024 •

edited

uros-db commented May 19, 2024 •

edited

GideonPotok commented May 24, 2024 •

edited

GideonPotok commented May 31, 2024 •

edited

GideonPotok commented Jun 4, 2024 •

edited

GideonPotok Jun 5, 2024 •

edited