Spark Action to Analyze table #10288

karuppayya · 2024-05-08T05:59:24Z

This change adds a Spark action to Analyze tables.
As part of analysis, the action generates Apache data - sketch for NDV stats and writes it as puffins.

karuppayya · 2024-05-08T05:59:53Z

cc: @RussellSpitzer @aokolnychyi @huaxingao @findepi

ajantha-bhat · 2024-05-08T06:12:17Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Computes the statistic of the given columns and stores it as Puffin files. */


AnalyzeTableSparkAction is a generic name, I see that in future we want to compute the partition stats too. Which may not be written as puffin files.

Either we can change the change the naming to computeNDVSketches or make it generic such that any kind of stats can be computed from this.

Thinking more on this, I think we should just call it computeNDVSketches and not mix it with partition stats.

I tried to follow the model of RDMS and Engines like Trino using ANALYZE TABLE <tblName> to collect all table level stats.
With a procedure per stats model, the user have to invoke procedure/action for every stats and
also with any new stats addition, the user need to ensure to update his code to call the new procedure/action.

not mix it with partition stats.

I think we could have partition stats as a separate action since it per partition, whereas this procedure can collect top level table stats.

@karuppayya
I can see the tests in TestAnalyzeTableAction, it's working fine.
But have we tested in Spark, whether its working with a query like -
"Analyze table table1 compute statistics" ?

Because generally it gives the error
"[NOT_SUPPORTED_COMMAND_FOR_V2_TABLE] ANALYZE TABLE is not supported for v2 tables."

Spark doesnot have the grammar for Analyzing tables.
This PR introduces a Spark action. In subsequent PR, i plan to introduce a iceberg procedure to invoke the Spark action.

Actually I think Spark has a grammar, and be great to plug it into there. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L428

I see that in future we want to compute the partition stats too. Which may not be written as puffin files.

Hi, @ajantha-bhat I agree with you, otherwise, the queries would have a lot of limitations, such as being applicable only for calculating the NDV over the entire table.

For example, Trino might want to read the NDV values written by Spark to respond to queries. However, if the query has partition filter conditions, then Trino would not be able to use the pre-computed NDV information from Spark. So, what do you think ?

Hi @karuppayya , as the above discussions suggest, there can be multiple engines like spark, presto, trino etc, who might want to query of the same data right. So in such a scenario the sketches that are generated by Spark or suppose Presto, must be readable by the alternate engine right.

This question is coming because I ran one Analyze query on Presto and the puffin file it created looks like this ->

{"blobs":[{"type":"apache-datasketches-theta-v1","fields":[2],"snapshot-id":7724902347602477706,
"sequence-number":1,"offset":44,"length":40,"properties":{"ndv":"3"}}],"properties":{"created-by":"presto-testversion"}}

where as the one created by iceberg through the changes of this PR looks like this ->

{"blobs":[{"type":"apache-datasketches-theta-v1","fields":[3],
"snapshot-id":5334747061548805461,"sequence-number":1,"offset":4,"length":32}],"properties"

If seen properly the "{"ndv":"3"}" portion is missing in the iceberg change.

So can we make any modifications or any suggestions from your side may be?
Because as per my understanding the Sketch file should be universal to all engines.

@jeesou
Yes, agreed that the sketch needs to compatible across all engines.
This PR takes care of using the same library(Apache dataasketches) as Trino does. (This was the major concern here)
Do we need to add the property ndv , should nt engines be reading the value from the sketch?

Hm this discussion makes me wonder if we're under spec'd in this regard. According to the spec:

https://iceberg.apache.org/puffin-spec/#blob-types

The blob metadata for this blob may include following properties: ndv: estimate of number of distinct values, derived from the sketch.

It really seems like we should take a stance. Either it must be in the sketch or it must be in the properties. "may include" seems a little too loose.

ajantha-bhat · 2024-05-08T07:10:16Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+              spark(), table, columnsToBeAnalyzed.toArray(new String[0]));
+      table
+          .updateStatistics()
+          .setStatistics(table.currentSnapshot().snapshotId(), statisticsFile)


what if table's current snapshot has modified concurrently by another client between like 117 to line 120?

ajantha-bhat · 2024-05-08T07:14:35Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+
+  public static Iterator<Tuple2<String, ThetaSketchJavaSerializable>> computeNDVSketches(
+      SparkSession spark, String tableName, String... columns) {
+    String sql = String.format("select %s from %s", String.join(",", columns), tableName);


I think we should also think about incremental update and update sketches from previous checkpoint. Querying whole table maybe not efficient.

Yes, incremental need to be wired into the ends of write paths.
This procedure could exist in parallel, which could get stats of the whole table on demand.

ajantha-bhat · 2024-05-08T07:16:11Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestAnalyzeTableAction.java

+    assumeTrue(catalogName.equals("spark_catalog"));
+    sql(
+        "CREATE TABLE %s (id int, data string) USING iceberg TBLPROPERTIES"
+            + "('format-version'='2')",


default format version itself v2 now. So, specifying it again is redundant.

ajantha-bhat · 2024-05-08T07:17:19Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    String path = operations.metadataFileLocation(String.format("%s.stats", UUID.randomUUID()));
+    OutputFile outputFile = fileIO.newOutputFile(path);
+    try (PuffinWriter writer =
+        Puffin.write(outputFile).createdBy("Spark DistinctCountProcedure").build()) {


I like this name instead of "analyze table procedure".

ajantha-bhat · 2024-05-15T10:41:15Z

there was an old PR on the same: #6582

huaxingao · 2024-05-15T15:02:00Z

there was an old PR on the same: #6582

I don't have time to work on this, so karuppayya will take over. Thanks a lot @karuppayya for continuing the work.

amogh-jahagirdar

Thanks @karuppayya @huaxingao @szehon-ho this is aewsome to see! I left a review of the API/implementation, still have yet to review the tests which look to be a WIP

amogh-jahagirdar · 2024-05-29T17:13:43Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param statsToBeCollected set of statistics to be collected
+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);


Should these stats be a Set<StandardBlobType> instead of arbitrary Strings? I feel like the API becomes more well defined in this case.

Oh I see, StandardBlobType defines string constants not enums

amogh-jahagirdar · 2024-05-29T17:16:54Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+  private void validateColumns() {
+    validateEmptyColumns();
+    validateTypes();
+  }
+
+  private void validateEmptyColumns() {
+    if (columnsToBeAnalyzed == null || columnsToBeAnalyzed.isEmpty()) {
+      throw new ValidationException("No columns to analyze for the table", table.name());
+    }
+  }


Nit: I think this validation should just happen at the time of setting these on the action rather than at the execcution time.

amogh-jahagirdar · 2024-05-29T17:19:51Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);
+


I also think this interface should have a snapshot API to allow users to pass in a snapshot to generate the statistics for. If it's not specified then we can generate the statistics for the latest snapshot.

Should we support branch/tag as well? (I guess in subsequent pr)

amogh-jahagirdar · 2024-05-29T17:22:41Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+          if (field == null) {
+            throw new ValidationException("No column with %s name in the table", columnName);
+          }


Style nit: new line after the if

amogh-jahagirdar · 2024-05-29T17:30:04Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+      SparkSession spark, Table table, long snapshotId, String... columnsToBeAnalyzed)
+      throws IOException {
+    Iterator<Tuple2<String, ThetaSketchJavaSerializable>> tuple2Iterator =
+        NDVSketchGenerator.computeNDVSketches(spark, table.name(), snapshotId, columnsToBeAnalyzed);


Does computeDVSketches need to be public? Seems like it can just be package private. Also nit, either way don't think you need the full qualified method name

amogh-jahagirdar · 2024-05-29T17:34:48Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+import org.apache.datasketches.theta.Sketches;
+import org.apache.datasketches.theta.UpdateSketch;
+
+public class ThetaSketchJavaSerializable implements Serializable {


Does this need to be public?

amogh-jahagirdar · 2024-05-29T17:35:06Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ThetaSketchJavaSerializable.java

+    if (sketch == null) {
+      return null;
+    }
+    if (sketch instanceof UpdateSketch) {
+      return sketch.compact();
+    }


Style nit: new line after if

amogh-jahagirdar · 2024-05-29T17:45:44Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                null,
+                ImmutableMap.of()));
+      }
+      writer.finish();


Nit: Don't think you need the writer.finish() because the try with resources will close, and close will already finish

amogh-jahagirdar · 2024-05-29T17:51:57Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                table.currentSnapshot().snapshotId(),
+                table.currentSnapshot().sequenceNumber(),
+                ByteBuffer.wrap(sketchMap.get(columns.get(i)).getSketch().toByteArray()),
+                null,


null means that the file will be uncompressed. I think it makes sense not to compress these files by default since the sketch will be a single long per column, so it'll be quite small already and not worth paying the price of compression/decompression.

amogh-jahagirdar · 2024-05-29T17:52:12Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+      if (sketch1.getSketch() == null && sketch2.getSketch() == null) {
+        return emptySketchWrapped;
+      }
+      if (sketch1.getSketch() == null) {
+        return sketch2;
+      }
+      if (sketch2.getSketch() == null) {
+        return sketch1;
+      }


Style nit: new line after if

szehon-ho

Hi @karuppayya thanks for the patch, I left a first round of comments.

szehon-ho · 2024-05-29T20:39:04Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param columns a set of column names to be analyzed
+   * @return this for method chaining
+   */
+  AnalyzeTable columns(Set<String> columns);


Nit, how about String... columns (see RewriteDataFiles). same for the others

szehon-ho · 2024-06-05T00:00:43Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @param statsToBeCollected set of statistics to be collected
+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);


Let's call statistics? Like StatisticsFile. https://iceberg.apache.org/contribute/#java-style-guidelines I think it can interpreted differently but I think point 3 implies we should make it have the full spelling if possible, and we dont have abbreviations for API methods in most of code.

Also statsToBeCollected => types ?

szehon-ho · 2024-06-05T00:05:59Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+  AnalyzeTable columns(Set<String> columns);
+
+  /**
+   * A set of statistics to be collected on the given columns of the given table


The set of statistics to be collected? (given columns, given tables is specified elsewhere)

szehon-ho · 2024-06-05T00:07:08Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   */
+  AnalyzeTable snapshot(String snapshotId);
+
+  /** The action result that contains a summary of the Analysis. */


plural? contains summaries of the analysis?

Also if capital, it can be a a javadoc link.

szehon-ho · 2024-06-05T00:08:13Z

api/src/main/java/org/apache/iceberg/actions/AnalyzeTable.java

+   * @return this for method chaining
+   */
+  AnalyzeTable stats(Set<String> statsToBeCollected);
+


Should we support branch/tag as well? (I guess in subsequent pr)

szehon-ho · 2024-06-06T01:13:27Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                (PairFlatMapFunction<Iterator<Row>, String, String>)
+                    input -> {
+                      final List<Tuple2<String, String>> list = Lists.newArrayList();
+                      while (input.hasNext()) {


Can we use flatmap and mapToPair to make this more concise?

data.javaRDD().flatMap(r -> { List<Tuple2<String, String>> list = Lists.newArrayListWithExpectedSize(columns.size()); for (int i = 0; i < r.size(); i++) { list.add(new Tuple2<>(columns.get(i), r.get(i).toString()); } return list.iterator(); }).mapToPair(t -> t);

szehon-ho · 2024-06-06T01:16:53Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/AnalyzeTableSparkAction.java

+    return ImmutableAnalyzeTable.Result.builder().analysisResults(analysisResults).build();
+  }
+
+  private boolean analyzeableTypes(Set<String> columns) {


According to intellij, there is a typo (analyzable)

szehon-ho · 2024-06-06T01:17:23Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    final JavaPairRDD<String, ThetaSketchJavaSerializable> sketches =
+        pairs.aggregateByKey(
+            new ThetaSketchJavaSerializable(),
+            1, // number of partitions


Why limit to 1 ?

This code was just copied from datasketches example.
This value is used in the HashPartitioner behind the scenes.
Should we set it to spark.sql.shuffle.partitions?

szehon-ho · 2024-06-06T01:24:51Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+    return sketches.toLocalIterator();
+  }
+
+  static class Add


can we use lambdas here for cleaner code? like

(sketch, val) -> { sketch.update(val;); return sketch; },

The next one may be too complex to inline but maybe we can reduce the ugly java boilerplate

szehon-ho · 2024-06-06T01:26:08Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/NDVSketchGenerator.java

+                        final Row row = input.next();
+                        int size = row.size();
+                        for (int i = 0; i < size; i++) {
+                          list.add(new Tuple2<>(columns.get(i), row.get(i).toString()));


Question, does forcing string type affect anything? I see the sketch library takes in other types.

github-actions bot added API spark core build labels May 8, 2024

ajantha-bhat reviewed May 8, 2024

View reviewed changes

ajantha-bhat mentioned this pull request May 15, 2024

Add a Spark procedure to collect NDV #6582

Open

amogh-jahagirdar reviewed May 29, 2024

View reviewed changes

karuppayya force-pushed the analyze_action branch 3 times, most recently from 5538f6e to de520fc Compare June 4, 2024 17:55

szehon-ho reviewed Jun 6, 2024

View reviewed changes

krajendran4 added 5 commits June 7, 2024 10:31

core +api changes

b95240a

Analyze table Spark action

b0a8e7e

Address review comments

c34d29f

Address review comments + some improvements

76796cf

Address review comemnts

58d22d6

karuppayya force-pushed the analyze_action branch from de520fc to 58d22d6 Compare June 7, 2024 19:37

Spark Action to Analyze table #10288

Are you sure you want to change the base?

Spark Action to Analyze table #10288

Conversation

karuppayya commented May 8, 2024

karuppayya commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rice668 May 30, 2024 • edited

Choose a reason for hiding this comment

jeesou May 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogh-jahagirdar May 31, 2024 • edited

Choose a reason for hiding this comment

ajantha-bhat May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajantha-bhat commented May 15, 2024

huaxingao commented May 15, 2024

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szehon-ho Jun 6, 2024 • edited

Choose a reason for hiding this comment

rice668 May 30, 2024 •

edited

jeesou May 31, 2024 •

edited

amogh-jahagirdar May 31, 2024 •

edited

ajantha-bhat May 8, 2024 •

edited

szehon-ho Jun 6, 2024 •

edited