Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-33759] [flink-parquet] Add support for nested array with row/map/array type #24795

Merged
merged 1 commit into from
May 27, 2024

Conversation

ViktorCosenza
Copy link

What is the purpose of the change

This Pull request adds commits from #24029 and #23881 to fully resolve issue https://issues.apache.org/jira/browse/FLINK-33759

Brief change log

Adds support for ParquetWriter nested Array, Map and Row Types. Previously the writer just wrote nothing. Silently failing.

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change added tests and can be verified as follows:

  • Added test to parquet writer using compression with all nested possible types.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): don't think so, previously the writer would instantly return for these cases but it would write nothing.
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no, it's a fix
  • If yes, how is the feature documented? na

@ViktorCosenza ViktorCosenza changed the title Flink 33759 [FLINK-33759] [flink-parquet] Add support for nested array with row/map/array type May 15, 2024
@flinkbot
Copy link
Collaborator

flinkbot commented May 15, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Member

@xccui xccui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The PR looks good to me.
Hi @JingsongLi, please also take a look. I think we can port this to fix apache/paimon#1730

@ViktorCosenza
Copy link
Author

Thanks! The PR looks good to me. Hi @JingsongLi, please also take a look. I think we can port this to fix apache/paimon#1730

Hey @xccui, thanks for the quick review! Just checking in to keep this PR alive.
Is there anything I can do to help get this merged?

@ViktorCosenza
Copy link
Author

Do you have any hints about the background info?

Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty.

After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process

Copy link
Contributor

@JingGe JingGe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ViktorCosenza for taking care of it. The PR looks overall good. Just left one small question about the background context.


MapData mapData = row.getMap(ordinal);
@Override
public void write(ArrayData arrayData, int ordinal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Not sure why those write(ArrayData arrayData, int ordinal) methods didn't get implemented. It should be trivial to do it while building those Writers. Do you have any hints about the background info?

@ViktorCosenza
Copy link
Author

Do you have any hints about the background info?

Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty.

After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process

Also, as @xccui mentioned, there is a very similar issue/code with methods missing on apache-paimon, so maybe this code was copied around a bit without these methods

@JingGe
Copy link
Contributor

JingGe commented May 22, 2024

It looks like those methods were skipped on purpose in #17542 and #17542 (comment)

@ViktorCosenza
Copy link
Author

ViktorCosenza commented May 22, 2024 via email

@JingGe
Copy link
Contributor

JingGe commented May 23, 2024

I talked with @JingsongLi offline. It seems that at that time there was no such requirement and therefore skipped those implementation.

Copy link
Contributor

@JingGe JingGe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JingGe
Copy link
Contributor

JingGe commented May 23, 2024

@ViktorCosenza would you please squash your commits and rebase it? Once it passed the CI, I will merge the PR. Thanks!

@JingGe
Copy link
Contributor

JingGe commented May 23, 2024

Do you have any hints about the background info?

Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty.
After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process

Also, as @xccui mentioned, there is a very similar issue/code with methods missing on apache-paimon, so maybe this code was copied around a bit without these methods

Thanks for the hint! Would you mind if I picked up some of your code to paimon?

@ViktorCosenza
Copy link
Author

@flinkbot run azure

@ViktorCosenza
Copy link
Author

@ViktorCosenza would you please squash your commits and rebase it? Once it passed the CI, I will merge the PR. Thanks!

Done! But I wasn't able to trigger the CI

@ViktorCosenza
Copy link
Author

ViktorCosenza commented May 23, 2024

Thanks for the hint! Would you mind if I picked up some of your code to paimon?

Sure, go ahead!

@ViktorCosenza
Copy link
Author

@JingGe Are you able to trigger the CI manually? I think I't wasnt triggered after the squash because no changes were detected.

@JingGe
Copy link
Contributor

JingGe commented May 24, 2024

@flinkbot run azure

@JingGe
Copy link
Contributor

JingGe commented May 24, 2024

@ViktorCosenza did you rebase?

@ViktorCosenza
Copy link
Author

ViktorCosenza commented May 24, 2024

@ViktorCosenza did you rebase?

Yes, i did an interactive rebase and then force-pushed the squashed commit

@ViktorCosenza
Copy link
Author

@JingGe CI Passed, I dindn't realize the bot just editted the same comment

Copy link
Contributor

@JingGe JingGe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JingGe JingGe merged commit 57b2005 into apache:master May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants