Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Url encode field names for partition paths #10329

Merged
merged 3 commits into from
May 27, 2024

Conversation

danielcweeks
Copy link
Contributor

Field names can contain special characters that result in failure to produce valid paths using path based layout providers and possibly breaks usage of HadoopFileIO based FileSystem implementations.

This PR encodes the field name in addition to the preexisting encoding for the partition values.

Fixes #10279
See also: #10283

cc. @dimas-b

Comment on lines +294 to +295
table.updateSchema().addColumn("data#1", Types.StringType.get()).commit();
table.updateSpec().addField("data#1").commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test for some other atypical characters like "$" or "&"? Some character from the "Characters that might require special handling" section in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They recommend URL encoding the values, which is what we're doing here. I'm not sure how much value there is in using a range of characters since we are just validating that url encoding is happening.

The only one that immediately stands out to me is =, which does not get encoded, but that's been used for partition paths for a long time with both FileIO and Hadoop FileSystem.

Copy link
Contributor

@dimas-b dimas-b May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe nitpicking, but in my reading of the AWS doc linked above I do not think they are recommending URL encoding. The doc's language is pretty vague. In fact, I suspect URL encoding even partition key values here will cause incorrect processing in tools that use java.net.URI for parsing those locations in Iceberg metadata files because the escaped chars will be converted to their proper Unicode chars during parsing and will likely not match existing S3 keys when passed to software.amazon.awssdk.services.s3.S3Utilities.parseUri(URI uri)... I'm not saying there's definitely a bug there, but I strongly suspect interoperability issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example:

  @Test
  void testA() {
    URI uri = URI.create("s3://bucket/path/id%22=1/something.parquet");
    S3Utilities u = S3Utilities.builder().region(Region.EU_CENTRAL_1).build();
    S3Uri s3Uri = u.parseUri(uri);
    soft.assertThat(s3Uri.key().get()).isEqualTo("path/id%22=1/something.parquet");
  }

Outcome:

Expected :"path/id%22=1/something.parquet"
Actual   :"path/id"=1/something.parquet"

Note that Iceberg, by virtue of having its own S3URI class, will use path/id%22=1/something.parquet (with the % sign) as the S3 key at the S3 API level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we'll just make this consistent with what we're doing for the values. I agree we should probably up-level this discussion and figure out a path forward that is more usable with other URI parsing utilities.

@@ -189,7 +189,7 @@ public String partitionToPath(StructLike data) {
if (i > 0) {
sb.append("/");
}
sb.append(field.name()).append("=").append(escape(valueString));
sb.append(escape(field.name())).append("=").append(escape(valueString));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix looks reasonable to me as far as #10279 is concerned.

However, this will cause partition paths to change in existing tables that have special characters that previously did not cause correctness issues, for example quotes ("). So new files under the same partition key will not share a common prefix with old files under that same partition key... Just want to make sure this is intentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's unavoidable. However, the physical path is just for practical/visual validation and isn't necessary for the correct operation of the table. I would also suspect that the number of tables where there's a field name and partitioning that would be affected is rather small.

I looked for cases that would be problematic for existing paths, but usage of this is rather isolated to the location providers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value of file prefixes is not critical, why not use non-URI encoding and avoid the URI interoperability problems altogether (note my other comment). For example, the escape() method could produce values using only non-special chars according to the the URI RFC (that is avoid even using % for escaping). As far as I understand the escaping method does not even have to be reversible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b In general, I agree with you. It would be good to not include any characters that are problematic.

To give a little context around how we ended up here, this layout is largely to provide similarity with Hive-style pathing so people transitioning over felt somewhat comfortable with the layout. Hive encodes some characters, but in a way that would also cause problems mentioned previously. I feel like most systems at this point are wary of any URI parsing because of this legacy behavior and cloud storage has made it more complicated. Tables that are migrated will likely have these issues, so we're largely stuck with supporting them for a while.

I'm not convinced that we want to move away from a reversible encoding because there's an existing expectation that you could recover by inferring the partitioning from the path structure.

The direction I would like to head overall is to provide the option to omit the path structure entirely. In Iceberg, the physical layout is entirely decoupled from the logical partitioning, so there's really no need for it. The additional pathing has a number of downsides in addition to the character encoding (long keys, type erasure, etc.).

For now, I think the simple path forward is to encode the field names like always has been done with values and then either introduce better encodings or remove them entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I comment above, this PR does fix #10279, which is welcome.

I'm looking forward to further improvements in Iceberg to improve interoperability with location strings stored in its metadata files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that we want to move away from a reversible encoding because there's an existing expectation that you could recover by inferring the partitioning from the path structure.

Doesn't this PR make the recovery situation worse? For example, in an old path, one could have s3://bucket/path/id%20=1/... (no encoding), in the new path it will be s3://bucket/path/id%2520=1/...... How does one figure out what to decode and when?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this PR make the recovery situation worse? For example, in an old path, one could have s3://bucket/path/id%20=1/... (no encoding), in the new path it will be s3://bucket/path/id%2520=1/...... How does one figure out what to decode and when?

I guess one approach to deal with such a case would be to decode the path string first before applying the encoding in order to not cause double-encoding:

private String escape(String string) {
    try {
      return URLEncoder.encode(URLDecoder.decode(string, "UTF-8"), "UTF-8");
    } catch (UnsupportedEncodingException e) {
      throw new RuntimeException(e);
    }
  }

But this also isn't an ideal solution and so I agree that long term we might want to consider better encoding or remove them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The %20 could be part of the column name (or value) as in my example in #10279, so decoding the original string may not be applicable to all cases.

create table test.ns.t9(`id%20` string not null, a int) partitioned by (`id%20`);

... but I did not test that particular name in practice :)

@danielcweeks danielcweeks requested a review from nastra May 15, 2024 21:33
@danielcweeks danielcweeks merged commit 795fea9 into apache:main May 27, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unpredictable behaviour with S3FileIO when column names contain #
4 participants