Url encode field names for partition paths #10329

danielcweeks · 2024-05-13T21:15:37Z

Field names can contain special characters that result in failure to produce valid paths using path based layout providers and possibly breaks usage of HadoopFileIO based FileSystem implementations.

This PR encodes the field name in addition to the preexisting encoding for the partition values.

Fixes #10279
See also: #10283

cc. @dimas-b

amogh-jahagirdar · 2024-05-13T21:49:55Z

core/src/test/java/org/apache/iceberg/TestLocationProvider.java

+    table.updateSchema().addColumn("data#1", Types.StringType.get()).commit();
+    table.updateSpec().addField("data#1").commit();


Could we add a test for some other atypical characters like "$" or "&"? Some character from the "Characters that might require special handling" section in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html.

They recommend URL encoding the values, which is what we're doing here. I'm not sure how much value there is in using a range of characters since we are just validating that url encoding is happening.

The only one that immediately stands out to me is =, which does not get encoded, but that's been used for partition paths for a long time with both FileIO and Hadoop FileSystem.

Maybe nitpicking, but in my reading of the AWS doc linked above I do not think they are recommending URL encoding. The doc's language is pretty vague. In fact, I suspect URL encoding even partition key values here will cause incorrect processing in tools that use java.net.URI for parsing those locations in Iceberg metadata files because the escaped chars will be converted to their proper Unicode chars during parsing and will likely not match existing S3 keys when passed to software.amazon.awssdk.services.s3.S3Utilities.parseUri(URI uri)... I'm not saying there's definitely a bug there, but I strongly suspect interoperability issues.

Example:

@Test void testA() { URI uri = URI.create("s3://bucket/path/id%22=1/something.parquet"); S3Utilities u = S3Utilities.builder().region(Region.EU_CENTRAL_1).build(); S3Uri s3Uri = u.parseUri(uri); soft.assertThat(s3Uri.key().get()).isEqualTo("path/id%22=1/something.parquet"); }

Outcome:

Expected :"path/id%22=1/something.parquet" Actual :"path/id"=1/something.parquet"

Note that Iceberg, by virtue of having its own S3URI class, will use path/id%22=1/something.parquet (with the % sign) as the S3 key at the S3 API level.

For now, we'll just make this consistent with what we're doing for the values. I agree we should probably up-level this discussion and figure out a path forward that is more usable with other URI parsing utilities.

dimas-b · 2024-05-13T23:24:30Z

api/src/main/java/org/apache/iceberg/PartitionSpec.java

@@ -189,7 +189,7 @@ public String partitionToPath(StructLike data) {
      if (i > 0) {
        sb.append("/");
      }
-      sb.append(field.name()).append("=").append(escape(valueString));
+      sb.append(escape(field.name())).append("=").append(escape(valueString));


This fix looks reasonable to me as far as #10279 is concerned.

However, this will cause partition paths to change in existing tables that have special characters that previously did not cause correctness issues, for example quotes ("). So new files under the same partition key will not share a common prefix with old files under that same partition key... Just want to make sure this is intentional.

Yes, I think that's unavoidable. However, the physical path is just for practical/visual validation and isn't necessary for the correct operation of the table. I would also suspect that the number of tables where there's a field name and partitioning that would be affected is rather small.

I looked for cases that would be problematic for existing paths, but usage of this is rather isolated to the location providers.

If the value of file prefixes is not critical, why not use non-URI encoding and avoid the URI interoperability problems altogether (note my other comment). For example, the escape() method could produce values using only non-special chars according to the the URI RFC (that is avoid even using % for escaping). As far as I understand the escaping method does not even have to be reversible.

@dimas-b In general, I agree with you. It would be good to not include any characters that are problematic.

To give a little context around how we ended up here, this layout is largely to provide similarity with Hive-style pathing so people transitioning over felt somewhat comfortable with the layout. Hive encodes some characters, but in a way that would also cause problems mentioned previously. I feel like most systems at this point are wary of any URI parsing because of this legacy behavior and cloud storage has made it more complicated. Tables that are migrated will likely have these issues, so we're largely stuck with supporting them for a while.

I'm not convinced that we want to move away from a reversible encoding because there's an existing expectation that you could recover by inferring the partitioning from the path structure.

The direction I would like to head overall is to provide the option to omit the path structure entirely. In Iceberg, the physical layout is entirely decoupled from the logical partitioning, so there's really no need for it. The additional pathing has a number of downsides in addition to the character encoding (long keys, type erasure, etc.).

For now, I think the simple path forward is to encode the field names like always has been done with values and then either introduce better encodings or remove them entirely.

As I comment above, this PR does fix #10279, which is welcome.

I'm looking forward to further improvements in Iceberg to improve interoperability with location strings stored in its metadata files.

I'm not convinced that we want to move away from a reversible encoding because there's an existing expectation that you could recover by inferring the partitioning from the path structure.

Doesn't this PR make the recovery situation worse? For example, in an old path, one could have s3://bucket/path/id%20=1/... (no encoding), in the new path it will be s3://bucket/path/id%2520=1/...... How does one figure out what to decode and when?

Doesn't this PR make the recovery situation worse? For example, in an old path, one could have s3://bucket/path/id%20=1/... (no encoding), in the new path it will be s3://bucket/path/id%2520=1/...... How does one figure out what to decode and when?

I guess one approach to deal with such a case would be to decode the path string first before applying the encoding in order to not cause double-encoding:

private String escape(String string) { try { return URLEncoder.encode(URLDecoder.decode(string, "UTF-8"), "UTF-8"); } catch (UnsupportedEncodingException e) { throw new RuntimeException(e); } }

But this also isn't an ideal solution and so I agree that long term we might want to consider better encoding or remove them.

The %20 could be part of the column name (or value) as in my example in #10279, so decoding the original string may not be applicable to all cases.

create table test.ns.t9(`id%20` string not null, a int) partitioned by (`id%20`);

... but I did not test that particular name in practice :)

core/src/test/java/org/apache/iceberg/TestLocationProvider.java

Url encode field ids for partition paths

c9c2072

danielcweeks requested review from rdblue and nastra May 13, 2024 21:15

github-actions bot added API core labels May 13, 2024

Cleanup tests

ce302b0

danielcweeks mentioned this pull request May 13, 2024

Support special chars in S3URI #10283

Closed

amogh-jahagirdar approved these changes May 13, 2024

View reviewed changes

dimas-b reviewed May 13, 2024

View reviewed changes

nastra reviewed May 14, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/TestLocationProvider.java Show resolved Hide resolved

Add test for partition paths

3225a07

danielcweeks requested a review from nastra May 15, 2024 21:33

nastra approved these changes May 16, 2024

View reviewed changes

dimas-b mentioned this pull request May 16, 2024

Do not allow special characters in base table locations projectnessie/nessie#8524

Open

danielcweeks merged commit 795fea9 into apache:main May 27, 2024
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Url encode field names for partition paths #10329

Url encode field names for partition paths #10329

danielcweeks commented May 13, 2024

amogh-jahagirdar May 13, 2024

danielcweeks May 13, 2024 •

edited

dimas-b May 13, 2024 •

edited

dimas-b May 13, 2024

danielcweeks May 27, 2024

dimas-b May 13, 2024

danielcweeks May 13, 2024

dimas-b May 14, 2024

danielcweeks May 15, 2024

dimas-b May 15, 2024

dimas-b May 15, 2024

nastra May 16, 2024

dimas-b May 16, 2024

		table.updateSchema().addColumn("data#1", Types.StringType.get()).commit();
		table.updateSpec().addField("data#1").commit();

Url encode field names for partition paths #10329

Url encode field names for partition paths #10329

Conversation

danielcweeks commented May 13, 2024

Choose a reason for hiding this comment

danielcweeks May 13, 2024 • edited

Choose a reason for hiding this comment

dimas-b May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielcweeks May 13, 2024 •

edited

dimas-b May 13, 2024 •

edited