Hi all,

A discussion on the Iceberg ML [1] recently highlighted that URL path
segments are not being decoded correctly according to RFC 3986,
specifically regarding space encoding.

I investigated the situation in Polaris, and found many problems:

TLDR

- Table names with the + sign can be created but cannot be retrieved
- Namespace names with the + sign are OK (can be created and retrieved)
- Table names with spaces cannot be created
- Namespace names with spaces cannot be created

DISCUSSION

Table names such as "foo+bar" can be created (via POST, where the name
is in the request body). But they cannot be retrieved: when reading
tables, the name is part of the URL path. Polaris incorrectly performs
a second decoding step using RESTUtil.decodeString(table), even though
the REST framework has already decoded it. Consequently, a client
sends "foo%2Bbar" which is first decoded to "foo+bar" by the framework
(correct) and then re-decoded by Polaris to "foo bar" (incorrect),
resulting in a "not found" error.

Table and namespace names like "foo bar" simply cannot be created at
all. This is because in IcebergCatalog.defaultWarehouseLocation() and
other similar places, we create locations merely by joining
identifiers together, without any form of URL encoding: see [2] [3].

And even if tables like "foo bar" could be created, they couldn't be
retrieved by Java clients. This occurs because current Java clients
incorrectly encode that name as "foo+bar", which the REST framework
does not modify. Consequently, Polaris would look for a table named
"foo+bar" instead and throw a "not found" error. (Other clients would
send "foo%20bar" which would be correctly decoded by the framework as
"foo bar", and thus it would succeed.)

PROPOSAL

To resolve the issue with the + sign in table names, we simply need to
eliminate the redundant decoding step. I can open a PR for that
shortly.

To resolve the issue with spaces in table and namespace names, we
could fix all the methods that incorrectly join together identifiers
without proper URL encoding.

Finally, addressing the Java clients encoding problem is complex, but
we could consider implementing a workaround as follows:

1) If the client is Java and lacks the upcoming Iceberg fix for space
encoding, manually replace "+" with a space to correct the client's
faulty encoding.

2) For non-Java clients or those with the fix, no workaround would be required.

What are your thoughts on this?

Thanks,
Alex

[1]: https://lists.apache.org/thread/c498svln0x18vvm42998b9nm9j6ck5yh
[2]: 
https://github.com/apache/polaris/blob/e94fdff63852dc41635c9e7eb62b3627ba562b85/runtime/service/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java#L379
[3]: 
https://github.com/apache/polaris/blob/e94fdff63852dc41635c9e7eb62b3627ba562b85/runtime/service/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java#L571

Reply via email to