Thanks for bringing this up Brian! In my view, the decision tree for this
would look something like:

1.) Is there anything incorrect with supporting truncate on the basis of
width for binary columns? I can't really think of any reason, it seems
legitimate to me (handling characters outside of utf-8 comes to mind) and I
see engines like Spark already handle such a transform (for Spark I see
TestSparkTruncateFunction#testTruncateBinary already validating the
behavior). If there is something problematic with supporting this then I
think we'd want to consider deprecating that behavior but again I don't see
anything problematic since the transform is well defined.

2.) If there's nothing incorrect, then changing the spec makes sense. I
left a review on the PR, the only thing is with a spec we want to be
explicit so we should specify that truncate on binary will take [0, width)
bytes (essentially the same as the string case).

Right now, I'm favoring 2 but  I think it's worth leaving this open for a
bit in case others see an issue with  supporting this behavior.

Thanks,

Amogh Jahagirdar

On Wed, Apr 3, 2024 at 6:45 PM Brian Hulette <bhule...@apache.org> wrote:

> Hello, this is my first time writing on this list so I'll  introduce
> myself. I'm Brian Hulette, I've been involved with a couple of Apache
> projects in the past (Arrow and Beam), and now I'm working on BigQuery's
> support for Iceberg.
>
> My colleague raised an issue [1] a while ago about a discrepancy between
> the specification and the implementation - truncate is not supposed to work
> on binary column, but it looks like it does.  It seems unlikely that we
> will drop support for something that is working now, so I figured we should
> document the current behavior instead. I drafted a PR for this [2], could
> someone help review?
>
> Thanks!
> Brian
>
> [1] https://github.com/apache/iceberg/issues/5251
> [2] https://github.com/apache/iceberg/pull/10079
>

Reply via email to