Just to chime in (and add yet another voice into the mix here), I'd have a preference for it being signed integers for the same reasons as most everyone else: consistency with everything else in the spec. Since we use signed integers everywhere, I'd prefer to keep it consistent rather than introduce an exception.
--Matt On Wed, Sep 20, 2023 at 11:19 AM Benjamin Kietzman <bengil...@gmail.com> wrote: > Hello all, > > Thanks for the input! > > @Will > > > Could we name which ones it would be compatible with? > > UmbraDB [1], Velox [2], and DuckDB [3] all use unsigned integers for size. > > > As I understood it originally, the current implementations use raw > pointers > > Yes, these implementations use raw pointers exclusively and for > transmission via IPC > those will require conversion. The compatibility we'd achieve by using > unsigned > integers is mostly one of documentation and validation; the above engines > would > accept a string as long as `2**32 - 1`, but with signed integers arrow > would reject > it. This is a distant edge case so I'd say it's acceptable to exclude it. > (For completeness it should be noted that conversion between signed and > unsigned is > typically a trivial operation and shouldn't be considered as adding to any > runtime cost.) > > @Dewey > > ... the spec specifically discourages their use. > > If the times have changed and this is no longer the case perhaps there > > should be a wider effort to support unsigned values in other places > > I'm not sure which signed integers you're thinking of. At the risk of > constructing a > strawman, I would say that for example we shouldn't extend the offsets in > the List type > to support unsigned integers. This would require adding a new type named > UnsignedOffsetsList or so, and I don't think that would offer a useful > difference from > the existing List type. I would recommend waiting for implementers to > motivate such a > change by proposing new cases where unsigned integers would be useful to > support > (as with unsigned dictionary indices). > > > I would lean towards signed integers because we don't use unsigned > > integers anywhere in the existing specification > > It's true this is the convention in the spec. Since you and @pitrou have > both mentioned > it (and we haven't heard strong opinions in support of unsigned sizes), I > think that's > sufficient grounds to avoid the exception. > > Sincerely, > Ben Kietzman > > [1] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf (section 3.1, > second paragraph) > [2] > > https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L255 > [3] > > https://github.com/duckdb/duckdb/blob/5a29c99891dcc71d19ce222ee3a68bf08686680e/src/include/duckdb/common/types/string_type.hpp#L210 > > On Tue, Sep 19, 2023 at 4:50 PM Will Jones <will.jones...@gmail.com> > wrote: > > > Hi Ben, > > > > I'm open to the idea of using unsigned if it allows compatibility with an > > existing implementation or two. Could we name which ones it would be > > compatible with? Links to implementation code would be very welcome, if > > available. > > > > As I understood it originally, the current implementations use raw > pointers > > rather than offsets into buffers, so these values already weren't > > compatible to begin with. For example, it seems like Velox uses raw > > pointers into buffers [1]. So unless I'm missing something, it seems like > > these implementations will have to map the pointers to offsets anyways, > so > > maybe it's not much more trouble to convert to signed integers on the > way. > > > > Best, > > > > Will Jones > > > > [1] > > > > > https://github.com/facebookincubator/velox/blob/db8edec395288527a7464d17ab86b36b970eb270/velox/type/StringView.h#L46-L78 > > > > On Tue, Sep 19, 2023 at 8:26 AM Benjamin Kietzman <bengil...@gmail.com> > > wrote: > > > > > Hello again all, > > > > > > It seems there hasn't been much interest in this point so I'm leaning > > > toward keeping unsigned integers. If anyone has a concern please > respond > > > here and/or on the PR [1]. > > > > > > Sincerely, > > > Ben Kietzman > > > > > > [1] https://github.com/apache/arrow/pull/37526#discussion_r1323029022 > > > > > > On Thu, Sep 14, 2023 at 9:31 AM David Li <lidav...@apache.org> wrote: > > > > > > > I think Java was usually raised as the odd child out when this has > come > > > up > > > > before. Since Java 8 there are standard library methods to manipulate > > > > signed integers as if they were unsigned, so in principle Java > > shouldn't > > > be > > > > a blocker anymore. > > > > > > > > That said, ByteBuffer is still indexed by int so in practice Java > > > wouldn't > > > > be able to handle more than 2 GB in a single buffer, at least until > we > > > can > > > > use the Java 21+ APIs (MemorySegment is finally indexed by (signed) > > > long). > > > > > > > > On Tue, Sep 12, 2023, at 11:40, Benjamin Kietzman wrote: > > > > > Hello all, > > > > > > > > > > Utf8View was recently accepted [1] and I've opened a PR to add the > > > > > spec/schema changes [2]. In review [3], it was requested that > signed > > 32 > > > > bit > > > > > integers be used for the fields of view structs instead of 32 bit > > > > unsigned. > > > > > > > > > > This divergence has been discussed on the ML previously [4], but in > > > light > > > > > of my reviewer's request for a change it should be raised again for > > > > focused > > > > > discussion. (At this stage, I don't *think* the change would > require > > > > > another vote.) I'll enumerate the motivations for signed and > unsigned > > > as > > > > I > > > > > understand them. > > > > > > > > > > Signed: > > > > > - signed integers are conventional in the arrow format > > > > > - unsigned integers may cause some difficulty of implementation in > > > > > languages which don't natively support them > > > > > > > > > > Unsigned: > > > > > - unsigned integers are used by engines which already implement > > > Utf8View > > > > > > > > > > My own bias is toward compatibility with existing implementers, but > > > using > > > > > signed integers will only affect the case of arrays which include > > data > > > > > buffers larger than 2GB. For reference, the default buffer size in > > > velox > > > > is > > > > > 32KB so such a massive data buffer would only occur when a single > > slot > > > > of a > > > > > string array has 2.1GB of characters. This seems sufficiently > > unlikely > > > > that > > > > > I wouldn't consider it a blocker. > > > > > > > > > > Sincerely, > > > > > Ben Kietzman > > > > > > > > > > [1] > https://lists.apache.org/thread/wt9j3q7qd59cz44kyh1zkts8s6wo1dn6 > > > > > [2] https://github.com/apache/arrow/pull/37526 > > > > > [3] > > https://github.com/apache/arrow/pull/37526#discussion_r1323029022 > > > > > [4] > https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt > > > > > [5] > > > > > > > > > > > > > > > https://github.com/facebookincubator/velox/blob/947d98c99a7cf05bfa4e409b1542abc89a28cb29/velox/vector/FlatVector.h#L46-L50 > > > > > > > > > >