I think it may be helpful to clarify what you mean by dimensions that are
not known in advance. I believe the intention here is that this unknown
dimension is consistent within a record batch, but it is allowed to vary
from batch to batch. Otherwise, I would say you could just delay creating
the schema until you do know the unknown dimension.

This isn't really relevant but I feel compelled to point it out - the
FixedSizeList type has actually been in the Arrow spec for a while, but it
was only implemented in JS and Java initially. It was implemented in C++
just a few months ago.

On Mon, Jul 29, 2019 at 7:01 AM Edward Loper <edlo...@google.com.invalid>
wrote:

> The FixedSizeList type, which was added to Arrow a few months ago, is an
> array where each slot contains a fixed-size sequence of values.  It is
> specified as FixedSizeList<T>[N], where T is a child type and N is a signed
> int32 that specifies the length of each list.
>
> This is useful for encoding fixed-size tensors.  E.g., if I have a 100x8x10
> tensor, then I can encode it as
> FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[8]>[100].
>
> But I'm also interested in encoding tensors where some dimension sizes are
> not known in advance.  It seems to me that FixedSizeList could be extended
> to support this fairly easily, by simply defining that N=-1 means "each
> array slot has the same length, but that length is not known in advance."
>  So e.g. we could encode a 100x?x10 tensor as
> FixedSizeList<FixedSizeList<FixedSizeList<byte>[10]>[-1]>[100].
>
> Since these N=-1 row-lengths are not encoded in the type, we need some way
> to determine what they are.  Luckily, every Field in the schema has a
> corresponding FieldNode in the message; and those FieldNodes can be used to
> deduce the row lengths.  In particular, the row length must be equal to the
> length of the child node divided by the length of the FixedSizeList.  E.g.,
> if we have a FixedSizeList<byte>[-1] array with the values [[1, 2], [3, 4],
> [5, 6]] then the message representation is:
>
> * Length: 3, Null count: 0
> * Null bitmap buffer: Not required
> * Values array (byte array):
>     * Length: 6,  Null count: 0
>     * Null bitmap buffer: Not required
>     * Value buffer: [1, 2, 3, 4, 5, 6, <unspecified padding bytes>]
>
> So we can deduce that the row length is 6/3=2.
>
> It looks to me like it would be fairly easy to add support for this.  E.g.,
> in the FixedSizeListArray constructor in c++, if list_type()->list_size()
> is -1, then set list_size_ to values.length()/length.  There would be no
> changes to the schema.fbs/message.fbs files -- we would just be assigning a
> meaning to something that's currently meaningless (having
> FixedSizeList.listSize=-1).
>
> If there's support for adding this to Arrow, then I could put together a
> PR.
>
> Thanks,
> -Edward
>
> P.S. Apologies if this gets posted twice -- I sent it out a couple days ago
> right before subscribing to the mailing list; but I don't see it on the
> archives, presumably because I wasn't subscribed yet when I sent it out.
>

Reply via email to