Re: Tensor column types in arrow

Wes McKinney Tue, 10 Apr 2018 16:19:11 -0700

The simplest thing would be to have a "tensor" or "ndarray" type where
each cell has the same shape. This would amount to adding the current
"Tensor" Flatbuffers table to the Type union in


https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194

The benefit of having each cell having the same shape is that the
physical representation is FixedSizeBinary.

Some caveats / notes:

* We have a prior unresolved discussion about our approach to logical
types. I could argue that this might fall into the same bucket of
logical types. I don't think we should merge any patches related to
this issue until we resolve that discussion

* Using FixedSizeBinary as the physical representation constrains
value sizes to 2GB (product of shape) because the FixedSizeBinary
metadata uses int for the byteWidth. We might consider changing this
to long (64 bits), but that's a separate discussion

* If we permitted each cell to have a different shape, then we would
need to use Binary (vs. FixedSizeBinary), which would limit the entire
size of a column to 2GB of total tensor data. This could be mitigated
by introducing LargeBinary (64 bit offsets), but this requires
additional discussion (there is a JIRA about this already from some
time ago)

Given that we are still falling short of a complete implementation of
other Arrow types (unions, intervals, fixed size lists), I urge all to
be deliberate about not piling on more technical debt / format
implementation shortfall if it can be avoided -- so a solution to this
might be to have a patch for initial Tensor/Ndarray value support that
is implemented in Java and/or C++

How about creating a JIRA about this broad topic and creating a Google
doc with a proposed implementation approach for discussion?

Thanks
Wes

On Tue, Apr 10, 2018 at 5:48 PM, Li Jin <[email protected]> wrote:
> What do people think whether "shape" should be included as a optional part
> of schema metadata or a required part of the schema itself?
>
> I feel having it be required might be too restrictive for interop with
> other systems.
>
> On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh <[email protected]> wrote:
>
>> My gut feeling is that such a column type should specify both the shape and
>> primitive type of all values in the column. I can’t think of a common use
>> case that requires differently shaped tensors in a single column.
>>
>> Can anyone here come up with such a use case?
>>
>> If not, I can try to draft a proposal change to the spec that adds these
>> types. The next question is whether such a change can make it in (with c++
>> and java implementations) before 1.0.
>> On Mon, Apr 9, 2018 at 17:36 Wes McKinney <[email protected]> wrote:
>>
>> > > As far as I know, there is an implementation of tensor type in
>> > C++/Python already. Should we just finalize the spec and add
>> implementation
>> > to Java?
>> >
>> > There is nothing specified yet as far as a *column* of
>> > ndarrays/tensors. We defined Tensor metadata for the purposes of
>> > IPC/serialization but made no effort to incorporate such data into the
>> > columnar format.
>> >
>> > There are likely many ways to implement column whose values are
>> > ndarrays, each cell with its own shape. Whether we would want to
>> > permit each cell to have a different ndarray cell type is another
>> > question (i.e. would we want to constrain every cell in a column to
>> > contain ndarrays of a particular type, like float64)
>> >
>> > So there's a couple of questions
>> >
>> > * How to represent the data using the columnar format
>> > * How to incorporate ndarray metadata into columnar schemas
>> >
>> > - Wes
>> >
>> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <[email protected]> wrote:
>> > > As far as I know, there is an implementation of tensor type in
>> C++/Python
>> > > already. Should we just finalize the spec and add implementation to
>> Java?
>> > >
>> > > On the Spark side, it's probably more complicated as Vector and Matrix
>> > are
>> > > not "first class" types in Spark SQL. Spark ML implements them as UDT
>> > > (user-defined types) so it's not clear how to make Spark/Arrow
>> converter
>> > > work with them.
>> > >
>> > > I wonder if Bryan and Holden have some more thoughts on that?
>> > >
>> > > Li
>> > >
>> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <[email protected]>
>> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I’ve been doing some work lately with Spark’s ML interfaces, which
>> > include
>> > >> sparse and dense Vector and Matrix types, backed on the Scala side by
>> > >> Breeze. Using these interfaces, you can construct DataFrames whose
>> > column
>> > >> types are vectors and matrices, and though the API isn’t terribly
>> rich,
>> > it
>> > >> is possible to run Python UDFs over such a DataFrame and get numpy
>> > ndarrays
>> > >> out of each row. However, if you’re using Spark’s Arrow serialization
>> > >> between the executor and Python workers, you get this
>> > >> UnsupportedOperationException:
>> > >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
>> > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
>> > >> execution/arrow/ArrowWriter.scala#L71
>> > >>
>> > >> I think it would be useful for Arrow to support something like a
>> column
>> > of
>> > >> tensors, and I’d like to see if anyone else here is interested in
>> such a
>> > >> thing.  If so, I’d like to propose adding it to the spec and getting
>> it
>> > >> implemented in at least Java and C++/Python.
>> > >>
>> > >> Some initial mildly-scattered thoughts:
>> > >>
>> > >> 1. You can certainly represent these today as List<Double> and
>> > >> List<List<Double>>, but then need to do some copying to get them back
>> > into
>> > >> numpy ndarrays.
>> > >>
>> > >> 2. In some cases it might be useful to know that a column contains
>> 3x3x4
>> > >> tensors, for example, and not just that there are three dimensions as
>> > you’d
>> > >> get with List<List<List<Double>>>.  This could constrain what
>> operations
>> > >> are meaningful (for example, in Spark you could imagine type checking
>> > that
>> > >> verifies dimension alignment for matrix multiplication).
>> > >>
>> > >> 3. You could approximate that with a FixedSizeList and metadata about
>> > the
>> > >> tensor shape.
>> > >>
>> > >> 4. But I kind of feel like this is generally useful enough that it’s
>> > worth
>> > >> having one implementation of it (well, one for each runtime) in Arrow.
>> > >>
>> > >> 5. Or, maybe everyone here thinks Spark should just do this with
>> > metadata?
>> > >>
>> > >> Curious to hear what you all think.
>> > >>
>> > >> Thanks,
>> > >> Leif
>> > >>
>> > >> --
>> > >> --
>> > >> Cheers,
>> > >> Leif
>> > >>
>> >
>> --
>> --
>> Cheers,
>> Leif
>>

Re: Tensor column types in arrow

Reply via email to