hi Andrew, I'm copying dev@ just so more folks are in the loop
On Wed, Jun 19, 2019 at 9:13 AM Andrew Spott <andrew.sp...@gmail.com> wrote: > > I was told to post this here, rather than as an issue on Github. > > ==== > > I'm looking to serialize data that looks something like this: > > ``` > record<n1> = { "predicted": <tensor with shape n1, m>, > "truth": <tensor with shape n1, m>, > "loss": <double>, > "index": <array with shape n1>} > > data = [ > pa.array([record<n1>, record<n2>, record<n3>]), > pa.array([<float>, <float>, <float>]) > pa.array([<float>, <float>, <float>]) > ] > > batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2']) > ``` > > But I'm not sure how to do that, or even if what I'm trying to do is the > right way to do it. We don't support tensors/ndarrays as first-class value types in the Python or C++ libraries. This could be done hypothetically using the new ExtensionType facility. Tensor values would be embedded in an Arrow Binary column. There is already ARROW-1614 open for this. I also opened ARROW-5819 about implementing the Python-side plumbing around this Another possible option is to infer list<...> types from ndarrays (e.g. list<list<double>> from an ndarray of ndim=2 and dtype=float64), but this has not been implemented. > > What is the difference between `pa.array` and `pa.list_`? This formulation > is an array of structs, but is the struct of arrays formulation of this > possible? i.e.: > * The return value of pa.array is an Array object, which wraps the C++ arrow::Array type, the base class for value sequences. It's data, not metadata * pa.list_ returns an instance of ListType, which is a DataType subclass. It's metadata, not data > ``` > data = [ > pa.array([ <tensor with shape n1, m>, <tensor with shape n2, m>, > <tensor with shape n3, m>]), > pa.array([ <tensor with shape n1, m>, <tensor with shape n2, m>, > <tensor with shape n3, m>]), > pa.array([<float>, <float>, <float>]), > ... > ] > ``` > > Which doesn't currently work. It seems that there is a separation between > '1d arraylike' datatypes and 'pythonlike' datatypes (and 'nd arraylike' > datatypes), so I can't have a struct of an array. > Right. ndarrays as array cell values are not natively part of the Arrow columnar format. But they could be supported through extensions. This would be a nice project for someone to take on in the future - Wes > -Andrew