Re: Pandas Block Manager

Joris Van den Bossche Fri, 13 Nov 2020 02:50:23 -0800

As Micah and Wes pointed out on the PR, this alignment/padding are
requirements of the format specification. For reference, see here:
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
That's also the reason that I said earlier in this thread that such
zero-copy conversion to pandas you want to achieve is "basically
impossible", as far as I know (when not using the split_block option).


The suggestion of Micah to use a fixed size list column for each set
of columns with the same data type could work to achieve this. But
then of course you need to construct the Arrow table specifically for
this (and you no longer have a normal table with columns), and will
have to write a custom arrow->pandas conversion to get the buffer of
the list column, view it as a 2D numpy array, and put this in a pandas
Block.

I know it is not a satisfactory solution *right now*, but longer term,
I think the best bet for getting this zero-copy conversion (which I
would also like to see!) is the non-consolidating manager that stores
1D arrays under the hood in pandas, which I mentioned earlier in the
thread.

Joris

On Fri, 13 Nov 2020 at 00:21, Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Nicholas,
> I don't think allowing for flexibility of non 8 byte aligned types is a
> good idea.  The specification explicitly calls out the alignment
> requirements and allowing for writers to output different non-aligned
> values potentially breaks other implementations.
>
> I'm not sure of your exact use-case but another approach to consider is to
> store the values in a single Arrow column as either a list or a fixed size
> list and look into doing zero copy from that to the corresponding pandas
> memory (this is hypothetical, again I don't have enough context on
> pandas/numpy memory layouts).
>
> -Micah
>
> On Thu, Nov 12, 2020 at 3:01 PM Nicholas White <n.j.wh...@gmail.com> wrote:
>
> > OK got everything to work, https://github.com/apache/arrow/pull/8644
> > (part of ARROW-10573 now) is ready for review. I've updated the test case
> > to show it is possible to zero-copy a pandas DataFrame! The next step is to
> > dig into `arrow_to_pandas.cc` to make it work automagically...
> >
> > On Wed, 11 Nov 2020 at 22:52, Nicholas White <n.j.wh...@gmail.com> wrote:
> >
> >> Thanks all, this has been interesting. I've made a patch that sort-of
> >> does what I want[1] - I hope the test case is clear! I made the batch
> >> writer use the `alignment` field that was already in the `IpcWriteOptions`
> >> to align the buffers, instead of fixing their alignment at 8. Arrow then
> >> writes out the buffers consecutively, so you can map them as a 2D memory
> >> array like I wanted. There's one problem though...the test case thinks the
> >> arrow data is invalid as it can't read the metadata properly (error below).
> >> Do you have any idea why? I think it's because Arrow puts the metadata at
> >> the end of the file after the now-unaligned buffers yet assumes the
> >> metadata is still 8-byte aligned (which it probably no longer is).
> >>
> >> Nick
> >>
> >> ````
> >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >> pyarrow/ipc.pxi:494: in pyarrow.lib.RecordBatchReader.read_all
> >>     check_status(self.reader.get().ReadAll(&table))
> >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>
> >> >   raise ArrowInvalid(message)
> >> E   pyarrow.lib.ArrowInvalid: Expected to read 117703432 metadata bytes,
> >> but only read 19
> >> ````
> >>
> >> [1] https://github.com/apache/arrow/pull/8644
> >>
> >>

Re: Pandas Block Manager

Reply via email to