Re: Pandas Block Manager

Wes McKinney Thu, 28 Jan 2021 09:51:57 -0800

My position on this is that we should work with the pandas community
to work toward elimination of the BlockManager data structure as this
will solve a multitude of problems and also make things better for
Arrow. I am not supportive of the IPC format changes in the PR.


On Wed, Jan 27, 2021 at 6:27 PM Nicholas White <n.j.wh...@gmail.com> wrote:
>
> Hi all - just pinging this thread given the later discussions on the PR
> <https://github.com/apache/arrow/pull/8644>. I am proposing a backwards
> (but not forwards) compatible change to the spec to strike this line out When
> serializing Arrow data for interprocess communication, these alignment and
> padding requirements are enforced and want to gauge the general reaction to
> this. The tradeoff is roughly ecosystem fragmentation vs. improved
> performance for a single, albeit common, workflow.
>
> @Micah the fixed size list would lose metadata like column headers - and my
> motivating example is saving the IPC-format Arrow files to disk. If you
> have to add your own solution to keep track of the metadata separately to
> the Arrow file then you don't really get anything from using Arrow and
> might as well use, e.g., `np.save` on each dtype in the blockmanager and
> keep the metadata in a JSON file.
>
> @Joris agree that a non-consolidating block manager is a good solution, and
> am following your progress on 39146
> <https://github.com/pandas-dev/pandas/issues/39146> etc! However, I see
> you've discussed performance a lot on the mailing list (archive
> <https://mail.python.org/pipermail/pandas-dev/2020-May/001244.html>) - if
> an array manager got you faster data loading but slower analysis it'd be a
> poor trade-off to the block manager's slower data loading and faster
> analysis. Your notebook
> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>doesn't
> have any group-by-and-aggregate timings - do you have any yet that you
> think are representative?
>
> Thanks -
>
> Nick
>
>
> On Fri, 13 Nov 2020 at 10:49, Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > As Micah and Wes pointed out on the PR, this alignment/padding are
> > requirements of the format specification. For reference, see here:
> >
> > https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
> > That's also the reason that I said earlier in this thread that such
> > zero-copy conversion to pandas you want to achieve is "basically
> > impossible", as far as I know (when not using the split_block option).
> >
> > The suggestion of Micah to use a fixed size list column for each set
> > of columns with the same data type could work to achieve this. But
> > then of course you need to construct the Arrow table specifically for
> > this (and you no longer have a normal table with columns), and will
> > have to write a custom arrow->pandas conversion to get the buffer of
> > the list column, view it as a 2D numpy array, and put this in a pandas
> > Block.
> >
> > I know it is not a satisfactory solution *right now*, but longer term,
> > I think the best bet for getting this zero-copy conversion (which I
> > would also like to see!) is the non-consolidating manager that stores
> > 1D arrays under the hood in pandas, which I mentioned earlier in the
> > thread.
> >
> > Joris
> >
> > On Fri, 13 Nov 2020 at 00:21, Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > >
> > > Hi Nicholas,
> > > I don't think allowing for flexibility of non 8 byte aligned types is a
> > > good idea.  The specification explicitly calls out the alignment
> > > requirements and allowing for writers to output different non-aligned
> > > values potentially breaks other implementations.
> > >
> > > I'm not sure of your exact use-case but another approach to consider is
> > to
> > > store the values in a single Arrow column as either a list or a fixed
> > size
> > > list and look into doing zero copy from that to the corresponding pandas
> > > memory (this is hypothetical, again I don't have enough context on
> > > pandas/numpy memory layouts).
> > >
> > > -Micah
> > >
> > > On Thu, Nov 12, 2020 at 3:01 PM Nicholas White <n.j.wh...@gmail.com>
> > wrote:
> > >
> > > > OK got everything to work, https://github.com/apache/arrow/pull/8644
> > > > (part of ARROW-10573 now) is ready for review. I've updated the test
> > case
> > > > to show it is possible to zero-copy a pandas DataFrame! The next step
> > is to
> > > > dig into `arrow_to_pandas.cc` to make it work automagically...
> > > >
> > > > On Wed, 11 Nov 2020 at 22:52, Nicholas White <n.j.wh...@gmail.com>
> > wrote:
> > > >
> > > >> Thanks all, this has been interesting. I've made a patch that sort-of
> > > >> does what I want[1] - I hope the test case is clear! I made the batch
> > > >> writer use the `alignment` field that was already in the
> > `IpcWriteOptions`
> > > >> to align the buffers, instead of fixing their alignment at 8. Arrow
> > then
> > > >> writes out the buffers consecutively, so you can map them as a 2D
> > memory
> > > >> array like I wanted. There's one problem though...the test case
> > thinks the
> > > >> arrow data is invalid as it can't read the metadata properly (error
> > below).
> > > >> Do you have any idea why? I think it's because Arrow puts the
> > metadata at
> > > >> the end of the file after the now-unaligned buffers yet assumes the
> > > >> metadata is still 8-byte aligned (which it probably no longer is).
> > > >>
> > > >> Nick
> > > >>
> > > >> ````
> > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > _ _
> > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > > >> pyarrow/ipc.pxi:494: in pyarrow.lib.RecordBatchReader.read_all
> > > >>     check_status(self.reader.get().ReadAll(&table))
> > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > _ _
> > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > > >>
> > > >> >   raise ArrowInvalid(message)
> > > >> E   pyarrow.lib.ArrowInvalid: Expected to read 117703432 metadata
> > bytes,
> > > >> but only read 19
> > > >> ````
> > > >>
> > > >> [1] https://github.com/apache/arrow/pull/8644
> > > >>
> > > >>
> >

Re: Pandas Block Manager

Reply via email to