My position on this is that we should work with the pandas community to work toward elimination of the BlockManager data structure as this will solve a multitude of problems and also make things better for Arrow. I am not supportive of the IPC format changes in the PR.
On Wed, Jan 27, 2021 at 6:27 PM Nicholas White <n.j.wh...@gmail.com> wrote: > > Hi all - just pinging this thread given the later discussions on the PR > <https://github.com/apache/arrow/pull/8644>. I am proposing a backwards > (but not forwards) compatible change to the spec to strike this line out When > serializing Arrow data for interprocess communication, these alignment and > padding requirements are enforced and want to gauge the general reaction to > this. The tradeoff is roughly ecosystem fragmentation vs. improved > performance for a single, albeit common, workflow. > > @Micah the fixed size list would lose metadata like column headers - and my > motivating example is saving the IPC-format Arrow files to disk. If you > have to add your own solution to keep track of the metadata separately to > the Arrow file then you don't really get anything from using Arrow and > might as well use, e.g., `np.save` on each dtype in the blockmanager and > keep the metadata in a JSON file. > > @Joris agree that a non-consolidating block manager is a good solution, and > am following your progress on 39146 > <https://github.com/pandas-dev/pandas/issues/39146> etc! However, I see > you've discussed performance a lot on the mailing list (archive > <https://mail.python.org/pipermail/pandas-dev/2020-May/001244.html>) - if > an array manager got you faster data loading but slower analysis it'd be a > poor trade-off to the block manager's slower data loading and faster > analysis. Your notebook > <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>doesn't > have any group-by-and-aggregate timings - do you have any yet that you > think are representative? > > Thanks - > > Nick > > > On Fri, 13 Nov 2020 at 10:49, Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > > > As Micah and Wes pointed out on the PR, this alignment/padding are > > requirements of the format specification. For reference, see here: > > > > https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding > > That's also the reason that I said earlier in this thread that such > > zero-copy conversion to pandas you want to achieve is "basically > > impossible", as far as I know (when not using the split_block option). > > > > The suggestion of Micah to use a fixed size list column for each set > > of columns with the same data type could work to achieve this. But > > then of course you need to construct the Arrow table specifically for > > this (and you no longer have a normal table with columns), and will > > have to write a custom arrow->pandas conversion to get the buffer of > > the list column, view it as a 2D numpy array, and put this in a pandas > > Block. > > > > I know it is not a satisfactory solution *right now*, but longer term, > > I think the best bet for getting this zero-copy conversion (which I > > would also like to see!) is the non-consolidating manager that stores > > 1D arrays under the hood in pandas, which I mentioned earlier in the > > thread. > > > > Joris > > > > On Fri, 13 Nov 2020 at 00:21, Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > Hi Nicholas, > > > I don't think allowing for flexibility of non 8 byte aligned types is a > > > good idea. The specification explicitly calls out the alignment > > > requirements and allowing for writers to output different non-aligned > > > values potentially breaks other implementations. > > > > > > I'm not sure of your exact use-case but another approach to consider is > > to > > > store the values in a single Arrow column as either a list or a fixed > > size > > > list and look into doing zero copy from that to the corresponding pandas > > > memory (this is hypothetical, again I don't have enough context on > > > pandas/numpy memory layouts). > > > > > > -Micah > > > > > > On Thu, Nov 12, 2020 at 3:01 PM Nicholas White <n.j.wh...@gmail.com> > > wrote: > > > > > > > OK got everything to work, https://github.com/apache/arrow/pull/8644 > > > > (part of ARROW-10573 now) is ready for review. I've updated the test > > case > > > > to show it is possible to zero-copy a pandas DataFrame! The next step > > is to > > > > dig into `arrow_to_pandas.cc` to make it work automagically... > > > > > > > > On Wed, 11 Nov 2020 at 22:52, Nicholas White <n.j.wh...@gmail.com> > > wrote: > > > > > > > >> Thanks all, this has been interesting. I've made a patch that sort-of > > > >> does what I want[1] - I hope the test case is clear! I made the batch > > > >> writer use the `alignment` field that was already in the > > `IpcWriteOptions` > > > >> to align the buffers, instead of fixing their alignment at 8. Arrow > > then > > > >> writes out the buffers consecutively, so you can map them as a 2D > > memory > > > >> array like I wanted. There's one problem though...the test case > > thinks the > > > >> arrow data is invalid as it can't read the metadata properly (error > > below). > > > >> Do you have any idea why? I think it's because Arrow puts the > > metadata at > > > >> the end of the file after the now-unaligned buffers yet assumes the > > > >> metadata is still 8-byte aligned (which it probably no longer is). > > > >> > > > >> Nick > > > >> > > > >> ```` > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > _ _ > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > > >> pyarrow/ipc.pxi:494: in pyarrow.lib.RecordBatchReader.read_all > > > >> check_status(self.reader.get().ReadAll(&table)) > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > _ _ > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > > >> > > > >> > raise ArrowInvalid(message) > > > >> E pyarrow.lib.ArrowInvalid: Expected to read 117703432 metadata > > bytes, > > > >> but only read 19 > > > >> ```` > > > >> > > > >> [1] https://github.com/apache/arrow/pull/8644 > > > >> > > > >> > >