On Wed, 11 Nov 2020 at 00:52, Micah Kornfield <emkornfi...@gmail.com> wrote: > > Sorry, I should clarify, I'm not familiar with zero copy from Pandas to > Arrow, so there might be something else going on here. But once an arrow > file is written out, buffers will be padded/aligned to 8 bytes. > > In general, I think relying on exact memory translation from systems that > aren't used arrow, might require copies. > > On Tue, Nov 10, 2020 at 3:49 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > My question is: why are these addresses not 40 bytes apart from each other? > >> What's in the gaps between the buffers? It's not null bitsets - there's > >> only one buffer for each column. Thanks - > > > > > > All buffers are padded to at least 8 bytes (and per the spec 64 is > > recommended).
That indeed seems what is happening here as the buffers are 64 bytes apart (except for the first column, though, not sure what is going on there) Given this padding, I think such zero-copy conversion from Arrow to pandas is basically impossible (even for the primitive types without nulls). At least, with the current consolidated BlockManager. There is ongoing work on a non-consolidated manager that stores 1D arrays. And with that, it will become easier to experiment with such zero-copy conversions. See https://github.com/pandas-dev/pandas/pull/36010 (but it's experimental, not expected to be fully working in a pandas release soon, but contributions to this effort are very welcome) Joris > > > > On Tue, Nov 10, 2020 at 3:39 PM Nicholas White <n.j.wh...@gmail.com> > > wrote: > > > >> I've done a bit more digging. This code: > >> ```` > >> df = pd.DataFrame(np.random.randint(10, size=(5, 5))) > >> table = pa.Table.from_pandas(df) > >> mem = [] > >> for c in table.columns: > >> buf = c.chunks[0].buffers()[1] > >> mem.append((buf.address, buf.size)) > >> sorted(mem) > >> ```` > >> ...prints... > >> ```` > >> > >> [(140262915478912, 40), > >> (140262915479232, 40), > >> (140262915479296, 40), > >> (140262915479360, 40), > >> (140262915479424, 40)] > >> > >> ```` > >> My question is: why are these addresses not 40 bytes apart from each > >> other? > >> What's in the gaps between the buffers? It's not null bitsets - there's > >> only one buffer for each column. Thanks - > >> > >> Nick > >> > >