On Wed, 11 Nov 2020 at 00:52, Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Sorry, I should clarify, I'm not familiar with zero copy from Pandas to
> Arrow, so there might be something else going on here.  But once an arrow
> file is written out, buffers will be padded/aligned to 8 bytes.
>
> In general, I think relying on exact memory translation from systems that
> aren't used arrow, might require copies.
>
> On Tue, Nov 10, 2020 at 3:49 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > My question is: why are these addresses not 40 bytes apart from each other?
> >> What's in the gaps between the buffers? It's not null bitsets - there's
> >> only one buffer for each column. Thanks -
> >
> >
> > All buffers are padded to at least 8 bytes (and per the spec 64 is
> > recommended).

That indeed seems what is happening here as the buffers are 64 bytes
apart (except for the first column, though, not sure what is going on
there)

Given this padding, I think such zero-copy conversion from Arrow to
pandas is basically impossible (even for the primitive types without
nulls).
At least, with the current consolidated BlockManager. There is ongoing
work on a non-consolidated manager that stores 1D arrays. And with
that, it will become easier to experiment with such zero-copy
conversions. See https://github.com/pandas-dev/pandas/pull/36010 (but
it's experimental, not expected to be fully working in a pandas
release soon, but contributions to this effort are very welcome)

Joris

> >
> > On Tue, Nov 10, 2020 at 3:39 PM Nicholas White <n.j.wh...@gmail.com>
> > wrote:
> >
> >> I've done a bit more digging. This code:
> >> ````
> >> df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
> >> table = pa.Table.from_pandas(df)
> >> mem = []
> >> for c in table.columns:
> >>     buf = c.chunks[0].buffers()[1]
> >>     mem.append((buf.address, buf.size))
> >> sorted(mem)
> >> ````
> >> ...prints...
> >> ````
> >>
> >> [(140262915478912, 40),
> >>  (140262915479232, 40),
> >>  (140262915479296, 40),
> >>  (140262915479360, 40),
> >>  (140262915479424, 40)]
> >>
> >> ````
> >> My question is: why are these addresses not 40 bytes apart from each
> >> other?
> >> What's in the gaps between the buffers? It's not null bitsets - there's
> >> only one buffer for each column. Thanks -
> >>
> >> Nick
> >>
> >

Reply via email to