Nick, it appears converting the ndarray to a dataframe clears the
contiguous flag even though it doesn't actually change the underlying
array.  At least, this is what I'm seeing with my testing.  My guess
is this is what is causing arrow to do a copy (arrow is indeed doing a
new allocation here, this is why you see the 64 byte padded
differences).  I am not enough of a pandas expert to provide any
further guidance but maybe someone else may know what is happening
here.

arr = np.random.randint(10, size=(5,5))
old_address = arr.__array_interface__['data'][0]
df = pd.DataFrame(arr)
print(arr.flags)
pa_arr = pa.array(df[0].values)
print(df[0].values.flags)
df_address = df[0].values.__array_interface__['data'][0]
new_address = pa_arr.buffers()[1].address
print(f'Old address={old_address}\nDf  address={df_address}\nNew
address={new_address}')

#   C_CONTIGUOUS : True
#   F_CONTIGUOUS : False
#   OWNDATA : True
#   WRITEABLE : True
#   ALIGNED : True
#   WRITEBACKIFCOPY : False
#   UPDATEIFCOPY : False
#
#   C_CONTIGUOUS : False
#   F_CONTIGUOUS : False
#   OWNDATA : False
#   WRITEABLE : True
#   ALIGNED : True
#   WRITEBACKIFCOPY : False
#   UPDATEIFCOPY : False
#
# Old address=2297872094880
# Df  address=2297872094880
# New address=7932699743552

Conversion from the numpy array directly does seem to perform a zero
copy operation:

arr = np.random.randint(10, size=(5,5))
old_address = arr.__array_interface__['data'][0]
pa_arr = pa.array(arr[0])
new_address = pa_arr.buffers()[1].address
print(f'Old address={old_address}\nNew address={new_address}')

# Old address=2297872094096
# New address=2297872094096

As an even further oddity consider:

arr = np.random.randint(10, size=(5,5))
old_address = arr.__array_interface__['data'][0]
df = pd.DataFrame(arr)
print(f'ndarray address: {old_address}')
for i in range(5):
    addr = df[i].values.__array_interface__['data'][0]
    print(f'DF column {i}    : {addr}')

# ndarray address: 2297872094880
# DF column 0    : 2297872094880
# DF column 1    : 2297872094884
# DF column 2    : 2297872094888
# DF column 3    : 2297872094892
# DF column 4    : 2297872094896

The pandas "values" seem to be giving very odd addresses (I would
expect them to be 20 bytes apart not 4).

On Tue, Nov 10, 2020 at 1:52 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Sorry, I should clarify, I'm not familiar with zero copy from Pandas to
> Arrow, so there might be something else going on here.  But once an arrow
> file is written out, buffers will be padded/aligned to 8 bytes.
>
> In general, I think relying on exact memory translation from systems that
> aren't used arrow, might require copies.
>
> On Tue, Nov 10, 2020 at 3:49 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > My question is: why are these addresses not 40 bytes apart from each other?
> >> What's in the gaps between the buffers? It's not null bitsets - there's
> >> only one buffer for each column. Thanks -
> >
> >
> > All buffers are padded to at least 8 bytes (and per the spec 64 is
> > recommended).
> >
> > On Tue, Nov 10, 2020 at 3:39 PM Nicholas White <n.j.wh...@gmail.com>
> > wrote:
> >
> >> I've done a bit more digging. This code:
> >> ````
> >> df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
> >> table = pa.Table.from_pandas(df)
> >> mem = []
> >> for c in table.columns:
> >>     buf = c.chunks[0].buffers()[1]
> >>     mem.append((buf.address, buf.size))
> >> sorted(mem)
> >> ````
> >> ...prints...
> >> ````
> >>
> >> [(140262915478912, 40),
> >>  (140262915479232, 40),
> >>  (140262915479296, 40),
> >>  (140262915479360, 40),
> >>  (140262915479424, 40)]
> >>
> >> ````
> >> My question is: why are these addresses not 40 bytes apart from each
> >> other?
> >> What's in the gaps between the buffers? It's not null bitsets - there's
> >> only one buffer for each column. Thanks -
> >>
> >> Nick
> >>
> >

Reply via email to