Nick, it appears converting the ndarray to a dataframe clears the contiguous flag even though it doesn't actually change the underlying array. At least, this is what I'm seeing with my testing. My guess is this is what is causing arrow to do a copy (arrow is indeed doing a new allocation here, this is why you see the 64 byte padded differences). I am not enough of a pandas expert to provide any further guidance but maybe someone else may know what is happening here.
arr = np.random.randint(10, size=(5,5)) old_address = arr.__array_interface__['data'][0] df = pd.DataFrame(arr) print(arr.flags) pa_arr = pa.array(df[0].values) print(df[0].values.flags) df_address = df[0].values.__array_interface__['data'][0] new_address = pa_arr.buffers()[1].address print(f'Old address={old_address}\nDf address={df_address}\nNew address={new_address}') # C_CONTIGUOUS : True # F_CONTIGUOUS : False # OWNDATA : True # WRITEABLE : True # ALIGNED : True # WRITEBACKIFCOPY : False # UPDATEIFCOPY : False # # C_CONTIGUOUS : False # F_CONTIGUOUS : False # OWNDATA : False # WRITEABLE : True # ALIGNED : True # WRITEBACKIFCOPY : False # UPDATEIFCOPY : False # # Old address=2297872094880 # Df address=2297872094880 # New address=7932699743552 Conversion from the numpy array directly does seem to perform a zero copy operation: arr = np.random.randint(10, size=(5,5)) old_address = arr.__array_interface__['data'][0] pa_arr = pa.array(arr[0]) new_address = pa_arr.buffers()[1].address print(f'Old address={old_address}\nNew address={new_address}') # Old address=2297872094096 # New address=2297872094096 As an even further oddity consider: arr = np.random.randint(10, size=(5,5)) old_address = arr.__array_interface__['data'][0] df = pd.DataFrame(arr) print(f'ndarray address: {old_address}') for i in range(5): addr = df[i].values.__array_interface__['data'][0] print(f'DF column {i} : {addr}') # ndarray address: 2297872094880 # DF column 0 : 2297872094880 # DF column 1 : 2297872094884 # DF column 2 : 2297872094888 # DF column 3 : 2297872094892 # DF column 4 : 2297872094896 The pandas "values" seem to be giving very odd addresses (I would expect them to be 20 bytes apart not 4). On Tue, Nov 10, 2020 at 1:52 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Sorry, I should clarify, I'm not familiar with zero copy from Pandas to > Arrow, so there might be something else going on here. But once an arrow > file is written out, buffers will be padded/aligned to 8 bytes. > > In general, I think relying on exact memory translation from systems that > aren't used arrow, might require copies. > > On Tue, Nov 10, 2020 at 3:49 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > My question is: why are these addresses not 40 bytes apart from each other? > >> What's in the gaps between the buffers? It's not null bitsets - there's > >> only one buffer for each column. Thanks - > > > > > > All buffers are padded to at least 8 bytes (and per the spec 64 is > > recommended). > > > > On Tue, Nov 10, 2020 at 3:39 PM Nicholas White <n.j.wh...@gmail.com> > > wrote: > > > >> I've done a bit more digging. This code: > >> ```` > >> df = pd.DataFrame(np.random.randint(10, size=(5, 5))) > >> table = pa.Table.from_pandas(df) > >> mem = [] > >> for c in table.columns: > >> buf = c.chunks[0].buffers()[1] > >> mem.append((buf.address, buf.size)) > >> sorted(mem) > >> ```` > >> ...prints... > >> ```` > >> > >> [(140262915478912, 40), > >> (140262915479232, 40), > >> (140262915479296, 40), > >> (140262915479360, 40), > >> (140262915479424, 40)] > >> > >> ```` > >> My question is: why are these addresses not 40 bytes apart from each > >> other? > >> What's in the gaps between the buffers? It's not null bitsets - there's > >> only one buffer for each column. Thanks - > >> > >> Nick > >> > >