Re: memory mapped IPC File of RecordBatches?

Wes McKinney Wed, 22 May 2019 18:42:43 -0700

I tried locally and am not seeing this behavior

In [10]: source = pa.memory_map('/tmp/test.batch')


In [11]: reader=pa.ipc.open_stream(source)

In [12]: batch = reader.get_next_batch()
/home/wesm/miniconda/envs/arrow-3.7/bin/ipython:1: FutureWarning:
Please use read_next_batch instead of get_next_batch
  #!/home/wesm/miniconda/envs/arrow-3.7/bin/python

In [13]: batch.to_pandas()
Out[13]:
   field1
0     1.0
1     NaN

Now ran dd to overwrite the file contents

In [14]: batch.to_pandas()
Out[14]:
        field1
0          NaN
1 -245785081.0

On Wed, May 22, 2019 at 8:34 PM John Muehlhausen <[email protected]> wrote:
>
> I don't think that is it.  I changed my mmap to MAP_PRIVATE in the first
> raw mmap test and the dd changes are still visible.  I also changed to
> storing the stream format instead of the file format and got the same
> result.
>
> Where is the code that constructs a buffer/array by pointing it into the
> mmap space instead of by allocating space?  Sorry I'm so confused about
> this, I just don't see how it is supposed to work.
>
> On Wed, May 22, 2019 at 7:58 PM Wes McKinney <[email protected]> wrote:
>
> > It seems this could be due to our use of MAP_PRIVATE for read-only memory
> > maps
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393
> >
> > Some more investigation would be required
> >
> > On Wed, May 22, 2019 at 7:43 PM John Muehlhausen <[email protected]> wrote:
> > >
> > > Is there an example somewhere of referring to the RecordBatch data in a
> > memory-mapped IPC File in a zero-copy manner?
> > >
> > > I tried to do this in Python and must be doing something wrong.  (I
> > don't really care whether the example is Python or C++)
> > >
> > > In the attached test, when I get to the first prompt and hit return, I
> > get the same content again.  Likewise when I hit return on the second
> > prompt I get the same content again.
> > >
> > > However, if before hitting return on the first prompt I issue:
> > >
> > > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1
> > >
> > >
> > > i.e. overwrite the contents of the file, I get a garbled result.
> > (Replace 478 with the size of your file.)
> > >
> > > However, if I wait until the second prompt to issue the dd command
> > before hitting return, I do not get an error.  Instead, batch.to_pandas()
> > works the same both before and after the data is overwritten.  This was not
> > expected as I thought that the batch object was looking at the file
> > in-place, i.e. zero-copy?
> > >
> > > Am I tying together the memory-mapping and the batch construction in the
> > wrong way?
> > >
> > > Thanks,
> > > John
> >

Re: memory mapped IPC File of RecordBatches?

Reply via email to