Re: memory mapped IPC File of RecordBatches?

John Muehlhausen Fri, 24 May 2019 21:03:36 -0700

I'm not seeing it as a bug at this point.  I was only using it to convince
myself that the batch was zero-copy.


As you know I'd like to draw up a proposal for fully pre-allocated but only
partially populated batches in order to facilitate low-latency streaming
appends.  It may come up then?  I'm making a note of it.

On Thu, May 23, 2019 at 8:25 AM Wes McKinney <[email protected]> wrote:

> OK. Can you open a JIRA about fixing this? I don't recall the
> rationale for using MAP_PRIVATE to begin with, and since the behavior
> is unspecified on Linux it would be better to be consistent across
> platforms
>
> On Wed, May 22, 2019 at 11:02 PM John Muehlhausen <[email protected]> wrote:
> >
> > Well, it works fine on Linux... and the Linux mmap man page seems to
> > indicate you are right about MAP_PRIVATE:
> >
> > "It is unspecified whether changes made to the file after the mmap() call
> > are visible in the mapped region."
> >
> > The Mac man page has no such note.
> >
> > Changing it to MAP_SHARED makes it work as expected on MacOS.  Still odd
> > that the changes are only sometimes visible ... but I guess that is
> > compatible with it being "unspecified."
> >
> > -John
> >
> > On Wed, May 22, 2019 at 8:56 PM John Muehlhausen <[email protected]> wrote:
> >
> > > I'll mess with this on various platforms and report back.  Thanks
> > >
> > > On Wed, May 22, 2019 at 8:42 PM Wes McKinney <[email protected]>
> wrote:
> > >
> > >> I tried locally and am not seeing this behavior
> > >>
> > >> In [10]: source = pa.memory_map('/tmp/test.batch')
> > >>
> > >> In [11]: reader=pa.ipc.open_stream(source)
> > >>
> > >> In [12]: batch = reader.get_next_batch()
> > >> /home/wesm/miniconda/envs/arrow-3.7/bin/ipython:1: FutureWarning:
> > >> Please use read_next_batch instead of get_next_batch
> > >>   #!/home/wesm/miniconda/envs/arrow-3.7/bin/python
> > >>
> > >> In [13]: batch.to_pandas()
> > >> Out[13]:
> > >>    field1
> > >> 0     1.0
> > >> 1     NaN
> > >>
> > >> Now ran dd to overwrite the file contents
> > >>
> > >> In [14]: batch.to_pandas()
> > >> Out[14]:
> > >>         field1
> > >> 0          NaN
> > >> 1 -245785081.0
> > >>
> > >> On Wed, May 22, 2019 at 8:34 PM John Muehlhausen <[email protected]> wrote:
> > >> >
> > >> > I don't think that is it.  I changed my mmap to MAP_PRIVATE in the
> first
> > >> > raw mmap test and the dd changes are still visible.  I also changed
> to
> > >> > storing the stream format instead of the file format and got the
> same
> > >> > result.
> > >> >
> > >> > Where is the code that constructs a buffer/array by pointing it
> into the
> > >> > mmap space instead of by allocating space?  Sorry I'm so confused
> about
> > >> > this, I just don't see how it is supposed to work.
> > >> >
> > >> > On Wed, May 22, 2019 at 7:58 PM Wes McKinney <[email protected]>
> > >> wrote:
> > >> >
> > >> > > It seems this could be due to our use of MAP_PRIVATE for read-only
> > >> memory
> > >> > > maps
> > >> > >
> > >> > >
> > >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393
> > >> > >
> > >> > > Some more investigation would be required
> > >> > >
> > >> > > On Wed, May 22, 2019 at 7:43 PM John Muehlhausen <[email protected]>
> wrote:
> > >> > > >
> > >> > > > Is there an example somewhere of referring to the RecordBatch
> data
> > >> in a
> > >> > > memory-mapped IPC File in a zero-copy manner?
> > >> > > >
> > >> > > > I tried to do this in Python and must be doing something
> wrong.  (I
> > >> > > don't really care whether the example is Python or C++)
> > >> > > >
> > >> > > > In the attached test, when I get to the first prompt and hit
> > >> return, I
> > >> > > get the same content again.  Likewise when I hit return on the
> second
> > >> > > prompt I get the same content again.
> > >> > > >
> > >> > > > However, if before hitting return on the first prompt I issue:
> > >> > > >
> > >> > > > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478
> count=1
> > >> > > >
> > >> > > >
> > >> > > > i.e. overwrite the contents of the file, I get a garbled result.
> > >> > > (Replace 478 with the size of your file.)
> > >> > > >
> > >> > > > However, if I wait until the second prompt to issue the dd
> command
> > >> > > before hitting return, I do not get an error.  Instead,
> > >> batch.to_pandas()
> > >> > > works the same both before and after the data is overwritten.
> This
> > >> was not
> > >> > > expected as I thought that the batch object was looking at the
> file
> > >> > > in-place, i.e. zero-copy?
> > >> > > >
> > >> > > > Am I tying together the memory-mapping and the batch
> construction
> > >> in the
> > >> > > wrong way?
> > >> > > >
> > >> > > > Thanks,
> > >> > > > John
> > >> > >
> > >>
> > >
>

Re: memory mapped IPC File of RecordBatches?

Reply via email to