Re: memory mapped IPC File of RecordBatches?

John Muehlhausen Wed, 22 May 2019 21:02:47 -0700

Well, it works fine on Linux... and the Linux mmap man page seems to
indicate you are right about MAP_PRIVATE:


"It is unspecified whether changes made to the file after the mmap() call
are visible in the mapped region."

The Mac man page has no such note.

Changing it to MAP_SHARED makes it work as expected on MacOS.  Still odd
that the changes are only sometimes visible ... but I guess that is
compatible with it being "unspecified."

-John

On Wed, May 22, 2019 at 8:56 PM John Muehlhausen <[email protected]> wrote:

> I'll mess with this on various platforms and report back.  Thanks
>
> On Wed, May 22, 2019 at 8:42 PM Wes McKinney <[email protected]> wrote:
>
>> I tried locally and am not seeing this behavior
>>
>> In [10]: source = pa.memory_map('/tmp/test.batch')
>>
>> In [11]: reader=pa.ipc.open_stream(source)
>>
>> In [12]: batch = reader.get_next_batch()
>> /home/wesm/miniconda/envs/arrow-3.7/bin/ipython:1: FutureWarning:
>> Please use read_next_batch instead of get_next_batch
>>   #!/home/wesm/miniconda/envs/arrow-3.7/bin/python
>>
>> In [13]: batch.to_pandas()
>> Out[13]:
>>    field1
>> 0     1.0
>> 1     NaN
>>
>> Now ran dd to overwrite the file contents
>>
>> In [14]: batch.to_pandas()
>> Out[14]:
>>         field1
>> 0          NaN
>> 1 -245785081.0
>>
>> On Wed, May 22, 2019 at 8:34 PM John Muehlhausen <[email protected]> wrote:
>> >
>> > I don't think that is it.  I changed my mmap to MAP_PRIVATE in the first
>> > raw mmap test and the dd changes are still visible.  I also changed to
>> > storing the stream format instead of the file format and got the same
>> > result.
>> >
>> > Where is the code that constructs a buffer/array by pointing it into the
>> > mmap space instead of by allocating space?  Sorry I'm so confused about
>> > this, I just don't see how it is supposed to work.
>> >
>> > On Wed, May 22, 2019 at 7:58 PM Wes McKinney <[email protected]>
>> wrote:
>> >
>> > > It seems this could be due to our use of MAP_PRIVATE for read-only
>> memory
>> > > maps
>> > >
>> > >
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393
>> > >
>> > > Some more investigation would be required
>> > >
>> > > On Wed, May 22, 2019 at 7:43 PM John Muehlhausen <[email protected]> wrote:
>> > > >
>> > > > Is there an example somewhere of referring to the RecordBatch data
>> in a
>> > > memory-mapped IPC File in a zero-copy manner?
>> > > >
>> > > > I tried to do this in Python and must be doing something wrong.  (I
>> > > don't really care whether the example is Python or C++)
>> > > >
>> > > > In the attached test, when I get to the first prompt and hit
>> return, I
>> > > get the same content again.  Likewise when I hit return on the second
>> > > prompt I get the same content again.
>> > > >
>> > > > However, if before hitting return on the first prompt I issue:
>> > > >
>> > > > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1
>> > > >
>> > > >
>> > > > i.e. overwrite the contents of the file, I get a garbled result.
>> > > (Replace 478 with the size of your file.)
>> > > >
>> > > > However, if I wait until the second prompt to issue the dd command
>> > > before hitting return, I do not get an error.  Instead,
>> batch.to_pandas()
>> > > works the same both before and after the data is overwritten.  This
>> was not
>> > > expected as I thought that the batch object was looking at the file
>> > > in-place, i.e. zero-copy?
>> > > >
>> > > > Am I tying together the memory-mapping and the batch construction
>> in the
>> > > wrong way?
>> > > >
>> > > > Thanks,
>> > > > John
>> > >
>>
>

Re: memory mapped IPC File of RecordBatches?

Reply via email to