I tried locally and am not seeing this behavior In [10]: source = pa.memory_map('/tmp/test.batch')
In [11]: reader=pa.ipc.open_stream(source) In [12]: batch = reader.get_next_batch() /home/wesm/miniconda/envs/arrow-3.7/bin/ipython:1: FutureWarning: Please use read_next_batch instead of get_next_batch #!/home/wesm/miniconda/envs/arrow-3.7/bin/python In [13]: batch.to_pandas() Out[13]: field1 0 1.0 1 NaN Now ran dd to overwrite the file contents In [14]: batch.to_pandas() Out[14]: field1 0 NaN 1 -245785081.0 On Wed, May 22, 2019 at 8:34 PM John Muehlhausen <j...@jgm.org> wrote: > > I don't think that is it. I changed my mmap to MAP_PRIVATE in the first > raw mmap test and the dd changes are still visible. I also changed to > storing the stream format instead of the file format and got the same > result. > > Where is the code that constructs a buffer/array by pointing it into the > mmap space instead of by allocating space? Sorry I'm so confused about > this, I just don't see how it is supposed to work. > > On Wed, May 22, 2019 at 7:58 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > It seems this could be due to our use of MAP_PRIVATE for read-only memory > > maps > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393 > > > > Some more investigation would be required > > > > On Wed, May 22, 2019 at 7:43 PM John Muehlhausen <j...@jgm.org> wrote: > > > > > > Is there an example somewhere of referring to the RecordBatch data in a > > memory-mapped IPC File in a zero-copy manner? > > > > > > I tried to do this in Python and must be doing something wrong. (I > > don't really care whether the example is Python or C++) > > > > > > In the attached test, when I get to the first prompt and hit return, I > > get the same content again. Likewise when I hit return on the second > > prompt I get the same content again. > > > > > > However, if before hitting return on the first prompt I issue: > > > > > > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1 > > > > > > > > > i.e. overwrite the contents of the file, I get a garbled result. > > (Replace 478 with the size of your file.) > > > > > > However, if I wait until the second prompt to issue the dd command > > before hitting return, I do not get an error. Instead, batch.to_pandas() > > works the same both before and after the data is overwritten. This was not > > expected as I thought that the batch object was looking at the file > > in-place, i.e. zero-copy? > > > > > > Am I tying together the memory-mapping and the batch construction in the > > wrong way? > > > > > > Thanks, > > > John > >