I'm not seeing it as a bug at this point. I was only using it to convince myself that the batch was zero-copy.
As you know I'd like to draw up a proposal for fully pre-allocated but only partially populated batches in order to facilitate low-latency streaming appends. It may come up then? I'm making a note of it. On Thu, May 23, 2019 at 8:25 AM Wes McKinney <wesmck...@gmail.com> wrote: > OK. Can you open a JIRA about fixing this? I don't recall the > rationale for using MAP_PRIVATE to begin with, and since the behavior > is unspecified on Linux it would be better to be consistent across > platforms > > On Wed, May 22, 2019 at 11:02 PM John Muehlhausen <j...@jgm.org> wrote: > > > > Well, it works fine on Linux... and the Linux mmap man page seems to > > indicate you are right about MAP_PRIVATE: > > > > "It is unspecified whether changes made to the file after the mmap() call > > are visible in the mapped region." > > > > The Mac man page has no such note. > > > > Changing it to MAP_SHARED makes it work as expected on MacOS. Still odd > > that the changes are only sometimes visible ... but I guess that is > > compatible with it being "unspecified." > > > > -John > > > > On Wed, May 22, 2019 at 8:56 PM John Muehlhausen <j...@jgm.org> wrote: > > > > > I'll mess with this on various platforms and report back. Thanks > > > > > > On Wed, May 22, 2019 at 8:42 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > >> I tried locally and am not seeing this behavior > > >> > > >> In [10]: source = pa.memory_map('/tmp/test.batch') > > >> > > >> In [11]: reader=pa.ipc.open_stream(source) > > >> > > >> In [12]: batch = reader.get_next_batch() > > >> /home/wesm/miniconda/envs/arrow-3.7/bin/ipython:1: FutureWarning: > > >> Please use read_next_batch instead of get_next_batch > > >> #!/home/wesm/miniconda/envs/arrow-3.7/bin/python > > >> > > >> In [13]: batch.to_pandas() > > >> Out[13]: > > >> field1 > > >> 0 1.0 > > >> 1 NaN > > >> > > >> Now ran dd to overwrite the file contents > > >> > > >> In [14]: batch.to_pandas() > > >> Out[14]: > > >> field1 > > >> 0 NaN > > >> 1 -245785081.0 > > >> > > >> On Wed, May 22, 2019 at 8:34 PM John Muehlhausen <j...@jgm.org> wrote: > > >> > > > >> > I don't think that is it. I changed my mmap to MAP_PRIVATE in the > first > > >> > raw mmap test and the dd changes are still visible. I also changed > to > > >> > storing the stream format instead of the file format and got the > same > > >> > result. > > >> > > > >> > Where is the code that constructs a buffer/array by pointing it > into the > > >> > mmap space instead of by allocating space? Sorry I'm so confused > about > > >> > this, I just don't see how it is supposed to work. > > >> > > > >> > On Wed, May 22, 2019 at 7:58 PM Wes McKinney <wesmck...@gmail.com> > > >> wrote: > > >> > > > >> > > It seems this could be due to our use of MAP_PRIVATE for read-only > > >> memory > > >> > > maps > > >> > > > > >> > > > > >> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393 > > >> > > > > >> > > Some more investigation would be required > > >> > > > > >> > > On Wed, May 22, 2019 at 7:43 PM John Muehlhausen <j...@jgm.org> > wrote: > > >> > > > > > >> > > > Is there an example somewhere of referring to the RecordBatch > data > > >> in a > > >> > > memory-mapped IPC File in a zero-copy manner? > > >> > > > > > >> > > > I tried to do this in Python and must be doing something > wrong. (I > > >> > > don't really care whether the example is Python or C++) > > >> > > > > > >> > > > In the attached test, when I get to the first prompt and hit > > >> return, I > > >> > > get the same content again. Likewise when I hit return on the > second > > >> > > prompt I get the same content again. > > >> > > > > > >> > > > However, if before hitting return on the first prompt I issue: > > >> > > > > > >> > > > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 > count=1 > > >> > > > > > >> > > > > > >> > > > i.e. overwrite the contents of the file, I get a garbled result. > > >> > > (Replace 478 with the size of your file.) > > >> > > > > > >> > > > However, if I wait until the second prompt to issue the dd > command > > >> > > before hitting return, I do not get an error. Instead, > > >> batch.to_pandas() > > >> > > works the same both before and after the data is overwritten. > This > > >> was not > > >> > > expected as I thought that the batch object was looking at the > file > > >> > > in-place, i.e. zero-copy? > > >> > > > > > >> > > > Am I tying together the memory-mapping and the batch > construction > > >> in the > > >> > > wrong way? > > >> > > > > > >> > > > Thanks, > > >> > > > John > > >> > > > > >> > > > >