(new test attached) On Wed, May 22, 2019 at 8:09 PM John Muehlhausen <j...@jgm.org> wrote:
> I don't think that is it. I changed my mmap to MAP_PRIVATE in the first > raw mmap test and the dd changes are still visible. I also changed to > storing the stream format instead of the file format and got the same > result. > > Where is the code that constructs a buffer/array by pointing it into the > mmap space instead of by allocating space? Sorry I'm so confused about > this, I just don't see how it is supposed to work. > > On Wed, May 22, 2019 at 7:58 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> It seems this could be due to our use of MAP_PRIVATE for read-only memory >> maps >> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393 >> >> Some more investigation would be required >> >> On Wed, May 22, 2019 at 7:43 PM John Muehlhausen <j...@jgm.org> wrote: >> > >> > Is there an example somewhere of referring to the RecordBatch data in a >> memory-mapped IPC File in a zero-copy manner? >> > >> > I tried to do this in Python and must be doing something wrong. (I >> don't really care whether the example is Python or C++) >> > >> > In the attached test, when I get to the first prompt and hit return, I >> get the same content again. Likewise when I hit return on the second >> prompt I get the same content again. >> > >> > However, if before hitting return on the first prompt I issue: >> > >> > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1 >> > >> > >> > i.e. overwrite the contents of the file, I get a garbled result. >> (Replace 478 with the size of your file.) >> > >> > However, if I wait until the second prompt to issue the dd command >> before hitting return, I do not get an error. Instead, batch.to_pandas() >> works the same both before and after the data is overwritten. This was not >> expected as I thought that the batch object was looking at the file >> in-place, i.e. zero-copy? >> > >> > Am I tying together the memory-mapping and the batch construction in >> the wrong way? >> > >> > Thanks, >> > John >> >
import mmap import pyarrow as pa batch=pa.RecordBatch.from_arrays([ pa.array([1,None],type=pa.int32()) ], [ 'field1' ]) with open('/tmp/test.batch','wb') as sink: #writer=pa.RecordBatchFileWriter(sink, batch.schema) writer=pa.RecordBatchStreamWriter(sink, batch.schema) writer.write_batch(batch) writer.close() with open('/tmp/test.batch','rb') as source: #reader=pa.ipc.open_stream(source.read()[8:]) reader=pa.ipc.open_stream(source.read()) print(reader.read_pandas()) mm = mmap.mmap(source.fileno(),0,prot=mmap.PROT_READ,flags=mmap.MAP_PRIVATE) print(mm[0:6]) input("run dd, then return to continue") print(mm[0:6]) with pa.memory_map('/tmp/test.batch') as source: #reader=pa.ipc.open_file(source) reader=pa.ipc.open_stream(source) #or? #reader=pa.RecordBatchFileReader(source) # shouldn't this be zero-copy? for batch in reader: print(batch.to_pandas()) input("run dd, then return to continue") print(batch.to_pandas())