Re: memory mapped IPC File of RecordBatches?

John Muehlhausen Wed, 22 May 2019 18:20:14 -0700

(new test attached)

On Wed, May 22, 2019 at 8:09 PM John Muehlhausen <j...@jgm.org> wrote:


> I don't think that is it.  I changed my mmap to MAP_PRIVATE in the first
> raw mmap test and the dd changes are still visible.  I also changed to
> storing the stream format instead of the file format and got the same
> result.
>
> Where is the code that constructs a buffer/array by pointing it into the
> mmap space instead of by allocating space?  Sorry I'm so confused about
> this, I just don't see how it is supposed to work.
>
> On Wed, May 22, 2019 at 7:58 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> It seems this could be due to our use of MAP_PRIVATE for read-only memory
>> maps
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393
>>
>> Some more investigation would be required
>>
>> On Wed, May 22, 2019 at 7:43 PM John Muehlhausen <j...@jgm.org> wrote:
>> >
>> > Is there an example somewhere of referring to the RecordBatch data in a
>> memory-mapped IPC File in a zero-copy manner?
>> >
>> > I tried to do this in Python and must be doing something wrong.  (I
>> don't really care whether the example is Python or C++)
>> >
>> > In the attached test, when I get to the first prompt and hit return, I
>> get the same content again.  Likewise when I hit return on the second
>> prompt I get the same content again.
>> >
>> > However, if before hitting return on the first prompt I issue:
>> >
>> > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1
>> >
>> >
>> > i.e. overwrite the contents of the file, I get a garbled result.
>> (Replace 478 with the size of your file.)
>> >
>> > However, if I wait until the second prompt to issue the dd command
>> before hitting return, I do not get an error.  Instead, batch.to_pandas()
>> works the same both before and after the data is overwritten.  This was not
>> expected as I thought that the batch object was looking at the file
>> in-place, i.e. zero-copy?
>> >
>> > Am I tying together the memory-mapping and the batch construction in
>> the wrong way?
>> >
>> > Thanks,
>> > John
>>
>

import mmap
import pyarrow as pa
batch=pa.RecordBatch.from_arrays([ pa.array([1,None],type=pa.int32()) ], [ 'field1' ])

with open('/tmp/test.batch','wb') as sink:
    #writer=pa.RecordBatchFileWriter(sink, batch.schema)
    writer=pa.RecordBatchStreamWriter(sink, batch.schema)
    writer.write_batch(batch)
    writer.close()

with open('/tmp/test.batch','rb') as source:
    #reader=pa.ipc.open_stream(source.read()[8:])
    reader=pa.ipc.open_stream(source.read())
    print(reader.read_pandas())
    mm = mmap.mmap(source.fileno(),0,prot=mmap.PROT_READ,flags=mmap.MAP_PRIVATE)
    print(mm[0:6])
    input("run dd, then return to continue")
    print(mm[0:6])

with pa.memory_map('/tmp/test.batch') as source:
    #reader=pa.ipc.open_file(source)
    reader=pa.ipc.open_stream(source)
    #or?
    #reader=pa.RecordBatchFileReader(source)

    # shouldn't this be zero-copy?
    for batch in reader:
        print(batch.to_pandas())
        input("run dd, then return to continue")
        print(batch.to_pandas())

Re: memory mapped IPC File of RecordBatches?

Reply via email to