Re: Question about streaming to memorymapped files

Wes McKinney Fri, 11 May 2018 17:26:21 -0700

hi Robert,

Thank you for this analysis. Having a memory map interface that
supports growing the memory map sounds useful, so we would welcome
this contribution to the project.


best
Wes

On Fri, May 11, 2018 at 10:23 AM, Ambalu, Robert
<robert.amb...@point72.com> wrote:
> Antoine, fair point.  I just ran some perf stats using FileOutputStream vs my 
> growing mmap impl.
> It seems in most cases you are correct, their runtimes are basically 
> equivalent.  The only time mmap beats it significantly is if there are many 
> Flush calls. I have a parameter to control how many rows to buffer before 
> finishing a record batch and writing it out.  Note that my mmap impl 
> currently doubles its size every time its requested to grow
>
> Testing on writing 5 double columns on 10 million rows I get the following:
>
> MMAP:
> BatchSize    Time
> 1                  01:24.849
> 10                00:08.980
> 100              00:02.105
> 1000            00:01.081
> 10000          00:01.101
>
> FILE:
> BatchSize    Time
> 1                  03:13.982
> 10                00:18.875
> 100              00:03.172
> 1000            00:01.137
> 10000          00:01.104
>
> -----Original Message-----
> From: Antoine Pitrou [mailto:anto...@python.org]
> Sent: Friday, May 11, 2018 4:54 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
>
>
> If you write your own auto-growing memory mapped file implementation,
> I'd be curious about performance measurements vs. FileOutputStream (and
> possibly BufferedOutputStream).
>
> mremap() and truncate() calls are not free.  Also, at some point you'll
> want to unmap data already written to prevent the map from growing
> endlessly.
>
> Regards
>
> Antoine.
>
>
> Le 09/05/2018 à 17:55, Ambalu, Robert a écrit :
>> I don’t use the output stream objects directly though right? Just to take a 
>> step back a bit, what im trying to do is to generate streaming rows to a 
>> table in realtime ( with the ability to control how many rows to batch up 
>> before writing out a recordbatch )
>>
>> My understanding is that to properly stream table data I need to:
>> a) create an outputstream instance
>> b) create a RecordBatchStreamWriter binding my strmea object to it
>> c) create a RecordBatchBuilder.  As rows are added, add it to the record 
>> batch builder.  When we're ready to flush, call Flust on the batchbuilder to 
>> create a record batch and pass the batch to the RecordBatchStreamWriter.
>>
>> I was hoping use MemoryMappedFile for a but since it doesn’t support 
>> dynamically growing the mmap file I'll have to write my own impl
>>
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:anto...@python.org]
>> Sent: Wednesday, May 09, 2018 11:42 AM
>> To: dev@arrow.apache.org
>> Subject: Re: Question about streaming to memorymapped files
>>
>>
>> As for buffering data before making a call to write(): in Arrow 0.10.0
>> you'll be able to use BufferedOutputStream for this:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_blob_master_cpp_src_arrow_io_buffered.h&d=DwIDaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=saGHLviPO9fhScNR4CP81xeAZv0qydj6cD5eJs7fZG4&m=JPb2EN-IHSoqJKmEqn-rC7CorVXLSWxcrywaUrMYYzc&s=1E4T4kTw88QvpO9Bk2GiADuArl_rn72Up4EXqHGwCnk&e=
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
>>> I don’t have any offhand, no, but I would imagine that direct file writes 
>>> will at some point need to make a system call, which is expensive ( fwrite 
>>> might buffer before eventually making the sys call, looks like 
>>> FileOutputStream uses the raw system write for every write call).
>>> The current MMap io interface isn’t usable as a streaming output 
>>> unfortunately, though I suppose I could just implement my own
>>>
>>> -----Original Message-----
>>> From: Antoine Pitrou [mailto:solip...@pitrou.net]
>>> Sent: Wednesday, May 09, 2018 11:11 AM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Question about streaming to memorymapped files
>>>
>>>
>>> Do you know of any benchmark numbers / performance studies about this?
>>> While it's true that a memory-mapped file avoids explicit system calls,
>>> I've heard file I/O is quite well optimized, at least on Linux,
>>> nowadays.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> On Wed, 9 May 2018 14:47:53 +0000
>>> "Ambalu, Robert" <robert.amb...@point72.com> wrote:
>>>> Antoine, thanks for the quick reply.
>>>> You can actually grow memorymapped files with a mremap call ( and I think 
>>>> a seek/write on the file ), I do this in my applications and it works fine.
>>>> I want the efficiency of writing via memory maps, so would prefer to avoid 
>>>> FileOutputStream
>>>>
>>>> -----Original Message-----
>>>> From: Antoine Pitrou [mailto:anto...@python.org]
>>>> Sent: Wednesday, May 09, 2018 10:37 AM
>>>> To: dev@arrow.apache.org
>>>> Subject: Re: Question about streaming to memorymapped files
>>>>
>>>>
>>>> Hi,
>>>>
>>>> If you don't know the output size upfront then should probably use a
>>>> FileOutputStream instead.  By definition, memory mapped files must have
>>>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>>>> Hey, I'm looking into streaming table updates into a memory mapped file ( 
>>>>> C++ )
>>>>> I think I have everything I need ( MemoryMappedFile output streamer, 
>>>>> RecordBatchStreamWriter ) but I don't understand how to properly create 
>>>>> the memmap file.  It looks like it requires you to preset a size to the 
>>>>> file when you create it, but since ill be streaming I don't actually know 
>>>>> how big a file im going to need...
>>>>> Am I missing some other API point here?  Any reason why size is required 
>>>>> up front and the memmap doesn't auto-grow as needed?
>>>>>
>>>>> Thanks in advance
>>>>> - Rob
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> DISCLAIMER: This e-mail message and any attachments are intended solely 
>>>>> for the use of the individual or entity to which it is addressed and may 
>>>>> contain information that is confidential or legally privileged. If you 
>>>>> are not the intended recipient, you are hereby notified that any 
>>>>> dissemination, distribution, copying or other use of this message or its 
>>>>> attachments is strictly prohibited. If you have received this message in 
>>>>> error, please notify the sender immediately and permanently delete this 
>>>>> message and any attachments.
>>>>>
>>>>>
>>>>>
>>>>>
>>>

Re: Question about streaming to memorymapped files

Reply via email to