hi Robert, Thank you for this analysis. Having a memory map interface that supports growing the memory map sounds useful, so we would welcome this contribution to the project.
best Wes On Fri, May 11, 2018 at 10:23 AM, Ambalu, Robert <robert.amb...@point72.com> wrote: > Antoine, fair point. I just ran some perf stats using FileOutputStream vs my > growing mmap impl. > It seems in most cases you are correct, their runtimes are basically > equivalent. The only time mmap beats it significantly is if there are many > Flush calls. I have a parameter to control how many rows to buffer before > finishing a record batch and writing it out. Note that my mmap impl > currently doubles its size every time its requested to grow > > Testing on writing 5 double columns on 10 million rows I get the following: > > MMAP: > BatchSize Time > 1 01:24.849 > 10 00:08.980 > 100 00:02.105 > 1000 00:01.081 > 10000 00:01.101 > > FILE: > BatchSize Time > 1 03:13.982 > 10 00:18.875 > 100 00:03.172 > 1000 00:01.137 > 10000 00:01.104 > > -----Original Message----- > From: Antoine Pitrou [mailto:anto...@python.org] > Sent: Friday, May 11, 2018 4:54 AM > To: dev@arrow.apache.org > Subject: Re: Question about streaming to memorymapped files > > > If you write your own auto-growing memory mapped file implementation, > I'd be curious about performance measurements vs. FileOutputStream (and > possibly BufferedOutputStream). > > mremap() and truncate() calls are not free. Also, at some point you'll > want to unmap data already written to prevent the map from growing > endlessly. > > Regards > > Antoine. > > > Le 09/05/2018 à 17:55, Ambalu, Robert a écrit : >> I don’t use the output stream objects directly though right? Just to take a >> step back a bit, what im trying to do is to generate streaming rows to a >> table in realtime ( with the ability to control how many rows to batch up >> before writing out a recordbatch ) >> >> My understanding is that to properly stream table data I need to: >> a) create an outputstream instance >> b) create a RecordBatchStreamWriter binding my strmea object to it >> c) create a RecordBatchBuilder. As rows are added, add it to the record >> batch builder. When we're ready to flush, call Flust on the batchbuilder to >> create a record batch and pass the batch to the RecordBatchStreamWriter. >> >> I was hoping use MemoryMappedFile for a but since it doesn’t support >> dynamically growing the mmap file I'll have to write my own impl >> >> -----Original Message----- >> From: Antoine Pitrou [mailto:anto...@python.org] >> Sent: Wednesday, May 09, 2018 11:42 AM >> To: dev@arrow.apache.org >> Subject: Re: Question about streaming to memorymapped files >> >> >> As for buffering data before making a call to write(): in Arrow 0.10.0 >> you'll be able to use BufferedOutputStream for this: >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_blob_master_cpp_src_arrow_io_buffered.h&d=DwIDaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=saGHLviPO9fhScNR4CP81xeAZv0qydj6cD5eJs7fZG4&m=JPb2EN-IHSoqJKmEqn-rC7CorVXLSWxcrywaUrMYYzc&s=1E4T4kTw88QvpO9Bk2GiADuArl_rn72Up4EXqHGwCnk&e= >> >> Regards >> >> Antoine. >> >> >> Le 09/05/2018 à 17:39, Ambalu, Robert a écrit : >>> I don’t have any offhand, no, but I would imagine that direct file writes >>> will at some point need to make a system call, which is expensive ( fwrite >>> might buffer before eventually making the sys call, looks like >>> FileOutputStream uses the raw system write for every write call). >>> The current MMap io interface isn’t usable as a streaming output >>> unfortunately, though I suppose I could just implement my own >>> >>> -----Original Message----- >>> From: Antoine Pitrou [mailto:solip...@pitrou.net] >>> Sent: Wednesday, May 09, 2018 11:11 AM >>> To: dev@arrow.apache.org >>> Subject: Re: Question about streaming to memorymapped files >>> >>> >>> Do you know of any benchmark numbers / performance studies about this? >>> While it's true that a memory-mapped file avoids explicit system calls, >>> I've heard file I/O is quite well optimized, at least on Linux, >>> nowadays. >>> >>> Regards >>> >>> Antoine. >>> >>> >>> On Wed, 9 May 2018 14:47:53 +0000 >>> "Ambalu, Robert" <robert.amb...@point72.com> wrote: >>>> Antoine, thanks for the quick reply. >>>> You can actually grow memorymapped files with a mremap call ( and I think >>>> a seek/write on the file ), I do this in my applications and it works fine. >>>> I want the efficiency of writing via memory maps, so would prefer to avoid >>>> FileOutputStream >>>> >>>> -----Original Message----- >>>> From: Antoine Pitrou [mailto:anto...@python.org] >>>> Sent: Wednesday, May 09, 2018 10:37 AM >>>> To: dev@arrow.apache.org >>>> Subject: Re: Question about streaming to memorymapped files >>>> >>>> >>>> Hi, >>>> >>>> If you don't know the output size upfront then should probably use a >>>> FileOutputStream instead. By definition, memory mapped files must have >>>> a fixed size (since they are mapped to a fixed area in virtual memory). >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit : >>>>> Hey, I'm looking into streaming table updates into a memory mapped file ( >>>>> C++ ) >>>>> I think I have everything I need ( MemoryMappedFile output streamer, >>>>> RecordBatchStreamWriter ) but I don't understand how to properly create >>>>> the memmap file. It looks like it requires you to preset a size to the >>>>> file when you create it, but since ill be streaming I don't actually know >>>>> how big a file im going to need... >>>>> Am I missing some other API point here? Any reason why size is required >>>>> up front and the memmap doesn't auto-grow as needed? >>>>> >>>>> Thanks in advance >>>>> - Rob >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> DISCLAIMER: This e-mail message and any attachments are intended solely >>>>> for the use of the individual or entity to which it is addressed and may >>>>> contain information that is confidential or legally privileged. If you >>>>> are not the intended recipient, you are hereby notified that any >>>>> dissemination, distribution, copying or other use of this message or its >>>>> attachments is strictly prohibited. If you have received this message in >>>>> error, please notify the sender immediately and permanently delete this >>>>> message and any attachments. >>>>> >>>>> >>>>> >>>>> >>>