RE: cpp Memory Pool Clarification

Ivan Chau Tue, 12 Jul 2022 09:10:23 -0700

Would this also explain the lack of allocations, reallocations or frees when 
creating a pipeline with just a source and a sink?


For example, we do not see logs for a regular source, a table source node, or a 
streaming file reader node (using RecordBatchFileReader and 
MakeReaderGenerator) to generate for a regular source node.

-----Original Message-----
From: Weston Pace <weston.p...@gmail.com>
Sent: Monday, July 11, 2022 4:37 PM
To: dev@arrow.apache.org
Subject: Re: cpp Memory Pool Clarification

> Is there anything else I'd need to change?

Maybe try something like this:
https://github.com/westonpace/arrow/commit/15ac0d051136c585cda63297e48f17557808d898

> Beyond that, we should also expect to see some allocations from 
> TableSourceNode going through the logging memory pool, even if AsOfJoinNode 
> was using the default memory pool instead of the Exec Plan's pool, but I am 
> not seeing anything come through...

TableSourceNode wouldn't need to allocate since it runs against memory that's 
already been allocated.  It might split input into smaller batches but slicing 
tables / arrays is a zero-copy operation that does not require allocating new 
buffers.

On Mon, Jul 11, 2022 at 12:46 PM Ivan Chau <ivan.c...@twosigma.com> wrote:
>
> Yeah this behavior is certainly a bit strange then.
>
> The only alteration I am making is changing the way we create the Execution 
> Context in the benchmark file.
>
> Something like:
>
> ```
> auto logging_pool = LoggingMemoryPool(default_memory_pool());
> ExecContext ctx(&logging_pool, ...);
> ```
>
> Is there anything else I'd need to change?
>
> Beyond that, we should also expect to see some allocations from 
> TableSourceNode going through the logging memory pool, even if AsOfJoinNode 
> was using the default memory pool instead of the Exec Plan's pool, but I am 
> not seeing anything come through...
>
> -----Original Message-----
> From: Weston Pace <weston.p...@gmail.com>
> Sent: Monday, July 11, 2022 2:47 PM
> To: dev@arrow.apache.org
> Subject: Re: cpp Memory Pool Clarification
>
> Are you changing the default memory pool to a LoggingMemoryPool?
> Where are you doing this?  For a benchmark I think you would need to change 
> the implementation in the benchmark file itself.
>
> Similarly, is AsofJoinNode using the default memory pool or the memory pool 
> of the exec plan?  It should be exclusively using the latter but it's easy 
> sometimes to overlook using the default memory pool.  It probably won't make 
> too much of a difference at the end of the day as benchmarks normally 
> configure an exec plan to use the default memory pool and so the two pools 
> would be the same.
>
> > My expectation is that we would see some pretty sizable calls to Allocate 
> > when we begin to read files or to create tables, but that is not evident.
>
> Yes, the materializtion step of an asof join uses array builders and those 
> will be allocating buffers from a memory pool.
>
> > 1) To my understanding, only large allocations will call Allocate.
> > Are there allocations (for files, table objects), which despite
> > being of large size, do not call Allocate?
>
> No.  There is no size limit for the allocator.  Instead, when people were 
> talking about "large allocations" and "small allocations" in the previous 
> thread is was more of a general concept.
>
> For example, if I create an array builder, add some items to it, and then 
> create an array then this will always use a memory pool for the allocation.  
> This will be true even if I create an array with a single element in it (in 
> which case the allocation is often padded for alignment purposes).
>
> On the other hand, schemas keep their fields in a std::vector which never 
> uses the memory pool for allocation.  This is true even if I have 10,000 
> columns and the vector's memory is actually quite large.
>
> However, in general, arrays tend to be quite large and schemas tend to be 
> quite small.
>
> > 2) How can maximum_peak_memory be nonzero if we have not seen any
> > calls to Allocate/Reallocate/Free?
>
> I don't think that is possible.
>
> On Mon, Jul 11, 2022 at 10:44 AM Ivan Chau <ivan.m.c...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I've been doing some testing with LoggingMemoryPool to benchmark our
> > AsOfJoin implementation
> > <https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/exec/asof_join_node.cc>.
> > Our underlying memory pool for the LoggingMemoryPool is the
> > default_memory_pool (this is process-wide).
> >
> > Curiously enough, I don't see any allocations, reallocations, or
> > frees when we run our benchmarking code. I also see that the
> > max_memory property of the memory pool (which is documented as the
> > peak memory allocation), is nonzero (1.2e9 bytes).
> >
> > My expectation is that we would see some pretty sizable calls to
> > Allocate when we begin to read files or to create tables, but that is not 
> > evident.
> >
> > 1) To my understanding, only large allocations will call Allocate.
> > Are there allocations (for files, table objects), which despite
> > being of large size, do not call Allocate?
> >
> > 2) How can maximum_peak_memory be nonzero if we have not seen any
> > calls to Allocate/Reallocate/Free?
> >
> > Thank you!

RE: cpp Memory Pool Clarification

Reply via email to