The way postgres does it sounds pretty reasonable. I'd like to be able to do the reallocations out of the pre-allocated processing buffer, so that way freeing them all at the end of the query is, well, free. (We can just reuse the buffer for the next query without actually doing anything.) Maybe we carve out some space in the processing buffer for reallocations. Perhaps each aggregator type can give us a hint about the expected amount of additional memory needed beyond the starting size (so, zero for fixed-size aggregators, maybe 10% overhead for ones that need to grow?) and we can use that to size the carve-out.
If aggregators need more space beyond the carve-out, we could allocate memory outside of the processing buffer, perhaps on the Java heap or perhaps by grabbing additional off-heap memory beyond the processing buffer. (DoublesSketchBuildBufferAggregator already does this sneakily: the DirectUpdateDoublesSketch it is based on might bail completely out of the off-heap buffer and allocate new memory on-heap, without Druid knowing about it.) I don't think we'd be able to reuse the original allocation. I'm not sure if it'd be valuable to allow the aggregator to continue to use it: if so we could provide an array of positions instead of a single position in the BufferAggregator aggregate method, and an array of array of positions in the VectorAggregator one. If not then we can just leave it there unused until the query is over, when it'll get released. Does this seem like a useful way of doing things? Himanshu does this line up with what you were thinking about in https://github.com/apache/druid/issues/8126? What are your thoughts? Gian On Thu, Apr 16, 2020 at 1:30 PM Alexander Saydakov <sayda...@verizonmedia.com.invalid> wrote: > > PostgreSQL provides its own memory allocation and deallocation functions > palloc and pfree (replacements for malloc and free). Documentation for > user-defined functions in C is here: > https://www.postgresql.org/docs/current/xfunc-c.html > I am not sure you need such a broad context about writing UDFs in general. > The main point about aggregating functions is that they consist of two > parts: a state transition function and a finalizer. State transition > functions are called to process the next value in an aggregation (such as > group-by). One of the parameters is a pointer to the current state. It is > null on the first call, and the function allocates an arbitrary data > structure using palloc, updates it with the given value and returns the > pointer. It can choose to reallocate and return a different pointer or > return the same pointer. A finalizer is called at the end to convert this > arbitrary state object into whatever return type is specified for this > aggregation. When the query (or transaction) is finished, all allocations > in this context are freed. > > > On Wed, Apr 15, 2020 at 11:03 PM Himanshu <g.himan...@gmail.com> wrote: > >> Alex, Thanks for posting it here. >> >> We have definitely thought about ability to start with a small off-heap >> buffer and grow to bigger off-heap buffers if needed and that is what >> https://github.com/apache/druid/issues/8126 is trying to address. >> However, we have never considered one sketch being spread across several >> different contiguous memory blocks. As you pointed out, That would likely >> require BufferAggregator interface having access to Druid supplied >> MemoryAllocator. >> >> > ... but we would need a "maloc / free"-like API similar to how >> PostgreSQL has implemented their aggregations API. ... >> Can you point to some documentation/code describing what kind of API does >> Postgres provide for implementing custom aggregators. >> >> -- Himanshu >> >> On Wed, Apr 15, 2020 at 10:11 AM Alexander Saydakov >> <sayda...@verizonmedia.com.invalid> wrote: >> >>> Let's involve Druid developers as Gian suggested. >>> >>> >>> On Tue, Apr 14, 2020 at 4:44 PM leerho <lee...@gmail.com> wrote: >>> >>> > Hi, >>> > It is pretty easy to graphically view the various cross-platform >>> > discrepancies that Alex is talking about in the Sketch Features Matrix >>> > < >>> https://datasketches.apache.org/docs/Architecture/SketchFeaturesMatrix.html >>> >. >>> > We still need to add the PostgreSQL System Integration column. >>> > Nevertheless, it is easy to see that where we have implemented "Off >>> Java >>> > Heap" capability is directly correlated to our Druid integrations. >>> > >>> > To underscore Alex's point, many of the other sketches in this matrix >>> that >>> > are not currently integrated with Druid have very dynamic memory >>> > requirements and may be impractical to implement in a contiguous-memory >>> > model. They could still be implemented "off-heap", but we would need a >>> > "maloc / free"-like API similar to how PostgreSQL has implemented their >>> > aggregations API. >>> > >>> > One of the conclusions from this matrix is that Druid would have a much >>> > richer and more powerful sketching analytics capability to offer its >>> > customers if it had a more flexible aggregations API. >>> > >>> > >>> > >>> > On Tue, Apr 14, 2020 at 1:47 PM Alexander Saydakov >>> > <sayda...@verizonmedia.com.invalid> wrote: >>> > >>> >> Gian, >>> >> Thanks for your reply. I did not mean to suggest moving away from >>> >> off-heap memory. As I see it, the problem mostly is with preallocating >>> >> fixed blocks for each sketch of some maximum size, which, on the one >>> hand, >>> >> is wasteful because most sketches will be small, and on the other >>> hand, >>> >> requires operating in a "contiguous block" mode, which is very >>> limiting. >>> >> Perhaps you might consider something like PostgreSQL does: provide an >>> >> allocator, so that the memory allocation is controlled and can be >>> done in a >>> >> context of a query or transaction, but it does not impose a single >>> >> contiguous block for a data structure. So a custom aggregation >>> function >>> >> allocates memory for its own state on the first call, and PostgreSQL >>> passes >>> >> this state to the next call as a pointer and so on. At any moment the >>> >> aggregation can choose to reallocate and return a new pointer. And >>> that >>> >> state can be a complex data structure with pointers to other blocks >>> >> allocated using the same mechanism, so it does not have to be a single >>> >> contiguous block. >>> >> >>> >> This will still require some changes to our library to support memory >>> >> allocation like this, but it seems to be less challenging then the >>> current >>> >> Direct memory mode we have. >>> >> >>> >> There are some trade offs here, as usually. For instance, currently >>> (if >>> >> implemented) wrapping a chunk of bytes as a fast alternative to >>> creating an >>> >> on-heap object during deserialization is the same operation as >>> interpreting >>> >> this "contiguous block" aggregation state. But if the aggregation >>> state is >>> >> fragmented, it is not going to be equivalent to a serialized blob >>> anymore. >>> >> This is how it is in the current C++ implementation - there is no >>> >> equivalent of wrap() as in Direct mode in Java. This is a potential >>> >> performance improvement that can be discussed separately. Some >>> sketches can >>> >> choose to operate in this contiguous block mode if it does not go >>> against >>> >> the algorithm. >>> >> >>> >> On Mon, Apr 13, 2020 at 10:43 PM Gian Merlino <g...@apache.org> >>> wrote: >>> >> >>> >>> Hi Alexander, >>> >>> >>> >>> It sounds like integrating with Druid is an important concern here. >>> It >>> >>> might be nice to cross post the discussion to dev@druid. >>> >>> >>> >>> Personally, I don't think it's likely the Druid community will move >>> away >>> >>> from requiring that aggregators be able to work with raw memory. The >>> >>> requirement exists so it can use allocate an arena for aggregation >>> space, >>> >>> minimizing GC load. But there have been some proposals floating >>> around for >>> >>> adding ability to dynamically allocate new memory (perhaps out of >>> the same >>> >>> arena or perhaps some other way — the proposals varied). I think it >>> would >>> >>> be useful for Druid devs to understand more deeply what effect the >>> >>> allocator API has on the DataSketches implementation. >>> >>> >>> >>> (By the way, the Druid issue you're referring to might be >>> >>> https://github.com/apache/druid/issues/8126.) >>> >>> >>> >>> On Mon, Apr 13, 2020 at 5:11 PM Alexander Saydakov >>> >>> <sayda...@verizonmedia.com.invalid> wrote: >>> >>> >>> >>>> Hi everyone! >>> >>>> I would like to discuss some discrepancies we accumulated as >>> results of >>> >>>> some decisions around prioritizing our development efforts in Java >>> and C++, >>> >>>> and also around integration with data processing systems. >>> >>>> >>> >>>> Druid is one of the most important systems in which our library is >>> >>>> used. It has a strong requirement to have off-heap memory support >>> and also >>> >>>> to operate in a single contiguous block. This was the main >>> motivation >>> >>>> behind what we call Direct memory support in our Java library. >>> >>>> >>> >>>> Currently some sketches do not support Direct memory, in particular: >>> >>>> KLL quantiles sketch, Frequent Items sketch, CPC distinct counting >>> sketch. >>> >>>> Therefore they are not integrated with Druid yet. >>> >>>> >>> >>>> Quantiles sketch: >>> >>>> In Java we have ItemsSketch<T> for any type and a specialized >>> numeric >>> >>>> DoublesSketch with Direct memory support - this is the one >>> integrated with >>> >>>> Druid. We consider this algorithm replaced by KLL quantiles sketch, >>> but for >>> >>>> historical reasons in Java we have implemented KllFloatsSketch only >>> - no >>> >>>> generic implementation and no Direct memory support. This >>> contiguous memory >>> >>>> block mode is so limiting that we did not think that KLL algorithm >>> can be >>> >>>> implemented efficiently in that mode. We might need to reconsider >>> this. >>> >>>> Development is C++ happened later, so we did not implement the older >>> >>>> algorithm. We may want to do it if a strong use case arises to >>> support >>> >>>> reading data prepared using Java library. On the other hand, in C++ >>> >>>> kll_sketch is a template, and therefore it can be a container of >>> any user >>> >>>> type. In Java we felt the need to have both: a generic >>> implementation and a >>> >>>> specialized implementation for some numeric type to avoid object >>> overhead. >>> >>>> >>> >>>> To summarize the situation with quantiles sketches: >>> >>>> Java: KllFloatsSketch, older quantiles ItemsSketch<T> and >>> DoublesSketch >>> >>>> C++: kll_sketch<T> (kll_sketch<float> is compatible with Java), no >>> >>>> older quantiles sketch. >>> >>>> Druid: quantiles DoublesSketch >>> >>>> PostgreSQL: kll_sketch<float> >>> >>>> >>> >>>> Regarding distinct counting sketches: >>> >>>> Druid: Theta, HLL >>> >>>> PostgreSQL: Theta, HLL, CPC >>> >>>> >>> >>>> Regarding frequent items: >>> >>>> Druid: none >>> >>>> PostgreSQL: frequent_items_sketch<std::string> >>> >>>> >>> >>>> It seems to me that we need to do something to improve the >>> situation. >>> >>>> Possible tasks: >>> >>>> - Discuss alternatives to Direct memory mode in Druid with Druid >>> >>>> developers (again). They were not quite happy with the current >>> approach >>> >>>> either. It leads to gross overallocation of BufferAggredator >>> memory. There >>> >>>> was an open issue in Druid repo about this I believe. >>> >>>> - Find some way to have KLL sketch in Druid >>> >>>> - Find some way to have Frequent Items (frequent strings?) sketch in >>> >>>> Druid >>> >>>> - Consider KllItemsSketch<T> in Java >>> >>>> - Consider legacy quantiles_sketch<T> in C++ (and then in >>> PostgreSQL - >>> >>>> with double values, compatible with Druid) >>> >>>> >>> >>>> I would appreciate your thoughts about this. >>> >>>> Thank you. >>> >>>> Alexander Saydakov >>> >>>> >>> >>> >>> >>