Wow, seems I've found that! * Beginning in PostgreSQL 8.1, the executor's AggState node is passed as * the fmgr "context" value in all transfunc and finalfunc calls. It is * not really intended that the transition functions will look into the * AggState node, but they can use code like * if (fcinfo->context && IsA(fcinfo->context, AggState)) * to verify that they are being called by nodeAgg.c and not as ordinary * SQL functions. The main reason a transition function might want to know * that is that it can avoid palloc'ing a fixed-size pass-by-ref transition * value on every call: it can instead just scribble on and return its left * input. Ordinarily it is completely forbidden for functions to modify * pass-by-ref inputs, but in the aggregate case we know the left input is * either the initial transition value or a previous function result, and * in either case its value need not be preserved. See int8inc() for an * example. Notice that advance_transition_function() is coded to avoid a * data copy step when the previous transition value pointer is returned.
So theoretically I may create intarray_aggregate_push() function which, when called by aggregate, does not reallocate & copy memory each time it is called. Instead, it may allocate 1M memory at once (with gap), or enlarge the memory segment by factor of 2 when it need to reallocate it (it is O(log2) instead of O(N)). And here is an example from the source code: Datum int8inc(PG_FUNCTION_ARGS) { if (fcinfo->context && IsA(fcinfo->context, AggState)) { /* * Special case to avoid palloc overhead for COUNT(): when called from * nodeAgg, we know that the argument is modifiable local storage, so * just update it in-place. * * Note: this assumes int8 is a pass-by-ref type; if we ever support * pass-by-val int8, this should be ifdef'd out when int8 is * pass-by-val. */ int64 *arg = (int64 *) PG_GETARG_POINTER(0); int64 result; result = *arg + 1; /* Overflow check */ if (result < 0 && *arg > 0) ereport(ERROR, (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg("bigint out of range"))); *arg = result; PG_RETURN_POINTER(arg); } ... } On 10/10/07, Dmitry Koterov <[EMAIL PROTECTED]> wrote: > > Thanks for your comment. > > I see two possible solution directions: > > > 1. Is it possible to create C-function, which could accept something like > ROWSET(ARRAY[]) in its input parameters? > E.g. to call it as > > SELECT array_rowset_glue((SELECT arrayfield FROM arraytable)); > > or something like this? > > > 2. Is it possible to implement in C something like this? > > array_buffer_init(); > SELECT array_buffer_push(arrayfield) FROM arraytable; > ids := array_buffer_get(); > array_buffer_free(); > > where array_buffer_push() is an aggregate function which returns void, > but, as its side-effect, appends arrayfield to the global array buffer for > later acces with array_buffer_get(). > > > On 10/10/07, Filip Rembiałkowski <[EMAIL PROTECTED]> wrote: > > > > 2007/10/10, Dmitry Koterov <[EMAIL PROTECTED]>: > > > Hello. > > > > > > I created an aggregate: > > > > > > CREATE AGGREGATE intarray_aggregate_push (_int4) > > > ( > > > STYPE = _int4, > > > SFUNC = intarray_push_array, > > > INITCOND = '{}' > > > ); > > > > > > (or - I may use _int_union instead of intarray_push_array, its speed > > is > > > practically the same in my case). > > > This aggregate merges together a list of integer[] arrays resulting > > one big > > > array with all elements. > > > > > > Then I want to use this aggregate: > > > > > > SELECT intarray_aggregate_push(arrayfield) > > > FROM arraytable > > > > > > The table arraytable contains a lot of rows (about 5000), each row > > has > > > array with length of 5-10 elements, so - the resulting array should > > contain > > > about 50000 elements. > > > > > > The query is okay, but its speed is too bad: about 1 second. > > > > > > The main problem is the speed of intarray_aggregate_push function - it > > is > > > quite slow, because intarray_push_array reallocates the memory each > > time I > > > merge two arrays. I am pretty sure that the reallocaton and copying is > > the > > > bottleneck, because if I use another dummy aggreate: > > > > > > CREATE AGGREGATE intarray_aggregate_dummy (_int4) > > > ( > > > STYPE = _int4, > > > SFUNC = dummy, > > > INITCOND = '{}' > > > ); > > > > > > CREATE OR REPLACE FUNCTION "public"."dummy" (a integer [], b integer > > []) > > > RETURNS integer [] AS > > > $body$ BEGIN RETURN a; END; $body$ > > > LANGUAGE 'plpgsql' VOLATILE CALLED ON NULL INPUT SECURITY INVOKER; > > > > > > where dummy() is the function which returns its first argument without > > any > > > modification, the speed grows dramatically - about 25 ms (instead of > > 1000 > > > ms!). > > > > > > The question is: how could I optimize this, and is it possible at all > > in > > > Postgres? I just want to get one large array glued from a lot of > > smaller > > > arrays... > > > > > > 1. no wonder copying is the bottleneck - this is what the aggregate > > does, mostly. > > > > 2. you can use plain array_cat for this, in my test it is few percent > > faster > > > > 3. in this case I guess intarrray contrib is not an option, AFAIK it > > was created only for speeding up searches, that is int4[] lookups > > > > 4. to have this kind of optimization you talk about, we would need an > > aggregate operating (in this case appending) directly on > > internalstate. i'm not sure if this is possible in postgres > > > > 5. my results: > > your method (using intarray_push_array): 940 ms > > using array_cat: 860 ms > > same in PL/PgSQL: (LOOP, append) 800 ms > > same thing in Perl, no database (push array of arrays into one and > > print ): 18 ms > > > > > > cheers, Filip > > > > > > -- > > Filip Rembiałkowski > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 4: Have you searched our list archives? > > > > http://archives.postgresql.org/ > > > >