Re: [GENERAL] How to speedup intarray aggregate function?

Dmitry Koterov Wed, 10 Oct 2007 02:49:49 -0700

Wow, seems I've found that!

 *      Beginning in PostgreSQL 8.1, the executor's AggState node is passed
as
 *      the fmgr "context" value in all transfunc and finalfunc calls.  It
is
 *      not really intended that the transition functions will look into the
 *      AggState node, but they can use code like
 *            if (fcinfo->context && IsA(fcinfo->context, AggState))
 *      to verify that they are being called by nodeAgg.c and not as
ordinary
 *      SQL functions.  The main reason a transition function might want to
know
 *      that is that it can avoid palloc'ing a fixed-size pass-by-ref
transition
 *      value on every call: it can instead just scribble on and return its
left
 *      input.  Ordinarily it is completely forbidden for functions to
modify
 *      pass-by-ref inputs, but in the aggregate case we know the left input
is
 *      either the initial transition value or a previous function result,
and
 *      in either case its value need not be preserved.  See int8inc() for
an
 *      example.    Notice that advance_transition_function() is coded to
avoid a
 *      data copy step when the previous transition value pointer is
returned.


So theoretically I may create intarray_aggregate_push() function which, when
called by aggregate, does not reallocate & copy memory each time it is
called. Instead, it may allocate 1M memory at once (with gap), or enlarge
the memory segment by factor of 2 when it need to reallocate it (it is
O(log2) instead of O(N)).

And here is an example from the source code:

Datum
int8inc(PG_FUNCTION_ARGS)
{
    if (fcinfo->context && IsA(fcinfo->context, AggState))
    {
        /*
         * Special case to avoid palloc overhead for COUNT(): when called
from
         * nodeAgg, we know that the argument is modifiable local storage,
so
         * just update it in-place.
         *
         * Note: this assumes int8 is a pass-by-ref type; if we ever support
         * pass-by-val int8, this should be ifdef'd out when int8 is
         * pass-by-val.
         */
        int64       *arg = (int64 *) PG_GETARG_POINTER(0);
        int64        result;

        result = *arg + 1;
        /* Overflow check */
        if (result < 0 && *arg > 0)
            ereport(ERROR,
                    (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
                     errmsg("bigint out of range")));

        *arg = result;
        PG_RETURN_POINTER(arg);
    }
    ...
}


On 10/10/07, Dmitry Koterov <[EMAIL PROTECTED]> wrote:
>
> Thanks for your comment.
>
> I see two possible solution directions:
>
>
> 1. Is it possible to create C-function, which could accept something like
> ROWSET(ARRAY[]) in its input parameters?
> E.g. to call it as
>
> SELECT array_rowset_glue((SELECT arrayfield FROM arraytable));
>
> or something like this?
>
>
> 2. Is it possible to implement in C something like this?
>
> array_buffer_init();
> SELECT array_buffer_push(arrayfield) FROM arraytable;
> ids := array_buffer_get();
> array_buffer_free();
>
> where array_buffer_push() is an aggregate function which returns void,
> but, as its side-effect, appends arrayfield to the global array buffer for
> later acces with array_buffer_get().
>
>
> On 10/10/07, Filip Rembiałkowski <[EMAIL PROTECTED]> wrote:
> >
> > 2007/10/10, Dmitry Koterov <[EMAIL PROTECTED]>:
> > > Hello.
> > >
> > > I created an aggregate:
> > >
> > > CREATE AGGREGATE intarray_aggregate_push (_int4)
> > > (
> > >   STYPE = _int4,
> > >   SFUNC = intarray_push_array,
> > >   INITCOND = '{}'
> > > );
> > >
> > > (or - I may use _int_union instead of intarray_push_array, its speed
> > is
> > > practically the same in my case).
> > > This aggregate merges together a list of integer[] arrays resulting
> > one big
> > > array with all elements.
> > >
> > > Then I want to use this aggregate:
> > >
> > > SELECT intarray_aggregate_push(arrayfield)
> > > FROM arraytable
> > >
> > >  The table arraytable contains a lot of rows (about 5000), each row
> > has
> > > array with length of 5-10 elements, so - the resulting array should
> > contain
> > > about 50000 elements.
> > >
> > > The query is okay, but its speed is too bad: about 1 second.
> > >
> > > The main problem is the speed of intarray_aggregate_push function - it
> > is
> > > quite slow, because intarray_push_array reallocates the memory each
> > time I
> > > merge two arrays. I am pretty sure that the reallocaton and copying is
> > the
> > > bottleneck, because if I use another dummy aggreate:
> > >
> > > CREATE AGGREGATE intarray_aggregate_dummy (_int4)
> > > (
> > >   STYPE = _int4,
> > >   SFUNC = dummy,
> > >   INITCOND = '{}'
> > > );
> > >
> > > CREATE OR REPLACE FUNCTION "public"."dummy" (a integer [], b integer
> > [])
> > > RETURNS integer [] AS
> > > $body$ BEGIN RETURN a; END; $body$
> > > LANGUAGE 'plpgsql' VOLATILE CALLED ON NULL INPUT SECURITY INVOKER;
> > >
> > > where dummy() is the function which returns its first argument without
> > any
> > > modification, the speed grows dramatically - about 25 ms (instead of
> > 1000
> > > ms!).
> > >
> > > The question is: how could I optimize this, and is it possible at all
> > in
> > > Postgres? I just want to get one large array glued from a lot of
> > smaller
> > > arrays...
> >
> >
> > 1. no wonder copying is the bottleneck - this is what the aggregate
> > does, mostly.
> >
> > 2. you can use plain array_cat for this, in my test it is few percent
> > faster
> >
> > 3. in this case I guess intarrray contrib is not an option, AFAIK it
> > was created only for speeding up searches, that is int4[] lookups
> >
> > 4. to have this kind of optimization you talk about, we would need an
> > aggregate operating (in this case appending) directly on
> > internalstate. i'm not sure if this is possible in postgres
> >
> > 5. my results:
> > your method (using intarray_push_array): 940 ms
> > using array_cat: 860 ms
> > same in PL/PgSQL: (LOOP, append) 800 ms
> > same thing in Perl, no database (push array of arrays into one and
> > print ): 18 ms
> >
> >
> > cheers, Filip
> >
> >
> > --
> > Filip Rembiałkowski
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >                http://archives.postgresql.org/
> >
>
>

Re: [GENERAL] How to speedup intarray aggregate function?

Reply via email to