On Thu, 2018-06-21 at 13:44 -0700, David Gershuni wrote:
> To handle hash collisions, we can do the following:
>
> 1) We track the current hash code we’re processing, in ascending
> order.
>
> 2) Instead of combining one group at at time, we’ll maintain a list
> of
> all groups we’ve seen that ma
On Jun 21, 2018, at 1:04 PM, Jeff Davis wrote:
> On Thu, 2018-06-21 at 11:04 -0700, David Gershuni wrote:
>> This approach seems functionally correct, but I don't like the idea
>> of
>> transforming O(N) tuples of disk I/O into O(S*N) tuples of disk I/O
>> (in the worst case).
>
> It's the same a
On Thu, 2018-06-21 at 11:04 -0700, David Gershuni wrote:
> This approach seems functionally correct, but I don't like the idea
> of
> transforming O(N) tuples of disk I/O into O(S*N) tuples of disk I/O
> (in the worst case).
It's the same amount of I/O as the idea you suggested as putting the
hash
> On Jun 19, 2018, at 10:36 PM, Jeff Davis wrote:
>
> But I am worried that I am missing something, because it appears that
> for AGG_MIXED, we wait until the end of the last phase to emit the
> contents of the hash tables. Wouldn't they be complete after the first
> phase?
You're right. They'
On Fri, 2018-06-15 at 12:30 -0700, David Gershuni wrote:
> For example, imagine a tuple that belongs to a group G in grouping
> set A had to be
> spilled because it also belonged to an evicted group from grouping
> set B. Then group
> G would remain in set A’s hash table at the end of the phase, bu
> On Jun 13, 2018, at 12:53 PM, Jeff Davis wrote:
>
>>
>> An adaptive hash agg node would start as a hash agg, and turn into a
>> "partial hash agg + sort + final group agg" when OOM is detected.
>>
>> The benefit over ordinary sort+group agg is that the sort is
>> happening
>> on a potential
On Wed, 2018-06-13 at 11:50 -0300, Claudio Freire wrote:
> In essence, the technique I've been using always uses a fixed-size
> hash table to do partial grouping. The table is never grown, when
> collisions do happen, the existing entry in the hash table is flushed
> to disk and the aggregate state
On Wed, 2018-06-13 at 10:08 -0400, Robert Haas wrote:
> I wasn't using the term "average" in a mathematical sense. I suppose
> I really meant "typical". I agree that thinking about cases where
> the
> group size is nonuniform is a good idea, but I don't think I agree
> that all uniform distributi
On Tue, Jun 5, 2018 at 5:05 AM Tomas Vondra
wrote:
>
> On 06/05/2018 07:46 AM, Jeff Davis wrote:
> > On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
> >> I expect the eviction strategy to be the primary design challenge of
> >> this patch. The other bits will be mostly determined by this on
On Mon, Jun 11, 2018 at 1:50 PM, Jeff Davis wrote:
> I think average group size is the wrong way to look at it, because
> averages are only useful for a normal distribution. The two most
> interesting cases are:
>
> 1. Fairly uniform group size (e.g. normal dist)
> 2. Skew values, power law distri
On 06/11/2018 08:13 PM, Jeff Davis wrote:
> On Mon, 2018-06-11 at 19:33 +0200, Tomas Vondra wrote:
>> For example we hit the work_mem limit after processing 10% tuples,
>> switching to sort would mean spill+sort of 900GB of data. Or we
>> might say - hmm, we're 10% through, so we expect hitting t
On Mon, 2018-06-11 at 19:33 +0200, Tomas Vondra wrote:
> For example we hit the work_mem limit after processing 10% tuples,
> switching to sort would mean spill+sort of 900GB of data. Or we
> might
> say - hmm, we're 10% through, so we expect hitting the limit 10x, so
> let's spill the hash tabl
Jeff Davis writes:
> Also, I am not sure we should really be designing around data types
> where it makes sense to group and then don't supply a btree opclass.
> Seems like they are likely to hit a problem soon anyway.
It's not that unreasonable to have a hash opclass and no btree opclass;
the da
On Mon, 2018-06-11 at 11:55 -0400, Robert Haas wrote:
> performance degradation is not necessarily much better than OOM. I
> suspect that part of the reason why both Andres and David proposed
> algorithms that combined hashing and sorting is out of a desire to
> cater somewhat to both few-tuples-p
On 06/11/2018 07:14 PM, Andres Freund wrote:
Hi,
On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote:
It would be great to get something that performs better than just falling
back to sort (and I was advocating for that), but I'm worried we might be
moving the goalposts way too far.
I'm uncle
Hi,
On 2018-06-11 17:29:52 +0200, Tomas Vondra wrote:
> It would be great to get something that performs better than just falling
> back to sort (and I was advocating for that), but I'm worried we might be
> moving the goalposts way too far.
I'm unclear on why that'd have that bad performance in
On Mon, Jun 11, 2018 at 11:29 AM, Tomas Vondra
wrote:
> I think the underlying question is whether we want to treat this as an
> emergency safety against OOM (which should be a rare thing in most cases) or
> something allowing us to pick hash aggregate more often.
>
> It would be great to get some
On 06/11/2018 04:25 PM, Robert Haas wrote:
> ...
Maybe that's not exactly what Tomas (or you) meant by eviction
strategy, though. Not sure. But it does seem to me that we need to
pick the algorithm we're going to use, more or less, in order to
decide what infrastructure we need, and at leas
On 06/11/2018 05:16 PM, Robert Haas wrote:
On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
wrote:
... and this is pretty much what Jeff Davis suggested, I think. The
trouble is, each of those cases behaves nicely/terribly in different
corner cases.
That's a really good point. If the number o
On Wed, Jun 6, 2018 at 8:16 PM, Tomas Vondra
wrote:
> ... and this is pretty much what Jeff Davis suggested, I think. The
> trouble is, each of those cases behaves nicely/terribly in different
> corner cases.
That's a really good point. If the number of groups is pretty small
compared to the num
On Tue, Jun 5, 2018 at 1:46 AM, Jeff Davis wrote:
> On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
>> I expect the eviction strategy to be the primary design challenge of
>> this patch. The other bits will be mostly determined by this one
>> piece.
>
> Not sure I agree that this is the pri
As Serge mentioned, we’ve implemented spill-to-disk for SetOps and Aggregates
at Salesforce. We were hitting OOMs often enough that this became a high
priority for us. However, our current spill implementation is based on dynahash
from 9.6, and we’re not happy with its performance (it was primar
On 06/07/2018 02:18 AM, Andres Freund wrote:
On 2018-06-06 17:17:52 -0700, Andres Freund wrote:
On 2018-06-07 12:11:37 +1200, David Rowley wrote:
On 7 June 2018 at 08:11, Tomas Vondra wrote:
On 06/06/2018 04:11 PM, Andres Freund wrote:
Consider e.g. a scheme where we'd switch from hashed agg
On 6 June 2018 at 01:17, David Rowley wrote:
> On 6 June 2018 at 01:09, Andres Freund wrote:
>> On 2018-06-06 01:06:39 +1200, David Rowley wrote:
>>> My concern is that only accounting memory for the group and not the
>>> state is only solving half the problem. It might be fine for
>>> aggregates
On 2018-06-06 17:17:52 -0700, Andres Freund wrote:
> On 2018-06-07 12:11:37 +1200, David Rowley wrote:
> > On 7 June 2018 at 08:11, Tomas Vondra wrote:
> > > On 06/06/2018 04:11 PM, Andres Freund wrote:
> > >> Consider e.g. a scheme where we'd switch from hashed aggregation to
> > >> sorted aggreg
On 2018-06-07 12:11:37 +1200, David Rowley wrote:
> On 7 June 2018 at 08:11, Tomas Vondra wrote:
> > On 06/06/2018 04:11 PM, Andres Freund wrote:
> >> Consider e.g. a scheme where we'd switch from hashed aggregation to
> >> sorted aggregation due to memory limits, but already have a number of
> >>
On 06/07/2018 02:11 AM, David Rowley wrote:
> On 7 June 2018 at 08:11, Tomas Vondra wrote:
>> On 06/06/2018 04:11 PM, Andres Freund wrote:
>>> Consider e.g. a scheme where we'd switch from hashed aggregation to
>>> sorted aggregation due to memory limits, but already have a number of
>>> transi
On 7 June 2018 at 08:11, Tomas Vondra wrote:
> On 06/06/2018 04:11 PM, Andres Freund wrote:
>> Consider e.g. a scheme where we'd switch from hashed aggregation to
>> sorted aggregation due to memory limits, but already have a number of
>> transition values in the hash table. Whenever the size of t
On 06/06/2018 04:11 PM, Andres Freund wrote:
> On 2018-06-06 16:06:18 +0200, Tomas Vondra wrote:
>> On 06/06/2018 04:01 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:
The other issue is that serialize/deserialize is only a part of a problem -
yo
On 2018-06-06 16:06:18 +0200, Tomas Vondra wrote:
> On 06/06/2018 04:01 PM, Andres Freund wrote:
> > Hi,
> >
> > On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:
> > > The other issue is that serialize/deserialize is only a part of a problem
> > > -
> > > you also need to know how to do "combine
On 06/06/2018 04:01 PM, Andres Freund wrote:
Hi,
On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:
The other issue is that serialize/deserialize is only a part of a problem -
you also need to know how to do "combine", and not all aggregates can do
that ... (certainly not in universal way).
Th
Hi,
On 2018-06-06 15:58:16 +0200, Tomas Vondra wrote:
> The other issue is that serialize/deserialize is only a part of a problem -
> you also need to know how to do "combine", and not all aggregates can do
> that ... (certainly not in universal way).
There are several schemes where only serializ
On 06/05/2018 07:39 PM, David Fetter wrote:
On Tue, Jun 05, 2018 at 01:27:01PM -0400, Tom Lane wrote:
David Fetter writes:
On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
True. Although not all built in aggregates have those defined.
Just out of curiosity, which ones don't
Hi,
On 2018-06-05 10:47:49 -0700, Jeff Davis wrote:
> The thing I don't like about it is that it requires running two memory-
> hungry operations at once. How much of work_mem do we use for sorted
> runs, and how much do we use for the hash table?
Is that necessarily true? I'd assume that we'd us
On Tue, 2018-06-05 at 05:42 -0700, Andres Freund wrote:
> That's an interesting idea, but it seems simpler to stick to
> > hashing
> > rather than using a combination strategy. It also seems like it
> > would
> > take less CPU effort.
> Isn't the locality of access going to considerably better with
On Tue, Jun 05, 2018 at 01:27:01PM -0400, Tom Lane wrote:
> David Fetter writes:
> > On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
> >> True. Although not all built in aggregates have those defined.
>
> > Just out of curiosity, which ones don't? As of
> > 3f85c62d9e825eedd1315d249
On Tue, 2018-06-05 at 05:57 -0700, Andres Freund wrote:
> But I think my proposal to continue use a hashtable for the already
> known groups, and sorting for additional groups would largely address
> that largely, right? We couldn't deal with groups becoming too
> large,
> but easily with the numb
On Tue, 2018-06-05 at 08:39 -0700, se...@rielau.com wrote:
> Strange. We don't see this overeahd and measure a lot more than just
> a single metric:
Let's investigate again. I wasn't able to repro the overhead on x86;
Robert saw it on a POWER machine, and it was somewhat noisy. I don't
think we we
David Fetter writes:
> On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
>> True. Although not all built in aggregates have those defined.
> Just out of curiosity, which ones don't? As of
> 3f85c62d9e825eedd1315d249ef1ad793ca78ed4, pg_aggregate has both of
> those as NOT NULL.
NOT NU
On 2018-06-05 19:04:11 +0200, David Fetter wrote:
> On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
> > On 5 June 2018 at 06:52, Andres Freund wrote:
> > > That part has gotten a bit easier since, because we have serialize
> > > / deserialize operations for aggregates these days.
> >
On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund wrote:
> > That part has gotten a bit easier since, because we have serialize
> > / deserialize operations for aggregates these days.
>
> True. Although not all built in aggregates have those de
On Tue, Jun 05, 2018 at 02:56:23PM +1200, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund wrote:
> > That part has gotten a bit easier since, because we have serialize /
> > deserialize operations for aggregates these days.
>
> True. Although not all built in aggregates have those de
Strange. We don't see this overeahd and measure a lot more than just a single
metric:
Of the grab below only the context->stats += size is need.
Could easily be a macro.
We also track overall backend size to cap it, and track memory contexts in
shared memory (plan cache, function cache, messag
On 06/05/2018 03:41 PM, se...@rielau.com wrote:
In our code base we have added a WithStats-Flavor for creating memory
contexts.
This api accepts a pointer to metric for accounting and it is inherited
by all subcontexts unless overridden.
So we only needed to change context creation API where we
In our code base we have added a WithStats-Flavor for creating memory contexts.
This api accepts a pointer to metric for accounting and it is inherited by all
subcontexts unless overridden.
So we only needed to change context creation API where we wanted (such as
TopTansactionContext, Message Con
On 06/05/2018 03:17 PM, David Rowley wrote:
On 6 June 2018 at 01:09, Andres Freund wrote:
On 2018-06-06 01:06:39 +1200, David Rowley wrote:
My concern is that only accounting memory for the group and not the
state is only solving half the problem. It might be fine for
aggregates that don't str
On 6 June 2018 at 01:09, Andres Freund wrote:
> On 2018-06-06 01:06:39 +1200, David Rowley wrote:
>> My concern is that only accounting memory for the group and not the
>> state is only solving half the problem. It might be fine for
>> aggregates that don't stray far from their aggtransspace, but
On 06/05/2018 02:49 PM, Andres Freund wrote:
Hi,
On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:
My concern is more about what happens when the input tuple ordering is
inherently incompatible with the eviction strategy, greatly increasing the
amount of data written to disk during evictions.
Hi,
On 2018-06-06 01:06:39 +1200, David Rowley wrote:
> On 6 June 2018 at 00:57, Andres Freund wrote:
> > I think it's ok to only handle this gracefully if serialization is
> > supported.
> >
> > But I think my proposal to continue use a hashtable for the already
> > known groups, and sorting for
On 6 June 2018 at 00:57, Andres Freund wrote:
> On 2018-06-06 00:53:42 +1200, David Rowley wrote:
>> On 6 June 2018 at 00:45, Andres Freund wrote:
>> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> >> I wonder if an aggregate might use a custom context
>> >> internally (I don't recall anyt
On 2018-06-06 00:53:42 +1200, David Rowley wrote:
> On 6 June 2018 at 00:45, Andres Freund wrote:
> > On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
> >> I wonder if an aggregate might use a custom context
> >> internally (I don't recall anything like that). The accounting capability
> >> seems
On 6 June 2018 at 00:45, Andres Freund wrote:
> On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
>> I wonder if an aggregate might use a custom context
>> internally (I don't recall anything like that). The accounting capability
>> seems potentially useful for other places, and those might not us
Hi,
On 2018-06-05 10:05:35 +0200, Tomas Vondra wrote:
> My concern is more about what happens when the input tuple ordering is
> inherently incompatible with the eviction strategy, greatly increasing the
> amount of data written to disk during evictions.
>
> Say for example that we can fit 1000 g
Hi,
On 2018-06-05 09:35:13 +0200, Tomas Vondra wrote:
> But I don't think we've considered copying the whole AllocSet. Maybe that
> would work, not sure.
Don't think you'd need to fully copy it - you can just have a "wrapper"
memory context implementation that does the accounting and then hands
t
Hi,
On 2018-06-04 22:18:56 -0700, Jeff Davis wrote:
> On Mon, 2018-06-04 at 11:52 -0700, Andres Freund wrote:
> > I wonder whether, at least for aggregates, the better fix wouldn't be
> > to
> > switch to feeding the tuples into tuplesort upon memory exhaustion
> > and
> > doing a sort based aggre
On 06/05/2018 07:46 AM, Jeff Davis wrote:
On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
I expect the eviction strategy to be the primary design challenge of
this patch. The other bits will be mostly determined by this one
piece.
Not sure I agree that this is the primary challenge.
Th
On 5 June 2018 at 17:04, Tomas Vondra wrote:
> On 06/05/2018 04:56 AM, David Rowley wrote:
>> Isn't there still a problem determining when the memory exhaustion
>> actually happens though? As far as I know, we've still little
>> knowledge how much memory each aggregate state occupies.
>>
>> Jeff
On 06/05/2018 09:22 AM, David Rowley wrote:
On 5 June 2018 at 17:04, Tomas Vondra wrote:
On 06/05/2018 04:56 AM, David Rowley wrote:
Isn't there still a problem determining when the memory exhaustion
actually happens though? As far as I know, we've still little
knowledge how much memory ea
On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
> I expect the eviction strategy to be the primary design challenge of
> this patch. The other bits will be mostly determined by this one
> piece.
Not sure I agree that this is the primary challenge.
The cases that benefit from eviction are p
On Mon, 2018-06-04 at 11:52 -0700, Andres Freund wrote:
> I wonder whether, at least for aggregates, the better fix wouldn't be
> to
> switch to feeding the tuples into tuplesort upon memory exhaustion
> and
> doing a sort based aggregate. We have most of the infrastructure to
> do
That's an inte
On 06/05/2018 04:56 AM, David Rowley wrote:
> On 5 June 2018 at 06:52, Andres Freund wrote:
>> That part has gotten a bit easier since, because we have serialize /
>> deserialize operations for aggregates these days.
>
> True. Although not all built in aggregates have those defined.
Not sure wha
On 5 June 2018 at 06:52, Andres Freund wrote:
> That part has gotten a bit easier since, because we have serialize /
> deserialize operations for aggregates these days.
True. Although not all built in aggregates have those defined.
> I wonder whether, at least for aggregates, the better fix woul
Hi,
On 2018-06-04 10:32:47 +0200, Heikki Linnakangas wrote:
> Hash Aggs and SetOps are currently not spilled to disk. If the planner's
> estimate on the number of entries is badly off, you might run out of memory
> at execution time, if all the entries don't fit in memory.
>
> For HashAggs, this
Hi,
Hash Aggs and SetOps are currently not spilled to disk. If the planner's
estimate on the number of entries is badly off, you might run out of
memory at execution time, if all the entries don't fit in memory.
For HashAggs, this was discussed in depth a couple of years ago at [1].
SetOps h
64 matches
Mail list logo