Re: Parallel grouping sets

2019-09-30 Thread Pengzhou Tang
Hi Richard & Tomas:

I followed the idea of the second approach to add a gset_id in the
targetlist of the first stage of
grouping sets and uses it to combine the aggregate in final stage. gset_id
stuff is still kept
because of GROUPING() cannot uniquely identify a grouping set, grouping
sets may contain
duplicated set, eg: group by grouping sets((c1, c2), (c1,c2)).

There are some differences to implement the second approach comparing to
the original idea from
Richard, gset_id is not used as additional group key in the final stage,
instead, we use it to
dispatch the input tuple to the specified grouping set directly and then do
the aggregate.
One advantage of this is that we can handle multiple rollups with better
performance without APPEND node.

the plan now looks like:

gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
 QUERY PLAN

 Finalize MixedAggregate  (cost=1000.00..73108.57 rows=8842 width=12)
   Dispatched by: (GROUPINGSETID())
   Hash Key: c1, c2
   Hash Key: c1
   Hash Key: c3
   Group Key: ()
 Group Key: ()
   ->  Gather  (cost=1000.00..71551.48 rows=17684 width=16)
 Workers Planned: 2
 ->  Partial MixedAggregate  (cost=0.00..68783.08 rows=8842
width=16)
   Hash Key: c1, c2
   Hash Key: c1
   Hash Key: c3
   Group Key: ()
   Group Key: ()
   ->  Parallel Seq Scan on gstest  (cost=0.00..47861.33
rows=208 width=12)
(16 rows)

gpadmin=# set enable_hashagg to off;
gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
   QUERY PLAN

 Finalize GroupAggregate  (cost=657730.66..663207.45 rows=8842 width=12)
   Dispatched by: (GROUPINGSETID())
   Group Key: c1, c2
   Sort Key: c1
 Group Key: c1
 Group Key: ()
 Group Key: ()
   Sort Key: c3
 Group Key: c3
   ->  Sort  (cost=657730.66..657774.87 rows=17684 width=16)
 Sort Key: c1, c2
 ->  Gather  (cost=338722.94..656483.04 rows=17684 width=16)
   Workers Planned: 2
   ->  Partial GroupAggregate  (cost=337722.94..653714.64
rows=8842 width=16)
 Group Key: c1, c2
 Group Key: c1
 Group Key: ()
 Group Key: ()
 Sort Key: c3
   Group Key: c3
 ->  Sort  (cost=337722.94..342931.28 rows=208
width=12)
   Sort Key: c1, c2
   ->  Parallel Seq Scan on gstest
 (cost=0.00..47861.33 rows=208 width=12)

References:
[1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
_3

On Wed, Jul 31, 2019 at 4:07 PM Richard Guo  wrote:

> On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
>
>> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
>> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo  wrote:
>> >
>> >> Hi all,
>> >>
>> >> Paul and I have been hacking recently to implement parallel grouping
>> >> sets, and here we have two implementations.
>> >>
>> >> Implementation 1
>> >> 
>> >>
>> >> Attached is the patch and also there is a github branch [1] for this
>> >> work.
>> >>
>> >
>> >Rebased with the latest master.
>> >
>>
>> Hi Richard,
>>
>> thanks for the rebased patch. I think the patch is mostly fine (at least I
>> don't see any serious issues). A couple minor comments:
>>
>
> Hi Tomas,
>
> Thank you for reviewing this patch.
>
>
>>
>> 1) I think get_number_of_groups() would deserve a short explanation why
>> it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
>> same branch. Before these cases were clearly separated, now it seems a bit
>> mixed up and it may not be immediately obvious why it's OK.
>>
>
> Added a short comment in get_number_of_groups() explaining the behavior
> when doing partial aggregation for grouping sets.
>
>
>>
>> 2) There are new regression tests, but they are not added to any schedule
>> (parallel or serial), and so are not executed as part of "make check". I
>> suppose this is a mistake.
>>
>
> Yes, thanks. Added the new regression test in parallel_schedule and
> serial_schedule.
>
>
>>
>> 3) The regression tests do check plan and results like this:
>>
>> EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
>> SELECT ... ORDER BY 1, 2, 3;
>>
>> which however means that the query might easily use a different plan than
>> what's verified in the eplain (thanks to the additional ORDER BY clause).
>> So I think this should explain and execute the same query.
>>
>> (In this case the plan

Extend Table AM routine to get total blocks can be analyzed

2019-11-11 Thread Pengzhou Tang
Hi Hackers,

Table AM routine already provided two custom functions to fetch sample
blocks and sample tuples,
however, the total blocks the ANALYZE can scan are still restricted to the
number of physical blocks
in a table, this doesn't work well for storages which organize blocks in
different ways than the heap.

Here is proposing to add a new method named scan_analyze_total_blocks() to
provide more flexibility,
it can return physical or logical blocks number which depends on how the
table AM implement
scan_analyze_next_block() and scan_analyze_next_tuple().


v1-0001-extend-tableam-scan-analyze-total-blocks.patch
Description: Binary data


Re: Parallel grouping sets

2019-11-28 Thread Pengzhou Tang
Hi Hackers,

Richard pointed out that he get incorrect results with the patch I
attached, there are bugs somewhere,
I fixed them now and attached the newest version, please refer to [1] for
the fix.

Thanks,
Pengzhou

References:
[1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
<https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3

On Mon, Sep 30, 2019 at 5:41 PM Pengzhou Tang  wrote:

> Hi Richard & Tomas:
>
> I followed the idea of the second approach to add a gset_id in the
> targetlist of the first stage of
> grouping sets and uses it to combine the aggregate in final stage. gset_id
> stuff is still kept
> because of GROUPING() cannot uniquely identify a grouping set, grouping
> sets may contain
> duplicated set, eg: group by grouping sets((c1, c2), (c1,c2)).
>
> There are some differences to implement the second approach comparing to
> the original idea from
> Richard, gset_id is not used as additional group key in the final stage,
> instead, we use it to
> dispatch the input tuple to the specified grouping set directly and then
> do the aggregate.
> One advantage of this is that we can handle multiple rollups with better
> performance without APPEND node.
>
> the plan now looks like:
>
> gpadmin=# explain select c1, c2 from gstest group by grouping
> sets(rollup(c1, c2), rollup(c3));
>  QUERY PLAN
>
> 
>  Finalize MixedAggregate  (cost=1000.00..73108.57 rows=8842 width=12)
>Dispatched by: (GROUPINGSETID())
>Hash Key: c1, c2
>Hash Key: c1
>Hash Key: c3
>Group Key: ()
>  Group Key: ()
>->  Gather  (cost=1000.00..71551.48 rows=17684 width=16)
>  Workers Planned: 2
>  ->  Partial MixedAggregate  (cost=0.00..68783.08 rows=8842
> width=16)
>Hash Key: c1, c2
>Hash Key: c1
>Hash Key: c3
>Group Key: ()
>Group Key: ()
>->  Parallel Seq Scan on gstest  (cost=0.00..47861.33
> rows=208 width=12)
> (16 rows)
>
> gpadmin=# set enable_hashagg to off;
> gpadmin=# explain select c1, c2 from gstest group by grouping
> sets(rollup(c1, c2), rollup(c3));
>QUERY PLAN
>
> 
>  Finalize GroupAggregate  (cost=657730.66..663207.45 rows=8842 width=12)
>Dispatched by: (GROUPINGSETID())
>Group Key: c1, c2
>Sort Key: c1
>  Group Key: c1
>  Group Key: ()
>  Group Key: ()
>Sort Key: c3
>  Group Key: c3
>->  Sort  (cost=657730.66..657774.87 rows=17684 width=16)
>  Sort Key: c1, c2
>  ->  Gather  (cost=338722.94..656483.04 rows=17684 width=16)
>Workers Planned: 2
>->  Partial GroupAggregate  (cost=337722.94..653714.64
> rows=8842 width=16)
>  Group Key: c1, c2
>  Group Key: c1
>  Group Key: ()
>  Group Key: ()
>  Sort Key: c3
>Group Key: c3
>  ->  Sort  (cost=337722.94..342931.28 rows=208
> width=12)
>Sort Key: c1, c2
>->  Parallel Seq Scan on gstest
>  (cost=0.00..47861.33 rows=208 width=12)
>
> References:
> [1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
> <https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3
>
> On Wed, Jul 31, 2019 at 4:07 PM Richard Guo  wrote:
>
>> On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <
>> tomas.von...@2ndquadrant.com> wrote:
>>
>>> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
>>> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo  wrote:
>>> >
>>> >> Hi all,
>>> >>
>>> >> Paul and I have been hacking recently to implement parallel grouping
>>> >> sets, and here we have two implementations.
>>> >>
>>> >> Implementation 1
>>> >> 
>>> >>
>>> >> Attached is the patch and also there is a github branch [1] for this
>>> >> work.
>>> >>
>>> >
>>> >Rebased with the latest master.
>>> >
>>>
>>> Hi Richard,
>>>
>>> thanks for the rebased patch. I think the patch is mostly fine (at least
>>> I
>>>

Errors when update a view with conditional-INSTEAD rules

2019-12-03 Thread Pengzhou Tang
Hi Hackers,

I hit an error when updating a view with conditional INSTEAD OF rules, the
reproduce steps are list below:

CREATE TABLE t1(a int, b int);

CREATE TABLE t2(a int, b int);

CREATE VIEW v1 AS SELECT * FROM t1 where b > 100;

INSERT INTO v1 values(1, 110);

SELECT * FROM t1;


CREATE OR REPLACE rule r1 AS

ON UPDATE TO v1

WHERE old.a > new.b

DO INSTEAD (

INSERT INTO t2 values(old.a, old.b);

);


UPDATE v1 SET b = 2 WHERE a = 1;

*ERROR:  no relation entry for relid 2*


With some hacks, It is because, for conditional INSTEAD OF rules
 conditional, the original UPDATE operation also need to perform on the
view, however, we didn't rewrite the target view for any view with INSTEAD
rules.

There should be only two cases that you can skip the rewrite of target view:
1) the view has INSTEAD OF triggers on the operations, the operations will
be replaced by trigger-defined
2) the view has INSTEAD OF rules and it is non conditional rules, the
operations will be replaced by actions.

It should be a typo in commit a99c42f291421572aef2, there is a description
in documents:
"There is a catch if you try to use conditional rules
for complex view updates: there must be an unconditional
INSTEAD rule for each action you wish to allow on the view."

Commit a99c42f291421572aef2 explicitly change the description that the
restriction only applies to complex view, conditional INSTEAD rule should
work for a simple view.

I attached a patch to fix it, please take a look,

Thanks,
Pengzhou


0001-Rewrite-the-target-view-if-it-has-conditional-INSTEA.patch
Description: Binary data


[Proposal] Extend TableAM routines for ANALYZE scan

2019-12-05 Thread Pengzhou Tang
When hacking the Zedstore, we need to get a more accurate statistic for
zedstore and we
faced some restrictions:
1) acquire_sample_rows() always use RelationGetNumberOfBlocks to generate
sampling block
numbers, this is not friendly for zedstore which wants to use a logical
block number and might also
not friendly to non-block-oriented Table AMs.
2) columns of zedstore table store separately, so columns in a row have a
different physical position,
tid in a tuple is invalid for zedstore which means the correlation
statistic is incorrect for zedstore.
3) RelOptInfo->pages is not correct for Zedstore if we only access partial
of the columns which make
   the IO cost much higher than the actual cost.

For 1) and 2), we propose to extend existing ANALYZE-scan table AM routines
in patch
"0001-ANALYZE-tableam-API-change.patch" which add three more APIs:
scan_analyze_beginscan(), scan_analyze_sample_tuple(),
scan_analyze_endscan(). This provides
more convenience and table AMs can take more control of every step of
sampling rows. Meanwhile,
with the new structure named "AcquireSampleContext", we can acquire extra
info (eg: physical position,
physical size) except the real columns values.

For 3), we hope we can have a similar mechanism with RelOptInfo->rows which
is calculated from
 (RelOptInfo->tuples * Selectivity), we can calculate RelOptInfo->pages
with a page selectivity which
is base on the selected zedstore columns.
0002-Planner-can-estimate-the-pages-based-on-the-columns-.patch
shows one idea that adding the `stadiskfrac` to pg_statistic and planner
use it to estimate the
RelOptInfo->pages.

0003-ZedStore-use-extended-ANAlYZE-API.patch is attached to only show how
Zedstore use the
previous patches to achieve:
1. use logical block id to acquire the sample rows.
2. can only acquire sample rows from specified column c1, this is used when
user only analyze table
on specified columns eg: "analyze zs (c1)".
3 when ANALYZE, zedstore table AM provided extra disksize info, then
ANALYZE compute the
physical fraction statistic of each column and planner use it to
estimate the IO cost based on
the selected columns.

Thanks,
Pengzhou
From 8a8f6d14d1a1ddc0be35582d4a17af50ffce986a Mon Sep 17 00:00:00 2001
From: Pengzhou Tang 
Date: Wed, 20 Nov 2019 06:42:37 -0500
Subject: [PATCH 1/3] ANALYZE tableam API change

Extended three ANALYZE-related tableam APIs so AMs can take more control
of ANALYZE progress:
- scan_analyze_beginscan() : so AMs can has more flexible sampling strategy
- scan_analyze_sample_tuple() : so ANALYZE can get extra info as needed
- scan_analyze_endscan() :

Also use struct AcquireSampleContex to provide more convenience, with it
tableam analyze routines can provide extra info except the real data,
for example: physical size or compression ratio.
---
 contrib/file_fdw/file_fdw.c  |  36 +++-
 contrib/postgres_fdw/postgres_fdw.c  |  22 ++---
 src/backend/access/heap/heapam_handler.c |  98 ++---
 src/backend/access/table/tableam.c   | 109 +++
 src/backend/commands/analyze.c   | 144 +++
 src/include/access/tableam.h | 115 
 src/include/foreign/fdwapi.h |   5 +-
 7 files changed, 367 insertions(+), 162 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 549821c..e8937e8 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -19,6 +19,7 @@
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_foreign_table.h"
 #include "commands/copy.h"
@@ -158,9 +159,7 @@ static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 		   FileFdwPlanState *fdw_private,
 		   Cost *startup_cost, Cost *total_cost);
 static int	file_acquire_sample_rows(Relation onerel, int elevel,
-	 HeapTuple *rows, int targrows,
-	 double *totalrows, double *totaldeadrows);
-
+	 AcquireSampleContext *context);
 
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
@@ -1093,30 +1092,27 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  */
 static int
 file_acquire_sample_rows(Relation onerel, int elevel,
-		 HeapTuple *rows, int targrows,
-		 double *totalrows, double *totaldeadrows)
+		 AcquireSampleContext *context)
 {
 	int			numrows = 0;
 	double		rowstoskip = -1;	/* -1 means not set yet */
 	ReservoirStateData rstate;
-	TupleDesc	tupDesc;
-	Datum	   *values;
-	bool	   *nulls;
 	bool		found;
 	char	   *filename;
 	bool		is_program;
 	List	   *options;
 	CopyState	cstate;
+	TupleTableSlot *slot;
 	ErrorContextCallback errcallback;
 	MemoryContext oldcontext = CurrentMemoryContext;
 	MemoryContext tupcontext

Re: Errors when update a view with conditional-INSTEAD rules

2020-01-16 Thread Pengzhou Tang
Thanks a lot, Dean, to look into this and also sorry for the late reply.

On Sun, Jan 5, 2020 at 12:08 AM Dean Rasheed 
wrote:

> Tracing it through, this seems to be a result of
> cab5dc5daf2f6f5da0ce79deb399633b4bb443b5 which added support for
> updatable views with a mix of updatable and non-updatable columns.
> That included a change to rewriteTargetListIU() to prevent it from
> adding dummy targetlist entries for unassigned-to attributes for
> auto-updatable views, in case they are no longer simple references to
> the underlying relation. Instead, that is left to expand_targetlist(),
> as for a normal table. However, in this case (an UPDATE on a view with
> a conditional rule), the target relation of the original query isn't
> rewritten (we leave it to the executor to report the error), and so
> expand_targetlist() ends up adding a new targetlist entry that
> references the target relation, which is still the original view. But
> when the planner bulds the simple_rel_array, it only adds entries for
> relations referenced in the query's jointree, which only includes the
> base table by this point, not the view. Thus the new targetlist entry
> added by expand_targetlist() refers to a NULL slot in the
> simple_rel_array, and it blows up.
>
> That's a great analysis of this issue.


> Given that this is a query that's going to fail anyway, I'm inclined
> to think that the right thing to do is to throw the error sooner, in
> rewriteQuery(), rather than attempting to plan a query that cannot be
> executed.
>

I am wondering whether a simple auto-updatable view can have a conditional
update instead rule.
For the test case I added, does bellow plan looks reasonable?
gpadmin=# explain UPDATE v1 SET b = 2 WHERE a = 1;
QUERY PLAN
---
 Insert on t2  (cost=0.00..49.55 rows=1 width=8)
   ->  Seq Scan on t1  (cost=0.00..49.55 rows=1 width=8)
 Filter: ((b > 100) AND (a > 2) AND (a = 1))

 Update on t1  (cost=0.00..49.55 rows=3 width=14)
   ->  Seq Scan on t1  (cost=0.00..49.55 rows=3 width=14)
 Filter: (((a > 2) IS NOT TRUE) AND (b > 100) AND (a = 1))
(7 rows)

gpadmin=# UPDATE v1 SET b = 2 WHERE a = 1;
UPDATE 1

The document also says that:
"There is a catch if you try to use conditional rules for *complex view*
updates: there must be an unconditional
INSTEAD rule for each action you wish to allow on the view" which makes me
think a simple view can have a
conditional INSTEAD rule. And the document is explicitly changed in commit
a99c42f291421572aef2:
-   There is a catch if you try to use conditional rules for view
+  There is a catch if you try to use conditional rules for complex view

Does that mean we should support conditional rules for a simple view?

Regards,
Pengzhou Tang


Re: Errors when update a view with conditional-INSTEAD rules

2020-01-21 Thread Pengzhou Tang
> I am wondering whether a simple auto-updatable view can have a
> conditional update instead rule.
>
> Well, the decision reached in [1] was that we wouldn't allow that. We
> could decide to allow it now as a new feature enhancement, but it
> wouldn't be a back-patchable bug-fix, and to be honest I wouldn't be
> particularly excited about adding such a feature now. We already get
> enough reports related to multiple rule actions behaving in
> counter-intuitive ways that trip up users. I don't think we should be
> enhancing the rule system, but rather encouraging users not to use it
> and use triggers instead.
>
> Ok, that makes sense, thanks for the explanation.


Additional size of hash table is alway zero for hash aggregates

2020-03-12 Thread Pengzhou Tang
Hi hacker,

When reading the grouping sets codes, I find that the additional size of
the hash table for hash aggregates is always zero, this seems to be
incorrect to me, attached a patch to fix it, please help to check.

Thanks,
Pengzhou


0001-Set-numtrans-correctly-when-building-hash-aggregate-.patch
Description: Binary data


Re: Additional size of hash table is alway zero for hash aggregates

2020-03-13 Thread Pengzhou Tang
>
> On 2020-03-12 16:35:15 +0800, Pengzhou Tang wrote:
> > When reading the grouping sets codes, I find that the additional size of
> > the hash table for hash aggregates is always zero, this seems to be
> > incorrect to me, attached a patch to fix it, please help to check.
>
> Indeed, that's incorrect. Causes the number of buckets for the hashtable
> to be set higher - the size is just used for that.  I'm a bit wary of
> changing this in the stable branches - could cause performance changes?
>
>
 thanks for confirming this.


Re: Additional size of hash table is alway zero for hash aggregates

2020-03-13 Thread Pengzhou Tang
On Fri, Mar 13, 2020 at 8:34 AM Andrew Gierth 
wrote:

> > "Justin" == Justin Pryzby  writes:
>
>  > On Thu, Mar 12, 2020 at 12:16:26PM -0700, Andres Freund wrote:
>  >> Indeed, that's incorrect. Causes the number of buckets for the
>  >> hashtable to be set higher - the size is just used for that. I'm a
>  >> bit wary of changing this in the stable branches - could cause
>  >> performance changes?
>
> I think (offhand, not tested) that the number of buckets would only be
> affected if the (planner-supplied) numGroups value would cause work_mem
> to be exceeded; the planner doesn't plan a hashagg at all in that case
> unless forced to (grouping by a hashable but not sortable column). Note
> that for various reasons the planner tends to over-estimate the memory
> requirement anyway.
>
>
That makes sense, thanks


Re: Additional size of hash table is alway zero for hash aggregates

2020-03-13 Thread Pengzhou Tang
Thanks, Andres Freund and Andres Gierth.

To be related, can I invite you to help to review the parallel grouping sets
patches? It will be very great to hear some comments from you since you
contributed most of the codes for grouping sets.

the thread is
https://www.postgresql.org/message-id/CAG4reAQ8rFCc%2Bi0oju3VjaW7xSOJAkvLrqa4F-NYZzAG4SW7iQ%40mail.gmail.com

Thanks,
Pengzhou

On Fri, Mar 13, 2020 at 3:16 AM Andres Freund  wrote:

> Hi,
>
>
> On 2020-03-12 16:35:15 +0800, Pengzhou Tang wrote:
> > When reading the grouping sets codes, I find that the additional size of
> > the hash table for hash aggregates is always zero, this seems to be
> > incorrect to me, attached a patch to fix it, please help to check.
>
> Indeed, that's incorrect. Causes the number of buckets for the hashtable
> to be set higher - the size is just used for that.  I'm a bit wary of
> changing this in the stable branches - could cause performance changes?
>
> Greetings,
>
> Andres Freund
>


Re: Parallel grouping sets

2020-03-19 Thread Pengzhou Tang
Thanks you to review this patch.

On Thu, Mar 19, 2020 at 10:09 AM Tomas Vondra 
wrote:

> Hi,
>
> unfortunately this got a bit broken by the disk-based hash aggregation,
> committed today, and so it needs a rebase. I've started looking at the
> patch before that, and I have it rebased on e00912e11a9e (i.e. the
> commit before the one that breaks it).


I spent the day to look into the details of the hash spill patch and
finally can
successfully rebase it, I tested the first 5 patches and they all passed the
installcheck, the 0006-parallel-xxx.path is not tested yet and I also need
to
make hash spill work in the final stage of parallel grouping sets, will do
that
tomorrow.

the conflicts mainly located in the handling of hash spill for grouping
sets,
the 0004-reorganise- patch also make the refilling the hash table stage
easier and
can avoid the nullcheck in that stage.


>
Attached is the rebased patch series (now broken), with a couple of
> commits with some minor cosmetic changes I propose to make (easier than
> explaining it on a list, it's mostly about whitespace, comments etc).
> Feel free to reject the changes, it's up to you.


Thanks, I will enhance the comments and take care of the whitespace.

>
>
I'll continue doing the review, but it'd be good to have a fully rebased
> version.


Very appreciate it.


Thanks,
Pengzhou


Re: Memory-Bounded Hash Aggregation

2020-03-19 Thread Pengzhou Tang
Hi,

I happen to notice that "set enable_sort to false" cannot guarantee the
planner to use hashagg in test groupingsets.sql,
the following comparing results of sortagg and hashagg seems to have no
meaning.

Thanks,
Pengzhou

On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis  wrote:

>
> Committed.
>
> There's some future work that would be nice (some of these are just
> ideas and may not be worth it):
>
> * Refactor MemoryContextMemAllocated() to be a part of
> MemoryContextStats(), but allow it to avoid walking through the blocks
> and freelists.
>
> * Improve the choice of the initial number of buckets in the hash
> table. For this patch, I tried to preserve the existing behavior of
> estimating the number of groups and trying to initialize with that many
> buckets. But my performance tests seem to indicate this is not the best
> approach. More work is needed to find what we should really do here.
>
> * For workloads that are not in work_mem *or* system memory, and need
> to actually go to storage, I see poor CPU utilization because it's not
> effectively overlapping CPU and IO work. Perhaps buffering or readahead
> changes can improve this, or async IO (even better).
>
> * Project unnecessary attributes away before spilling tuples to disk.
>
> * Improve logtape.c API so that the caller doesn't need to manage a
> bunch of tape numbers.
>
> * Improve estimate of the hash entry size. This patch doesn't change
> the way the planner estimates it, but I observe that actual size as
> seen at runtime is significantly different. This is connected to the
> initial number of buckets for the hash table.
>
> * In recursive steps, I don't have a good estimate for the number of
> groups, so I just estimate it as the number of tuples in that spill
> tape (which is pessimistic). That could be improved by doing a real
> cardinality estimate as the tuples are spilling (perhaps with HLL?).
>
> * Many aggregates with pass-by-ref transition states don't provide a
> great aggtransspace. We should consider doing something smarter, like
> having negative numbers represent a number that should be multiplied by
> the size of the group (e.g. ARRAY_AGG would have a size dependent on
> the group size, not a constant).
>
> * If we want to handle ARRAY_AGG (and the like) well, we can consider
> spilling the partial states in the hash table whem the memory is full.
> That would add a fair amount of complexity because there would be two
> types of spilled data (tuples and partial states), but it could be
> useful in some cases.
>
> Regards,
> Jeff Davis
>
>
>
>
>


Re: Memory-Bounded Hash Aggregation

2020-03-19 Thread Pengzhou Tang
On Fri, Mar 20, 2020 at 1:20 PM Pengzhou Tang  wrote:

> Hi,
>
> I happen to notice that "set enable_sort to false" cannot guarantee the
> planner to use hashagg in test groupingsets.sql,
> the following comparing results of sortagg and hashagg seems to have no
> meaning.
>
>
Please forget my comment, I should set enable_groupingsets_hash_disk too.


Re: Parallel grouping sets

2020-02-09 Thread Pengzhou Tang
Thanks to reviewing those patches.

Ha, I believe you meant to say a "normal aggregate", because what's
> performed above gather is no longer "grouping sets", right?
>
> The group key idea is clever in that it helps "discriminate" tuples by
> their grouping set id. I haven't completely thought this through, but my
> hunch is that this leaves some money on the table, for example, won't it
> also lead to more expensive (and unnecessary) sorting and hashing? The
> groupings with a few partials are now sharing the same tuplesort with
> the groupings with a lot of groups even though we only want to tell
> grouping 1 *apart from* grouping 10, not neccessarily that grouping 1
> needs to come before grouping 10. That's why I like the multiplexed pipe
> / "dispatched by grouping set id" idea: we only pay for sorting (or
> hashing) within each grouping. That said, I'm open to the criticism that
> keeping multiple tuplesort and agg hash tabes running is expensive in
> itself, memory-wise ...
>
> Cheers,
> Jesse


That's something we need to testing, thanks. Meanwhile, for the approach to
use "normal aggregate" with grouping set id, one concern is that it cannot
use
Mixed Hashed which means if a grouping sets contain both non-hashable or
non-sortable sets, it will fallback to one-phase aggregate.


Re: Extracting only the columns needed for a query

2020-02-14 Thread Pengzhou Tang
> > > On Sat, Jun 15, 2019 at 10:02 AM Tom Lane  wrote:
> > >
> > > Another reason for having the planner do this is that presumably, in
> > > an AM that's excited about this, the set of fetched columns should
> > > play into the cost estimates for the scan.  I've not been paying
> > > enough attention to the tableam work to know if we've got hooks for
> > > the AM to affect scan costing ... but if we don't, that seems like
> > > a hole that needs plugged.
> >
> > AM callback relation_estimate_size exists currently which planner
> leverages.
> > Via this callback it fetches tuples, pages, etc.. So, our thought is to
> extend
> > this API if possible to pass down needed column and help perform better
> costing
> > for the query. Though we think if wish to leverage this function, need
> to know
> > list of columns before planning hence might need to use query tree.
>
> I believe it would be beneficial to add this potential API extension patch
> into
> the thread (as an example of an interface defining how scanCols could be
> used)
> and review them together.
>
> Thanks for your suggestion, we paste one potential API extension change
bellow for zedstore to use scanCols.

The change contains 3 patches to clarify our idea.
0001-ANALYZE.patch is a generic patch for ANALYZE API extension, we develop
it to make the
analysis of zedstore tables more accurate. It is more flexible now, eg,
TableAm can provide
logical block number as random sample seed; TableAm can only analyze
specified columns; TableAm
can provide extra info besides the data tuple.

0002-Planner.patch is the real patch to show how we use rte->scanCols for a
cost estimate, the main idea
is adding a new metric 'stadiskfrac' to catalog pg_statistic, 'stadiskfrac'
is the physical size ratio of a column,
it is calculated when ANALYZE is performed, 0001-ANALYZE.patch can help to
provide extra disk size info.
So when set_plain_rel_size() is called by the planner, it uses
rte->scanCols and 'stadiskfrac' to adjust the
rel->pages, please see set_plain_rel_page_estimates().

0003-ZedStore.patch is an example of how zedstore uses extended ANALYZE
API, I paste it here anywhere, in case someone
is interest in it.

Thanks,
Pengzhou
From 77d257ab002a5e1b6a2f65e359cbfd7978e3cff5 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang 
Date: Wed, 20 Nov 2019 06:42:37 -0500
Subject: [PATCH 1/3] ANALYZE tableam API change

Extended three ANALYZE-related tableam APIs so AMs can take more control
of ANALYZE progress:
- scan_analyze_beginscan() : so AMs can has more flexible sampling strategy
- scan_analyze_sample_tuple() : so ANALYZE can get extra info as needed
- scan_analyze_endscan() :

Also use struct AnalyzeSampleContext to provide more convenience, with it
tableam analyze routines can provide extra info except the real data,
for example: physical size or compression ratio.
---
 contrib/file_fdw/file_fdw.c  |  35 +++---
 contrib/postgres_fdw/postgres_fdw.c  |  56 +
 src/backend/access/heap/heapam_handler.c | 109 ++--
 src/backend/access/table/tableam.c   | 209 +++
 src/backend/commands/analyze.c   | 181 --
 src/include/access/tableam.h | 138 +---
 src/include/foreign/fdwapi.h |   7 +-
 7 files changed, 530 insertions(+), 205 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 549821c..2344f01 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -19,6 +19,7 @@
 #include "access/reloptions.h"
 #include "access/sysattr.h"
 #include "access/table.h"
+#include "access/tableam.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_foreign_table.h"
 #include "commands/copy.h"
@@ -157,10 +158,8 @@ static void estimate_size(PlannerInfo *root, RelOptInfo *baserel,
 static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
 		   FileFdwPlanState *fdw_private,
 		   Cost *startup_cost, Cost *total_cost);
-static int	file_acquire_sample_rows(Relation onerel, int elevel,
-	 HeapTuple *rows, int targrows,
-	 double *totalrows, double *totaldeadrows);
-
+static void file_acquire_sample_rows(Relation onerel, int elevel,
+	 AnalyzeSampleContext *context);
 
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
@@ -1091,14 +1090,16 @@ estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
  * may be meaningless, but it's OK because we don't use the estimates
  * currently (the planner only pays attention to correlation for indexscans).
  */
-static int
+static void
 file_acquire_sample_rows(Relation onerel, int elevel,
-		 HeapTuple *rows, int