from:"Kouhei Kaigai"

[HACKERS] A trivial fix on extensiblenode

2016-02-29 Thread Kouhei Kaigai

Hello,

RegisterExtensibleNodeMethods() initializes its hash table
with keysize=NAMEDATALEN, instead of EXTNODENAME_MAX_LEN.

The attached patch fixes it.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-v9.6-trivial-fix-extensiblenode.patch
Description: pgsql-v9.6-trivial-fix-extensiblenode.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Reworks of CustomScan serialization/deserialization

2016-02-29 Thread Kouhei Kaigai

Hello,

I'd like to adjust a few of custom-scan interface prior to v9.6 freeze.

The major point is serialization/deserialization mechanism.
Now, extension has to give LibraryName and SymbolName to reproduce
same CustomScanMethods on the background worker process side. Indeed,
it is sufficient information to pull the table of function pointers.

On the other hands, we now have different mechanism to wrap private
information - extensible node. It requires extensions to register its
ExtensibleNodeMethods identified by name, usually, on _PG_init() time.
It is also reasonable way to reproduce same objects on background
worker side.

However, mixture of two different ways is not good. My preference is
what extensible-node is doing rather than what custom-scan is currently
doing.
The attached patch allows extension to register CustomScanMethods once,
then readFunc.c can pull this table by CustomName in string form.


The minor one is header file location of CustomMethods declaration.
These are currently declared at relation.h, plannodes.h and execnodes.h.
These files are very primitive, so we put these lines:

  struct ParallelContext; /* avoid including parallel.h here */
  struct shm_toc; /* avoid including shm_toc.h here */
  struct ExplainState;/* avoid including explain.h here */

to avoid inclusion of other headers here.

It seems to me CustomMethods shall be moved to somewhere appropriate,
like fdwapi.h for FDW. If we put "struct CustomMethods;" on these
primitive header files instead, it will work.

I'm not 100% certain whether "nodes/custom-apis.h" is the best location,
but somewhere we can put these declarations rather than the primitive
header files might be needed.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-custom-scan-serialization-reworks.1.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-03-03 Thread Kouhei Kaigai

I found one other, but tiny, problem to implement SSD-to-GPU direct
data transfer feature under the PostgreSQL storage.

Extension cannot know the raw file descriptor opened by smgr.

I expect an extension issues an ioctl(2) on the special device file
on behalf of the special kernel driver, to control the P2P DMA.
This ioctl(2) will pack file descriptor of the DMA source and some
various information (like base position, range, destination device
pointer, ...).

However, the raw file descriptor is wrapped in the fd.c, instead of
the File handler, thus, not visible to extension. oops...

The attached patch provides a way to obtain raw file descriptor (and
relevant flags) of a particular File virtual file descriptor on
PostgreSQL. (No need to say, extension has to treat the raw descriptor
carefully not to give an adverse effect to the storage manager.)

How about this tiny enhancement?

> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> > Sent: Saturday, February 13, 2016 1:46 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> > Subject: Re: [HACKERS] Way to check whether a particular block is on the
> > shared_buffer?
> >
> > On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai  wrote:
> > > Hmm. In my experience, it is often not a productive discussion whether
> > > a feature is niche or commodity. So, let me change the viewpoint.
> > >
> > > We may utilize OS-level locking mechanism here.
> > >
> > > Even though it depends on filesystem implementation under the VFS,
> > > we may use inode->i_mutex lock that shall be acquired during the buffer
> > > copy from user to kernel, at least, on a few major filesystems; ext4,
> > > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > > acquire the inode->i_mutex lock during P2P DMA transfer.
> > >
> > > Once we can consider the OS buffer is updated atomically by the lock,
> > > we don't need to worry about corrupted pages, but still needs to pay
> > > attention to the scenario when updated buffer page is moved to GPU.
> > >
> > > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > > infrastructure, so I intend to move all-visible pages only.
> > > If someone updates the buffer concurrently, then write out the page
> > > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > > updated tuples should not be visible to the transaction which issued
> > > P2P DMA.
> > >
> > > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > > that indicates CPU to retry this page again. In this case, this page is
> > > likely loaded to the shared buffer already, so retry penalty is not so
> > > much.
> > >
> > > I'll try to investigate the implementation in this way.
> > > Please correct me, if I misunderstand something (especially, treatment
> > > of PD_ALL_VISIBLE).
> >
> > I suppose there's no theoretical reason why the buffer couldn't go
> > from all-visible to not-all-visible and back to all-visible again all
> > during the time you are copying it.
> >
> The backend process that is copying the data to GPU has a transaction
> in-progress (= not committed). Is it possible to get the updated buffer
> page back to the all-visible state again?
> I expect that in-progress transactions works as a blocker for backing
> to all-visible. Right?
> 
> > Honestly, I think trying to access buffers without going through
> > shared_buffers is likely to be very hard to make correct and probably
> > a loser.
> >
> No challenge, no outcome. ;-)
> 
> > Copying the data into shared_buffers and then to the GPU is,
> > doubtless, at least somewhat slower.  But I kind of doubt that it's
> > enough slower to make up for all of the problems you're going to have
> > with the approach you've chosen.
> >
> Honestly, I'm still uncertain whether it works well as I expects.
> However, scan workload on the table larger than main memory is
> headache for PG-Strom, so I'd like to try ideas we can implement.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei 
>



pgsql-v9.6-filegetrawdesc.1.patch
Description: pgsql-v9.6-filegetrawdesc.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-03-07 Thread Kouhei Kaigai





> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, March 05, 2016 2:42 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Mar 3, 2016 at 8:54 PM, Kouhei Kaigai  wrote:
> > I found one other, but tiny, problem to implement SSD-to-GPU direct
> > data transfer feature under the PostgreSQL storage.
> >
> > Extension cannot know the raw file descriptor opened by smgr.
> >
> > I expect an extension issues an ioctl(2) on the special device file
> > on behalf of the special kernel driver, to control the P2P DMA.
> > This ioctl(2) will pack file descriptor of the DMA source and some
> > various information (like base position, range, destination device
> > pointer, ...).
> >
> > However, the raw file descriptor is wrapped in the fd.c, instead of
> > the File handler, thus, not visible to extension. oops...
> >
> > The attached patch provides a way to obtain raw file descriptor (and
> > relevant flags) of a particular File virtual file descriptor on
> > PostgreSQL. (No need to say, extension has to treat the raw descriptor
> > carefully not to give an adverse effect to the storage manager.)
> >
> > How about this tiny enhancement?
> 
> Why not FileDescriptor(), FileFlags(), FileMode() as separate
> functions like FilePathName()?
>
Here is no deep reason. The attached patch adds three individual
functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-filegetrawdesc.2.patch
Description: pgsql-v9.6-filegetrawdesc.2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-09 Thread Kouhei Kaigai

> On 29/02/16 13:07, Kouhei Kaigai wrote:
> >
> > I'd like to adjust a few of custom-scan interface prior to v9.6 freeze.
> >
> > The major point is serialization/deserialization mechanism.
> > Now, extension has to give LibraryName and SymbolName to reproduce
> > same CustomScanMethods on the background worker process side. Indeed,
> > it is sufficient information to pull the table of function pointers.
> >
> > On the other hands, we now have different mechanism to wrap private
> > information - extensible node. It requires extensions to register its
> > ExtensibleNodeMethods identified by name, usually, on _PG_init() time.
> > It is also reasonable way to reproduce same objects on background
> > worker side.
> >
> > However, mixture of two different ways is not good. My preference is
> > what extensible-node is doing rather than what custom-scan is currently
> > doing.
> > The attached patch allows extension to register CustomScanMethods once,
> > then readFunc.c can pull this table by CustomName in string form.
> >
> 
> Agreed, but this will break compatibility right?
>
The manner to pass a pair of library-name and symbol-name is a new feature
in v9.6, not in v9.5, so it is now the last chance to fix up the interface
requirement.

> > I'm not 100% certain whether "nodes/custom-apis.h" is the best location,
> > but somewhere we can put these declarations rather than the primitive
> > header files might be needed.
> 
> custom-apis.c does not sound like right name to me, maybe it can be just
> custom.c but custom.h might be bit too generic, maybe custom_node.h
>
OK, custom_node.h may be better.

> I am not sure I like the fact that we have this EXTNODENAME_MAX_LEN and
> now the CUSTOM_NAME_MAX_LEN with the same length and also they are both
> same lenght as NAMEDATALEN I wonder if this shouldn't be somehow
> squished to less defines.
>
Hmm. I just followed the manner in extensible.c, because this label was
initially NAMEDATALEN, then Robert changed it with EXTNODENAME_MAX_LEN.
I guess he avoid to apply same label on different entities - NAMEDATALEN
is a limitation for NameData type, but identifier of extensible-node and
custom-scan node are not restricted by.

> Also in RegisterCustomScanMethods
> + Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> 
> Shouldn't this be actually "if" with ereport() considering this is
> public API and extensions can pass anything there? (for that matter same
> is true for RegisterExtensibleNodeMethods but that's already committed).
>
Hmm. I don't have clear answer which is better. The reason why I put
Assert() here is that only c-binary extension uses this interface, thus,
author will fix up the problem of too long name prior to its release.
Of course, if-with-ereport() also informs extension author the name is
too long.
One downside of Assert() may be, it makes oversight if --enable-cassert
was not specified.

> Other than that this seems like straight conversion to same basic
> template as extensible nodes so I think it's ok.
> 

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-09 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Thursday, March 10, 2016 11:01 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 10/03/16 02:18, Kouhei Kaigai wrote:
> >
> >> I am not sure I like the fact that we have this EXTNODENAME_MAX_LEN and
> >> now the CUSTOM_NAME_MAX_LEN with the same length and also they are both
> >> same lenght as NAMEDATALEN I wonder if this shouldn't be somehow
> >> squished to less defines.
> >>
> > Hmm. I just followed the manner in extensible.c, because this label was
> > initially NAMEDATALEN, then Robert changed it with EXTNODENAME_MAX_LEN.
> > I guess he avoid to apply same label on different entities - NAMEDATALEN
> > is a limitation for NameData type, but identifier of extensible-node and
> > custom-scan node are not restricted by.
> >
> 
> Makes sense.
> 
> >> Also in RegisterCustomScanMethods
> >> +  Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>
> >> Shouldn't this be actually "if" with ereport() considering this is
> >> public API and extensions can pass anything there? (for that matter same
> >> is true for RegisterExtensibleNodeMethods but that's already committed).
> >>
> > Hmm. I don't have clear answer which is better. The reason why I put
> > Assert() here is that only c-binary extension uses this interface, thus,
> > author will fix up the problem of too long name prior to its release.
> > Of course, if-with-ereport() also informs extension author the name is
> > too long.
> > One downside of Assert() may be, it makes oversight if --enable-cassert
> > was not specified.
> >
> 
> Well that's exactly my problem, this should IMHO throw error even
> without --enable-cassert. It's not like it's some performance sensitive
> API where if would be big problem, ensuring correctness of the input is
> more imporant here IMHO.
>
We may need to fix up RegisterExtensibleNodeMethods() first.

Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
is consumed by '\0' character. In fact, hash, match and keycopy function
of HTAB for string keys deal with the first (keysize - 1) bytes.
So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-extensible-namelen-check-by-ereport.patch
Description: pgsql-v9.6-extensible-namelen-check-by-ereport.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-13 Thread Kouhei Kaigai

Hello,

I'm now checking the new planner implementation to find out the way to
integrate CustomPath to the upper planner also.
CustomPath node is originally expected to generate various kind of plan
node, not only scan/join, and its interface is designed to support them.
For example, we can expect a CustomPath that generates "CustomSort".

On the other hands, upper path consideration is more variable than the
case of scan/join path consideration. Probably, we can have no centralized
point to add custom-paths for sort, group-by, ...
So, I think we have hooks for each (supported) upper path workload.

In case of sorting for example, the best location of the hook is just
above of the Assert() in the create_ordered_paths(). It allows to compare
estimated cost between SortPath and CustomPath.
However, it does not allow to inject CustomPath(for sort) into the path
node that may involve sorting, like WindowPath or AggPath.
Thus, another hook may be put on create_window_paths and
create_grouping_paths in my thought.

Some other good idea?

Even though I couldn't check the new planner implementation entirely,
it seems to be the points below are good candidate to inject CustomPath
(and potentially ForeignScan).

- create_grouping_paths
- create_window_paths
- create_distinct_paths
- create_ordered_paths
- just below of the create_modifytable_path
  (may be valuable for foreign-update pushdown)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Saturday, March 05, 2016 3:02 AM
> To: David Rowley
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> OK, here is a version that I think addresses all of the recent comments:
> 
> * I refactored the grouping-sets stuff as suggested by Robert and David.
> The GroupingSetsPath code is now used *only* when there are grouping sets,
> otherwise what you get is a plain AGG_SORTED AggPath.  This allowed
> removal of a boatload of weird corner cases in the GroupingSets code path,
> so it was a good change.  (Fundamentally, that's cleaning up some
> questionable coding in the grouping sets patch rather than fixing anything
> directly related to pathification, but I like the code better now.)
> 
> * I refactored the handling of targetlists in createplan.c.  After some
> reflection I decided that the disuse_physical_tlist callers fell into
> three separate categories: those that actually needed the exact requested
> tlist to be returned, those that wanted non-bloated tuples because they
> were going to put them into sort or hash storage, and those that needed
> grouping columns to be properly labeled.  The new approach is to pass down
> a "flags" word that specifies which if any of these cases apply at a
> specific plan level.  use_physical_tlist now always makes the right
> decision to start with, and disuse_physical_tlist is gone entirely, which
> should make things a bit faster since we won't uselessly construct and
> discard physical tlists.  The missing logic from make_subplanTargetList
> and locate_grouping_columns is reincarnated in the physical-tlist code.
> 
> * Added explicit limit/offset fields to LimitPath, as requested by Teodor.
> 
> * Removed SortPath.sortgroupclauses.
> 
> * Fixed handling of parallel-query fields in new path node types.
> (BTW, I found what seemed to be a couple of pre-existing bugs of
> the same kind, eg create_mergejoin_path was different from the
> other two kinds of join as to setting parallel_degree.)
> 
> 
> What remains to be done, IMV:
> 
> * Performance testing as per yesterday's discussion.
> 
> * Debug support in outfuncs.c and print_path() for new node types.
> 
> * Clean up unfinished work on function header comments.
> 
> * Write some documentation about how FDWs might use this.
> 
> I'll work on the performance testing next.  Barring unsatisfactory
> results from that, I think this could be committable in a couple
> of days.
> 
>   regards, tom lane



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-13 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Friday, March 11, 2016 12:27 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 10/03/16 08:08, Kouhei Kaigai wrote:
> >>
> >>>> Also in RegisterCustomScanMethods
> >>>> +Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>>>
> >>>> Shouldn't this be actually "if" with ereport() considering this is
> >>>> public API and extensions can pass anything there? (for that matter same
> >>>> is true for RegisterExtensibleNodeMethods but that's already committed).
> >>>>
> >>> Hmm. I don't have clear answer which is better. The reason why I put
> >>> Assert() here is that only c-binary extension uses this interface, thus,
> >>> author will fix up the problem of too long name prior to its release.
> >>> Of course, if-with-ereport() also informs extension author the name is
> >>> too long.
> >>> One downside of Assert() may be, it makes oversight if --enable-cassert
> >>> was not specified.
> >>>
> >>
> >> Well that's exactly my problem, this should IMHO throw error even
> >> without --enable-cassert. It's not like it's some performance sensitive
> >> API where if would be big problem, ensuring correctness of the input is
> >> more imporant here IMHO.
> >>
> > We may need to fix up RegisterExtensibleNodeMethods() first.
> >
> > Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
> > is consumed by '\0' character. In fact, hash, match and keycopy function
> > of HTAB for string keys deal with the first (keysize - 1) bytes.
> > So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.
> >
> 
> Yes, my thoughts as well but that can be separate tiny patch that does
> not have to affect this one. In my opinion if we fixed this one it would
> be otherwise ready to go in, and I definitely prefer this approach to
> the previous one.
>
OK, I split the previous small patch into two tiny patches.
The one is bugfix around max length of the extnodename.
The other replaces Assert() by ereport() according to the upthread discussion.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-fix-extnodename-max-len.patch
Description: pgsql-v9.6-fix-extnodename-max-len.patch


pgsql-v9.6-replace-assert-by-ereport-on-register-extnode.patch
Description: pgsql-v9.6-replace-assert-by-ereport-on-register-extnode.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-13 Thread Kouhei Kaigai

> On 14/03/16 02:53, Kouhei Kaigai wrote:
> >> -Original Message-
> >> From: pgsql-hackers-ow...@postgresql.org
> >> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> >> Sent: Friday, March 11, 2016 12:27 AM
> >> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> >> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> >>
> >> On 10/03/16 08:08, Kouhei Kaigai wrote:
> >>>>
> >>>>>> Also in RegisterCustomScanMethods
> >>>>>> +  Assert(strlen(methods->CustomName) <= CUSTOM_NAME_MAX_LEN);
> >>>>>>
> >>>>>> Shouldn't this be actually "if" with ereport() considering this is
> >>>>>> public API and extensions can pass anything there? (for that matter 
> >>>>>> same
> >>>>>> is true for RegisterExtensibleNodeMethods but that's already 
> >>>>>> committed).
> >>>>>>
> >>>>> Hmm. I don't have clear answer which is better. The reason why I put
> >>>>> Assert() here is that only c-binary extension uses this interface, thus,
> >>>>> author will fix up the problem of too long name prior to its release.
> >>>>> Of course, if-with-ereport() also informs extension author the name is
> >>>>> too long.
> >>>>> One downside of Assert() may be, it makes oversight if --enable-cassert
> >>>>> was not specified.
> >>>>>
> >>>>
> >>>> Well that's exactly my problem, this should IMHO throw error even
> >>>> without --enable-cassert. It's not like it's some performance sensitive
> >>>> API where if would be big problem, ensuring correctness of the input is
> >>>> more imporant here IMHO.
> >>>>
> >>> We may need to fix up RegisterExtensibleNodeMethods() first.
> >>>
> >>> Also, length limitation is (EXTNODENAME_MAX_LEN-1) because the last byte
> >>> is consumed by '\0' character. In fact, hash, match and keycopy function
> >>> of HTAB for string keys deal with the first (keysize - 1) bytes.
> >>> So, strkey(extnodename) == EXTNODENAME_MAX_LEN is not legal.
> >>>
> >>
> >> Yes, my thoughts as well but that can be separate tiny patch that does
> >> not have to affect this one. In my opinion if we fixed this one it would
> >> be otherwise ready to go in, and I definitely prefer this approach to
> >> the previous one.
> >>
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> >
> 
> Okay, it's somewhat akin to hairsplitting but works for me. Do you plan
> to do same thing with the CustomScan patch itself as well?.
>
Yes. I'll fixup the patch to follow the same manner.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-13 Thread Kouhei Kaigai

> -Original Message-
> From: Petr Jelinek [mailto:p...@2ndquadrant.com]
> Sent: Monday, March 14, 2016 12:18 PM
> To: Kaigai Kouhei(海外 浩平); Tom Lane; David Rowley
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> On 14/03/16 02:43, Kouhei Kaigai wrote:
> >
> > CustomPath node is originally expected to generate various kind of plan
> > node, not only scan/join, and its interface is designed to support them.
> > For example, we can expect a CustomPath that generates "CustomSort".
> >
> > On the other hands, upper path consideration is more variable than the
> > case of scan/join path consideration. Probably, we can have no centralized
> > point to add custom-paths for sort, group-by, ...
> > So, I think we have hooks for each (supported) upper path workload.
> >
> > In case of sorting for example, the best location of the hook is just
> > above of the Assert() in the create_ordered_paths(). It allows to compare
> > estimated cost between SortPath and CustomPath.
> > However, it does not allow to inject CustomPath(for sort) into the path
> > node that may involve sorting, like WindowPath or AggPath.
> > Thus, another hook may be put on create_window_paths and
> > create_grouping_paths in my thought.
> >
> > Some other good idea?
> >
> > Even though I couldn't check the new planner implementation entirely,
> > it seems to be the points below are good candidate to inject CustomPath
> > (and potentially ForeignScan).
> >
> > - create_grouping_paths
> > - create_window_paths
> > - create_distinct_paths
> > - create_ordered_paths
> > - just below of the create_modifytable_path
> >(may be valuable for foreign-update pushdown)
> >
> 
> To me that seems too low inside the planning tree, perhaps adding it
> just to the subquery_planner before SS_identify_outer_params would be
> better, that's the place where you see the path for the whole (sub)query
> so you can search and modify what you need from there.
>
Thanks for your idea. Yes, I also thought a similar point; where all
the path consideration get completed. It indeed allows extensions to
walk down the path tree and replace a part of them.
However, when we want to inject CustomPath under the built-in paths,
extension has to re-calculate cost of the built-in paths again.
Perhaps, it affects to the decision of built-in path selection.
So, I concluded that it is not realistic to re-implement equivalent
upper planning stuff in the extension side, if we put the hook after
all the planning works done.

If extension can add its CustomPath at create_grouping_paths(), the
later steps, like create_window_paths, stands on the estimated cost
of the CustomPath. Thus, extension don't need to know the detail of
the entire upper planning.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use %u to print user mapping's umid and userid

2016-03-14 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Etsuro Fujita
> Sent: Monday, March 14, 2016 4:59 PM
> To: Ashutosh Bapat; Tom Lane
> Cc: pgsql-hackers
> Subject: Re: [HACKERS] Use %u to print user mapping's umid and userid
> 
> Hi,
> 
> On 2016/02/09 14:09, Ashutosh Bapat wrote:
> > Sorry, I was wrong. For public user mapping userid is 0 (InvalidOid),
> > which is returned as is in UserMapping object. I confused InvalidOid
> > with -1.
> 
> I think the following umid handling in postgresGetForeignPlan has the
> same issue:
> 
>  /*
>   * Build the fdw_private list that will be available to the executor.
>   * Items in the list must match order in enum FdwScanPrivateIndex.
>   */
>  fdw_private = list_make4(makeString(sql.data),
>   retrieved_attrs,
>   makeInteger(fpinfo->fetch_size),
>   makeInteger(foreignrel->umid));
> 
> I don't think it's correct to use makeInteger for the foreignrel's umid.
>
BTW, use of ExtensibleNode allows to forget problems come from data format
translation.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-14 Thread Kouhei Kaigai





> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Tuesday, March 15, 2016 2:04 AM
> To: Petr Jelinek
> Cc: Kaigai Kouhei(海外 浩平); David Rowley; Robert Haas;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> Petr Jelinek  writes:
> > On 14/03/16 02:43, Kouhei Kaigai wrote:
> >> Even though I couldn't check the new planner implementation entirely,
> >> it seems to be the points below are good candidate to inject CustomPath
> >> (and potentially ForeignScan).
> >>
> >> - create_grouping_paths
> >> - create_window_paths
> >> - create_distinct_paths
> >> - create_ordered_paths
> >> - just below of the create_modifytable_path
> >> (may be valuable for foreign-update pushdown)
> 
> > To me that seems too low inside the planning tree, perhaps adding it
> > just to the subquery_planner before SS_identify_outer_params would be
> > better, that's the place where you see the path for the whole (sub)query
> > so you can search and modify what you need from there.
> 
> I don't like either of those too much.  The main thing I've noticed over
> the past few days is that you can't readily generate custom upper-level
> Paths unless you know what PathTarget grouping_planner is expecting each
> level to produce.  So what I was toying with doing is (1) having
> grouping_planner put all those targets into the PlannerInfo, perhaps
> in an array indexed by UpperRelationKind; and (2) adding a hook call
> immediately after those targets are computed, say right before
> the create_grouping_paths() call (approximately planner.c:1738
> in HEAD).  It should be sufficient to have one hook there since
> you can inject Paths into any of the upper relations at that point;
> moreover, that's late enough that you shouldn't have to recompute
> anything you figured out during scan/join planning.
>
Regarding to the (2), I doubt whether the location is reasonable,
because pathlist of each upper_rels[] are still empty, aren't it?
It will make extension not-easy to construct its own CustomPath that
takes underlying built-in pathnodes.

For example, an extension implements its own sort logic but not
interested in group-by/window function, it shall want to add its
CustomPath to UPPERREL_ORDERED, however, it does not know which is
the input_rel and no built-in paths are not added yet at the point
of create_upper_paths_hook().

On the other hands, here is another problem if we put a hook after
all the upper paths done. In this case, built-in create__paths()
functions cannot pay attention for CustomPath to be added later when
these functions pick up the cheapest path.

So, even though we don't need to define multiple hook declarations,
I think the hook invocation is needed just after create__paths()
for each. It will need to inform extension the context of hook
invocation, the argument list will take UpperRelationKind.

In addition, extension cannot reference some local variables from
the root structure, like:
 - rollup_lists
 - rollup_groupclauses
 - wflists
 - activeWindows
 - have_postponed_srfs
As we are doing at set_join_pathlist_hook, it is good idea to define
UpperPathExtraData structure to pack misc information.

So, how about to re-define the hook as follows?

typedef void (*create_upper_paths_hook_type) (UpperRelationKind upper_kind,
  PlannerInfo *root,
  RelOptInfo *scan_join_rel,
  UpperPathExtraData *extra);

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-14 Thread Kouhei Kaigai

> On Sun, Mar 13, 2016 at 9:53 PM, Kouhei Kaigai  wrote:
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> 
> Committed, except that (1) I replaced ereport() with elog(), because I
> can't see making translators care about this message; and (2) I
> reworded the error message a bit.
>
Thanks, and I got the point why ereport() is suggested for the error
message that may be visible to users, instead of elog().

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-14 Thread Kouhei Kaigai

Petr,

The attached patch is the revised one that follows the new extensible-
node routine.

It is almost same the previous version except for:
- custom-apis.[ch] was renamed to custom-node.[ch]
- check for the length of custom-scan-method name followed
  the manner of RegisterExtensibleNodeMethods()

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Tuesday, March 15, 2016 2:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Petr Jelinek; pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] Reworks of CustomScan
> serialization/deserialization
> 
> On Sun, Mar 13, 2016 at 9:53 PM, Kouhei Kaigai  wrote:
> > OK, I split the previous small patch into two tiny patches.
> > The one is bugfix around max length of the extnodename.
> > The other replaces Assert() by ereport() according to the upthread 
> > discussion.
> 
> Committed, except that (1) I replaced ereport() with elog(), because I
> can't see making translators care about this message; and (2) I
> reworded the error message a bit.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


pgsql-v9.6-custom-scan-serialization-reworks.2.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-18 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Friday, March 18, 2016 11:44 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> Kouhei Kaigai  writes:
> > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane  wrote:
> >> I do not, however, like the proposal to expose wflists and so forth.
> >> Those are internal data structures in grouping_planner and I absolutely
> >> refuse to promise that they're going to stay stable.
> 
> > Hmm... It's not easy to imagine a case that extension wants own idea
> > to extract window functions from the target list and select active
> > windows, even if extension wants to have own executor and own cost
> > estimation logic.
> > In case when extension tries to add WindowPath + CustomPath(Sort),
> > extension is interested in alternative sort task, but not window
> > function itself. It is natural to follow the built-in implementation,
> > thus, it motivates extension author to take copy & paste the code.
> > select_active_windows() is static, so extension needs to have same
> > routine on their side.
> 
> Well, to be perfectly blunt about it, I have said from day one that this
> notion that a CustomScan extension will be able to cause arbitrary planner
> behavior changes is loony.  We are simply not going to drop a hook into
> every tenth line of the planner for you, nor de-static-ify every internal
> function, nor (almost equivalently) expose the data structures those
> functions produce, because it would cripple core planner development to
> try to keep the implied APIs stable.  And I continue to maintain that any
> actually-generally-useful ideas would be better handled by submitting them
> as patches to the core planner, rather than trying to implement them in an
> arms-length extension.
> 
> In the case at hand, I notice that the WindowFuncLists struct is
> actually from find_window_functions in clauses.c, so an extension
> that needed to get hold of that would be unlikely to do any copying
> and pasting anyhow -- it'd just call find_window_functions again.
> (That only needs to search the targetlist, so it's not that expensive.)
> The other lists you mention are all tightly tied to specific, and not
> terribly well-designed, implementation strategies for grouping sets and
> window functions.  Those are *very* likely to change in the near future;
> and even if they don't, I'm skeptical that an external implementor of
> grouping sets or window functions would want to use exactly those same
> implementation strategies.  Maybe the only reason you're there at all
> is you want to be smarter about the order of doing window functions,
> for example.
> 
> So I really don't want to export any of that stuff.
>
Hmm. I could understand we have active development around this area
thus miscellaneous internal data structure may not be enough stable
to expose the extension.
Don't you deny recall the discussion once implementation gets calmed
down on the future development cycle?

> As for other details of the proposed patch, I was intending to put
> all the hook calls into grouping_planner for consistency, rather than
> scattering them between grouping_planner and its subroutines.  So that
> would probably mean calling the hook for a given step *before* we
> generate the core paths for that step, not after.  Did you have a
> reason to want the other order?  (If you say "so the hook can look
> at the core-made paths", I'm going to say "that's a bad idea".  It'd
> further increase the coupling between a CustomScan extension and core.)
>
No deep reason. I just followed the manner in scan/join path hook; that
calls extension once the core feature adds built-in path nodes.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai

> Robert Haas  writes:
> > On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai  wrote:
> >> So, even though we don't need to define multiple hook declarations,
> >> I think the hook invocation is needed just after create__paths()
> >> for each. It will need to inform extension the context of hook
> >> invocation, the argument list will take UpperRelationKind.
> 
> > That actually seems like a pretty good point.  Otherwise you can't
> > push anything from the upper rels down unless you are prepared to
> > handle all of it.
> 
> I'm not exactly convinced of the use-case for that.  What external
> thing is likely to handle window functions but not aggregation,
> for example?
>
WindowPath usually takes a SortPath. Even though extension don't want to
handle window function itself, it may want to add alternative sort logic
than built-in.
Unless it does not calculate expected cost, nobody knows whether WindowPath +
SortPath is really cheaper than WindowPath + CustomPath("GpuSort").

The supplied query may require to run group-by prior to window function,
but extension may not be interested in group-by on the other hands, thus,
extension needs to get control around the location where built-in logic
also adds paths to fetch the cheapest path of the underlying paths.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai

> > Robert Haas  writes:
> > > On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai  
> > > wrote:
> > >> So, even though we don't need to define multiple hook declarations,
> > >> I think the hook invocation is needed just after create__paths()
> > >> for each. It will need to inform extension the context of hook
> > >> invocation, the argument list will take UpperRelationKind.
> >
> > > That actually seems like a pretty good point.  Otherwise you can't
> > > push anything from the upper rels down unless you are prepared to
> > > handle all of it.
> >
> > I'm not exactly convinced of the use-case for that.  What external
> > thing is likely to handle window functions but not aggregation,
> > for example?
> >
> WindowPath usually takes a SortPath. Even though extension don't want to
> handle window function itself, it may want to add alternative sort logic
> than built-in.
> Unless it does not calculate expected cost, nobody knows whether WindowPath +
> SortPath is really cheaper than WindowPath + CustomPath("GpuSort").
> 
> The supplied query may require to run group-by prior to window function,
> but extension may not be interested in group-by on the other hands, thus,
> extension needs to get control around the location where built-in logic
> also adds paths to fetch the cheapest path of the underlying paths.
>
If I would design the hook, I will put its entrypoint at:
- tail of create_grouping_paths(), just before set_cheapest()
- tail of create_window_paths(), just before set_cheapest()
- tail of create_distinct_paths(), just before set_cheapest()
- tail of create_ordered_paths(), just before set_cheapest()
- tail of grouping_planner(), after the loop of create_modifytable_path()

I'm not 100% certain whether the last one is the straightforward idea
to provide alternative writing stuff. For example, if an extension has
own special storage like columnar format, we may need more consideration
whether CustomPath and related stuff are suitable tool.

On the other hands, I believe the earlier 4 entrypoints are right location.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-upper-custom-path.1.patch
Description: pgsql-v9.6-upper-custom-path.1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-19 Thread Kouhei Kaigai

> Robert Haas  writes:
> > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> On Mon, Mar 14, 2016 at 9:21 PM, Kouhei Kaigai  
> >>> wrote:
> >>>> So, even though we don't need to define multiple hook declarations,
> >>>> I think the hook invocation is needed just after create__paths()
> >>>> for each. It will need to inform extension the context of hook
> >>>> invocation, the argument list will take UpperRelationKind.
> 
> >>> That actually seems like a pretty good point.  Otherwise you can't
> >>> push anything from the upper rels down unless you are prepared to
> >>> handle all of it.
> 
> >> I'm not exactly convinced of the use-case for that.  What external
> >> thing is likely to handle window functions but not aggregation,
> >> for example?
> 
> > I'm not going to say that you're entirely wrong, but I think that
> > attitude is a bit short-sighted.
> 
> Well, I'm prepared to yield to the extent of repeating the hook call
> before each phase with an UpperRelationKind parameter to tell which phase
> we're about to do.  The main concern here is to avoid redundant
> computation, but the hook can check the "kind" parameter and fall out
> quickly if it has nothing useful to do at the current phase.
> 
> I do not, however, like the proposal to expose wflists and so forth.
> Those are internal data structures in grouping_planner and I absolutely
> refuse to promise that they're going to stay stable.  (I had already
> been thinking a couple of weeks ago about revising the activeWindows
> data structure, now that it would be reasonably practical to cost out
> different orders for doing the window functions in.)  I think a hook
> that has its own ideas about window function implementation methods
> can gather its own information about the WFs without that much extra
> cost, and it very probably wouldn't want exactly the same data that
> create_window_paths uses today anyway.
> 
> So what I would now propose is
> 
> typedef void (*create_upper_paths_hook_type) (PlannerInfo *root,
>   UpperRelationKind stage,
>   RelOptInfo *input_rel);
>
Hmm... It's not easy to imagine a case that extension wants own idea
to extract window functions from the target list and select active
windows, even if extension wants to have own executor and own cost
estimation logic.
In case when extension tries to add WindowPath + CustomPath(Sort),
extension is interested in alternative sort task, but not window
function itself. It is natural to follow the built-in implementation,
thus, it motivates extension author to take copy & paste the code.
select_active_windows() is static, so extension needs to have same
routine on their side.

On the other hands, 'rollup_lists' and 'rollup_groupclauses' need
three static functions (extract_rollup_sets(), reorder_grouping_sets()
and preprocess_groupclause() to reproduce the equivalent data structure.
It is larger copy & paste burden, if extension is not interested in
extracting the information related to grouping set.


I understand it is not "best", but better to provide extra information
needed for extension to reproduce equivalent pathnode, even if fields
of UpperPathExtraData structure is not stable right now.

> and have this invoked at each stage right before we call
> create_grouping_paths, create_window_paths, etc.
>
It seems to me reasonable.

> Also, I don't particularly see a need for a corresponding API for FDWs.
> If an FDW is going to do anything in this space, it presumably has to
> build up ForeignPaths for all the steps anyway.  So I'd be inclined
> to leave GetForeignUpperPaths as-is.
>
It seems to me reasonable. FDW driver which is interested in remote
execution of upper path can use the hook arbitrary.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-23 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Petr Jelinek
> Sent: Thursday, March 17, 2016 5:06 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On 15/03/16 05:03, Kouhei Kaigai wrote:
> > Petr,
> >
> > The attached patch is the revised one that follows the new extensible-
> > node routine.
> >
> > It is almost same the previous version except for:
> > - custom-apis.[ch] was renamed to custom-node.[ch]
> > - check for the length of custom-scan-method name followed
> >the manner of RegisterExtensibleNodeMethods()
> >
> 
> Hi,
> 
> looks good, only nitpick I have is that it probably should be
> custom_node.h with underscore given that we use underscore everywhere
> (except for libpq and for some reason atomic ops).
>
Sorry for my response late.

The attached patch just renamed custom-node.[ch] by custom_node.[ch].
Other portions are not changed from the previous revison.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-23 Thread Kouhei Kaigai

> >> On 15/03/16 05:03, Kouhei Kaigai wrote:
> >>> Petr,
> >>>
> >>> The attached patch is the revised one that follows the new extensible-
> >>> node routine.
> >>>
> >>> It is almost same the previous version except for:
> >>> - custom-apis.[ch] was renamed to custom-node.[ch]
> >>> - check for the length of custom-scan-method name followed
> >>> the manner of RegisterExtensibleNodeMethods()
> >>>
> >>
> >> Hi,
> >>
> >> looks good, only nitpick I have is that it probably should be
> >> custom_node.h with underscore given that we use underscore everywhere
> >> (except for libpq and for some reason atomic ops).
> >>
> > Sorry for my response late.
> >
> > The attached patch just renamed custom-node.[ch] by custom_node.[ch].
> > Other portions are not changed from the previous revison.
> >
> 
> Forgot to attach?
>
Yes Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-custom-scan-serialization-reworks.3.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Upper planner pathification

2016-03-28 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Saturday, March 19, 2016 8:57 AM
> To: Tom Lane
> Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] WIP: Upper planner pathification
> 
> > -Original Message-
> > From: pgsql-hackers-ow...@postgresql.org
> > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> > Sent: Friday, March 18, 2016 11:44 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: Robert Haas; Petr Jelinek; David Rowley; pgsql-hackers@postgresql.org
> > Subject: Re: [HACKERS] WIP: Upper planner pathification
> >
> > Kouhei Kaigai  writes:
> > > On Wed, Mar 16, 2016 at 2:47 PM, Tom Lane  wrote:
> > >> I do not, however, like the proposal to expose wflists and so forth.
> > >> Those are internal data structures in grouping_planner and I absolutely
> > >> refuse to promise that they're going to stay stable.
> >
> > > Hmm... It's not easy to imagine a case that extension wants own idea
> > > to extract window functions from the target list and select active
> > > windows, even if extension wants to have own executor and own cost
> > > estimation logic.
> > > In case when extension tries to add WindowPath + CustomPath(Sort),
> > > extension is interested in alternative sort task, but not window
> > > function itself. It is natural to follow the built-in implementation,
> > > thus, it motivates extension author to take copy & paste the code.
> > > select_active_windows() is static, so extension needs to have same
> > > routine on their side.
> >
> > Well, to be perfectly blunt about it, I have said from day one that this
> > notion that a CustomScan extension will be able to cause arbitrary planner
> > behavior changes is loony.  We are simply not going to drop a hook into
> > every tenth line of the planner for you, nor de-static-ify every internal
> > function, nor (almost equivalently) expose the data structures those
> > functions produce, because it would cripple core planner development to
> > try to keep the implied APIs stable.  And I continue to maintain that any
> > actually-generally-useful ideas would be better handled by submitting them
> > as patches to the core planner, rather than trying to implement them in an
> > arms-length extension.
> >
> > In the case at hand, I notice that the WindowFuncLists struct is
> > actually from find_window_functions in clauses.c, so an extension
> > that needed to get hold of that would be unlikely to do any copying
> > and pasting anyhow -- it'd just call find_window_functions again.
> > (That only needs to search the targetlist, so it's not that expensive.)
> > The other lists you mention are all tightly tied to specific, and not
> > terribly well-designed, implementation strategies for grouping sets and
> > window functions.  Those are *very* likely to change in the near future;
> > and even if they don't, I'm skeptical that an external implementor of
> > grouping sets or window functions would want to use exactly those same
> > implementation strategies.  Maybe the only reason you're there at all
> > is you want to be smarter about the order of doing window functions,
> > for example.
> >
> > So I really don't want to export any of that stuff.
> >
> Hmm. I could understand we have active development around this area
> thus miscellaneous internal data structure may not be enough stable
> to expose the extension.
> Don't you deny recall the discussion once implementation gets calmed
> down on the future development cycle?
> 
> > As for other details of the proposed patch, I was intending to put
> > all the hook calls into grouping_planner for consistency, rather than
> > scattering them between grouping_planner and its subroutines.  So that
> > would probably mean calling the hook for a given step *before* we
> > generate the core paths for that step, not after.  Did you have a
> > reason to want the other order?  (If you say "so the hook can look
> > at the core-made paths", I'm going to say "that's a bad idea".  It'd
> > further increase the coupling between a CustomScan extension and core.)
> >
> No deep reason. I just followed the manner in scan/join path hook; that
> calls extension once the core feature adds built-in path nodes.
>
Ah, I oversight a deep reason.
ForeignScan/CustomScan may have an alternative execution path if extension
s

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-28 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Friday, March 25, 2016 12:27 AM
> To: Petr Jelinek
> Cc: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On Wed, Mar 23, 2016 at 1:36 PM, Petr Jelinek  wrote:
> > Ok, I am happy with it, marked it as ready for committer (it was marked as
> > committed although it wasn't committed).
> 
> Thanks for fixing the status.   I had forgotten about this thread.
> 
> I can't really endorse the naming conventions here.  I mean, we've got
> the main extensible nodes stuff in extensible.h, and then we've got
> this stuff in custom_node.h (BTW, there is a leftover reference to
> custom-node.h).  There's no hint in the naming that this relates to
> scans, and why is it extensible in one place and custom in another?
> 
> I'm not quite sure how to clean this up.  At a minimum, I think we
> should standardize on "custom_scan.h" instead of "custom_node.h".  I
> think that would be clearer.  But I'm wondering if we should bite the
> bullet and rename everything from "custom" to "extensible" and declare
> it all in "extensible.h".
>
I don't have a strong reason to keep these stuff in separate files.
Both stuffs covers similar features and amount of code are enough small.
So, the attached v4 just merged custom-node.[ch] stuff into extensible.

Once we put similar routines closely, it may be better to consolidate
these routines.
As long as EXTNODENAME_MAX_LEN == CUSTOM_NAME_MAX_LEN, both features
have identical structure layout, so it is easy to call an internal
common function to register or find out a table of callbacks according
to the function actually called by other modules.

I'm inclined to think to replace EXTNODENAME_MAX_LEN and
CUSTOM_NAME_MAX_LEN by NAMEDATALEN again, to fit structure layout.

> src/backend/nodes/custom_node.c:45: indent with spaces.
> +}
> 
Oops, thanks,

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v9.6-custom-scan-serialization-reworks.4.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.4.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reworks of CustomScan serialization/deserialization

2016-03-28 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, March 29, 2016 10:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Petr Jelinek; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Reworks of CustomScan serialization/deserialization
> 
> On Mon, Mar 28, 2016 at 9:36 PM, Kouhei Kaigai  wrote:
> > I don't have a strong reason to keep these stuff in separate files.
> > Both stuffs covers similar features and amount of code are enough small.
> > So, the attached v4 just merged custom-node.[ch] stuff into extensible.
> >
> > Once we put similar routines closely, it may be better to consolidate
> > these routines.
> > As long as EXTNODENAME_MAX_LEN == CUSTOM_NAME_MAX_LEN, both features
> > have identical structure layout, so it is easy to call an internal
> > common function to register or find out a table of callbacks according
> > to the function actually called by other modules.
> >
> > I'm inclined to think to replace EXTNODENAME_MAX_LEN and
> > CUSTOM_NAME_MAX_LEN by NAMEDATALEN again, to fit structure layout.
> 
> I don't think that we need both EXTNODENAME_MAX_LEN and
> CUSTOM_NAME_MAX_LEN; we can use EXTNODENAME_MAX_LEN for both.  I'm
> opposed to using NAMEDATALEN for anything unrelated to the size of a
> Name.  If it's not being stored in a catalog, it doesn't need to care.
>
OK, I adjusted the v4 patch to use EXTNODENAME_MAX_LEN for both.

The structure of hash entry was revised as follows, then registered via
an internal common function: RegisterExtensibleNodeEntry, and found out
via also an internal common function: GetExtensibleNodeEntry.

typedef struct
{
charextnodename[EXTNODENAME_MAX_LEN];
const void *extnodemethods;
 } ExtensibleNodeEntry;

ExtensibleNodeMethods and CustomScanMethods shall be stored in
'extensible_node_methods' and 'custom_scan_methods' separatedly.
The entrypoint functions calls above internal common functions with
appropriate HTAB variable.

It will be re-usable if we would have further extensible nodes in the
future versions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-custom-scan-serialization-reworks.5.patch
Description: pgsql-v9.6-custom-scan-serialization-reworks.5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-01-25 Thread Kouhei Kaigai

Sorry for my late response. I've been unavailable to have enough
time to touch code for the last 1.5 month.

The attached patch is a revised one to handle private data of
foregn/custom scan node more gracefully.

The overall consensus upthread were:
- A new ExtensibleNodeMethods structure defines a unique name
  and a set of callbacks to handle node copy, serialization,
  deserialization and equality checks.
- (Foreign|Custom)(Path|Scan|ScanState) are first host of the
  ExtensibleNodeMethods, to allow extension to define larger
  structure to store its private fields.
- ExtensibleNodeMethods does not support variable length
  structure (like a structure with an array on the tail, use
  separately allocated array).
- ExtensibleNodeMethods shall be registered on _PG_init() of
  extensions.

The 'pgsql-v9.6-custom-private.v3.patch' is the main part of
this feature. As I pointed out before, it uses dynhash instead
of the self invented hash table.

Interfaces are defined as follows (not changed from v2):

  typedef struct ExtensibleNodeMethods
  {
 const char *extnodename;
 Sizenode_size;
 void  (*nodeCopy)(Node *newnode, const Node *oldnode);
 bool  (*nodeEqual)(const Node *a, const Node *b);
 void  (*nodeOut)(struct StringInfoData *str, const Node *node);
 void  (*nodeRead)(Node *node);
  } ExtensibleNodeMethods;

  extern void
  RegisterExtensibleNodeMethods(const ExtensibleNodeMethods *methods);

  extern const ExtensibleNodeMethods *
  GetExtensibleNodeMethods(const char *extnodename, bool missing_ok);

Also, 'extensible-node-example-on-pgstrom.patch' is a working
example on its "GpuScan" node.
The code below uses all of copy, serialization and deserialization.

gscan = (GpuScan *)stringToNode(nodeToString(copyObject(cscan)));
elog(INFO, "GpuScan: %s", nodeToString(gscan));

Then, I could confirm private fields are reproduced correctly.

In addition to this, I'd like to suggest two small improvement.

On nodeOut callback, extension will need _outToken() and _outBitmap(),
however, these two functions are static. Entrypoint for extensions
are needed. (Of course, extension can copy & paste these small functions...)

ExtensibleNodeMethods may be registered with a unique pair of its
name and node-tag which is associated with. The current code requires
the name is unique to others, however, it may make a bit inconvenience.
In case of CustomScan, extension need to define three nodes: CustomPath,
CustomScan and CustomScanState, thus, ExtensibleNodeMethods which is
associated with these node must have individually unique name, like
"GpuScanPath", "GpuScan" and "GpuScanState".
If extnodename would be unique within a particular node type, we can
apply same name for all of the three above.

How about your thought?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Kaigai Kouhei(海外 浩平)
> Sent: Wednesday, December 02, 2015 5:52 PM
> To: 'Robert Haas'
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan 
> support
> on readfuncs.c)
> 
> > On Thu, Nov 26, 2015 at 5:27 AM, Kouhei Kaigai  wrote:
> > > I'm now implementing. The above design perfectly works on ForeignScan.
> > > On the other hands, I'd like to have deeper consideration for CustomScan.
> > >
> > > My recent patch adds LibraryName and SymbolName on CustomScanMethods
> > > to lookup the method table even if library is not loaded yet.
> > > However, this ExtensibleNodeMethods relies custom scan provider shall
> > > be loaded, by parallel infrastructure, prior to the deserialization.
> > > It means extension has a chance to register itself as well.
> > >
> > > My idea is, redefine CustomScanMethod as follows:
> > >
> > > typedef struct ExtensibleNodeMethods
> > > {
> > > const char *extnodename;
> > > Sizenode_size;
> > > Node *(*nodeCopy)(const Node *from);
> > > bool  (*nodeEqual)(const Node *a, const Node *b);
> > > void  (*nodeOut)(struct StringInfoData *str, const Node *node);
> > > void  (*nodeRead)(Node *node);
> > > } ExtensibleNodeMethods;
> > >
> > > typedef struct CustomScanMethods
> > > {
> > > union {
> > > const char *CustomName;
> > > ExtensibleNodeMethods  xnode;
> > > };
> > > /* Create execution state (CustomScanState) from a CustomScan plan 
> > > node
> > */
> > > Node   *(*CreateCustomScanState) (struct CustomScan *cscan);
> > > } CustomScanMethods;
>

Re: [HACKERS] [Proposal] Table partition + join pushdown

2016-01-25 Thread Kouhei Kaigai

> On Tue, Jan 19, 2016 at 7:59 AM, Greg Stark  wrote:
> > On Mon, Jan 18, 2016 at 5:55 PM, Robert Haas  wrote:
> >> For
> >> example, suppose that x and y are numeric columns and P(x) is
> >> length(x::text) == 3.  Then you could have 1 in one table and 1.0 in
> >> the table; they join, but P(x) is true for one and false for the
> >> other.
> >
> > Fwiw, ages ago there was some talk about having a property on
> > functions "equality preserving" or something like that. If a function,
> > or more likely a  tuple had this property set then
> > x op y => f(x) op f(y). This would be most useful for things like
> > substring or hash functions which would allow partial indexes or
> > partition exclusion to be more generally useful.
> >
> > Of course then you really want  to indicate that "a op1 b
> > => f(a) op2 f(b)" so you can handle things like  so
> > that "a < b => substring(a,n) <= substring(b,n)" and you need some way
> > to represent the extra arguments to substring and the whole thing
> > became too complex and got dropped.
> >
> > But perhaps even a simpler property that only worked for equality and
> > single-argument functions would be useful since it would let us mark
> > hash functions Or perhaps we only need to mark the few functions that
> > expose properties that don't affect equality since I think there are
> > actually very few of them.
> 
> We could certainly mark operators that amount to testing binary
> equality as such, and this optimization could be used for join
> operators so marked.  But I worry that would become a crutch, with
> people implementing optimizations that work for such operators and
> leaving numeric (for example) out in the cold.  Of course, we could
> worry about such problems when and if they happen, and accept the idea
> of markings for now.  However, I'm inclined to think that there's a
> better way to optimize the case Taiki Kondo and Kouhei Kagai are
> targeting.
>
It seems to me Greg's idea intends to reduce CPU cycles by replacement
of the operator in use. I never deny we can have valuable scenarios,
however, we intend this feature to reduce amount of inner hash-table
size, to fit GPU RAM for example.

> If we get declarative partitioning, an oft-requested feature that has
> been worked on by various people over the years and currently by Amit
> Langote, and specifically if we get hash partitioning, then we'll
> presumably use the hash function for the default operator class of the
> partitioning column's datatype to partition the table.  Then, if we do
> a join against some other table and consider a hash join, we'll be
> using the same hash function on our side, and either the same operator
> or a compatible operator for some other datatype in the same opfamily
> on the other side.  At that point, if we push down the join, we can
> add a filter on the inner side of the join that the hash value of the
> matching column has to map to the partition it's being joined against.
> And we don't get a recurrence of this problem in that case, because
> we're not dealing with an arbitrary predicate - we're dealing with a
> hash function whose equality semantics are defined to be compatible
> with the join operator.
> 
> That approach works with any data type that has a default hash
> operator class, which covers pretty much everything anybody is likely
> to care about, including numeric.
>
Except for usage of CHECK constraint, the above description is almost
same as we are intending. Hash joinable operators are expected to check
equality of both side at least, thus, we can predicate which inner
columns shall have identical value once both tuples are joined.
Then, we can filter out rows obviously  obviously unmatched rows.

> At that point, if we push down the join, we can
> add a filter on the inner side of the join that the hash value of the
> matching column has to map to the partition it's being joined against.
>
Of course, its implementation is not graceful enough, especially, above
point because this extra filter will change expected number of rows to
be produced by inner relation, and relevant cost.
Right now, his patch calls cost_seqscan() and others according to the
type of inner relation by itself. Of course, it is not a portable way,
if inner relation would not be a simple relations scan.

Due to path construction staging, AppendPath with underlying join paths
has to be constructed on join path investigation steps. So, what is the
reasonable way to make inner relation's path node with filters pushed-
down?
It is the most ugly part of the current patch.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CustomScan under the Gather node?

2016-01-25 Thread Kouhei Kaigai

Hello,

What enhancement will be necessary to implement similar feature of
partial seq-scan using custom-scan interface?

It seems to me callbacks on the three points below are needed.
* ExecParallelEstimate
* ExecParallelInitializeDSM
* ExecParallelInitializeWorker

Anything else?
Does ForeignScan also need equivalent enhancement?



Background of my motivation is the slides below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using peer-to-peer DMA, prior to
data loading onto CPU/RAM. (Probably, it shall be loaded only
all-visible blocks like as index-only scan.)
Once we load the data blocks onto GPU RAM, we can reduce rows
to be filtered out later but consumes CPU RAM.
An expected major bottleneck is CPU thread which issues the
peer-to-peer DMA requests to the device, rather than GPU tasks.
So, utilization of parallel execution is a natural thought.
However, a CustomScan node that takes underlying PartialSeqScan
node is not sufficient because it once loads the data blocks
onto CPU RAM. P2P DMA does not make sense.

The expected "GpuSsdScan" on CustomScan will reference a shared
block-index to be incremented by multiple backend, then it
enqueues P2P DMA request (if all visible) to the device driver.
Then it receives the rows only visible towards the scan qualifiers.
It is almost equivalent to SeqScan, but wants to bypass heap layer
to utilize SSD-to-GPU direct data translation path.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan under the Gather node?

2016-01-26 Thread Kouhei Kaigai

> -Original Message-
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Sent: Wednesday, January 27, 2016 2:30 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] CustomScan under the Gather node?
> 
> On Tue, Jan 26, 2016 at 12:00 PM, Kouhei Kaigai  wrote:
> >
> > Hello,
> >
> > What enhancement will be necessary to implement similar feature of
> > partial seq-scan using custom-scan interface?
> >
> > It seems to me callbacks on the three points below are needed.
> > * ExecParallelEstimate
> > * ExecParallelInitializeDSM
> > * ExecParallelInitializeWorker
> >
> > Anything else?
> 
> I don't think so.
> 
> > Does ForeignScan also need equivalent enhancement?
> 
> I think this depends on the way ForeignScan is supposed to be
> parallelized, basically if it needs to coordinate any information
> with other set of workers, then it will require such an enhancement.
>
After the post yesterday, I reminded an possible scenario around FDW
if it manages own private storage, like cstore_fdw.

Probably, ForeignScan node performing on columnar store (for example)
will need a coordination information like as partial seq-scan doing.
It is a case very similar to the implementation on local storage.

On the other hands, if we try postgres_fdw (or others) to get parallelized
with background worker, I doubt whether we need this coordination information
on local side. Remote query will have an additional qualifier to skip blocks
already fetched for this purpose.
At least, it does not needs something special enhancement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan under the Gather node?

2016-01-27 Thread Kouhei Kaigai

> On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai  wrote:
> > What enhancement will be necessary to implement similar feature of
> > partial seq-scan using custom-scan interface?
> >
> > It seems to me callbacks on the three points below are needed.
> > * ExecParallelEstimate
> > * ExecParallelInitializeDSM
> > * ExecParallelInitializeWorker
> >
> > Anything else?
> > Does ForeignScan also need equivalent enhancement?
> 
> For postgres_fdw, running the query from a parallel worker would
> change the transaction semantics.  Suppose you begin a transaction,
> UPDATE data on the foreign server, and then run a parallel query.  If
> the leader performs the ForeignScan it will see the uncommitted
> UPDATE, but a worker would have to make its own connection which not
> be part of the same transaction and which would therefore not see the
> update.  That's a problem.
>
Ah, yes, as long as FDW driver ensure the remote session has no
uncommitted data, pg_export_snapshot() might provide us an opportunity,
however, once a session writes something, FDW driver has to prohibit it.

> Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> is that most of the work is being done on the remote side, so doing
> the work in a parallel worker doesn't seem super interesting.  Instead
> of incurring transfer costs to move the data from remote to local, we
> incur two sets of transfer costs: first remote to local, then worker
> to leader.  Ouch.  I think a more promising line of inquiry is to try
> to provide asynchronous execution when we have something like:
> 
> Append
> -> Foreign Scan
> -> Foreign Scan
> 
> ...so that we can return a row from whichever Foreign Scan receives
> data back from the remote server first.
> 
> So it's not impossible that an FDW author could want this, but mostly
> probably not.  I think.
>
Yes, I also have same opinion. Likely, local parallelism is not
valuable for the class of FDWs that obtains data from the remote
server (e.g, postgres_fdw, ...), expect for the case when packing
and unpacking cost over the network is major bottleneck.

On the other hands, it will be valuable for the class of FDW that
performs as a wrapper to local data structure, as like current
partial seq-scan doing. (e.g, file_fdw, ...)
Its data source is not under the transaction control, and 'remote
execution' of these FDWs are eventually executed on the local
computing resources.

If I would make a proof-of-concept patch with interface itself, it
seems to me file_fdw may be a good candidate for this enhancement.
It is not a field for postgres_fdw.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan under the Gather node?

2016-01-28 Thread Kouhei Kaigai

> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
>
The attached patch is enhancement of FDW/CSP interface and PoC feature
of file_fdw to scan source file partially. It was smaller enhancement
than my expectations.

It works as follows. This query tried to read 20M rows from a CSV file,
using 3 background worker processes.

postgres=# set max_parallel_degree = 3;
SET
postgres=# explain analyze select * from test_csv where id % 20 = 6;
  QUERY PLAN

 Gather  (cost=1000.00..194108.60 rows=94056 width=52)
 (actual time=0.570..19268.010 rows=200 loops=1)
   Number of Workers: 3
   ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056 
width=52)
  (actual time=0.180..12744.655 rows=50 
loops=4)
 Filter: ((id % 20) = 6)
 Rows Removed by Filter: 950
 Foreign File: /tmp/testdata.csv
 Foreign File Size: 1504892535
 Planning time: 0.147 ms
 Execution time: 19330.201 ms
(9 rows)


I'm not 100% certain whether this implementation of file_fdw is reasonable
for partial read, however, the callbacks located on the following functions
enabled to implement a parallel-aware custom logic based on the coordination
information.

> * ExecParallelEstimate
> * ExecParallelInitializeDSM
> * ExecParallelInitializeWorker

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Kaigai Kouhei(海外 浩平)
> Sent: Thursday, January 28, 2016 9:33 AM
> To: 'Robert Haas'
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
> 
> > On Tue, Jan 26, 2016 at 1:30 AM, Kouhei Kaigai  wrote:
> > > What enhancement will be necessary to implement similar feature of
> > > partial seq-scan using custom-scan interface?
> > >
> > > It seems to me callbacks on the three points below are needed.
> > > * ExecParallelEstimate
> > > * ExecParallelInitializeDSM
> > > * ExecParallelInitializeWorker
> > >
> > > Anything else?
> > > Does ForeignScan also need equivalent enhancement?
> >
> > For postgres_fdw, running the query from a parallel worker would
> > change the transaction semantics.  Suppose you begin a transaction,
> > UPDATE data on the foreign server, and then run a parallel query.  If
> > the leader performs the ForeignScan it will see the uncommitted
> > UPDATE, but a worker would have to make its own connection which not
> > be part of the same transaction and which would therefore not see the
> > update.  That's a problem.
> >
> Ah, yes, as long as FDW driver ensure the remote session has no
> uncommitted data, pg_export_snapshot() might provide us an opportunity,
> however, once a session writes something, FDW driver has to prohibit it.
> 
> > Also, for postgres_fdw, and many other FDWs I suspect, the assumption
> > is that most of the work is being done on the remote side, so doing
> > the work in a parallel worker doesn't seem super interesting.  Instead
> > of incurring transfer costs to move the data from remote to local, we
> > incur two sets of transfer costs: first remote to local, then worker
> > to leader.  Ouch.  I think a more promising line of inquiry is to try
> > to provide asynchronous execution when we have something like:
> >
> > Append
> > -> Foreign Scan
> > -> Foreign Scan
> >
> > ...so that we can return a row from whichever Foreign Scan receives
> > data back from the remote server first.
> >
> > So it's not impossible that an FDW author could want this, but mostly
> > probably not.  I think.
> >
> Yes, I also have same opinion. Likely, local parallelism is not
> valuable for the class of FDWs that obtains data from the remote
> server (e.g, postgres_fdw, ...), expect for the case when packing
> and unpacking cost over the network is major bottleneck.
> 
> On the other hands, it will be valuable for the class of FDW that
> performs as a wrapper to local data structure, as like current
> partial seq-scan doing. (e.g, file_fdw, ...)
> Its data source is not under the transaction control, and 'remote
> execution' of these FDWs are eventually executed on the local
> computing resources.
> 
> If I would make a proof-of-concept patch with interface itself, it
> seems to me file_fdw may be a good candidate for this enhancement.
> It is not a field for postgres_fdw.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei 



pgsql-v9.6-parallel-cspfdw.v1.patch
Description: pgsql-v9.6-parallel-cspfdw.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan under the Gather node?

2016-01-28 Thread Kouhei Kaigai

> On Thu, Jan 28, 2016 at 10:50 AM, Kouhei Kaigai  wrote:
> >> If I would make a proof-of-concept patch with interface itself, it
> >> seems to me file_fdw may be a good candidate for this enhancement.
> >> It is not a field for postgres_fdw.
> >>
> > The attached patch is enhancement of FDW/CSP interface and PoC feature
> > of file_fdw to scan source file partially. It was smaller enhancement
> > than my expectations.
> >
> > It works as follows. This query tried to read 20M rows from a CSV file,
> > using 3 background worker processes.
> >
> > postgres=# set max_parallel_degree = 3;
> > SET
> > postgres=# explain analyze select * from test_csv where id % 20 = 6;
> >   QUERY PLAN
> >
> 
> 
> >  Gather  (cost=1000.00..194108.60 rows=94056 width=52)
> >  (actual time=0.570..19268.010 rows=200 loops=1)
> >Number of Workers: 3
> >->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056
> width=52)
> >   (actual time=0.180..12744.655 rows=50
> loops=4)
> >  Filter: ((id % 20) = 6)
> >  Rows Removed by Filter: 950
> >  Foreign File: /tmp/testdata.csv
> >  Foreign File Size: 1504892535
> >  Planning time: 0.147 ms
> >  Execution time: 19330.201 ms
> > (9 rows)
> 
> Could you try it not in parallel and then with 1, 2, 3, and 4 workers
> and post the times for all?
>
The above query has 5% selectivity on the entire CSV file.
Its execution time (total, only ForeignScan) are below

 total ForeignScandiff
0 workers: 17584.319 ms   17555.904 ms  28.415 ms
1 workers: 18464.476 ms   18110.968 ms 353.508 ms
2 workers: 19042.755 ms   14580.335 ms4462.420 ms
3 workers: 19318.254 ms   12668.912 ms6649.342 ms
4 workers: 21732.910 ms   13596.788 ms8136.122 ms
5 workers: 23486.846 ms   14533.409 ms8953.437 ms

This workstation has 4 CPU cores, so it is natural nworkers=3 records the
peak performance on ForeignScan portion. On the other hands, nworkers>1 also
recorded unignorable time consumption (probably, by Gather node?)

An interesting observation was, less selectivity (1% and 0%) didn't change the
result so much. Something consumes CPU time other than file_fdw.

* selectivity 1%
   total   ForeignScan   diff
0 workers: 17573.572 ms   17566.875 ms  6.697 ms
1 workers: 18098.070 ms   18020.790 ms 77.280 ms
2 workers: 18676.078 ms   14600.749 ms   4075.329 ms
3 workers: 18830.597 ms   12731.459 ms   6099.138 ms
4 workers: 21015.842 ms   13590.657 ms   7425.185 ms
5 workers: 22865.496 ms   14634.342 ms   8231.154 ms

* selectivity 0% (...so Gather didn't work hard actually)
  totalForeignScan   diff
0 workers: 17551.011 ms   17550.811 ms  0.200 ms
1 workers: 18055.185 ms   18048.975 ms  6.210 ms
2 workers: 18567.660 ms   14593.974 ms   3973.686 ms
3 workers: 18649.819 ms   12671.429 ms   5978.390 ms
4 workers: 20619.184 ms   13606.715 ms   7012.469 ms
5 workers: 22557.575 ms   14594.420 ms   7963.155 ms

Further investigation will need

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

postgres=# explain analyze select * from test_csv where id % 100 = 100;
   QUERY PLAN
-
 Foreign Scan on test_csv  (cost=0.00..2158874.49 rows=94056 width=52) (actual 
time=17550.811..17550.811 rows=0 loops=1)
   Filter: ((id % 100) = 100)
   Rows Removed by Filter: 2000
   Foreign File: /tmp/testdata.csv
   Foreign File Size: 1504892535
 Planning time: 1.175 ms
 Execution time: 17551.011 ms
(7 rows)

postgres=# SET max_parallel_degree = 1;
SET
postgres=# explain analyze select * from test_csv where id % 100 = 100;
  QUERY PLAN
---
 Gather  (cost=1000.00..194108.60 rows=94056 width=52) (actual 
time=18054.651..18054.651 rows=0 loops=1)
   Number of Workers: 1
   ->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056 
width=52) (actual time=18048.975..18048.975 rows=0 loops=2)
 Filter: ((id % 100) = 100)
 Rows Removed by Filter: 2000
 Foreign File: /tmp/testdata.csv
 Foreign File Size: 1504892535
 Planning time: 0.461 ms
 E

Re: [HACKERS] CustomScan under the Gather node?

2016-01-28 Thread Kouhei Kaigai

>  total ForeignScandiff
> 0 workers: 17584.319 ms   17555.904 ms  28.415 ms
> 1 workers: 18464.476 ms   18110.968 ms 353.508 ms
> 2 workers: 19042.755 ms   14580.335 ms4462.420 ms
> 3 workers: 19318.254 ms   12668.912 ms6649.342 ms
> 4 workers: 21732.910 ms   13596.788 ms8136.122 ms
> 5 workers: 23486.846 ms   14533.409 ms8953.437 ms
> 
> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
> recorded unignorable time consumption (probably, by Gather node?)
  :
> Further investigation will need
>
It was a bug of my file_fdw patch. ForeignScan node in the master process was
also kicked by the Gather node, however, it didn't have coordinate information
due to oversight of the initialization at InitializeDSMForeignScan callback.
In the result, local ForeignScan node is still executed after the completion
of coordinated background worker processes, and returned twice amount of rows.

In the revised patch, results seems to me reasonable.
 total ForeignScan  diff
0 workers: 17592.498 ms   17564.457 ms 28.041ms
1 workers: 12152.998 ms   11983.485 ms169.513 ms
2 workers: 10647.858 ms   10502.100 ms145.758 ms
3 workers:  9635.445 ms9509.899 ms125.546 ms
4 workers: 11175.456 ms   10863.293 ms312.163 ms
5 workers: 12586.457 ms   12279.323 ms307.134 ms

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Friday, January 29, 2016 8:51 AM
> To: Robert Haas
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] CustomScan under the Gather node?
> 
> > On Thu, Jan 28, 2016 at 10:50 AM, Kouhei Kaigai  
> > wrote:
> > >> If I would make a proof-of-concept patch with interface itself, it
> > >> seems to me file_fdw may be a good candidate for this enhancement.
> > >> It is not a field for postgres_fdw.
> > >>
> > > The attached patch is enhancement of FDW/CSP interface and PoC feature
> > > of file_fdw to scan source file partially. It was smaller enhancement
> > > than my expectations.
> > >
> > > It works as follows. This query tried to read 20M rows from a CSV file,
> > > using 3 background worker processes.
> > >
> > > postgres=# set max_parallel_degree = 3;
> > > SET
> > > postgres=# explain analyze select * from test_csv where id % 20 = 6;
> > >   QUERY PLAN
> > >
> >
> 
> > 
> > >  Gather  (cost=1000.00..194108.60 rows=94056 width=52)
> > >  (actual time=0.570..19268.010 rows=200 loops=1)
> > >Number of Workers: 3
> > >->  Parallel Foreign Scan on test_csv  (cost=0.00..183703.00 rows=94056
> > width=52)
> > >   (actual time=0.180..12744.655
> rows=50
> > loops=4)
> > >  Filter: ((id % 20) = 6)
> > >  Rows Removed by Filter: 950
> > >  Foreign File: /tmp/testdata.csv
> > >  Foreign File Size: 1504892535
> > >  Planning time: 0.147 ms
> > >  Execution time: 19330.201 ms
> > > (9 rows)
> >
> > Could you try it not in parallel and then with 1, 2, 3, and 4 workers
> > and post the times for all?
> >
> The above query has 5% selectivity on the entire CSV file.
> Its execution time (total, only ForeignScan) are below
> 
>  total ForeignScandiff
> 0 workers: 17584.319 ms   17555.904 ms  28.415 ms
> 1 workers: 18464.476 ms   18110.968 ms 353.508 ms
> 2 workers: 19042.755 ms   14580.335 ms4462.420 ms
> 3 workers: 19318.254 ms   12668.912 ms6649.342 ms
> 4 workers: 21732.910 ms   13596.788 ms8136.122 ms
> 5 workers: 23486.846 ms   14533.409 ms8953.437 ms
> 
> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> peak performance on ForeignScan portion. On the other hands, nworkers>1 also
> recorded unignorable time consumption (probably, by Gather node?)
> 
> An interesting observation was, less selectivity (1% and 0%) didn't change the
> result so much. Something consumes CPU time other than file_fdw.
> 
> * selectivity 1%
>total   ForeignScan   diff
> 0 workers: 17573.572 ms   17566.875 ms  6.697 ms
> 1 workers: 18098.070 ms   18020.790 ms 77.280 ms
> 2 workers: 18

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-01-28 Thread Kouhei Kaigai

> On Mon, Jan 25, 2016 at 8:06 PM, Kouhei Kaigai  wrote:
> > Sorry for my late response. I've been unavailable to have enough
> > time to touch code for the last 1.5 month.
> >
> > The attached patch is a revised one to handle private data of
> > foregn/custom scan node more gracefully.
> >
> > The overall consensus upthread were:
> > - A new ExtensibleNodeMethods structure defines a unique name
> >   and a set of callbacks to handle node copy, serialization,
> >   deserialization and equality checks.
> > - (Foreign|Custom)(Path|Scan|ScanState) are first host of the
> >   ExtensibleNodeMethods, to allow extension to define larger
> >   structure to store its private fields.
> > - ExtensibleNodeMethods does not support variable length
> >   structure (like a structure with an array on the tail, use
> >   separately allocated array).
> > - ExtensibleNodeMethods shall be registered on _PG_init() of
> >   extensions.
> >
> > The 'pgsql-v9.6-custom-private.v3.patch' is the main part of
> > this feature. As I pointed out before, it uses dynhash instead
> > of the self invented hash table.
> 
> On a first read-through, I see nothing in this patch to which I would
> want to object except for the fact that the comments and documentation
> need some work from a native speaker of English.  It looks like what
> we discussed, and I think it's an improvement over what we have now.
>
Thanks,

Do you think we shall allow to register same extensible node name for
different node types? Like, "GpuJoin" for any of CustomPath, CustomScan
and CustomScanState. Or, do we avoid this using different name for each?

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-01-28 Thread Kouhei Kaigai

> On Thu, Jan 28, 2016 at 10:18 PM, Kouhei Kaigai  wrote:
> > Do you think we shall allow to register same extensible node name for
> > different node types? Like, "GpuJoin" for any of CustomPath, CustomScan
> > and CustomScanState. Or, do we avoid this using different name for each?
> 
> I'd say a different name for each.  That's our current convention, and
> I don't see much reason to change it.
>
OK, it is not a serious problem, at least, for my use cases.
A convention like "GpuJoinPath", "GpuJoin" and "GpuJoinState" are sufficient.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-01-31 Thread Kouhei Kaigai

Hello,

Do we have a reliable way to check whether a particular heap block
is already on the shared buffer, but not modify?

Right now, ReadBuffer and ReadBufferExtended are entrypoint of the
buffer manager for extensions. However, it tries to acquire an
available buffer pool instead of the victim buffer, regardless of
the ReadBufferMode.

It is different from what I want to do:
 1. Check whether the supplied BlockNum is already loaded on the
shared buffer.
 2. If yes, the caller gets buffer descriptor as usual ReadBuffer.
 3. If not, the caller gets InvalidBuffer without modification of
the shared buffer, also no victim buffer pool.

It allows extensions (likely a custom scan provider) to take
different strategies for large table's scan, according to the
latest status of individual blocks.
If we don't have these interface, it seems to me an enhancement
of the ReadBuffer_common and (Local)BufferAlloc is the only way
to implement the feature.

Of course, we need careful investigation definition of the 'valid'
buffer pool. How about a buffer pool with BM_IO_IN_PROGRESS?
How about a buffer pool that needs storage extend (thus, no relevant
physical storage does not exists yet)? ... and so on.


As an aside, background of my motivation is the slide below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-01 Thread Kouhei Kaigai

> On 1/31/16 7:38 PM, Kouhei Kaigai wrote:
> > I'm under investigation of SSD-to-GPU direct feature on top of
> > the custom-scan interface. It intends to load a bunch of data
> > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > It only makes sense if the target blocks are not loaded to the
> > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > So, I like to have a reliable way to check the latest status of
> > the shared buffer, to kwon whether a particular block is already
> > loaded or not.
> 
> That completely ignores the OS cache though... wouldn't that be a major
> issue?
>
Once we can ensure the target block is not cached in the shared buffer,
it is a job of the driver that support P2P DMA to handle OS page cache.
Once driver get a P2P DMA request from PostgreSQL, it checks OS page
cache status and determine the DMA source; whether OS buffer or SSD block.

> To answer your direct question, I'm no expert, but I haven't seen any
> functions that do exactly what you want. You'd have to pull relevant
> bits from ReadBuffer_*. Or maybe a better method would just be to call
> BufTableLookup() without any locks and if you get a result > -1 just
> call the relevant ReadBuffer function. Sometimes you'll end up calling
> ReadBuffer even though the buffer isn't in shared buffers, but I would
> think that would be a rare occurrence.
>
Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
has a good example for this.

If it returned a valid buf_id, we have nothing difficult; just call
ReadBuffer() to pin the buffer.

Elsewhere, when BufTableLookup() returned negative, it means a pair of
(relation, forknum, blocknum) does not exist on the shared buffer.
So, extension enqueues P2P DMA request for asynchronous translation,
then driver processes the P2P DMA soon but later.
Concurrent access may always happen. PostgreSQL uses MVCC, so the backend
which issued P2P DMA does not need to pay attention for new tuples that
didn't exist on executor start time, even if other backend loads and
updates the same buffer just after the above BufTableLookup().

On the other hands, we have to pay attention whether a fraction of
the buffer page is partially written to OS buffer or storage. It is
in the scope of operating system, so it is not controllable from us.

One idea I can find out is, temporary suspension of FlushBuffer() for
a particular pairs of (relation, forknum, blocknum) until P2P DMA gets
completed. Even if concurrent backend updates the buffer page after the
BufTableLookup(), it allows to prevent OS caches and storages getting
dirty during the P2P DMA.

How about people's thought?
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-01 Thread Kouhei Kaigai

> KaiGai-san,
> 
> On 2016/02/01 10:38, Kouhei Kaigai wrote:
> > As an aside, background of my motivation is the slide below:
> > http://www.slideshare.net/kaigai/sqlgpussd-english
> > (LT slides in JPUG conference last Dec)
> >
> > I'm under investigation of SSD-to-GPU direct feature on top of
> > the custom-scan interface. It intends to load a bunch of data
> > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > It only makes sense if the target blocks are not loaded to the
> > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > So, I like to have a reliable way to check the latest status of
> > the shared buffer, to kwon whether a particular block is already
> > loaded or not.
> 
> Quite interesting stuff, thanks for sharing!
> 
> I'm in no way expert on this but could this generally be attacked from the
> smgr API perspective? Currently, we have only one implementation - md.c
> (the hard-coded RelationData.smgr_which = 0). If we extended that and
> provided end-to-end support so that there would be md.c alternatives to
> storage operations, I guess that would open up opportunities for
> extensions to specify smgr_which as an argument to ReadBufferExtended(),
> provided there is already support in place to install md.c alternatives
> (perhaps in .so). Of course, these are just musings and, perhaps does not
> really concern the requirements of custom scan methods you have been
> developing.
>
Thanks for your idea. Indeed, smgr hooks are good candidate to implement
the feature, however, what I need is a thin intermediation layer rather
than alternative storage engine.

It becomes clear we need two features here.
1. A feature to check whether a particular block is already on the shared
   buffer pool.
   It is available. BufTableLookup() under the BufMappingPartitionLock
   gives us the information we want.

2. A feature to suspend i/o write-out towards a particular blocks
   that are registered by other concurrent backend, unless it is not
   unregistered (usually, at the end of P2P DMA).
   ==> to be discussed.

When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
from the 'smgrsw' array, then calls smgr_write.

  void
  smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
char *buffer, bool skipFsync)
  {
  (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
buffer, skipFsync);
  }

If extension would overwrite smgrsw[] array, then call the original
function under the control by extension, it allows to suspend the call
of the original smgr_write until completion of P2P DMA.

It may be a minimum invasive way to implement, and portable to any
further storage layers.

How about your thought? Even though it is a bit different from your
original proposition.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-02 Thread Kouhei Kaigai

> > > On 1/31/16 7:38 PM, Kouhei Kaigai wrote:
> 
> > > To answer your direct question, I'm no expert, but I haven't seen any
> > > functions that do exactly what you want. You'd have to pull relevant
> > > bits from ReadBuffer_*. Or maybe a better method would just be to call
> > > BufTableLookup() without any locks and if you get a result > -1 just
> > > call the relevant ReadBuffer function. Sometimes you'll end up calling
> > > ReadBuffer even though the buffer isn't in shared buffers, but I would
> > > think that would be a rare occurrence.
> > >
> > Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
> > has a good example for this.
> >
> > If it returned a valid buf_id, we have nothing difficult; just call
> > ReadBuffer() to pin the buffer.
> 
> Isn't this what (or very similar to)
> ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing?
>
This operation actually acquires a buffer page, fills up with zero
and a valid buffer page is wiped out if no free buffer page.
I want to keep the contents of the shared buffer already loaded on
the main memory. P2P DMA and GPU preprocessing intends to minimize
main memory consumption by rows to be filtered by scan qualifiers.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan under the Gather node?

2016-02-03 Thread Kouhei Kaigai

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Thursday, February 04, 2016 2:54 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] CustomScan under the Gather node?
> 
> On Thu, Jan 28, 2016 at 8:14 PM, Kouhei Kaigai  wrote:
> >>  total ForeignScandiff
> >> 0 workers: 17584.319 ms   17555.904 ms  28.415 ms
> >> 1 workers: 18464.476 ms   18110.968 ms 353.508 ms
> >> 2 workers: 19042.755 ms   14580.335 ms4462.420 ms
> >> 3 workers: 19318.254 ms   12668.912 ms6649.342 ms
> >> 4 workers: 21732.910 ms   13596.788 ms8136.122 ms
> >> 5 workers: 23486.846 ms   14533.409 ms8953.437 ms
> >>
> >> This workstation has 4 CPU cores, so it is natural nworkers=3 records the
> >> peak performance on ForeignScan portion. On the other hands, nworkers>1 
> >> also
> >> recorded unignorable time consumption (probably, by Gather node?)
> >   :
> >> Further investigation will need
> >>
> > It was a bug of my file_fdw patch. ForeignScan node in the master process 
> > was
> > also kicked by the Gather node, however, it didn't have coordinate 
> > information
> > due to oversight of the initialization at InitializeDSMForeignScan callback.
> > In the result, local ForeignScan node is still executed after the completion
> > of coordinated background worker processes, and returned twice amount of 
> > rows.
> >
> > In the revised patch, results seems to me reasonable.
> >  total ForeignScan  diff
> > 0 workers: 17592.498 ms   17564.457 ms 28.041ms
> > 1 workers: 12152.998 ms   11983.485 ms169.513 ms
> > 2 workers: 10647.858 ms   10502.100 ms145.758 ms
> > 3 workers:  9635.445 ms9509.899 ms125.546 ms
> > 4 workers: 11175.456 ms   10863.293 ms312.163 ms
> > 5 workers: 12586.457 ms   12279.323 ms307.134 ms
> 
> Hmm.  Is the file_fdw part of this just a demo, or do you want to try
> to get that committed?  If so, maybe start a new thread with a more
> appropriate subject line to just talk about that.  I haven't
> scrutinized that part of the patch in any detail, but the general
> infrastructure for FDWs and custom scans to use parallelism seems to
> be in good shape, so I rewrote the documentation and committed that
> part.
>
Thanks, I expect file_fdw part is just for demonstration.
It does not require any special hardware to reproduce this parallel
execution, rather than GpuScan of PG-Strom.

> Do you have any idea why this isn't scaling beyond, uh, 1 worker?
> That seems like a good thing to try to figure out.
>
The hardware I run the above query has 4 CPU cores, so it is not
surprising that 3 workers (+ 1 master) recorded the peak performance.

In addition, enhancement of file_fdw part is a corner-cutting work.

It picks up the next line number to be fetched from the shared memory
segment using pg_atomic_add_fetch_u32(), then it reads the input file
until worker meets the target line. Unrelated line shall be ignored.
Individual worker parses its responsible line only, thus, parallel
execution makes sense in this part. On the other hands, total amount
of CPU cycles for file scan will increase because all the workers
at least have to parse all the lines.

If we would simply split time consumption factor in 0 worker case
as follows:
  (time to scan file; TSF) + (time to parse lines; TPL)

Total amount of workloads when we distribute file_fdw into N workers is:

  N * (TSF) + (TPL)

Thus, individual worker has to process the following amount of works:

  (TSF) + (TPL)/N

It is a typical formula of Amdahl's law when sequencial part is not
small. The above result says, TSF part is about 7.4s, TPL part is
about 10.1s.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-03 Thread Kouhei Kaigai

> On Wed, Jan 27, 2016 at 9:36 PM, Robert Haas  wrote:
> > On Mon, Jan 25, 2016 at 8:06 PM, Kouhei Kaigai  wrote:
> >> Sorry for my late response. I've been unavailable to have enough
> >> time to touch code for the last 1.5 month.
> >>
> >> The attached patch is a revised one to handle private data of
> >> foregn/custom scan node more gracefully.
> >>
> >> The overall consensus upthread were:
> >> - A new ExtensibleNodeMethods structure defines a unique name
> >>   and a set of callbacks to handle node copy, serialization,
> >>   deserialization and equality checks.
> >> - (Foreign|Custom)(Path|Scan|ScanState) are first host of the
> >>   ExtensibleNodeMethods, to allow extension to define larger
> >>   structure to store its private fields.
> >> - ExtensibleNodeMethods does not support variable length
> >>   structure (like a structure with an array on the tail, use
> >>   separately allocated array).
> >> - ExtensibleNodeMethods shall be registered on _PG_init() of
> >>   extensions.
> >>
> >> The 'pgsql-v9.6-custom-private.v3.patch' is the main part of
> >> this feature. As I pointed out before, it uses dynhash instead
> >> of the self invented hash table.
> >
> > On a first read-through, I see nothing in this patch to which I would
> > want to object except for the fact that the comments and documentation
> > need some work from a native speaker of English.  It looks like what
> > we discussed, and I think it's an improvement over what we have now.
> 
> Well, looking at this a bit more, it seems like the documentation
> you've written here is really misplaced.  The patch is introducing a
> new facility that applies to both CustomScan and ForeignScan, but the
> documentation is only to do with CustomScan.  I think we need a whole
> new chapter on extensible nodes, or something.  I'm actually not
> really keen on the fact that we keep adding SGML documentation for
> this stuff; it seems like it belongs in a README in the source tree.
> We don't explain nodes in general, but now we're going to have to try
> to explain extensible nodes.  How's that going to work?
>
The detail of these callbacks are not for end-users, administrators and
so on except for core/extension developers. So, I loves idea not to have
such a detailed description in SGML.
How about an idea to have more detailed source code comments close to
the definition of ExtensibleNodeMethods?
I haven't seen the src/backend/nodes/README yet, and it has only 6 updates
history from Postgres95 era. I guess people may forget to update README
file if description is separately located from the implementation.

> I think you should avoid the call to GetExtensibleNodeMethods() in the
> case where extnodename is NULL.  On the other hand, I think that if
> extnodename is non-NULL, all four methods should be required, so that
> you don't have to check if (methods && methods->nodeRead) but just if
> (extnodename) { methods = GetExtensibleNodeMethods(extnodename);
> methods->nodeRead( ... ); }.  That seems like it would be a bit
> tidier.
>
OK, I'll fix up. No need to have 'missing_ok' argument here.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-03 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, February 04, 2016 11:39 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan 
> support
> on readfuncs.c)
> 
> On Wed, Feb 3, 2016 at 8:00 PM, Kouhei Kaigai  wrote:
> >> Well, looking at this a bit more, it seems like the documentation
> >> you've written here is really misplaced.  The patch is introducing a
> >> new facility that applies to both CustomScan and ForeignScan, but the
> >> documentation is only to do with CustomScan.  I think we need a whole
> >> new chapter on extensible nodes, or something.  I'm actually not
> >> really keen on the fact that we keep adding SGML documentation for
> >> this stuff; it seems like it belongs in a README in the source tree.
> >> We don't explain nodes in general, but now we're going to have to try
> >> to explain extensible nodes.  How's that going to work?
> >>
> > The detail of these callbacks are not for end-users, administrators and
> > so on except for core/extension developers. So, I loves idea not to have
> > such a detailed description in SGML.
> > How about an idea to have more detailed source code comments close to
> > the definition of ExtensibleNodeMethods?
> > I haven't seen the src/backend/nodes/README yet, and it has only 6 updates
> > history from Postgres95 era. I guess people may forget to update README
> > file if description is separately located from the implementation.
> 
> Hmm, that might work, although that file is so old that it may be
> difficult to add to.  Another idea is: maybe we could have a header
> file for the extensible node stuff and just give it a really long
> header comment.
>
At this moment, I tried to write up description at nodes/nodes.h.
The amount of description is about 100lines. It is on a borderline
whether we split off this chunk into another header file, in my sense.


On the other hands, I noticed that we cannot omit checks for individual
callbacks on Custom node type, ExtensibleNodeMethods is embedded in
the CustomMethods structure, thus we may have Custom node with
no extensible feature.
This manner is beneficial because extension does not need to register
the library and symbol name for serialization. So, CustomScan related
code still checks existence of individual callbacks.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


pgsql-v9.6-custom-private.v4.patch
Description: pgsql-v9.6-custom-private.v4.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-03 Thread Kouhei Kaigai

> > KaiGai-san,
> >
> > On 2016/02/01 10:38, Kouhei Kaigai wrote:
> > > As an aside, background of my motivation is the slide below:
> > > http://www.slideshare.net/kaigai/sqlgpussd-english
> > > (LT slides in JPUG conference last Dec)
> > >
> > > I'm under investigation of SSD-to-GPU direct feature on top of
> > > the custom-scan interface. It intends to load a bunch of data
> > > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > > It only makes sense if the target blocks are not loaded to the
> > > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > > So, I like to have a reliable way to check the latest status of
> > > the shared buffer, to kwon whether a particular block is already
> > > loaded or not.
> >
> > Quite interesting stuff, thanks for sharing!
> >
> > I'm in no way expert on this but could this generally be attacked from the
> > smgr API perspective? Currently, we have only one implementation - md.c
> > (the hard-coded RelationData.smgr_which = 0). If we extended that and
> > provided end-to-end support so that there would be md.c alternatives to
> > storage operations, I guess that would open up opportunities for
> > extensions to specify smgr_which as an argument to ReadBufferExtended(),
> > provided there is already support in place to install md.c alternatives
> > (perhaps in .so). Of course, these are just musings and, perhaps does not
> > really concern the requirements of custom scan methods you have been
> > developing.
> >
> Thanks for your idea. Indeed, smgr hooks are good candidate to implement
> the feature, however, what I need is a thin intermediation layer rather
> than alternative storage engine.
> 
> It becomes clear we need two features here.
> 1. A feature to check whether a particular block is already on the shared
>buffer pool.
>It is available. BufTableLookup() under the BufMappingPartitionLock
>gives us the information we want.
> 
> 2. A feature to suspend i/o write-out towards a particular blocks
>that are registered by other concurrent backend, unless it is not
>unregistered (usually, at the end of P2P DMA).
>==> to be discussed.
> 
> When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
> from the 'smgrsw' array, then calls smgr_write.
> 
>   void
>   smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
> char *buffer, bool skipFsync)
>   {
>   (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
> buffer, skipFsync);
>   }
> 
> If extension would overwrite smgrsw[] array, then call the original
> function under the control by extension, it allows to suspend the call
> of the original smgr_write until completion of P2P DMA.
> 
> It may be a minimum invasive way to implement, and portable to any
> further storage layers.
> 
> How about your thought? Even though it is a bit different from your
> original proposition.
>
I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.

  ASSUMPTION: I intend to implement this feature as a part of extension,
  because this i/o suspend/resume checks are pure overhead increment
  for the core features, unless extension which utilizes it.

Three functions shall be added:

extern intGetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:

  FlushBuffer, and others...
   +-- smgrwrite(...)
+-- (extension's own function)
 +-- mdwrite

Once extension's own function blocks write i/o until P2P DMA completed by
concurrent process, we don't need to care about partial update of OS cache
or storage device.
It is not difficult for extensions to implement a feat

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-04 Thread Kouhei Kaigai

> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >that are registered by other concurrent backend, unless it is not
> >> >unregistered (usually, at the end of P2P DMA).
> >> >==> to be discussed.
> 
> I think there's still a race condition here though...
> 
> A
> finds buffer not in shared buffers
> 
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
> 
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
> 
> Or maybe you were just glossing over that part for brevity.
>
Thanks, this part was not clear from my previous description.

At the time when B starts writing buffer to OS, extension will catch
this i/o request using a hook around the smgrwrite, then the mechanism
registers the block to block P2P DMA request during B's write operation.
(Of course, it unregisters the block at end of the smgrwrite)
So, even if A wants to issue P2P DMA concurrently, it cannot register
the block until B's write operation.

In practical, this operation shall be "try lock", because B's write
operation implies existence of the buffer in main memory, so B does
not need to wait A's write operation if B switch DMA source from SSD
to main memory.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> ...
> 
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >ASSUMPTION: I intend to implement this feature as a part of extension,
> >because this i/o suspend/resume checks are pure overhead increment
> >for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern intGetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >FlushBuffer, and others...
> > +-- smgrwrite(...)
> >  +-- (extension's own function)
> >   +-- mdwrite
> 
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
> 
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-04 Thread Kouhei Kaigai

> -Original Message-
> From: Jim Nasby [mailto:jim.na...@bluetreble.com]
> Sent: Friday, February 05, 2016 9:17 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org; Robert Haas
> Cc: Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >that are registered by other concurrent backend, unless it is not
> >> >unregistered (usually, at the end of P2P DMA).
> >> >==> to be discussed.
> 
> I think there's still a race condition here though...
> 
> A
> finds buffer not in shared buffers
> 
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
> 
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
> 
> Or maybe you were just glossing over that part for brevity.
> 
> ...
> 
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >ASSUMPTION: I intend to implement this feature as a part of extension,
> >because this i/o suspend/resume checks are pure overhead increment
> >for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern intGetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >FlushBuffer, and others...
> > +-- smgrwrite(...)
> >  +-- (extension's own function)
> >   +-- mdwrite
> 
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
> 
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
>
# sorry, I oversight the later part of your reply.

I can agree that smgr hooks shall be primarily designed to make storage
systems pluggable, even if we can use this hooks for suspend & resume of
write i/o stuff.
In addition, "pluggable storage" is a long-standing feature, even though
it is not certain whether existing smgr hooks are good starting point.
It may be a risk if we implement a grand feature on top of the hooks
but out of its primary purpose.

So, my preference is a mechanism to hook buffer write to implement this
feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
has nearly zero cost when no extension activate the feature.)
One downside of this approach is larger number of hook points.
We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
and FlushRelationBuffers, in addition to FlushBuffer, at least.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-07 Thread Kouhei Kaigai

> On Wed, Feb 3, 2016 at 11:57 PM, Kouhei Kaigai  wrote:
> > At this moment, I tried to write up description at nodes/nodes.h.
> > The amount of description is about 100lines. It is on a borderline
> > whether we split off this chunk into another header file, in my sense.
> >
> >
> > On the other hands, I noticed that we cannot omit checks for individual
> > callbacks on Custom node type, ExtensibleNodeMethods is embedded in
> > the CustomMethods structure, thus we may have Custom node with
> > no extensible feature.
> > This manner is beneficial because extension does not need to register
> > the library and symbol name for serialization. So, CustomScan related
> > code still checks existence of individual callbacks.
> 
> I was looking over this patch yesterday, and something was bugging me
> about it, but I couldn't quite put my finger on what it was.  But now
> I think I know.
> 
> I think of an extensible node type facility as something that ought to
> be defined to allow a user to create new types of nodes.  But this is
> not that.  What this does is allows you to have a CustomScan or
> ForeignScan node - that is, the existing node type - which is actually
> larger than a normal node of that type and has some extra data that is
> part of it.  I'm having a really hard time being comfortable with that
> concept.  Somehow, it seems like the node type should determine the
> size of the node.  I can stretch my brain to the point of being able
> to say, well, maybe if the node tag is T_ExtensibleNode, then you can
> look at char *extnodename to figure out what type of node it is
> really, and then from there get the size.  But what this does is:
> every time you see a T_CustomScan or T_ForeignScan node, it might not
> really be that kind of node but something else else, and tomorrow
> there might be another half-dozen node types with a similar property.
> And every one of those node types will have char *extnodename in a
> different place in the structure, so a hypothetical piece of code that
> wanted to find the extension methods for a node, or the size of a
> node, would need a switch that knows about all of those node types.
> It feels very awkward.
> 
> So I have a slightly different proposal.  Let's forget about letting
> T_CustomScan or T_ForeignScan or any other built-in node type vary in
> size.  Instead, let's add T_ExtensibleNode which acts as a completely
> user-defined node.  It has read/out/copy/equalfuncs support along the
> lines of what this patch implements, and that's it.  It's not a scan
> or a path or anything like that: it's just an opaque data container
> that the system can input, output, copy, and test for equality, and
> nothing else.  Isn't that useless?  Not at all.  If you're creating an
> FDW or custom scan provider and want to store some extra data, you can
> create a new type of extensible node and stick it in the fdw_private
> or custom_private field!  The data won't be part of the CustomScan or
> ForeignScan structure itself, as in this patch, but who cares? The
> only objection I can see is that you still need several pointer
> deferences to get to the data since fdw_private is a List, but if
> that's actually a real performance problem we could probably fix it by
> just changing fdw_private to a Node instead.  You'd still need one
> pointer dereference, but that doesn't seem too bad.
> 
> Thoughts?
>
The new callbacks of T_ExtensibleNode will replace the necessity to
form and deform process of private values, like as:
  https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L114

It transforms a bunch of internal data of CustomScan (similar to the
extended fields in T_ExtensibleNode) to/from the node functions
understandable forms for copy, input and output support.
I think it implies you proposition is workable.

I'd like to follow this proposition basically.
On the other hands, I have two points I want to pay attention.

1. At this moment, it is allowed to define a larger structure that
embeds CustomPath and CustomScanState by extension. How do we treat
this coding manner in this case? Especially, CustomScanState has no
private pointer dereference because it assumes an extension define
a larger structure. Of course, we don't need to care about node
operations on Path and PlanState nodes, but right now.

2. I intended to replace LibraryName and SymbolName fields from the
CustomScanMethods structure by integration of extensible node type.
We had to give a pair of these identifiers because custom scan provider
has no registration points at this moment. A little concern is extension
has to assume a particular filename of itself.
But, probably, it shall be a separated discussion

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-07 Thread Kouhei Kaigai

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Monday, February 08, 2016 1:52 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai  wrote:
> > I can agree that smgr hooks shall be primarily designed to make storage
> > systems pluggable, even if we can use this hooks for suspend & resume of
> > write i/o stuff.
> > In addition, "pluggable storage" is a long-standing feature, even though
> > it is not certain whether existing smgr hooks are good starting point.
> > It may be a risk if we implement a grand feature on top of the hooks
> > but out of its primary purpose.
> >
> > So, my preference is a mechanism to hook buffer write to implement this
> > feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
> > has nearly zero cost when no extension activate the feature.)
> > One downside of this approach is larger number of hook points.
> > We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
> > and FlushRelationBuffers, in addition to FlushBuffer, at least.
> 
> I don't understand what you're hoping to achieve by introducing
> pluggability at the smgr layer.  I mean, md.c is pretty much good for
> read and writing from anything that looks like a directory of files.
> Another smgr layer would make sense if we wanted to read and write via
> some kind of network protocol, or if we wanted to have some kind of
> storage substrate that did internally to itself some of the tasks for
> which we are currently relying on the filesystem - e.g. if we wanted
> to be able to use a raw device, or perhaps more plausibly if we wanted
> to reduce the number of separate files we need, or provide a substrate
> that can clip an unused extent out of the middle of a relation
> efficiently.  But I don't understand what this has to do with what
> you're trying to do here.  The subject of this thread is about whether
> you can check for the presence of a block in shared_buffers, and as
> discussed upthread, you can.  I don't quite follow how we made the
> jump from there to smgr pluggability.
>
Yes. smgr pluggability is not what I want to investigate in this thread.
It is not a purpose of discussion, but one potential "idea to implement".

Through the discussion, it became clear that extension can check existence
of buffer of a particular block, using existing infrastructure.

On the other hands, it also became clear we have to guarantee OS buffer
or storage block must not be updated partially during the P2P DMA.
My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
filter out unnecessary rows and columns prior to loading to CPU/RAM.
It needs to ensure PostgreSQL does not write out buffers to OS buffers
to avoid unexpected data corruption.

What I want to achieve is suspend of buffer write towards a particular
(relnode, forknum, blocknum) pair for a short time, by completion of
data processing by GPU (or other external devices).
In addition, it is preferable being workable regardless of the choice
of storage manager, even if we may have multiple options on top of the
pluggable smgr in the future.

The data processing close to the storage needs OS buffer should not be
updated under the P2P DMA, concurrently. So, I want the feature below.
1. An extension (that controls GPU and P2P DMA) can register a particular
   (relnode, forknum, blocknum) pair as suspended block for write.
2. Once a particular block gets suspended, smgrwrite (or its caller) shall
   be blocked unless the above suspended block is not unregistered.
3. The extension will unregister when P2P DMA from the blocks get completed,
   then suspended concurrent backend shall be resumed to write i/o.
4. On the other hands, the extension cannot register the block if some
   other concurrent executes smgrwrite, to avoid potential data flaw.

One idea was injection of a thin layer on top of the smgr mechanism, to
implement the above mechanism.
However, I'm also uncertain whether injection to entire smgr hooks is
a straightforward approach to achieve it.

The minimum stuff I want is a facility to get a control at the head and tail
of smgrwrite() - to suspend the concurrent write prior to smgr_write, and to
notify the concurrent smgr_write gets completed for the mechanism.

It does not need pluggability of smgr, but entrypoint shall be located around
smgr functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-09 Thread Kouhei Kaigai

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Wednesday, February 10, 2016 1:58 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: ##freemail## Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Sun, Feb 7, 2016 at 9:49 PM, Kouhei Kaigai  wrote:
> > On the other hands, it also became clear we have to guarantee OS buffer
> > or storage block must not be updated partially during the P2P DMA.
> > My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
> > filter out unnecessary rows and columns prior to loading to CPU/RAM.
> > It needs to ensure PostgreSQL does not write out buffers to OS buffers
> > to avoid unexpected data corruption.
> >
> > What I want to achieve is suspend of buffer write towards a particular
> > (relnode, forknum, blocknum) pair for a short time, by completion of
> > data processing by GPU (or other external devices).
> > In addition, it is preferable being workable regardless of the choice
> > of storage manager, even if we may have multiple options on top of the
> > pluggable smgr in the future.
> 
> It seems like you just need to take an exclusive content lock on the
> buffer, or maybe a shared content lock would be sufficient.
>
Unfortunately, it was not sufficient.

Due to the assumption, the buffer page to be suspended does not exist
when a backend process issues a series P2P DMA command. (If block would
be already loaded to the shared buffer, it don't need to issue P2P DMA,
but just use usual memory<->device DMA because RAM is much faster than
SSD.)
It knows the pair of (rel,fork,block), but no BufferDesc of this block
exists. Thus, it cannot acquire locks in BufferDesc structure.

Even if the block does not exist at this point, concurrent process may
load the same page. BufferDesc of this page shall be assigned at this
point, however, here is no chance to lock something in BufferDesc for
the process which issues P2P DMA command.

It is the reason why I assume the suspend/resume mechanism shall take
a pair of (rel,fork,block) as identifier of the target block.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-09 Thread Kouhei Kaigai

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Monday, February 08, 2016 11:59 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: ##freemail## Re: CustomScan in a larger structure (RE: [HACKERS]
> CustomScan support on readfuncs.c)
> 
> On Sun, Feb 7, 2016 at 7:28 PM, Kouhei Kaigai  wrote:
> > The new callbacks of T_ExtensibleNode will replace the necessity to
> > form and deform process of private values, like as:
> >   https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L114
> 
> Yeah.
> 
> > It transforms a bunch of internal data of CustomScan (similar to the
> > extended fields in T_ExtensibleNode) to/from the node functions
> > understandable forms for copy, input and output support.
> > I think it implies you proposition is workable.
> >
> > I'd like to follow this proposition basically.
> > On the other hands, I have two points I want to pay attention.
> >
> > 1. At this moment, it is allowed to define a larger structure that
> > embeds CustomPath and CustomScanState by extension. How do we treat
> > this coding manner in this case? Especially, CustomScanState has no
> > private pointer dereference because it assumes an extension define
> > a larger structure. Of course, we don't need to care about node
> > operations on Path and PlanState nodes, but right now.
> 
> I see no real advantage in letting a CustomPath be larger.  If
> custom_private can include extension-defined node types, that seems
> good enough.  On the other hand, if CustomScanState can be larger,
> that seems fine.   We don't really need any special support for that,
> do we?
>
Yes. Right now, we have no code path that handles PlanState or its
inheritance using node operations. So, it is not a real problem.

> > 2. I intended to replace LibraryName and SymbolName fields from the
> > CustomScanMethods structure by integration of extensible node type.
> > We had to give a pair of these identifiers because custom scan provider
> > has no registration points at this moment. A little concern is extension
> > has to assume a particular filename of itself.
> > But, probably, it shall be a separated discussion. My preference is
> > preliminary registration of custom scan provider by its name, as well
> > as extensible node.
> 
> Seems like we could just leave the CustomScan stuff alone and worry
> about this as a separate facility.
>
OK

> > Towards the last question; whether *_private shall be void * or List *,
> > I want to keep fdw_private and custom_private as List * pointer, because
> > a new node type definition is a bit overdone job if this FDW or CSP will
> > take only a few private fields with primitive data types.
> > It is a preferable features when extension defines ten or more private
> > fields.
> 
> Well, I suggested Node *, not void *.  A Node can be a List, but not
> every Node is a List.
>
It is pretty good!

The attached patch (primary one) implements the above idea.

Now ExtensibleNode works as a basis structure of data container,
regardless of CustomScan and ForeignScan.
Also, fdw_private and custom_private are de-defined to Node * type
from List * type. It affected to a few FDW APIs.

The secondary patch is a demonstration of new ExtensibleNode using
postgres_fdw extension. Its private data are expected to be packed
in a list with a particular order. Self defined structure allows to
keep these variables without ugly pack/unpacking.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v9.6-custom-private.v5.demo.patch
Description: pgsql-v9.6-custom-private.v5.demo.patch


pgsql-v9.6-custom-private.v5.patch
Description: pgsql-v9.6-custom-private.v5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-11 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, February 11, 2016 1:26 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Andres Freund; Amit Kapila; pgsql-hackers
> Subject: Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan 
> support
> on readfuncs.c)
> 
> On Wed, Feb 10, 2016 at 1:25 AM, Kouhei Kaigai  wrote:
> > It is pretty good!
> >
> > The attached patch (primary one) implements the above idea.
> >
> > Now ExtensibleNode works as a basis structure of data container,
> > regardless of CustomScan and ForeignScan.
> > Also, fdw_private and custom_private are de-defined to Node * type
> > from List * type. It affected to a few FDW APIs.
> 
> I extracted the subset of this that just creates the extensible node
> framework and did some cleanup - the result is attached.  I will
> commit this if nobody objects.
>
I have no objection of course.

> I think the part about whacking around the FDW API is a little more
> potentially objectionable to others, so I want to hold off doing that
> unless a few more people chime in with +1.  Perhaps we could start a
> new thread to talk about that specific idea.  This is useful even
> without that, though.
>

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-11 Thread Kouhei Kaigai

> On Tue, Feb 9, 2016 at 6:35 PM, Kouhei Kaigai  wrote:
> > Unfortunately, it was not sufficient.
> >
> > Due to the assumption, the buffer page to be suspended does not exist
> > when a backend process issues a series P2P DMA command. (If block would
> > be already loaded to the shared buffer, it don't need to issue P2P DMA,
> > but just use usual memory<->device DMA because RAM is much faster than
> > SSD.)
> > It knows the pair of (rel,fork,block), but no BufferDesc of this block
> > exists. Thus, it cannot acquire locks in BufferDesc structure.
> >
> > Even if the block does not exist at this point, concurrent process may
> > load the same page. BufferDesc of this page shall be assigned at this
> > point, however, here is no chance to lock something in BufferDesc for
> > the process which issues P2P DMA command.
> >
> > It is the reason why I assume the suspend/resume mechanism shall take
> > a pair of (rel,fork,block) as identifier of the target block.
> 
> I see the problem, but I'm not terribly keen on putting in the hooks
> that it would take to let you solve it without hacking core.  It
> sounds like an awfully invasive thing for a pretty niche requirement.
>
Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-13 Thread Kouhei Kaigai





> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, February 13, 2016 1:46 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai  wrote:
> > Hmm. In my experience, it is often not a productive discussion whether
> > a feature is niche or commodity. So, let me change the viewpoint.
> >
> > We may utilize OS-level locking mechanism here.
> >
> > Even though it depends on filesystem implementation under the VFS,
> > we may use inode->i_mutex lock that shall be acquired during the buffer
> > copy from user to kernel, at least, on a few major filesystems; ext4,
> > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > acquire the inode->i_mutex lock during P2P DMA transfer.
> >
> > Once we can consider the OS buffer is updated atomically by the lock,
> > we don't need to worry about corrupted pages, but still needs to pay
> > attention to the scenario when updated buffer page is moved to GPU.
> >
> > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > infrastructure, so I intend to move all-visible pages only.
> > If someone updates the buffer concurrently, then write out the page
> > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > updated tuples should not be visible to the transaction which issued
> > P2P DMA.
> >
> > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > that indicates CPU to retry this page again. In this case, this page is
> > likely loaded to the shared buffer already, so retry penalty is not so
> > much.
> >
> > I'll try to investigate the implementation in this way.
> > Please correct me, if I misunderstand something (especially, treatment
> > of PD_ALL_VISIBLE).
> 
> I suppose there's no theoretical reason why the buffer couldn't go
> from all-visible to not-all-visible and back to all-visible again all
> during the time you are copying it.
>
The backend process that is copying the data to GPU has a transaction
in-progress (= not committed). Is it possible to get the updated buffer
page back to the all-visible state again?
I expect that in-progress transactions works as a blocker for backing
to all-visible. Right?

> Honestly, I think trying to access buffers without going through
> shared_buffers is likely to be very hard to make correct and probably
> a loser.
>
No challenge, no outcome. ;-)

> Copying the data into shared_buffers and then to the GPU is,
> doubtless, at least somewhat slower.  But I kind of doubt that it's
> enough slower to make up for all of the problems you're going to have
> with the approach you've chosen.
>
Honestly, I'm still uncertain whether it works well as I expects.
However, scan workload on the table larger than main memory is
headache for PG-Strom, so I'd like to try ideas we can implement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] comment fix for CUSTOMPATH_* flags

2016-08-28 Thread Kouhei Kaigai

Hello,

I noticed the source code comment around CustomPath structure says "see above"
for definition of CUSTOMPATH_* flags. It was originally right, but it was moved
to nodes/extensible.h on the further development. So, no comments are above.
The attached patch corrects the comment for the right location.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-v9.6-custom-flags-comments-fixup.patch
Description: pgsql-v9.6-custom-flags-comments-fixup.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-08-28 Thread Kouhei Kaigai

Hello,

The attached patch adds an optional callback to support special optimization
if ForeignScan/CustomScan are located under the Limit node in plan-tree.

Our sort node wisely switches the behavior when we can preliminary know
exact number of rows to be produced, because all the Sort node has to
return is the top-k rows when it is located under the Limit node.
It is much lightweight workloads than sorting of entire input rows when
nrows is not small.

In my case, this information is very useful because GPU can complete its
sorting operations mostly on L1-grade memory if we can preliminary know
the top-k value is enough small and fits to size of the fast memory.

Probably, it is also valuable for Fujita-san's case because this information
allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
It will reduce amount of the network traffic and remote CPU consumption
once we got support of sort pushdown.

One thing we need to pay attention is cost estimation on the planner stage.
In the existing code, only create_ordered_paths() and create_merge_append_path()
considers the limit clause for cost estimation of sorting. They use the
'limit_tuples' of PlannerInfo; we can reference the structure when extension
adds ForeignPath/CustomPath, so I think we don't need a special enhancement
on the planner stage.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



pgsql-v10-fdw-css-limit-bound.v1.patch
Description: pgsql-v10-fdw-css-limit-bound.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-04 Thread Kouhei Kaigai

> On Mon, Aug 29, 2016 at 7:25 AM, Kouhei Kaigai  wrote:
> 
> 
>   Hello,
> 
>   The attached patch adds an optional callback to support special 
> optimization
>   if ForeignScan/CustomScan are located under the Limit node in plan-tree.
> 
>   Our sort node wisely switches the behavior when we can preliminary know
>   exact number of rows to be produced, because all the Sort node has to
>   return is the top-k rows when it is located under the Limit node.
>   It is much lightweight workloads than sorting of entire input rows when
>   nrows is not small.
> 
>   In my case, this information is very useful because GPU can complete its
>   sorting operations mostly on L1-grade memory if we can preliminary know
>   the top-k value is enough small and fits to size of the fast memory.
> 
>   Probably, it is also valuable for Fujita-san's case because this 
> information
>   allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
>   It will reduce amount of the network traffic and remote CPU consumption
>   once we got support of sort pushdown.
> 
> 
> 
>   One thing we need to pay attention is cost estimation on the planner 
> stage.
>   In the existing code, only create_ordered_paths() and
> create_merge_append_path()
>   considers the limit clause for cost estimation of sorting. They use the
>   'limit_tuples' of PlannerInfo; we can reference the structure when 
> extension
>   adds ForeignPath/CustomPath, so I think we don't need a special 
> enhancement
>   on the planner stage.
>
Thanks for your comments.

> I believe this hook is gets called at execution time.
> So to push LIMIT clause like you said above we should use "limit_tuples" at 
> the time
> of planning and then use this hook to optimize at runtime, right?
>
Yes. For more correctness, a valid "limit_tuples" of PlannerInfo is set only 
when
LIMIT clause takes constant values; it is true for most of use case.
Then, the hook I added shall be called at execution time for more exact 
optimization.

If FDW/CSP cannot accept uncertain number of rows to generate on planning time,
it is not a duty to provide its own path which is optimized for small number of
LIMIT clause.

> Apart from that, attached patch applies cleanly on latest sources and found 
> no issues
> with make or with regressions.
> 
> However this patch is an infrastructure for any possible optimization when
> foreign/customscan is under LIMIT.
> 
> So look good to me.
> 
> I quickly tried adding a hook support in postgres_fdw, and it gets called 
> correctly
> when we have foreignscan with LIMIT (limit being evaluated on local server).
> 
> So code wise no issue. Also add this hook details in documentation.
>
OK, I'll try to write up some detailed documentation stuff; not only API 
specification.

Best regards,

> 
> Thanks
> 
> 
> 
>   Thanks,
>   --
>   NEC Business Creation Division / PG-Strom Project
>   KaiGai Kohei 
> 
> 
> 
>   --
>   Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>   To make changes to your subscription:
>   http://www.postgresql.org/mailpref/pgsql-hackers
> <http://www.postgresql.org/mailpref/pgsql-hackers>
> 
> 
> 
> 
> 
> 
> --
> 
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-05 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Monday, September 05, 2016 12:58 PM
> To: Jeevan Chalke
> Cc: pgsql-hackers@postgresql.org; Etsuro Fujita
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> 
> > On Mon, Aug 29, 2016 at 7:25 AM, Kouhei Kaigai  wrote:
> >
> >
> > Hello,
> >
> > The attached patch adds an optional callback to support special 
> > optimization
> > if ForeignScan/CustomScan are located under the Limit node in plan-tree.
> >
> > Our sort node wisely switches the behavior when we can preliminary know
> > exact number of rows to be produced, because all the Sort node has to
> > return is the top-k rows when it is located under the Limit node.
> > It is much lightweight workloads than sorting of entire input rows when
> > nrows is not small.
> >
> > In my case, this information is very useful because GPU can complete its
> > sorting operations mostly on L1-grade memory if we can preliminary know
> > the top-k value is enough small and fits to size of the fast memory.
> >
> > Probably, it is also valuable for Fujita-san's case because this 
> > information
> > allows to attach "LIMIT k" clause on the remote query of postgres_fdw.
> > It will reduce amount of the network traffic and remote CPU consumption
> > once we got support of sort pushdown.
> >
> >
> >
> > One thing we need to pay attention is cost estimation on the planner 
> > stage.
> > In the existing code, only create_ordered_paths() and
> > create_merge_append_path()
> > considers the limit clause for cost estimation of sorting. They use the
> > 'limit_tuples' of PlannerInfo; we can reference the structure when 
> > extension
> > adds ForeignPath/CustomPath, so I think we don't need a special 
> > enhancement
> > on the planner stage.
> >
> Thanks for your comments.
> 
> > I believe this hook is gets called at execution time.
> > So to push LIMIT clause like you said above we should use "limit_tuples" at 
> > the time
> > of planning and then use this hook to optimize at runtime, right?
> >
> Yes. For more correctness, a valid "limit_tuples" of PlannerInfo is set only 
> when
> LIMIT clause takes constant values; it is true for most of use case.
> Then, the hook I added shall be called at execution time for more exact 
> optimization.
> 
> If FDW/CSP cannot accept uncertain number of rows to generate on planning 
> time,
> it is not a duty to provide its own path which is optimized for small number 
> of
> LIMIT clause.
> 
> > Apart from that, attached patch applies cleanly on latest sources and found 
> > no issues
> > with make or with regressions.
> >
> > However this patch is an infrastructure for any possible optimization when
> > foreign/customscan is under LIMIT.
> >
> > So look good to me.
> >
> > I quickly tried adding a hook support in postgres_fdw, and it gets called 
> > correctly
> > when we have foreignscan with LIMIT (limit being evaluated on local server).
> >
> > So code wise no issue. Also add this hook details in documentation.
> >
> OK, I'll try to write up some detailed documentation stuff; not only API 
> specification.
>
The v2 patch attached. It introduces the role of this hook and how extension
utilizes the LIMIT clause for its further optimization on planning and
execution time.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


pgsql-v10-fdw-css-limit-bound.v2.patch
Description: pgsql-v10-fdw-css-limit-bound.v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai

Sorry for my late.

The attached patch fixed the wording problems on SGML part.

Best regards,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Jeevan Chalke [mailto:jeevan.cha...@enterprisedb.com]
> Sent: Tuesday, September 06, 2016 11:22 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Etsuro Fujita
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> 
> Hi,
> 
> Changes look good to me.
> 
> However there are couple of minor issues need to be fixed.
> 
> 1.
> "under" repeated on second line. Please remove.
> +if and when CustomScanState is located under
> +under LimitState; which implies the underlying node is not
> 
> 2.
> Typo: dicsussion => discussion
> Please fix.
> 
> Apart from this I see no issues.
> 
> 
> Thanks
> 
> 
> --
> 
> Jeevan B Chalke
> Principal Software Engineer, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
> 



pgsql-v10-fdw-css-limit-bound.v3.patch
Description: pgsql-v10-fdw-css-limit-bound.v3.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

2016-09-13 Thread Kouhei Kaigai

> On Tue, Sep 13, 2016 at 3:48 AM, Kouhei Kaigai  wrote:
> > Sorry for my late.
> >
> > The attached patch fixed the wording problems on SGML part.
> 
> I agree that we should have some way for foreign data wrappers and
> custom scans and perhaps also other executor nodes to find out whether
> there's a known limit to the number of tuples that they might need to
> produce, but I wonder if we should be doing something more general
> than this.  For example, suppose we add a new PlanState member "long
> numTuples" where 0 means that the number of tuples that will be needed
> is unknown (so that most node types need not initialize it), a
> positive value is an upper bound on the number of tuples that will be
> fetched, and -1 means that it is known for certain that we will need
> all of the tuples.  This might be relevant to the executor batching
> stuff that Andres has been working on, because you could for example
> set ps->numTuples == 1 on the inner side of a semi-join, warning the
> executor node that it shouldn't bother trying to batch anything.
>
I also think the generic approach is a preferable direction.

In the current implementation calls recompute_limits() on the first
invocation of ExecLimit and ExecReScanLimit. Do we expect the
ps->numTuples will be also passed down to the child nodes on the same
timing?
I also think this new executor contract shall be considered as a hint
(but not a requirement) for the child nodes, because it allows the
parent nodes to re-distribute the upper limit regardless of the type
of the child nodes as long as the parent node can work correctly and
has benefit even if the child node returns a part of tuples. It makes
the decision whether the upper limit should be passed down much simple.
The child node "can" ignore the hint but can utilize for more optimization.

> On a more practical level, I notice that you haven't adapted
> postgres_fdw or file_fdw to benefit from this new callback.  It seems
> like postgres_fdw could benefit, because it could fetch only the
> required number of tuples if that happens to be a smaller number than
> the configured fetch_size.
>
It is because of just my time pressure around the patch submission days.
I'll try to enhance postgres_fdw as a usage of this run-time optimization.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-13 Thread Kouhei Kaigai

Hello,

It looks to me pg_buffercache tries to allocate more than 1GB using
palloc(), when shared_buffers is more than 256GB.

# show shared_buffers ;
 shared_buffers

 280GB
(1 row)

# SELECT buffers, d.datname, coalesce(c.relname, '???')
FROM (SELECT count(*) buffers, reldatabase, relfilenode
FROM pg_buffercache group by reldatabase, relfilenode) b
   LEFT JOIN pg_database d ON d.oid = b.reldatabase
   LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
 WHERE datname = current_database())
   AND b.relfilenode = pg_relation_filenode(c.oid)
   ORDER BY buffers desc;
ERROR:  invalid memory alloc request size 1174405120

It is a situation to use MemoryContextAllocHuge(), instead of palloc().
Also, it may need a back patching?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




pgsql-fix-pg_buffercache-palloc-huge.patch
Description: pgsql-fix-pg_buffercache-palloc-huge.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] palloc() too large on pg_buffercache with large shared_buffers

2016-09-14 Thread Kouhei Kaigai

> On Wed, Sep 14, 2016 at 12:13 AM, Kouhei Kaigai  wrote:
> > It looks to me pg_buffercache tries to allocate more than 1GB using
> > palloc(), when shared_buffers is more than 256GB.
> >
> > # show shared_buffers ;
> >  shared_buffers
> > 
> >  280GB
> > (1 row)
> >
> > # SELECT buffers, d.datname, coalesce(c.relname, '???')
> > FROM (SELECT count(*) buffers, reldatabase, relfilenode
> > FROM pg_buffercache group by reldatabase, relfilenode) b
> >LEFT JOIN pg_database d ON d.oid = b.reldatabase
> >LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database
> >  WHERE datname = current_database())
> >AND b.relfilenode = pg_relation_filenode(c.oid)
> >ORDER BY buffers desc;
> > ERROR:  invalid memory alloc request size 1174405120
> >
> > It is a situation to use MemoryContextAllocHuge(), instead of palloc().
> > Also, it may need a back patching?
> 
> I guess so.  Although it's not very desirable for it to use that much
> memory, I suppose if you have a terabyte of shared_buffers you
> probably have 4GB of memory on top of that to show what they contain.
>
Exactly. I found this problem when a people asked me why shared_buffers=280GB
is slower than shared_buffers=128MB to scan 350GB table.
As I expected, most of shared buffers are not in-use and it also reduced
amount of free memory; usable for page-cache.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai

Hello,

I'm now trying to carry extra performance statistics on CustomScan
(like DMA transfer rate, execution time of GPU kernels, etc...)
from parallel workers to the leader process using the DSM segment
attached by the parallel-context.
We can require an arbitrary length of DSM using ExecCustomScanEstimate
hook by extension, then it looks leader/worker can share the DSM area.
However, we have a problem on this design.

Below is the implementation of ExecEndGather().

  void
  ExecEndGather(GatherState *node)
  {
  ExecShutdownGather(node);
  ExecFreeExprContext(&node->ps);
  ExecClearTuple(node->ps.ps_ResultTupleSlot);
  ExecEndNode(outerPlanState(node));
  }

It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
The DSM segment shall be released on this call, so child node cannot
reference the DSM at the time of ExecEndNode().

Is there some technical reason why parallel context needs to be released
prior to ExecEndNode() of the child nodes? Or, just convention of coding?

I think I'm not an only person who wants to use DSM of CustomScan to write
back something extra status of parallel workers.
How about an idea to move ExecShutdownGather() after the ExecEndNode()?

To avoid this problem, right now, I allocate an another DSM then inform
its handle to the parallel workers. This segment can be survived until
ExecEndCustomScan(), but not best effective way, of course.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai

> I'm now trying to carry extra performance statistics on CustomScan
> (like DMA transfer rate, execution time of GPU kernels, etc...)
> from parallel workers to the leader process using the DSM segment
> attached by the parallel-context.
> We can require an arbitrary length of DSM using ExecCustomScanEstimate
> hook by extension, then it looks leader/worker can share the DSM area.
> However, we have a problem on this design.
> 
> Below is the implementation of ExecEndGather().
> 
>   void
>   ExecEndGather(GatherState *node)
>   {
>   ExecShutdownGather(node);
>   ExecFreeExprContext(&node->ps);
>   ExecClearTuple(node->ps.ps_ResultTupleSlot);
>   ExecEndNode(outerPlanState(node));
>   }
> 
> It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> The DSM segment shall be released on this call, so child node cannot
> reference the DSM at the time of ExecEndNode().
> 
> Is there some technical reason why parallel context needs to be released
> prior to ExecEndNode() of the child nodes? Or, just convention of coding?
> 
> I think I'm not an only person who wants to use DSM of CustomScan to write
> back something extra status of parallel workers.
> How about an idea to move ExecShutdownGather() after the ExecEndNode()?
> 
> To avoid this problem, right now, I allocate an another DSM then inform
> its handle to the parallel workers. This segment can be survived until
> ExecEndCustomScan(), but not best effective way, of course.
>
My analysis was not collect a bit.

ExecShutdownNode() at ExecutePlan() is the primary point to call
ExecShutdownGather(), thus, parallel context shall not survive at the point
of ExecEndPlan() regardless of the implementation of ExecEndGather.

Hmm, what is the best way to do...? Or, is it completely abuse of DSM that
is setup by the parallel context?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Steps inside ExecEndGather

2016-10-16 Thread Kouhei Kaigai

> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai  wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >   ExecShutdownGather(node);
> >   ExecFreeExprContext(&node->ps);
> >   ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >   ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
>
Thanks for the suggestion.
Hmm. Indeed, it is more straightforward way to do, although a new hook
is needed for CSP/FDW.

What I want to collect are: DMA transfer rate between RAM<->GPU, Execution
time of GPU kernels and etc... These are obviously out of the standard
Instrumentation structure, so only CSP/FDW can know its size and format.

If we would have a callback just before the planstate_tree_walker() when
planstate is either CustomScanState or ForeignScanState, it looks to me
the problem can be solved very cleanly.

Best regards,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2016-10-31 Thread Kouhei Kaigai

Hello,

The attached patch implements the suggestion by Amit before.

What I'm motivated is to collect extra run-time statistics specific
to a particular ForeignScan/CustomScan, not only the standard
Instrumentation; like DMA transfer rate or execution time of GPU
kernels in my case.

Per-node DSM toc is one of the best way to return run-time statistics
to the master backend, because FDW/CSP can assign arbitrary length of
the region according to its needs. It is quite easy to require.
However, one problem is, the per-node DSM toc is already released when
ExecEndNode() is called on the child node of Gather.

This patch allows extensions to get control on the master backend's
context when all the worker node gets finished but prior to release
of the DSM segment. If FDW/CSP has its special statistics on the
segment, it can move to the private memory area for EXPLAIN output
or something other purpose.

One design consideration is whether the hook shall be called from
ExecParallelRetrieveInstrumentation() or ExecParallelFinish().
The former is a function to retrieve the standard Instrumentation
information, thus, it is valid only if EXPLAIN ANALYZE.
On the other hands, if we put entrypoint at ExecParallelFinish(),
extension can get control regardless of EXPLAIN ANALYZE, however,
it also needs an extra planstate_tree_walker().

Right now, we don't assume anything onto the requirement by FDW/CSP.
It may want run-time statistics regardless of EXPLAIN ANALYZE, thus,
hook shall be invoked always when Gather node confirmed termination
of the worker processes.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> Sent: Monday, October 17, 2016 11:22 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers
> Subject: ##freemail## Re: [HACKERS] Steps inside ExecEndGather
> 
> On Mon, Oct 17, 2016 at 6:22 AM, Kouhei Kaigai  wrote:
> > Hello,
> >
> > I'm now trying to carry extra performance statistics on CustomScan
> > (like DMA transfer rate, execution time of GPU kernels, etc...)
> > from parallel workers to the leader process using the DSM segment
> > attached by the parallel-context.
> > We can require an arbitrary length of DSM using ExecCustomScanEstimate
> > hook by extension, then it looks leader/worker can share the DSM area.
> > However, we have a problem on this design.
> >
> > Below is the implementation of ExecEndGather().
> >
> >   void
> >   ExecEndGather(GatherState *node)
> >   {
> >   ExecShutdownGather(node);
> >   ExecFreeExprContext(&node->ps);
> >   ExecClearTuple(node->ps.ps_ResultTupleSlot);
> >   ExecEndNode(outerPlanState(node));
> >   }
> >
> > It calls ExecShutdownGather() prior to the recursive call of ExecEndNode().
> > The DSM segment shall be released on this call, so child node cannot
> > reference the DSM at the time of ExecEndNode().
> >
> 
> Before releasing DSM, we do collect all the statistics or
> instrumentation information of each node.  Refer
> ExecParallelFinish()->ExecParallelRetrieveInstrumentation(), so I am
> wondering why can't you collect the additional information in the same
> way?
> 
> 
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

parallel-finish-fdw_csp.v1.patch
Description: parallel-finish-fdw_csp.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-10-31 Thread Kouhei Kaigai

Hello,

The attached patch is revised version of the pass-down-bounds feature.
Its functionality is not changed from the previous version, however,
implementation was revised according to the discussion at the last CF.

This patch add a new fields (ps_numTuples) to the PlanState. This is
a hint for optimization when parent node needs first N-tuples only.
It shall be set prior to ExecProcNode() after ExecInitNode() or
ExecReScan() by the parent node, then child nodes can adjust its
execution behavior (e.g, Sort will take top-N heapsort if ps_numTuples
is set) and pass down the hint to its child nodes furthermore.

As an example, I enhanced postgres_fdw to understand the ps_numTuples
if it is set. If and when remote ORDER BY is pushed down, the latest
code tries to sort the entire remote table because it does not know
how many rows to be returned. Thus, it took larger execution time.
On the other hands, the patched one runs the remote query with LIMIT
clause according to the ps_numTuples; which is informed by the Limit
node on top of the ForeignScan node.

* without patch
=
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
 QUERY PLAN

 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=2332.548..2332.550 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=2332.547..2332.548 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.177 ms
 Execution time: 2445.590 ms
(7 rows)

* with patch
==
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
QUERY PLAN
--
 Limit  (cost=100.00..100.43 rows=10 width=52) (actual time=579.469..579.471 
rows=10 loops=1)
   Output: id, x, y, z
   ->  Foreign Scan on public.ft  (cost=100.00..146.46 rows=1077 width=52) 
(actual time=579.468..579.469 rows=10 loops=1)
 Output: id, x, y, z
 Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS 
LAST, y ASC NULLS LAST
 Planning time: 0.123 ms
 Execution time: 579.858 ms
(7 rows)

Right now, I have a few concerns for this patch.
1. Because LIMIT clause can have expression not only constant value,
   we cannot determine the value of ps_numTuples until execution time.
   So, it is not possible to adjust remote query on planning time, and
   EXPLAIN does not show exact remote query even if LIMIT clause was
   attached actually.

2. Where is the best location to put the interface contract to set
   ps_numTuples field. It has to be set prior to the first ExecProcNode()
   after ExecInitNode() or ExecReScan().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Friday, September 16, 2016 12:39 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jeevan Chalke; pgsql-hackers@postgresql.org; Etsuro Fujita; Andres Freund
> Subject: ##freemail## Re: [HACKERS] PassDownLimitBound for 
> ForeignScan/CustomScan
> 
> On Tue, Sep 13, 2016 at 9:07 PM, Kouhei Kaigai  wrote:
> > In the current implementation calls recompute_limits() on the first
> > invocation of ExecLimit and ExecReScanLimit. Do we expect the
> > ps->numTuples will be also passed down to the child nodes on the same
> > timing?
> 
> Sure, unless we find some reason why that's not good.
> 
> > I also think this new executor contract shall be considered as a hint
> > (but not a requirement) for the child nodes, because it allows the
> > parent nodes to re-distribute the upper limit regardless of the type
> > of the child nodes as long as the parent node can work correctly and
> > has benefit even if the child node returns a part of tuples. It makes
> > the decision whether the upper limit should be passed down much simple.
> > The child node "can" ignore the hint but can utilize for more optimization.
> 
> +1.
> 
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

passdown-limit-fdw.v1.patch
Description: passdown-limit-fdw.v1.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-11-01 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Dilip Kumar
> Sent: Saturday, October 29, 2016 3:48 PM
> To: Andres Freund
> Cc: Tom Lane; Alvaro Herrera; pgsql-hackers
> Subject: Re: [HACKERS] Proposal: scan key push down to heap [WIP]
> 
> On Wed, Oct 26, 2016 at 12:01 PM, Andres Freund  wrote:
> > The gains are quite noticeable in some cases. So if we can make it work
> > without noticeable downsides...
> >
> > What I'm worried about though is that this, afaics, will quite
> > noticeably *increase* total cost in cases with a noticeable number of
> > columns and a not that selective qual. The reason for that being that
> > HeapKeyTest() uses heap_getattr(), whereas upper layers use
> > slot_getattr(). The latter "caches" repeated deforms, the former
> > doesn't... That'll lead to deforming being essentially done twice, and
> > it's quite often already a major cost of query processing.
> 
> What about putting slot reference inside HeapScanDesc ?. I know it
> will make ,heap layer use executor structure but just a thought.
> 
> I have quickly hacked this way where we use slot reference in
> HeapScanDesc and directly use
>  slot_getattr inside HeapKeyTest (only if we have valid slot otherwise
> use _heap_getattr) and measure the worst case performance (what you
> have mentioned above.)
> 
> My Test: (21 column table with varchar in beginning + qual is on last
> few column + varying selectivity )
> 
> postgres=# \d test
>   Table "public.test"
>  Column |   Type| Modifiers
> +---+---
>  f1 | integer   |
>  f2 | character varying |
>  f3 | integer   |
>  f4 | integer   |
>  f5 | integer   |
>  f6 | integer   |
>  f7 | integer   |
>  f8 | integer   |
>  f9 | integer   |
>  f10| integer   |
>  f11| integer   |
>  f12| integer   |
>  f13| integer   |
>  f14| integer   |
>  f15| integer   |
>  f16| integer   |
>  f17| integer   |
>  f18| integer   |
>  f19| integer   |
>  f20| integer   |
>  f21| integer   |
> 
> tuple count : 1000 (10 Million)
> explain analyze select * from test where f21< $1 and f20 < $1 and f19
> < $1 and f15 < $1 and f10 < $1; ($1 vary from 1Million to 1Million).
> 
> Target code base:
> ---
> 1. Head
> 2. Heap_scankey_pushdown_v1
> 3. My hack for keeping slot reference in HeapScanDesc
> (v1+use_slot_in_HeapKeyTest)
> 
> Result:
> Selectivity Head   scan_key_pushdown_v1 v1+use_slot_in_HeapKeyTest
> 0.1 3880  2980 2747
> 0.2 4041  3187 2914
> 0.5 5051  4921 3626
> 0.8 5378  7296 3879
> 1.0 6161  8525 4575
> 
> Performance graph is attached in the mail..
> 
> Observation:
> 
> 1. Heap_scankey_pushdown_v1, start degrading after very high
> selectivity (this behaviour is only visible if table have 20 or more
> columns, I tested with 10 columns but with that I did not see any
> regression in v1).
> 
> 2. (v1+use_slot_in_HeapKeyTest) is always winner, even at very high 
> selectivity.
> 
Prior to this interface change, it may be a starting point to restrict scan key
pushdown only when OpExpr references the column with static attcacheoff.
This type of column does not need walks on tuples from the head, thus, tuple
deforming cost will not be a downside.

By the way, I'm a bit skeptical whether this enhancement is really beneficial
than works for this enhancement, because we can now easily increase the number
of processor cores to run seq-scan with qualifier, especially, when it has high
selectivity.
How about your thought?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Thursday, November 10, 2016 3:08 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org; Jeevan Chalke; Etsuro Fujita; Andres Freund
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]
> 
> On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai  wrote:
> > As an example, I enhanced postgres_fdw to understand the ps_numTuples
> > if it is set. If and when remote ORDER BY is pushed down, the latest
> > code tries to sort the entire remote table because it does not know
> > how many rows to be returned. Thus, it took larger execution time.
> > On the other hands, the patched one runs the remote query with LIMIT
> > clause according to the ps_numTuples; which is informed by the Limit
> > node on top of the ForeignScan node.
> 
> So there are two cases here.  If the user says LIMIT 12, we could in
> theory know that at planner time and optimize accordingly.  If the
> user says LIMIT twelve(), however, we will need to wait until
> execution time unless twelve() happens to be capable of being
> simplified to a constant by the planner.
> 
> Therefore, it's possible to imagine having two mechanisms here. In the
> simple case where the LIMIT and OFFSET values are constants, we could
> implement a system to get hold of that information during planning and
> use it for whatever we like.   In addition, we can have an
> execution-time system that optimizes based on values available at
> execution (regardless of whether those values were also available
> during planning).  Those are, basically, two separate things, and this
> patch has enough to do just focusing on one of them.
>
OK, we need to have a private value of postgres_fdw to indicate whether
LIMIT and OFFSET were supplied at the planner stage. If any, it has to
be matched with the ps_numTuples informed at the executor stage.

I'll revise the patch.
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-09 Thread Kouhei Kaigai

> On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
>  wrote:
> > 1. ps_numTuples is declared as long, however offset and count members in
> > LimitState struct and bound member in SortState struct is int64.  However
> > long on 32 bit machine may be 32 bits and thus I think tuples_needed which
> > is long may have overflow hazards as it may store int64 + int64.  I think
> > ps_numTuples should be int64.
> 
> I suggested long originally because that's what ExecutorRun() was
> using at the time.  It seems that it got changed to uint64 in
> 23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
> probably use uint64.
> 
> > 2. Robert suggested following in the previous discussion:
> > "For example, suppose we add a new PlanState member "long
> > numTuples" where 0 means that the number of tuples that will be needed
> > is unknown (so that most node types need not initialize it), a
> > positive value is an upper bound on the number of tuples that will be
> > fetched, and -1 means that it is known for certain that we will need
> > all of the tuples."
> >
> > We should have 0 for the default case so that we don't need to initialize it
> > at most of the places.  But I see many such changes in the patch.  I think
> > this is not possible here since 0 can be a legal user provided value which
> > cannot be set as a default (default is all rows).
> >
> > However do you think, can we avoid that? Is there any other way so that we
> > don't need every node having ps_numTuples to be set explicitly?
> 
> +1.
>
I thought we have to distinguish a case if LIMIT 0 is supplied.
However, in this case, ExecLimit() never goes down to the child nodes,
thus, its ps_numTuples shall not be referenced anywhere.

OK, I'll use uint64 for ps_numTuples, and 0 shall be a usual default
value that means no specific number of rows are given.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2016-11-21 Thread Kouhei Kaigai

Hello,

The attached patch is a revised version of pass-down LIMIT to FDW/CSP.

Below is the updates from the last version.

'ps_numTuples' of PlanState was declared as uint64, instead of long
to avoid problems on 32bits machine when a large LIMIT clause is
supplied.

'ps_numTuples' is re-interpreted; 0 means that its upper node wants
to fetch all the tuples. It allows to eliminate a boring initialization
on ExecInit handler for each executor node.

Even though it was not suggested, estimate_path_cost_size() of postgres_fdw
adjusts number of rows if foreign-path is located on top-level of
the base-relations and LIMIT clause takes a constant value.
It will make more adequate plan as follows:

* WITHOUT this patch

postgres=# explain verbose select * from t_a, t_b where t_a.id = t_b.id and 
t_a.x < t_b.x LIMIT 100;
   QUERY PLAN

 Limit  (cost=261.17..274.43 rows=100 width=88)
   Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
   ->  Hash Join  (cost=261.17..581.50 rows=2416 width=88)
 Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
 Hash Cond: (t_a.id = t_b.id)
 Join Filter: (t_a.x < t_b.x)
 ->  Foreign Scan on public.t_a  (cost=100.00..146.12 rows=1204 
width=44)
   Output: t_a.id, t_a.x, t_a.y
   Remote SQL: SELECT id, x, y FROM public.t
 ->  Hash  (cost=146.12..146.12 rows=1204 width=44)
   Output: t_b.id, t_b.x, t_b.y
   ->  Foreign Scan on public.t_b  (cost=100.00..146.12 rows=1204 
width=44)
 Output: t_b.id, t_b.x, t_b.y
 Remote SQL: SELECT id, x, y FROM public.t
(14 rows)

* WITH this patch
-
postgres=# explain verbose select * from t_a, t_b where t_a.id = t_b.id and 
t_a.x < t_b.x LIMIT 100;
  QUERY PLAN
--
 Limit  (cost=100.00..146.58 rows=100 width=88)
   Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
   ->  Foreign Scan  (cost=100.00..146.58 rows=100 width=88)
 Output: t_a.id, t_a.x, t_a.y, t_b.id, t_b.x, t_b.y
 Relations: (public.t_a) INNER JOIN (public.t_b)
 Remote SQL: SELECT r1.id, r1.x, r1.y, r2.id, r2.x, r2.y FROM (public.t 
r1 INNER JOIN public.t r2 ON (((r1.x < r2.x)) AND ((r1.id = r2.id
(6 rows)


On the other hands, I noticed it is not safe to attach LIMIT clause at
the planner stage because root->limit_tuples is declared as double.
Even if LIMIT clause takes a constant value, it is potentially larger
than 2^53 which is the limitation we can represent accurately with
float64 data type but LIMIT clause allows up to 2^63-1.
So, postgres_fdw now attaches LIMIT clause on the remote query on
execution time only.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Thursday, November 10, 2016 3:08 AM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: pgsql-hackers@postgresql.org; Jeevan Chalke
> ; Etsuro Fujita
> ; Andres Freund 
> Subject: ##freemail## Re: PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai 
> wrote:
> > As an example, I enhanced postgres_fdw to understand the ps_numTuples
> > if it is set. If and when remote ORDER BY is pushed down, the latest
> > code tries to sort the entire remote table because it does not know
> > how many rows to be returned. Thus, it took larger execution time.
> > On the other hands, the patched one runs the remote query with LIMIT
> > clause according to the ps_numTuples; which is informed by the Limit
> > node on top of the ForeignScan node.
> 
> So there are two cases here.  If the user says LIMIT 12, we could in theory
> know that at planner time and optimize accordingly.  If the user says LIMIT
> twelve(), however, we will need to wait until execution time unless twelve()
> happens to be capable of being simplified to a constant by the planner.
> 
> Therefore, it's possible to imagine having two mechanisms here. In the
> simple case where the LIMIT and OFFSET values are constants, we could
> implement a system to get hold of that information during planning and
> use it for whatever we like.   In addition, we can have an
> execution-time system that optimizes based on values available at execution
> (regardless of whether those values were also available during planning).
> Those are, basically, two separate things, and this patch has enough to
> do just focusing on one of them.
> 
> --
> Robert Haas
> EnterpriseDB: htt

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-16 Thread Kouhei Kaigai

Hello,

The attached patch is revised one.

Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
ExecParallelRetrieveInstrumentation() not to walk on the plan-
state tree twice.
One (hypothetical) downside is, FDW/CSP can retrieve its own
run-time statistics only when query is executed under EXPLAIN
ANALYZE.

This enhancement allows FDW/CSP to collect its specific run-
time statistics more than Instrumentation, then show them as
output of EXPLAIN. My expected examples are GPU's kernel execution
time, DMA transfer ratio and so on. These statistics will never
appear in the Instrumentation structure, however, can be a hot-
point of performance bottleneck if CustomScan works on background
workers.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Claudio Freire
> Sent: Monday, February 06, 2017 3:37 PM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Amit Kapila ; Robert Haas
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Mon, Feb 6, 2017 at 1:42 AM, Kouhei Kaigai  wrote:
> > I also had thought an idea to have extra space to Instrumentation
> > structure, however, it needs to make Instrumentation flexible-length
> > structure according to the custom format by CSP/FDW. Likely, it is not
> a good design.
> > As long as extension can retrieve its custom statistics on DSM area
> > required by ExecParallelEstimate(), I have no preference on the hook
> location.
> 
> That's what I had in mind: the hook happens there, but the extension
> retrieves the information from some extension-specific DSM area, just as
> it would on the ParallelFinish hook.
> 
> > One thing we may pay attention is, some extension (not mine) may want
> > to collect worker's statistics regardless of Instrumentation (in other
> > words, even if plan is not under EXPLAIN ANALYZE).
> > It is the reason why I didn't put a hook under the
> ExecParallelRetrieveInstrumentation().
> 
> I don't think you should worry about that as long as it's a hypothetical
> case.
> 
> If/when some extension actually needs to do that, the design can be discussed
> with a real use case at hand, and not a hypothetical one.
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


parallel-finish-fdw_csp.v2.patch
Description: parallel-finish-fdw_csp.v2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-19 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Monday, February 20, 2017 2:20 AM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Claudio Freire ; Amit Kapila
> ; pgsql-hackers 
> Subject: Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside
> ExecEndGather)
> 
> On Fri, Feb 17, 2017 at 12:46 PM, Kouhei Kaigai 
> wrote:
> > The attached patch is revised one.
> >
> > Invocation of Exec(Foreign|Custom)ParallelFinish was moved to
> > ExecParallelRetrieveInstrumentation() not to walk on the plan- state
> > tree twice.
> > One (hypothetical) downside is, FDW/CSP can retrieve its own run-time
> > statistics only when query is executed under EXPLAIN ANALYZE.
> >
> > This enhancement allows FDW/CSP to collect its specific run- time
> > statistics more than Instrumentation, then show them as output of
> > EXPLAIN. My expected examples are GPU's kernel execution time, DMA
> > transfer ratio and so on. These statistics will never appear in the
> > Instrumentation structure, however, can be a hot- point of performance
> > bottleneck if CustomScan works on background workers.
> 
> Would gather_shutdown_children_first.patch from
> https://www.postgresql.org/message-id/CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbn
> b8eb10c-6aywjdxb...@mail.gmail.com
> help with this problem also?  Suppose we did that, and then also added an
> ExecShutdownCustom method.  Then you'd definitely be able to get control
> before the DSM went away, either from ExecEndNode() or ExecShutdownNode().
> 
Ah, yes, I couldn't find any problem around the above approach.
ExecShutdownGather() can be called by either ExecShutdownNode() or
ExecEndGather(). This patch allows to have an entrypoint for CSP/FDW
prior to release of the DSM.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-02-28 Thread Kouhei Kaigai

The attached patch is rebased version of pass-down LIMIT clause patch,
which was forgotten to register on the last CF.

It allows to inform required number of rows to the sub-plans not only
ones we have individually handled at pass_down_bound().
Its primary target is control of number of remote tuple transfer over
the network connection by postgres_fdw.

According to the past discussion, we add a new field @ps_numTuples on
the PlanState to represent the required number of tuples.
Limit node assign a particular number on the field of sub-plan, then
this sub-plan can know its upper node does not require entire tuples,
and adjust its execution storategy.
Like MergeAppend, the sub-plan can also pass down the bounds to its
sub-plans again, if it makes sense and works correctly.

This feature is potentially a basis of GPU-based sorting on top of
CustomScan, because it has advantage for a workload to pick up the
top-N tuples if its data-size is enough small to load onto GPU-RAM.

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
> Sent: Tuesday, January 03, 2017 12:07 PM
> To: Kaigai Kouhei(海外 浩平) 
> Cc: Jeevan Chalke ; Robert Haas
> ; pgsql-hackers@postgresql.org; Etsuro Fujita
> ; Andres Freund 
> Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan
> [take-2]
> 
> Oops, I oversight this patch was marked as "returned with feedback", not
> "moved to the next CF".
> 
> Its status has not been changed since the last update. (Code was revised
> according to the last comment by Jeevan, but CF-Nov was time up at that
> time.)
> 
> How do I handle the patch?
> 
> 2016-12-05 16:49 GMT+09:00 Kouhei Kaigai :
> > Hello,
> >
> > Sorry for my late response.
> > The attached patch reflects your comments.
> >
> >> Here are few comments on latest patch:
> >>
> >>
> >> 1.
> >> make/make check is fine, however I am getting regression failure in
> >> postgres_fdw contrib module (attached regression.diff).
> >> Please investigate and fix.
> >>
> > It was an incorrect interaction when postgres_fdw tries to push down
> > sorting to the remote side. We cannot attach LIMIT clause on the plain
> > scan path across SORT, however, the previous version estimated the
> > cost for the plain scan with LIMIT clause even if local sorting is needed.
> > If remote scan may return just 10 rows, estimated cost of the local
> > sort is very lightweight, thus, this unreasonable path was chosen.
> > (On the other hands, its query execution results were correct because
> > ps_numTuples is not delivered across Sort node, so ForeignScan
> > eventually scanned all the remote tuples. It made correct results but
> > not optimal from the viewpoint of performance.)
> >
> > The v3 patch estimates the cost with remote LIMIT clause only if
> > supplied pathkey strictly matches with the final output order of the
> > query, thus, no local sorting is expected.
> >
> > Some of the regression test cases still have different plans but due
> > to the new optimization by remote LIMIT clause.
> > Without remote LIMIT clause, some of regression test cases preferred
> > remote-JOIN + local-SORT then local-LIMIT.
> > Once we have remote-LIMIT option, it allows to discount the cost for
> > remote-SORT by choice of top-k heap sorting.
> > It changed the optimizer's decision on some test cases.
> >
> > Potential one big change is the test case below.
> >
> >  -- CROSS JOIN, not pushed down
> >  EXPLAIN (VERBOSE, COSTS OFF)
> >  SELECT t1.c1, t2.c1 FROM ft1 t1 CROSS JOIN ft2 t2 ORDER BY t1.c1,
> > t2.c1 OFFSET 100 LIMIT 10;
> >
> > It assumed CROSS JOIN was not pushed down due to the cost for network
> > traffic, however, remote LIMIT reduced the estimated number of tuples
> > to be moved. So, all of the CROSS JOIN + ORDER BY + LIMIT became to
> > run on the remote side.
> >
> >> 2.
> >> + *
> >> + * MEMO: root->limit_tuples is not attached when query
> >> contains
> >> + * grouping-clause or aggregate functions. So, we don's
> adjust
> >> + * rows even if LIMIT  is supplied.
> >>
> >> Can you please explain why you are not doing this for grouping-clause
> >> or aggregate functions.
> >>
> > See grouping_planner() at optimizer/plan/planner.c It puts an invalid
> > value on the root->limit_tuples if query has GROUP BY clause,

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-03-01 Thread Kouhei Kaigai

> Hello all,
> 
> as this is my first mail to pgsql-hackers, please be gentle :)
>
Welcome to pgsql-hackers,

> I've looked at the patch, and as I'm not that familiar with the pg-sourcecode,
> customs and so on, this isn't a review, more like food for thought and all
> should be taken with a grain of salt. :)
> 
> So here are a few questions and remarks:
> 
> >+double  limit_tuples = -1.0;
> 
> Surely the limit cannot be fractional, and must be an integer. So wouldn't
> it be better the same type as say:
> 
> >+if (root->limit_tuples >= 0.0 &&
> 
> Than you could also compare with ">= 0", not ">= 0.0".
>
The above variable represents the "estimated" number of rows at the
planning stage, not execution time.
You may be able to see Path structure has "rows" field declared as
double type. It makes sense to consider stupid paths during planning,
even if it is eventually rejected. For example, if a cross join with
two large tables appear during planning, 64bit integer will make overflow
easily.

> node->ss.ps.ps_numTuples is f.i. an uint64.
> 
> Or is there a specific reason the limit must be a double?
>
The above variable represents "actual" number of rows at the execution
stage. Likely, hardware replacement cycle will come before int64 overflow.

> And finally:
> 
> >+if (node->ss.ps.ps_numTuples > 0)
> 
> >+appendStringInfo(&buf, " LIMIT %ld",
> node->ss.ps.ps_numTuples);
> 
> vs.
> 
> >+appendStringInfo(&buf, "%s LIMIT %lu",
> >+ sql,
> node->ss.ps.ps_numTuples);
> 
> It seems odd to have two different format strings here for the same variable.
>
Ah, yes, %lu is right because ps_numTuples is uint64.

> A few comments miss "." at the end, like these:
> 
> >+ * Also, pass down the required number of tuples
> 
> >+ * Pass down the number of required tuples by the upper node
> 
OK,

> And this comment might be better "were we already called?"
> 
> >+boolrs_started; /* are we already
> called? */
> 
Other variables in ResultState uses present form, like:

+   boolrs_started; /* are we already called? */
boolrs_done;/* are we done? */
boolrs_checkqual;   /* do we need to check the qual? */
 } ResultState;

Thanks,

PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 


passdown-limit-fdw.v5.patch
Description: passdown-limit-fdw.v5.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

2017-03-12 Thread Kouhei Kaigai

> Hello,
> 
> On Wed, March 1, 2017 7:21 pm, Kouhei Kaigai wrote:
> >> I've looked at the patch, and as I'm not that familiar with the
> >> pg-sourcecode, customs and so on, this isn't a review, more like food
> >> for thought and all should be taken with a grain of salt. :)
> >>
> >> So here are a few questions and remarks:
> >>
> >> >+ double  limit_tuples = -1.0;
> >>
> >> Surely the limit cannot be fractional, and must be an integer. So
> >> wouldn't it be better the same type as say:
> >>
> >> >+ if (root->limit_tuples >= 0.0 &&
> >>
> >> Than you could also compare with ">= 0", not ">= 0.0".
> >>
> > The above variable represents the "estimated" number of rows at the
> > planning stage, not execution time.
> > You may be able to see Path structure has "rows" field declared as
> > double type. It makes sense to consider stupid paths during planning,
> > even if it is eventually rejected. For example, if a cross join with
> > two large tables appear during planning, 64bit integer will make
> > overflow easily.
> 
> Hm, ok. Not related to your patch, just curious: Is there a mechanism in
> place that automatically rejects plans where the limit would overflow the
> double to uint64 conversation? Or is this more of a "there would be hopefully
> a plan with a better limit so we do not use the bad one"?
> 
> Would it possible to force a plan where such overflow would occur?
>
We have no such mechanism, and less necessity.
Estimated number of rows in plan time is stored in the plan_rows field of
the Plan structure, as FP64. Once plan-tree gets constructed, estimated
number of rows shall not affect to the execution. (Some plan might use it
for estimation of resource consumption on execution time.)
On the other hands, the actual number of rows that was processed is saved
on the instrument field of the PlanState structure. It is counted up from
the zero by one. So, people wants to replace the hardware prior to uint64
overflow. If 1B rows are processed per sec, uint64 overflow happen at 26th
century(!).

> >> And this comment might be better "were we already called?"
> >>
> >> >+ boolrs_started; /* are we already
> >> called? */
> >>
> > Other variables in ResultState uses present form, like:
> >
> > +   boolrs_started; /* are we already called? */
> > boolrs_done;/* are we done? */
> > boolrs_checkqual;   /* do we need to check the qual? */
> >  } ResultState;
> 
> Yes, I noted that, but still "are" and "called" and "already" don't read
> well together for me:
> 
>   are - present form
>   called - past form like "were we called?", or "are we called bob?" an
> ongoing process
>   already - it has started
> 
> So "are we already called" reads like someone is waiting for being called.
> 
> Maybe to mirror the comment on "rs_done":
> 
>   /* have we started yet? */
> 
> Also, maybe it's easier for the comment to describe what is happening in
> the code because of the flag, not just to the flag itself:
> 
>   /* To do things once when we are called */
> 
> Anyway, it is a minor point and don't let me distract you from your work,
> I do like the feature and the patch :)
>
Fixed to "have we started yet?"


PG-Strom Project / NEC OSS Promotion Center
KaiGai Kohei 



passdown-limit-fdw.v6.patch
Description: passdown-limit-fdw.v6.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-07-03 Thread Kouhei Kaigai

ll -wno-unused-param -wno-empty -wfully-tagged -D . -D .
> >> -d stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml
> >> openjade:catalogs.sgml:2525:45:X: reference to non-existent ID
> >> "SQL-CREATECUSTOMPLAN"
> >> make: *** [HTML.index] Error 1
> >> make: *** Deleting file `HTML.index'
> >>
> >> I'll review another part of the patch, including the design.
> >>
> >>
> >> 2014-06-14 10:59 GMT+09:00 Kohei KaiGai :
> >>> According to the discussion upthread, I revised the custom-plan
> >>> patch to focus on regular relation scan but no join support right
> >>> now, and to support DDL command to define custom-plan providers.
> >>>
> >>> Planner integration with custom logic to scan a particular relation
> >>> is enough simple, unlike various join cases. It's almost similar to
> >>> what built-in logic are doing now - custom-plan provider adds a path
> >>> node with its cost estimation if it can offer alternative way to
> >>> scan referenced relation. (in case of no idea, it does not need to
> >>> add any paths)
> >>>
> >>> A new DDL syntax I'd like to propose is below:
> >>>
> >>>   CREATE CUSTOM PLAN  FOR  PROVIDER ;
> >>>
> >>>  is as literal, put a unique identifier.
> >>>  is workload type to be offered by this custom-plan provider.
> >>> "scan" is the only option right now, that means base relation scan.
> >>>  is also as literal; it shall perform custom-plan
> provider.
> >>>
> >>> A custom-plan provider function is assumed to take an argument of
> >>> "internal" type to deliver a set of planner information that is
> >>> needed to construct custom-plan pathnode.
> >>> In case of "scan" class, pointer towards an customScanArg object
> >>> shall be delivered on invocation of custom-plan provider.
> >>>
> >>> typedef struct {
> >>> uint32custom_class;
> >>> PlannerInfo*root;
> >>> RelOptInfo *baserel;
> >>> RangeTblEntry  *rte;
> >>> } customScanArg;
> >>>
> >>> In case when the custom-plan provider function being invoked thought
> >>> it can offer an alternative scan path on the relation of "baserel",
> >>> things to do is (1) construct a CustomPath (or its inherited data
> >>> type) object with a table of callback function pointers (2) put its
> >>> own cost estimation, and (3) call add_path() to register this path as
> an alternative one.
> >>>
> >>> Once the custom-path was chosen by query planner, its
> >>> CreateCustomPlan callback is called to populate CustomPlan node based
> on the pathnode.
> >>> It also has a table of callback function pointers to handle various
> >>> planner's job in setrefs.c and so on.
> >>>
> >>> Similarly, its CreateCustomPlanState callback is called to populate
> >>> CustomPlanState node based on the plannode. It also has a table of
> >>> callback function pointers to handle various executor's job during
> >>> quey execution.
> >>>
> >>> Most of callback designs are not changed from the prior proposition
> >>> in
> >>> v9.4 development cycle, however, here is a few changes.
> >>>
> >>> * CustomPlan became to inherit Scan, and CustomPlanState became to
> >>>   inherit ScanState. Because some useful routines to implement scan-
> >>>   logic, like ExecScan, expects state-node has ScanState as its base
> >>>   type, it's more kindness for extension side. (I'd like to avoid each
> >>>   extension reinvent ExecScan by copy & paste!)
> >>>   I'm not sure whether it should be a union of Join in the future,
> however,
> >>>   it is a reasonable choice to have compatible layout with
> Scan/ScanState
> >>>   to implement alternative "scan" logic.
> >>>
> >>> * Exporting static functions - I still don't have a graceful answer
> here.
> >>>   However, it is quite natural that extensions to follow up interface
> updates
> >>>   on the future version up of PostgreSQL.
> >>>   Probably, it shall become clear what class of functions shall be
> >>>   exported and what class of functions shall be re-implemented withi

Re: [HACKERS] RLS Design

2014-07-03 Thread Kouhei Kaigai

Sorry for my late responding, now I'm catching up the discussion.

> * Robert Haas (robertmh...@gmail.com) wrote:
> > On Tue, Jul 1, 2014 at 3:20 PM, Dean Rasheed 
> wrote:
> > > If RLS quals are instead regarded as constraints on access, and
> > > multiple policies apply, then it seems that the quals should now be
> > > combined with AND rather than OR, right?
> 
> I do feel that RLS quals are constraints on access, but I don't see how
> it follows that multiple quals should be AND'd together because of that.
> I view the RLS policies on each table as being independent and "standing
> alone" regarding what can be seen.  If you have access to a table today
> through policy A, and then later policy B is added, using AND would mean
> that the set of rows returned is less than if only policy A existed.
> That doesn't seem correct to me.
> 
It seems to me direction of the constraints (RLS-policy) works to is reverse.

In case when we have no RLS-policy, 100% of rows are visible isn't it?
Addition of a constraint usually reduces the number of rows being visible,
or same number of rows at least. Constraint shall never work to the direction
to increase the number of rows being visible.

If multiple RLS-policies are connected with OR-operator, the first policy
works to the direction to reduce number of visible rows, but the second
policy works to the reverse direction.

If we would have OR'd RLS-policy, how does it merged with user given
qualifiers with?
For example, if RLS-policy of t1 is (t1.credential < get_user_credential)
and user's query is:
  SELECT * FROM t1 WHERE t1.x = t1.x;
Do you think RLS-policy shall be merged with OR'd form?

> > Yeah, maybe.  I intuitively feel that OR would be more useful, so it
> > would be nice to find a design where that makes sense.  But it depends
> > a lot, in my view, on what syntax we end up with.  For example,
> > suppose we add just one command:
> >
> > ALTER TABLE table_name FILTER [ role_name | PUBLIC ] USING qual;
> >
> > If the given role inherits from multiple roles that have different
> > filters, I think the user will naturally expect all of the filters to
> > be applied.
> 
> Agreed.
> 
> > But you could do it other ways.  For example:
> >
> > ALTER TABLE table_name [ NO ] ROW LEVEL SECURITY; ALTER TABLE
> > table_name GRANT ROW ACCESS TO role_name USING qual;
> >
> > If a table is set to NO ROW LEVEL SECURITY then it behaves just like
> > it does now: anyone who accesses it sees all the rows, restricted to
> > those columns for which they have permission.  If the table is set to
> > ROW LEVEL SECURITY then the default is to show no rows.  The second
> > command then allows access to a subset of the rows for a give role
> > name.  In this case, it is probably logical for access to be combined
> > via OR.
> 
> I can see value is having a table-level option to indicate if RLS is applied
> for that table or not, but I had been thinking we'd just automatically manage
> that.  That is to say that once you define an RLS policy for a table, we
> go look and see what policy should be applied in each case.  With the user
> able to control that, what happens if they say "row security" on the table
> and there are no policies?  All access would show the table as empty?  What
> if policies exist and they decide to 'turn off' RLS for the table- suddenly
> everyone can see all the rows?
> 
> My answers to the above (which are making me like the idea more,
> actually...) would be:
> 
> Yes, if they turn on RLS for the table and there aren't any policies, then
> the table appears empty for anyone with normal SELECT rights (table owner
> and superusers would still see everything).
> 
> If policies exist and the user asks to turn off RLS, I'd throw an ERROR
> as there is a security risk there.  We could support a CASCADE option which
> would go and drop the policies from the table first.
> 
Hmm... This approach starts from the empty permission then adds permission
to reference a particular range of the configured table. It's one attitude.

However, I think it has a dark side we cannot ignore. Usually, the purpose
of security mechanism is to ensure which is readable/writable according to
the rules. Once multiple RLS-policies are merged with OR'd form, its results
are unpredicatable.
Please assume here are two individual applications that use RLS on table-X.
Even if application-1 want only rows being "public" become visible, it may
expose "credential" or "secret" rows by interaction of orthogonal policy
configured by application-2 (that may configure the policy according to the
source ip-address). It seems to me application-2 partially invalidated the
RLS-policy configured by application-1.

I think, an important characteristic is things to be invisible is invisible
even though multiple rules are configured.

> Otherwise, I'm generally liking Dean's thoughts in
> http://www.postgresql.org/message-id/CAEZATCVftksFH=X+9mVmBNMZo5KsUP+R
> k0kb4oro92jofjo...@mail.gmail.com
> along with

Re: [HACKERS] RLS Design

2014-07-04 Thread Kouhei Kaigai

> Kaigai,
> 
> On Thursday, July 3, 2014, Kouhei Kaigai  wrote:
> 
> 
>   Sorry for my late responding, now I'm catching up the discussion.
> 
>   > * Robert Haas (robertmh...@gmail.com  ) wrote:
>   > > On Tue, Jul 1, 2014 at 3:20 PM, Dean Rasheed
>  >
>   > wrote:
>   > > > If RLS quals are instead regarded as constraints on access,
> and
>   > > > multiple policies apply, then it seems that the quals should
> now be
>   > > > combined with AND rather than OR, right?
>   >
>   > I do feel that RLS quals are constraints on access, but I don't
> see how
>   > it follows that multiple quals should be AND'd together because
> of that.
>   > I view the RLS policies on each table as being independent and
> "standing
>   > alone" regarding what can be seen.  If you have access to a table
> today
>   > through policy A, and then later policy B is added, using AND
> would mean
>   > that the set of rows returned is less than if only policy A existed.
>   > That doesn't seem correct to me.
>   >
>   It seems to me direction of the constraints (RLS-policy) works to
> is reverse.
> 
>   In case when we have no RLS-policy, 100% of rows are visible isn't
> it?
> 
> 
> No, as outlined later, the table would appear empty if no policies exist
> and RLS is enabled for the table.
> 
> 
>   Addition of a constraint usually reduces the number of rows being
> visible,
>   or same number of rows at least. Constraint shall never work to
> the direction
>   to increase the number of rows being visible.
> 
> 
> Can you clarify where this is coming from..?  It sounds like you're
> referring to an existing implementation and, if so, it'd be good to get
> more information on how that works exactly.
> 

Oracle VPD - Multiple Policies for Each Table, View, or Synonym
http://docs.oracle.com/cd/B19306_01/network.102/b14266/apdvpoli.htm#i1008351

It says - Note that all policies applied to a table are enforced with AND 
syntax.

Not only Oracle VPD, it fits attitude of defense in depth.
Please assume a system that installs network firewall, unix permissions
and selinux. If somebody wants to reference an information asset within
a file, he has to connect the server from the network address being allowed
by the firewall configuration AND both of DAC and MAC has to allow his
access.
Usually, we have to pass all the access control to reference the target
information, not one of the access control stuffs being installed.


>   For example, if RLS-policy of t1 is (t1.credential <
> get_user_credential)
>   and user's query is:
> SELECT * FROM t1 WHERE t1.x = t1.x;
>   Do you think RLS-policy shall be merged with OR'd form?
> 
> 
> Only the RLS policies are OR'd together, not user provided quals. The above
> would result in:
> 
> Where t1.x = t1.x and (t1.credential < get_user_credential)
> 
> If another policy also applies for this query, such as t1.cred2 <
> get_user_credential then we would have:
> 
> Where t1.x = t1.x and (t1.credential < get_user_credential OR t1.cred2 <
> get_user_credential)
> 
> This is similar to how roles work- your overall access includes all access
> granted to any roles you are a member of. You don't need SELECT rights granted
> to every role you are a member of to select from the table. Additionally,
> if an admin wants to AND the quals together then they can simply create
> a policy which does that rather than have 2 policies.
> 
It seems to me a pain on database administration, if we have to pay attention
not to conflict each RLS-policy. I expect 90% of RLS-policy will be configured
to PUBLIC user, to apply everybody same rules on access. In this case, DBA
has to ensure the target table has no policy or existing policy does not
conflict with the new policy to be set.
I don't think it is a good idea to enforce DBA these checks.


>   Please assume here are two individual applications that use RLS
> on table-X.
>   Even if application-1 want only rows being "public" become visible,
> it may
>   expose "credential" or "secret" rows by interaction of orthogonal
> policy
>   configured by application-2 (that may configure the policy
> according to the
>   source ip-address). It seems to me application-2 partially
> invalidated the
>   RLS-policy configured by application-1.
> 
> 
>  You are suggesting instead that if application 2 sets up policies on the
> table and then application 1 adds another policy that it should reduce what
> applic

Re: [HACKERS] [v9.5] Custom Plan API

2014-07-17 Thread Kouhei Kaigai

> Alvaro Herrera  writes:
> > I haven't followed this at all, but I just skimmed over it and noticed
> > the CustomPlanMarkPos thingy; apologies if this has been discussed
> > before.  It seems a bit odd to me; why isn't it sufficient to have a
> > boolean flag in regular CustomPlan to indicate that it supports
> > mark/restore?
> 
> Yeah, I thought that was pretty bogus too, but it's well down the list of
> issues that were there last time I looked at this ...
> 
IIRC, CustomPlanMarkPos was suggested to keep the interface of
ExecSupportsMarkRestore() that takes plannode tag to determine
whether it support Mark/Restore.
As my original proposition did, it seems to me a flag field in
CustomPlan structure is straightforward, if we don't hesitate to
change ExecSupportsMarkRestore().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-07-17 Thread Kouhei Kaigai

> I personally don't see how this patch is 'ready for committer'. I realize
> that that state is sometimes used to denote that review needs to be
> "escalated", but it still seemspremature.
> 
> Unless I miss something there hasn't been any API level review of this?
> Also, aren't there several open items?
> 
Even though some interface specifications are revised according to the
comment from Tom on the last development cycle, the current set of
interfaces are not reviewed by committers. I really want this.

Here are two open items that we want to wait for committers comments.

* Whether set_cheapest() is called for all relkind?

This pactch moved set_cheapest() to the end of set_rel_pathlist(),
to consolidate entrypoint of custom-plan-provider handler function.
It also implies CPP can provider alternative paths towards non-regular
relations (like sub-queries, functions, ...).
Hanada-san wonder whether we really have a case to run alternative
sub-query code. Even though I don't have usecases for alternative
sub-query execution logic, but we also don't have a reason why not
to restrict it.

* How argument of add_path handler shall be derivered?

The handler function (that adds custom-path to the required relation
scan if it can provide) is declared with an argument with INTERNAL
data type. Extension needs to have type-cast on the supplied pointer
to customScanArg data-type (or potentially customHashJoinArg and
so on...) according to the custom plan class.
I think it is well extendable design than strict argument definitions,
but Hanada-san wonder whether it is the best design.

> Perhaps there needs to be a stage between 'needs review' and 'ready for
> committer'?
>
It needs clarification of 'ready for committer'. I think interface
specification is a kind of task to be discussed with committers
because preference/viewpoint of rr-reviewer are not always same
opinion with them.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

> -Original Message-
> From: Andres Freund [mailto:and...@2ndquadrant.com]
> Sent: Friday, July 18, 2014 3:12 AM
> To: Shigeru Hanada
> Cc: Kaigai Kouhei(海外 浩平); Kohei KaiGai; Simon Riggs; Tom Lane; Stephen
> Frost; Robert Haas; PgHacker; Jim Mlodgenski; Peter Eisentraut
> Subject: Re: [HACKERS] [v9.5] Custom Plan API
> 
> On 2014-07-16 10:43:08 +0900, Shigeru Hanada wrote:
> > Kaigai-san,
> >
> > 2014-07-15 21:37 GMT+09:00 Kouhei Kaigai :
> > > Sorry, expected result of sanity-check test was not updated on
> > > renaming to pg_custom_plan_provider.
> > > The attached patch fixed up this point.
> >
> > I confirmed that all regression tests passed, so I marked the patch as
> > "Ready for committer".
> 
> I personally don't see how this patch is 'ready for committer'. I realize
> that that state is sometimes used to denote that review needs to be
> "escalated", but it still seemspremature.
> 
> Unless I miss something there hasn't been any API level review of this?
> Also, aren't there several open items?
> 
> Perhaps there needs to be a stage between 'needs review' and 'ready for
> committer'?
> 
> Greetings,
> 
> Andres Freund
> 
> --
>  Andres Freund   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fdw_scan_tlist for foreign table scans breaks EPQ testing, doesn't it?

2015-07-22 Thread Kouhei Kaigai

Fujita-san,

Sorry for my late response.

> The latest foreign-join pushdown patch allows fdw_scan_tlist to be set
> to a targetlist even for simple foreign table scans.  However, since I
> think we assume that the test tuple of a foreign table for an EPQ
> testing, whether it may be copied from the whole-row var or returned by
> the RefetchForeignRow routine, has the rowtype declared for the foreign
> table, ISTM that EPQ testing doesn't work properly in such a case since
> that the targetlist and qual are adjusted to reference fdw_scan_tlist in
> such a case.  Maybe I'm missing something though.
>
Let me confirm step-by-step.
For EPQ testing, whole-row-reference or RefetchForeignRow pulls a record
with row-type compatible to the base foreign table. Then, this record
is stored in the es_epqTuple[] indexed by the base relation.

According to the previous discussion, I expect these tuples are re-checked
by built-in execution plan, but equivalent to the sub-plan entirely pushed
out to the remote side.
Do we see the same assumption?

If so, next step is enhancement of ExecScanFetch() to run the alternative
built-in plans towards each es_epqTuple[] records, if given scanrelid==0.
In this case, expression nodes adjusted to fdw_scan_tlist never called,
so it should not lead any problems...?

> I don't understand custom scans/joins exactly, but I have a similar
> concern for the simple-custom-scan case too.
>
In case of custom scan/join, it fetches a record using heap_fetch()
identified by ctid, and saved to es_epqTuple[].
Then, EvalPlanQual() walks down the plan-tree. Once it appears a node
of custom-join (scanrelid==0), it shall call the equivalent alternatives
if possible, or calls ExecProcNode() towards the underlying nodes then
re-construct its result according to the custom_scan_tlist definition.

It does not look to me problematic.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Asynchronous execution on FDW

2015-07-22 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kyotaro HORIGUCHI
> Sent: Wednesday, July 22, 2015 4:10 PM
> To: robertmh...@gmail.com
> Cc: hlinn...@iki.fi; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Asynchronous execution on FDW
> 
> Hello, thank you for the comment.
> 
> At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas  wrote
> in 
> > On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas  wrote:
> > > At a quick glance, I think this has all the same problems as starting the
> > > execution at ExecInit phase. The correct way to do this is to kick off the
> > > queries in the first IterateForeignScan() call. You said that "ExecProc
> > > phase does not fit" - why not?
> >
> > What exactly are those problems?
> >
> > I can think of these:
> >
> > 1. If the scan is parametrized, we probably can't do it for lack of
> > knowledge of what they will be.  This seems easy; just don't do it in
> > that case.
> 
> We can put an early kick to foreign scans only for the first shot
> if we do it outside (before) ExecProc phase.
> 
> Nestloop
> -> SeqScan
> -> Append
>-> Foreign (Index) Scan
>-> Foreign (Index) Scan
>..
> 
> This plan premises precise (even to some extent) estimate for
> remote query but async execution within ExecProc phase would be
> in effect for this case.
> 
> 
> > 2. It's possible that we're down inside some subtree of the plan that
> > won't actually get executed.  This is trickier.
> 
> As for current postgres_fdw, it is done simply abandoning queued
> result then close the cursor.
> 
> > Consider this:
> >
> > Append
> > -> Foreign Scan
> > -> Foreign Scan
> > -> Foreign Scan
> > 
> >
> > If we don't start each foreign scan until the first tuple is fetched,
> > we will not get any benefit here, because we won't fetch the first
> > tuple from query #2 until we finish reading the results of query #1.
> > If the result of the Append node will be needed in its entirety, we
> > really, really want to launch of those queries as early as possible.
> > OTOH, if there's a Limit node with a small limit on top of the Append
> > node, that could be quite wasteful.
> 
> It's the nature of speculative execution, but the Limit will be
> pushed down onto every Foreign Scans near future.
> 
> > We could decide not to care: after all, if our limit is
> > satisfied, we can just bang the remote connections shut, and if
> > they wasted some CPU, well, tough luck for them.  But it would
> > be nice to be smarter.  I'm not sure how, though.
> 
> Appropriate fetch size will cap the harm and the case will be
> handled as I mentioned above as for postgres_fdw.
>
Horiguchi-san,

Let me ask an elemental question.

If we have ParallelAppend node that kicks a background worker process for
each underlying child node in parallel, does ForeignScan need to do something
special?

Expected waste of CPU or I/O is common problem to be solved, however, it does
not need to add a special case handling to ForeignScan, I think.
How about your opinion?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fdw_scan_tlist for foreign table scans breaks EPQ testing, doesn't it?

2015-07-22 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Wednesday, July 22, 2015 7:05 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] fdw_scan_tlist for foreign table scans breaks EPQ 
> testing,
> doesn't it?
> 
> Hi KaiGai-san,
> 
> On 2015/07/22 16:44, Kouhei Kaigai wrote:
> >> The latest foreign-join pushdown patch allows fdw_scan_tlist to be set
> >> to a targetlist even for simple foreign table scans.  However, since I
> >> think we assume that the test tuple of a foreign table for an EPQ
> >> testing, whether it may be copied from the whole-row var or returned by
> >> the RefetchForeignRow routine, has the rowtype declared for the foreign
> >> table, ISTM that EPQ testing doesn't work properly in such a case since
> >> that the targetlist and qual are adjusted to reference fdw_scan_tlist in
> >> such a case.  Maybe I'm missing something though.
> >>
> > Let me confirm step-by-step.
> > For EPQ testing, whole-row-reference or RefetchForeignRow pulls a record
> > with row-type compatible to the base foreign table. Then, this record
> > is stored in the es_epqTuple[] indexed by the base relation.
> >
> > According to the previous discussion, I expect these tuples are re-checked
> > by built-in execution plan, but equivalent to the sub-plan entirely pushed
> > out to the remote side.
> > Do we see the same assumption?
> 
> No, what I'm concerned about is the case when scanrelid > 0.
>
Hmm. if scanrelid > 0, then fdw_scan_tlist should be NIL.
I want to put Assert(scanrelid==0 || fdw_scan_tlist == NIL) just after
the GetForeignPlan() in createplan.c.

I'm curious why you tried to put valid fdw_scan_tlist for scanrelid > 0.
It's unusual.

> > If so, next step is enhancement of ExecScanFetch() to run the alternative
> > built-in plans towards each es_epqTuple[] records, if given scanrelid==0.
> > In this case, expression nodes adjusted to fdw_scan_tlist never called,
> > so it should not lead any problems...?
> 
> When scanrelid = 0, I think we should run the alternative plans in
> ExecScanFetch or somewhere, as you mentioned.
>
OK,

> >> I don't understand custom scans/joins exactly, but I have a similar
> >> concern for the simple-custom-scan case too.
> >>
> > In case of custom scan/join, it fetches a record using heap_fetch()
> > identified by ctid, and saved to es_epqTuple[].
> > Then, EvalPlanQual() walks down the plan-tree. Once it appears a node
> > of custom-join (scanrelid==0), it shall call the equivalent alternatives
> > if possible, or calls ExecProcNode() towards the underlying nodes then
> > re-construct its result according to the custom_scan_tlist definition.
> >
> > It does not look to me problematic.
> 
> Sorry, I don't understand what you mean.  Maybe I have to learn more
> about custom scans/joins, but thanks for the explanation!
>
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fdw_scan_tlist for foreign table scans breaks EPQ testing, doesn't it?

2015-07-22 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Wednesday, July 22, 2015 11:19 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Etsuro Fujita; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] fdw_scan_tlist for foreign table scans breaks EPQ 
> testing,
> doesn't it?
> 
> On Wed, Jul 22, 2015 at 8:13 AM, Kouhei Kaigai  wrote:
> >> No, what I'm concerned about is the case when scanrelid > 0.
> >>
> > Hmm. if scanrelid > 0, then fdw_scan_tlist should be NIL.
> > I want to put Assert(scanrelid==0 || fdw_scan_tlist == NIL) just after
> > the GetForeignPlan() in createplan.c.
> >
> > I'm curious why you tried to put valid fdw_scan_tlist for scanrelid > 0.
> > It's unusual.
> 
> Allowing that was part of the point of Tom Lane's commit
> 1a8a4e5cde2b7755e11bde2ea7897bd650622d3e.  See the second bullet
> point, after the comma.
>
Indeed, this commit allows ForeignScan to have fdw_scan_tlist, even if
scanrelid > 0, however, I'm uncertain about its reason/intention.
Does it a preparation for the upcoming target-list-pushdown??

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Queries runs slow on GPU with PG-Strom

2015-07-22 Thread Kouhei Kaigai

Hi Yang,

> I've performed some tests on pg_strom according to the wiki. But it seems that
> queries run slower on GPU than CPU. Can someone shed a light on what's wrong
> with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
> Ubuntu 15.04. And the results was
>
  :
> ,
> | LOG:  CUDA Runtime version: 7.0.0
> | LOG:  NVIDIA driver version: 346.59
> | LOG:  GPU0 Quadro K620 (384 CUDA cores, 1124MHz), L2 2048KB, RAM 
> 2047MB
> (128bits, 900KHz), capability 5.0
> | LOG:  NVRTC - CUDA Runtime Compilation vertion 7.0
> | LOG:  redirecting log output to logging collector process
> | HINT:  Future log output will appear in directory "pg_log".
> `
>
It looks to me your GPU processor has poor memory access capability,
thus, preprocess of aggregation (that heavy uses atomic operations
towards the global memory) consumes majority of processing time.
Please try the query with:
  SET pg_strom.enable_gpupreagg = off;
GpuJoin uses less atomic operation, so it has an advantage.

GPU's two major advantage are massive amount of cores and higher
memory bandwidth than GPU, so, fundamentally, I'd like to recommend
to use better GPU board...
According to NVIDIA, K620 uses DDR3 DRAM, thus here is no advantage
on memory access speed. How about GTX750Ti (no external power is
needed like K620) or AWS's g2.2xlarge instance type?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of YANG
> Sent: Thursday, July 23, 2015 12:16 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] Queries runs slow on GPU with PG-Strom
> 
> 
> Hello,
> 
> I've performed some tests on pg_strom according to the wiki. But it seems that
> queries run slower on GPU than CPU. Can someone shed a light on what's wrong
> with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
> Ubuntu 15.04. And the results was
> 
> with pg_strom
> =
> 
> explain SELECT count(*) FROM t0 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 10;
> 
> 
> QUERY PLAN
> 
> 
> ---
>  Aggregate  (cost=190993.70..190993.71 rows=1 width=0) (actual
> time=18792.236..18792.236 rows=1 loops=1)
>->  Custom Scan (GpuPreAgg)  (cost=7933.07..184161.18 rows=86 width=108)
> (actual time=4249.656..18792.074 rows=77 loops=1)
>  Bulkload: On (density: 100.00%)
>  Reduction: NoGroup
>  Device Filter: (sqrtx - '25.6'::double precision) ^ '2'::double
> precision) + ((y - '12.8'::double precision) ^ '2'::double precision))) <
> '10'::double precision)
>  ->  Custom Scan (BulkScan) on t0  (cost=6933.07..182660.32
> rows=1060 width=0) (actual time=139.399..18499.246 rows=1000 loops=1)
>  Planning time: 0.262 ms
>  Execution time: 19268.650 ms
> (8 rows)
> 
> 
> 
> explain analyze SELECT cat, AVG(x) FROM t0 NATURAL JOIN t1 GROUP BY cat;
> 
> QUERY
> PLAN
> 
> --
>  HashAggregate  (cost=298541.48..298541.81 rows=26 width=12) (actual
> time=11311.568..11311.572 rows=26 loops=1)
>Group Key: t0.cat
>->  Custom Scan (GpuPreAgg)  (cost=5178.82..250302.07 rows=1088 width=52)
> (actual time=3304.727..11310.021 rows=2307 loops=1)
>  Bulkload: On (density: 100.00%)
>  Reduction: Local + Global
>  ->  Custom Scan (GpuJoin)  (cost=4178.82..248541.18 rows=1060
> width=12) (actual time=923.417..2661.113 rows=1000 loops=1)
>Bulkload: On (density: 100.00%)
>Depth 1: Logic: GpuHashJoin, HashKeys: (aid), JoinQual: (aid =
> aid), nrows_ratio: 1.
>->  Custom Scan (BulkScan) on t0  (cost=0.00..242858.60
> rows=1060 width=16) (actual time=6.980..871.431 rows=1000 loops=1)
>->  Seq Scan on t1  (cost=0.00..734.00 rows=4 width=4)
> (actual time=0.204..7.309 rows=4 loops=1)
>  Planning time: 47.834 ms
>  Execution time: 11355.103 ms
> (12 rows)
> 
> 
> without pg_strom
> 
> 
> test=# explain analyze SELECT count(*) FROM t0 WHERE sqrt((x-25.6)^2 +
> (y-12.8)^2) < 10;
> 
> QUERY PLAN
> 
> 
> 
>  Aggregate  (cost=426193.03..426193.04 rows=1 width=0) (actual
> time=3880.379..3880.379 rows=1 loops=1)
>->  Seq Scan on t0  (cost=0.00..417859.65 rows=353 width=0) (actual
> time=0.075..3859.200 rows=314063 loops=1)
>  Filter: (sqrt

Re: [HACKERS] Queries runs slow on GPU with PG-Strom

2015-07-22 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Josh Berkus
> Sent: Thursday, July 23, 2015 2:49 AM
> To: YANG; pgsql-hackers@postgresql.org; KaiGai Kohei
> Subject: Re: [HACKERS] Queries runs slow on GPU with PG-Strom
> 
> On 07/22/2015 08:16 AM, YANG wrote:
> >
> > Hello,
> >
> > I've performed some tests on pg_strom according to the wiki. But it seems 
> > that
> > queries run slower on GPU than CPU. Can someone shed a light on what's wrong
> > with my settings. My setup was Quadro K620 + CUDA 7.0 (For Ubuntu 14.10) +
> > Ubuntu 15.04. And the results was
> 
> I believe that pgStrom has its own mailing list.  KaiGai?
>
Sorry, I replied to this earlier.

We have no own mailing list, but issue tracker on GitHub is a proper
way to report problems to the developers.

  https://github.com/pg-strom/devel/issues

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Asynchronous execution on FDW

2015-07-23 Thread Kouhei Kaigai

> > If we have ParallelAppend node that kicks a background worker process for
> > each underlying child node in parallel, does ForeignScan need to do 
> > something
> > special?
> 
> Although I don't see the point of the background worker in your
> story but at least for ParalleMergeAppend, it would frequently
> discontinues to scan by upper Limit so one more state, say setup
> - which mans a worker is allocated but not started- would be
> useful and the driver node might need to manage the number of
> async execution. Or the driven nodes might do so inversely.
>
I expected workloads like single shot scan on a partitioned large
fact table on DWH system. Yep, if workload is expected to rescan
so frequently, its expected cost shall be higher (by the cost to
launch bgworker) than existing Append, then planner will kick out
this path.

Regarding of interaction between Limit and ParallelMergeAppend,
it is probably the best scenario, isn't it? If Limit picks up
the least 1000rows from a partitioned table consists of 20 child
tables, ParallelMergeAppend can launch 20 parallel jobs that
picks up the least 1000rows from the child relations for each.
Probably, it is same job done in pass_down_bound() of nodeLimit.c.

> As for ForeignScan, it is merely an API for FDW and does nothing
> substantial so it would have nothing special to do. As for
> postgres_fdw, current patch restricts one execution per one
> foreign server at once by itself. We would have to provide
> another execution management if we want to have two or more
> simultaneous scans per one foreign server at once.
>
Yep, your 4th patch defines a new callback to FdwRoutines and
5th patch implements postgres_fdw specific portion.
It shall work for distributed / shaded database environment well,
however, its benefit is around ForeignScan only.
Once management node kicks underlying SeqScan, ForeignScan or
others in parallel, it also enables to run local heap scan
asynchronously.

> Sorry for the focusless discussion but does this answer some of
> your question?
>
Hmm... Its advantage is still unclear for me. However, it is not
fair to hijack this thread by my idea.
I'll submit my design proposal about ParallelAppend towards the
next commit-fest. Please comment on.

> > Expected waste of CPU or I/O is common problem to be solved, however, it 
> > does
> > not need to add a special case handling to ForeignScan, I think.
> > How about your opinion?
> 
> I agree with you that ForeignScan as the wrapper for FDWs don't
> need anything special for the case. I suppose for now that
> avoiding the penalty from abandoning too many speculatively
> executed scans (or other works on bg worker like sorts) would be
> a business of the upper node of FDWs, or somewhere else.
> 
> However, I haven't dismissed the possibility that some common
> works related to resource management could be integrated into
> executor (or even into planner), but I see none for now.
>
I also agree with it is "eventually" needed, but may not be supported
in the first version.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-07-23 Thread Kouhei Kaigai

Hi Amit,

The latest v16 patch cannot be applied to the latest
master as is.
434873806a9b1c0edd53c2a9df7c93a8ba021147 changed various
lines in heapam.c, so it probably conflicts with this.

[kaigai@magro sepgsql]$ cat ~/patch/parallel_seqscan_v16.patch | patch -p1
patching file src/backend/access/common/printtup.c
patching file src/backend/access/heap/heapam.c
Hunk #4 succeeded at 499 (offset 10 lines).
Hunk #5 succeeded at 533 (offset 10 lines).
Hunk #6 FAILED at 678.
Hunk #7 succeeded at 790 (offset 10 lines).
Hunk #8 succeeded at 821 (offset 10 lines).
Hunk #9 FAILED at 955.
Hunk #10 succeeded at 1365 (offset 10 lines).
Hunk #11 succeeded at 1375 (offset 10 lines).
Hunk #12 succeeded at 1384 (offset 10 lines).
Hunk #13 succeeded at 1393 (offset 10 lines).
Hunk #14 succeeded at 1402 (offset 10 lines).
Hunk #15 succeeded at 1410 (offset 10 lines).
Hunk #16 succeeded at 1439 (offset 10 lines).
Hunk #17 succeeded at 1533 (offset 10 lines).
2 out of 17 hunks FAILED -- saving rejects to file 
src/backend/access/heap/heapam.c.rej
 :

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Amit Kapila
> Sent: Thursday, July 23, 2015 8:43 PM
> To: Robert Haas
> Cc: Haribabu Kommi; Gavin Flower; Jeff Davis; Andres Freund; Kaigai Kouhei(海
> 外 浩平); Amit Langote; Amit Langote; Fabrízio Mello; Thom Brown; Stephen Frost;
> pgsql-hackers
> Subject: Re: [HACKERS] Parallel Seq Scan
> 
> On Wed, Jul 22, 2015 at 9:14 PM, Robert Haas  wrote:
> >
> > One thing I noticed that is a bit dismaying is that we don't get a lot
> > of benefit from having more workers.  Look at the 0.1 data.  At 2
> > workers, if we scaled perfectly, we would be 3x faster (since the
> > master can do work too), but we are actually 2.4x faster.  Each
> > process is on the average 80% efficient.  That's respectable.  At 4
> > workers, we would be 5x faster with perfect scaling; here we are 3.5x
> > faster.   So the third and fourth worker were about 50% efficient.
> > Hmm, not as good.  But then going up to 8 workers bought us basically
> > nothing.
> >
> 
> I think the improvement also depends on how costly is the qualification,
> if it is costly, even for same selectivity the gains will be shown till higher
> number of clients and for simple qualifications, we will see that cost of
> having more workers will start dominating (processing data over multiple
> tuple queues) over the benefit we can achieve by them.
> 
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fdw_scan_tlist for foreign table scans breaks EPQ testing, doesn't it?

2015-07-23 Thread Kouhei Kaigai

> On Wed, Jul 22, 2015 at 8:24 PM, Kouhei Kaigai  wrote:
> > Indeed, this commit allows ForeignScan to have fdw_scan_tlist, even if
> > scanrelid > 0, however, I'm uncertain about its reason/intention.
> > Does it a preparation for the upcoming target-list-pushdown??
> 
> I guess Tom would have to comment on whether it could be used for that
> purpose.  I assume that omitting columns could be interesting for some
> FDWs, if nothing else.
>
Indeed. As current postgres_fdw doing, FDW driver puts dummy NULLs
on unreferenced columns for network optimization, however, it shall
become unnecessary if we can change definition of the expected
record-type of foreign-table. Its advantage is more human readable
remote query and less CPU cycle for projection.

A dark side is, as discussed in this thread, complexity of EvalPlanQual.
RefetchForeignRow() returns a tuple based on foreign table definition,
on the other hands, whole-row var points a tuple based on fdw_scan_tlist
if exists.
An alternative host-only plan-node and relevant expression will be
constructed towards the definition of base foreign-table. So, we need to
transform the tuple to the layout based on foreign table definition if
we allow fdw_scan_tlist with scanrelid > 0.

However, I'm skeptical whether this solution is valid for long term.
Once we support to push down expensive expression in target-list to
remote side, fdw_scan_tlist will contain expression node rather than
simple Var node. In this case, it is not obvious to reproduce a tuple
according to the foreign table definition from a record based on the
fdw_scan_tlist.

So, I'm inclined to prohibit to set fdw_scan_tlist/custom_scan_tlist
for actual scan node (scanrelid > 0), at present.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Asynchronous execution on FDW

2015-07-24 Thread Kouhei Kaigai

Hello Horiguchi-san,

> > > As for ForeignScan, it is merely an API for FDW and does nothing
> > > substantial so it would have nothing special to do. As for
> > > postgres_fdw, current patch restricts one execution per one
> > > foreign server at once by itself. We would have to provide
> > > another execution management if we want to have two or more
> > > simultaneous scans per one foreign server at once.
> > >
> > Yep, your 4th patch defines a new callback to FdwRoutines and
> > 5th patch implements postgres_fdw specific portion.
> > It shall work for distributed / shaded database environment well,
> > however, its benefit is around ForeignScan only.
> > Once management node kicks underlying SeqScan, ForeignScan or
> > others in parallel, it also enables to run local heap scan
> > asynchronously.
> 
> I suppose SeqScan don't need async kick since its startup cost is
> extremely low as nothing. (fetching first several pages would
> boost seqscans?) On the other hand sort/hash would be a field
> where asynchronous execution is in effect.
>
Startup cost is not only advantage of asynchronous execution.
If background worker prefetches the records to be read soon, during
other tasks are in progress, its latency to fetch next record is
much faster than usual execution path.
Please assume if next record is on neither shared-buffer nor page
cache of operating system.
First, the upper node calls heap_getnext() to fetch next record,
then it looks up the target block on the shared-buffer, then it
issues read(2) system call, then operating system makes the caller
process slept until this block gets read from the storage.
If asynchronous worker already goes through the above painful code
path and the records to be read are ready on the top of queue, it
will reduce the i/o wait time dramatically.

> > > Sorry for the focusless discussion but does this answer some of
> > > your question?
> > >
> > Hmm... Its advantage is still unclear for me. However, it is not
> > fair to hijack this thread by my idea.
> 
> It would be more advantageous if join/sort pushdown on fdw comes,
> where start-up cost could be extremely high...
>
Not only FDW. I intend to combine the ParallelAppend with another idea
I previously post, to run tables join in parallel.
In case of partitioned foreign-tables, planner probably needs to consider
(1) FDW scan + local serial join, (2) FDW scan + local parallel join,
or (3) FDW remote join, according to the cost.

* [idea] table partition + hash join:
http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f6...@bpxm15gp.gisp.nec.co.jp

Anyway, let's have a further discussion in another thread.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to support ForeignRecheck for late row locking, don't we?

2015-07-24 Thread Kouhei Kaigai

Fujita-san,

> On 2015/07/22 19:10, Etsuro Fujita wrote:
> > While working on the issue "Foreign join pushdown vs EvalPlanQual", I
> > happened to notice odd behaviors of late row locking in FDWs.
> 
> > I think the reason for that is because we don't check pushed-down quals
> > inside an EPQ testing even if what was fetched by RefetchForeignRow was
> > an updated version of the tuple rather than the same version previously
> > obtained.  So, to fix this, I'd like to propose that pushed-down quals
> > be checked in ForeignRecheck.
> 
> Attached is a patch for that.
> 
> * I've modified ForeignRecheck so as to check pushed-down quals whether
> doing late locking or early locking.  I think we could probably make
> ForeignRecheck do so only when doing late locking, but I'm not sure it's
> worth complicating the code.
> 
> * I've made the above change only for simple foreign table scans that
> have scanrelid > 0 and fdw_scan_tlist = NIL.  As for simple foreign
> table scans that have scanrelid > 0 and *fdw_scan_tlist is non-NIL*, I
> think we are under discussion in another thread I started.  Will update
> as necessary.
> 
> * Sorry, I've not fully updated comments and docs yet.  Will update.
> 
> I'd be happy if I could get feedback earlier.
>
Isn't it an option to put a new callback in ForeignRecheck?

FDW driver knows its private data structure includes expression node
that was pushed down to the remote side. So, it seems to me the best
way to consult FDW driver whether the supplied tuple should be visible
according to the pushed down qualifier.

More or less, this fix need a new interface contract around EvalPlanQual
logic. It is better to give FDW driver more flexibility of its private
data structure and the way to process recheck logic, rather than special
purpose variable.

If FDW driver managed pushed-down expression in its own format, requirement
to pushedDownQual makes them to have qualifier redundantly.
The callback approach does not have such kind of concern.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fdw_scan_tlist for foreign table scans breaks EPQ testing, doesn't it?

2015-07-24 Thread Kouhei Kaigai





> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Saturday, July 25, 2015 2:59 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Etsuro Fujita; pgsql-hackers@postgresql.org
> Subject: ##freemail## Re: [HACKERS] fdw_scan_tlist for foreign table scans 
> breaks
> EPQ testing, doesn't it?
> 
> On Thu, Jul 23, 2015 at 8:27 PM, Kouhei Kaigai  wrote:
> > A dark side is, as discussed in this thread, complexity of EvalPlanQual.
> > RefetchForeignRow() returns a tuple based on foreign table definition,
> > on the other hands, whole-row var points a tuple based on fdw_scan_tlist
> > if exists.
> > An alternative host-only plan-node and relevant expression will be
> > constructed towards the definition of base foreign-table. So, we need to
> > transform the tuple to the layout based on foreign table definition if
> > we allow fdw_scan_tlist with scanrelid > 0.
> >
> > However, I'm skeptical whether this solution is valid for long term.
> > Once we support to push down expensive expression in target-list to
> > remote side, fdw_scan_tlist will contain expression node rather than
> > simple Var node. In this case, it is not obvious to reproduce a tuple
> > according to the foreign table definition from a record based on the
> > fdw_scan_tlist.
> 
> I don't think we can realistically make a decision that pushing down
> target list expressions to the remote side is forever off the table.
> 
> Is the problem here that it's not *possible* for an FDW to do the
> right thing, or just that it might be difficult to code in practice?
> I'm fuzzy on why this isn't just a matter of having
> RefetchForeignRow() return a row with the correct tuple descriptor.
>
RefetchForeignRow() does not take ForeignScanState argument that
knows how remote data is represented on the local side if valid
fdw_scan_tlist is configured.
Do we have no facility to lookup ScanState object by scanrelid on
execution time, don't it?

On the other hands, I'm inclined to think FDW driver should provide
alternative whole-row reference (according to the base foreign-table
definition) if it has a valid fdw_scan_tlist
It is more suitable on join pushdown cases, because the alternative
subplan (to be executed instead of the remote query) assumes all the
EPQ tuples follows base table definitions as usual.

Is it not easy to inject a junk TLE to reference a whole-row variable
based on the foreign table definition?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [DESIGN] ParallelAppend

2015-07-25 Thread Kouhei Kaigai

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
--
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
  so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
  worker, per subplan basis. If subplan contains large tables and
  small tables, ParallelAppend may kick background worker to scan
  large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
  individual node types. SeqScan, IndexScan, ForeignScan or others
  can perform as usual, but actually in parallel.


3. Implementation
--
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.
Cost estimation logic shall take further discussions, however,
I expect the logic below to estimate the cost for ParallelAppend.
  1. Sum startup_cost and run_cost for each child pathnode, but
 distinguish according to synchronous or asynchronous.
 Probably, total cost of pathnode is less than:
  (parallel_setup_cost + its total cost / parallel_append_degree
   + number of rows * cpu_tuple_comm_cost)
 is nonsense to run on background worker.
  2. parallel_setup_cost * (# of asynchronous nodes) are added to
 sum of startup_cost of asynchronous nodes.
  3. sum of run_cost of asynchronous nodes are divided by 
 parallel_append_degree, then cpu_tuple_comm_cost * (total # of
 rows by asynchronous nodes) are added.
  4. both of synchronous and asynchronous cost are added, then it
 becomes the cost of ParallelAppend.
Obviously, it stand on the viewpoint that says: cost reflects response
time of the underlying plan. So, cost of ParallelAppend can be smaller
than sum of underlying child nodes.

* Execution

Like Funnel node, it kicks background worker on the ExecProcNode handler,
thus, its startup time may be later than Fujita-san's approach if call
of ParallelAppend would be late. For example, when ParallelAppend is
located under HashJoin but inner Hash loads billion of rows.
Even though I expect ExecParallelAppend takes, at least, simple round-
robin scheduling like funnel_getnext(), we may give synchronous nodes
than asynchronous just after the background worker startup.

4. Further challenges
--
* Serialization of CustomScan via outfuncs.c/readfuncs.c
  Because methods field is, basically, a set of pointers per process basis,
  we need to have an infrastructure to reproduce same table on the background
  worker process identified by the nam

[HACKERS] CustomScan and readfuncs.c

2015-07-26 Thread Kouhei Kaigai

Hello,

Under the investigation of ParallelAppend, I noticed here is a few
problems in CustomScan, that prevents to reproduce an equivalent
plan node on the background worker from serialized string.

1. CustomScanMethods->TextOutCustomScan callback

This callback allows to output custom information on nodeToString.
Originally, we intend to use this callback for debug only, because
CustomScan must be copyObject() safe, thus, all the private data
also must be stored in custom_exprs or custom_private.
However, it will lead another problem when we try to reproduce
CustomScan node from the string form generated by outfuncs.c.
If TextOutCustomScan prints something, upcoming _readCustomScan
has to deal with unexpected number of tokens in unexpected format.
I'd like to propose to omit this callback prior to v9.5 release,
for least compatibility issues.

2. Reproduce method table on background worker
--
The method field of CustomPath/Scan/ScanState is expected to be
a reference to a static structure. Thus, copyObject() does not
copy the entire table, but only pointers.
However, we have no way to guarantee the callback functions have
same entrypoint addressed on background workers. So, we may need
an infrastructure to reproduce same CustomScan node with same
callback function tables, probably, identified by name.
We may have a few ways to solve the problem.

* Add system catalog, function returns pointer
The simplest way, like FDW. System catalog has a name and function
to return callback pointers. It also needs SQL statement support,
even a little down side.

* Registered by name, during shared_preload_libraries only
Like an early version of CustomScan interface, it requires custom-
scan providers to register a pair of name and callbacks, but only
when shared_preload_libraries is processed, to guarantee the callbacks
are also registered in the background workers also.
(Is this assumption right on windows?)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan and readfuncs.c

2015-07-26 Thread Kouhei Kaigai

> Kouhei Kaigai  writes:
> > Under the investigation of ParallelAppend, I noticed here is a few
> > problems in CustomScan, that prevents to reproduce an equivalent
> > plan node on the background worker from serialized string.
> 
> > 1. CustomScanMethods->TextOutCustomScan callback
> > 
> > This callback allows to output custom information on nodeToString.
> > Originally, we intend to use this callback for debug only, because
> > CustomScan must be copyObject() safe, thus, all the private data
> > also must be stored in custom_exprs or custom_private.
> > However, it will lead another problem when we try to reproduce
> > CustomScan node from the string form generated by outfuncs.c.
> > If TextOutCustomScan prints something, upcoming _readCustomScan
> > has to deal with unexpected number of tokens in unexpected format.
> 
> Um ... wait a second.  There is no support in readfuncs for any
> plan node type, and never has been, and I seriously doubt that there
> ever should be.  I do not think it makes sense to ship plans around
> in the way you seem to have in mind.  (Also, I don't think the
> problems you mention are exactly unique to CustomScan.  There's no
> reason to assume that FDW plans could survive this treatment either,
> since we do not know what's in the fdw_private stuff; certainly no
> one has ever suggested that it should not contain pointers to static
> data.)
>
Yep, no Plan node types are supported at this moment, however, will
appear soon by the Funnel + PartialSeqScan nodes.
It serializes a partial plan subtree using nodeToString() then gives
the flatten PlannedStmt to background workers.
I'm now investigating to apply same structure to Append not to kick
child nodes in parallel.
Once various plan node types appear in readfuncs.c, we have to care
about this problem, don't it? I'm working for the patch submission
of ParallelAppend on the next commit-fest, so like to make a consensus
how to treat this matter.

> > I'd like to propose to omit this callback prior to v9.5 release,
> > for least compatibility issues.
> 
> I regard our commitment to cross-version compatibility for the
> custom scan APIs as being essentially zero, for reasons previously
> discussed.  So if this goes away in 9.6 it will not matter, but we
> might as well leave it in for now for debug support.
>
I don't argue this point strongly. If TextOutCustomScan shall be
obsoleted on v9.6, it is just kindness for developers not to use
this callback.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2015-07-26 Thread Kouhei Kaigai

> On 04/01/2015 06:28 PM, Robert Haas wrote:
> > On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier
> >  wrote:
> >>> I've been thinking of bumping this patch to the June commitfest as the
> >>> patch only exists to provide the basic infrastructure for things like
> >>> parallel aggregation, aggregate before join, and perhaps auto updating
> >>> materialised views.
> >>>
> >>> It seems unlikely that any of those things will happen for 9.5.
> >>
> >> Yeah, I guess so...
> >>
> >>> Does anybody object to me moving this to June's commitfest?
> >>
> >> Not from my side FWIW. I think it actually makes sense.
> >
> > +1.  I'd like to devote some time to looking at this, but I don't have
> > the time right now.  The chances that we can do something useful with
> > it in 9.6 seem good.
> 
> And the June commitfest is now in progress.
> 
> This patch seems sane to me, as far as it goes. However, there's no
> planner or executor code to use the aggregate combining for anything.
> I'm not a big fan of dead code, I'd really like to see something to use
> this.
>
+1, this patch itself looks good for me, but...

> The main use case people have been talking about is parallel query, but
> is there some other case this would be useful right now, without the
> parallel query feature? You and Simon talked about this case:
>
> > 2. Queries such as:
> >
> > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id
> > = p.product_id GROUP BY p.name;
> >
> > Such a query could be transformed into:
> >
> > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales
> > GROUP BY product_id) s
> > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
> >
> > Of course the outer query's SUM and GROUP BY would not be required if there
> > happened to be a UNIQUE index on product(name), but assuming there's not
> > then the above should produce the results faster. This of course works ok
> > for SUM(), but for something like AVG() or STDDEV() the combine/merge
> > aggregate functions would be required to process those intermediate
> > aggregate results that were produced by the sub-query.
> 
> Any chance you could implement that in the planner?
>
It likely needs planner enhancement prior to other applications...
http://www.postgresql.org/message-id/ca+tgmobgwkhfzc09b+s2lxjtword5ht-avovdvaq4+rpwro...@mail.gmail.com

Once planner allowed to have both of normal path and partial aggregation
paths to compare according to the cost, it is the straightforward way to
do.

Here are various academic research, for example, below is the good starting
point to clarify aggregate queries that we can run with 2-phase approach.
http://www.researchgate.net/publication/2715288_Performing_Group-By_before_Join

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to support ForeignRecheck for late row locking, don't we?

2015-07-27 Thread Kouhei Kaigai

> On 2015/07/24 23:51, Kouhei Kaigai wrote:
> >> On 2015/07/22 19:10, Etsuro Fujita wrote:
> >>> While working on the issue "Foreign join pushdown vs EvalPlanQual", I
> >>> happened to notice odd behaviors of late row locking in FDWs.
> >>
> >>> I think the reason for that is because we don't check pushed-down quals
> >>> inside an EPQ testing even if what was fetched by RefetchForeignRow was
> >>> an updated version of the tuple rather than the same version previously
> >>> obtained.  So, to fix this, I'd like to propose that pushed-down quals
> >>> be checked in ForeignRecheck.
> 
> >> * I've modified ForeignRecheck so as to check pushed-down quals whether
> >> doing late locking or early locking.
> 
> > Isn't it an option to put a new callback in ForeignRecheck?
> >
> > FDW driver knows its private data structure includes expression node
> > that was pushed down to the remote side. So, it seems to me the best
> > way to consult FDW driver whether the supplied tuple should be visible
> > according to the pushed down qualifier.
> >
> > More or less, this fix need a new interface contract around EvalPlanQual
> > logic. It is better to give FDW driver more flexibility of its private
> > data structure and the way to process recheck logic, rather than special
> > purpose variable.
> >
> > If FDW driver managed pushed-down expression in its own format, requirement
> > to pushedDownQual makes them to have qualifier redundantly.
> > The callback approach does not have such kind of concern.
> 
> That might be an idea, but is there any performance disadvantage as
> discussed in [1]?; it looks like that that needs to perform another
> remote query to see if the supplied tuple satisfies the pushed-down
> quals during EPQ testing.
>
I expect the callback of ForeignRecheck runs ExecQual() towards
the qualifier expression pushed-down but saved on the private data
of ForeignScanState. It does not need to kick another remote query
(unless FDW driver is not designed), so performance disadvantage is
none or quite limited.

Also, let's assume the case when scanrelid == 0 (join pushdown).
It is easy to put special code path if scanrelid == 0, that
implies ScanState is either ForeignScan or CustomScan.
If ForeignRecheck (= recheckMtd) is called instead of the if-
block below of the Assert() on ExecScanFetch, FDW driver will be
able to put its own special code path to run alternative sub-plan.
How this alternative sub-plan works? It walks down the sub-plan
tree that is typically consists of NestLoop + ForeignScan for
example, then ExecScanFetch() is called again towards ScanState
with scanrelid > 0 at that time.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Kouhei Kaigai

> Hello, can I ask some questions?
>
> I suppose we can take this as the analog of ParalleSeqScan.  I
> can see not so distinction between Append(ParalleSeqScan) and
> ParallelAppend(SeqScan). What difference is there between them?
>
Append does not start to execute the second or later node until
first node reaches end of the scan.
On the other hands, ParallelAppend will kick all the child nodes
(almost) simultaneously.

> If other nodes will have the same functionality as you mention at
> the last of this proposal, it might be better that some part of
> this feature is implemented as a part of existing executor
> itself, but not as a deidicated additional node, just as my
> asynchronous fdw execution patch patially does. (Although it
> lacks planner part and bg worker launching..) If that is the
> case, it might be better that ExecProcNode is modified so that it
> supports both in-process and inter-bgworker cases by the single
> API.
> 
> What do you think about this?
>
Its downside is that we need to adjust all the existing nodes to
follow the new executor's capability. At this moment, we have 38
node types delivered from Plan. I think, it is not an easy job to
review a patch that changes multi-dozens files.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> regards,
> 
> > Hello,
> >
> > I'm recently working/investigating on ParallelAppend feature
> > towards the next commit fest. Below is my design proposal.
> >
> > 1. Concept
> > --
> > Its concept is quite simple anybody might consider more than once.
> > ParallelAppend node kicks background worker process to execute
> > child nodes in parallel / asynchronous.
> > It intends to improve the performance to scan a large partitioned
> > tables from standpoint of entire throughput, however, latency of
> > the first multi-hundred rows are not scope of this project.
> > From standpoint of technology trend, it primarily tries to utilize
> > multi-cores capability within a system, but also enables to expand
> > distributed database environment using foreign-tables inheritance
> > features.
> > Its behavior is very similar to Funnel node except for several
> > points, thus, we can reuse its infrastructure we have had long-
> > standing discussion through the v9.5 development cycle.
> >
> > 2. Problems to be solved
> > -
> > Typical OLAP workloads takes tons of tables join and scan on large
> > tables which are often partitioned, and its KPI is query response
> > time but very small number of sessions are active simultaneously.
> > So, we are required to run a single query as rapid as possible even
> > if it consumes larger computing resources than typical OLTP workloads.
> >
> > Current implementation to scan heap is painful when we look at its
> > behavior from the standpoint - how many rows we can read within a
> > certain time, because of synchronous manner.
> > In the worst case, when SeqScan node tries to fetch the next tuple,
> > heap_getnext() looks up a block on shared buffer, then ReadBuffer()
> > calls storage manager to read the target block from the filesystem
> > if not on the buffer. Next, operating system makes the caller
> > process slept until required i/o get completed.
> > Most of the cases are helped in earlier stage than the above worst
> > case, however, the best scenario we can expect is: the next tuple
> > already appear on top of the message queue (of course visibility
> > checks are already done also) with no fall down to buffer manager
> > or deeper.
> > If we can run multiple scans in parallel / asynchronous, CPU core
> > shall be assigned to another process by operating system, thus,
> > it eventually improves the i/o density and enables higher processing
> > throughput.
> > Append node is an ideal point to be parallelized because
> > - child nodes can have physically different location by tablespace,
> >   so further tuning is possible according to the system landscape.
> > - it can control whether subplan is actually executed on background
> >   worker, per subplan basis. If subplan contains large tables and
> >   small tables, ParallelAppend may kick background worker to scan
> >   large tables only, but scan on small tables are by itself.
> > - Like as Funnel node, we don't need to care about enhancement of
> >   individual node types. SeqScan, IndexScan, ForeignScan or others
> >   can perform as usual, but actually in parallel.
> >
> >
> > 3. Implementation
> > --
> > * Plan & Cost
> >
> > ParallelAppend shall appear where Appen can appear except for the
> > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
> > both of AppendPath and ParallelAppendPath with cost for each.
> > Cost estimation logic shall take further discussions, however,
> > I expect the logic below to estimate the cost for ParallelAppend.
> >   1. Sum startup_cost and run_cost for each child pathnode, but
> >  distinguish according to synchronous or asynchronous

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Kouhei Kaigai

> On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai  wrote:
> >
> > Hello,
> >
> > I'm recently working/investigating on ParallelAppend feature
> > towards the next commit fest. Below is my design proposal.
> >
> > 1. Concept
> > --
> > Its concept is quite simple anybody might consider more than once.
> > ParallelAppend node kicks background worker process to execute
> > child nodes in parallel / asynchronous.
> > It intends to improve the performance to scan a large partitioned
> > tables from standpoint of entire throughput, however, latency of
> > the first multi-hundred rows are not scope of this project.
> > From standpoint of technology trend, it primarily tries to utilize
> > multi-cores capability within a system, but also enables to expand
> > distributed database environment using foreign-tables inheritance
> > features.
> > Its behavior is very similar to Funnel node except for several
> > points, thus, we can reuse its infrastructure we have had long-
> > standing discussion through the v9.5 development cycle.
> >
> > 2. Problems to be solved
> > -
> > Typical OLAP workloads takes tons of tables join and scan on large
> > tables which are often partitioned, and its KPI is query response
> > time but very small number of sessions are active simultaneously.
> > So, we are required to run a single query as rapid as possible even
> > if it consumes larger computing resources than typical OLTP workloads.
> >
> > Current implementation to scan heap is painful when we look at its
> > behavior from the standpoint - how many rows we can read within a
> > certain time, because of synchronous manner.
> > In the worst case, when SeqScan node tries to fetch the next tuple,
> > heap_getnext() looks up a block on shared buffer, then ReadBuffer()
> > calls storage manager to read the target block from the filesystem
> > if not on the buffer. Next, operating system makes the caller
> > process slept until required i/o get completed.
> > Most of the cases are helped in earlier stage than the above worst
> > case, however, the best scenario we can expect is: the next tuple
> > already appear on top of the message queue (of course visibility
> > checks are already done also) with no fall down to buffer manager
> > or deeper.
> > If we can run multiple scans in parallel / asynchronous, CPU core
> > shall be assigned to another process by operating system, thus,
> > it eventually improves the i/o density and enables higher processing
> > throughput.
> > Append node is an ideal point to be parallelized because
> > - child nodes can have physically different location by tablespace,
> >   so further tuning is possible according to the system landscape.
> > - it can control whether subplan is actually executed on background
> >   worker, per subplan basis. If subplan contains large tables and
> >   small tables, ParallelAppend may kick background worker to scan
> >   large tables only, but scan on small tables are by itself.
> > - Like as Funnel node, we don't need to care about enhancement of
> >   individual node types. SeqScan, IndexScan, ForeignScan or others
> >   can perform as usual, but actually in parallel.
> >
> >
> > 3. Implementation
> > --
> > * Plan & Cost
> >
> > ParallelAppend shall appear where Appen can appear except for the
> > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
> > both of AppendPath and ParallelAppendPath with cost for each.
> >
> 
> Is there a real need to have new node like ParallelAppendPath?
> Can't we have Funnel node beneath AppendNode and then each
> worker will be responsible to have SeqScan on each inherited child
> relation.  Something like
> 
> Append
>---> Funnel
>   --> SeqScan rel1
>   --> SeqScan rel2
>
If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

We will need to pay attention another issues we will look at when Funnel
kicks background worker towards asymmetric relations.

If number of rows of individual child nodes are various, we may
want to assign 10 background workers to scan rel1 with PartialSeqScan.
On the other hands, rel2 may have very small number of rows thus
its total_cost may be smaller than cost to launc

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-27 Thread Kouhei Kaigai

> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> Sent: Monday, July 27, 2015 11:07 PM
> To: Amit Kapila
> Cc: pgsql-hackers@postgresql.org; Robert Haas; Kyotaro HORIGUCHI
> Subject: Re: [HACKERS] [DESIGN] ParallelAppend
> 
> > On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai  wrote:
> > >
> > > Hello,
> > >
> > > I'm recently working/investigating on ParallelAppend feature
> > > towards the next commit fest. Below is my design proposal.
> > >
> > > 1. Concept
> > > --
> > > Its concept is quite simple anybody might consider more than once.
> > > ParallelAppend node kicks background worker process to execute
> > > child nodes in parallel / asynchronous.
> > > It intends to improve the performance to scan a large partitioned
> > > tables from standpoint of entire throughput, however, latency of
> > > the first multi-hundred rows are not scope of this project.
> > > From standpoint of technology trend, it primarily tries to utilize
> > > multi-cores capability within a system, but also enables to expand
> > > distributed database environment using foreign-tables inheritance
> > > features.
> > > Its behavior is very similar to Funnel node except for several
> > > points, thus, we can reuse its infrastructure we have had long-
> > > standing discussion through the v9.5 development cycle.
> > >
> > > 2. Problems to be solved
> > > -
> > > Typical OLAP workloads takes tons of tables join and scan on large
> > > tables which are often partitioned, and its KPI is query response
> > > time but very small number of sessions are active simultaneously.
> > > So, we are required to run a single query as rapid as possible even
> > > if it consumes larger computing resources than typical OLTP workloads.
> > >
> > > Current implementation to scan heap is painful when we look at its
> > > behavior from the standpoint - how many rows we can read within a
> > > certain time, because of synchronous manner.
> > > In the worst case, when SeqScan node tries to fetch the next tuple,
> > > heap_getnext() looks up a block on shared buffer, then ReadBuffer()
> > > calls storage manager to read the target block from the filesystem
> > > if not on the buffer. Next, operating system makes the caller
> > > process slept until required i/o get completed.
> > > Most of the cases are helped in earlier stage than the above worst
> > > case, however, the best scenario we can expect is: the next tuple
> > > already appear on top of the message queue (of course visibility
> > > checks are already done also) with no fall down to buffer manager
> > > or deeper.
> > > If we can run multiple scans in parallel / asynchronous, CPU core
> > > shall be assigned to another process by operating system, thus,
> > > it eventually improves the i/o density and enables higher processing
> > > throughput.
> > > Append node is an ideal point to be parallelized because
> > > - child nodes can have physically different location by tablespace,
> > >   so further tuning is possible according to the system landscape.
> > > - it can control whether subplan is actually executed on background
> > >   worker, per subplan basis. If subplan contains large tables and
> > >   small tables, ParallelAppend may kick background worker to scan
> > >   large tables only, but scan on small tables are by itself.
> > > - Like as Funnel node, we don't need to care about enhancement of
> > >   individual node types. SeqScan, IndexScan, ForeignScan or others
> > >   can perform as usual, but actually in parallel.
> > >
> > >
> > > 3. Implementation
> > > --
> > > * Plan & Cost
> > >
> > > ParallelAppend shall appear where Appen can appear except for the
> > > usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
> > > both of AppendPath and ParallelAppendPath with cost for each.
> > >
> >
> > Is there a real need to have new node like ParallelAppendPath?
> > Can't we have Funnel node beneath AppendNode and then each
> > worker will be responsible to have SeqScan on each inherited child
> > relation.  Something like
> >
> > Append
> >---> Funnel
> >   --> SeqScan rel1
> >   --> SeqScan rel2
> >
> I

Re: [HACKERS] CustomScan and readfuncs.c

2015-07-27 Thread Kouhei Kaigai

> 2. Reproduce method table on background worker
> --
> The method field of CustomPath/Scan/ScanState is expected to be
> a reference to a static structure. Thus, copyObject() does not
> copy the entire table, but only pointers.
> However, we have no way to guarantee the callback functions have
> same entrypoint addressed on background workers. So, we may need
> an infrastructure to reproduce same CustomScan node with same
> callback function tables, probably, identified by name.
> We may have a few ways to solve the problem.
> 
> * Add system catalog, function returns pointer
> The simplest way, like FDW. System catalog has a name and function
> to return callback pointers. It also needs SQL statement support,
> even a little down side.
>
I tried to design a DDL statement and relevant system catalog as follows.

  #define CustomPlanRelationId3999
  
  CATALOG(pg_custom_plan,3999)
  {
  NameDatacustom_name;
  regproc custom_handler;
  } FormData_pg_custom_plan;

This simple catalog saves a pair of name and handler function of custom
plan provider. Like FDW, this handler function returns pointers to the
entrypoint to be called by set_(rel|join)_pathlist_hook and relevant
CustomXXXMethods table.

User can register a custom plan provider using the following statement:
  CREATE CUSTOM PLAN  HANDLER ;

And unregister:
  DROP CUSTOM PLAN ;

This enhancement allows background workers to reproduce CustomScan node
that was serialized by nodeToString(), as long as provider is specified
by the name.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


> -Original Message-
> From: Kaigai Kouhei(海外 浩平)
> Sent: Monday, July 27, 2015 8:42 AM
> To: 'Tom Lane'
> Cc: pgsql-hackers@postgresql.org
> Subject: RE: [HACKERS] CustomScan and readfuncs.c
> 
> > Kouhei Kaigai  writes:
> > > Under the investigation of ParallelAppend, I noticed here is a few
> > > problems in CustomScan, that prevents to reproduce an equivalent
> > > plan node on the background worker from serialized string.
> >
> > > 1. CustomScanMethods->TextOutCustomScan callback
> > > 
> > > This callback allows to output custom information on nodeToString.
> > > Originally, we intend to use this callback for debug only, because
> > > CustomScan must be copyObject() safe, thus, all the private data
> > > also must be stored in custom_exprs or custom_private.
> > > However, it will lead another problem when we try to reproduce
> > > CustomScan node from the string form generated by outfuncs.c.
> > > If TextOutCustomScan prints something, upcoming _readCustomScan
> > > has to deal with unexpected number of tokens in unexpected format.
> >
> > Um ... wait a second.  There is no support in readfuncs for any
> > plan node type, and never has been, and I seriously doubt that there
> > ever should be.  I do not think it makes sense to ship plans around
> > in the way you seem to have in mind.  (Also, I don't think the
> > problems you mention are exactly unique to CustomScan.  There's no
> > reason to assume that FDW plans could survive this treatment either,
> > since we do not know what's in the fdw_private stuff; certainly no
> > one has ever suggested that it should not contain pointers to static
> > data.)
> >
> Yep, no Plan node types are supported at this moment, however, will
> appear soon by the Funnel + PartialSeqScan nodes.
> It serializes a partial plan subtree using nodeToString() then gives
> the flatten PlannedStmt to background workers.
> I'm now investigating to apply same structure to Append not to kick
> child nodes in parallel.
> Once various plan node types appear in readfuncs.c, we have to care
> about this problem, don't it? I'm working for the patch submission
> of ParallelAppend on the next commit-fest, so like to make a consensus
> how to treat this matter.
> 
> > > I'd like to propose to omit this callback prior to v9.5 release,
> > > for least compatibility issues.
> >
> > I regard our commitment to cross-version compatibility for the
> > custom scan APIs as being essentially zero, for reasons previously
> > discussed.  So if this goes away in 9.6 it will not matter, but we
> > might as well leave it in for now for debug support.
> >
> I don't argue this point strongly. If TextOutCustomScan shall be
> obsoleted on v9.6, it is just kindness for developers not to use
> this callback.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Kouhei Kaigai

> On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai  wrote:
> >
> > > -Original Message-
> > > From: pgsql-hackers-ow...@postgresql.org
> > > [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kouhei Kaigai
> > > Sent: Monday, July 27, 2015 11:07 PM
> > > To: Amit Kapila
> > > >
> > > > Is there a real need to have new node like ParallelAppendPath?
> > > > Can't we have Funnel node beneath AppendNode and then each
> > > > worker will be responsible to have SeqScan on each inherited child
> > > > relation.  Something like
> > > >
> > > > Append
> > > >---> Funnel
> > > >   --> SeqScan rel1
> > > >   --> SeqScan rel2
> > > >
> > > If Funnel can handle both of horizontal and vertical parallelism,
> > > it is a great simplification. I never stick a new node.
> > >
> > > Once Funnel get a capability to have multiple child nodes, probably,
> > > Append node above will have gone. I expect set_append_rel_pathlist()
> > > add two paths based on Append and Funnel, then planner will choose
> > > the cheaper one according to its cost.
> > >
> > In the latest v16 patch, Funnel is declared as follows:
> >
> >   typedef struct Funnel
> >   {
> >   Scanscan;
> >   int num_workers;
> >   } Funnel;
> >
> > If we try to add Append capability here, I expects the structure will
> > be adjusted as follows, for example:
> >
> >   typedef struct Funnel
> >   {
> >   Scanscan;
> >   List   *funnel_plans;
> >   List   *funnel_num_workers;
> >   } Funnel;
> >
> > As literal, funnel_plans saves underlying Plan nodes instead of the
> > lefttree. Also, funnel_num_workers saves number of expected workers
> > to be assigned on individual child plans.
> >
> 
> or shall we have a node like above and name it as FunnelAppend or
> AppenFunnel?
>
It is better to have smaller number of node types which are capable to
kick background workers because of simplification of path construction.

Let's assume the case below. When planner considers a path to append
child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
up Funnel of rel2, can we?

  (Append? or Funnel)
   --> SeqScan on rel1
   --> Funnel
--> PartialSeqScan on rel2
   --> IndexScan on rel3

If we pull Funnel here, I think the plan shall be as follows:
  Funnel
   --> SeqScan on rel1
   --> PartialSeqScan on rel2
   --> IndexScan on rel3

If all we have to pay attention is Funnel node, it makes the code
around path construction and pull-up logic much simpler, rather than
multiple node types can kick background workers.

> > Even though create_parallelscan_paths() in v16 set num_workers not
> > larger than parallel_seqscan_degree, total number of the concurrent
> > background workers may exceed this configuration if more than two
> > PartialSeqScan nodes are underlying.
> > It is a different configuration from max_worker_processes, so it is
> > not a matter as long as we have another restriction.
> > However, how do we control the cap of number of worker processes per
> > "appendable" Funnel node? For example, if a parent table has 200
> > child tables but max_worker_processes are configured to 50.
> > It is obviously impossible to launch all the background workers
> > simultaneously. One idea I have is to suspend launch of some plans
> > until earlier ones are completed.
> >
> 
> Okay, but I think in that idea you need to re-launch the workers again for
> new set of relation scan's which could turn out to be costly, how about
> designing some way where workers after completing their assigned work
> check for new set of task/'s (which in this case would be to scan a new) and
> then execute the same.  I think in this way we can achieve dynamic allocation
> of work and achieve maximum parallelism with available set of workers.
> We have achieved this in ParallelSeqScan by scanning at block level, once
> a worker finishes a block, it checks for new block to scan.
>
Is it possible to put multiple PlannedStmt on TOC, isn't it?
If background worker picks up an uncompleted PlannedStmt first
(based on round-robin likely?), it may achieve the maximum
parallelism. Yep, it seems to me a good idea which I want to try.
If (num of worker) > (num of sub-plans), some of sub-plans can
have multiple workers from the beginning, then, other workers
also help to exe

Re: [HACKERS] [DESIGN] ParallelAppend

2015-07-28 Thread Kouhei Kaigai

> KaiGai-san,
> 
> On 2015-07-27 PM 11:07, Kouhei Kaigai wrote:
> >
> >   Append
> >--> Funnel
> > --> PartialSeqScan on rel1 (num_workers = 4)
> >--> Funnel
> > --> PartialSeqScan on rel2 (num_workers = 8)
> >--> SeqScan on rel3
> >
> >  shall be rewritten to
> >   Funnel
> > --> PartialSeqScan on rel1 (num_workers = 4)
> > --> PartialSeqScan on rel2 (num_workers = 8)
> > --> SeqScan on rel3(num_workers = 1)
> >
> 
> In the rewritten plan, are respective scans (PartialSeq or Seq) on rel1,
> rel2 and rel3 asynchronous w.r.t each other? Or does each one wait for the
> earlier one to finish? I would think the answer is no because then it
> would not be different from the former case, right? Because the original
> premise seems that (partitions) rel1, rel2, rel3 may be on different
> volumes so parallelism across volumes seems like a goal of parallelizing
> Append.
> 
> From my understanding of parallel seqscan patch, each worker's
> PartialSeqScan asks for a block to scan using a shared parallel heap scan
> descriptor that effectively keeps track of division of work among
> PartialSeqScans in terms of blocks. What if we invent a PartialAppend
> which each worker would run in case of a parallelized Append. It would use
> some kind of shared descriptor to pick a relation (Append member) to scan.
> The shared structure could be the list of subplans including the mutex for
> concurrency. It doesn't sound as effective as proposed
> ParallelHeapScanDescData does for PartialSeqScan but any more granular
> might be complicated. For example, consider (current_relation,
> current_block) pair. If there are more workers than subplans/partitions,
> then multiple workers might start working on the same relation after a
> round-robin assignment of relations (but of course, a later worker would
> start scanning from a later block in the same relation). I imagine that
> might help with parallelism across volumes if that's the case.
>
I initially thought ParallelAppend kicks fixed number of background workers
towards sub-plans, according to the estimated cost on the planning stage.
However, I'm now inclined that background worker picks up an uncompleted
PlannedStmt first. (For more details, please see the reply to Amit Kapila)
It looks like less less-grained worker's job distribution.
Once number of workers gets larger than number of volumes / partitions,
it means more than two workers begin to assign same PartialSeqScan, thus
it takes fine-grained job distribution using shared parallel heap scan.

> MergeAppend
> parallelization might involve a bit more complication but may be feasible
> with a PartialMergeAppend with slightly different kind of coordination
> among workers. What do you think of such an approach?
>
Do we need to have something special in ParallelMergeAppend?
If individual child nodes are designed to return sorted results,
what we have to do seems to me same.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 >

1 - 100 of 457 matches

Mail list logo