date:20180716

Re: [PATCH] Add missing type conversion functions for PL/Python

2018-07-16 Thread Haozhou Wang

+1, I also think that we may not change the previous behavior of plpython.
@Nikita Glukhov  maybe we just check the
whether pyobject is int or long only in related conversion functions, and
fallback otherwise?

On Fri, Jul 13, 2018 at 12:09 AM Heikki Linnakangas  wrote:

> On 12/07/18 18:06, Nikita Glukhov wrote:
> > I have added some cross-type test cases and now almost all new code is
> covered
> > (excluding several error cases which can be triggered only by custom
> numeric
> > type implementations).
>
> Thanks! Some of those new tests actually fail, if you run them against
> unpatched master. For example:
>
>   SELECT * FROM test_type_conversion_float8_int2(100::float8);
>   INFO:  (100.0, )
> - test_type_conversion_float8_int2
> ---
> -  100
> -(1 row)
> -
> +ERROR:  invalid input syntax for integer: "100.0"
> +CONTEXT:  while creating return value
> +PL/Python function "test_type_conversion_float8_int2"
>
> So this patch is making some subtle changes to behavior. I don't think
> we want that.
>
> - Heikki
>


-- 
Regards,
Haozhou

Re: Make foo=null a warning by default.

2018-07-16 Thread Fabien COELHO




Hello David,


Per a discussion with Andrew Gierth and Vik Fearing, both of whom
helped make this happen, please find attached a patch which makes it
possible to get SQL standard behavior for "= NULL", which is an error.
It's been upgraded to a warning, and can still be downgraded to
silence (off) and MS-SQL-compatible behavior (on).


That's nice, and good for students, and probably others as well:-)

A few comments:

Patch applies & compiles, "make check" ok.

  #backslash_quote = safe_encoding# on, off, or safe_encoding
  [...]
  #transform_null_equals = warn

I think it would be nice to have the possible values as a comment in 
"postgresql.conf".


* Code:

  -bool   operator_precedence_warning = false;
  +bool   operator_precedence_warning = false;

Is this reindentation useful/needed?

 + parser_errposition(pstate, a->location)));
 +   if (need_transform_null_equals && transform_null_equals == 
TRANSFORM_NULL_EQUALS_ON)

For consistency, maybe skip a line before the if?

  transform_null_equals_options[] = { [...]

I do not really understand the sort order of this array. Maybe it could be 
alphabetical, or per value? Or maybe it is standard values and then 
synonyms, in which is case maybe a comment on the end of the list.


Guc help text: ISTM that it should spell out the possible values and their 
behavior. Maybe something like:


"""
When turned on, turns expr = NULL into expr IS NULL.
With off, warn or error, do not do so and be silent, issue a warning or an 
error.
The standard behavior is not to perform this transformation.
Default is warn.
"""

Or anything in better English.

* Test

 +select 1=null;
 + ?column?

Maybe use as to describe the expected behavior, so that it is 
easier to check, and I think that "?column?" looks silly eg:


  select 1=null as expect_{true,false}[_and_a_warning/error];

   create table cnullchild (check (f1 = 1 or f1 = null)) inherits(cnullparent);
  +WARNING:  = NULL can only produce a NULL

I'm not sure of this test case. Should it be turned into "is null"?

Maybe the behavior could be retested after the reset?

* Documentation:

The "error" value is not documented.

More generally, The value is said to be an enum, but the list of values is 
not listed anywhere in the documentation.


ISTM that the doc would be clearer if for each of the four values of the 
parameter the behavior is spelled out.


Maybe "warn" in the doc should be warn or something.


--
Fabien.

Re: Make foo=null a warning by default.

2018-07-16 Thread Heikki Linnakangas


On 16/07/18 04:40, David Fetter wrote:

Folks,

Per a discussion with Andrew Gierth and Vik Fearing, both of whom
helped make this happen, please find attached a patch which makes it
possible to get SQL standard behavior for "= NULL", which is an error.
It's been upgraded to a warning, and can still be downgraded to
silence (off) and MS-SQL-compatible behavior (on).


I don't agree with changing the default to 'warn'. "foo = NULL" is 
perfectly legal SQL, even if it's not very useful in practice.


The use case for transform_null_equals='warn' that I see is to wean off 
an application that uses transform_null_equals='on', to find all the 
queries that rely on it. But I'm not too excited about that either. The 
documented case for using transform_null_equals='on' is compatibility 
with Microsoft Access, and this wouldn't help with that. Besides, it's 
easy enough to grep the application or the server log for " = NULL", to 
find all such queries.


- Heikki

Parallel queries in single transaction

2018-07-16 Thread Paul Muntyanu

Hello,

   I am working with data warehouse based on postgresql and would like to
propose a feature. The idea is to give control and ability for developer to
execute queries in parallel within single transaction. Usual flow is next:
START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> END_TRANSACTION. However
sometimes QUERY1 and QUERY2 are independent and can be executed in parallel
mode. E.g.: START_TRANSACTION -> DEFINE_QUERY1(no execution) ->
DEFINE_QUERY2(no_execution) -> EXECUTE_QUERY1_AND_QUERY2(in parallel) ->
QUERY3 -> END

Of course QUERY1 and QUERY2 can be dependent and then this would not work,
but sometimes it is useful, especially when you have bound to e.g. CPU and
query stuck.
If we go further, the postgresql engine could possible find such cases by
itself and run queries in parallel mode, but that’s sounds too far.

Here is also an example in scala how you can wait for several futures:
https://stackoverflow.com/a/16257851/2439539

Syntax:

BEGIN;

--create two temp tables because prepare does not support CTAS

—query1
CREATE TEMP TABLE QUERY1_RESULT ON COMMIT DROP (…);
PREPARE query1 (id int, val varchar) as INSERT INTO QUERY1_RESULT(...)
SELECT …

—query2
CREATE TEMP TABLE QUERY2_RESULT ON COMMIT DROP (…);
PREPARE query2 (id int, val varchar) as INSERT INTO QUERY2_RESULT(...)
SELECT …

—exec in parallel
execute parallel (query1, query2);

—query3
….

END;


-Paul

Re: Make foo=null a warning by default.

2018-07-16 Thread Fabien COELHO




Hello Heikki,

The use case for transform_null_equals='warn' that I see is to wean off an 
application that uses transform_null_equals='on', to find all the queries 
that rely on it. But I'm not too excited about that either. The documented 
case for using transform_null_equals='on' is compatibility with Microsoft 
Access, and this wouldn't help with that. Besides, it's easy enough to grep 
the application or the server log for " = NULL", to find all such queries.


I'm in favor in having this feature, even if off by default, because I 
could enable it and it would help my students when learning SQL in 
interactive sessions, where "grep '= *NULL'" cannot be issued.


--
Fabien.

RE: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-16 Thread Phil Florent

I get it. Thank you for this precision.

Regards

Phil


De : David Rowley 
Envoyé : lundi 16 juillet 2018 07:48
À : Phil Florent
Cc : Tom Lane; Robert Haas; Amit Langote; PostgreSQL Hackers
Objet : Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 
on Debian

On 16 July 2018 at 16:56, Phil Florent 
mailto:philflor...@hotmail.com>> wrote:

I should post that in the general section but I am confused by the sentence "A 
parent partition is always going to have a lower relid than its children"

It's a little confusing since RelOptInfo has a relid field and so does 
RangeTblEntry. They both have completely different meanings.  RelOptInfo's 
relid is a number starting at 1 and continues in a gapless sequence increasing 
by 1 with each RelOptInfo.  These relids are completely internal to the server 
and don't appear in the system catalog tables.  RangeTblEntry's relid is what's 
in pg_class.oid.

I was talking about RelOptInfo's relid.

Using relids starting at 1 is quite convenient for allowing direct array 
lookups in various data structures in the planner. However it's also required 
to uniquely identify a relation as a single table may appear many times in a 
query, so trying to identify them by their oid could be ambiguous.  Also, some 
RTEKinds don't have storage, e.g a VALUES() clause.

--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Libpq support to connect to standby server as priority

2018-07-16 Thread Haribabu Kommi

On Wed, Jul 11, 2018 at 6:00 PM Haribabu Kommi 
wrote:

>
>
> On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe 
> wrote:
>
>> Haribabu Kommi wrote:
>>
>>
> - I think the construction with "read_write_host_index" makes the code
>> even more
>>   complicated than it already is.
>>
>>   What about keeping the first successful connection open and storing it
>> in a
>>   variable if we are in "prefer-read" mode.
>>   If we get the read-only connection we desire, close that cached
>> connection,
>>   otherwise use it.
>>
>
> Even if we add a variable to cache the connection, I don't think the logic
> of checking
> the next host for the read-only host logic may not change, but the extra
> connection
> request to the read-write host again will be removed.
>

I evaluated your suggestion of caching the connection and reuse it when
there is no
read only server doesn't find, but I am thinking that it will add more
complexity and also
the connection to the other servers delays, the cached connection may be
closed by
the server also because of timeout.

I feel the extra time during connection may be fine, if user is preferring
the prefer-read
mode, instead of adding more complexity in handling the cached connection?

comments?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Tomas Vondra


On 07/16/2018 04:54 AM, Stephen Frost wrote:

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:

I think that the right basic idea is to have a GUC that chooses between
the two implementations, but whether it can be set automatically is not
clear to me.  Can initdb perhaps investigate what kind of filesystem the
WAL directory is sitting on, and set the default value from hard-wired
knowledge about that?


Maybe..  but I think we'd still need a way to change it because people
often start with their database system minimally configured (including
having WAL in the default location of the data directory) and only later
realize that was a bad idea and change it later.  I wouldn't be at all
surprised if that "change it later" meant moving it to a different
filesystem, and having to re-initdb to take advantage of that would be
particularly unfriendly.



I'm not sure the detection can be made sufficiently reliable for initdb. 
For example, it's not that uncommon to do initdb and then move the WAL 
to a different filesystem using symlink. Also, I wonder how placing the 
filesystem on LVM with snapshotting (which kinda makes it CoW) affects 
the system behavior.


But maybe those are not issues, as long as the result is predictable.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel queries in single transaction

2018-07-16 Thread Tomas Vondra


Hi,

On 07/16/2018 09:45 AM, Paul Muntyanu wrote:

Hello,

    I am working with data warehouse based on postgresql and would like 
to propose a feature. The idea is to give control and ability for 
developer to execute queries in parallel within single transaction. 
Usual flow is next: START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> 
END_TRANSACTION. However sometimes QUERY1 and QUERY2 are independent and 
can be executed in parallel mode. E.g.: START_TRANSACTION -> 
DEFINE_QUERY1(no execution) -> DEFINE_QUERY2(no_execution) -> 
EXECUTE_QUERY1_AND_QUERY2(in parallel) -> QUERY3 -> END


Of course QUERY1 and QUERY2 can be dependent and then this would not 
work, but sometimes it is useful, especially when you have bound to e.g. 
CPU and query stuck.


I'm struggling to understand what would be the possible benefits. Either 
the queries are CPU-bound or stuck (e.g. waiting for I/O), they can't be 
both at the same time. If a query is stuck, running it concurrently is 
pointless. If they are CPU-bound, we can run them in parallel (which 
should produce the results faster).


I'd even dare to say that running the queries concurrently can easily 
hinder performance, because the queries will compete for parallel 
workers, preventing some of them from running in parallel mode.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel queries in single transaction

2018-07-16 Thread Paul Muntyanu

Hi Tomas, thanks for looking into. I am more talking about queries which
can not be optimized, e.g.
* fullscan of the table and heavy calculations for another one.
* query through FDW for both queries(e.g. one query fetches data from Kafka
and another one is fetching from remote Postgres. There are no bounds for
both queries for anything except local CPU, network and remote machine)

IO bound is not a problem in case if you have multiple tablesapces. And CPU
bound can be not the case when you have 32 cores and 6 max workers per
query. Then, during nigtly ETL, I do not have anything except single query
running) == 6 cores are occupied. If I can run queries in parallel, I would
occupy two IO stacks(two tablespaces) + 12 cores instead of sequentially 6
and then again 6.

Hope that makes sense

-P

On 16 Jul 2018, at 11:44, Tomas Vondra  wrote:

Hi,

On 07/16/2018 09:45 AM, Paul Muntyanu wrote:

Hello,
   I am working with data warehouse based on postgresql and would like to
propose a feature. The idea is to give control and ability for developer to
execute queries in parallel within single transaction. Usual flow is next:
START_TRANSACTION -> QUERY1 -> QUERY2 -> QUERY3 -> END_TRANSACTION. However
sometimes QUERY1 and QUERY2 are independent and can be executed in parallel
mode. E.g.: START_TRANSACTION -> DEFINE_QUERY1(no execution) ->
DEFINE_QUERY2(no_execution) -> EXECUTE_QUERY1_AND_QUERY2(in parallel) ->
QUERY3 -> END
Of course QUERY1 and QUERY2 can be dependent and then this would not work,
but sometimes it is useful, especially when you have bound to e.g. CPU and
query stuck.


I'm struggling to understand what would be the possible benefits. Either
the queries are CPU-bound or stuck (e.g. waiting for I/O), they can't be
both at the same time. If a query is stuck, running it concurrently is
pointless. If they are CPU-bound, we can run them in parallel (which should
produce the results faster).

I'd even dare to say that running the queries concurrently can easily
hinder performance, because the queries will compete for parallel workers,
preventing some of them from running in parallel mode.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Fix some error handling for read() and errno

2018-07-16 Thread Michael Paquier

On Sat, Jul 14, 2018 at 03:37:56PM +0900, Michael Paquier wrote:
> For now, I think that just moving forward with 0001, and then revisit
> 0002 once the other 2PC patch is settled makes the most sense.  On the
> other thread, the current 2PC behavior can create silent data loss so
> I would like to back-patch it, so that would be less work.

Are there any objections with this plan?  If none, then I would like to
move on with 0001 as there is clearly a consensus to simplify the work
of translators and to clean up the error code paths for read() calls.
Let's sort of the rest after the 2PC code paths are addressed.
--
Michael


signature.asc
Description: PGP signature

Re: [WIP PATCH] Index scan offset optimisation using visibility map

2018-07-16 Thread Michail Nikolaev

Hello.

Thanks a lot for your feedback. I'll try to update patch in few days
(currently stuck at small performance regression in unknown place).

Regarding issue with delete: yes, it is valid point, but record removing
should clear visibility buffer - and tuple will be fetched from heap to
test its existance. In such case expression are not evaluated at all. Not
sure for delete and query in same transaction - I'll check.
Also, need to recheck possible issues with EvalPlanQual.

PS.

Updated link in case someone want to briefly see code until git patch is
ready:
https://github.com/michail-nikolaev/postgres/compare/e3eb8be77ef82ccc8f87c515f96d01bf7c726ca8...michail-nikolaev:index_only_fetch?ts=4


сб, 14 июл. 2018 г. в 0:17, Heikki Linnakangas :

> On 21/05/18 18:43, Michail Nikolaev wrote:
> > Hello everyone.
> > This letter related to “Extended support for index-only-scan” from my
> > previous message in the thread.
> >
> > WIP version of the patch is ready for a while now and I think it is time
> to
> > resume the work on the feature. BTW, I found a small article about Oracle
> > vs Postgres focusing this issue -
> > https://blog.dbi-services.com/postgres-vs-oracle-access-paths-viii/
> >
> > Current WIP version of the patch is located here -
> >
> https://github.com/postgres/postgres/compare/88ba0ae2aa4aaba8ea0d85c0ff81cc46912d9308...michail-nikolaev:index_only_fetch
> ,
> > passing all checks. In addition, patch includes small optimization for
> > caching of amcostestimate results.
>
> Please submit an actual path, extracted e.g. with "git format-patch
> -n1", rather than a link to an external site. That is a requirement for
> archival purposes, so that people reading the email archives later on
> can see what was being discussed. (And that link doesn't return a proper
> diff, anyway.)
>
> > For now, I decide to name the plan as “Index Only Fetch Scan”. Therefore:
> > * In case of “Index Scan” – we touch the index and heap for EVERY tuple
> we
> > need to test
> > * For “Index Only Scan” – we touch the index for every tuple and NEVER
> > touch the heap
> > * For “Index Only Fetch Scan” – we touch the index for every tuple and
> > touch the heap for those tuples we need to RETURN ONLY.
>
> Hmm. IIRC there was some discussion on doing that, when index-only scans
> were implemented. It's not generally OK to evaluate expressions based on
> data that has already been deleted from the table. As an example of the
> kind of problems you might get:
>
> Imagine that a user does "DELETE * FROM table WHERE div = 0; SELECT *
> FROM table WHERE 100 / div < 10". Would you expect the query to throw a
> "division by zero error"? If there was an index on 'div', you might
> evaluate the "100 / div" expression based on values from the index,
> which still includes entries for the just-deleted tuples with div = 0.
> They would be filtered out later, after performing the visibility
> checks, but it's too late if you already threw an error.
>
> Now, maybe there's some way around that, but I don't know what. Without
> some kind of a solution, this won't work.
>
> - Heikki
>

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-16 Thread Ashutosh Bapat

On Sat, Jul 14, 2018 at 2:41 AM, Alvaro Herrera
 wrote:
> On 2018-Jul-13, Ashutosh Bapat wrote:
>
>> On Fri, Jul 13, 2018 at 1:15 PM, Amit Langote
>>  wrote:
>> >>
>> >> I don't think this is true. When equality conditions and IS NULL clauses 
>> >> cover
>> >> all partition keys of a hash partitioned table and do not have 
>> >> contradictory
>> >> clauses, we should be able to find the partition which will remain 
>> >> unpruned.
>> >
>> > I was trying to say the same thing, but maybe the comment doesn't like it.
>> >  How about this:
>> >
>> > + * For hash partitioning, if we found IS NULL clauses for some keys 
>> > and
>> > + * OpExpr's for others, gen_prune_steps_from_opexps() will generate 
>> > the
>> > + * necessary pruning steps if all partition keys are taken care of.
>> > + * If we found only IS NULL clauses, then we can generate the pruning
>> > + * step here but only if all partition keys are covered.
>> >
>>
>> It's still confusing a bit. I think "taken care of" doesn't convey the
>> same meaning as "covered". I don't think the second sentence is
>> necessary, it talks about one special case of the first sentence. But
>> this is better than the prior version.
>
> Hmm, let me reword this comment completely.  How about the attached?
>
> I also changed the order of the tests, putting the 'generate_opsteps'
> one in the middle instead of at the end (so the not-null one doesn't
> test that boolean again).  AFAICS it's logically the same, and it makes
> more sense to me that way.

That looks much better. However it took me a small while to understand
that (1), (2) and (3) correspond to strategies.

>
> I also changed the tests so that they apply to the existing mc2p table,
> which is identical to the one Amit was adding.

+1 for that.

>
>> > How about if we explicitly spell out the strategies like this:
>> >
>> > +if (!bms_is_empty(nullkeys) &&
>> > +(part_scheme->strategy == PARTITION_STRATEGY_LIST ||
>> > + part_scheme->strategy == PARTITION_STRATEGY_RANGE ||
>> > + (part_scheme->strategy == PARTITION_STRATEGY_HASH &&
>> > +  bms_num_members(nullkeys) == part_scheme->partnatts)))
>>
>> I still do not understand why do we need (part_scheme->strategy ==
>> PARTITION_STRATEGY_HASH && bms_num_members(nullkeys) ==
>> part_scheme->partnatts) there. I thought that this will be handled by
>> code, which deals with null keys and op_exprs together, somewhere
>> else.
>
> I'm not sure I understand this question.  This strategy applies to hash
> partitioning only if we have null tests for all keys, so there are no
> op_exprs.  If there are any, there's no point in checking them.

With the rearranged code, it's much more simple to understand this
code. No change needed.

>>
>> Hmm. Ok. May be it's better to explicitly say that it's only useful in
>> case of list partitioned table.
>
> Yeah, maybe.

No need with the re-written code.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: PATCH: psql tab completion for SELECT

2018-07-16 Thread Heikki Linnakangas


(trimmed CC list to evade gmail's spam filter, sorry)

On 21/03/18 08:51, Edmund Horner wrote:

Hi all, I haven't heard anything for a while and so assume you're
beavering away on real features. :)

I've been dogfooding this patch at work, and I am personally pretty
happy with it.

I still think the number of completions on an empty string is a bit
too big, but I don't know what to omit.  There are around 1700
completions on the empty "postgres" database in my testing, and we
show the first 1000 (arbitrarily limited, as the generated SQL has
LIMIT 1000 but no ORDER BY).

Should we just leave it as is?



Whether we improve the filtering or not, I'm optimistic the feature
will be committed in this CF or the next.

I've also been working on adding support for completions after commas,
but I really want to see the current feature finished first.


Playing around with this a little bit, I'm not very satisfied with the 
completions. Sometimes this completes too much, and sometimes too 
little. All of this has been mentioned in this and the other thread [1] 
already, this just my opinion on all this.


Too much:

postgres=# \d lp
   Table "public.lp"
   Column  |  Type   | Collation | Nullable | Default
--+-+---+--+-
  id   | integer |   |  |
  partkey  | text|   |  |
  partcol1 | text|   |  |
  partcol2 | text|   |  |
Partition key: LIST (partkey)
Number of partitions: 1000 (Use \d+ to list them.)

postgres=# select pa[TAB]
pad_attribute  parameter_default  parameter_stylepartattrs 
partcol2   partexprs  partrelid
page   parameter_mode parameter_typespartclass 
partcollation  partkeypartstrat
pageno parameter_name parse_ident(   partcol1 
partdefid  partnatts  passwd


Too little:

postgres=# select partkey, p[TAB]
[no completions]


Then there's the multi-column case, which seems weird (to a user - I 
know the implementation and understand why):


postgres=# select partkey, partc[TAB]
[no completions]

And I'd love this case, where go back to edit the SELECT list, after 
already typing the FROM part, to be smarter:


postgres=# select p[TAB] FROM lp;
Display all 370 possibilities? (y or n)

There's something weird going on with system columns, from a user's 
point of view:


SELECT oi[TAB]
oid  oidvectortypes(

postgres=# select xm[TAB]
xmin  xmlcomment( xml_is_well_formed_content( 
xmlvalidate(
xmlagg(   xml_is_well_formed( 
xml_is_well_formed_document(


So oid and xmin are completed. But "xmax" and "ctid" are not. I think 
this is because in fact none of the system columns are completed, but 
there happen to be some tables with regular columns named "oid" and 
"xmin". So it makes sense once you know that, but it's pretty confusing 
to a user. Tab-completion is a great way for a user to explore what's 
available, so it's weird that some system columns are effectively 
completed while others are not.


I'd like to not include set-returning functions from the list. Although 
you can do "SELECT generate_series(1,10)", I'd like to discourage people 
from doing that, since using set-returning functions in the target list 
is a PostgreSQL extension to the SQL standard, and IMHO the "SELECT * 
FROM generate_series(1,10)" syntax is easier to understand and works 
more sanely with joins etc. Conversely, it would be really nice to 
include set-returning function in the completions after FROM.



I understand that there isn't much you can do about some of those 
things, short of adding a ton of new context information about previous 
queries and heuristics. I think the completion needs to be smarter to be 
acceptable. I don't know what exactly to do, but perhaps someone else 
has ideas.


I'm also worried about performance, especially of the query to get all 
the column names. It's pretty much guaranteed to do perform a 
SeqScan+Sort+Unique on pg_attribute. pg_attribute can be very large. In 
my little test database, with the above 1000-partition table, hitting 
tab after "SELECT " takes about 1 second, until you get the "Display all 
1000 possibilities" prompt. And that's not a particularly large schema.


- Heikki


PS. All the references to "pg_attribute" and other system tables, and 
functions, need to be schema-qualified, as "pg_catalog.pg_attribute" and 
so forth.


[1] 
https://www.postgresql.org/message-id/1328820579.11241.4.ca...@vanquo.pezone.net

Re: [HACKERS] PoC plpgsql - possibility to force custom or generic plan

2018-07-16 Thread Peter Eisentraut

On 12.07.18 11:12, Pavel Stehule wrote:
> This appears to be the patch of record in this thread.  I think there is
> general desire for adding a setting like this, and the implementation is
> simple enough.
> 
> One change perhaps: How about naming the default setting "auto" instead
> of "default".  That makes it clearer what it does.
> 
> 
> I changed "default" to "auto"
> 
> updated patch attached

committed with some editing

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Finding database for pg_upgrade missing library

2018-07-16 Thread Bruce Momjian

On Sat, Jul 14, 2018 at 12:14:51AM +0200, Daniel Gustafsson wrote:
> > On 13 Jul 2018, at 18:28, Bruce Momjian  wrote:
> > 
> > I received a private pg_upgrade feature request to report the database
> > name for missing loadable libraries.  Currently we report "could not
> > load library" and the library file name, e.g. $libdir/pgpool-regclass.
> > 
> > The request is that we report the _database_ name that contained the
> > loadable library reference.  However, that isn't easy to do because we
> > gather all loadable library file names, sort them, and remove
> > duplicates, for reasons of efficiency and so we check libraries in a
> > predictable alphabetical order.
> > 
> > Is it worth modifying pg_upgrade to report the first or all databases
> > that contain references to missing loadable libraries?  I don't think
> > so, but I wanted to ask here.
> 
> I ran into that very problem when upgrading 50+ clusters during a long night a
> while back, and patched pg_upgrade to report the database which helped a lot
> (or at least helped me be a bit lazy).  So, +1 on the feature from me.  If
> there is interest I can see if I can dig out the patch and polish it up.

Yes, please post the patch.  Seems we now have three people who want
this.  Even though it is related to reporting errors, I think this is a
new feature so will only be in PG 12.

Looking at the code, we do a qsort(), then detect (since they are all
now adjacent) and remove the duplicate references to the library.  What
I think should be done is to move the duplicate detection down to the
place where we check for the library, then print all the database names
of the duplicates if we don't find the library.  I guess we either need
a separate array for the database name, or a 'struct' that holds the
library name and database name.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-16 Thread Tomas Vondra





On 07/16/2018 12:16 AM, Tomas Vondra wrote:


On 07/15/2018 04:43 PM, Dean Rasheed wrote:

On 15 July 2018 at 14:29, Tomas Vondra  wrote:

It's quite unclear to me how this algorithm could reliably end up with
hist_sel=0 (in cases where we already don't end up with that). I mean,
if a bucket matches the conditions, then the only way to eliminate is by
deducing that MCV already contains all the matches - and that's rather
difficult for complex clauses ...


Ah, I didn't realise that you were using histograms for equality
clauses as well. I had assumed that they would only use the MCV stats,
as in the univariate case. Using histograms for equality seems
problematic -- if bucket_contains_value() returns STATS_MATCH_PARTIAL,
as things stand that would end up with an estimate of half the
bucket's frequency, which seems excessive.


Yeah, I think you're right - this is likely to produce over-estimates
and needs rethinking. With top-level equality clauses it should be
fairly trivial to use approach similar to the univariate case, i.e.
computing ndistinct and using

 (1 - mcv_totalsel) / ndistinct

If there are ndistinct coefficients this might be pretty good estimate
of the non-MCV part, I think. But it only works for top-level
equalities, not for complex clauses as in your examples.



On further thought, it's a bit more complicated, actually. Firstly, we 
already do that when there's no histogram (as in your example), and 
clearly it does not help. I initially thought it's a mistake to use the 
histogram in this case, but I can think of cases where it helps a lot.


1) when the equality clauses match nothing

In this case we may not find any buckets possibly matching the 
combination of values, producing selectivity estimate 0.0. While by 
using 1/ndistinct would give us something else.


2) when there are equality and inequality clauses

Similarly to the previous case, the equality clauses are useful in 
eliminating some of the buckets.


Now, I agree estimating equality clauses using histogram is tricky, so 
perhaps what we should do is using them as "conditions" to eliminate 
histogram buckets, but use ndistinct to estimate the selectivity. That 
is something like this:


  P(a=1 & b=1 & c<10 & d>=100)
   = P(a=1 & b=1) * P(c<10 & d>=100 | a=1 & b=1)
   = 1/ndistinct(a,b) * P(c<10 & d>=100 | a=1 & b=1)

where the second part is estimated using the histogram.

Of course, this still only works for the top-level equality clauses :-(

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-16 Thread Tomas Vondra





On 07/15/2018 11:36 AM, Dean Rasheed wrote:

On 13 July 2018 at 18:27, Tomas Vondra  wrote:

I'm not so sure. The issue is that a lot of the MCV deductions depends
on whether we can answer questions like "Is there a single match?" or
"If we got a match in MCV, do we need to look at the non-MCV part?" This
is not very different from the single-column estimates, except of course
here we need to look at multiple columns.

The top-level clauses allow us to make such deductions, with deeper
clauses it's much more difficult (perhaps impossible). Because for
example with (a=1 AND b=1) there can be just a single match, so if we
find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
b=2)) it's not that simple, because there may be multiple combinations
and so a match in MCV does not guarantee anything.


Actually, it guarantees a lower bound on the overall selectivity, and
maybe that's the best that we can do in the absence of any other
stats.



Hmmm, is that actually true? Let's consider a simple example, with two 
columns, each with just 2 values, and a "perfect" MCV list:


a | b | frequency
   ---
1 | 1 | 0.5
2 | 2 | 0.5

And let's estimate sel(a=1 & b=2). Your proposed algorithm does this:

1) sel(a=1) = 0.5
2) sel(b=2) = 0.5
3) total_sel = sel(a=1) * sel(b=2) = 0.25
4) mcv_sel = 0.0
5) total_sel = Max(total_sel, mcv_sel) = 0.25

How is that a lower bound? Or what is it lower than?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel queries in single transaction

2018-07-16 Thread Tomas Vondra





On 07/16/2018 12:03 PM, Paul Muntyanu wrote:
Hi Tomas, thanks for looking into. I am more talking about queries which 
can not be optimized, e.g.

* fullscan of the table and heavy calculations for another one.
* query through FDW for both queries(e.g. one query fetches data from 
Kafka and another one is fetching from remote Postgres. There are no 
bounds for both queries for anything except local CPU, network and 
remote machine)


IO bound is not a problem in case if you have multiple tablesapces.


But it was you who mentioned "query stuck" not me. I merely pointed out 
that in such cases running queries concurrently won't help.


And CPU bound can be not the case when you have 32 cores and 6 max workers 
per query. Then, during nigtly ETL, I do not have anything except single 
query running) == 6 cores are occupied. If I can run queries in 
parallel, I would occupy two IO stacks(two tablespaces) + 12 cores 
instead of sequentially 6 and then again 6.




Well, sure. But you could just as well open multiple connections and 
make the queries concurrent that way. Or change the GUC to increase the 
number of workers for the nightly ETL.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Finding database for pg_upgrade missing library

2018-07-16 Thread Bruce Momjian

On Fri, Jul 13, 2018 at 10:15:34PM -0500, Justin T Pryzby wrote:
> On Fri, Jul 13, 2018 at 12:28:15PM -0400, Bruce Momjian wrote:
> > I received a private pg_upgrade feature request to report the database
> > name for missing loadable libraries.  Currently we report "could not
> > load library" and the library file name, e.g. $libdir/pgpool-regclass.
> > 
> > The request is that we report the _database_ name that contained the
> > loadable library reference.  However, that isn't easy to do because we
> > gather all loadable library file names, sort them, and remove
> > duplicates, for reasons of efficiency and so we check libraries in a
> > predictable alphabetical order.
> > 
> > Is it worth modifying pg_upgrade to report the first or all databases
> > that contain references to missing loadable libraries?  I don't think
> > so, but I wanted to ask here.
> 
> Yes please, with a preference for the "all databases" option.
> 
> We typically have only 4 DBs, including postgres and template1,2.  It's
> annoying enough when an upgrade process breaks because pg_repack or
> pg_stat_buffercache installed into postgres DB.  But it's a veritable pain 
> when
> you discover in the middle of an upgrade that postgis had been somehow loaded
> into template1, needs to be uninstalled (or upgraded from 22 to 23 to allow
> upgrade), old postgis package was already removed..  Maybe you find that one
> library was installed one place, fix it and restart the upgrade process.  Then
> it fails because the old library was also installed some other place..
> 
> When I've had to figure this out in the past, I ended up grepping the dumps to
> figure out what old library was where.

OK, we now have three people who want this so we will get it into PG 12.
Thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-16 Thread Dean Rasheed

On 16 July 2018 at 13:23, Tomas Vondra  wrote:
>>> The top-level clauses allow us to make such deductions, with deeper
>>> clauses it's much more difficult (perhaps impossible). Because for
>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>> b=2)) it's not that simple, because there may be multiple combinations
>>> and so a match in MCV does not guarantee anything.
>>
>> Actually, it guarantees a lower bound on the overall selectivity, and
>> maybe that's the best that we can do in the absence of any other
>> stats.
>>
> Hmmm, is that actually true? Let's consider a simple example, with two
> columns, each with just 2 values, and a "perfect" MCV list:
>
> a | b | frequency
>---
> 1 | 1 | 0.5
> 2 | 2 | 0.5
>
> And let's estimate sel(a=1 & b=2).

OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).

What we could do though is also impose an upper bound, based on the
sum of non-matching MCV frequencies. In this case, the upper bound is
also 0, so we could actually say the resulting selectivity is 0.

Regards,
Dean

Re: cursors with prepared statements

2018-07-16 Thread Peter Eisentraut

On 11.07.18 19:07, Heikki Linnakangas wrote:
>> One point worth pondering is how to pass the parameters of the prepared
>> statements.  The actual SQL standard syntax would be
>>
>>  DECLARE cursor_name CURSOR FOR prepared_statement_name;
>>  OPEN cursor_name USING param, param;
>>
>> But since we don't have the OPEN statement in direct SQL, it made sense
>> to me to attach the USING clause directly to the DECLARE statement.
> 
> Hmm. I'm not excited about adding PostgreSQL-extensions to the SQL 
> standard.

Isn't that what we do all the time?

> It's confusing, and risks conflicting with future additions to 
> the standard. ECPG supports the actual standard syntax, with OPEN, 
> right? So this wouldn't be consistent with ECPG, either.

It would be consistent for the case of no parameters.

>> Curiously, the direct EXECUTE statement uses the non-standard syntax
>>
>>  EXECUTE prep_stmt (param, param);
>>
>> instead of the standard
>>
>>  EXECUTE prep_stmt USING param, param;
>>
>> I tried to consolidate this.  But using
>>
>>  DECLARE c CURSOR FOR p (foo, bar)
>>
>> leads to parsing conflicts (and looks confusing?),
> 
> How about
> 
> DECLARE c CURSOR FOR EXECUTE p (foo, bar)

That's not the standard syntax for the case of no parameters.

> The attached patch seems to do the trick, of allowing EXECUTE + USING. 
> I'm not sure this is worth the trouble, though, since EXECUTE as a plain 
> SQL command is a PostgreSQL-extension anyway.

I think it's a PostgreSQL extension that we allow just about anything to
be executed directly.  So we should still use the standard syntax either
way.  It would be weird if EXECUTE or any other command had different
syntax in direct SQL, ECPG, PL/pgSQL, etc.  We have some differences
already, but we shouldn't create more.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Usage of epoch in txid_current

2018-07-16 Thread Andres Freund

Hi,

On 2018-07-15 16:41:35 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-07-09 19:56:25 -0400, Tom Lane wrote:
> >> Or, perhaps, use a struct in assert builds and int64 otherwise?
> >> You could hide the ensuing notational differences in macros.
> 
> > That should be doable. But I'd like to check if it's necessary
> > first. Optimizing passing an 8 byte struct shouldn't be hard for
> > compilers these days - especially when most things dealing with them are
> > inline functions.  If we find that it's not a problem on contemporary
> > compilers, it might be worthwhile to use a bit more type safety in other
> > places too.
> 
> [ bunch of test results ]
> Offhand it would seem that we can get away with struct wrappers
> on any platform where performance is really of concern today.

Cool, thanks for checking!

Greetings,

Andres Freund

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Andres Freund

On 2018-07-15 20:32:39 -0400, Robert Haas wrote:
> On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund  wrote:
> > This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
> > performance impact of non COW filesystems, and very likely even negative
> > impacts in a number of COWed scenarios (when there's enough memory to
> > keep all WAL files in memory).
> >
> > I still think that fixing this another way would be preferrable. This'll
> > be too much of a magic knob that depends on the fs, hardware and
> > workload.
> 
> I tend to agree with you, but unless we have a pretty good idea what
> that other way would be, I think we should probably accept the patch.

I don't think I've argued against that - I just want there to be
sufficient caveats to make clear it's going to hurt on very common OS &
FS combinations.


> I think part of the problem here is that whether a WAL segment is
> likely to be cached depends on a host of factors which we don't track
> very carefully, like whether it's been streamed or decoded recently.
> If we knew when that a particular WAL segment hadn't been accessed for
> any purpose in 10+ minutes, it would probably be fairly safe to guess
> that it's no longer in cache; if we knew that it had been accessed <15
> seconds ago, that it is probably still in cache.  But we have no idea.

True. Additionally we don't know whether, even if cold cache,
re-initializing files isn't worse performance-wise than recycling files.

Greetings,

Andres Freund

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-16 Thread Andres Freund

Hi,

On 2018-07-15 18:48:43 -0400, Tom Lane wrote:
> So basically, WAL replay hits an error while holding a buffer pin, and
> nothing is done to release the buffer pin, but AtProcExit_Buffers thinks
> something should have been done.

I think there's a few other cases where we hit this. I've seen something
similar from inside checkpointer / BufferSync(). I'd be surprised if
bgwriter couldn't be triggered into the same.


> What seems like a better answer for now is to adjust AtProcExit_Buffers
> so that it doesn't cause an assertion failure in this case.  I think we
> can define "this case" as being "failure exit from the startup process":
> if that happens, the postmaster is going to throw up its hands and force
> a panic shutdown anyway, so the failure to free a buffer pin shouldn't
> have any serious consequences.


> The attached one-liner patch does that, and is enough to get through
> Michael's test case without an assertion.  This is just proof of concept
> though --- my inclination, if we go this route, is to make a slightly
> longer patch that would fix CheckForBufferLeaks to still print the WARNING
> messages if any, but not die with an assertion afterwards.
> 
> Another question is whether this is really a sufficient band-aid.  It
> looks to me like AtProcExit_Buffers will be called in any auxiliary
> process type, not only the startup process.

We could just replace the Assert() with a PANIC?


> Do, or should we, force a panic restart for nonzero-exit-code failures
> of all aux process types?  If not, what are we going to do to clean up
> similar failures in other aux process types?

I'm pretty sure that we do *not* force a panic on all nonzero-exit-code
cases for other subprocesses.


> BTW, this assertion has been there since fe548629; before that, there
> was code that would release any leaked buffer pins, relying on the
> PrivateRefCount data for that.  So another idea is to restore some
> version of that capability, although I think we really really don't
> wanna scan the PrivateRefCount array if we can avoid it.

I don't think scanning PrivateRefCount would be particularly problematic
- in almost all cases it's going to be tiny. These day it's not NBuffers
sized anymore, so I can't forsee any performance problems?

I think we could invent something like like enter/leave "transactional
mode" wherein we throw a PANIC when a buffer leaked. Processes that
don't enter it - like startup, checkpointer, etc - would just WARN?

Greetings,

Andres Freund

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Andres Freund

Hi,

On 2018-07-15 20:55:38 -0400, Tom Lane wrote:
> That's not the way to think about it.  On a COW file system, we don't
> want to "create 16MB files" at all --- we should just fill WAL files
> on-the-fly, because the pre-fill activity isn't actually serving the
> intended purpose of reserving disk space.  It's just completely useless
> overhead :-(.  So we can't really make a direct comparison between the
> two approaches; there's no good way to net out the cost of constructing
> the WAL data we need to write.

We probably should still allocate them in 16MB segments. We rely on the
size being fixed in a number of places. But it's probably worthwhile to
just do a posix_fadvise or such. Also, if we continually increase the
size with each write we end up doing a lot more metadata transactions,
which'll essentially serve to increase journalling overhead further.

Greetings,

Andres Freund

Re: Pluggable Storage - Andres's take

2018-07-16 Thread Andres Freund

Hi,

I've pushed up a new version to
https://github.com/anarazel/postgres-pluggable-storage which now passes
all the tests.

Besides a lot of bugfixes, I've rebased the tree, moved TriggerData to
be primarily slot based (with a conversion roundtrip when calling
trigger functions), and a lot of other small things.


On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote:
> On Tue, Jul 3, 2018 at 5:06 PM Andres Freund  wrote:
> > The current state of my version of the patch is *NOT* ready for proper
> > review (it doesn't even pass all tests, there's FIXME / elog()s).  But I
> > think it's getting close enough to it's eventual shape that more eyes,
> > and potentially more hands on keyboards, can be useful.
> >
> 
> I will try to update it to make sure that it passes all the tests and also
> try to
> reduce the FIXME's.

Cool.

Alexander, Haribabu, if you give me (privately) your github accounts,
I'll give you write access to that repository.


> > The most fundamental issues I had with Haribabu's last version from [2]
> > are the following:
> >
> > - The use of TableTuple, a typedef from void *, is bad from multiple
> >   fronts. For one it reduces just about all type safety. There were
> >   numerous bugs in the patch where things were just cast from HeapTuple
> >   to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
> >   a really, really bad idea to introduce a vague type like this for
> >   development purposes alone, it makes it way too hard to refactor -
> >   essentially throwing the biggest benefit of type safe languages out of
> >   the window.
> >
> 
> My earlier intention was to remove the HeapTuple usage entirely and replace
> it with slot everywhere outside the tableam. But it ended up with TableTuple
> before it reach to the stage because of heavy use of HeapTuple.

I don't think that's necessary - a lot of the system catalog accesses
are going to continue to be heap tuple accesses. And the conversions you
did significantly continue to access TableTuples as heap tuples - it was
just that the compiler didn't warn about it anymore.

A prime example of that is the way the rewriteheap / cluster
integreation was done. Cluster continued to sort tuples as heap tuples -
even though that's likely incompatible with other tuple formats which
need different state.


> >   Additionally I think it's also the wrong approach architecturally. We
> >   shouldn't assume that a tuple can efficiently be represented as a
> >   single palloc'ed chunk. In fact, we should move *away* from relying on
> >   that so much.
> >
> >   I've thus removed the TableTuple type entirely.
> >
> 
> Thanks for the changes, I didn't check the code yet, so for now whenever the
> HeapTuple is required it will be generated from slot?

Pretty much.


> > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> >   a higher memory usage, because the resulting tuples weren't freed or
> >   anything. There might be a reason for doing such a change - we've
> >   certainly discussed that before - but I'm *vehemently* against doing
> >   that at the same time we introduce pluggable storage. Analyzing the
> >   performance effects will be hard enough without changes like this.
> >
> 
> How about using of slot instead of tuple and reusing of it? I don't know
> yet whether it is possible everywhere.

Not quite sure what you mean?


> > Tasks / Questions:
> >
> > - split up patch
> >
> 
> How about generating refactoring changes as patches first based on
> the code in your repo as discussed here[1]?

Sure - it was just at the moment too much work ;)


> > - Change heap table AM to not allocate handler function for each table,
> >   instead allocate it statically. Avoids a significant amount of data
> >   duplication, and allows for a few more compiler optimizations.
> >
> 
> Some kind of static variable handlers for each tableam, but need to check
> how can we access that static handler from the relation.

I'm not sure what you mean by "how can we access"? We can just return a
pointer from the constant data from the current handler? Except that
adding a bunch of consts would be good, the external interface wouldn't
need to change?


> > - change scan level slot creation to use tableam function for doing so
> > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid
> >
> 
> so with this there shouldn't be a way from slot to tid mapping or there
> should be some other way.

I'm not sure I follow?


> - bitmap index scans probably need a new tableam.h callback, abstracting
> >   bitgetpage()
> >
> 
> OK.

Any chance you could try to tackle this?  I'm going to be mostly out
this week, so we'd probably not run across each others feet...

Greetings,

Andres Freund

Re: Pluggable Storage - Andres's take

2018-07-16 Thread Andres Freund

Hi,

On 2018-07-05 15:25:25 +0300, Alexander Korotkov wrote:
> > - I think the move of the indexing from outside the table layer into the
> >   storage layer isn't a good idea. It lead to having to pass EState into
> >   the tableam, a callback API to perform index updates, etc.  This seems
> >   to have at least partially been triggered by the speculative insertion
> >   codepaths.  I've reverted this part of the changes.  The speculative
> >   insertion / confirm codepaths are now exposed to tableam.h - I think
> >   that's the right thing because we'll likely want to have that
> >   functionality across more than a single tuple in the future.
> 
> I agree that passing EState into tableam doesn't look good.  But I
> believe that tableam needs way more control over indexes than it has
> in your version patch.  If even tableam-independent insertion into
> indexes on tuple insert is more or less ok, but on update we need
> something smarter rather than just insert index tuples depending
> "update_indexes" flag.  Tableam specific update method may decide to
> update only some of indexes.  For example, when zheap performs update
> in-place then it inserts to only indexes, whose fields were updated.
> And I think any undo-log based storage would have similar behavior.
> Moreover, it might be required to do something with existing index
> tuples (for instance, as I know, zheap sets "deleted" flag to index
> tuples related to previous values of updated fields).

I agree that we probably need more - I'm just inclined to think that we
need a more concrete target to work against.  Currently zheap's indexing
logic still is fairly naive, I don't think we'll get the interface right
without having worked further on the zheap side of things.

> > - The visibility functions relied on the *caller* performing buffer
> >   locking. That's not a great idea, because generic code shouldn't know
> >   about the locking scheme a particular AM needs.  I've changed the
> >   external visibility functions to instead take a slot, and perform the
> >   necessary locking inside.
> 
> Makes sense for me.  But would it cause extra locking/unlocking and in
> turn performance impact?

I don't think so - nearly all the performance relevant cases do all the
visibility logic inside the AM. Where the underlying functions can be
used, that do not do the locking.  Pretty all the converted places just
had manual LockBuffer calls.

> > - HOT was encoded in the API in a bunch of places. That doesn't look
> >   right to me. I tried to improve a bit on that, but I'm not yet quite
> >   sure I like it. Needs written explanation & arguments...
> 
> Yes, HOT is heapam specific feature.  Other tableams might not have
> HOT.  But it appears that we still expose hot_search_buffer() function
> in tableam API.  But that function have no usage, so it's just
> redundant and can be removed.

Yea, that was a leftover.

> > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> >   a higher memory usage, because the resulting tuples weren't freed or
> >   anything. There might be a reason for doing such a change - we've
> >   certainly discussed that before - but I'm *vehemently* against doing
> >   that at the same time we introduce pluggable storage. Analyzing the
> >   performance effects will be hard enough without changes like this.
> 
> I think once we switched to slots, doing heap_copytuple() do
> frequently is not required anymore.

It's mostly gone now.

> > - I've for now backed out the heap rewrite changes, partially.  Mostly
> >   because I didn't like the way the abstraction looks, but haven't quite
> >   figured out how it should look like.
> 
> Yeah, it's hard part, but we need to invent something in this area...

I agree. But I really don't yet quite now what. I somewhat wonder if we
should just add a cluster_rel() callback to the tableam and let it deal
with everything :(.   As previously proposed the interface wouldn't have
worked with anything not losslessly encodable into a heaptuple, which is
unlikely to be sufficient.

FWIW, I plan to be mostly out until Thursday this week, and then I'll
rebase onto the new version of the abstract slot patch and then try to
split up the patchset. Once that's done, I'll do a prototype conversion
of zheap, which I'm sure will show up a lot of weaknesses in the current
abstraction.  Once that's done I hope we can collaborate / divide &
conquer to get the individual pieces into commit shape.

If either of you wants to get a head start separating something out,
let's try to organize who would do what? The EPQ and trigger
slotification are probably good candidates.

Greetings,

Andres Freund

Re: Usage of epoch in txid_current

2018-07-16 Thread Tom Lane

Andres Freund  writes:
> On 2018-07-15 16:41:35 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> On 2018-07-09 19:56:25 -0400, Tom Lane wrote:
 Or, perhaps, use a struct in assert builds and int64 otherwise?
 You could hide the ensuing notational differences in macros.

>> [ bunch of test results ]
>> Offhand it would seem that we can get away with struct wrappers
>> on any platform where performance is really of concern today.

> Cool, thanks for checking!

BTW, independently of any performance questions, these results show
that my idea above was untenable anyway.  On those platforms where
there is a codegen difference, doing it like that would have resulted
in an ABI difference between regular and assert builds.  That's something
to avoid, at least in any API that's visible to extension modules ---
we've had project policy for some time that it should be possible to
use non-assert extensions with assert-enabled core and vice versa.

Conceivably we could have used the struct API only under a special
devel flag that few people use except a buildfarm animal or two.
But that's just a pain in the rear.

regards, tom lane

Re: Parallel queries in single transaction

2018-07-16 Thread Paul Muntyanu

> Well, sure. But you could just as well open multiple connections and
> make the queries concurrent that way. Or change the GUC to increase the
> number of workers for the nightly ETL.


This is an option right now for having permanent staging tables for future
join. I mistakenly said ETL while it is ELT what means that most of
operations are in the database so we try to keep all changes in db code
instead of changing engine for execution. In PG11 we have parallel CTAS
what is drammatical improvement for us, but there are still will be
operations(query plans) which are not parallel.

Having postgresql completely ACID is amazing feature, so when we need to do
ELT operation outside the transaction and guarantee that ELT job completed
successfully by checking that all steps(multiple transactions with staging
tables) are succeeded(with graceful rollback + cleanup in case of failure),
makes things more complex. Indeed I still agree that it is possible to
workaround by operating on application level.
-P

-P


On Mon, Jul 16, 2018 at 2:28 PM Tomas Vondra 
wrote:

>
>
> On 07/16/2018 12:03 PM, Paul Muntyanu wrote:
> > Hi Tomas, thanks for looking into. I am more talking about queries which
> > can not be optimized, e.g.
> > * fullscan of the table and heavy calculations for another one.
> > * query through FDW for both queries(e.g. one query fetches data from
> > Kafka and another one is fetching from remote Postgres. There are no
> > bounds for both queries for anything except local CPU, network and
> > remote machine)
> >
> > IO bound is not a problem in case if you have multiple tablesapces.
>
> But it was you who mentioned "query stuck" not me. I merely pointed out
> that in such cases running queries concurrently won't help.
>
> > And CPU bound can be not the case when you have 32 cores and 6 max
> workers
> > per query. Then, during nigtly ETL, I do not have anything except single
> > query running) == 6 cores are occupied. If I can run queries in
> > parallel, I would occupy two IO stacks(two tablespaces) + 12 cores
> > instead of sequentially 6 and then again 6.
> >
>
> Well, sure. But you could just as well open multiple connections and
> make the queries concurrent that way. Or change the GUC to increase the
> number of workers for the nightly ETL.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Tom Lane

Andres Freund  writes:
> On 2018-07-15 20:55:38 -0400, Tom Lane wrote:
>> That's not the way to think about it.  On a COW file system, we don't
>> want to "create 16MB files" at all --- we should just fill WAL files
>> on-the-fly, because the pre-fill activity isn't actually serving the
>> intended purpose of reserving disk space.  It's just completely useless
>> overhead :-(.  So we can't really make a direct comparison between the
>> two approaches; there's no good way to net out the cost of constructing
>> the WAL data we need to write.

> We probably should still allocate them in 16MB segments. We rely on the
> size being fixed in a number of places.

Reasonable point.  I was supposing that it'd be okay if a partially
written segment were shorter than 16MB, but you're right that that
would require vetting a lot of code to be sure about it.

> But it's probably worthwhile to
> just do a posix_fadvise or such. Also, if we continually increase the
> size with each write we end up doing a lot more metadata transactions,
> which'll essentially serve to increase journalling overhead further.

Hm.  What you're claiming is that on these FSen, extending a file involves
more/different metadata activity than allocating new space for a COW
overwrite of an existing area within the file.  Is that really true?
The former case would be far more common in typical usage, and somehow
I doubt the FS authors would have been too stupid to optimize things so
that the same journal entry can record both the space allocation and the
logical-EOF change.

But anyway, this means we have two nearly independent issues to
investigate: whether recycling/renaming old files is cheaper than
constantly creating and deleting them, and whether to use physical
file zeroing versus some "just set the EOF please" filesystem call
when first creating a file.  The former does seem like it's purely
a performance question, but the latter involves a tradeoff of
performance against an ENOSPC-panic protection feature that in
reality only works on some filesystems.

regards, tom lane

Re: cursors with prepared statements

2018-07-16 Thread Robert Haas

On Mon, Jul 16, 2018 at 8:56 AM, Peter Eisentraut
 wrote:
>> The attached patch seems to do the trick, of allowing EXECUTE + USING.
>> I'm not sure this is worth the trouble, though, since EXECUTE as a plain
>> SQL command is a PostgreSQL-extension anyway.
>
> I think it's a PostgreSQL extension that we allow just about anything to
> be executed directly.  So we should still use the standard syntax either
> way.  It would be weird if EXECUTE or any other command had different
> syntax in direct SQL, ECPG, PL/pgSQL, etc.  We have some differences
> already, but we shouldn't create more.

Hmm.  Your proposal to attach a USING clause to DECLARE .. CURSOR FOR
rather than inventing an OPEN command is an argument for a PostgreSQL
syntax extension, but your proposal to write DECLARE .. CURSOR FOR
rather than DECLARE .. CURSOR FOR EXECUTE is an argument for standard
syntax over and against a PostgreSQL extension.

That sounds a little contradictory, but I think I agree with it.  If
we allow a USING clause for DECLARE .. CURSOR FOR, that doesn't
prevent somebody from inventing an OPEN command in the future.  As
part of introducing such an OPEN command, DECLARE .. CURSOR FOR could
be made not to fail if the prepared statement has parameters but no
USING commands.  On the other hand, if we insist on injecting the word
EXECUTE into the syntax as Heikki proposes, that's purely and simply
an incompatibility with the standard's syntax, as well as with what
ECPG already does.

So +1 for your position.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire

On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:
>
>
>
> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > On 13/07/18 01:39, Andrew Dunstan wrote:
> >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> >>> On 2018-Jul-12, Andrew Dunstan wrote:
> >>>
>  I fully understand. I think this needs to go back to "Waiting on
>  Author".
> >>> Why?  Heikki's patch applies fine and passes the regression tests.
> >>
> >> Well, I understood Claudio was going to do some more work (see
> >> upthread).
> >
> > Claudio raised a good point, that doing small pallocs leads to
> > fragmentation, and in particular, it might mean that we can't give
> > back the memory to the OS. The default glibc malloc() implementation
> > has a threshold of 4 or 32 MB or something like that - allocations
> > larger than the threshold are mmap()'d, and can always be returned to
> > the OS. I think a simple solution to that is to allocate larger
> > chunks, something like 32-64 MB at a time, and carve out the
> > allocations for the nodes from those chunks. That's pretty
> > straightforward, because we don't need to worry about freeing the
> > nodes in retail. Keep track of the current half-filled chunk, and
> > allocate a new one when it fills up.
>
>
> Google seems to suggest the default threshold is much lower, like 128K.
> Still, making larger allocations seems sensible. Are you going to work
> on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Robert Haas

On Mon, Jul 16, 2018 at 10:12 AM, Tom Lane  wrote:
> But anyway, this means we have two nearly independent issues to
> investigate: whether recycling/renaming old files is cheaper than
> constantly creating and deleting them, and whether to use physical
> file zeroing versus some "just set the EOF please" filesystem call
> when first creating a file.  The former does seem like it's purely
> a performance question, but the latter involves a tradeoff of
> performance against an ENOSPC-panic protection feature that in
> reality only works on some filesystems.

It's been a few years since I tested this, but my recollection is that
if you fill up pg_xlog, the system will PANIC and die on a vanilla
Linux install.  Sure, you can set max_wal_size, but that's a soft
limit, not a hard limit, and if you generate WAL faster than the
system can checkpoint, you can overrun that value and force allocation
of additional WAL files.  So I'm not sure we have any working
ENOSPC-panic protection today.  Given that, I'm doubtful that we
should prioritize maintaining whatever partially-working protection we
may have today over raw performance.  If we want to fix ENOSPC on
pg_wal = PANIC, and I think that would be a good thing to fix, then we
should do it either by finding a way to make the WAL insertion ERROR
out instead of panicking, or throttle WAL generation as we get close
to disk space exhaustion so that the checkpoint has time to complete,
as previously proposed by Heroku.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-16 Thread Tom Lane

Andres Freund  writes:
> On 2018-07-15 18:48:43 -0400, Tom Lane wrote:
>> So basically, WAL replay hits an error while holding a buffer pin, and
>> nothing is done to release the buffer pin, but AtProcExit_Buffers thinks
>> something should have been done.

> I think there's a few other cases where we hit this. I've seen something
> similar from inside checkpointer / BufferSync(). I'd be surprised if
> bgwriter couldn't be triggered into the same.

Hm, yeah, on reflection it's pretty obvious that those are hazard cases.

> I'm pretty sure that we do *not* force a panic on all nonzero-exit-code
> cases for other subprocesses.

That's my recollection as well -- mostly, we just start a new one.

So I said I didn't want to do extra work on this, but I am looking into
fixing it by having these aux process types run a ResourceOwner that can
be told to clean up any open buffer pins at exit.  We could be sure the
coverage is complete by dint of removing the special-case code in
resowner.c that allows buffer pins to be taken with no active resowner.
Then CheckForBufferLeaks can be left as-is, ie something we do only
in assert builds.

regards, tom lane

Re: Libpq support to connect to standby server as priority

2018-07-16 Thread Laurenz Albe

Haribabu Kommi wrote:
> > On Wed, Jul 4, 2018 at 11:14 PM Laurenz Albe  
> > wrote:
> > > Haribabu Kommi wrote:
> > >  
> > > - I think the construction with "read_write_host_index" makes the code 
> > > even more
> > >   complicated than it already is.
> > > 
> > >   What about keeping the first successful connection open and storing it 
> > > in a
> > >   variable if we are in "prefer-read" mode.
> > >   If we get the read-only connection we desire, close that cached 
> > > connection,
> > >   otherwise use it.
> > 
> > Even if we add a variable to cache the connection, I don't think the logic 
> > of checking
> > the next host for the read-only host logic may not change, but the extra 
> > connection
> > request to the read-write host again will be removed.
> 
> I evaluated your suggestion of caching the connection and reuse it when there 
> is no
> read only server doesn't find, but I am thinking that it will add more 
> complexity and also
> the connection to the other servers delays, the cached connection may be 
> closed by
> the server also because of timeout.
> 
> I feel the extra time during connection may be fine, if user is preferring 
> the prefer-read
> mode, instead of adding more complexity in handling the cached connection? 
> 
> comments?

I tested the new patch, and it works as expected.

I don't think that time-out of the cached session is a valid concern, because 
that
would have to be a really short timeout.
On the other hand, establishing the connection twice (first to check if it is 
read-only,
then again because no read-only connection is found) can be quite costly.

But that is a matter of debate, as is the readability of the code.

Since I don't think I can contribute more to this patch, I'll mark it as
ready for committer.

Yours,
Laurenz Albe

Re: GiST VACUUM

2018-07-16 Thread Robert Haas

On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas  wrote:
> I'm still a bit scared about using pd_prune_xid to store the XID that
> prevents recycling the page too early. Can we use some field in
> GISTPageOpaqueData for that, similar to how the B-tree stores it in
> BTPageOpaqueData?

What's your reason for being scared?  It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

If I had my druthers, I'd nuke pd_prune_xid from orbit -- it's a piece
of seemingly heap-specific data that is kept in the generic page
header rather than in the heap's special space.  Other AMs like btree
or zheap may have different needs; one size does not fit all.  That
said, since getting rid of pd_prune_xid seems impractical, maybe it
makes more sense to reuse it than to insist on leaving it idle and
consuming space elsewhere in the page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-07-16 Thread Robert Haas

On Sun, Jul 15, 2018 at 1:43 PM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Partitions: test11 FOR VALUES FROM (0) TO (100),
> test12 FOR VALUES FROM (100) TO (200),
> test13 FOR VALUES FROM (200) TO (300)
>
> Partitions: test21 FOR VALUES FROM (10) TO (110),
> test22 FOR VALUES FROM (110) TO (210),
> test23 FOR VALUES FROM (210) TO (310)
>
> I'm confused, since there is only one-to-one mapping of partitions.

No, test21 would have to be joined to both test11 and test12, since
either could contain matching rows.  Also, test22 would have to be
joined to both test12 and test13.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Make foo=null a warning by default.

2018-07-16 Thread Robert Haas

On Mon, Jul 16, 2018 at 3:49 AM, Heikki Linnakangas  wrote:
> I don't agree with changing the default to 'warn'. "foo = NULL" is perfectly
> legal SQL, even if it's not very useful in practice.

+1.  I think that will probably generate lots of annoying warnings in
server logs compared to the value it adds.  I don't object to having
an optional feature to generate warnings, but I think it should be
something a particular user needs to activate, not something we enable
by default.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Make foo=null a warning by default.

2018-07-16 Thread Tom Lane

Heikki Linnakangas  writes:
> On 16/07/18 04:40, David Fetter wrote:
>> Per a discussion with Andrew Gierth and Vik Fearing, both of whom
>> helped make this happen, please find attached a patch which makes it
>> possible to get SQL standard behavior for "= NULL", which is an error.
>> It's been upgraded to a warning, and can still be downgraded to
>> silence (off) and MS-SQL-compatible behavior (on).

> I don't agree with changing the default to 'warn'. "foo = NULL" is 
> perfectly legal SQL, even if it's not very useful in practice.

I think that there's a very narrow argument to be made that SQL doesn't
allow a NULL literal with context-determined type in this context.  But
we decided to generalize that restriction long ago, and suddenly deciding
to enforce it only in this one context makes no sense to me.  The idea
that we would ever decide that it's an error seems flat out ridiculous.

TBH I'm not really excited about investing any work in this area at all.
Considering how seldom we hear any questions about transform_null_equals
anymore[1], I'm wondering if we couldn't just rip the "feature" out
entirely.  But to the extent that we want to leave it in place, it's
100% for backwards compatibility, and that is a strong argument against
changing anything about it.

I'm also pretty dubious that issuing a warning here will accomplish much
to help anyone find bugs.  There are too many small variants of the same
problem that it will not catch[2].

regards, tom lane

[1] In a quick search, the most recent discussion I could find was from
2011:
https://www.postgresql.org/message-id/flat/201110070729.p977tdcx075...@wwwmaster.postgresql.org
Before that it came up maybe once a year or so, but it's just fallen
off a cliff since then.  I wonder whether Microsoft fixed Access to
not need it.

[2] Compare for instance the discussion about bug #6064,
https://www.postgresql.org/message-id/flat/4DFE481F02250003E8EA%40gw.wicourts.gov
A very large fraction of the pre-2011 threads mentioning
transform_null_equals are essentially of the form "why didn't/can't
transform_null_equals fix this bad code for me?".  Each of those cases
would also be a case that the proposed warning doesn't catch.

Re: Make foo=null a warning by default.

2018-07-16 Thread Heikki Linnakangas


On 16/07/18 18:10, Tom Lane wrote:

Heikki Linnakangas  writes:

On 16/07/18 04:40, David Fetter wrote:

Per a discussion with Andrew Gierth and Vik Fearing, both of whom
helped make this happen, please find attached a patch which makes it
possible to get SQL standard behavior for "= NULL", which is an error.
It's been upgraded to a warning, and can still be downgraded to
silence (off) and MS-SQL-compatible behavior (on).


I don't agree with changing the default to 'warn'. "foo = NULL" is
perfectly legal SQL, even if it's not very useful in practice.


I think that there's a very narrow argument to be made that SQL doesn't
allow a NULL literal with context-determined type in this context.  But
we decided to generalize that restriction long ago, and suddenly deciding
to enforce it only in this one context makes no sense to me.  The idea
that we would ever decide that it's an error seems flat out ridiculous.

TBH I'm not really excited about investing any work in this area at all.
Considering how seldom we hear any questions about transform_null_equals
anymore[1], I'm wondering if we couldn't just rip the "feature" out
entirely.
Yeah, I was wondering about that too. But Fabien brought up a completely 
new use-case for this: people learning SQL. For beginners who don't 
understand the behavior of NULLs yet, I can see a warning or error being 
useful training wheels. Perhaps a completely new "training_wheels=on" 
option, which could check may for many other beginner errors, too, would 
be better for that.


- Heikki

Re: [HACKERS] logical decoding of two-phase transactions

2018-07-16 Thread Tomas Vondra


Hi Nikhil,

I've been looking at this patch series, and I do have a bunch of 
comments and questions, as usual ;-)


Overall, I think it's clear the main risk associated with this patch is 
the decode group code - it touches PROC entries, so a bug may cause 
trouble pretty easily. So I've focused on this part, for now.



1) LogicalLockTransaction does roughly this

...

if (MyProc->decodeGroupLeader == NULL)
{
leader = AssignDecodeGroupLeader(txn->xid);

if (leader == NULL ||
!BecomeDecodeGroupMember((PGPROC *)leader, txn->xid))
goto lock_cleanup;
}

leader = BackendXidGetProc(txn->xid);
if (!leader)
goto lock_cleanup;

leader_lwlock = LockHashPartitionLockByProc(leader);
LWLockAcquire(leader_lwlock, LW_EXCLUSIVE);

pgxact = &ProcGlobal->allPgXact[leader->pgprocno];
if(pgxact->xid != txn->xid)
{
LWLockRelease(leader_lwlock);
goto lock_cleanup;
}

...

I wonder why we need the BackendXidGetProc call after the first if 
block. Can we simply grab MyProc->decodeGroupLeader at that point?



2) InitProcess now resets decodeAbortPending/decodeLocked flags, while 
checking decodeGroupLeader/decodeGroupMembers using asserts. Isn't that 
a bit strange? Shouldn't it do the same thing with both?



3) A comment in ProcKill says this:

* Detach from any decode group of which we are a member.  If the leader
* exits before all other group members, its PGPROC will remain allocated
* until the last group process exits; that process must return the
* leader's PGPROC to the appropriate list.

So I'm wondering what happens if the leader dies before other group 
members, but the PROC entry gets reused for a new connection. It clearly 
should not be a leader for that old decode group, but it may need to be 
a leader for another group.



4) strange hunk in ProcKill

There seems to be some sort of merge/rebase issue, because this block of 
code (line ~880) related to lock groups


  /* Return PGPROC structure (and semaphore) to appropriate freelist */
  proc->links.next = (SHM_QUEUE *) *procgloballist;
  *procgloballist = proc;

got replaced by code relared to decode groups. That seems strange.


5) ReorderBufferCommitInternal

I see the LogicalLockTransaction() calls in ReorderBufferCommitInternal 
have vastly variable comments. Some calls have no comment, some calls 
have "obvious" comment like "Lock transaction before catalog access" and 
one call has this very long comment


/*
 * Output plugins can access catalog metadata and we
 * do not have any control over that. We could ask
 * them to call
 * LogicalLockTransaction/LogicalUnlockTransaction
 * APIs themselves, but that leads to unnecessary
 * complications and expectations from plugin
 * writers. We avoid this by calling these APIs
 * here, thereby ensuring that the in-progress
 * transaction will be around for the duration of
 * the apply_change call below
 */

I find that rather inconsistent, and I'd say those comments are useless. 
I suggest to remove all the per-call comments and instead add a comment 
about the locking into the initial file-level comment, which already 
explains handling of large transactions, etc.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Refactor documentation for wait events (Was: pgsql: Add wait event for fsync of WAL segments)

2018-07-16 Thread Robert Haas

On Sun, Jul 15, 2018 at 10:10 PM, Michael Paquier  wrote:
> On Fri, Jul 13, 2018 at 04:57:59PM -0500, Robert Haas wrote:
>> On Mon, Jul 9, 2018 at 4:41 PM, Michael Paquier  wrote:
>>> Another idea that I have here, is to rework the page for monitoring
>>> stats so as we create one sub-section for each system view, and also one
>>> for the table of wait events.  For the wait events, we could then
>>> completely remove the first category column which has morerows and
>>> divide the section into on table per event category.
>>
>> +1 from me.  I think I proposed that before.
>
> Attached is a proof of concept of that.  I have divided the "Viewing
> Statistics" section into a subset for each catalog, and each wait event
> type gains its sub-section as well.  There is a bit more to do with the
> indentation and some xreflabels, but I think that this is enough to
> begin a discussion.
>
> Thoughts?

This doesn't seem to get rid of the morerows stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical decoding from promoted standby with same replication slot

2018-07-16 Thread Jeremy Finzel

On Fri, Jul 13, 2018 at 2:30 PM, Jeremy Finzel  wrote:

> Hello -
>
> We are working on several DR scenarios with logical decoding.  Although we
> are using pglogical the question we have I think is generally applicable to
> logical replication.
>
> Say we have need to drop a logical replication slot for some emergency
> reason on the master, but we don't want to lose the data permanently.  We
> can make a point-in-time-recovery snapshot of the master to use in order to
> recover the lost data in the slot we are about to drop.  Then we drop the
> slot on master.
>
> We can then point our logical subscription to pull from the snapshot to
> get the lost data, once we promote it.
>
> The question is that after promotion, logical decoding is looking for a
> timeline 2 file whereas the file is still at timeline 1.
>
> The WAL file is 000108FD003C, for example.  After promotion,
> it is still 000108FD003C in pg_wal.  But logical decoding says
> ERROR: segment 000208FD003C has already been removed (it is
> looking for a timeline 2 WAL file).  Simply renaming the file actually
> allows us to stream from the replication slot accurately and recover the
> data.
>
> But all of this begs the question of an easier way to do this - why
> doesn't logical decoding know to look for a timeline 1 file?  It is really
> helpful to have this ability to easily recover logical replicated data from
> a snapshot of a replication slot, in case of disaster.
>
> All thoughts very welcome!
>
> Thanks,
> Jeremy
>

I'd like to bump this question with some elaboration on my original
question: is it possible to do a *controlled* failover reliably with
logical decoding, assuming there are unconsumed changes in the replication
slot that client still needs?

It is rather easy to do a controlled failover if we can verify there are no
unconsumed changes in the slot before failover.  Then, we just recreate the
slot on the promoted standby while clients are locked out, and we have not
missed any data changes.

I am trying to figure out if the problem of following timelines, as per
this wiki for example: https://wiki.postgresql.org/wiki/Failover_slots, can
be worked around in a controlled scenario.  One additional part of this is
that after failover I have 2 WAL files with the same walfile name but on
differing timelines, and the promoted standby is only going to decode from
the latter.  Does that mean I am likely to lose data?

Part of the reason I ask is because in testing, I have NOT lost data in
doing a controlled failover as described above (i.e. with unconsumed
changes in the slot that I need to replay on promoted standby).  I am
trying to figure out if I've gotten lucky or if this method is actually
reliable.  That is, renaming the WAL files to bump the timeline, since
these WAL files are simply identical to the ones that were played on the
master, and thus ought to show the same logical decoding information to be
consumed.

Thank you!
Jeremy

Re: New GUC to sample log queries

2018-07-16 Thread Robert Haas

On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing
 wrote:
> Hmm.  Not sure if that last word should be _sample, _sampling, _rate, or
> a combination of those.

+1 for rate or sample_rate.  I think "sample" or "sampling" without
"rate" will not be very clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: psql tab completion for SELECT

2018-07-16 Thread Joao De Almeida Pereira

Hello,

postgres=# select partkey, partc[TAB]
> [no completions]
>

>From the thread, I believe that this feature will be implemented in a after
patch.

>
> And I'd love this case, where go back to edit the SELECT list, after
> already typing the FROM part, to be smarter:
>
> postgres=# select p[TAB] FROM lp;
> Display all 370 possibilities? (y or n)
>

I believe this would be a very interesting feature indeed.


After playing alittle bit around with the patch I noticed that a comma was
missing in line 1214
+ 1202 /* min_server_version */
+ 1203 9,
+ 1204 /* catname */
+ 1205 "pg_catalog.pg_proc p",
+ 1206 /* selcondition */
+ 1207 "p.prorettype NOT IN ('trigger'::regtype,
'internal'::regtype) "
+ 1208 "AND 'internal'::regtype != ALL (p.proargtypes) "
+ 1209 "AND p.oid NOT IN (SELECT
unnest(array[typinput,typoutput,typreceive,typsend,typmodin,typmodout,typanalyze])
FROM pg_type) "
+ 1210 "AND p.oid NOT IN (SELECT
unnest(array[aggtransfn,aggfinalfn]) FROM pg_aggregate) "
+ 1211 "AND p.oid NOT IN (SELECT
unnest(array[oprcode,oprrest,oprjoin]) FROM pg_operator) "
+ 1212 "AND p.oid NOT IN (SELECT
unnest(array[lanplcallfoid,laninline,lanvalidator]) FROM pg_language) "
+ 1213 "AND p.oid NOT IN (SELECT castfunc FROM pg_cast) "
+ 1214 "AND p.oid NOT IN (SELECT amproc FROM pg_amproc) "
+ 1215 /* viscondition */
+ 1216 "pg_catalog.pg_function_is_visible(p.oid)",

To catch these typos would be good if we could get some testing around
psql.
(Newbie question: do we have any kind of testing around tools like psql?)

Thanks
Joao

Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is (fwd)

2018-07-16 Thread Heikki Linnakangas


On 12/07/18 21:27, Fabien COELHO wrote:

For the testing, we just need to make sure that at least one progress report
is always printed, if -P is used. Right?


Yep. That is the first condition above the last_report is set to
thread_start meaning that there has been no report.


So where does the 0.5 second rule come in? Can't we just do "if (no
progress reports were printed) { print progress report; }" at the end?


The second 0.5s condition is to print a closing report if some time
significant time passed since the last one, but we do not want to print
a report at the same second.

pgbench -T 5 -P 2

Would then print report at 2, 4 and 5. 0.5 ensures that we are not going
to do "2 4.0[00] 4.0[01]" on -t whatever -P 2, which would look silly.

If you do not like it then the second condition can be removed, fine with
me.


As the code stands, you would get reports at "2 4.0[00]", right? Let's 
keep it that way. I think the only change we need to make in the logic 
is to check at the end, if *any* progress reports at all have been 
printed, and print one if not. And do that only when the -P option is 
smaller than the -T option, I suppose.



It also adds a small feature which is that there is always a final
progress when the run is completed, which can be useful when computing
progress statistics, otherwise some transactions could not be reported in
any progress at all.


Any transactions in the last 0.5 seconds might still not get reported in any
progress reports.


Yep. I wanted a reasonable threshold, given that both -T and -P are in
seconds anyway, so it probably could only happen with -t.


Oh. I'm a bit surprised we don't support decimals, i.e. -P 0.5. 
Actually, it seems to be acceptd, but it's truncated down to the nearest 
integer. That's not very nice :-(. But it's a separate issue.



Indeed… but then throttling would not be tested:-) The point of the test
is to exercise all time-related options, including throttling with a
reasonable small value.


Ok. I don't think that's really worthwhile. If we add some code that only
runs in testing, then we're not really testing the real thing. I wouldn't
trust the test to tell much. Let's just leave out that magic environment
variable thing, and try to get the rest of the patch finished.


If you remove the environment, then some checks need to be removed,
because the 2 second run may be randomly shorten when there is nothing to
do. If not, the test will fail underterminiscally, which is not
acceptable. Hence the hack. I agree that it is not beautiful.

The more reasonable alternative could be to always last 2 seconds under
-T 2, even if the execution can be shorten because there is nothing to do
at all, i.e. remove the environment-based condition but keep the sleep.


That sounds reasonable. It's a bit silly to wait when there's nothing to 
do, but it's also weird if the test exits before the specified time is 
up. Seems less surprising to always sleep.


- Heikki

Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire

On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire  wrote:
>
> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
>  wrote:
> >
> >
> >
> > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > > On 13/07/18 01:39, Andrew Dunstan wrote:
> > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> > >>> On 2018-Jul-12, Andrew Dunstan wrote:
> > >>>
> >  I fully understand. I think this needs to go back to "Waiting on
> >  Author".
> > >>> Why?  Heikki's patch applies fine and passes the regression tests.
> > >>
> > >> Well, I understood Claudio was going to do some more work (see
> > >> upthread).
> > >
> > > Claudio raised a good point, that doing small pallocs leads to
> > > fragmentation, and in particular, it might mean that we can't give
> > > back the memory to the OS. The default glibc malloc() implementation
> > > has a threshold of 4 or 32 MB or something like that - allocations
> > > larger than the threshold are mmap()'d, and can always be returned to
> > > the OS. I think a simple solution to that is to allocate larger
> > > chunks, something like 32-64 MB at a time, and carve out the
> > > allocations for the nodes from those chunks. That's pretty
> > > straightforward, because we don't need to worry about freeing the
> > > nodes in retail. Keep track of the current half-filled chunk, and
> > > allocate a new one when it fills up.
> >
> >
> > Google seems to suggest the default threshold is much lower, like 128K.
> > Still, making larger allocations seems sensible. Are you going to work
> > on that?
>
> Below a few MB the threshold is dynamic, and if a block bigger than
> 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> the dynamic threshold is set to the size of the freed block.
>
> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
>
> So I'd suggest allocating blocks bigger than M_MMAP_MAX.
>
> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html

Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
former is something else.

Re: Make foo=null a warning by default.

2018-07-16 Thread Tom Lane

Heikki Linnakangas  writes:
> On 16/07/18 18:10, Tom Lane wrote:
>> TBH I'm not really excited about investing any work in this area at all.
>> Considering how seldom we hear any questions about transform_null_equals
>> anymore[1], I'm wondering if we couldn't just rip the "feature" out
>> entirely.

> Yeah, I was wondering about that too. But Fabien brought up a completely 
> new use-case for this: people learning SQL. For beginners who don't 
> understand the behavior of NULLs yet, I can see a warning or error being 
> useful training wheels. Perhaps a completely new "training_wheels=on" 
> option, which could check may for many other beginner errors, too, would 
> be better for that.

Agreed --- but what we'd want that to do seems only vaguely related to
the existing behavior of transform_null_equals.  As an example, we
intentionally made transform_null_equals *not* trigger on

CASE x WHEN NULL THEN ...

but a training-wheels warning for that would likely be reasonable.

For that matter, many of the old threads about this are complaining
about nulls that aren't simple literals in the first place.  I wonder
whether a training-wheels feature that whined *at runtime* about null
WHERE-qual or case-test results would be more useful than a parser
check.

regards, tom lane

Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Heikki Linnakangas


On 16/07/18 18:35, Claudio Freire wrote:

On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire  wrote:

On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:

On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:

Claudio raised a good point, that doing small pallocs leads to
fragmentation, and in particular, it might mean that we can't give
back the memory to the OS. The default glibc malloc() implementation
has a threshold of 4 or 32 MB or something like that - allocations
larger than the threshold are mmap()'d, and can always be returned to
the OS. I think a simple solution to that is to allocate larger
chunks, something like 32-64 MB at a time, and carve out the
allocations for the nodes from those chunks. That's pretty
straightforward, because we don't need to worry about freeing the
nodes in retail. Keep track of the current half-filled chunk, and
allocate a new one when it fills up.


Google seems to suggest the default threshold is much lower, like 128K.
Still, making larger allocations seems sensible. Are you going to work
on that?


Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html


Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
former is something else.


Yeah, we basically want to be well above whatever the threshold is. I 
don't think we should try to check for any specific constant, just make 
it large enough. Different libc implementations might have different 
policies, too. There's little harm in overshooting, and making e.g. 64 
MB allocations when 1 MB would've been enough to trigger the mmap() 
behavior. It's going to be more granular than the current situation, 
anyway, where we do a single massive allocation.


(A code comment to briefly mention the thresholds on common platforms 
would be good, though).


- Heikki

Re: Make foo=null a warning by default.

2018-07-16 Thread David Fetter

On Mon, Jul 16, 2018 at 09:49:14AM +0200, Fabien COELHO wrote:
> Hello David,
> 
> >Per a discussion with Andrew Gierth and Vik Fearing, both of whom
> >helped make this happen, please find attached a patch which makes it
> >possible to get SQL standard behavior for "= NULL", which is an error.
> >It's been upgraded to a warning, and can still be downgraded to
> >silence (off) and MS-SQL-compatible behavior (on).
> 
> That's nice, and good for students, and probably others as well:-)
> 
> A few comments:
> 
> Patch applies & compiles, "make check" ok.
> 
>   #backslash_quote = safe_encoding# on, off, or safe_encoding
>   [...]
>   #transform_null_equals = warn

Fixed.

> I think it would be nice to have the possible values as a comment in
> "postgresql.conf".
> 
> * Code:
> 
>   -bool   operator_precedence_warning = false;
>   +bool   operator_precedence_warning = false;
> 
> Is this reindentation useful/needed?

I believe so because it fits with the line just below it.

>  + parser_errposition(pstate, a->location)));
>  +   if (need_transform_null_equals && transform_null_equals == 
> TRANSFORM_NULL_EQUALS_ON)
> 
> For consistency, maybe skip a line before the if?

Fixed.

>   transform_null_equals_options[] = { [...]
> 
> I do not really understand the sort order of this array. Maybe it could be
> alphabetical, or per value? Or maybe it is standard values and then
> synonyms, in which is case maybe a comment on the end of the list.

I've changed it to per value. Is this better?

> Guc help text: ISTM that it should spell out the possible values and their
> behavior. Maybe something like:
> 
> """
> When turned on, turns expr = NULL into expr IS NULL.
> With off, warn or error, do not do so and be silent, issue a warning or an 
> error.
> The standard behavior is not to perform this transformation.
> Default is warn.
> """
> 
> Or anything in better English.

I've changed this, and hope it's clearer.

> * Test
> 
>  +select 1=null;
>  + ?column?
> 
> Maybe use as to describe the expected behavior, so that it is easier to
> check, and I think that "?column?" looks silly eg:
> 
>   select 1=null as expect_{true,false}[_and_a_warning/error];

Changed to more descriptive headers.

>create table cnullchild (check (f1 = 1 or f1 = null)) 
> inherits(cnullparent);
>   +WARNING:  = NULL can only produce a NULL
> 
> I'm not sure of this test case. Should it be turned into "is null"?

This was just adjusting the existing test output to account for the
new warning. I presume it was phrased that way for a reason.

> Maybe the behavior could be retested after the reset?
> 
> * Documentation:
> 
> The "error" value is not documented.
> 
> More generally, The value is said to be an enum, but the list of values is
> not listed anywhere in the documentation.
> 
> ISTM that the doc would be clearer if for each of the four values of the
> parameter the behavior is spelled out.
> 
> Maybe "warn" in the doc should be warn or something.

Fixed.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 33b9f51b1374ede71a6479d44adfa0588c7fb4c3 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 15 Jul 2018 16:11:08 -0700
Subject: [PATCH] Make foo=null a warning by default v2
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.17.1"

This is a multi-part message in MIME format.
--2.17.1
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


The transform_null_equals GUC is now an enum consisting of warn, error, off, on.
---
 doc/src/sgml/config.sgml  | 20 ++---
 src/backend/parser/parse_expr.c   | 30 ++---
 src/backend/utils/misc/guc.c  | 44 +--
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/include/parser/parse_expr.h   | 11 -
 src/test/regress/expected/guc.out | 30 +
 src/test/regress/expected/inherit.out |  1 +
 src/test/regress/sql/guc.sql  | 18 
 8 files changed, 129 insertions(+), 27 deletions(-)


--2.17.1
Content-Type: text/x-patch; name="0001-Make-foo-null-a-warning-by-default.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="0001-Make-foo-null-a-warning-by-default.patch"

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e307bb4e8e..626a2d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8036,7 +8036,7 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
  

  
-  transform_null_equals (boolean)
+  transform_null_equals (enum)
   IS NULL
   
transform_null_equals configuration parameter
@@ -8044,15 +8044,23 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
   

-When on,

Re: [HACKERS] logical decoding of two-phase transactions

2018-07-16 Thread Robert Haas

On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra
 wrote:
> Overall, I think it's clear the main risk associated with this patch is the
> decode group code - it touches PROC entries, so a bug may cause trouble
> pretty easily. So I've focused on this part, for now.

I agree.  As a general statement, I think the idea of trying to
prevent transactions from aborting is really scary.  It's almost an
axiom of the system that we're always allowed to abort, and I think
there could be a lot of unintended and difficult-to-fix consequences
of undermining that guarantee.  I think it will be very difficult to
create a sound system for delaying transactions, and I doubt very much
that the proposed system is sound.

In particular:

- The do_wait loop contains a CHECK_FOR_INTERRUPTS().  If that makes
it interruptible, then it's possible for the abort to complete before
the decoding processes have aborted.  If that can happen, then this
whole mechanism is completely pointless, because it fails to actually
achieve the guarantee which is its central goal.  On the other hand,
if you don't make this abort interruptible, then you are significantly
increase the risk that a backend could get stuck in the abort path for
an unbounded period of time.  If the aborting backend holds any
significant resources at this point, such as heavyweight locks, then
you risk creating a deadlock that cannot be broken until the decoding
process manages to abort, and if that process is involved in the
deadlock, then you risk creating an unbreakable deadlock.

- BackendXidGetProc() seems to be called in multiple places without
any lock held.  I don't see how that can be safe, because AFAICS it
must inevitably introduce a race condition: the answer can change
after that value is returned but before it is used.  There's a bunch
of recheck logic that looks like it is trying to cope with this
problem, but I'm not sure it's very solid.  For example,
AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding
any lock; we have historically avoided assuming that pointer-width
reads cannot be torn.  (We have assumed this only for 4-byte reads or
narrower.)  There are no comments about the locking hazards here, and
no real explanation of how the recheck algorithm tries to patch things
up:

+ leader = BackendXidGetProc(xid);
+ if (!leader || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }

Can be non-NULL yet unequal to proc?  I don't understand how that can
happen: surely once the PGPROC that has that XID aborts, the same XID
can't possibly be assigned to a different PGPROC.

- The code for releasing PGPROCs in ProcKill looks completely unsafe
to me.  With locking groups for parallel query, a process always
enters a lock group of its own volition.  It can safely use
(MyProc->lockGroupLeader != NULL) as a race-free test because no other
process can modify that value.  But in this implementation of decoding
groups, one process can put another process into a decoding group,
which means this test has a race condition.  If there's some reason
this is safe, the comments sure don't explain it.

I don't want to overplay my hand, but I think this code is a very long
way from being committable, and I am concerned that the fundamental
approach of blocking transaction aborts may be unsalvageably broken or
at least exceedingly dangerous.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Another usability issue with our TAP tests

2018-07-16 Thread Tom Lane

Since "make check-world" is rather chatty, if you get a failure while
running it under high parallelism, the location of the failure has often
scrolled off the terminal window by the time all the other subjobs exit.
This is not a huge problem for tests using our traditional infrastructure,
because you can just run "git status" to look for regression.diffs files.
But a TAP test failure leaves nothing behind that git will consider
unusual.  I've repeatedly had to run check-world with no parallelism
(wasting many minutes) in order to locate which test actually failed.

I'm not sure about a good way to improve this.  One idea that comes
to mind is to tweak the "make check" rules so that the tmp_check
subdirectories are automatically deleted on successful completion,
but not on failure, and then remove tmp_check from the .gitignore lists.
But the trouble with that is sometimes you want to look at the test logs
afterwards, even when make thought the test succeeded.

Ideas?

regards, tom lane

Re: [HACKERS] logical decoding of two-phase transactions

2018-07-16 Thread Tom Lane

Robert Haas  writes:
> I agree.  As a general statement, I think the idea of trying to
> prevent transactions from aborting is really scary.  It's almost an
> axiom of the system that we're always allowed to abort, and I think
> there could be a lot of unintended and difficult-to-fix consequences
> of undermining that guarantee.  I think it will be very difficult to
> create a sound system for delaying transactions, and I doubt very much
> that the proposed system is sound.

Ugh, is this patch really dependent on such a thing?

TBH, I think the odds of making that work are indistinguishable from zero;
and even if you managed to commit something that did work at the instant
you committed it, the odds that it would stay working in the face of later
system changes are exactly zero.  I would reject this idea out of hand.

regards, tom lane

Re: GiST VACUUM

2018-07-16 Thread Andrey Borodin



> 16 июля 2018 г., в 18:58, Robert Haas  написал(а):
> 
> On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas  wrote:
>> I'm still a bit scared about using pd_prune_xid to store the XID that
>> prevents recycling the page too early. Can we use some field in
>> GISTPageOpaqueData for that, similar to how the B-tree stores it in
>> BTPageOpaqueData?
> 
> What's your reason for being scared?  It seems to me that the
> alternatives being proposed (altering the size of the special space,
> or sometimes repurposing a field for some other purpose) have their
> own associated scariness.

Thanks, that's exactly what I'm thinking about where to store this xid.

Here's v9 of the patch, it uses pd_prune_xid, but it is abstracted to 
GistPageGetDeleteXid() \ GistPageSetDeleteXid() so that we can change the way 
we store it easily.

Best regards, Andrey Borodin.


0002-Physical-GiST-scan-during-VACUUM-v9.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v9.patch
Description: Binary data

Re: [HACKERS] logical decoding of two-phase transactions

2018-07-16 Thread Tomas Vondra





On 07/16/2018 06:15 PM, Robert Haas wrote:

On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra
 wrote:

Overall, I think it's clear the main risk associated with this patch is the
decode group code - it touches PROC entries, so a bug may cause trouble
pretty easily. So I've focused on this part, for now.


I agree.  As a general statement, I think the idea of trying to
prevent transactions from aborting is really scary.  It's almost an
axiom of the system that we're always allowed to abort, and I think
there could be a lot of unintended and difficult-to-fix consequences
of undermining that guarantee.  I think it will be very difficult to
create a sound system for delaying transactions, and I doubt very much
that the proposed system is sound.

In particular:

- The do_wait loop contains a CHECK_FOR_INTERRUPTS().  If that makes
it interruptible, then it's possible for the abort to complete before
the decoding processes have aborted.  If that can happen, then this
whole mechanism is completely pointless, because it fails to actually
achieve the guarantee which is its central goal.  On the other hand,
if you don't make this abort interruptible, then you are significantly
increase the risk that a backend could get stuck in the abort path for
an unbounded period of time.  If the aborting backend holds any
significant resources at this point, such as heavyweight locks, then
you risk creating a deadlock that cannot be broken until the decoding
process manages to abort, and if that process is involved in the
deadlock, then you risk creating an unbreakable deadlock.



I'm not sure I understand. Are you suggesting the process might get 
killed or something, thanks to the CHECK_FOR_INTERRUPTS() call?



- BackendXidGetProc() seems to be called in multiple places without
any lock held.  I don't see how that can be safe, because AFAICS it
must inevitably introduce a race condition: the answer can change
after that value is returned but before it is used.  There's a bunch
of recheck logic that looks like it is trying to cope with this
problem, but I'm not sure it's very solid.


But BackendXidGetProc() internally acquires ProcArrayLock, of course. 
It's true there are a few places where we do != NULL checks on the 
result without holding any lock, but I don't see why that would be a 
problem? And before actually inspecting the contents, the code always 
does LockHashPartitionLockByProc.


But I certainly agree this would deserve comments explaining why this 
(lack of) locking is safe. (The goal why it's done this way is clearly 
an attempt to acquire the lock as infrequently as possible, in an effort 
to minimize the overhead.)



For example,
AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding
any lock; we have historically avoided assuming that pointer-width
reads cannot be torn.  (We have assumed this only for 4-byte reads or
narrower.)  There are no comments about the locking hazards here, and
no real explanation of how the recheck algorithm tries to patch things
up:

+ leader = BackendXidGetProc(xid);
+ if (!leader || leader != proc)
+ {
+ LWLockRelease(leader_lwlock);
+ return NULL;
+ }

Can be non-NULL yet unequal to proc?  I don't understand how that can
happen: surely once the PGPROC that has that XID aborts, the same XID
can't possibly be assigned to a different PGPROC.



Yeah. I have the same question.


- The code for releasing PGPROCs in ProcKill looks completely unsafe
to me.  With locking groups for parallel query, a process always
enters a lock group of its own volition.  It can safely use
(MyProc->lockGroupLeader != NULL) as a race-free test because no other
process can modify that value.  But in this implementation of decoding
groups, one process can put another process into a decoding group,
which means this test has a race condition.  If there's some reason
this is safe, the comments sure don't explain it.



I don't follow. How could one process put another process into a 
decoding group? I don't think that's possible.



I don't want to overplay my hand, but I think this code is a very long
way from being committable, and I am concerned that the fundamental
approach of blocking transaction aborts may be unsalvageably broken or
at least exceedingly dangerous.



I'm not sure about the 'unsalvageable' part, but it needs more work, 
that's for sure. Unfortunately, all previous attempts to make this work 
in various other ways failed (see past discussions in this thread), so 
this is the only approach left :-( So let's see if we can make it work.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] logical decoding of two-phase transactions

2018-07-16 Thread Tomas Vondra


On 07/16/2018 07:21 PM, Tom Lane wrote:

Robert Haas  writes:

I agree.  As a general statement, I think the idea of trying to
prevent transactions from aborting is really scary.  It's almost an
axiom of the system that we're always allowed to abort, and I think
there could be a lot of unintended and difficult-to-fix consequences
of undermining that guarantee.  I think it will be very difficult to
create a sound system for delaying transactions, and I doubt very much
that the proposed system is sound.


Ugh, is this patch really dependent on such a thing?



Unfortunately it does :-( Without it the decoding (or output plugins) 
may see catalogs broken in various ways - the catalog records may get 
vacuumed, HOT chains are broken, ... There were attempts to change that 
part, but that seems an order of magnitude more invasive than this.



TBH, I think the odds of making that work are indistinguishable from zero;
and even if you managed to commit something that did work at the instant
you committed it, the odds that it would stay working in the face of later
system changes are exactly zero.  I would reject this idea out of hand.



Why? How is this significantly different from other patches touching 
ProcArray and related bits?



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] logical decoding of two-phase transactions

2018-07-16 Thread Robert Haas

On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra
 wrote:
> I'm not sure I understand. Are you suggesting the process might get killed
> or something, thanks to the CHECK_FOR_INTERRUPTS() call?

Yes.  CHECK_FOR_INTERRUPTS() can certainly lead to a non-local
transfer of control.

> But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's
> true there are a few places where we do != NULL checks on the result without
> holding any lock, but I don't see why that would be a problem? And before
> actually inspecting the contents, the code always does
> LockHashPartitionLockByProc.

I think at least some of those cases are a problem.  See below...

> I don't follow. How could one process put another process into a decoding
> group? I don't think that's possible.

Isn't that exactly what AssignDecodeGroupLeader() is doing?  It looks
up the process that currently has that XID, then turns that process
into a decode group leader.  Then after that function returns, the
caller adds itself to the decode group as well.  So it seems entirely
possible for somebody to swing the decodeGroupLeader pointer for a
PGPROC from NULL to some other value at an arbitrary point in time.

> I'm not sure about the 'unsalvageable' part, but it needs more work, that's
> for sure. Unfortunately, all previous attempts to make this work in various
> other ways failed (see past discussions in this thread), so this is the only
> approach left :-( So let's see if we can make it work.

I think that's probably not going to work out, but of course it's up
to you how you want to spend your time!

After thinking about it a bit more, if you want to try to stick with
this design, I don't think that this decode group leader/members thing
has much to recommend it.  In the case of parallel query, the point of
the lock group stuff is to treat all of those processes as one for
purposes of heavyweight lock acquisition.  There's no similar need
here, so the design that makes sure the "leader" is in the list of
processes that are members of the "group" is, AFAICS, just wasted
code.  All you really need is a list of processes hung off of the
PGPROC that must abort before the leader is allowed to abort; the
leader itself doesn't need to be in the list, and there's no need to
consider it as a "group".  It's just a list of waiters.

That having been said, I still don't see how that's really going to
work.  Just to take one example, suppose that the leader is trying to
ERROR out, and the decoding workers are blocked waiting for a lock
held by the leader.  The system has no way of detecting this deadlock
and resolving it automatically, which certainly seems unacceptable.
The only way that's going to work is if the leader waits for the
worker by trying to acquire a lock held by the worker.  Then the
deadlock detector would know to abort some transaction.  But that
doesn't really work either - the deadlock was created by the
foreground process trying to abort, and if the deadlock detector
chooses that process as its victim, what then?  We're already trying
to abort, and the abort code isn't supposed to throw further errors,
or fail in any way, lest we break all kinds of other things.  Not to
mention the fact that running the deadlock detector in the abort path
isn't really safe to begin with, again because we can't throw errors
when we're already in an abort path.

If we're only ever talking about decoding prepared transactions, we
could probably work around all of these problems: have the decoding
process take a heavyweight lock before it begins decoding.  Have a
process that wants to execute ROLLBACK PREPARED take a conflicting
heavyweight lock on the same object.  The net effect would be that
ROLLBACK PREPARED would simply wait for decoding to finish.  That
might be rather lousy from a latency point of view since the
transaction could take an arbitrarily long time to decode, but it
seems safe enough.  Possibly you could also design a mechanism for the
ROLLBACK PREPARED command to SIGTERM the processes that are blocking
its lock acquisition, if they are decoding processes.  The difference
between this and what you the current patch is doing is that nothing
complex or fragile is happening in the abort pathway itself.  The
complicated stuff in both the worker and in the main backend happens
while the transaction is still good and can still be rolled back at
need.  This kind of approach won't work if you want to decode
transactions that aren't yet prepared, so if that is the long term
goal then we need to think harder.  I'm honestly not sure that problem
has any reasonable solution.  The assumption that a running process
can abort at any time is deeply baked into many parts of the system
and for good reasons.  Trying to undo that is going to be like trying
to push water up a hill.  I think we need to install interlocks in
such a way that any waiting happens before we enter the abort path,
not while we're actually trying to perform the abort.  But I don't

Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Andrew Dunstan





On 07/16/2018 10:34 AM, Claudio Freire wrote:

On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:



On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:

On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on
Author".

Why?  Heikki's patch applies fine and passes the regression tests.

Well, I understood Claudio was going to do some more work (see
upthread).

Claudio raised a good point, that doing small pallocs leads to
fragmentation, and in particular, it might mean that we can't give
back the memory to the OS. The default glibc malloc() implementation
has a threshold of 4 or 32 MB or something like that - allocations
larger than the threshold are mmap()'d, and can always be returned to
the OS. I think a simple solution to that is to allocate larger
chunks, something like 32-64 MB at a time, and carve out the
allocations for the nodes from those chunks. That's pretty
straightforward, because we don't need to worry about freeing the
nodes in retail. Keep track of the current half-filled chunk, and
allocate a new one when it fills up.


Google seems to suggest the default threshold is much lower, like 128K.
Still, making larger allocations seems sensible. Are you going to work
on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html



That page says:

   M_MMAP_MAX
  This parameter specifies the maximum number of allocation
  requests that may be simultaneously serviced using mmap(2).
  This parameter exists because some systems have a limited
  number of internal tables for use by mmap(2), and using more
  than a few of them may degrade performance.

  The default value is 65,536, a value which has no special
  significance and which serves only as a safeguard. Setting
  this parameter to 0 disables the use of mmap(2) for servicing
  large allocation requests.


I'm confused about the relevance.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire

On Mon, Jul 16, 2018 at 3:30 PM Andrew Dunstan
 wrote:
>
>
>
> On 07/16/2018 10:34 AM, Claudio Freire wrote:
> > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
> >  wrote:
> >>
> >>
> >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> >>> On 13/07/18 01:39, Andrew Dunstan wrote:
>  On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> > On 2018-Jul-12, Andrew Dunstan wrote:
> >
> >> I fully understand. I think this needs to go back to "Waiting on
> >> Author".
> > Why?  Heikki's patch applies fine and passes the regression tests.
>  Well, I understood Claudio was going to do some more work (see
>  upthread).
> >>> Claudio raised a good point, that doing small pallocs leads to
> >>> fragmentation, and in particular, it might mean that we can't give
> >>> back the memory to the OS. The default glibc malloc() implementation
> >>> has a threshold of 4 or 32 MB or something like that - allocations
> >>> larger than the threshold are mmap()'d, and can always be returned to
> >>> the OS. I think a simple solution to that is to allocate larger
> >>> chunks, something like 32-64 MB at a time, and carve out the
> >>> allocations for the nodes from those chunks. That's pretty
> >>> straightforward, because we don't need to worry about freeing the
> >>> nodes in retail. Keep track of the current half-filled chunk, and
> >>> allocate a new one when it fills up.
> >>
> >> Google seems to suggest the default threshold is much lower, like 128K.
> >> Still, making larger allocations seems sensible. Are you going to work
> >> on that?
> > Below a few MB the threshold is dynamic, and if a block bigger than
> > 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> > the dynamic threshold is set to the size of the freed block.
> >
> > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
> >
> > So I'd suggest allocating blocks bigger than M_MMAP_MAX.
> >
> > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
>
>
> That page says:
>
> M_MMAP_MAX
>This parameter specifies the maximum number of allocation
>requests that may be simultaneously serviced using mmap(2).
>This parameter exists because some systems have a limited
>number of internal tables for use by mmap(2), and using more
>than a few of them may degrade performance.
>
>The default value is 65,536, a value which has no special
>significance and which serves only as a safeguard. Setting
>this parameter to 0 disables the use of mmap(2) for servicing
>large allocation requests.
>
>
> I'm confused about the relevance.

It isn't relevant. See my next message, it should have read
DEFAULT_MMAP_THRESHOLD_MAX.

Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Andrew Dunstan





On 07/16/2018 11:35 AM, Claudio Freire wrote:

On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire  wrote:

On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:



On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:

On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on
Author".

Why?  Heikki's patch applies fine and passes the regression tests.

Well, I understood Claudio was going to do some more work (see
upthread).

Claudio raised a good point, that doing small pallocs leads to
fragmentation, and in particular, it might mean that we can't give
back the memory to the OS. The default glibc malloc() implementation
has a threshold of 4 or 32 MB or something like that - allocations
larger than the threshold are mmap()'d, and can always be returned to
the OS. I think a simple solution to that is to allocate larger
chunks, something like 32-64 MB at a time, and carve out the
allocations for the nodes from those chunks. That's pretty
straightforward, because we don't need to worry about freeing the
nodes in retail. Keep track of the current half-filled chunk, and
allocate a new one when it fills up.


Google seems to suggest the default threshold is much lower, like 128K.
Still, making larger allocations seems sensible. Are you going to work
on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html

Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
former is something else.



Ah, ok. Thanks. ignore the email I just sent about that.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-16 Thread Robert Haas

On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas  wrote:
> Doesn't have to be a trigger, could be a CHECK constraint, datatype input
> function, etc. Admittedly, having a datatype input function that inserts to
> the table is worth a "huh?", but I'm feeling very confident that we can
> catch all such cases, and some of them might even be sensible.

Is this sentence missing a "not"?  i.e. "I'm not feeling very confident"?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-16 Thread Heikki Linnakangas

On 16 July 2018 21:38:39 EEST, Robert Haas  wrote:
>On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas 
>wrote:
>> Doesn't have to be a trigger, could be a CHECK constraint, datatype
>input
>> function, etc. Admittedly, having a datatype input function that
>inserts to
>> the table is worth a "huh?", but I'm feeling very confident that we
>can
>> catch all such cases, and some of them might even be sensible.
>
>Is this sentence missing a "not"?  i.e. "I'm not feeling very
>confident"?

Yes, sorry.

- Heikki

Alter index rename concurrently to

2018-07-16 Thread Andrey Klychkov


Dear hackers!
I have an idea to facilitate work with index rebuilding.
Usually if we want to rebuild an index without table locks we should do the 
queries below:
1. create index concurrently... (with different name on the same columns)
2. drop index concurrently 
3. alter index  rename to 
As you can see, only the last query in this simple algorithm locks table.
Commonly this steps are included to a more complex script and run by cron or 
manually.
When you have high load databases, maybe some problems appear:
1. it’s dangerous and unacceptable for production to wait unknown
number of minutes or even hours while a table is locked
2. we must write a complex script that sets statement timeout to rename and
implement several checks there to prevent stopping of production at nights
3. it’s uncomfortable to find indexes that couldn’t be renamed during script 
executing
Then we must execute renaming manually and interrupt it again and
again if it can’t execute in allowable time.

I made a patch to solve this issue (see the attachment).
It allows to avoid locks by a query like this:
“alter index  rename CONCURRENTLY to ”.
Briefly, I did it by using similar way in the index_create() and index_drop() 
functions
that set ShareUpdateExclusiveLock instead of AccessShareLock
when the structure DropStmt/CreateStmt has “true” in the stmt->concurrent field.
(In other words, when it "see" "concurretly" in query). 
In all other cases (alter table, sequence, etc) I initialized this field as 
“false” respectively.
I ran it when another transactions to the same table are started but not 
finished.
Also I used a script that made SELECT\DML queries in a loop to that test tables 
and
"ALTER INDEX ... RENAME CONCURRENTLY TO ..." works without any errors and
indexes were valid after renaming.
Maybe it’s interesting for someone.
I’ll be very thankful for any ideas!
-- 
Kind regards,
Andrew Klychkov# Author: Andrew Klychkov 
# 16-07-2018

diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 482d463..47e20d5 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1658,14 +1658,14 @@ finish_heap_swap(Oid OIDOldHeap, Oid OIDNewHeap,
 			snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u",
 	 OIDOldHeap);
 			RenameRelationInternal(newrel->rd_rel->reltoastrelid,
-   NewToastName, true);
+   NewToastName, true, false);
 
 			/* ... and its valid index too. */
 			snprintf(NewToastName, NAMEDATALEN, "pg_toast_%u_index",
 	 OIDOldHeap);
 
 			RenameRelationInternal(toastidx,
-   NewToastName, true);
+   NewToastName, true, false);
 		}
 		relation_close(newrel, NoLock);
 	}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 22e81e7..ead4f26 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3012,7 +3012,7 @@ rename_constraint_internal(Oid myrelid,
 			|| con->contype == CONSTRAINT_UNIQUE
 			|| con->contype == CONSTRAINT_EXCLUSION))
 		/* rename the index; this renames the constraint as well */
-		RenameRelationInternal(con->conindid, newconname, false);
+		RenameRelationInternal(con->conindid, newconname, false, false);
 	else
 		RenameConstraintById(constraintOid, newconname);
 
@@ -3082,7 +3082,9 @@ RenameRelation(RenameStmt *stmt)
 {
 	Oid			relid;
 	ObjectAddress address;
+LOCKMODElockmode;
 
+lockmode = stmt->concurrent ? ShareUpdateExclusiveLock : AccessExclusiveLock;
 	/*
 	 * Grab an exclusive lock on the target table, index, sequence, view,
 	 * materialized view, or foreign table, which we will NOT release until
@@ -3091,7 +3093,7 @@ RenameRelation(RenameStmt *stmt)
 	 * Lock level used here should match RenameRelationInternal, to avoid lock
 	 * escalation.
 	 */
-	relid = RangeVarGetRelidExtended(stmt->relation, AccessExclusiveLock,
+	relid = RangeVarGetRelidExtended(stmt->relation, lockmode,
 	 stmt->missing_ok ? RVR_MISSING_OK : 0,
 	 RangeVarCallbackForAlterRelation,
 	 (void *) stmt);
@@ -3105,7 +3107,7 @@ RenameRelation(RenameStmt *stmt)
 	}
 
 	/* Do the work */
-	RenameRelationInternal(relid, stmt->newname, false);
+	RenameRelationInternal(relid, stmt->newname, false, stmt->concurrent);
 
 	ObjectAddressSet(address, RelationRelationId, relid);
 
@@ -3122,20 +3124,22 @@ RenameRelation(RenameStmt *stmt)
  *			  sequence, AFAIK there's no need for it to be there.
  */
 void
-RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal)
+RenameRelationInternal(Oid myrelid, const char *newrelname, bool is_internal, bool concurrent)
 {
 	Relation	targetrelation;
 	Relation	relrelation;	/* for RELATION relation */
 	HeapTuple	reltup;
 	Form_pg_class relform;
 	Oid			namespaceId;
+LOCKMODElockmode;
 
 	/*
 	 * Grab an exclusive lock on the target table, index, sequence, view,
 	 * materialized view, or foreign table, which we will NOT release until
 	 * end of transaction.
 	 */
-	targetrelation = re

Re: Alter index rename concurrently to

2018-07-16 Thread Andrey Borodin

Hi!

> 16 июля 2018 г., в 22:58, Andrey Klychkov  написал(а):
> Dear hackers!
> 
> I have an idea to facilitate work with index rebuilding.
> 
> "ALTER INDEX ... RENAME CONCURRENTLY TO ..."

The idea seems useful. I'm not an expert in CIC, but your post do not cover one 
very important topic. When we use CREATE INDEX CONCURRENTLY we pay for less 
intrusive lock by scanning data twice. What is the price of RENAME 
CONCURRENTLY? Should this mode be default?

Best regards, Andrey Borodin.

Re: [HACKERS] logical decoding of two-phase transactions

2018-07-16 Thread Tomas Vondra

On 07/16/2018 08:09 PM, Robert Haas wrote:
> On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra
>  wrote:
>> I'm not sure I understand. Are you suggesting the process might get killed
>> or something, thanks to the CHECK_FOR_INTERRUPTS() call?
> 
> Yes.  CHECK_FOR_INTERRUPTS() can certainly lead to a non-local
> transfer of control.
> 
>> But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's
>> true there are a few places where we do != NULL checks on the result without
>> holding any lock, but I don't see why that would be a problem? And before
>> actually inspecting the contents, the code always does
>> LockHashPartitionLockByProc.
> 
> I think at least some of those cases are a problem.  See below...
> 
>> I don't follow. How could one process put another process into a decoding
>> group? I don't think that's possible.
> 
> Isn't that exactly what AssignDecodeGroupLeader() is doing?  It looks
> up the process that currently has that XID, then turns that process
> into a decode group leader.  Then after that function returns, the
> caller adds itself to the decode group as well.  So it seems entirely
> possible for somebody to swing the decodeGroupLeader pointer for a
> PGPROC from NULL to some other value at an arbitrary point in time.
> 

Oh, right, I forgot the patch also adds the leader into the group, for
some reason (I agree it's unclear why that would be necessary, as you
pointed out later).

But all this is happening while holding the partition lock (in exclusive
mode). And the decoding backends do synchronize on the lock correctly
(although, man, the rechecks are confusing ...).

But now I see ProcKill accesses decodeGroupLeader in multiple places,
and only the first one is protected by the lock, for some reason
(interestingly enough the one in lockGroupLeader block). Is that what
you mean?

FWIW I suspect the ProcKill part is borked due to incorrectly resolved
merge conflict or something, per my initial response from today.

>> I'm not sure about the 'unsalvageable' part, but it needs more work, that's
>> for sure. Unfortunately, all previous attempts to make this work in various
>> other ways failed (see past discussions in this thread), so this is the only
>> approach left :-( So let's see if we can make it work.
> 
> I think that's probably not going to work out, but of course it's up 
> to you how you want to spend your time!
> 

Well, yeah. I'm sure I could think of more fun things to do, but OTOH I
also have patches that require the capability to decode in-progress
transactions.

> After thinking about it a bit more, if you want to try to stick with
> this design, I don't think that this decode group leader/members thing
> has much to recommend it.  In the case of parallel query, the point of
> the lock group stuff is to treat all of those processes as one for
> purposes of heavyweight lock acquisition.  There's no similar need
> here, so the design that makes sure the "leader" is in the list of
> processes that are members of the "group" is, AFAICS, just wasted
> code.  All you really need is a list of processes hung off of the
> PGPROC that must abort before the leader is allowed to abort; the
> leader itself doesn't need to be in the list, and there's no need to
> consider it as a "group".  It's just a list of waiters.
> 

But the way I understand it, it pretty much *is* a list of waiters,
along with a couple of flags to allow the processes to notify the other
side about lock/unlock/abort. It does resemble the lock groups, but I
don't think it has the same goals.

The thing is that the lock/unlock happens for each decoded change
independently, and it'd be silly to modify the list all the time, so
instead it just sets the decodeLocked flag to true/false. Similarly,
when the leader decides to abort, it marks decodeAbortPending and waits
for the decoding backends to complete.

Of course, that's my understanding/interpretation, and perhaps Nikhil as
a patch author has a better explanation.

> That having been said, I still don't see how that's really going to
> work.  Just to take one example, suppose that the leader is trying to
> ERROR out, and the decoding workers are blocked waiting for a lock
> held by the leader.  The system has no way of detecting this deadlock
> and resolving it automatically, which certainly seems unacceptable.
> The only way that's going to work is if the leader waits for the
> worker by trying to acquire a lock held by the worker.  Then the
> deadlock detector would know to abort some transaction.  But that
> doesn't really work either - the deadlock was created by the
> foreground process trying to abort, and if the deadlock detector
> chooses that process as its victim, what then?  We're already trying
> to abort, and the abort code isn't supposed to throw further errors,
> or fail in any way, lest we break all kinds of other things.  Not to
> mention the fact that running the deadlock detector in the abort path
> isn't really safe to begin

Re: Alter index rename concurrently to

2018-07-16 Thread Victor Yegorov

пн, 16 июл. 2018 г. в 21:58, Andrey Klychkov :

> I made a patch to solve this issue (see the attachment).
> It allows to avoid locks by a query like this:
> “alter index  rename CONCURRENTLY to ”.
>

Please, have a look at previous discussions on the subject:
- 2012
https://www.postgresql.org/message-id/cab7npqtys6juqdxuczbjb0bnw0kprw8wdzuk11kaxqq6o98...@mail.gmail.com
- 2013
https://www.postgresql.org/message-id/cab7npqstfkuc7dzxcdx4hotu63yxhrronv2aouzyd-zz_8z...@mail.gmail.com
- https://commitfest.postgresql.org/16/1276/

-- 
Victor Yegorov

Re: Fix some error handling for read() and errno

2018-07-16 Thread Alvaro Herrera

On 2018-Jul-16, Michael Paquier wrote:

> On Sat, Jul 14, 2018 at 03:37:56PM +0900, Michael Paquier wrote:
> > For now, I think that just moving forward with 0001, and then revisit
> > 0002 once the other 2PC patch is settled makes the most sense.  On the
> > other thread, the current 2PC behavior can create silent data loss so
> > I would like to back-patch it, so that would be less work.
> 
> Are there any objections with this plan?  If none, then I would like to
> move on with 0001 as there is clearly a consensus to simplify the work
> of translators and to clean up the error code paths for read() calls.
> Let's sort of the rest after the 2PC code paths are addressed.

No objection here -- incremental progress is better than none.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-16 Thread Alvaro Herrera

On 2018-Jul-12, Heikki Linnakangas wrote:

> > > Thanks for the pointer.  My tap test has been covering two out of
> > > the three scenarios you have in your script.  I have been able to
> > > convert the extra as the attached, and I have added as well an
> > > extra test with TRUNCATE triggers.  So it seems to me that we want
> > > to disable the optimization if any type of trigger are defined on
> > > the relation copied to as it could be possible that these triggers
> > > work on the blocks copied as well, for any BEFORE/AFTER and
> > > STATEMENT/ROW triggers.  What do you think?
> > 
> > Yeah, this seems like the only sane approach.
> 
> Doesn't have to be a trigger, could be a CHECK constraint, datatype
> input function, etc. Admittedly, having a datatype input function that
> inserts to the table is worth a "huh?", but I'm feeling very confident
> that we can catch all such cases, and some of them might even be
> sensible.

A counterexample could be a a JSON compresion scheme that uses a catalog
for a dictionary of keys.  Hasn't this been described already?  Also not
completely out of the question for GIS data, I think (Not sure if
PostGIS does this already.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: New GUC to sample log queries

2018-07-16 Thread Adrien Nayrat

On 07/16/2018 05:24 PM, Robert Haas wrote:
> On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing
>  wrote:
>> Hmm.  Not sure if that last word should be _sample, _sampling, _rate, or
>> a combination of those.
> 
> +1 for rate or sample_rate.  I think "sample" or "sampling" without
> "rate" will not be very clear.
> 

Thanks,

After reading this mail :
https://www.postgresql.org/message-id/20180710183828.GB3890%40telsasoft.com

I wonder if this GUC could be used for theses logging method?
log_statement_stats
log_parser_stats
log_planner_stats
log_executor_stats



-- 
Adrien



signature.asc
Description: OpenPGP digital signature

Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"

2018-07-16 Thread Tom Lane

I wrote:
> So I said I didn't want to do extra work on this, but I am looking into
> fixing it by having these aux process types run a ResourceOwner that can
> be told to clean up any open buffer pins at exit.  We could be sure the
> coverage is complete by dint of removing the special-case code in
> resowner.c that allows buffer pins to be taken with no active resowner.
> Then CheckForBufferLeaks can be left as-is, ie something we do only
> in assert builds.

That turned out to be a larger can of worms than I'd anticipated, as it
soon emerged that we'd acquired a whole lot of cargo-cult programming
around ResourceOwners.  Having a ResourceOwner in itself does nothing
for you: there has to be code someplace that ensures we'll call
ResourceOwnerRelease at an appropriate time.  There was basically noplace
outside xact.c and portalmem.c that got this completely right.  bgwriter.c
and a couple of other places at least tried a little, by doing a
ResourceOwnerRelease in their sigsetjmp-catching stanzas, but that didn't
account for the process-exit code path.  Other places just created a
ResourceOwner and did *nothing* about cleaning it up.

I decided that the most expedient way of dealing with this was to create
a self-contained facility in resowner.c that would create a standalone
ResourceOwner and register a shmem-exit callback to clean it up.
That way it's easier (less code) to do it right than to do it wrong.
That led to the attached, which passes check-world as well as Michael's
full-disk test script.

I'm mostly pretty happy with this, but I think there are a couple of
loose ends in logicalfuncs.c and slotfuncs.c: those are creating
non-standalone ResourceOwners (children of whatever the active
ResourceOwner is) and doing nothing much to clean them up.  That seems
pretty bogus.  It's not a permanent resource leak, because sooner or
later the parent ResourceOwner will get cleaned up and that will
recurse to the child ... but then why bother with a child ResourceOwner
at all?  I added asserts to these calls to verify that there was a
parent resowner (if there isn't, the code is just broken), but I think
that we should either add more code to clean up the child resowner
promptly, or just not bother with a child at all.

Not very sure what to do with this.  I don't think we should try to
backpatch it, because it's essentially changing the API requirements
around buffer access in auxiliary processes --- but it might not be
too late to squeeze it into v11.  It is a bug fix, in that the current
code can leak buffer pins in some code paths and we won't notice in
non-assert builds; but we've not really seen any field reports of that
happening, so I'm not sure how important this is.

regards, tom lane

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4e32cff..3a5cd0e 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1221,8 +1221,7 @@ ParallelWorkerMain(Datum main_arg)
 	memcpy(&ParallelWorkerNumber, MyBgworkerEntry->bgw_extra, sizeof(int));

 	/* Set up a memory context and resource owner. */
-	Assert(CurrentResourceOwner == NULL);
-	CurrentResourceOwner = ResourceOwnerCreate(NULL, "parallel toplevel");
+	CreateAuxProcessResourceOwner();
 	CurrentMemoryContext = AllocSetContextCreate(TopMemoryContext,
  "Parallel worker",
  ALLOCSET_DEFAULT_SIZES);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4049deb..5d01edd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6356,6 +6356,15 @@ StartupXLOG(void)
 	struct stat st;

 	/*
+	 * We should have an aux process resource owner to use, and we should not
+	 * be in a transaction that's installed some other resowner.
+	 */
+	Assert(AuxProcessResourceOwner != NULL);
+	Assert(CurrentResourceOwner == NULL ||
+		   CurrentResourceOwner == AuxProcessResourceOwner);
+	CurrentResourceOwner = AuxProcessResourceOwner;
+
+	/*
 	 * Verify XLOG status looks valid.
 	 */
 	if (ControlFile->state < DB_SHUTDOWNED ||
@@ -8467,6 +8476,15 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	/*
+	 * We should have an aux process resource owner to use, and we should not
+	 * be in a transaction that's installed some other resowner.
+	 */
+	Assert(AuxProcessResourceOwner != NULL);
+	Assert(CurrentResourceOwner == NULL ||
+		   CurrentResourceOwner == AuxProcessResourceOwner);
+	CurrentResourceOwner = AuxProcessResourceOwner;
+
 	/* Don't be chatty in standalone mode */
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
 			(errmsg("shutting down")));
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7e34bee..cdd71a9 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -403,6 +403,13 @@ AuxiliaryProcessMain(int argc, char *argv[])
 		/* finish settin

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-16 Thread Tomas Vondra

On 07/16/2018 02:54 PM, Dean Rasheed wrote:
> On 16 July 2018 at 13:23, Tomas Vondra  wrote:
 The top-level clauses allow us to make such deductions, with deeper
 clauses it's much more difficult (perhaps impossible). Because for
 example with (a=1 AND b=1) there can be just a single match, so if we
 find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
 b=2)) it's not that simple, because there may be multiple combinations
 and so a match in MCV does not guarantee anything.
>>>
>>> Actually, it guarantees a lower bound on the overall selectivity, and
>>> maybe that's the best that we can do in the absence of any other
>>> stats.
>>>
>> Hmmm, is that actually true? Let's consider a simple example, with two
>> columns, each with just 2 values, and a "perfect" MCV list:
>>
>> a | b | frequency
>>---
>> 1 | 1 | 0.5
>> 2 | 2 | 0.5
>>
>> And let's estimate sel(a=1 & b=2).
> 
> OK.In this case, there are no MCV matches, so there is no lower bound (it's 
> 0).
> 
> What we could do though is also impose an upper bound, based on the
> sum of non-matching MCV frequencies. In this case, the upper bound is
> also 0, so we could actually say the resulting selectivity is 0.
> 

Hmmm, it's not very clear to me how would we decide which of these cases
applies, because in most cases we don't have MCV covering 100% rows.

Anyways, I've been thinking about how to modify the code to wort the way
you proposed (in a way sufficient for a PoC). But after struggling with
it for a while it occurred to me it might be useful to do it on paper
first, to verify how would it work on your examples.

So let's use this data

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,5);
insert into foo select 1,2 from generate_series(1,4);
insert into foo select 1,x/10 from generate_series(30,25) g(x);
insert into foo select 2,1 from generate_series(1,3);
insert into foo select 2,2 from generate_series(1,2);
insert into foo select 2,x/10 from generate_series(30,50) g(x);
insert into foo select 3,1 from generate_series(1,1);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,60) g(x);
insert into foo select x,x/10 from generate_series(4,75) g(x);

Assuming we have perfectly exact statistics, we have these MCV lists
(both univariate and multivariate):

  select a, count(*), round(count(*) /2254937.0, 4) AS frequency
from foo group by a order by 2 desc;

 a| count  | frequency
  ++---
3 | 614998 |0.2727
2 | 549971 |0.2439
1 | 339971 |0.1508
 1014 |  1 |0.
57220 |  1 |0.
...

  select b, count(*), round(count(*) /2254937.0, 4) AS frequency
from foo group by b order by 2 desc;

 b| count | frequency
  +---+---
1 | 90010 |0.0399
2 | 65010 |0.0288
3 |31 |0.
7 |31 |0.
   ...

  select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
from foo group by a, b order by 3 desc;

 a|   b| count | frequency
  ++---+---
1 |  1 | 5 |0.0222
1 |  2 | 4 |0.0177
2 |  1 | 3 |0.0133
2 |  2 | 2 |0.0089
3 |  1 | 1 |0.0044
3 |  2 |  5000 |0.0022
2 |  12445 |10 |0.
...

And let's estimate the two queries with complex clauses, where the
multivariate stats gave 2x overestimates.

SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
-- actual 9, univariate: 24253, multivariate: 181091

   univariate:

 sel(a=1) = 0.1508
 sel(b=1) = 0.0399
 sel(b=2) = 0.0288
 sel(b=1 or b=2) = 0.0673

   multivariate:
 sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)

The second multivariate estimate comes from assuming only the first 5
items make it to the multivariate MCV list (covering 6.87% of the data)
and extrapolating the selectivity to the non-MCV data too.

(Notice it's about 2x the actual selectivity, so this extrapolation due
to not realizing the MCV already contains all the matches is pretty much
responsible for the whole over-estimate).

So, how would the proposed algorithm work? Let's start with "a=1":

   sel(a=1) = 0.1508

I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):

   sel(b=1 or b=2) = 0.0673

And we get

   total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101

>From the multivariate MCV we get

   mcv_sel = 0.0399

And finally

   total_sel = Max(total_sel, mcv_sel) = 0.0399

Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.

I could do the same thing for the second qu

Re: New GUC to sample log queries

2018-07-16 Thread Tomas Vondra

On 07/16/2018 05:24 PM, Robert Haas wrote:
> On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing
>  wrote:
>> Hmm.  Not sure if that last word should be _sample, _sampling, _rate, or
>> a combination of those.
> 
> +1 for rate or sample_rate.  I think "sample" or "sampling" without
> "rate" will not be very clear.
> 

1+ to sample_rate

It's what auto_explain and pgbench uses, so let's keep the naming
consistent.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: AtEOXact_ApplyLauncher() and subtransactions

2018-07-16 Thread Robert Haas

On Mon, Jul 16, 2018 at 2:36 AM, Amit Khandekar  wrote:
> 0001 patch contains the main fix. In this patch I have used some
> naming conventions and some comments that you used in your patch,
> plus, I used your method of lazily allocating new stack level. The
> stack is initially Null.

Committed and back-patched to v11 and v10.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Jerry Jelinek

There have been quite a few comments since last week, so at this point I am
uncertain how to proceed with this change. I don't think I saw anything
concrete in the recent emails that I can act upon.

I would like to respond to the comment about trying to "self-tune" the
behavior based on inferences made about caching during setup. I can't speak
for many other filesystems, but for ZFS, the ARC size is not fixed and will
vary based on the memory demands against the machine. Also, what files are
cached will vary based upon the workloads running on the machine. Thus, I
do not think there is a valid way to make inferences about future caching
behavior based upon a point-in-time observation.

I am still happy to update the man pages to explain the new tunable better
if that is acceptable.

Thanks,
Jerry

On Sun, Jul 15, 2018 at 6:32 PM, Robert Haas  wrote:

> On Thu, Jul 5, 2018 at 4:39 PM, Andres Freund  wrote:
> > This is formulated *WAY* too positive. It'll have dramatic *NEGATIVE*
> > performance impact of non COW filesystems, and very likely even negative
> > impacts in a number of COWed scenarios (when there's enough memory to
> > keep all WAL files in memory).
> >
> > I still think that fixing this another way would be preferrable. This'll
> > be too much of a magic knob that depends on the fs, hardware and
> > workload.
>
> I tend to agree with you, but unless we have a pretty good idea what
> that other way would be, I think we should probably accept the patch.
>
> Could we somehow make this self-tuning?  On any given
> filesystem/hardware/workload, either creating a new 16MB file is
> faster, or recycling an old file is faster.  If the old file is still
> cached, recycling it figures to win on almost any hardware.  If not,
> it seems like something of a toss-up.  I suppose we could try to keep
> a running average of how long it is taking us to recycle WAL files and
> how long it is taking us to create new ones; if we do each one of
> those things at least sometimes, then we'll eventually get an idea of
> which one is quicker.  But it's not clear to me that such data would
> be very reliable unless we tried to make sure that we tried both
> things fairly regularly under circumstances where we could have chosen
> to do the other one.
>
> I think part of the problem here is that whether a WAL segment is
> likely to be cached depends on a host of factors which we don't track
> very carefully, like whether it's been streamed or decoded recently.
> If we knew when that a particular WAL segment hadn't been accessed for
> any purpose in 10+ minutes, it would probably be fairly safe to guess
> that it's no longer in cache; if we knew that it had been accessed <15
> seconds ago, that it is probably still in cache.  But we have no idea.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-16 Thread Alvaro Herrera

On 2018-Jul-16, Ashutosh Bapat wrote:

> > Hmm, let me reword this comment completely.  How about the attached?

> That looks much better. However it took me a small while to understand
> that (1), (2) and (3) correspond to strategies.

You're right.  Amended again and pushed.  I also marked the open item
closed.

Thanks everyone

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-16 Thread Michael Paquier

On Mon, Jul 16, 2018 at 09:41:51PM +0300, Heikki Linnakangas wrote:
> On 16 July 2018 21:38:39 EEST, Robert Haas  wrote:
>>On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas 
>>wrote:
>>> Doesn't have to be a trigger, could be a CHECK constraint, datatype
>>input
>>> function, etc. Admittedly, having a datatype input function that
>>inserts to
>>> the table is worth a "huh?", but I'm feeling very confident that we
>>can
>>> catch all such cases, and some of them might even be sensible.
>>
>>Is this sentence missing a "not"?  i.e. "I'm not feeling very
>>confident"?
> 
> Yes, sorry.

This explains a lot :p

I doubt as well that we'd be able to catch all the holes as well as the
conditions where the optimization could be run safely are rather
basically impossible to catch beforehand.  I'd like to vote for getting
rid of this optimization for COPY, this can hurt more than it is
helpful.  Per the lack of complaints, this could happen only in HEAD?
--
Michael


signature.asc
Description: PGP signature

Re: Fix error message when trying to alter statistics on included column

2018-07-16 Thread Alvaro Herrera

On 2018-Jun-28, Yugo Nagata wrote:

> According to the error message, it is not allowed to alter statistics on
> included column because this is "non-expression column".
> 
>  postgres=# create table test (i int, d int);
>  CREATE TABLE
>  postgres=# create index idx on test(i) include (d);
>  CREATE INDEX
>  postgres=# alter index idx alter column 2 set statistics 10;
>  ERROR:  cannot alter statistics on non-expression column "d" of index "idx"
>  HINT:  Alter statistics on table column instead.
> 
> However, I think this should be forbidded in that this is not a key column 
> but a included column. Even if we decide to support expressions in included
> columns in future, it is meaningless to do this because any statistics on 
> included column is never used by the planner.

I agree with this reasoning, so I pushed this patch.  Thanks!  I added a
couple of lines in the regress file for this feature also.

Teodor, Alexander, now would be the time to express dissent :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-16 Thread David Rowley

On 16 July 2018 at 06:55, Tom Lane  wrote:
> I started to look at this patch.  I think this is basically the right
> direction to go in, but I'm not terribly happy with the details of the
> data structure design.

I've made an attempt at addressing the issues that I understood.

I've not done anything about your Bitmapset for non-matching
partitions. I fail to see how that would improve the code. But please
feel free to provide details of your design and I'll review it and let
you know what I think about it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v2-0001-Fix-run-time-partition-pruning-for-UNION-ALL-pare.patch
Description: Binary data

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-16 Thread Amit Langote

On 2018/07/17 8:17, Alvaro Herrera wrote:
> On 2018-Jul-16, Ashutosh Bapat wrote:
> 
>>> Hmm, let me reword this comment completely.  How about the attached?
> 
>> That looks much better. However it took me a small while to understand
>> that (1), (2) and (3) correspond to strategies.
> 
> You're right.  Amended again and pushed.  I also marked the open item
> closed.
> 
> Thanks everyone

Thank you Ashutosh for further review and Alvaro for improving it a quite
a bit before committing.

Thanks,
Amit

Re: patch to allow disable of WAL recycling

2018-07-16 Thread Michael Paquier

On Mon, Jul 16, 2018 at 10:38:14AM -0400, Robert Haas wrote:
> It's been a few years since I tested this, but my recollection is that
> if you fill up pg_xlog, the system will PANIC and die on a vanilla
> Linux install.  Sure, you can set max_wal_size, but that's a soft
> limit, not a hard limit, and if you generate WAL faster than the
> system can checkpoint, you can overrun that value and force allocation
> of additional WAL files.  So I'm not sure we have any working
> ENOSPC-panic protection today.  Given that, I'm doubtful that we
> should prioritize maintaining whatever partially-working protection we
> may have today over raw performance.  If we want to fix ENOSPC on
> pg_wal = PANIC, and I think that would be a good thing to fix, then we
> should do it either by finding a way to make the WAL insertion ERROR
> out instead of panicking, or throttle WAL generation as we get close
> to disk space exhaustion so that the checkpoint has time to complete,
> as previously proposed by Heroku.

I would personally prefer seeing max_wal_size being switched to a hard
limit, and make that tunable.  I am wondering if that's the case for
other people on this list, but I see from time to time, every couple of
weeks, people complaining that Postgres is not able to maintain a hard
guarantee behind the value of max_wal_size.  In some upgrade scenarios,
I had to tell such folks to throttle their insert load and also manually
issue checkpoints to allow the system to stay up and continue with the 
upgrade process.  So there are definitely cases where throttling is
useful, and if the hard limit is reached for some cases I would rather
see WAL generation from other backends simply stopped instead of risking
the system to go down so as the system can finish its checkpoint.  And
sometimes this happens also with a SQL dump, where throttling the load
at the application level means more complex dump strategy so as things
are split between multiple files for example.
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: psql tab completion for SELECT

2018-07-16 Thread Edmund Horner

On 17 July 2018 at 00:00, Heikki Linnakangas  wrote:
> Playing around with this a little bit, I'm not very satisfied with the
> completions. Sometimes this completes too much, and sometimes too little.
> All of this has been mentioned in this and the other thread [1] already,
> this just my opinion on all this.

Hi Heikki, thanks for getting this thread active again.

I do actually have another patch, which I didn't post yet because
everyone was busy with v11, and I wanted to wait for more feedback
about inclusions/exclusions.  Holding it back now seems like a mistake
(especially since the last patch had a bug) so here it is.

There's also a patch, to be applied on top, that adds completion after commas.

> Too much:
>
> postgres=# \d lp
>Table "public.lp"
>Column  |  Type   | Collation | Nullable | Default
> --+-+---+--+-
>   id   | integer |   |  |
>   partkey  | text|   |  |
>   partcol1 | text|   |  |
>   partcol2 | text|   |  |
> Partition key: LIST (partkey)
> Number of partitions: 1000 (Use \d+ to list them.)
>
> postgres=# select pa[TAB]
> pad_attribute  parameter_default  parameter_stylepartattrs partcol2
> partexprs  partrelid
> page   parameter_mode parameter_typespartclass
> partcollation  partkeypartstrat
> pageno parameter_name parse_ident(   partcol1 partdefid
> partnatts  passwd

I agree that there is too much.  I don't know what the best way to
reduce it is, short of specifically excluding certain things.  In
theory, I think the query could be sensitive to how much text is
entered, for example, let pa[TAB] not show partrelid and other columns
from pg_partition, but part[TAB] show them; the general rule could be
"only show attributes for system tables if 3 or more letters entered".
But I'm wary of overcomplicating a patch that is (IMHO) a useful
improvement even if simple.

> Too little:
>
> postgres=# select partkey, p[TAB]
> [no completions]
>
>
> Then there's the multi-column case, which seems weird (to a user - I know
> the implementation and understand why):
>
> postgres=# select partkey, partc[TAB]
> [no completions]

Yep, there was no completion after commas in the previous patches.

> And I'd love this case, where go back to edit the SELECT list, after already
> typing the FROM part, to be smarter:
>
> postgres=# select p[TAB] FROM lp;
> Display all 370 possibilities? (y or n)

I don't know enough about readline to estimate how easy this would be.
I think all our current completions use only the text up to the
cursor.

> There's something weird going on with system columns, from a user's point of
> view:
>
> SELECT oi[TAB]
> oid  oidvectortypes(
>
> postgres=# select xm[TAB]
> xmin  xmlcomment( xml_is_well_formed_content(
> xmlvalidate(
> xmlagg(   xml_is_well_formed(
> xml_is_well_formed_document(
>
> So oid and xmin are completed. But "xmax" and "ctid" are not. I think this
> is because in fact none of the system columns are completed, but there
> happen to be some tables with regular columns named "oid" and "xmin". So it
> makes sense once you know that, but it's pretty confusing to a user.
> Tab-completion is a great way for a user to explore what's available, so
> it's weird that some system columns are effectively completed while others
> are not.

You are correct, system columns are excluded, but of course there are
always the same column names in other tables.  Do you think we should
just include all columns?

> I'd like to not include set-returning functions from the list. Although you
> can do "SELECT generate_series(1,10)", I'd like to discourage people from
> doing that, since using set-returning functions in the target list is a
> PostgreSQL extension to the SQL standard, and IMHO the "SELECT * FROM
> generate_series(1,10)" syntax is easier to understand and works more sanely
> with joins etc. Conversely, it would be really nice to include set-returning
> function in the completions after FROM.

I'm happy to exclude SRFs after SELECT if others agree.  Including
them after FROM seems worthwhile, but best left for a different patch?

> I understand that there isn't much you can do about some of those things,
> short of adding a ton of new context information about previous queries and
> heuristics. I think the completion needs to be smarter to be acceptable. I
> don't know what exactly to do, but perhaps someone else has ideas.
>
> I'm also worried about performance, especially of the query to get all the
> column names. It's pretty much guaranteed to do perform a
> SeqScan+Sort+Unique on pg_attribute. pg_attribute can be very large. In my
> little test database, with the above 1000-partition table, hitting tab after
> "SELECT " takes about 1 second, until you get the "Display all 1000
> possibili

Re: PATCH: psql tab completion for SELECT

2018-07-16 Thread Edmund Horner

On 17 July 2018 at 03:27, Joao De Almeida Pereira
 wrote:
> After playing alittle bit around with the patch I noticed that a comma was
> missing in line 1214
> + 1202 /* min_server_version */
> + 1203 9,
> + 1204 /* catname */
> + 1205 "pg_catalog.pg_proc p",
> + 1206 /* selcondition */
> + 1207 "p.prorettype NOT IN ('trigger'::regtype,
> 'internal'::regtype) "
> + 1208 "AND 'internal'::regtype != ALL (p.proargtypes) "
> + 1209 "AND p.oid NOT IN (SELECT
> unnest(array[typinput,typoutput,typreceive,typsend,typmodin,typmodout,typanalyze])
> FROM pg_type) "
> + 1210 "AND p.oid NOT IN (SELECT
> unnest(array[aggtransfn,aggfinalfn]) FROM pg_aggregate) "
> + 1211 "AND p.oid NOT IN (SELECT
> unnest(array[oprcode,oprrest,oprjoin]) FROM pg_operator) "
> + 1212 "AND p.oid NOT IN (SELECT
> unnest(array[lanplcallfoid,laninline,lanvalidator]) FROM pg_language) "
> + 1213 "AND p.oid NOT IN (SELECT castfunc FROM pg_cast) "
> + 1214 "AND p.oid NOT IN (SELECT amproc FROM pg_amproc) "
> + 1215 /* viscondition */
> + 1216 "pg_catalog.pg_function_is_visible(p.oid)",
>
> To catch these typos would be good if we could get some testing around psql.
> (Newbie question: do we have any kind of testing around tools like psql?)
>
> Thanks
> Joao

Hi Joao,

Ah yes, the embarrassing missing comma.  I kind of wish my IDE or
compiler had pointed it out to me, but how was it to know that I
foolishly combining two struct fields?  Sadly, I think the best
defence right now is to have others scrutinise the code.

I attached a new version of the patch in reply to Heikki.  Would you
care to take a careful look for me?

I think some automated test framework for the tab completion queries
is possible, by calling psql_completion and examining the results.
Maybe someone will volunteer...

Edmund

Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Amit Langote

On 2018/07/13 22:10, Ashutosh Bapat wrote:
> On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho  wrote:
>>> I wondered if you compared to PG10 or to inheritence-partitioning (parent 
>>> with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into 
>>> child) ?
>>
>> Thank you for your reply.
>>
>> I compared to PG11beta2 with non-partitioned table.
>>
>> Non-partitioned table has 1100 records in one table.
>> Partitioned table has one record on each leaf partitions.
>>
> 
> I don't think partitioning should be employed this way even for the
> sake of comparison. Depending upon the size of each tuple, 1100 tuples
> are inserted into a single table, they will probably occupy few
> hundred pages. In a partitioned table with one tuple per partition
> they will occupy 1100 pages at least. There is other space, locking
> overheads to maintain 1100 tables. I think the right way to compare is
> to have really large that which really requires 1100 partitions and
> then compare performance by putting that data in 1100 partitions and
> in an unpartitioned table. Even with that kind of data, you will see
> some difference in performance, but that won't be as dramatic as you
> report.
> 
> I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner
spends for a partitioned table with 1100 partitions is just too high
compared to the time it spends on a non-partitioned table.  It is and has
been clear that that's because the planning time explodes as the number of
partitions increases.

If there's lots of data in it, then the result will look completely
different as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table
containing 1100 partitions worth of data.  But the planning time would
still be more for the partitioned table, which seems to be the point of
this benchmark.

Thanks,
Amit

Re: Make foo=null a warning by default.

2018-07-16 Thread David Fetter

On Mon, Jul 16, 2018 at 11:37:28AM -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > On 16/07/18 18:10, Tom Lane wrote:
> >> TBH I'm not really excited about investing any work in this area at all.
> >> Considering how seldom we hear any questions about transform_null_equals
> >> anymore[1], I'm wondering if we couldn't just rip the "feature" out
> >> entirely.
> 
> > Yeah, I was wondering about that too. But Fabien brought up a completely 
> > new use-case for this: people learning SQL. For beginners who don't 
> > understand the behavior of NULLs yet, I can see a warning or error being 
> > useful training wheels. Perhaps a completely new "training_wheels=on" 
> > option, which could check may for many other beginner errors, too, would 
> > be better for that.
> 
> Agreed --- but what we'd want that to do seems only vaguely related to
> the existing behavior of transform_null_equals.  As an example, we
> intentionally made transform_null_equals *not* trigger on
> 
>   CASE x WHEN NULL THEN ...
> 
> but a training-wheels warning for that would likely be reasonable.
> 
> For that matter, many of the old threads about this are complaining
> about nulls that aren't simple literals in the first place.  I wonder
> whether a training-wheels feature that whined *at runtime* about null
> WHERE-qual or case-test results would be more useful than a parser
> check.

Am I understanding correctly that this would be looking for bogus
conditions in the plan tree?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

RE: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Kato, Sho

On 2018/07/17 10:49, Amit Langote wrote:
>Perhaps, Kato-san only intended to report that the time that planner spends 
>for a partitioned table with 1100 partitions is just too high compared to the 
>time it spends on a non-partitioned table.

yes, It is included for the purposes of this comparison.

The purpose of this comparison is to find where the partitioning bottleneck is.
Using the bottleneck as a hint of improvement, I'd like to bring the 
performance of partitioned table closer to the performance of unpartitioned 
table as much as possible.

-Original Message-
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] 
Sent: Tuesday, July 17, 2018 10:49 AM
To: Ashutosh Bapat ; Kato, Sho/加藤 翔 

Cc: Justin Pryzby ; PostgreSQL mailing lists 

Subject: Re: How to make partitioning scale better for larger numbers of 
partitions

On 2018/07/13 22:10, Ashutosh Bapat wrote:
> On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho  wrote:
>>> I wondered if you compared to PG10 or to inheritence-partitioning (parent 
>>> with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into 
>>> child) ?
>>
>> Thank you for your reply.
>>
>> I compared to PG11beta2 with non-partitioned table.
>>
>> Non-partitioned table has 1100 records in one table.
>> Partitioned table has one record on each leaf partitions.
>>
> 
> I don't think partitioning should be employed this way even for the 
> sake of comparison. Depending upon the size of each tuple, 1100 tuples 
> are inserted into a single table, they will probably occupy few 
> hundred pages. In a partitioned table with one tuple per partition 
> they will occupy 1100 pages at least. There is other space, locking 
> overheads to maintain 1100 tables. I think the right way to compare is 
> to have really large that which really requires 1100 partitions and 
> then compare performance by putting that data in 1100 partitions and 
> in an unpartitioned table. Even with that kind of data, you will see 
> some difference in performance, but that won't be as dramatic as you 
> report.
> 
> I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner spends for 
a partitioned table with 1100 partitions is just too high compared to the time 
it spends on a non-partitioned table.  It is and has been clear that that's 
because the planning time explodes as the number of partitions increases.

If there's lots of data in it, then the result will look completely different 
as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table containing 
1100 partitions worth of data.  But the planning time would still be more for 
the partitioned table, which seems to be the point of this benchmark.

Thanks,
Amit

Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Ashutosh Bapat

On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho  wrote:
> On 2018/07/17 10:49, Amit Langote wrote:
>>Perhaps, Kato-san only intended to report that the time that planner spends 
>>for a partitioned table with 1100 partitions is just too high compared to the 
>>time it spends on a non-partitioned table.
>
> yes, It is included for the purposes of this comparison.
>
> The purpose of this comparison is to find where the partitioning bottleneck 
> is.
> Using the bottleneck as a hint of improvement, I'd like to bring the 
> performance of partitioned table closer to the performance of unpartitioned 
> table as much as possible.
>

That's a good thing, but may not turn out to be realistic. We should
compare performance where partitioning matters and try to improve in
those contexts. Else we might improve performance in scenarios which
are never used.

In this case, even if we improve the planning time by 100%, it hardly
matters since planning time is neglegible compared to the execution
time because of huge data where partitioning is useful.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Let's remove DSM_IMPL_NONE.

2018-07-16 Thread Kyotaro HORIGUCHI

Hello.

At Tue, 10 Jul 2018 18:52:54 +0200, Peter Eisentraut 
 wrote in 

> committed

Thank you for committing this!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Refactor documentation for wait events (Was: pgsql: Add wait event for fsync of WAL segments)

2018-07-16 Thread Michael Paquier

On Mon, Jul 16, 2018 at 11:22:07AM -0400, Robert Haas wrote:
> This doesn't seem to get rid of the morerows stuff.

The troubling ones are in monitoring.sgml:
LWLock
Lock
Activity
Client
IPC
Timeout
IO

And the patch previously sent removes them, but perhaps I am missing
your point?
--
Michael


signature.asc
Description: PGP signature

Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Amit Langote

On 2018/07/17 12:14, Ashutosh Bapat wrote:
> On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho  wrote:
>> On 2018/07/17 10:49, Amit Langote wrote:
>>> Perhaps, Kato-san only intended to report that the time that planner spends 
>>> for a partitioned table with 1100 partitions is just too high compared to 
>>> the time it spends on a non-partitioned table.
>>
>> yes, It is included for the purposes of this comparison.
>>
>> The purpose of this comparison is to find where the partitioning bottleneck 
>> is.
>> Using the bottleneck as a hint of improvement, I'd like to bring the 
>> performance of partitioned table closer to the performance of unpartitioned 
>> table as much as possible.
>>
> 
> That's a good thing, but may not turn out to be realistic. We should
> compare performance where partitioning matters and try to improve in
> those contexts. Else we might improve performance in scenarios which
> are never used.
> 
> In this case, even if we improve the planning time by 100%, it hardly
> matters since planning time is neglegible compared to the execution
> time because of huge data where partitioning is useful.

While I agree that it's a good idea to tell users to use partitioning only
if the overhead of having the partitioning in the first place is bearable,
especially the planner overhead, this benchmark shows us that even for
what I assume might be fairly commonly occurring queries (select .. from /
update .. / delete from partitioned_table where partkey = ?), planner
spends way too many redundant cycles.  Some amount of that overhead will
always remain and planning with partitioning will always be a bit slower
than without partitioning, it's *too* slow right now for non-trivial
number of partitions.

Thanks,
Amit

Re: [HACKERS] Restricting maximum keep segments by repslots

2018-07-16 Thread Masahiko Sawada

On Fri, Jul 13, 2018 at 5:40 PM, Kyotaro HORIGUCHI
 wrote:
> Hello.
>
> At Wed, 11 Jul 2018 15:09:23 +0900, Masahiko Sawada  
> wrote in 
>> On Mon, Jul 9, 2018 at 2:47 PM, Kyotaro HORIGUCHI
>>  wrote:
> ..
>> Here is review comments of v4 patches.
>>
>> +   if (minKeepLSN)
>> +   {
>> +   XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
>> +   Assert(!XLogRecPtrIsInvalid(slotPtr));
>> +
>> +   tailSeg = GetOldestKeepSegment(currpos, slotPtr);
>> +
>> +   XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN,
>> wal_segment_size);
>> +   }
>>
>> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the
>> destination at 4th argument the wal_segment_size will be changed in
>> the above expression. The regression tests by PostgreSQL Patch Tester
>
> I'm not sure I get this correctly, the definition of the macro is
> as follows.
>
> | #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \
> |   (dest) = (segno) * (wal_segsz_bytes) + (offset)
>
> The destination is the *3rd* parameter and the forth is segment
> size which is not to be written.

Please see commit a22445ff0b which flipped input and output arguments.
So maybe you need to rebase the patches to current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: pgsql: Allow UNIQUE indexes on partitioned tables

2018-07-16 Thread Alvaro Herrera

On 2018-Feb-19, David G. Johnston wrote:

> As an aside, adding a link to "Data Definiton/Table Partitioning" from at
> least CREATE TABLE ... PARTITION BY; and swapping "PARTITION BY" and
> "PARTITION OF" in the Parameters section of that page - one must partition
> by a table before one can partition it (and also the synopsis lists them in
> the BY before OF order), would be helpful.

A little late, I have done these two changes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Allow UNIQUE indexes on partitioned tables

2018-07-16 Thread Amit Langote

On 2018/07/17 13:57, Alvaro Herrera wrote:
> On 2018-Feb-19, David G. Johnston wrote:
> 
>> As an aside, adding a link to "Data Definiton/Table Partitioning" from at
>> least CREATE TABLE ... PARTITION BY; and swapping "PARTITION BY" and
>> "PARTITION OF" in the Parameters section of that page - one must partition
>> by a table before one can partition it (and also the synopsis lists them in
>> the BY before OF order), would be helpful.
> 
> A little late, I have done these two changes.

A belated +1.

Thanks,
Amit

untrusted PLs should be GRANTable

2018-07-16 Thread Craig Ringer

Hi all

A user has raised the point that our refusal to GRANT rights to untrusted
PLs is counterproductive and inconsistent with how we behave elsewhere.

Yes, untrusted PLs can be escaped to gain superuser rights, often trivially.

But we allow this:

CREATE ROLE superme SUPERUSER NOINHERIT;
GRANT superme TO me;

 and really, GRANTing an untrusted PL is similar.

Forcing users to create their PLs as a superuser increases the routine use
of superuser accounts. Most users' DDL deploy scripts will get be run as a
superuser if they have to use a superuser for PL changes; they're not going
to SET ROLE and RESET ROLE around the function changes.

It also encourages users to make their untrusted functions SECURITY DEFINER
when still owned by a superuser, which we really don't want them doing
unnecessarily.

In the name of making things more secure, we've made them less secure.

Untrusted PLs should be GRANTable with a NOTICE or WARNING telling the
admin that GRANTing an untrusted PL effectively gives the user the ability
to escape to superuser.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Kato, Sho

On 2018/07/16 13:16, Tsunakawa-san wrote:
>Thanks.  And what does the comparison look like between the unpartitioned case 
>and various partition counts?  What's the performance characteristics in terms 
>of the latency and partition count?   I thought that's what you tried to 
>reveal first?

In unpartitioned table, latency of SELECT/UPDATE statement is close to O(n), 
where n is number of records.
Latency of INSERT statements is close to O(1).

In partitioned table, up to 400 partitions, latency of SELECT/INSERT statement 
is close to O(log n), where n is the number of partitions.
Between 400 and 6400 partitions, latency is close to O(n).
Up to 400 partitions, latency of UPDATE statement is close to O(n).
Between 400 and 6400 partitions, latency of UPDATE statement seems to O(n^2).

Details are as follows.

unpartitioned table result (prepared mode)
--

 scale | latency_avg |   tps_ex| update_latency | select_latency | 
insert_latency 
---+-+-+++
   100 |0.24 | 4174.395738 |  0.056 |  0.051 |  
 0.04
   200 |   0.258 | 3873.099014 |  0.065 |  0.059 |  
 0.04
   400 |0.29 | 3453.171112 |  0.081 |  0.072 |  
0.041
   800 |   0.355 | 2814.936942 |  0.112 |  0.105 |  
0.041
  1600 |   0.493 | 2027.702689 |   0.18 |  0.174 |  
0.042
  3200 |   0.761 | 1313.532458 |  0.314 |  0.307 |  
0.043
  6400 |   1.214 |  824.001431 |   0.54 |  0.531 |  
0.043


partitioned talble result (prepared mode)
-

 num_part | latency_avg |   tps_ex| update_latency | select_latency | 
insert_latency 
--+-+-+++
  100 |   0.937 | 1067.473258 |  0.557 |  0.087 |   
   0.123
  200 |1.65 |  606.244552 |  1.115 |  0.121 |   
   0.188
  400 |   3.295 |  303.491681 |  2.446 |   0.19 |   
   0.322
  800 |   8.102 |  123.422456 |  6.553 |  0.337 |   
   0.637
 1600 |  36.996 |   27.030027 | 31.528 |  1.621 |   
   2.455
 3200 | 147.998 |6.756922 |136.136 |   4.21 |   
4.94
 6400 | 666.947 |1.499383 |640.879 |  7.631 |   
   9.642

regards,
-Original Message-
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] 
Sent: Monday, July 16, 2018 1:16 PM
To: Kato, Sho/加藤 翔 
Cc: 'Amit Langote' ; PostgreSQL mailing lists 

Subject: RE: How to make partitioning scale better for larger numbers of 
partitions

From: Kato, Sho [mailto:kato-...@jp.fujitsu.com]
> I did pgbench -M prepared and perf record.
> 
> UPDATE latency in prepared mode is 95% shorter than in simple mode.
> SELECT latency in prepared mode is 54% shorter than in simple mode.
> INSERT latency in prepared mode is 8% shorter than in simple mode.

Thanks.  And what does the comparison look like between the unpartitioned case 
and various partition counts?  What's the performance characteristics in terms 
of the latency and partition count?   I thought that's what you tried to reveal 
first?

Regards
Takayuki Tsunakawa

Re: Add SKIP LOCKED to VACUUM and ANALYZE

2018-07-16 Thread Michael Paquier

On Thu, Jul 12, 2018 at 02:37:28PM +0900, Michael Paquier wrote:
> On Wed, Jun 13, 2018 at 08:29:12PM +, Bossart, Nathan wrote:
>> Previous thread: 
>> https://postgr.es/m/4BC0F3CD-F4B5-4F23-AADB-80607F9E4B4E%40amazon.com
>> 
>> This is a new thread for tracking the work to add SKIP LOCKED to
>> VACUUM and ANALYZE.  I've attached a rebased set of patches.
> 
> I am beginning to look at this stuff, and committed patch 4 and 5 on the
> way as they make sense independently.  I am still reviewing the rest,
> which applies with some fuzz, and the tests proposed still pass.

I have been looking at the rest of the patch set, and I find that the
way SKIP_LOCKED is handled is a bit inconsistent.  First, if the manual
VACUUM does not specify a list of relations, which can never happen for
autovacuum, then we get to use get_all_vacuum_rels to decide the list of
relations to look at, and then it is up to vacuum_rel() to decide if a
relation can be skipped or not.  If a list of relation is given by the
caller, then it is up to expand_vacuum_rel() to do the call.

In expand_vacuum_rel(), an OID could be defined for a relation, which is
something used by autovacuum.  If no OID is defined, which happens for a
manual VACUUM, then we use directly RangeVarGetRelidExtended() at this
stage and see if the relation is available.  If the relation listed in
the manual VACUUM is a partitioned table, then its full set of
partition OIDs (including down layers), are included in the list of
relations to include.  And we definitely want to hold, then release once
AccessShareLock so as the partition tree lookup can happen.  This uses
as well find_all_inheritors() with NoLock so as we rely on vacuum_rel()
to skip the relation or not.

The first thing which is striking me is that we may actually *not* want
to check for lock skipping within expand_vacuum_rel() as that's mainly a
function aimed at building the relations which are going to be vacuumed,
and that all the lock skipping operations should happen at the beginning
of vacuum_rel().  This way, even if the parent of a partition tree is
listed but cannot have its RangeVar opened immediately, then we have a
chance to have some of its partitions to be vacuumed after listing them.
This point is debatable as there are pros and cons.  I would be of the
opinion to not change expand_vacuum_rel()'s mission to build the list of
relations to VACUUM, and actually complain about a lock not available
when the relation is ready to be processed individually.  It is also
perfectly possible to skip locks for partitions by listing them
individually in the VACUUM command used. I am pretty sure that Andres
would complain at the sight of this paragraph as this is backwards with
the reason behind the refactoring of RangeVarGetRelidExtended but I like
the new API better :p 

For this part, it seems to me that we can do better than what is in
HEAD:
- Call RangeVarGetRelidExtended without lock.
- Check for has_subclass(), which should be extended so as it does not
complain because of a missing relation so as vacuum.c does the error
handling.
- Call RangeVarGetRelidExtended a second time if a lookup with
find_all_inheritors is necessary.  If the relation goes missing
in-between then we just get an error as the current behavior.

I am also questioning the fact of complicating vac_open_indexes() by
skipping a full relation vacuum if one of its indexes cannot be opened,
which would be the case for an ALTER INDEX for example.  It seems to me
that skipping locks should just happen for the parent relation, so I
would not bother much about this case, and just document the behavior.
If somebody argues for this option, we could always have a SKIP_INDEXES
or such.

"SKIP LOCKED" should be named "SKIP_LOCKED", with an underscore for
consistency with the other options.

-void
-cluster_rel(Oid tableOid, Oid indexOid, bool recheck, bool verbose)
+bool
+cluster_rel(Oid tableOid, Oid indexOid, bool recheck, bool verbose,
-bool skip_locked)
 {
Some refactoring of CLUSTER on the way here?  It would be nice to avoid
having three boolean arguments here, and opens the door for an extended
CLUSTER grammar with parenthesis as well as a SKIP_LOCKED option for
it.  And this could be refactored first.  VACUUM FULL being a variant of
CLUSTER, we could just reuse the same options...  Now using separate
option names could be cleaner.

The stuff of get_elevel_for_skipped_relation could be refactored into
something used as well by cluster_rel as the same logic is used in three
places (vacuum_rel, analyze_rel and cluster_rel with try_relation_open).

I think that it would be most interesting to get the refactoring around
get_elevel_for_skip_locked and cluster_rel done first.  The improvement
for RangeVarGetRelidExtended with relations not having subclasses could
be worth done separately as well.
--
Michael

signature.asc
Description: PGP signature

Re: missing toast table for pg_policy

2018-07-16 Thread Michael Paquier

On Sat, Jul 14, 2018 at 03:47:38PM +0700, John Naylor wrote:
> I hope you don't mind, but since it might be tedious to piece together
> the addenda I left behind in this thread, I thought it might be useful
> to update Joe's patch. The attached was rebased over the new
> regression test, passes the pg_upgrade test, and has a draft commit
> message.

I was just having a second look at this patch, and did a bit more tests
with pg_upgrade which passed.

+-- 2. pg_largeobject and pg_largeobject_metadata, to avoid problems
+-- with pg_upgrade
John, what's actually the failure that was seen here?  It would be nice
to see this patch committed but the reason here should be more
explicit about why this cannot happen.
--
Michael


signature.asc
Description: PGP signature

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-07-16 Thread Michael Paquier

On Wed, Jun 06, 2018 at 11:58:14AM -0400, Peter Eisentraut wrote:
> 
>  The setting always requires the use of relinks.  If
>  they are not supported, the pg_upgrade run
>  will abort.  Use this in production to limit the upgrade run time.
>  The setting auto uses reflinks when available,
>  otherwise it falls back to a normal copy.  This is the default.  The
>  setting never prevents use of reflinks and always
>  uses a normal copy.  This can be useful to ensure that the upgraded
>  cluster has its disk space fully allocated and not shared with the old
>  cluster.
> 

Hm...  I am wondering if we actually want the "auto" mode where we make
the option smarter automatically.  I am afraid of users trying to use it
and being surprised that there is no gain while they expected some.  I
would rather switch that to an on/off switch, which defaults to "off",
or simply what is available now.  huge_pages=try was a bad idea as the
result is not deterministic, so I would not have more of that...

Putting CloneFile and check_reflink in a location that other frontend
binaries could use would be nice, like pg_rewind.
--
Michael

signature.asc
Description: PGP signature

98 matches

Mail list logo