Re: proposal: variadic argument support for least, greatest function

2019-02-12 Thread Pavel Stehule
Hi

út 12. 2. 2019 v 2:00 odesílatel Chapman Flack 
napsal:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, failed
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> The argument for consistency with other functions that are variadic makes
> sense to me: although they are different from ordinary variadic functions
> in the type unification they do, their description in the manual simply
> calls them functions without calling attention to any way that they are
> different beasts. So, a reader might reasonably be surprised that VARIADIC
> doesn't work in the usual way.
>
> This patch applies (with some offsets) but the build produces several
> incompatible pointer type assignment warnings, and fails on errors where
> fcinfo->arg is no longer a thing (so should be rebased over the
> variable-length function call args patch).
>
> It does not yet add regression tests, or update the documentation in
> func.sgml.


here is fixed patch

Regards

Pavel
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index db3777d15e..01d7f0e02c 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1854,13 +1854,18 @@ ExecInitExprRec(Expr *node, ExprState *state,
 		case T_MinMaxExpr:
 			{
 MinMaxExpr *minmaxexpr = (MinMaxExpr *) node;
-int			nelems = list_length(minmaxexpr->args);
+int			nelems;
 TypeCacheEntry *typentry;
 FmgrInfo   *finfo;
 FunctionCallInfo fcinfo;
 ListCell   *lc;
 int			off;
 
+if (minmaxexpr->args)
+	nelems = list_length(minmaxexpr->args);
+else
+	nelems = 1;
+
 /* Look up the btree comparison function for the datatype */
 typentry = lookup_type_cache(minmaxexpr->minmaxtype,
 			 TYPECACHE_CMP_PROC);
@@ -1897,16 +1902,29 @@ ExecInitExprRec(Expr *node, ExprState *state,
 scratch.d.minmax.finfo = finfo;
 scratch.d.minmax.fcinfo_data = fcinfo;
 
-/* evaluate expressions into minmax->values/nulls */
-off = 0;
-foreach(lc, minmaxexpr->args)
+if (minmaxexpr->args)
 {
-	Expr	   *e = (Expr *) lfirst(lc);
+	scratch.d.minmax.variadic_arg = false;
 
-	ExecInitExprRec(e, state,
-	&scratch.d.minmax.values[off],
-	&scratch.d.minmax.nulls[off]);
-	off++;
+	/* evaluate expressions into minmax->values/nulls */
+	off = 0;
+	foreach(lc, minmaxexpr->args)
+	{
+		Expr	   *e = (Expr *) lfirst(lc);
+
+		ExecInitExprRec(e, state,
+		&scratch.d.minmax.values[off],
+		&scratch.d.minmax.nulls[off]);
+		off++;
+	}
+}
+else
+{
+	scratch.d.minmax.variadic_arg = true;
+
+	ExecInitExprRec(minmaxexpr->variadic_arg, state,
+		&scratch.d.minmax.values[0],
+		&scratch.d.minmax.nulls[0]);
 }
 
 /* and push the final comparison */
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index a018925d4e..748c950885 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2819,6 +2819,55 @@ ExecEvalMinMax(ExprState *state, ExprEvalStep *op)
 	/* default to null result */
 	*op->resnull = true;
 
+	if (op->d.minmax.variadic_arg)
+	{
+		ArrayIterator array_iterator;
+		ArrayType  *arr;
+		Datum	value;
+		bool	isnull;
+
+		if (nulls[0])
+			return;
+
+		arr = DatumGetArrayTypeP(values[0]);
+
+		array_iterator = array_create_iterator(arr, 0, NULL);
+		while (array_iterate(array_iterator, &value, &isnull))
+		{
+			if (isnull)
+continue;
+
+			if (*op->resnull)
+			{
+/* first nonnull input */
+*op->resvalue = value;
+*op->resnull = false;
+			}
+			else
+			{
+int			cmpresult;
+
+Assert(fcinfo->nargs == 2);
+
+/* apply comparison function */
+fcinfo->args[0].value = *op->resvalue;
+fcinfo->args[1].value = value;
+
+fcinfo->isnull = false;
+cmpresult = DatumGetInt32(FunctionCallInvoke(fcinfo));
+if (fcinfo->isnull) /* probably should not happen */
+	continue;
+
+if (cmpresult > 0 && operator == IS_LEAST)
+	*op->resvalue = value;
+else if (cmpresult < 0 && operator == IS_GREATEST)
+	*op->resvalue = value;
+			}
+		}
+
+		return;
+	}
+
 	for (off = 0; off < op->d.minmax.nelems; off++)
 	{
 		/* ignore NULL inputs */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index b44ead269f..4d1c209c6a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -1942,6 +1942,7 @@ _copyMinMaxExpr(const MinMaxExpr *from)
 	COPY_SCALAR_FIELD(inputcollid);
 	COPY_SCALAR_FIELD(op);
 	COPY_NODE_FIELD(args);
+	COPY_NODE_FIELD(variadic_arg);
 	COPY_LOCATION_FIELD(location);
 
 	return newnode;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 1e169e0b9c..aebc0115ea 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c

Re: [PATCH] xlogreader: do not read a file block twice

2019-02-12 Thread Arthur Zakirov

On 12.02.2019 07:23, Michael Paquier wrote:

On 02/11/2019 07:25 PM, Arthur Zakirov wrote:

Grigory noticed that one of our utilities has very slow performance when
xlogreader reads zlib archives. We found out that xlogreader sometimes
reads a WAL file block twice.

What do you think?


I think that such things, even if they look simple, need a careful
lookup, and I have not looked at the proposal yet.  Could you add it
to the next commit fest so as we don't lose track of it?
https://commitfest.postgresql.org/22/


Of course. Agree, it may be a non trivial case. Added as a bug fix:
https://commitfest.postgresql.org/22/1994/

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Problems with plan estimates in postgres_fdw

2019-02-12 Thread Antonin Houska
Etsuro Fujita  wrote:

> (2019/02/08 2:04), Antonin Houska wrote:
> 
> > First, I wonder if the information on LIMIT needs to be passed to the
> > FDW. Cannot the FDW routines use root->tuple_fraction?
> 
> I think it's OK for the FDW to use the tuple_fraction, but I'm not sure the
> tuple_fraction should be enough.  For example, I don't think we can re-compute
> from the tuple_fraction the LIMIT/OFFSET values.
> 
> > I think (although have
> > not verified experimentally) that the problem reported in [1] is not limited
> > to the LIMIT clause. I think the same problem can happen if client uses 
> > cursor
> > and sets cursor_tuple_fraction very low. Since the remote node is not aware 
> > of
> > this setting when running EXPLAIN, the low fraction can also make 
> > postgres_fdw
> > think that retrieval of the whole set is cheaper than it will appear to be
> > during the actual execution.
> 
> I think it would be better to get a fast-start plan on the remote side in such
> a situation, but I'm not sure we can do that as well with this patch, because
> in the cursor case, the FDW couldn't know in advance exactly how may rows will
> be fetched from the remote side using that cursor, so the FDW couldn't push
> down LIMIT.  So I'd like to leave that for another patch.

root->tuple_fraction (possibly derived from cursor_tuple_fraction) should be
enough to decide whether fast-startup plan should be considered. I just failed
to realize that your patch primarily aims at the support of UPPERREL_ORDERED
and UPPERREL_FINAL relations by postgres_fdw. Yes, another patch would be
needed to completely resolve the problem reported by Andrew Gierth upthread.

> > Some comments on coding:
> >
> > 0001-postgres_fdw-Perform-UPPERREL_ORDERED-step-remotely-v3.patch
> > -
> >
> > * I'm not sure if UPPERREL_ORDERD should to deal with LIMIT at all. Note 
> > that
> > grouping_planner() does not call create_limit_path() until it's done with
> > create_ordered_paths(), and create_ordered_paths() is where the FDW is
> > requested to add its paths to UPPERREL_ORDERED relation.
> 
> Maybe I'm missing something, but the 0001 patch doesn't deal with LIMIT at
> all.  I added the parameter limit_tuples to PgFdwPathExtraData so that we can
> pass limit_tuples=-1, which means no LIMIT, to cost_sort(), though.

Yes, the limit_tuples field made me confused. But if cost_sort() is the only
reason, why not simply pass -1 in the 0001 patch, and use the limit_tuples
argument only in 0002? I see several other calls of cost_sort() in the core
where -1 is hard-coded.

> > Both patches:
> > 
> >
> > One thing that makes me confused when I try to understand details is that 
> > the
> > new functions add_foreign_ordered_paths() and add_foreign_final_paths() both
> > pass "input_rel" for the "foreignrel" argument of estimate_path_cost_size():
> >
> > estimate_path_cost_size(root, input_rel, ...)
> >
> > whereas the existing add_foreign_grouping_paths() passes "grouped_rel":
> >
> > estimate_path_cost_size(root, grouped_rel, ...)
> >
> > Can you please make this consistent? If you do, you'll probably need to
> > reconsider your changes to estimate_path_cost_size().
> 
> This is intended and I think I should add more comments on that.  Let me
> explain.  These patches modified estimate_path_cost_size() so that we can
> estimate the costs of a foreign path for the UPPERREL_ORDERED or
> UPPERREL_FINAL rel, by adding the costs of the upper-relation processing (ie,
> ORDER BY and/or LIMIT/OFFSET, specified by PgFdwPathExtraData) to the costs of
> implementing the *underlying* relation for the upper relation (ie, scan, join
> or grouping rel, specified by the input_rel). So those functions call
> estimate_path_cost_size() with the input_rel, not the ordered_rel/final_rel,
> along with PgFdwPathExtraData.

I think the same already happens for the UPPERREL_GROUP_AGG relation:
estimate_path_cost_size()

...
else if (IS_UPPER_REL(foreignrel))
{
...
ofpinfo = (PgFdwRelationInfo *) fpinfo->outerrel->fdw_private;

Here "outerrel" actually means input relation from the grouping perspective.
The input relation (i.e. result of scan / join) estimates are retrieved from
"ofpinfo" and the costs of aggregation are added. Why should this be different
for other kinds of upper relations?

> > And maybe related problem: why should FDW care about the effect of
> > apply_scanjoin_target_to_paths() like you claim to do here?
> >
> > /*
> >  * If this is UPPERREL_ORDERED and/or UPPERREL_FINAL steps on the
> >  * final scan/join relation, the costs obtained from the cache
> >  * wouldn't yet contain the eval cost for the final scan/join target,
> >  * which would have been updated by apply_scanjoin_target_to_paths();
> >  * add the eval cost now.
> >  */
> > if (fpextra&&  !IS_UPPER_REL(foreignrel))
> > {
> >   

Re: Protect syscache from bloating with negative cache entries

2019-02-12 Thread Kyotaro HORIGUCHI
At Fri, 8 Feb 2019 09:42:20 +, "Ideriha, Takeshi" 
 wrote in 
<4E72940DA2BF16479384A86D54D0988A6F41EDD1@G01JPEXMBKW04>
> >From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> >I made a rerun of benchmark using "-S -T 30" on the server build with no 
> >assertion and
> >-O2. The numbers are the best of three successive attempts.  The patched 
> >version is
> >running with cache_target_memory = 0, cache_prune_min_age = 600 and
> >cache_entry_limit = 0 but pruning doesn't happen by the workload.
> >
> >master: 13393 tps
> >v12   : 12625 tps (-6%)
> >
> >Significant degradation is found.
> >
> >Recuded frequency of dlist_move_tail by taking 1ms interval between two 
> >succesive
> >updates on the same entry let the degradation dissapear.
> >
> >patched  : 13720 tps (+2%)
> 
> It would be good to introduce some interval.
> I followed your benchmark (initialized scale factor=10 and others are same 
> option) 
> and found the same tendency. 
> These are average of 5 trials.
> master:   7640.000538 
> patch_v12:7417.981378 (3 % down against master)
> patch_v13:7645.071787 (almost same as master)

Thank you for cross checking.

> These cases are not pruning happen workload as you mentioned.
> I'd like to do benchmark of cache-pruning-case as well.
> To demonstrate cache-pruning-case
> right now I'm making hundreds of partitioned table and run select query for 
> each partitioned table
> using pgbench custom file. Maybe using small number of cache_prune_min_age or 
> hard limit would be better.
> Are there any good model?

As per Tomas' benchmark, it doesn't seem to harm for the case.

> ># I'm not sure the name LRU_IGNORANCE_INTERVAL makes sens..
> How about MIN_LRU_UPDATE_INTERVAL?

Looks fine. Fixed in the next version. Thank you for the suggestion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg11.1: dsa_area could not attach to segment

2019-02-12 Thread Sergei Kornilov
Hi

> I think this is tentatively confirmed..I ran 20 loops for over 90 minutes with
> no crash when parallel_leader_participation=off.
>
> On enabling parallel_leader_participation, crash within 10min.
>
> Sergei, could you confirm ?

I still have error with parallel_leader_participation = off. One difference is 
time: with parallel_leader_participation = on i have error after minute-two, 
with off - error was after 20 min.

My desktop is 
- debian testing with actual updates, 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 
(2019-01-17) x86_64 GNU/Linux
- gcc version 8.2.0 (Debian 8.2.0-16)
- i build fresh REL_11_STABLE postgresql with  ./configure --enable-cassert 
--enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer" 
--enable-tap-tests --prefix=/...

Can't provide dsa_dump(area) due recursion. With such dirty hack:

 fprintf(stderr,
-"segment bin %zu (at least %d contiguous pages 
free):\n",
-i, 1 << (i - 1));
-segment_index = area->control->segment_bins[i];
-while (segment_index != DSA_SEGMENT_INDEX_NONE)
-{
-dsa_segment_map *segment_map;
-
-segment_map =
-get_segment_by_index(area, segment_index);
-
-fprintf(stderr,
-"  segment index %zu, usable_pages = %zu, "
-"contiguous_pages = %zu, mapped at %p\n",
-segment_index,
-segment_map->header->usable_pages,
-fpm_largest(segment_map->fpm),
-segment_map->mapped_address);
-segment_index = segment_map->header->next;
-}
+"segment bin %zu (at least %d contiguous pages free), 
segment_index=%zu\n",
+i, 1 << (i - 1), area->control->segment_bins[i]);

i have result:

dsa_area handle 0:
  max_total_segment_size: 18446744073709551615
  total_segment_size: 2105344
  refcnt: 2
  pinned: f
  segment bins:
segment bin 0 (at least -2147483648 contiguous pages free), segment_index=0
segment bin 3 (at least 4 contiguous pages free), segment_index=1
segment bin 8 (at least 128 contiguous pages free), segment_index=2
  pools:
pool for blocks of span objects:
  fullness class 0 is empty
  fullness class 1:
span descriptor at 01001000, superblock at 01001000, 
pages = 1, objects free = 54/72
  fullness class 2 is empty
  fullness class 3 is empty
pool for large object spans:
  fullness class 0 is empty
  fullness class 1:
span descriptor at 010013b8, superblock at 02009000, 
pages = 8, objects free = 0/0
span descriptor at 01001380, superblock at 02001000, 
pages = 8, objects free = 0/0
span descriptor at 01001348, superblock at 010f2000, 
pages = 8, objects free = 0/0
span descriptor at 01001310, superblock at 010ea000, 
pages = 8, objects free = 0/0
span descriptor at 010012d8, superblock at 010e2000, 
pages = 8, objects free = 0/0
span descriptor at 010012a0, superblock at 010da000, 
pages = 8, objects free = 0/0
span descriptor at 01001268, superblock at 010d2000, 
pages = 8, objects free = 0/0
span descriptor at 01001230, superblock at 010ca000, 
pages = 8, objects free = 0/0
span descriptor at 010011f8, superblock at 010c2000, 
pages = 8, objects free = 0/0
span descriptor at 010011c0, superblock at 010ba000, 
pages = 8, objects free = 0/0
span descriptor at 01001188, superblock at 010b2000, 
pages = 8, objects free = 0/0
span descriptor at 01001150, superblock at 010aa000, 
pages = 8, objects free = 0/0
span descriptor at 01001118, superblock at 010a2000, 
pages = 8, objects free = 0/0
span descriptor at 010010e0, superblock at 0109a000, 
pages = 8, objects free = 0/0
span descriptor at 010010a8, superblock at 01092000, 
pages = 8, objects free = 0/0
span descriptor at 01001070, superblock at 01012000, 
pages = 128, objects free = 0/0
  fullness class 2 is empty
  fullness class 3 is empty
pool for size class 32 (object size 3640 bytes):
  fullness class 0 is empty
  fullness class 1:
span descriptor at 01001038, superblock at 01002000, 
pages = 16, objects free = 17/18
  fullness class 2 is empty
  fullness class 3 is empty

"at least -2147483648" seems surprise.


melkij@melkij:~$ LANG=C ls -lt /dev/shm
total 0
-rw--- 1 melkij melkij 4194304 Feb 12 11:56 PostgreSQL.1822959854

only one segment, restart_after_crash = off and no more postgresql instances 
running.

regards, Sergei



RE: [PROPOSAL]a new data type 'bytea' for ECPG

2019-02-12 Thread Matsumura, Ryo
Meskes-san

Thank you for your review.

> There is one thing that I don't understand right now. YOu
> change ecpg_store_input() to handle the bytea data type, yet you also
> change ECPGset_desc() to not use ecpg_store_input() in case of an
> bytea. This looks odd to me. Can you please explain that to me?

I try to explain as follows. I would like to receive your comment.

The current architecture of data copying of descriptor walks through the 
following path.
The important point is that it walks through two ecpg_store_input().

 step 1. ECPGset_desc
   Store to descriptor_item with ecpg_store_input().

 step 2. ecpg_build_params(setup for tobeinserted)
   Store to tobeinserted with ecpg_store_input().

 step 3. ecpg_build_params(building stmt->param*)
   Set tobeinserted to stmt->paramvalues.

On the other hand, the part of ecpg_build_params(building stmt->param*)
for bytea needs two information that are is_binary and binary_length.
But, data copying with ecpg_store_input() losts them.

There are two ideas to pass the information to part of 
ecpg_build_params(building stmt->param*).
But they are same in terms of copying data without ecpg_store_input() at least 
ones.
I selected Idea-1.

Idea-1.

 step 1. ECPGset_desc
   Set descriptor_item.is_binary.
   Memcpy both bytea.length and bytea.array to descriptor_item.data.

 step 2. ecpg_build_params(setup for tobeinserted)
   Store bytea.array to tobeinserted with ecpg_store_input(bytea route).
   Set is_binary(local) from descriptor_item.is_binary.
   Set binary_length(local) from descriptor_item.data.

 step 3. ecpg_build_params(building stmt->param*)
   Set stmt->paramvalues from tobeinserted.
   Set stmt->formats from is_binary(local).
   Set stmt->lengths from binary_length(local).


Idea-2.

 step 1. ECPGset_desc
   Set descriptor_item.is_binary.
   Set bytea.length to descriptor_item.data_len. (different!)
   Set bytea.array to descriptor_item.data. (different!)

 step 2. ecpg_build_params(setup for tobeinserted)
   Memcpy bytea.array to tobeinserted by using alloc and memcpy whitout 
store_input. (different!)
   Set is_binary(local) from descriptor_item.is_binary.
   Set binary_length(local) from descriptor_item.data_len. (different!)

 step 3. ecpg_build_params(building stmt->param*)
   Set stmt->paramvalues with tobeinserted.
   Set stmt->formats from is_binary(local).
   Set stmt->lengths from binary_length(local).

Regards
Ryo Matsumura 



Re: Protect syscache from bloating with negative cache entries

2019-02-12 Thread Kyotaro HORIGUCHI
Thank you for testing and the commits, Tomas.

At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra  
wrote in <74386116-0bc5-84f2-e614-0cff19aca...@2ndquadrant.com>
> On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
> > At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20190207.152418.139132570.horiguchi.kyot...@lab.ntt.co.jp>
> I've done a bunch of benchmarks on v13, and I don't see any serious
> regression either. Each test creates a number of tables (100, 1k, 10k,
> 100k and 1M) and then runs SELECT queries on them. The tables are
> accessed randomly - with either uniform or exponential distribution. For
> each combination there are 5 runs, 60 seconds each (see the attached
> shell scripts, it should be pretty obvious).
> 
> I've done the tests on two different machines - small one (i5 with 8GB
> of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
> almost exactly the same (with the exception of 1M tables, which does not
> fit into RAM on the smaller one).
> 
> On the xeon, the results (throughput compared to master) look like this:
> 
> 
> uniform   100 10001   10   100
>
> v13   105.04%  100.28%  102.96%  102.11%   101.54%
> v13 (nodata)   97.05%   98.30%   97.42%   96.60%   107.55%
> 
> 
> exponential   100 10001   10   100
>
> v13   100.04%  103.48%  101.70%   98.56%   103.20%
> v13 (nodata)   97.12%   98.43%   98.86%   98.48%   104.94%
> 
> The "nodata" case means the tables were empty (so no files created),
> while in the other case each table contained 1 row.
> 
> Per the results it's mostly break even, and in some cases there is
> actually a measurable improvement.

Great! I guess it comes from reduced size of hash?

> That being said, the question is whether the patch actually reduces
> memory usage in a useful way - that's not something this benchmark
> validates. I plan to modify the tests to make pgbench script
> time-dependent (i.e. to pick a subset of tables depending on time).

Thank you.

> A couple of things I've happened to notice during a quick review:
> 
> 1) The sgml docs in 0002 talk about "syscache_memory_target" and
> "syscache_prune_min_age", but those options were renamed to just
> "cache_memory_target" and "cache_prune_min_age".

I'm at a loss how call syscache for users. I think it is "catalog
cache". The most basic component is called catcache, which is
covered by the syscache layer, both of then are not revealed to
users, and it is shown to user as "catalog cache".

"catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
exists) "catalog_cache_entry_limit" and
"catalog_cache_prune_ratio" make sense?

> 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
> defined three times in guc.c for some reason.

It is just PoC, added to show how it looks. (The multiple
instances must bex a result of a convulsion of my fingers..) I
think this is not useful unless it can be specfied per-relation
or per-cache basis. I'll remove the GUC and add reloptions for
the purpose. (But it won't work for pg_class and pg_attribute
for now).

> 3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
> of just using two bool variables prune_by_age and prune_by_number doing
> the same thing.

Agreed. It's a kind of memory-stingy, which is useless there.

> 4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
> pretty much mean long-running statements will set the lastaccess to very
> old timestamp? Also, it means that long-running statements (like a PL
> function accessing a bunch of tables) won't do any eviction at all, no?
> AFAICS we'll set the timestamp only once, at the very beginning.
> 
> I wonder whether using some other timestamp source (like a timestamp
> updated regularly from a timer, or something like that).

I didin't consider planning that happen within a function. If
5min is the default for catalog_cache_prune_min_age, 10% of it
(30s) seems enough and gettieofday() with such intervals wouldn't
affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
than fixed value 30s and 1s as the minimal.

I obeserved significant degradation by setting up timer at every
statement start. The patch is doing the followings to get rid of
the degradation.

(1) Every statement updates the catcache timestamp as currently
does.  (SetCatCacheClock)

(2) The timestamp is also updated periodically using timer
   separately from (1). The timer starts if not yet at the time
   of (1).  (SetCatCacheClock, UpdateCatCacheClock)

(3) Statement end and transaction end don't stop the timer, to
   avoid overhead of setting up a timer. (

(4) But it stops by error. I choosed not to change the thing in
PostgresMain that it kills all timers on error.

(5) Also changing the GUC cat

Re: Protect syscache from bloating with negative cache entries

2019-02-12 Thread Kyotaro HORIGUCHI
At Tue, 12 Feb 2019 01:02:39 +, "Tsunakawa, Takayuki" 
 wrote in 
<0A3221C70F24FB45833433255569204D1FB972A6@G01JPEXMBYT05>
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> > Recuded frequency of dlist_move_tail by taking 1ms interval between two
> > succesive updates on the same entry let the degradation dissapear.
> > 
> > patched  : 13720 tps (+2%)
> 
> What do you think contributed to this performance increase?  Or do you hink 
> this is just a measurement variation?
> 
> Most of my previous comments also seem to apply to v13, so let me repost them 
> below:
> 
> 
> (1)
> 
> (1)
> +/* GUC variable to define the minimum age of entries that will be cosidered 
> to
> + /* initilize catcache reference clock if haven't done yet */
> 
> cosidered -> considered
> initilize -> initialize

Fixed. I found "databsae", "temprary", "resturns",
"If'force'"(missing space), "aginst", "maintan". And all fixed.

> I remember I saw some other wrong spelling and/or missing words, which I 
> forgot (sorry).

Thank you for pointing some of them.

> (2)
> Only the doc prefixes "sys" to the new parameter names.  Other places don't 
> have it.  I think we should prefix sys, because relcache and plancache should 
> be configurable separately because of their different usage 
> patterns/lifecycle.

I tend to agree. They are already removed in this patchset. The
names are changed to "catalog_cache_*" in the new version.

> (3)
> The doc doesn't describe the unit of syscache_memory_target.  Kilobytes?

Added.

> (4)
> + hash_size = cp->cc_nbuckets * sizeof(dlist_head);
> + tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len;
> + tupsize = sizeof(CatCTup);
> 
> GetMemoryChunkSpace() should be used to include the memory context overhead.  
> That's what the files in src/backend/utils/sort/ do.

Thanks. Done. Include bucket and cache header part but still
excluding clist.  Renamed from tupsize to memusage.

> (5)
> + if (entry_age > cache_prune_min_age)
> 
> ">=" instead of ">"?

I didn't get it serious, but it is better. Fixed.

> (6)
> + if (!ct->c_list || ct->c_list->refcount 
> == 0)
> + {
> + CatCacheRemoveCTup(cp, ct);
> 
> It's better to write "ct->c_list == NULL" to follow the style in this file.
> 
> "ct->refcount == 0" should also be checked prior to removing the catcache 
> tuple, just in case the tuple hasn't been released for a long time, which 
> might hardly happen.

Yeah, I fixed it in v12. This no longer removes an entry in
use. (if (c_list) is used in the file.)

> (7)
> CatalogCacheCreateEntry
> 
> + int tupsize = 0;
>   if (ntp)
>   {
>   int i;
> + int tupsize;
> 
> tupsize is defined twice.

The second tupsize was bogus, but the first is removed in this
version. Now memory usage of an entry is calculated as a chunk
size.

> (8)
> CatalogCacheCreateEntry
> 
> In the negative entry case, the memory allocated by CatCacheCopyKeys() is not 
> counted.  I'm afraid that's not negligible.

Right. Fixed.

> (9)
> The memory for CatCList is not taken into account for syscache_memory_target.

Yeah, this is intensional since CatCacheList is short lived. Comment added.

| * Don't waste a time by counting the list in catcache memory usage,
| * since a list doesn't persist for a long time
| */
|cl = (CatCList *)
|  palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));


Please fine the attached, which is the new version v14 addressing
Tomas', Ideriha-san and your comments.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 3b24233b1891b967ccac65a4d21ed0207037578b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/3] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
 	dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+	/* fast path if it's already at the tail */
+	if (head->head.prev == node)
+		return;
+
+	dlist_delete(node);
+	dlist_push_tail(head, node);
+
+	dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

>From 5031833af1777c4c6a6bf8daf32b6a3f428c

Re: Problems with plan estimates in postgres_fdw

2019-02-12 Thread Antonin Houska
Etsuro Fujita  wrote:

> (2019/02/08 21:35), Etsuro Fujita wrote:
> > (2019/02/08 2:04), Antonin Houska wrote:
> >> * add_foreign_ordered_paths()
> >>
> >> Exprssion root->upper_targets[UPPERREL_FINAL] is used to access the
> >> target.
> >>
> >> I think create_ordered_paths() should actually set the target so that
> >> postgresGetForeignUpperPaths() can simply find it in
> >> output_rel->reltarget. This is how postgres_fdw already retrieves the
> >> target
> >> for grouped paths. Small patch is attached that makes this possible.
> >
> > Seems like a good idea. Thanks for the patch! Will review.
> 
> Here are my review comments:
> 
>   root->upper_targets[UPPERREL_FINAL] = final_target;
> + root->upper_targets[UPPERREL_ORDERED] = final_target;
>   root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
>   root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;
> 
> I think that's a good idea.  I think we'll soon need the PathTarget for
> UPPERREL_DISTINCT, so how about saving that as well like this?
> 
> root->upper_targets[UPPERREL_FINAL] = final_target;
> +   root->upper_targets[UPPERREL_ORDERED] = final_target;
> +   root->upper_targets[UPPERREL_DISTINCT] = sort_input_target;
> root->upper_targets[UPPERREL_WINDOW] = sort_input_target;
> root->upper_targets[UPPERREL_GROUP_AGG] = grouping_target;

I see nothing wrong about this.

> Another is about this:
> 
>   /*
> +  * Set reltarget so that it's consistent with the paths. Also it's more
> +  * convenient for FDW to find the target here.
> +  */
> + ordered_rel->reltarget = target;
> 
> I think this is correct if there are no SRFs in the targetlist, but otherwise
> I'm not sure this is really correct because the relation's reltarget would not
> match the target of paths added to the relation after adjust_paths_for_srfs().

I wouldn't expect problem here. create_grouping_paths() also sets the
reltarget although adjust_paths_for_srfs() can be alled on grouped_rel
afterwards.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: Logical replication and restore from pg_basebackup

2019-02-12 Thread Alexey Kondratov

Hi Dmitry,

On 11.02.2019 17:39, Dmitry Vasiliev wrote:
What is the scope of logical replication if I cannot make recovery 
from pg_basebackup?



No, you can, but there are some things to keep in mind:

1) I could be wrong, but usage of pgbench in such a test seems to be a 
bad idea, since it drops and creates tables from the scratch, when -i is 
passed. However, if I recall it correctly, pub/sub slots use OIDs of 
relations, so I expect that you should get only initial sync data on 
replica and last pgbench results on master.


2) Next, 'srsubstate' check works only for initial sync. After that you 
should poll master's replication slot lsn for 'pg_current_wal_lsn() <= 
replay_lsn'.


Please, find attached a slightly modified version of your test (and gist 
[1]), which works just fine. You should replace %username% with your 
current username, since I did not run it as postgres user.


[1] https://gist.github.com/ololobus/a8a11f11eb67dfa1b6a95bff5e8f0096


Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



logical-replication-test.sh
Description: application/shellscript


Re: Connection slots reserved for replication

2019-02-12 Thread Oleksii Kliukin



> On 12. Feb 2019, at 05:12, Michael Paquier  wrote:
> 
> On Mon, Feb 11, 2019 at 08:57:41PM -0700, Kevin Hale Boyes wrote:
>> I think there's a small problem with the commit.
>> The position of xlrec.max_wal_senders (line 117) should be below
>> max_worker_processes.
> 
> Fixed, thanks!

Wonderful, thank you very much for taking it to commit and everyone involved 
for getting it through!

Cheers,
Oleksii


Re: Planning counters in pg_stat_statements (using pgss_store)

2019-02-12 Thread Sergei Kornilov
Hello

Thank you for picking this up! Did you register patch in CF app? I did not 
found entry.

Currently we have pg_stat_statements 1.7 version and this patch does not 
apply... My fast and small view:

> -  errmsg("could not read file \"%s\": %m",
> +  errmsg("could not read pg_stat_statement file \"%s\": 
> %m",

Not sure this is need for this patch. Usually refactoring and new features are 
different topics.

> +#define PG_STAT_STATEMENTS_COLS_V1_4 25

should not be actual version? I think version in names is relevant to extension 
version.

And this patch does not have documentation changes.

> "I agree with the sentiment on the old thread that
> {total,min,max,mean,stddev}_time now seem badly named, but adding
> execution makes them so long...  Thoughts?"
>
> What would you think about:
> - userid
> - dbid
> - queryid
> - query
> - plans
> - plan_time
> - {min,max,mean,stddev}_plan_time
> - calls
> - exec_time
> - {min,max,mean,stddev}_exec_time
> - total_time (being the sum of plan_time and exec_time)
> - rows
> - ...

We have some consensus about backward incompatible changes in this function? 
*plan_time + *exec_time naming is ok for me

regards, Sergei



Re: Too rigorous assert in reorderbuffer.c

2019-02-12 Thread Alvaro Herrera
On 2019-Feb-11, Alvaro Herrera wrote:

> On 2019-Feb-07, Arseny Sher wrote:
> 
> > Alvaro Herrera  writes:
> 
> > > Ah, okay.  Does the test still fail when run without the code fix?
> > 
> > Yes. The problem here is overriding cmax of catalog (pg_attribute in the
> > test) tuples, so it fails without any data at all.
> 
> Makes sense.
> 
> I thought the blanket removal of the assert() was excessive, and we can
> relax it instead; what do you think of the attached?

More precisely, my question was: with this change, does the code still
work correctly in your non-toy case?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15548: Unaccent does not remove combining diacritical characters

2019-02-12 Thread Ramanarayana
Hi Michael,
The issue was that the python script was working in python 2 but not in
python 3 in Windows. This is because the python script writes the final
output to stdout and stdout encoding is set to utf-8 only for python 2 but
not python 3.If no encoding is set for stdout it takes the encoding from
the Operating system.Default encoding in linux and windows might be
different.Hence this issue.
Regards,
Ram.

On Tue, 12 Feb 2019 at 09:48, Michael Paquier  wrote:

> On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
> > I tested the script in python 2.7 and it works perfect. The problem is in
> > python 3.7(and may be only in windows as you were not getting the issue)
> > and I was getting the following error
> >
> > UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
> > position 0: character maps to 
> >
> >  I went through the python script and found that the stdout encoding is
> set
> > to utf-8 only  if python version is <=2.
> >
> > I have made the same change for python version 3 as well. Please find the
> > patch for the same.Let me know if it makes sense
>
> Isn't that because Windows encoding becomes cp1252, utf16 or such?
> FWIW, on Debian SID with Python 3.7, I get the correct output, and no
> diffs on HEAD.  Perhaps it would make sense to use open() on the
> different files with encoding='utf-8' to avoid any kind of problems?
> --
> Michael
>


-- 
Cheers
Ram 4.0


Re: libpq compression

2019-02-12 Thread Andreas Karlsson

On 2/11/19 5:28 PM, Peter Eisentraut wrote:

One mitigation is to not write code like that, that is, don't put secret
parameters and user-supplied content into the same to-be-compressed
chunk, or at least don't let the end user run that code at will in a
tight loop.

The difference in CRIME is that the attacker supplied the code.  You'd
trick the user to go to http://evil.com/ (via spam), and that site
automatically runs JavaScript code in the user's browser that contacts
https://bank.com/, which will then automatically send along any secret
cookies the user had previously saved from bank.com.  The evil
JavaScript code can then stuff the requests to bank.com with arbitrary
bytes and run the requests in a tight loop, only subject to rate
controls at bank.com.


Right, CRIME is worse since it cannot be mitigated by the application 
developer. But even so I do not think that my query is that odd. I do 
not think that it is obvious to most application developer that putting 
user supplied data close to sensitive data is potentially dangerous.


Will this attack ever be useful in practice? No idea, but I think we 
should be aware of what risks we open our end users to.


Andreas



Re: [PROPOSAL]a new data type 'bytea' for ECPG

2019-02-12 Thread Michael Meskes
Matsumura-san,

> I try to explain as follows. I would like to receive your comment.
> ...

I'm afraid I don't really understand your explanation. Why is handling
a bytea so different from handling a varchar? I can see some
differences due to its binary nature, but I do not understand why it
needs so much special handling for stuff like its length? There is a
length field in the structure but instead of using it the data field is
used to store both, the length and the data. What am I missing?

Please keep in mind that I did not write the descriptor code, so I may
very well not see the obvious.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL




Re: use Getopt::Long for catalog scripts

2019-02-12 Thread Alvaro Herrera
On 2019-Feb-08, John Naylor wrote:

> On Thu, Feb 7, 2019 at 6:53 PM David Fetter  wrote:
> > [some style suggestions]
> 
> I've included these in v2, and also some additional tweaks for consistency.

I noticed that because we have this line in the scripts,
  BEGIN  { use lib File::Spec->rel2abs(dirname(__FILE__)); }
the -I switch to Perl is no longer needed in the makefiles.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: use Getopt::Long for catalog scripts

2019-02-12 Thread Alvaro Herrera
On 2019-Feb-12, Alvaro Herrera wrote:

> On 2019-Feb-08, John Naylor wrote:
> 
> > On Thu, Feb 7, 2019 at 6:53 PM David Fetter  wrote:
> > > [some style suggestions]
> > 
> > I've included these in v2, and also some additional tweaks for consistency.
> 
> I noticed that because we have this line in the scripts,
>   BEGIN  { use lib File::Spec->rel2abs(dirname(__FILE__)); }
> the -I switch to Perl is no longer needed in the makefiles.

Pushed, thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Should we still have old release notes in docs?

2019-02-12 Thread Greg Stark
I was just perusing our PDF docs for the first time in ages and
realized that of the 3,400+ pages of docs there's about 1,000 pages of
release notes in it That seems like a bit overkill.

I love having the old release notes online but perhaps they can be
somewhere other than the main docs? We could limit the current docs to
including the release notes for just the supported versions -- after
all you can always get the old release notes in the old docs
themselves.

-- 
greg



Re: Should we still have old release notes in docs?

2019-02-12 Thread Andres Freund
Hi,

On 2019-02-12 11:00:51 -0500, Greg Stark wrote:
> I was just perusing our PDF docs for the first time in ages and
> realized that of the 3,400+ pages of docs there's about 1,000 pages of
> release notes in it That seems like a bit overkill.
> 
> I love having the old release notes online but perhaps they can be
> somewhere other than the main docs? We could limit the current docs to
> including the release notes for just the supported versions -- after
> all you can always get the old release notes in the old docs
> themselves.

You're behind the times, that just happened (in a pretty uncoordinated
manner).

Greetings,

Andres Freund



Re: Too rigorous assert in reorderbuffer.c

2019-02-12 Thread Arseny Sher


Alvaro Herrera  writes:

>> I thought the blanket removal of the assert() was excessive, and we can
>> relax it instead; what do you think of the attached?
>
> More precisely, my question was: with this change, does the code still
> work correctly in your non-toy case?

Yes, it works. I thought for a moment that some obscure cases where cmax
on a single tuple is not strictly monotonic might exist, but looks like
they don't. So your change is ok for me, reshaping assert is better than
removing. make check is also good on all supported branches.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Should we still have old release notes in docs?

2019-02-12 Thread Tom Lane
Andres Freund  writes:
> On 2019-02-12 11:00:51 -0500, Greg Stark wrote:
>> I love having the old release notes online but perhaps they can be
>> somewhere other than the main docs? We could limit the current docs to
>> including the release notes for just the supported versions -- after
>> all you can always get the old release notes in the old docs
>> themselves.

> You're behind the times, that just happened (in a pretty uncoordinated
> manner).

Yeah, see 527b5ed1a et al.

The part about having a unified release-note archive somewhere else is
still WIP.  The ball is in the web team's court on that, I think.

regards, tom lane



Re: use Getopt::Long for catalog scripts

2019-02-12 Thread John Naylor
On 2/12/19, Alvaro Herrera  wrote:
> Pushed, thanks.

Thank you.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: BUG #15548: Unaccent does not remove combining diacritical characters

2019-02-12 Thread Hugh Ranalli
On Tue, 12 Feb 2019 at 08:54, Ramanarayana  wrote:

> Hi Michael,
> The issue was that the python script was working in python 2 but not in
> python 3 in Windows. This is because the python script writes the final
> output to stdout and stdout encoding is set to utf-8 only for python 2 but
> not python 3.If no encoding is set for stdout it takes the encoding from
> the Operating system.Default encoding in linux and windows might be
> different.Hence this issue.
> Regards,
> Ram.
>
> On Tue, 12 Feb 2019 at 09:48, Michael Paquier  wrote:
>
>> On Tue, Feb 12, 2019 at 02:27:31AM +0530, Ramanarayana wrote:
>> > I tested the script in python 2.7 and it works perfect. The problem is
>> in
>> > python 3.7(and may be only in windows as you were not getting the issue)
>> > and I was getting the following error
>> >
>> > UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in
>> > position 0: character maps to 
>> >
>> >  I went through the python script and found that the stdout encoding is
>> set
>> > to utf-8 only  if python version is <=2.
>> >
>> > I have made the same change for python version 3 as well. Please find
>> the
>> > patch for the same.Let me know if it makes sense
>>
>> Isn't that because Windows encoding becomes cp1252, utf16 or such?
>> FWIW, on Debian SID with Python 3.7, I get the correct output, and no
>> diffs on HEAD.  Perhaps it would make sense to use open() on the
>> different files with encoding='utf-8' to avoid any kind of problems?
>> --
>> Michael
>
>
I can't look at this today, but will fire up Windows and Python tomorrow,
look at Ram's patch, and see what is going on. I'll also look at how we
open the input files, to see if we should supply an encoding. It makes
sense those input files will only make sense in UTF-8 anyway.

Ram, thanks for catching this issue.,

Hugh


Re: Should we still have old release notes in docs?

2019-02-12 Thread Dave Page



> On 12 Feb 2019, at 16:00, Greg Stark  wrote:
> 
> I was just perusing our PDF docs for the first time in ages and
> realized that of the 3,400+ pages of docs there's about 1,000 pages of
> release notes in it That seems like a bit overkill.
> 
> I love having the old release notes online but perhaps they can be
> somewhere other than the main docs? We could limit the current docs to
> including the release notes for just the supported versions -- after
> all you can always get the old release notes in the old docs
> themselves

+1. It does seem excessive.



Re: Should we still have old release notes in docs?

2019-02-12 Thread Jonathan S. Katz


> On Feb 12, 2019, at 5:12 PM, Tom Lane  wrote:
> 
> The part about having a unified release-note archive somewhere else is
> still WIP.  The ball is in the web team's court on that, I think.

Yes - we are going to try one thing with the existing way we load docs to try 
to avoid
additional structural changes. If that doesn’t work, we will reconvene
and see what we need to do.

Thanks,

Jonathan




Re: Should we still have old release notes in docs?

2019-02-12 Thread Thomas Kellerer

Tom Lane schrieb am 12.02.2019 um 17:12:

Yeah, see 527b5ed1a et al.

The part about having a unified release-note archive somewhere else is
still WIP.  The ball is in the web team's court on that, I think.


The Bucardo team has already done that:

https://bucardo.org/postgres_all_versions.html





Re: Protect syscache from bloating with negative cache entries

2019-02-12 Thread Tomas Vondra


On 2/12/19 12:35 PM, Kyotaro HORIGUCHI wrote:
> Thank you for testing and the commits, Tomas.
> 
> At Sat, 9 Feb 2019 19:09:59 +0100, Tomas Vondra 
>  wrote in 
> <74386116-0bc5-84f2-e614-0cff19aca...@2ndquadrant.com>
>> On 2/7/19 1:18 PM, Kyotaro HORIGUCHI wrote:
>>> At Thu, 07 Feb 2019 15:24:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>>>  wrote in 
>>> <20190207.152418.139132570.horiguchi.kyot...@lab.ntt.co.jp>
>> I've done a bunch of benchmarks on v13, and I don't see any serious
>> regression either. Each test creates a number of tables (100, 1k, 10k,
>> 100k and 1M) and then runs SELECT queries on them. The tables are
>> accessed randomly - with either uniform or exponential distribution. For
>> each combination there are 5 runs, 60 seconds each (see the attached
>> shell scripts, it should be pretty obvious).
>>
>> I've done the tests on two different machines - small one (i5 with 8GB
>> of RAM) and large one (e5-2620v4 with 64GB RAM), but the behavior is
>> almost exactly the same (with the exception of 1M tables, which does not
>> fit into RAM on the smaller one).
>>
>> On the xeon, the results (throughput compared to master) look like this:
>>
>>
>> uniform   100 10001   10   100
>>
>> v13   105.04%  100.28%  102.96%  102.11%   101.54%
>> v13 (nodata)   97.05%   98.30%   97.42%   96.60%   107.55%
>>
>>
>> exponential   100 10001   10   100
>>
>> v13   100.04%  103.48%  101.70%   98.56%   103.20%
>> v13 (nodata)   97.12%   98.43%   98.86%   98.48%   104.94%
>>
>> The "nodata" case means the tables were empty (so no files created),
>> while in the other case each table contained 1 row.
>>
>> Per the results it's mostly break even, and in some cases there is
>> actually a measurable improvement.
> 
> Great! I guess it comes from reduced size of hash?
> 

Not sure about that. I haven't actually verified that it reduces the
cache size at all - I was measuring the overhead of the extra work. And
I don't think the syscache actually shrunk significantly, because the
throughput was quite high (~15-30k tps, IIRC) so pretty much everything
was touched within the default 600 seconds.

>> That being said, the question is whether the patch actually reduces
>> memory usage in a useful way - that's not something this benchmark
>> validates. I plan to modify the tests to make pgbench script
>> time-dependent (i.e. to pick a subset of tables depending on time).
> 
> Thank you.
> 
>> A couple of things I've happened to notice during a quick review:
>>
>> 1) The sgml docs in 0002 talk about "syscache_memory_target" and
>> "syscache_prune_min_age", but those options were renamed to just
>> "cache_memory_target" and "cache_prune_min_age".
> 
> I'm at a loss how call syscache for users. I think it is "catalog
> cache". The most basic component is called catcache, which is
> covered by the syscache layer, both of then are not revealed to
> users, and it is shown to user as "catalog cache".
> 
> "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
> exists) "catalog_cache_entry_limit" and
> "catalog_cache_prune_ratio" make sense?
> 

I think "catalog_cache" sounds about right, although my point was simply
that there's a discrepancy between sgml docs and code.

>> 2) "cache_entry_limit" is not mentioned in sgml docs at all, and it's
>> defined three times in guc.c for some reason.
> 
> It is just PoC, added to show how it looks. (The multiple
> instances must bex a result of a convulsion of my fingers..) I
> think this is not useful unless it can be specfied per-relation
> or per-cache basis. I'll remove the GUC and add reloptions for
> the purpose. (But it won't work for pg_class and pg_attribute
> for now).
> 

OK, although I'd just keep it as simple as possible. TBH I can't really
imagine users tuning limits for individual caches in any meaningful way.

>> 3) I don't see why to define PRUNE_BY_AGE and PRUNE_BY_NUMBER, instead
>> of just using two bool variables prune_by_age and prune_by_number doing
>> the same thing.
> 
> Agreed. It's a kind of memory-stingy, which is useless there.
> 
>> 4) I'm not entirely sure about using stmtStartTimestamp. Doesn't that
>> pretty much mean long-running statements will set the lastaccess to very
>> old timestamp? Also, it means that long-running statements (like a PL
>> function accessing a bunch of tables) won't do any eviction at all, no?
>> AFAICS we'll set the timestamp only once, at the very beginning.
>>
>> I wonder whether using some other timestamp source (like a timestamp
>> updated regularly from a timer, or something like that).
> 
> I didin't consider planning that happen within a function. If
> 5min is the default for catalog_cache_prune_min_age, 10% of it
> (30s) seems enough and gettieofday() with such intervals wouldn't
> affect

Re: [PATCH] xlogreader: do not read a file block twice

2019-02-12 Thread Andrey Lepikhov




On 11.02.2019 21:25, Arthur Zakirov wrote:

Hello hackers,

Grigory noticed that one of our utilities has very slow performance when 
xlogreader reads zlib archives. We found out that xlogreader sometimes 
reads a WAL file block twice.


zlib has slow performance when you read an archive not in sequential 
order. I think reading a block twice in same position isn't sequential, 
because gzread() moves current position forward and next call gzseek() 
to the same position moves it back.


It seems that the attached patch solves the issue. I think when reqLen 
== state->readLen the requested block already is in the xlogreader's 
buffer.


What do you think?

I looked at the history of the code changes:

---
7fcbf6a405f (Alvaro Herrera 2013-01-16 16:12:53 -0300 539) 
reqLen < state->readLen)


1bb2558046c (Heikki Linnakangas 2010-01-27 15:27:51 + 9349) 
   targetPageOff == readOff && targetRecOff < readLen)


eaef111396e (Tom Lane 2006-04-03 23:35:05 + 3842)
len = XLOG_BLCKSZ - RecPtr->xrecoff % XLOG_BLCKSZ;
4d14fe0048c (Tom Lane 2001-03-13 01:17:06 + 3843)
if (total_len > len) 
---


In the original code of Tom Lane, condition (total_len > len) caused a 
page reread from disk. As I understand it, this is equivalent to your 
proposal.
Th code line in commit 1bb2558046c seems tantamount to the corresponding 
line in commit 7fcbf6a405f but have another semantics: the targetPageOff 
value can't be more or equal XLOG_BLCKSZ, but the reqLen value can be. 
It may be a reason of appearance of possible mistake, introduced by 
commit 7fcbf6a405f.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company



Re: [PATCH v20] GSSAPI encryption support

2019-02-12 Thread Robbie Harwood
Stephen Frost  writes:

> * Robbie Harwood (rharw...@redhat.com) wrote:
>> Stephen Frost  writes:
>>> * Robbie Harwood (rharw...@redhat.com) wrote:
>>>
 Attached please find version 20 of the GSSAPI encryption support.
 This has been rebased onto master (thanks Stephen for calling out
 ab69ea9).
>>>
>>> I've looked over this again and have been playing with it off-and-on
>>> for a while now.  One thing I was actually looking at implementing
>>> before we push to commit this was to have similar information to
>>> what SSL provides on connection, specifically what printSSLInfo()
>>> prints by way of cipher, bits, etc for the client-side and then
>>> something like pg_stat_ssl for the server side.
>>>
>>> I went ahead and added a printGSSENCInfo(), and then
>>> PQgetgssSASLSSF(), which calls down into
>>> gss_inquire_sec_context_by_oid() for GSS_C_SEC_CONTEXT_SASL_SSF to
>>> get the bits (which also required including gssapi_ext.h), and now
>>> have psql producing:
>> 
>> Also, this is an MIT-specific extension: Heimdal doesn't support it.
>
> Is that simply because they've not gotten around to it..?  From what I
> saw, this approach is in an accepted RFC, so if they want to implement
> it, they could do so..?

Yes.  Heimdal is less active these days than MIT; they were dormant for
a while, though are active again.  They do have an issue open to track
this feature: https://github.com/heimdal/heimdal/issues/400

>> While I (a krb5 developer) am fine with a hard MIT dependency, the
>> project doesn't currently have this.  (And if we went that road, I'd
>> like to use gss_acquire_cred_from() instead of the setenv() in
>> be-secure-gssapi.c.)
>
> We certainly don't want a hard MIT dependency, but if it's following an
> RFC and we can gracefully fall-back if it isn't available then I think
> it's acceptable.  If there's a better approach which would work with
> both MIT and Heimdal and ideally is defined through an RFC, that'd be
> better, but I'm guessing there isn't...

While I think the concept of SASL SSF is standardized, I don't think the
semantics of this OID are - the OID itself is in the MIT gssapi
extension space.

As a historical note, the reason this interface exists in krb5 is of
course for SASL.  In particular, cyrus was reporting treating the SSF of
all krb5 types as if they were DES.  So while technically "assume DES"
could be a fallback... we probably shouldn't do that.

>>> ---
>>> psql (12devel)
>>> GSS Encrypted connection (bits: 256)
>>> Type "help" for help.
>>> ---
>>>
>>> I was curious though- is there a good way to get at the encryption
>>> type used..?  I haven't had much luck in reading the RFPs and poking
>>> around, but perhaps I'm just not looking in the right place.
>> 
>> A call to gss_inquire_context() produces mech type, so you could
>> print something like "GSS Encrypted connection (Kerberos 256 bits)"
>> or "GSS encrypted connection (NTLM)".  I think this'd be pretty
>> reasonable.
>
> I'm... not impressed with that approach, since "Kerberos" as an
> encryption algorithm doesn't mean squat- that could be DES or RC4 for
> all we know.

This is a fair concern.

>> For Kerberos-specific introspection, MIT krb5 and Heimdal both
>> support an interface called lucid contexts that allows this kind of
>> introspection (and some other gross layering violations) for use with
>> the kernel.  Look at gssapi_krb5.h (it's called that in both).  I
>> don't think it's worth supporting rfc1964 at this point (it's
>> 1DES-only).
>
> I agree that there's no sense in supporting 1DES-only.
>
> I also, frankly, don't agree with the characterization that wanting to
> know what the agreed-upon encryption algorithm is constitutes a gross
> layering violation, but that's really an independent discussion that we
> can have in a pub somewhere. ;)

Yeah I won't defend it very hard :)

> To get where I'd like us to be, however, it sounds like we need to:
>
> a) determine if we've got encryption (that's easy, assuming I've done it
>correctly so far)

Yes.

> b) Figure out if the encryption is being provided by Kerberos (using
>gss_inquire_context() would do that, right?)

Right - check and log mech_out there, then proceed if it's
gss_mech_krb5.  Probably not worth handling _old and _wrong, especially
since Heimdal doesn't have them and they shouldn't be in use anyway.

> c) If we're using Kerberos, use the lucid contexts to ask what the
>actual encryption algorithm used is

Yes.  This'll involve some introspection into krb5.h (for enctype
names), but that's not too hard.

> That's a bit obnoxious but it at least seems doable.  I don't suppose
> you know of any example code out there which provides this?  Any idea
> if there's work underway on an RFC to make this, well, suck less?

MIT has some in our test suite (t_enctypes).  nfs-utils (GPLv2) also
uses this interface (NFS is the intended consumer).

There isn't IETF effort in this regard that I'm aware of, but I'll add
it to the kitten

Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-02-12 Thread Alvaro Herrera
I added metrics for the validate_index phases; patch attached.  This is
still a bit raw, but it looks much better now.  Here's a sample
concurrent index build on a 100M tuple table.  There are no concurrent
transactions, so phases that wait for lockers don't appear.  I'm not
making any effort to report metrics during phase 6, note.  In phase 7
I'm currently reporting only tuple counts; I think it'd be better to
report block numbers.  

I don't show it here, but when an index is built on a partitioned table,
the "partitions done" number goes up all the way to "partitions total"
and the phases come and go for each partition.

One thing is clear: given the phase mechanics of varying durations, it
seems hard to use these numbers to predict total index build time.


 now  | phase  | blocks 
total | blocks done | tuples total | tuples done | partitions total | 
partitions done 
--++--+-+--+-+--+-
 15:56:33.792 | building index (3 of 8): initializing (1/5)|   
442478 |  32 |0 |   0 |0 |  
 0
 15:56:33.805 | building index (3 of 8): initializing (1/5)|   
442478 | 188 |0 |   0 |0 |  
 0
 [snip about 500 lines]
 15:56:41.345 | building index (3 of 8): initializing (1/5)|   
442478 |  439855 |0 |   0 |0 |  
 0
 15:56:41.356 | building index (3 of 8): initializing (1/5)|   
442478 |  440288 |0 |   0 |0 |  
 0
 15:56:41.367 | building index (3 of 8): initializing (1/5)|   
442478 |  440778 |0 |   0 |0 |  
 0
 15:56:41.378 | building index (3 of 8): initializing (1/5)|   
442478 |  441706 |0 |   0 |0 |  
 0
 15:56:41.389 | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
 15:56:41.4   | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0

 [snip 300 lines]   ... I'm not sure what happened in this 3 
seconds period.  No metrics were being updated.

 15:56:44.65  | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
 15:56:44.661 | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
 15:56:44.672 | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
 15:56:44.682 | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
 15:56:44.694 | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
 15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
442478 |  442399 |1 |   0 |0 |  
 0
 15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
442478 |  442399 |1 |   0 |0 |  
 0
 15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |   79057 |0 |  
 0
 15:56:44.738 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  217018 |0 |  
 0
 15:56:44.75  | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  353804 |0 |  
 0
 15:56:44.761 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  487892 |0 |  
 0
 15:56:44.773 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  634136 |0 |  
 0
[snip 660 lines]
 15:56:52.47  | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |99111337 |0 |  
 0
 15:56:52.482 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |99285701 |0 |  
 0
 15:56:52.493 | buildi

Re: [PATCH v20] GSSAPI encryption support

2019-02-12 Thread Stephen Frost
Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Stephen Frost  writes:
> > * Robbie Harwood (rharw...@redhat.com) wrote:
> >> Stephen Frost  writes:
> >>> * Robbie Harwood (rharw...@redhat.com) wrote:
> >>>
>  Attached please find version 20 of the GSSAPI encryption support.
>  This has been rebased onto master (thanks Stephen for calling out
>  ab69ea9).
> >>>
> >>> I've looked over this again and have been playing with it off-and-on
> >>> for a while now.  One thing I was actually looking at implementing
> >>> before we push to commit this was to have similar information to
> >>> what SSL provides on connection, specifically what printSSLInfo()
> >>> prints by way of cipher, bits, etc for the client-side and then
> >>> something like pg_stat_ssl for the server side.
> >>>
> >>> I went ahead and added a printGSSENCInfo(), and then
> >>> PQgetgssSASLSSF(), which calls down into
> >>> gss_inquire_sec_context_by_oid() for GSS_C_SEC_CONTEXT_SASL_SSF to
> >>> get the bits (which also required including gssapi_ext.h), and now
> >>> have psql producing:
> >> 
> >> Also, this is an MIT-specific extension: Heimdal doesn't support it.
> >
> > Is that simply because they've not gotten around to it..?  From what I
> > saw, this approach is in an accepted RFC, so if they want to implement
> > it, they could do so..?
> 
> Yes.  Heimdal is less active these days than MIT; they were dormant for
> a while, though are active again.  They do have an issue open to track
> this feature: https://github.com/heimdal/heimdal/issues/400

Ok, good, when they add it then hopefully it'll be picked up
more-or-less 'for free'.

> >> While I (a krb5 developer) am fine with a hard MIT dependency, the
> >> project doesn't currently have this.  (And if we went that road, I'd
> >> like to use gss_acquire_cred_from() instead of the setenv() in
> >> be-secure-gssapi.c.)
> >
> > We certainly don't want a hard MIT dependency, but if it's following an
> > RFC and we can gracefully fall-back if it isn't available then I think
> > it's acceptable.  If there's a better approach which would work with
> > both MIT and Heimdal and ideally is defined through an RFC, that'd be
> > better, but I'm guessing there isn't...
> 
> While I think the concept of SASL SSF is standardized, I don't think the
> semantics of this OID are - the OID itself is in the MIT gssapi
> extension space.

We can adjust for that pretty easily, presumably, if another OID ends up
being assigned (though if one already exists, isn't it something that
Heimdal, for example, could potentially decide to just support..?).

> As a historical note, the reason this interface exists in krb5 is of
> course for SASL.  In particular, cyrus was reporting treating the SSF of
> all krb5 types as if they were DES.  So while technically "assume DES"
> could be a fallback... we probably shouldn't do that.

Agreed, we shouldn't do that.

> >> A call to gss_inquire_context() produces mech type, so you could
> >> print something like "GSS Encrypted connection (Kerberos 256 bits)"
> >> or "GSS encrypted connection (NTLM)".  I think this'd be pretty
> >> reasonable.
> >
> > I'm... not impressed with that approach, since "Kerberos" as an
> > encryption algorithm doesn't mean squat- that could be DES or RC4 for
> > all we know.
> 
> This is a fair concern.

I will say that I'm alright with listing Kerberos in the message if we
think it's useful- but we should also list the actual encryption
algorithm and bits, if possible (which it sounds like it is, so that's
definitely good).

> > To get where I'd like us to be, however, it sounds like we need to:
> >
> > a) determine if we've got encryption (that's easy, assuming I've done it
> >correctly so far)
> 
> Yes.
> 
> > b) Figure out if the encryption is being provided by Kerberos (using
> >gss_inquire_context() would do that, right?)
> 
> Right - check and log mech_out there, then proceed if it's
> gss_mech_krb5.  Probably not worth handling _old and _wrong, especially
> since Heimdal doesn't have them and they shouldn't be in use anyway.
> 
> > c) If we're using Kerberos, use the lucid contexts to ask what the
> >actual encryption algorithm used is
> 
> Yes.  This'll involve some introspection into krb5.h (for enctype
> names), but that's not too hard.

Sounds good.

> > That's a bit obnoxious but it at least seems doable.  I don't suppose
> > you know of any example code out there which provides this?  Any idea
> > if there's work underway on an RFC to make this, well, suck less?
> 
> MIT has some in our test suite (t_enctypes).  nfs-utils (GPLv2) also
> uses this interface (NFS is the intended consumer).
> 
> There isn't IETF effort in this regard that I'm aware of, but I'll add
> it to the kitten backlog.

Fantastic, thanks.

> >> (With my downstream maintainer hat on, I'd further ask that any such
> >> check not just be a version check in order to allow backporting; in
> >> particular, the el7 krb5-1.15 branch does incl

Re: [PATCH v20] GSSAPI encryption support

2019-02-12 Thread Robbie Harwood
Stephen Frost  writes:

> * Robbie Harwood (rharw...@redhat.com) wrote:
>> Stephen Frost  writes:
>>> * Robbie Harwood (rharw...@redhat.com) wrote:
 Stephen Frost  writes:
> * Robbie Harwood (rharw...@redhat.com) wrote:
>
>> Attached please find version 20 of the GSSAPI encryption support.
>> This has been rebased onto master (thanks Stephen for calling out
>> ab69ea9).
>
> I've looked over this again and have been playing with it
> off-and-on for a while now.  One thing I was actually looking at
> implementing before we push to commit this was to have similar
> information to what SSL provides on connection, specifically what
> printSSLInfo() prints by way of cipher, bits, etc for the
> client-side and then something like pg_stat_ssl for the server
> side.
>
> I went ahead and added a printGSSENCInfo(), and then
> PQgetgssSASLSSF(), which calls down into
> gss_inquire_sec_context_by_oid() for GSS_C_SEC_CONTEXT_SASL_SSF to
> get the bits (which also required including gssapi_ext.h), and now
> have psql producing:
 
 While I (a krb5 developer) am fine with a hard MIT dependency, the
 project doesn't currently have this.  (And if we went that road,
 I'd like to use gss_acquire_cred_from() instead of the setenv() in
 be-secure-gssapi.c.)
>>>
>>> We certainly don't want a hard MIT dependency, but if it's following
>>> an RFC and we can gracefully fall-back if it isn't available then I
>>> think it's acceptable.  If there's a better approach which would
>>> work with both MIT and Heimdal and ideally is defined through an
>>> RFC, that'd be better, but I'm guessing there isn't...
>> 
>> While I think the concept of SASL SSF is standardized, I don't think
>> the semantics of this OID are - the OID itself is in the MIT gssapi
>> extension space.
>
> We can adjust for that pretty easily, presumably, if another OID ends
> up being assigned (though if one already exists, isn't it something
> that Heimdal, for example, could potentially decide to just
> support..?).

Yes, exactly - Heimdal would probably just use the MIT OID with the
existing semantics.

Thanks,
--Robbie


signature.asc
Description: PGP signature


Re: Planning counters in pg_stat_statements (using pgss_store)

2019-02-12 Thread legrand legrand
Hi Sergei,

Thank you for this review !

>Did you register patch in CF app? I did not found entry. 
I think it is related to https://commitfest.postgresql.org/16/1373/
but I don't know how to link it with.

Yes, there are many things to improve, but before to go deeper, 
I would like to know if that way to do it (with 3 access to pgss hash)
has a chance to get consensus  ?

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html



Re: Too rigorous assert in reorderbuffer.c

2019-02-12 Thread Alvaro Herrera
On 2019-Feb-12, Arseny Sher wrote:

> 
> Alvaro Herrera  writes:
> 
> >> I thought the blanket removal of the assert() was excessive, and we can
> >> relax it instead; what do you think of the attached?
> >
> > More precisely, my question was: with this change, does the code still
> > work correctly in your non-toy case?
> 
> Yes, it works. I thought for a moment that some obscure cases where cmax
> on a single tuple is not strictly monotonic might exist, but looks like
> they don't. So your change is ok for me, reshaping assert is better than
> removing. make check is also good on all supported branches.

Thanks for checking!  I also run it on all branches, everything passes.
Pushed now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Statement-level rollback

2019-02-12 Thread Alvaro Herrera
On 2019-Feb-04, Michael Paquier wrote:

> On Thu, Jan 31, 2019 at 04:38:27AM -0800, Andres Freund wrote:
> > Is this still in development? Or should we mark this as returned with
> > feedback?
> 
> Marked as returned with feedback, as it has already been two months.
> If you have an updated patch set, please feel free to resubmit.

It was reasonable to close this patch as returned-with-feedback in
January commitfest, but what you did here was first move it to the March
commitfest and then close it as returned-with-feedback in the March
commitfest, which has not even started.  That makes no sense.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Reaping Temp tables to avoid XID wraparound

2019-02-12 Thread James Sewell
Hi all,

I hit an issue yesterday where I was quickly nearing XID wraparound on a
database due to a temp table being created in a session which was then left
IDLE out of transaction for 6 days.

I solved the issue by tracing the owner of the temp table back to a session
and terminating it - in my case I was just lucky that there was one session
for that user.

I'm looking for a way to identify the PID of the backend which owns a temp
table identified as being an issue so I can terminate it.

>From the temp table namespace I can get the backend ID using a regex - but
I have no idea how I can map that to a PID - any thoughts?

Cheers,

James Sewell,



Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000>  *W* www.jirotech.com  *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

-- 
The contents of this email are confidential and may be subject to legal or 
professional privilege and copyright. No representation is made that this 
email is free of viruses or other defects. If you have received this 
communication in error, you may not copy or distribute any part of it or 
otherwise disclose its contents to anyone. Please advise the sender of your 
incorrect receipt of this correspondence.


RE: Protect syscache from bloating with negative cache entries

2019-02-12 Thread Tsunakawa, Takayuki
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> I'm at a loss how call syscache for users. I think it is "catalog
> cache". The most basic component is called catcache, which is
> covered by the syscache layer, both of then are not revealed to
> users, and it is shown to user as "catalog cache".
> 
> "catalog_cache_prune_min_age", "catalog_cache_memory_target", (if
> exists) "catalog_cache_entry_limit" and
> "catalog_cache_prune_ratio" make sense?

PostgreSQL documentation uses "system catalog" in its table of contents, so 
syscat_cache_xxx would be a bit more familiar?  I'm for either catalog_ and 
syscat_, but what name shall we use for the relation cache?  catcache and 
relcache have different element sizes and possibly different usage patterns, so 
they may as well have different parameters just like MySQL does.  If we follow 
that idea, then the name would be relation_cache_xxx.  However, from the user's 
viewpoint, the relation cache is also created from the system catalog like 
pg_class and pg_attribute...


Regards
Takayuki Tsunakawa







RE: Reaping Temp tables to avoid XID wraparound

2019-02-12 Thread Tsunakawa, Takayuki
From: James Sewell [mailto:james.sew...@jirotech.com]
> From the temp table namespace I can get the backend ID using a regex - but
> I have no idea how I can map that to a PID - any thoughts?
> 

SELECT pg_stat_get_backend_pid(backendid);

https://www.postgresql.org/docs/devel/monitoring-stats.html

This mailing list is for PostgreSQL development.  You can post questions as a 
user to pgsql-gene...@lists.postgresql.org.


Regards
Takayuki Tsunakawa





Re: Reaping Temp tables to avoid XID wraparound

2019-02-12 Thread Andrew Gierth
> "Tsunakawa" == Tsunakawa, Takayuki  
> writes:

 >> From the temp table namespace I can get the backend ID using a regex
 >> - but I have no idea how I can map that to a PID - any thoughts?

 Tsunakawa> SELECT pg_stat_get_backend_pid(backendid);

Doesn't work - that function's idea of "backend id" doesn't match the
real one, since it's looking at a local copy of the stats from which
unused slots have been removed.

postgres=# select pg_my_temp_schema()::regnamespace;
 pg_my_temp_schema 
---
 pg_temp_5
(1 row)

postgres=# select pg_stat_get_backend_pid(5);
 pg_stat_get_backend_pid 
-
4730
(1 row)

postgres=# select pg_backend_pid();
 pg_backend_pid 

  21086
(1 row)

-- 
Andrew (irc:RhodiumToad)



Re: dsa_allocate() faliure

2019-02-12 Thread Thomas Munro
On Mon, Feb 11, 2019 at 10:33 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > So I'll wait until after the release, and we'll have
> > to live with the bug for another 3 months.
>
> Check.  Please hold off committing until you see the release tags
> appear, probably late Tuesday my time / Wednesday noonish yours.

Pushed.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Statement-level rollback

2019-02-12 Thread Michael Paquier
On Tue, Feb 12, 2019 at 07:13:45PM -0300, Alvaro Herrera wrote:
> It was reasonable to close this patch as returned-with-feedback in
> January commitfest, but what you did here was first move it to the March
> commitfest and then close it as returned-with-feedback in the March
> commitfest, which has not even started.  That makes no sense.

Except if the CF app does not leave any place for error, particularly
as it is not possible to move back a patch to a previous commit fest
once it has been moved to the next one.  In this case, it looks that I
did a mistake in moving the patch first, my apologies for that.  There
were many patches to classify last week, so it may be possible that I
did similar mistakes for some other patches.
--
Michael


signature.asc
Description: PGP signature


Re: Reaping Temp tables to avoid XID wraparound

2019-02-12 Thread Michael Paquier
On Wed, Feb 13, 2019 at 12:38:51AM +, Andrew Gierth wrote:
> Doesn't work - that function's idea of "backend id" doesn't match the
> real one, since it's looking at a local copy of the stats from which
> unused slots have been removed.

The temporary namespace OID is added to PGPROC since v11, so it could
be easy enough to add a system function which maps a temp schema to a
PID.  Now, it could actually make sense to add this information into
pg_stat_get_activity() and that would be cheaper.
--
Michael


signature.asc
Description: PGP signature


Re: pg_basebackup ignores the existing data directory permissions

2019-02-12 Thread Michael Paquier
On Tue, Feb 12, 2019 at 06:03:37PM +1100, Haribabu Kommi wrote:
> During the testing allow group access permissions on the standby database
> directory, one of my colleague found the issue, that pg_basebackup
> doesn't verify whether the existing data directory has the required
> permissions or not?  This issue is not related allow group access
> permissions. This problem was present for a long time, (I think from
> the time the pg_basebackup was introduced).

In which case this would cause the postmaster to fail to start.

> Attached patch fixes the problem similar like initdb by changing the
> permissions of the data
> directory to the required permissions.

It looks right to me and takes care of the case where group access is
allowed.  Still, we have not seen any complains about the current
behavior either and pg_basebackup is around for some time already, so
I would tend to not back-patch that.  Any thoughts from others?
--
Michael


signature.asc
Description: PGP signature


RE: Reaping Temp tables to avoid XID wraparound

2019-02-12 Thread Tsunakawa, Takayuki
From: Andrew Gierth [mailto:and...@tao11.riddles.org.uk]
>  Tsunakawa> SELECT pg_stat_get_backend_pid(backendid);
> 
> Doesn't work - that function's idea of "backend id" doesn't match the
> real one, since it's looking at a local copy of the stats from which
> unused slots have been removed.

Ouch, the argument of pg_stat_get_backend_pid() and the number in pg_temp_N are 
both backend IDs, but they are allocated from two different data structures.  
Confusing.


From: Michael Paquier [mailto:mich...@paquier.xyz]
> The temporary namespace OID is added to PGPROC since v11, so it could be
> easy enough to add a system function which maps a temp schema to a PID.
> Now, it could actually make sense to add this information into
> pg_stat_get_activity() and that would be cheaper.

That sounds good.


Regards
Takayuki Tsunakawa





RE: Protect syscache from bloating with negative cache entries

2019-02-12 Thread Tsunakawa, Takayuki
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> > I didin't consider planning that happen within a function. If
> > 5min is the default for catalog_cache_prune_min_age, 10% of it
> > (30s) seems enough and gettieofday() with such intervals wouldn't
> > affect forground jobs. I'd choose catalog_c_p_m_age/10 rather
> > than fixed value 30s and 1s as the minimal.
> >
> 
> Actually, I see CatCacheCleanupOldEntries contains this comment:
> 
> /*
>  * Calculate the duration from the time of the last access to the
>  * "current" time. Since catcacheclock is not advanced within a
>  * transaction, the entries that are accessed within the current
>  * transaction won't be pruned.
>  */
> 
> which I think is pretty much what I've been saying ... But the question
> is whether we need to do something about it.

Hmm, I'm surprised at v14 patch about this.  I remember that previous patches 
renewed the cache clock on every statement, and it is correct.  If the cache 
clock is only updated at the beginning of a transaction, the following TODO 
item would not be solved:

https://wiki.postgresql.org/wiki/Todo

" Reduce memory use when analyzing many tables in a single command by making 
catcache and syscache flushable or bounded."

Also, Tom mentioned pg_dump in this thread (protect syscache...).  pg_dump runs 
in a single transaction, touching all system catalogs.  That may result in OOM, 
and this patch can rescue it.


Regards
Takayuki Tsunakawa






Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-02-12 Thread Simon Riggs
On Wed, 13 Feb 2019 at 00:46, Alvaro Herrera 
wrote:

> Here's a sample
> concurrent index build on a 100M tuple table.


Cool


> There are no concurrent
> transactions, so phases that wait for lockers don't appear.


Would prefer to see them, so we know they are zero.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


2019-02-14 Press Release Draft

2019-02-12 Thread Jonathan S. Katz
Hi,

Attached is a draft of the press release for the upcoming 2019-02-14
release. Feedback & suggestions are welcome.

Thanks!

Jonathan
2019-02-14 Cumulative Update Release


The PostgreSQL Global Development Group has released an update to all supported versions of our database system, including 11.2, 10.7, 9.6.12, 9.5.16, and 9.4.21. This release changes the behavior in how PostgreSQL interfaces with `fsync()` and includes fixes for partitioning and over 70 other bugs that were reported over the past three months.

Users should plan to apply this update at the next scheduled downtime.

Highlight: Change in behavior with `fsync()`
--

When available in an operating system and enabled in the configuration file (which it is by default), PostgreSQL uses the kernel function `fsync()` to help ensure that data is written to a disk. In some operating systems that provide `fsync()`, when the kernel is unable to write out the data, it returns a failure and flushes the data that was supposed to be written from its data buffers.

This flushing operation has an unfortunate side-effect for PostgreSQL: if PostgreSQL tries again to write the data to disk by again calling `fsync()`, `fsync()` will report back that it succeeded, but the data that PostgreSQL believed to be saved to the disk would not actually be written. This presents a possible data corruption scenario.

This update modifies how PostgreSQL handles a `fsync()` failure: PostgreSQL will no longer retry calling `fsync()` but instead will panic. In this case, PostgreSQL can then replay the data from the write-ahead log (WAL) to help ensure the data is written. While this may appear to be a suboptimal solution, there are presently few alternatives and, based on reports, the problem case occurs extremely rarely.

Bug Fixes and Improvements
--

This update introduces a change into how release notes are packaged. As of this update, all currently supported version of PostgreSQL will only contain their major version-specific release notes. For example, PostgreSQL 11 only packages the release notes for version 11.2, 11.1, and 11.0. The release notes for unsupported versions (PostgreSQL 9.3 and older) can be viewed from the [archive](https://www.postgresql.org/docs/release/) on the PostgreSQL website.

This update also fixes over 70 bugs that were reported in the last several months. Some of these issues affect only version 11, but many affect all supported versions.

Some of these fixes include:

* Fix handling of unique indexes with `INCLUDE` columns on partitioned tables
* Ensure that `NOT NULL` constraints of a partitioned table are honored within its partitions
* Several fixes for constraints on partitioned tables
* Fix problems with applying `ON COMMIT DROP` and `ON COMMIT DELETE ROWS` to partitioned tables and tables with inheritance children
* Disallow `COPY FREEZE` on partitioned tables
* Several fixes for the `ALTER TABLE .. ADD COLUMN` with a non-nullable default feature, including a possible index corruption case
* Several fixes in GIN indexes, including avoiding a deadlock with vacuuming and concurrent index insertions (which partially reverts a performance improvement introduced in PostgreSQL 10)
* Fix possible crashes in logical replication when index expressions or predicates are in use
* Several fixes for the write-ahead log (WAL)
* Fix possible crash in `UPDATE` with a multiple `SET` clause using a sub-`SELECT`
* Fix crash when zero rows are provided to `json[b]_populate_recordset()` or `json[b]_to_recordset()`
* Several fixes related to collation handling, including the parsing of collation-sensitive expressions in the arguments of a `CALL` statement
* Several fixes for the query planner, including an improvement to planning speed for large inheritance or partitioning table groups
* Several fixes for `TRUNCATE`
* Ensure `ALTER TABLE ONLY ADD COLUMN IF NOT EXISTS` is processed correctly
* Allow `UNLISTEN` in hot-standby (replica) mode
* Fix parsing of space-separated lists of host names in the ldapserver parameter of LDAP authentication entries in `pg_hba.conf`
* Several fixes for ecpg
* Several fixes for psql, including having `\g target` work with `COPY TO STDOUT`
* The random number generation for `pgbench` is now fully deterministic and platform-independent when `--random-seed=N` is specified
* Fix `pg_basebackup` and `pg_verify_checksums` to appropriately ignore temporary files
* Several fixes for `pg_dump`, including having `ALTER INDEX SET STATISTICS` commands present
* Prevent false index-corruption reports from contrib/amcheck caused by inline-compressed data
* Support new Makefile variables to help with building extensions

This update also contains tzdata release 2018i for DST law changes in Kazakhstan, Metlakatla, and Sao Tome and Principe. Kazakhstan's Qyzylorda zone is split in two, creating a new zone Asia/Qostanay, as some areas did n

Re: Should we still have old release notes in docs?

2019-02-12 Thread Jonathan S. Katz
On 2/12/19 5:45 PM, Jonathan S. Katz wrote:
> 
>> On Feb 12, 2019, at 5:12 PM, Tom Lane  wrote:
>>
>> The part about having a unified release-note archive somewhere else is
>> still WIP.  The ball is in the web team's court on that, I think.
> 
> Yes - we are going to try one thing with the existing way we load docs to try 
> to avoid
> additional structural changes. If that doesn’t work, we will reconvene
> and see what we need to do.

I've proposed a patch for this here:

https://www.postgresql.org/message-id/e0f09c9a-bd2b-862a-d379-601dfabc8969%40postgresql.org

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: 2019-02-14 Press Release Draft

2019-02-12 Thread Chapman Flack
On 02/12/19 22:00, Jonathan S. Katz wrote:
> Attached is a draft of the press release for the upcoming 2019-02-14
> release. Feedback & suggestions are welcome.


Users on PostgreSQL 9.4 should plan to upgrade to a supported version of
PostgreSQL as the community will stop releasing fixes for it on February 12,
2019. Please see our [versioning
policy](https://www.postgresql.org/support/versioning/)


Should that be February 13, 2020? That's what the linked page says for 9.4.

February 12, 2019 would be (a) today, and (b) in the past for this press
release.

Cheers,
-Chap



Re: 2019-02-14 Press Release Draft

2019-02-12 Thread Jonathan S. Katz
On 2/13/19 4:13 AM, Chapman Flack wrote:
> On 02/12/19 22:00, Jonathan S. Katz wrote:
>> Attached is a draft of the press release for the upcoming 2019-02-14
>> release. Feedback & suggestions are welcome.
> 
> 
> Users on PostgreSQL 9.4 should plan to upgrade to a supported version of
> PostgreSQL as the community will stop releasing fixes for it on February 12,
> 2019. Please see our [versioning
> policy](https://www.postgresql.org/support/versioning/)
> 
> 
> Should that be February 13, 2020? That's what the linked page says for 9.4.
> 
> February 12, 2019 would be (a) today, and (b) in the past for this press
> release.

Yes, good catch. Fixed.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: Fast path for empty relids in check_outerjoin_delay()

2019-02-12 Thread Richard Guo
Yes sure. Thanks Tom.

On Fri, Feb 1, 2019 at 1:04 AM Tom Lane  wrote:

> Since we've reached the end of the January commitfest, and it's pretty
> clear that this patch isn't going to get committed in anything like
> its current form, I'm going to close the CF entry as returned with
> feedback.  If you come up with a follow-on patch, please create a
> new CF entry.
>
> regards, tom lane
>


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-02-12 Thread Amit Langote
On 2019/02/13 11:59, Simon Riggs wrote:
> On Wed, 13 Feb 2019 at 00:46, Alvaro Herrera 
> wrote:
> 
>> Here's a sample
>> concurrent index build on a 100M tuple table.
> 
> 
> Cool

+1

Looking at the "([phase] x of x)" in the sample output, I thought
pg_stat_progress_vacuum's output should've had it too (not a concern of
Alvaro's patch though.)

Thanks,
Amit




Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-02-12 Thread Amit Langote
Hi Alvaro,

On 2019/02/12 12:18, Alvaro Herrera wrote:
> I ended up defining phases for the index_build phase in the AM itself;
> the code reports a phase number using the regular API, and then the
> pgstat_progress view obtains the name of each phase using a support
> method.

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 05102724ead..189179a5667 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -127,6 +127,7 @@ typedef struct IndexAmRoutine
 amcostestimate_function amcostestimate;
 amoptions_function amoptions;
 amproperty_function amproperty; /* can be NULL */
+amphasename_function amphasename;   /* can be NULL */

Doesn't the name amphasename sound a bit too generic, given that it can
only describe the phases of ambuild?  Maybe ambuildphase?

Thanks,
Amit




obsolete comment above index_pages_fetched

2019-02-12 Thread Amit Langote
Hi,

I think the following commit:

commit c6e4133fae1fde93769197379ffcc2b379845113
Author: Tom Lane 
Date:   Wed Nov 7 12:12:56 2018 -0500

Postpone calculating total_table_pages until after pruning/exclusion.
...


obsoleted a sentence in the comment above index_pages_fetched(), which says:

 * "index_pages" is the amount to add to the total table space, which was
 * computed for us by query_planner.

total_table_pages is computed by make_one_rel as of the aforementioned
commit.  Attached fixes this.

Thanks,
Amit
diff --git a/src/backend/optimizer/path/costsize.c 
b/src/backend/optimizer/path/costsize.c
index beee50ec13..4b9be13f08 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -815,7 +815,7 @@ extract_nonindex_conditions(List *qual_clauses, List 
*indexclauses)
  * product rather than calculating it here.  "pages" is the number of pages
  * in the object under consideration (either an index or a table).
  * "index_pages" is the amount to add to the total table space, which was
- * computed for us by query_planner.
+ * computed for us by make_one_rel.
  *
  * Caller is expected to have ensured that tuples_fetched is greater than zero
  * and rounded to integer (see clamp_row_est).  The result will likewise be


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-02-12 Thread Tatsuro Yamada

On 2019/02/13 4:16, Alvaro Herrera wrote:

I added metrics for the validate_index phases; patch attached.  This is
still a bit raw, but it looks much better now.  Here's a sample
concurrent index build on a 100M tuple table.  There are no concurrent


Great!

+   s.pid AS pid, S.datid AS datid, D.datname AS datname,
+   S.relid AS relid,
+   CASE s.param2 WHEN 0 THEN 'initializing (phase 1 of 8)'
+ WHEN 1 THEN 'waiting for lockers 1 
(phase 2 of 8)'
+ WHEN 2 THEN 'building index (3 of 8): 
' ||
+   
pg_indexam_progress_phasename(s.param1::oid, s.param3)

It would be better to replace "s" with "S".

s/s.pid/S.pid/
s/s.param2/S.param2/
s/s.param1::oid, s.param3/S.param1::oid, S.param3/

These are not important comments but for readability. :)

Thanks,
Tatsuro Yamada




Better error messages when lacking connection slots for autovacuum workers and bgworkers

2019-02-12 Thread Michael Paquier
Hi all,

When lacking connection slots, the backend returns a simple "sorry,
too many clients", which is something that has been tweaked by recent
commit ea92368 for WAL senders.  However, the same message would show
up for autovacuum workers and bgworkers.  Based on the way autovacuum
workers are spawned by the launcher, and the way bgworkers are added,
it seems that this cannot actually happen.  Still, in my opinion,
providing more context can be helpful for people trying to work on
features related to such code so as they can get more information than
what would normally happen for normal backends.

I am wondering as well if this could not help for issues like this
one, which has popped out today and makes me wonder if we don't have
race conditions with the way dynamic bgworkers are spawned:
https://www.postgresql.org/message-id/CAONUJSM5X259vAnnwSpqu=vnrecfgsj-cgrhys4p5yyrvwk...@mail.gmail.com

There are four code paths which report the original "sorry, too many
clients", and the one of the patch attached can easily track the
type of connection slot used which helps for better context messages
when a process is initialized.

Would that be helpful for at least debugging purposes?

Thanks,
--
Michael
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 0da5b19719..418363000b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -349,15 +349,25 @@ InitProcess(void)
 		/*
 		 * If we reach here, all the PGPROCs are in use.  This is one of the
 		 * possible places to detect "too many backends", so give the standard
-		 * error message.  XXX do we need to give a different failure message
-		 * in the autovacuum case?
+		 * error message.
 		 */
 		SpinLockRelease(ProcStructLock);
-		if (am_walsender)
+		if (IsAnyAutoVacuumProcess())
+			ereport(FATAL,
+	(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
+	 errmsg("number of requested autovacuum workers exceeds autovacuum_max_workers (currently %d)",
+			autovacuum_max_workers)));
+		else if (IsBackgroundWorker)
+			ereport(FATAL,
+	(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
+	 errmsg("number of requested worker processes exceeds max_worker_processes (currently %d)",
+			max_worker_processes)));
+		else if (am_walsender)
 			ereport(FATAL,
 	(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
 	 errmsg("number of requested standby connections exceeds max_wal_senders (currently %d)",
 			max_wal_senders)));
+
 		ereport(FATAL,
 (errcode(ERRCODE_TOO_MANY_CONNECTIONS),
  errmsg("sorry, too many clients already")));


signature.asc
Description: PGP signature


Re: Refactoring the checkpointer's fsync request queue

2019-02-12 Thread Thomas Munro
On Wed, Feb 13, 2019 at 3:58 PM Shawn Debnath  wrote:
> On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:
> > I wonder if it might be better to introduce two different functions
> > catering to the two different use cases for forcing an immediate sync:
> >
> > - sync a relation
> > smgrimmedsyncrel(SMgrRelation, ForkNumber)
> > - sync a specific segment
> > smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)
> >
> > This will avoid having to specify InvalidSegmentNumber for majority of
> > the callers today.
>
> I have gone ahead and rebased the refactor patch so it could cleanly
> apply on heapam.c, see patch v7.
>
> I am also attaching a patch (v8) that implements smgrimmedsyncrel() and
> smgrimmedsyncseg() as I mentioned in the previous email. It avoids
> callers to pass in InvalidSegmentNumber when they just want to sync the
> whole relation. As a side effect, I was able to get rid of some extra
> checkpointer.h includes.

Hi Shawn,

Thanks!  And sorry for not replying sooner -- I got distracted by
FOSDEM (and the associated 20 thousand miles of travel).  On that trip
I had a chance to discuss this patch with Andres Freund in person, and
he opined that it might be better for the fsync request queue to work
in terms of pathnames.  Instead of the approach in this patch, where a
backend sends an fsync request for { reflfilenode, segno } inside
mdwrite(), and then the checkpointer processes the request by calling
smgrdimmedsyncrel(), he speculated that it'd be better to have
mdwrite() send an fsync request for a pathname, and then the
checkpointer would just open that file by name and fsync() it.  That
is, the checkpointer wouldn't call back into smgr.

One of the advantages of that approach is that there are probably
other files that need to be fsync'd for each checkpoint that could
benefit from being offloaded to the checkpointer.  Another is that you
break the strange cycle mentioned above.

Here's a possible problem with it.  The fsync request messages would
have to be either large (MAXPGPATH) or variable sized and potentially
large.  I am a bit worried that such messages would be problematic for
the atomicity requirement of the (future) fd-passing patch that passes
it via a Unix domain socket (which is a bit tricky because
SOCK_SEQPACKET and SOCK_DGRAM aren't portable enough, so we probably
have to use SOCK_STREAM, but there is no formal guarantee like
PIPE_BUF; we know that in practice small messages will be atomic, but
certainty decreases with larger messages.  This needs more study...).
You need to include the path even in a message containing an fd,
because the checkpointer will use that as a hashtable key to merge
received requests.  Perhaps you'd solve that by using a small tag that
can be converted back to a path (as you noticed my last patch had some
leftover dead code from an experiment along those lines), but then I
think you'd finish up needing an smgr interface to convert it back to
the path (implementation different for md.c, undofile.c, slru.c).  So
you don't exactly break the cycle mentioned earlier.  Hmm, or perhaps
you could avoid even thinking about atomicity by passing 1 byte
fd-bearing messages via the pipe, and pathnames via shared memory, in
the same order.

Another consideration if we do that is that the existing scheme has a
kind of hierarchy that allows fsync requests to be cancelled in bulk
when you drop relations and databases.  That is, the checkpointer
knows about the internal hierarchy of tablespace, db, rel, seg.  If we
get rid of that and have just paths, it seems like a bad idea to teach
the checkpointer about the internal structure of the paths (even
though we know they contain the same elements encoded somehow).  You'd
have to send an explicit cancel for every key; that is, if you're
dropping a relation, you need to generate a cancel message for every
segment, and if you're dropping a database, you need to generate a
cancel message for every segment of every relation.  Once again, if
you used some kind of tag that is passed back to smgr, you could
probably come up with a way to do it.

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com



RE: Cache relation sizes?

2019-02-12 Thread Jamison, Kirk
On February 6, 2019, 8:57 AM +, Andres Freund wrote:
> Maybe I'm missing something here, but why is it actually necessary to
> have the sizes in shared memory, if we're just talking about caching
> sizes?  It's pretty darn cheap to determine the filesize of a file that
> has been recently stat()/lseek()/ed, and introducing per-file shared
> data adds *substantial* complexity, because the amount of shared memory
> needs to be pre-determined.  The reason I want to put per-relation data
> into shared memory is different, it's about putting the buffer mapping
> into shared memory, and that, as a prerequisite, also need per-relation
> data. And there's a limit of the number of relations that need to be
> open (one per cached page at max), and space can be freed by evicting
> pages.

Ahh.. You are right about the logic of putting it in the shared memory.
As for Thomas' toy patch, multiple files share one counter in shmem.
Although it currently works, it might likely to miss. 
Though his eventual plan of the idea is to use an array of N counters
and map relation OIDs onto them. 
But as your point about complexity says, in shared memory we cannot
share the same area with multiple files, so that needs an area to
allocate depending on the number of files.

Regarding the allocation of per-relation data in shared memory, I
thought it can be a separated component at first so I asked for
validity of the idea. But now I consider the point raised.

Regards,
Kirk Jamison




Re: Protect syscache from bloating with negative cache entries

2019-02-12 Thread Kyotaro HORIGUCHI
At Tue, 12 Feb 2019 20:36:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190212.203628.118792892.horiguchi.kyot...@lab.ntt.co.jp>
> > (4)
> > +   hash_size = cp->cc_nbuckets * sizeof(dlist_head);
> > +   tupsize = sizeof(CatCTup) + MAXIMUM_ALIGNOF + dtp->t_len;
> > +   tupsize = sizeof(CatCTup);
> > 
> > GetMemoryChunkSpace() should be used to include the memory context 
> > overhead.  That's what the files in src/backend/utils/sort/ do.
> 
> Thanks. Done. Include bucket and cache header part but still
> excluding clist.  Renamed from tupsize to memusage.

It is too complex as I was afraid. The indirect calls causes
siginicant degradation. (Anyway the previous code was bogus in
that it passes CACHELINEALIGN'ed pointer to get_chunk_size..)

Instead, I added an accounting(?) interface function.

| MemoryContextGettConsumption(MemoryContext cxt);

The API returns the current consumption in this memory
context. This allows "real" memory accounting almost without
overhead.

(1) New patch v15-0002 adds accounting feature to MemoryContext.
  (It adds this feature only to AllocSet, if this is acceptable
  it can be extended to other allocators.)

(2) Another new patch v15-0005 on top of previous design of
  limit-by-number-of-a-cache feature converts it to
  limit-by-size-on-all-caches feature, which I think is
  Tsunakawa-san wanted.

As far as I can see no significant degradation is found in usual
(as long as pruning doesn't happen) code paths.

About the new global-size based evicition(2), cache entry
creation becomes slow after the total size reached to the limit
since every one new entry evicts one or more old (=
not-recently-used) entries. Because of not needing knbos for each
cache, it become far realistic. So I added documentation of
"catalog_cache_max_size" in 0005.

About the age-based eviction, the bulk eviction seems to take a a
bit long time but it happnes instead of hash resizing so the user
doesn't observe additional slowdown. On the contrary the pruning
can avoid rehashing scanning the whole cache. I think it is the
gain seen in the Tomas' experiment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 3b24233b1891b967ccac65a4d21ed0207037578b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/5] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
 	dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+	/* fast path if it's already at the tail */
+	if (head->head.prev == node)
+		return;
+
+	dlist_delete(node);
+	dlist_push_tail(head, node);
+
+	dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

>From ade1f6bf389d834cd4428f302a5cc4deaf66be9e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 13 Feb 2019 13:36:38 +0900
Subject: [PATCH 2/5] Memory consumption report reature of memorycontext

This adds a feature that count memory consumption (in other words,
internally allocated size for a chunk) and read it from others.  This
allows other features to know "(almost) real" consumption of memory.
---
 src/backend/utils/mmgr/aset.c | 13 +
 src/backend/utils/mmgr/mcxt.c |  1 +
 src/include/nodes/memnodes.h  |  4 
 3 files changed, 18 insertions(+)

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 08aff333a4..3c5798734c 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -614,6 +614,9 @@ AllocSetReset(MemoryContext context)
 
 	/* Reset block size allocation sequence, too */
 	set->nextBlockSize = set->initBlockSize;
+
+	/* Reset consumption account */
+	set->header.consumption = 0;
 }
 
 /*
@@ -778,6 +781,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 		/* Disallow external access to private part of chunk header. */
 		VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
 
+		context->consumption += chunk_size;
+
 		return AllocChunkGetPointer(chunk);
 	}
 
@@ -817,6 +822,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 		/* Disallow external access to private part of chunk header. */
 		VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
 
+		context->consumption += chunk->size;
+
 		return AllocChunkGetPointer(chunk);
 	}
 
@@ -976,6 +983,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 	/* Disallow extern

Re: pg_dump multi VALUES INSERT

2019-02-12 Thread Surafel Temesgen
On Mon, Feb 11, 2019 at 10:20 AM David Rowley 
wrote:

> Reviewing pg_dump-rows-per-insert-option-v14.
>
Also, maybe one for Fabien (because he seems keen on keeping the
> --rows-per-insert validation code)
>
> strtol() returns a long. dump_inserts is an int, so on machines where
> sizeof(long) == 8 and sizeof(int) == 4 (most machines, these days) the
> validation is not bulletproof.  This could lead to:
>
> $ pg_dump --rows-per-insert=2147483648
> pg_dump: rows-per-insert must be a positive number
>

fixed


> For me, I was fine with the atoi() code that the other options use,
> but maybe Fabien has a problem with the long vs int?
>


The main issue with atoi() is it does not detect errors and return 0 for
both invalid input and input value 0 but in our case it doesn't case a
problem because it error out for value 0. but for example in compress Level
if invalid input supplied it silently precede as input value 0 is supplied.

regards
Surafel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 9e0bb93f08..0ab57067a8 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -661,9 +661,9 @@ PostgreSQL documentation
 ...).  This will make restoration very slow; it is mainly
 useful for making dumps that can be loaded into
 non-PostgreSQL databases.
-However, since this option generates a separate command for each row,
-an error in reloading a row causes only that row to be lost rather
-than the entire table contents.
+However, since by default the option generates a separate command
+for each row, an error in reloading a row causes only that row to be
+lost rather than the entire table contents.

   
  
@@ -764,11 +764,10 @@ PostgreSQL documentation
 than COPY).  This will make restoration very slow;
 it is mainly useful for making dumps that can be loaded into
 non-PostgreSQL databases.
-However, since this option generates a separate command for each row,
-an error in reloading a row causes only that row to be lost rather
-than the entire table contents.
-Note that
-the restore might fail altogether if you have rearranged column order.
+However, since by default the option generates a separate command
+for each row, an error in reloading a row causes only that row to be
+lost rather than the entire table contents.  Note that the restore
+might fail altogether if you have rearranged column order.
 The --column-inserts option is safe against column
 order changes, though even slower.

@@ -914,8 +913,9 @@ PostgreSQL documentation

 Add ON CONFLICT DO NOTHING to
 INSERT commands.
-This option is not valid unless --inserts or
---column-inserts is also specified.
+This option is not valid unless --inserts,
+--column-inserts or
+--rows-per-insert is also specified.

   
  
@@ -938,6 +938,18 @@ PostgreSQL documentation
   
  
 
+ 
+  --rows-per-insert=nrows
+  
+   
+Dump data as INSERT commands (rather than
+COPY).  Controls the maximum number of rows per
+INSERT statement. The value specified must be a
+number greater than zero.
+   
+  
+ 
+
  
--section=sectionname

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 4a2e122e2d..7ab27391fb 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -140,10 +140,10 @@ typedef struct _dumpOptions
 	int			dumpSections;	/* bitmask of chosen sections */
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
+	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
-	int			dump_inserts;
 	int			column_inserts;
 	int			if_exists;
 	int			no_comments;
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 3a89ad846a..c7403f4e40 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -307,6 +307,8 @@ main(int argc, char **argv)
 	const char *dumpencoding = NULL;
 	const char *dumpsnapshot = NULL;
 	char	   *use_role = NULL;
+	char   *rowPerInsertEndPtr;
+	long			rowPerInsert;
 	int			numWorkers = 1;
 	trivalue	prompt_password = TRI_DEFAULT;
 	int			compressLevel = -1;
@@ -358,7 +360,7 @@ main(int argc, char **argv)
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
 		{"if-exists", no_argument, &dopt.if_exists, 1},
-		{"inserts", no_argument, &dopt.dump_inserts, 1},
+		{"inserts", no_argument, NULL, 8},
 		{"lock-wait-timeout", required_argument, NULL, 2},
 		{"no-tablespaces", no_argument, &dopt.outputNoTablespaces, 1},
 		{"quote-all-identifiers", no_argument, "e_all_identifiers, 1},
@@ -377,6 +379,7 @@ mai

Re: obsolete comment above index_pages_fetched

2019-02-12 Thread Michael Paquier
On Wed, Feb 13, 2019 at 01:57:09PM +0900, Amit Langote wrote:
> total_table_pages is computed by make_one_rel as of the aforementioned
> commit.  Attached fixes this.

Thanks, fixed.
--
Michael


signature.asc
Description: PGP signature


Re: obsolete comment above index_pages_fetched

2019-02-12 Thread Amit Langote
On 2019/02/13 16:33, Michael Paquier wrote:
> On Wed, Feb 13, 2019 at 01:57:09PM +0900, Amit Langote wrote:
>> total_table_pages is computed by make_one_rel as of the aforementioned
>> commit.  Attached fixes this.
> 
> Thanks, fixed.

Thank you.

Regards,
Amit




Re: pg_basebackup ignores the existing data directory permissions

2019-02-12 Thread Haribabu Kommi
On Wed, Feb 13, 2019 at 12:42 PM Michael Paquier 
wrote:

> On Tue, Feb 12, 2019 at 06:03:37PM +1100, Haribabu Kommi wrote:
> > During the testing allow group access permissions on the standby database
> > directory, one of my colleague found the issue, that pg_basebackup
> > doesn't verify whether the existing data directory has the required
> > permissions or not?  This issue is not related allow group access
> > permissions. This problem was present for a long time, (I think from
> > the time the pg_basebackup was introduced).
>
> In which case this would cause the postmaster to fail to start.
>

Yes, the postmaster fails to start, but I feel if pg_basebackup takes care
of correcting the file permissions automatically like initdb, that will be
good.


> > Attached patch fixes the problem similar like initdb by changing the
> > permissions of the data
> > directory to the required permissions.
>
> It looks right to me and takes care of the case where group access is
> allowed.  Still, we have not seen any complains about the current
> behavior either and pg_basebackup is around for some time already, so
> I would tend to not back-patch that.  Any thoughts from others?
>

This should back-patch till 11 where the group access is introduced.
Because of lack of complaints, I agree with you that there is no need of
further back-patch.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Planning counters in pg_stat_statements (using pgss_store)

2019-02-12 Thread Sergei Kornilov
Hi

> I think it is related to https://commitfest.postgresql.org/16/1373/
> but I don't know how to link it with.

You can just add new entry in open commitfest and then attach previous thread. 
This is usual way for pick up old patches. For example, as i did here: 
https://commitfest.postgresql.org/20/1711/

> Yes, there are many things to improve, but before to go deeper,
> I would like to know if that way to do it (with 3 access to pgss hash)
> has a chance to get consensus ?

I can not say something here, i am not experienced contributor here.
Can you post some performance test results with slowdown comparison between 
master branch and proposed patch?

regards, Sergei



Re: reducing isolation tests runtime

2019-02-12 Thread Andres Freund
Hi,

On 2018-01-25 17:34:15 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera  writes:
> 
> > > I think we could solve this by putting in the same parallel group only
> > > slow tests that mostly sleeps, ie. nothing that would monopolize CPU for
> > > long enough to cause a problem.  Concretely:
> > > test: timeouts tuplelock-update deadlock-hard deadlock-soft-2
> > 
> > OK, but there'd better be a comment there explaining the concern
> > very precisely, or somebody will break it.
> 
> Here's a concrete proposal.  Runtime is 45.7 seconds on my laptop.  It
> can be further reduced, but not by more than a second or two unless you
> get in the business of modifying other tests.  (I only modified
> deadlock-soft-2 because it saves 5 seconds).

I'm working an updated version of this. Adding the new tests is a bit
painful because of conflicting names making it harder than necessary to
schedule tests. While it's possible to work out a schedule that doesn't
conflict, it's pretty annoying to do and more importantly seems fragile
- it's very easy to create schedules that succeed on one machine, and
not on another, based on how slow which tests are.

I'm more inclined to be a bit more aggressive in renaming tables -
there's not much point in having a lot of "foo"s around.  So I'm
inclined to rename some of the names that are more likely to
conflict. If we agree on doing that, I'd like to do that first, and
commit that more aggressively than the schedule itself.

An alternative approach would be to have isolationtester automatically
create a schema with the specfile's name, and place it in the search
path. But that'd make it impossible to use isolationtester against a
standby - which I think we currently don't do, but which probably would
be a good idea.


With regard to the schedule, I'm inclined to order it so that faster
test groups are earlier on, just to make it more likely to reach the
tests one is debugging faster.  Does that sound sane?

Do we want to maintain a serial version of the schedule too? I'm
wondering if we should just generate both the isolationtester and plain
regression test schedule by either adding an option to pg_regress that
serializes test groups, or by generating the serial schedule file in a
few lines of perl.

Greetings,

Andres Freund