Re: pg_stat_statements and "IN" conditions

2023-02-11 Thread David Geier

Hi,

On 2/9/23 16:02, Dmitry Dolgov wrote:

Unfortunately, rebase is needed again due to recent changes in queryjumblefuncs 
( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc )

I reviewed the last patch applied to some commit from Feb. 4th.

It seems a little strange to me that with const_merge_threshold = 1, such a 
test case gives the same result as with const_merge_threshold = 2

select pg_stat_statements_reset();
set const_merge_threshold to 1;
select * from test where i in (1,2,3);
select * from test where i in (1,2);
select * from test where i in (1);
select query, calls from pg_stat_statements order by query;

 query| calls
-+---
  select * from test where i in (...) | 2
  select * from test where i in ($1)  | 1

Probably const_merge_threshold = 1 should produce only "i in (...)"?

Well, it's not intentional, probably I need to be more careful with
off-by-one. Although I agree to a certain extent with Peter questioning


Please add tests for all the corner cases. At least for (1) IN only 
contains a single element and (2) const_merge_threshold = 1.


Beyond that:

- There's a comment about find_const_walker(). I cannot find that 
function anywhere. What am I missing?


- What about renaming IsConstList() to IsMergableConstList().

- Don't you intend to use the NUMERIC data column in SELECT * FROM 
test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? 
Otherwise, the test is identical to previous test cases and you're not 
checking for what happens with NUMERICs which are wrapped in FuncExpr 
because of the implicit coercion.


- Don't we want to extend IsConstList() to allow a list of all 
implicitly coerced constants? It's inconsistent that otherwise e.g. 
NUMERICs don't work.


- Typo in /* The firsts merged constant */ (first not firsts)

- Prepared statements are not supported as they contain INs with Param 
instead of Const nodes. While less likely, I've seen applications that 
use prepared statements in conjunction with queries generated through a 
UI which ended up with tons of prepared queries with different number of 
elements in the IN clause. Not necessarily something that must go into 
this patch but maybe worth thinking about.


- The setting name const_merge_threshold is not very telling without 
knowing the context. While being a little longer, what about 
jumble_const_merge_threshold or queryid_const_merge_threshold or similar?


- Why do we actually only want to merge constants? Why don't we ignore 
the type of element in the IN and merge whatever is there? Is this 
because the original jumbling logic as of now only has support for 
constants?


- Ideally we would even remove duplicates. That would even improve 
cardinality estimation but I guess we don't want to spend the cycles on 
doing so in the planner?


- Why did you change palloc() to palloc0() for clocations array? The 
length is initialized to 0 and FWICS RecordConstLocation() initializes 
all members. Seems to me like we don't have to spend these cycles.


- Can't the logic at the end of IsConstList() not be simplified to:

    foreach(temp, elements)
        if (!IsA(lfirst(temp), Const))
            return false;

    // All elements are of type Const
        *firstConst = linitial_node(Const, elements);
        *lastConst = llast_node(Const, elements);
        return true;

--
David Geier
(ServiceNow)





Re: pg_stat_statements and "IN" conditions

2023-02-11 Thread Dmitry Dolgov
> On Sat, Feb 11, 2023 at 11:03:36AM +0100, David Geier wrote:
> Hi,
>
> On 2/9/23 16:02, Dmitry Dolgov wrote:
> > > Unfortunately, rebase is needed again due to recent changes in 
> > > queryjumblefuncs ( 9ba37b2cb6a174b37fc51d0649ef73e56eae27fc )
> I reviewed the last patch applied to some commit from Feb. 4th.

Thanks for looking. Few quick answers about high-level questions below,
the rest I'll incorporate in the new version.

> - There's a comment about find_const_walker(). I cannot find that function
> anywhere. What am I missing?
>
> [...]
>
> - Don't you intend to use the NUMERIC data column in SELECT * FROM
> test_merge_numeric WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)? Otherwise,
> the test is identical to previous test cases and you're not checking for
> what happens with NUMERICs which are wrapped in FuncExpr because of the
> implicit coercion.
>
> - Don't we want to extend IsConstList() to allow a list of all implicitly
> coerced constants? It's inconsistent that otherwise e.g. NUMERICs don't
> work.
>
> [...]
>
> - Prepared statements are not supported as they contain INs with Param
> instead of Const nodes. While less likely, I've seen applications that use
> prepared statements in conjunction with queries generated through a UI which
> ended up with tons of prepared queries with different number of elements in
> the IN clause. Not necessarily something that must go into this patch but
> maybe worth thinking about.

The original version of the patch was doing all of this, i.e. handling
numerics, Param nodes, RTE_VALUES. The commentary about
find_const_walker in tests is referring to a part of that, that was
dealing with evaluation of expression to see if it could be reduced to a
constant.

Unfortunately there was a significant push back from reviewers because
of those features. That's why I've reduced the patch to it's minimally
useful version, having in mind re-implementing them as follow-up patches
in the future. This is the reason as well why I left tests covering all
this missing functionality -- as breadcrumbs to already discovered
cases, important for the future extensions.

> - Why do we actually only want to merge constants? Why don't we ignore the
> type of element in the IN and merge whatever is there? Is this because the
> original jumbling logic as of now only has support for constants?
>
> - Ideally we would even remove duplicates. That would even improve
> cardinality estimation but I guess we don't want to spend the cycles on
> doing so in the planner?

I believe these points are beyond the patch goals, as it's less clear
(at least to me) if it's safe or desirable to do so.




Re: pg_stat_statements and "IN" conditions

2023-02-11 Thread Dmitry Dolgov
> On Sat, Feb 11, 2023 at 11:47:07AM +0100, Dmitry Dolgov wrote:
>
> The original version of the patch was doing all of this, i.e. handling
> numerics, Param nodes, RTE_VALUES. The commentary about
> find_const_walker in tests is referring to a part of that, that was
> dealing with evaluation of expression to see if it could be reduced to a
> constant.
>
> Unfortunately there was a significant push back from reviewers because
> of those features. That's why I've reduced the patch to it's minimally
> useful version, having in mind re-implementing them as follow-up patches
> in the future. This is the reason as well why I left tests covering all
> this missing functionality -- as breadcrumbs to already discovered
> cases, important for the future extensions.

I'd like to elaborate on this a bit and remind about the origins of the
patch, as it's lost somewhere in the beginning of the thread. The idea
is not pulled out of thin air, everything is coming from our attempts to
improve one particular monitoring infrastructure in a real commercial
setting. Every covered use case and test in the original proposal was a
result of field trials, when some application-side library or ORM was
responsible for gigabytes of data in pgss, chocking the monitoring agent.




Sort optimizations: Making in-memory sort cache-aware

2023-02-11 Thread Ankit Kumar Pandey

Hi all,

While working on sort optimization for window function, it was seen that 
performance of sort where


all tuples are in memory was bad when number of tuples were very large [1]

Eg: work_mem = 4 GB, sort on 4 int columns on table having 10 million 
tuples.



Issues we saw were as follows:

1. The comparetup function re-compares the first key again in case of 
tie-break.


2. Frequent cache misses


Issue #1 is being looked in separate patch. I am currently looking at #2.

Possible solution was to batch tuples into groups (which can fit into L3 
cache) before pushing them to sort function.


After looking at different papers on this (multi-Quicksort, memory-tuned 
quicksort, Samplesort and various distributed sorts),


although they look promising (especially samplesort), I would like to 
get more inputs as changes look bit too steep and


may or may not be in of scope of solving actual problem in hand.


Please let me know your opinions, do we really need to re-look at 
quicksort for this use-case or we can


perform optimization without major change in core sorting algorithm? Are 
we are open for trying new algorithms for sort?


Any suggestions to narrow down search space for this problem are welcomed.


[1] 
https://www.postgresql.org/message-id/CAApHDvqh+qOHk4sbvvy=Qr2NjPqAAVYf82oXY0g=z2hrpc2...@mail.gmail.com



Thanks,

Ankit





Re: UUID v7

2023-02-11 Thread Peter Eisentraut

On 11.02.23 02:14, Andres Freund wrote:

On 2023-02-10 15:57:50 -0800, Andrey Borodin wrote:

As you may know there's a new version of UUID being standardized [0].
These new algorithms of UUID generation are very promising for
database performance.


I agree it's very useful to have.



[0] 
https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format-04


That looks to not be the current version anymore, it's superseded by:
https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis


Yes, this means that the draft that an individual had uploaded has now 
been taken on by a working group for formal review.  If there is a 
prototype implementation, this is a good time to provide feedback.  But 
it's too early to ship a production version.






Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-02-11 Thread Andres Freund
Hi,

On 2023-02-08 21:03:19 -0800, Andres Freund wrote:
> Pushed the first (and biggest) commit. More tomorrow.

Just pushed the actual pg_stat_io view, the splitting of the tablespace test,
and the pg_stat_io tests.

Yay!

Thanks all for patch and review!


> Already can't wait to see incremental improvements of this version of
> pg_stat_io ;). Tracking buffer hits. Tracking Wal IO. Tracking relation IO
> bypassing shared buffers. Per connection IO statistics. Tracking IO time.

That's still the case.

Greetings,

Andres Freund




Improving inferred query column names

2023-02-11 Thread Andres Freund
Hi,

A common annoyance when writing ad-hoc analytics queries is column naming once
aggregates are used.

Useful column names:
SELECT reads, writes FROM pg_stat_io;
column names: reads, writes

Not useful column names:
SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
column names: sum, sum

So i often end up manually writing:
SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM pg_stat_io;


Of course we can't infer useful column names for everything, but for something
like this, it should't be too hard to do better. E.g. by combining the
function name with the column name in the argument, if a single plain column
is the argument.

I think on a green field it'd be clearly better to do something like the
above.  What does give me pause is that it seems quite likely to break
existing queries, and to a lesser degree, might break applications relying on
inferred column names

Can anybody think of a good way out of that? It's not like that problem is
going to go away at some point...

Greetings,

Andres Freund




Re: Sort optimizations: Making in-memory sort cache-aware

2023-02-11 Thread Andres Freund
Hi,

On 2023-02-11 17:49:02 +0530, Ankit Kumar Pandey wrote:
> 2. Frequent cache misses
>
> Issue #1 is being looked in separate patch. I am currently looking at #2.
>
> Possible solution was to batch tuples into groups (which can fit into L3
> cache) before pushing them to sort function.
>
> After looking at different papers on this (multi-Quicksort, memory-tuned
> quicksort, Samplesort and various distributed sorts),  although they look
> promising (especially samplesort), I would like to get more inputs as
> changes look bit too steep and may or may not be in of scope of solving
> actual problem in hand.
>
> Please let me know your opinions, do we really need to re-look at quicksort
> for this use-case or we can perform optimization without major change in
> core sorting algorithm? Are we are open for trying new algorithms for sort?

I think it'll require some experimentation to know what we can and should
do. Clearly we're not going to do anything fundamental if the gains are a few
percent. But it's not hard to imagine that the gains will be substantially
larger.


I believe that a significant part of the reason we have low cache hit ratios
once the input gets larger, is that we kind of break the fundamental benefit
of qsort:

The reason quicksort is a good sorting algorithm, despite plenty downsides, is
that it has pretty decent locality, due to its divide and conquer
approach. However, tuplesort.c completely breaks that for > 1 column
sorts. While spatial locality for accesses to the ->memtuples array is decent
during sorting, due to qsort's subdividing of the problem, the locality for
access to the tuples is *awful*.

The memtuples array is reordered while sorting, but the tuples themselves
aren't. Unless the input data is vaguely presorted, the access pattern for the
tuples has practically zero locality.

The only reason that doesn't completely kill us is that SortTuple contains
datum1 inline and that abbreviated keys reduce the cost of by-reference datums
in the first column.


There are things we could do to improve upon this that don't require swapping
out our sorting implementation wholesale.


One idea is to keep track of the distinctness of the first column sorted and
to behave differently if it's significantly lower than the number of to be
sorted tuples.  E.g. by doing a first sort solely on the first column, then
reorder the MinimalTuples in memory, and then continue normally.

There's two main problems with that idea:
1) It's hard to re-order the tuples in memory, without needing substantial
   amounts of additional memory
2) If the second column also is not very distinct, it doesn't buy you much, if
   anything.

But it might provide sufficient benefits regardless. And a naive version,
requiring additional memory, should be quick to hack up.


I have *not* looked at a whole lot of papers of cache optimized sorts, and the
little I did was not recently. Partially because I am not sure that they are
that applicable to our scenarios: Most sorting papers don't discuss
variable-width data, nor a substantial amount of cache-polluting work while
gathering the data that needs to be sorted.

I think:

> Possible solution was to batch tuples into groups (which can fit into L3
> cache) before pushing them to sort function.

is the most general solution to the issue outlined above. I wouldn't try to
implement this via a full new sorting algorithm though.

My suggestion would be to collect a roughly ~L3 sized amount of tuples, sort
just those using the existing code, allocate new memory for all the
corresponding MinimalTuples in one allocation, and copy the MinimalTuples into
that, obviously in ->memtuples order.

Even if we just use the existing code for the overall sort after that, I'd
expect that to yield noticable benefits.

It's very likely we can do better than just doing a plain sort of everything
after that.

You effectively end up with a bounded number of pre-sorted blocks, so the most
obvious thing to try is to build a heap of those blocks and effectively do a
heapsort between the presorted blocks.



A related, but separate, improvement is to reduce / remove the memory
allocation overhead. The switch to GenerationContext helped some, but still
leaves a bunch of overhead. And it's not used for bounded sorts right now.

We don't palloc/pfree individual tuples during a normal sorts, but we do have
some, for bounded sorts.  I think with a reasonable amount of work we could
avoid that for all tuples in ->tuplecontext. And switch to a trivial bump
allocator, getting rid of all allocator overhead.

The biggest source of individual pfree()s in the bounded case is that we
unconditionally copy the tuple into base->tuplecontext during puttuple. Even
though quite likely we'll immediately free it in the "case TSS_BOUNDED" block.

We could instead pre-check that the tuple won't immediately be discarded,
before copying it into tuplecontext. Only in the TSS_BOUNDED, case, of
course.

I think we also can re

Re: Improving inferred query column names

2023-02-11 Thread Vladimir Churyukin
That is a good idea for simple cases, I'm just curious how it would look
like for more complex cases (you can have all kinds of expressions as
parameters for aggregate function calls).
If it works only for simple cases, I think it would be confusing and not
very helpful.
Wouldn't it make more sense to just deduplicate the names by adding
numerical postfixes, like sum_1, sum_2?
For backwards compatibility I guess you can have a GUC flag controlling
that behavior that can be set into backwards compatibility mode if required.
The previous functionality can be declared deprecated and removed (with the
flag) once the current version becomes unsupported.
(or with a different deprecation policy, I'm not sure what is the general
rule for breaking changes and deprecation currently).
If there is a clearly defined deprecation policy and a backwards
compatibility option, it should be good, no? Just my 2 cents.

-Vladimir Churyukin

On Sat, Feb 11, 2023 at 11:24 AM Andres Freund  wrote:

> Hi,
>
> A common annoyance when writing ad-hoc analytics queries is column naming
> once
> aggregates are used.
>
> Useful column names:
> SELECT reads, writes FROM pg_stat_io;
> column names: reads, writes
>
> Not useful column names:
> SELECT SUM(reads), SUM(writes) FROM pg_stat_io;
> column names: sum, sum
>
> So i often end up manually writing:
> SELECT SUM(reads) AS sum_reads, SUM(writes) AS sum_writes, ... FROM
> pg_stat_io;
>
>
> Of course we can't infer useful column names for everything, but for
> something
> like this, it should't be too hard to do better. E.g. by combining the
> function name with the column name in the argument, if a single plain
> column
> is the argument.
>
> I think on a green field it'd be clearly better to do something like the
> above.  What does give me pause is that it seems quite likely to break
> existing queries, and to a lesser degree, might break applications relying
> on
> inferred column names
>
> Can anybody think of a good way out of that? It's not like that problem is
> going to go away at some point...
>
> Greetings,
>
> Andres Freund
>
>
>


Re: proposal: psql: psql variable BACKEND_PID

2023-02-11 Thread Andres Freund
Hi,

On 2023-02-09 10:11:21 +0100, Pavel Stehule wrote:
> first and main (for me) - I can use psql variables tab complete - just
> :B - it is significantly faster
> second - I can see all connection related information by \set
> third - there is not hook on reconnect in psql - so if you implement
> BACKEND_PID by self, you ensure to run query with pg_backend_pid() after
> any reconnect or connection change.
> 
> It is clean so you can run "select pg_backend_pid() AS "BACKEND_PID" \gset"
> and you can store it to .psqlrc. But most of the time I am in customer's
> environment, and I have the time, possibility to do a complete setup of
> .psqlrc. It looks (for me) like a generally useful feature to be
> everywhere.

I personally just solved this by using %p in PROMPT*. Not that that serves
quite the same niche.

I guess the fact that we have %p is a minor precedent of psql special casing
backend pid in psql.

Greetings,

Andres Freund




Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-11 Thread Heikki Linnakangas

v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
This can be applied separately from the rest of the patches, which is 
nice. Some small comments on it:


* Needs a rebase, it conflicted slightly with commit f30d62c2fc.

* GetVictimBuffer needs a comment to explain what it does. In 
particular, mention that it returns a buffer that's pinned and known 
!BM_TAG_VALID.


* I suggest renaming 'cur_buf' and other such local variables in 
GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some 
other buffer involved too, but there is no 'prev' or 'next' or 'other' 
buffer. The old code called it just 'buf' too, and before this patch it 
actually was a bit confusing because there were two buffers involved. 
But with this patch, GetVictimBuffer only deals with one buffer at a time.


* This FIXME:


/* OK, do the I/O */
/* FIXME: These used the wrong smgr before afaict? */
{
SMgrRelation smgr = 
smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),

 InvalidBackendId);


TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
   
   buf_hdr->tag.blockNum,
   
   smgr->smgr_rlocator.locator.spcOid,
   
   smgr->smgr_rlocator.locator.dbOid,
   
   smgr->smgr_rlocator.locator.relNumber);

FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, 
io_context);
LWLockRelease(content_lock);

ScheduleBufferTagForWriteback(&BackendWritebackContext,
  
&buf_hdr->tag);


TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
   
  buf_hdr->tag.blockNum,
   
  smgr->smgr_rlocator.locator.spcOid,
   
  smgr->smgr_rlocator.locator.dbOid,
   
  smgr->smgr_rlocator.locator.relNumber);
}


I believe that was intentional. The probes previously reported the block 
and relation whose read *caused* the eviction. It was not just the smgr 
but also the blockNum and forkNum that referred to the block that was 
being read. There's another pair of probe points, 
TRACE_POSTGRESQL_BUFFER_FLUSH_START/DONE, inside FlushBuffer that 
indicate the page that is being flushed.


I see that reporting the evicted page is more convenient with this 
patch, otherwise you'd need to pass the smgr and blocknum of the page 
that's being read to InvalidateVictimBuffer(). IMHO you can just remove 
these probe points. We don't need to bend over backwards to maintain 
specific probe points.


* InvalidateVictimBuffer reads the buffer header with an atomic read op, 
just to check if BM_TAG_VALID is set. If it's not, it does nothing 
(except for a few Asserts). But the caller has already read the buffer 
header. Consider refactoring it so that the caller checks VM_TAG_VALID, 
and only calls InvalidateVictimBuffer if it's set, saving one atomic 
read in InvalidateVictimBuffer. I think it would be just as readable, so 
no loss there. I doubt the atomic read makes any measurable performance 
difference, but it looks redundant.


* I don't understand this comment:


/*
 * Clear out the buffer's tag and flags and usagecount.  We must do
 * this to ensure that linear scans of the buffer array don't think
 * the buffer is valid.
 *
 * XXX: This is a pre-existing comment I just moved, but isn't it
 * entirely bogus with regard to the tag? We can't do anything with
 * the buffer without taking BM_VALID / BM_TAG_VALID into
 * account. Likely doesn't matter because we're already dirtying the
 * cacheline, but still.
 *
 */
ClearBufferTag(&buf_hdr->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
UnlockBufHdr(buf_hdr, buf_state);


What exactly is wrong with clearing the tag? What does dirtying the 
cacheline have to do with the correctness here?


* pgindent

- Heikki




Re: proposal: psql: psql variable BACKEND_PID

2023-02-11 Thread Andres Freund
On 2023-02-04 15:35:58 -0500, Corey Huinker wrote:
> This effectively makes the %p prompt (which I use in the example above) the
> same as %:BACKEND_PID: and we may want to note that in the documentation.

I don't really see much of a point in noting this in the doc.  I don't know in
what situation a user would be helped by reading

+ This substitution is almost equal to using 
%:BACKEND_PID:,
+ but it is safer, because psql variable can be overwriten or unset.

or just about any reformulation of that?




Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-11 Thread Andres Freund
Hi,

On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote:
> > v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch
> This can be applied separately from the rest of the patches, which is nice.
> Some small comments on it:

Thanks for looking at these!


> * Needs a rebase, it conflicted slightly with commit f30d62c2fc.

Will work on that.


> * GetVictimBuffer needs a comment to explain what it does. In particular,
> mention that it returns a buffer that's pinned and known !BM_TAG_VALID.

Will add.


> * I suggest renaming 'cur_buf' and other such local variables in
> GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some
> other buffer involved too, but there is no 'prev' or 'next' or 'other'
> buffer. The old code called it just 'buf' too, and before this patch it
> actually was a bit confusing because there were two buffers involved. But
> with this patch, GetVictimBuffer only deals with one buffer at a time.

Hm. Yea. I probably ended up with these names because initially
GetVictimBuffer() wasn't a separate function, and I indeed constantly got
confused by which buffer was referenced.


> * This FIXME:
>
> > /* OK, do the I/O */
> > /* FIXME: These used the wrong smgr before afaict? */
> > {
> > SMgrRelation smgr = 
> > smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),
> > 
> >  InvalidBackendId);
> >
> > 
> > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
> > 
> >   buf_hdr->tag.blockNum,
> > 
> >   smgr->smgr_rlocator.locator.spcOid,
> > 
> >   smgr->smgr_rlocator.locator.dbOid,
> > 
> >   smgr->smgr_rlocator.locator.relNumber);
> >
> > FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, 
> > io_context);
> > LWLockRelease(content_lock);
> >
> > ScheduleBufferTagForWriteback(&BackendWritebackContext,
> > 
> >   &buf_hdr->tag);
> >
> > 
> > TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
> > 
> >  buf_hdr->tag.blockNum,
> > 
> >  smgr->smgr_rlocator.locator.spcOid,
> > 
> >  smgr->smgr_rlocator.locator.dbOid,
> > 
> >  smgr->smgr_rlocator.locator.relNumber);
> > }
>
> I believe that was intentional. The probes previously reported the block and
> relation whose read *caused* the eviction. It was not just the smgr but also
> the blockNum and forkNum that referred to the block that was being read.

You're probably right.  It's certainly not understandable from our docs
though:


 buffer-write-dirty-start
 (ForkNumber, BlockNumber, Oid, Oid, Oid)
 Probe that fires when a server process begins to write a dirty
  buffer.  (If this happens often, it implies that
   is too
  small or the background writer control parameters need adjustment.)
  arg0 and arg1 contain the fork and block numbers of the page.
  arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
  identifying the relation.



> I see that reporting the evicted page is more convenient with this patch,
> otherwise you'd need to pass the smgr and blocknum of the page that's being
> read to InvalidateVictimBuffer(). IMHO you can just remove these probe
> points. We don't need to bend over backwards to maintain specific probe
> points.

Agreed.


> * InvalidateVictimBuffer reads the buffer header with an atomic read op,
> just to check if BM_TAG_VALID is set.

It's not a real atomic op, in the sense of being special instruction. It does
force the compiler to actually read from memory, but that's it.

But you're right, even that is unnecessary. I think it ended up that way
because I also wanted the full buf_hdr, and it seemed somewhat error prone to
pass in both.



> * I don't understand this comment:
>
> > /*
> >  * Clear out the buffer's tag and flags and usagecount.  We must do
> >  * this to ensure that linear scans of the buffer array don't think
> >  * the buffer is valid.
> >  *
> >  * XXX: This is a pre

Re: Improving inferred query column names

2023-02-11 Thread Corey Huinker
On Sat, Feb 11, 2023 at 3:47 PM Vladimir Churyukin 
wrote:

> For backwards compatibility I guess you can have a GUC flag controlling
> that behavior that can be set into backwards compatibility mode if required.
> The previous functionality can be declared deprecated and removed (with
> the flag) once the current version becomes unsupported.
>

Seems more like a per-session setting than a GUC.

Here's a suggestion off the top of my head.

We create a session setting inferred_column_name_template.

The template takes a formatting directive %N which is just a counter

SET inferred_column_name_template = 'col_%N'


which would give you col_1, col_2, regardless of what kind of expression
the columns were

We could introduce another directive, %T

SET inferred_column_name_template = '%T_%N'


which prints the datatype short name of the column. In this case, %N would
increment per datatype, so text_1, integer_1, text_2, timestamptz_1, text_3

Getting fancier, we could introduce something less datatype centric, %F

SET inferred_column_name_template = '%F_%N'


Which would walk the following waterfall and stop on the first match

   1. The datatype short name if the expression is explicitly casted
(either CAST or ::)
   2. the name of the function if the outermost expression was a function
(aggregate, window, or scalar),  so sum_1, substr_1
   3. 'case' if the outermost expression was  case
   4. 'expr' if the expression was effectively an operator (  SELECT 3+4,
'a'  || 'b' etc)
   5. the datatype short name for anything that doesn't match any of the
previous, and for explicit casts


Keeping track of all the %N counters could get silly, so maybe a %P which
is simply the numeric column position of the column, so your result set
would go like:  id, name, col_3, last_login, col_5.

We would have to account for the case where the user left either %N or %P
out of the template, so one of them would be an implied suffix if both were
absent, or we maybe go with

SET inferred_column_name_prefix = '%F_';
SET inferred_column_name_counter = 'position';   /* position, counter,
per_type_counter */

Or we just cook up a few predefined naming schemes, and let the user pick
from those.

One caution I have is that I have seen several enterprise app database
designs that have lots of user-customizable columns with names like
varchar1, numeric4, etc. Presumably the user would know their environment
and not pick a confusing template.


Re: Transparent column encryption

2023-02-11 Thread Mark Dilger



> On Jan 25, 2023, at 10:44 AM, Peter Eisentraut 
>  wrote:
> 
> Here is a new patch.  Changes since v14:
> 
> - Fixed some typos (review by Justin Pryzby)
> - Fixed backward compat. psql and pg_dump (review by Justin Pryzby)
> - Doc additions (review by Jacob Champion)
> - Validate column_encryption option in libpq (review by Jacob Champion)
> - Handle column encryption in inheritance
> - Change CEKs and CMKs to live inside 
> schemas

Thanks Peter.  Here are some observations about the documentation in patch 
version 15.

In acronyms.sgml, the CEK and CMK entries should link to documentation, perhaps 
linkend="glossary-column-encryption-key" and 
linkend="glossary-column-master-key".  These glossary entries should in turn 
link to linkend="ddl-column-encryption".

In ddl.sgml, the sentence "forcing encryption of certain parameters in the 
client library (see its documentation)" should link to 
linkend="libpq-connect-column-encryption".

Did you intend the use of "transparent data encryption" (rather than 
"transparent column encryption") in datatype.sgml?  If so, what's the 
difference?

Is this feature intended to be available from ecpg?  If so, can we maybe 
include an example in 36.3.4. Prepared Statements showing how to pass the 
encrypted values securely.  If not, can we include verbiage about that 
limitation, so folks don't waste time trying to figure out how to do it?

The documentation for pg_dump (and pg_dumpall) now includes a 
--decrypt-encrypted-columns option, which I suppose requires cmklookup to first 
be configured, and for PGCMKLOOKUP to be exported.  There isn't anything in the 
pg_dump docs about this, though, so maybe a link to section 5.5.3 with a 
warning about not running pg_dump this way on the database server itself?

How does a psql user mark a parameter as having forced encryption?  A libpq 
user can specify this in the paramFormats array, but I don't see any syntax for 
doing this from psql.  None of the perl tap tests you've included appear to do 
this (except indirectly when calling test_client); grep'ing for the libpq error 
message "parameter with forced encryption is not to be encrypted" in the tests 
has no matches.  Is it just not possible?  I thought you'd mentioned some 
syntax for this when we spoke in person, but I don't see it now.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-11 Thread Andres Freund
On 2023-02-11 13:36:51 -0800, Andres Freund wrote:
> Even though it's not a correctness issue, it seems to me that
> DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
> before doing anything further.  Particularly in DropRelationsAllBuffers(), the
> check we do for each buffer isn't cheap. Doing it for buffers that don't even
> have a tag seems .. not smart.

There's a small regression for a single relation, but after that it's a clear
benefit.

32GB shared buffers, empty. The test creates N new relations and then rolls
back.

tps tps
num relations   HEADprecheck
1   46.11   45.22
2   43.24   44.87
4   35.14   44.20
8   28.72   42.79

I don't understand the regression at 1, TBH.  I think it must be a random code
layout issue, because the same pre-check in DropRelationBuffers() (exercised
via TRUNCATE of a newly created relation), shows a tiny speedup.




Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-11 Thread Andres Freund
Hi,

On 2023-02-10 18:38:50 +0200, Heikki Linnakangas wrote:
> I'll continue reviewing this, but here's some feedback on the first two
> patches:
> 
> v2-0001-aio-Add-some-error-checking-around-pinning.patch:
> 
> I wonder if the extra assertion in LockBufHdr() is worth the overhead. It
> won't add anything without assertions, of course, but still. No objections
> if you think it's worth it.

It's so easy to get confused about local/non-local buffers, that I think it is
useful.  I think we really need to consider cleaning up the separation
further. Having half the code for local buffers in bufmgr.c and the other half
in localbuf.c, without a scheme that I can recognize, is not a good scheme.


It bothers me somewhat ConditionalLockBufferForCleanup() silently accepts
multiple pins by the current backend. That's the right thing for
e.g. heap_page_prune_opt(), but for something like lazy_scan_heap() it's
not.  And yes, I did encounter a bug hidden by that when making vacuumlazy use
AIO as part of that patchset.  That's why I made BufferCheckOneLocalPin()
externally visible.



> v2-0002-hio-Release-extension-lock-before-initializing-pa.patch:
> 
> Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK,
> which zeroes the page, and then we call PageInit to zero the page again.
> RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache
> previously, but with P_NEW, that is always true.

It is quite silly, and it shows up noticably in profiles. The zeroing is
definitely needed in other places calling PageInit(), though. I suspect we
should have a PageInitZeroed() or such, that asserts the page is zero, but
otherwise skips it.

Seems independent enough from this series, that I'd probably tackle it
separately? If you prefer, I'm ok with adding a patch to this series instead,
though.

Greetings,

Andres Freund




Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-11 Thread Andres Freund
Hi,

On 2023-01-20 13:40:55 +1300, David Rowley wrote:
> On Tue, 10 Jan 2023 at 15:08, Andres Freund  wrote:
> > Thanks for letting me now. Updated version attached.
> 
> I'm not too sure I've qualified for giving a meaningful design review
> here, but I have started looking at the patches and so far only made
> it as far as 0006.

Thanks!


> I noted down the following while reading:
> 
> v2-0001:
> 
> 1. BufferCheckOneLocalPin needs a header comment
> 
> v2-0002:
> 
> 2. The following comment and corresponding code to release the
> extension lock has been moved now.
> 
> /*
> * Release the file-extension lock; it's now OK for someone else to extend
> * the relation some more.
> */
> 
> I think it's worth detailing out why it's fine to release the
> extension lock in the new location. You've added detail to the commit
> message but I think you need to do the same in the comments too.

Will do.


> v2-0003
> 
> 3. FileFallocate() and FileZero() should likely document what they
> return, i.e zero on success and non-zero on failure.

I guess I just tried to fit in with the rest of the file :)


> 4. I'm not quite clear on why you've modified FileGetRawDesc() to call
> FileAccess() twice.

I do not have the faintest idea what happened there... Will fix.


> v2-0004:
> 
> 5. Is it worth having two versions of PinLocalBuffer() one to adjust
> the usage count and one that does not? Couldn't the version that does
> not adjust the count skip doing pg_atomic_read_u32()?

I think it'd be nicer to just move the read inside the if
(adjust_usagecount). That way the rest of the function doesn't have to be
duplicated.


Thanks,

Andres Freund




Re: refactoring relation extension and BufferAlloc(), faster COPY

2023-02-11 Thread Andres Freund
Hi,

On 2023-02-11 14:25:06 -0800, Andres Freund wrote:
> On 2023-01-20 13:40:55 +1300, David Rowley wrote:
> > v2-0004:
> > 
> > 5. Is it worth having two versions of PinLocalBuffer() one to adjust
> > the usage count and one that does not? Couldn't the version that does
> > not adjust the count skip doing pg_atomic_read_u32()?
> 
> I think it'd be nicer to just move the read inside the if
> (adjust_usagecount). That way the rest of the function doesn't have to be
> duplicated.

Ah, no, we need it for the return value. No current users of
  PinLocalBuffer(adjust_usagecount = false)
need the return value, but I don't think that's necessarily the case.

I'm somewhat inclined to not duplicate it, but if you think it's worth it,
I'll do that.

Greetings,

Andres Freund




Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-02-11 Thread Andres Freund
Hi,

On 2022-11-01 08:32:48 +0530, Bharath Rupireddy wrote:
> +/*
> + * pg_pwrite_zeros
> + *
> + * Writes zeros to a given file. Input parameters are "fd" (file descriptor 
> of
> + * the file), "size" (size of the file in bytes).
> + *
> + * On failure, a negative value is returned and errno is set appropriately so
> + * that the caller can use it accordingly.
> + */
> +ssize_t
> +pg_pwrite_zeros(int fd, size_t size)
> +{
> + PGAlignedBlock  zbuffer;
> + size_t  zbuffer_sz;
> + struct ioveciov[PG_IOV_MAX];
> + int blocks;
> + size_t  remaining_size = 0;
> + int i;
> + ssize_t written;
> + ssize_t total_written = 0;
> +
> + zbuffer_sz = sizeof(zbuffer.data);
> +
> + /* Zero-fill the buffer. */
> + memset(zbuffer.data, 0, zbuffer_sz);

I previously commented on this - why are we memseting a buffer on every call
to this?  That's not at all free.

Something like
static const PGAlignedBlock zerobuf = {0};
would do the trick.  You do need to cast the const away, to assign to
iov_base, but that's not too ugly.

Greetings,

Andres Freund




Re: Importing pg_bsd_indent into our source tree

2023-02-11 Thread Andres Freund
Hi,

On 2023-02-11 18:54:00 -0500, Tom Lane wrote:
> I ended up converting the test infrastructure into a TAP test,
> which kind of feels like overkill; but the Meson system doesn't
> seem to provide any lower-overhead way to run a test.

FWIW, The default way to indicate failures in a test is the exit
code. Obviously that allows less detailed reporting, but other than that, it
works (that's how we test pg_regress today).


> Also, for now both build systems *will* run tests on it,
> although I'm not sure if plugging it into "make check-world"
> is enough to cause the cfbot to do so, and I'm pretty sure
> that the buildfarm won't notice that.

That's sufficient for cfbot, on the CI task still using autoconf. And for
meson it'll also suffice.

It actually already ran:
https://cirrus-ci.com/build/5984572702195712

The windows test failure is a transient issue independent of the patch
(something went wrong with image permissions). However the linux autoconf one
isn't:
https://api.cirrus-ci.com/v1/artifact/task/5482952532951040/log/src/tools/pg_bsd_indent/tmp_check/log/regress_log_001_pg_bsd_indent

# Running: pg_bsd_indent --version
Command 'pg_bsd_indent' not found in 
/tmp/cirrus-ci-build/tmp_install/usr/local/pgsql/bin, 
/tmp/cirrus-ci-build/src/tools/pg_bsd_indent, /usr/local/sbin, /usr/local/bin, 
/usr/sbin, /usr/bin, /sbin, /bin at 
/tmp/cirrus-ci-build/src/tools/pg_bsd_indent/../../../src/test/perl/PostgreSQL/Test/Utils.pm
 line 832.

I guess there might be a missing dependency? PATH looks sufficient.


> I'll let the cfbot loose on this, and if it runs the tests
> successfully I plan to go ahead and push.  We can resolve
> the installation question later.  We might want to back off
> testing too once we're satisfied about portability.

> (I left out the 0004 declaration-formatting patch for now, btw.)

Makes sense.

Greetings,

Andres Freund




Re: Importing pg_bsd_indent into our source tree

2023-02-11 Thread Tom Lane
Andres Freund  writes:
> # Running: pg_bsd_indent --version
> Command 'pg_bsd_indent' not found in 
> /tmp/cirrus-ci-build/tmp_install/usr/local/pgsql/bin, 
> /tmp/cirrus-ci-build/src/tools/pg_bsd_indent, /usr/local/sbin, 
> /usr/local/bin, /usr/sbin, /usr/bin, /sbin, /bin at 
> /tmp/cirrus-ci-build/src/tools/pg_bsd_indent/../../../src/test/perl/PostgreSQL/Test/Utils.pm
>  line 832.

> I guess there might be a missing dependency? PATH looks sufficient.

Yeah, I expected that "check" would have a dependency on "all",
but apparently it doesn't (and I'd missed this because I had
pg_bsd_indent installed elsewhere in my PATH :-().  New build
running now.

regards, tom lane




Re: Importing pg_bsd_indent into our source tree

2023-02-11 Thread Tom Lane
Hmmm ... ci autoconf build is now happy, but the Windows run complains
that none of the output files match.  I'm betting that this is a
Windows-newline problem, since I now see that indent.c opens both the
input and output files in default (text) mode.  I'm inclined to
change it to open the output file in binary mode while leaving the
input in text, which should have the effect of stripping \r if it's
present.

regards, tom lane




Re: Importing pg_bsd_indent into our source tree

2023-02-11 Thread Thomas Munro
On Sun, Feb 12, 2023 at 2:44 PM Tom Lane  wrote:
> I wrote:
> > Hmmm ... ci autoconf build is now happy, but the Windows run complains
> > that none of the output files match.  I'm betting that this is a
> > Windows-newline problem, since I now see that indent.c opens both the
> > input and output files in default (text) mode.  I'm inclined to
> > change it to open the output file in binary mode while leaving the
> > input in text, which should have the effect of stripping \r if it's
> > present.
>
> So let's see if that theory is correct at all ...

(Since I happened to be tinkering on cfbot while you posted these, I
noticed that cfbot took over 50 minutes to start processing the v4.
The problem was upstream: the time in the second-last column of
https://commitfest.postgresql.org/42/ didn't change for that whole
time, even though the archives had your new email.  Cf castles, sand;
I should probably get a better trigger mechanism :-)  I like that page
because it lets me poll one single end point once per minute to learn
about changes across all threads, but I am not sure what sort of
technology connects the archives to the CF app, and how it can fail.)




Making aggregate deserialization (and WAL receive) functions slightly faster

2023-02-11 Thread David Rowley
While working on 16fd03e95, I noticed that in each aggregate
deserialization function, in order to "receive" the bytea value that
is the serialized aggregate state, appendBinaryStringInfo is used to
append the bytes of the bytea value onto a temporary StringInfoData.
Using  appendBinaryStringInfo seems a bit wasteful here. We could
really just fake up a StringInfoData and point directly to the bytes
of the bytea value.

The best way I could think of to do this was to invent
initStringInfoFromString() which initialises a StringInfoData and has
the ->data field point directly at the specified buffer.  This will
mean that it would be unsafe to do any appendStringInfo* operations on
the resulting StringInfoData as enlargeStringInfo would try to
repalloc the data buffer, which might not even point to a palloc'd
string.  I thought it might be fine just to mention that in the
comments for the function, but we could probably do a bit better and
set maxlen to something like -1 and Assert() we never see -1 in the
various append functions.  I wasn't sure it was worth it, so didn't do
that.

I had a look around for other places that might be following the same
pattern. I only found range_recv() and XLogWalRcvProcessMsg().  I
didn't adjust the range_recv() one as I couldn't see how to do that
without casting away a const.  I did adjust the XLogWalRcvProcessMsg()
one and got rid of a global variable in the process.

I've attached the benchmark results I got after testing how the
modification changed the performance of string_agg_deserialize().

I was hoping this would have a slightly more impressive performance
impact, especially for string_agg() and array_agg() as the aggregate
states of those can be large.  However, in the test I ran, there's
only a very slight performance gain. I may just not have found the
best case, however.

David
#!/bin/bash

psql -c "drop table if exists t;" postgres
psql -c "create table t (a text not null);" postgres
psql -c "insert into t select x::text from generate_series(1,100)x;" 
postgres
psql -c "vacuum freeze t;" postgres
psql -c "select pg_prewarm('t');" postgres
for q in "select length(string_agg(a,',')) from t;"
do
for i in {0..32}
do
echo Testing with $i parallel workers: $q
psql -c "alter table t set (parallel_workers = $i);" 
postgres
echo $q > bench.sql
pgbench -f bench.sql -n -T 10 postgres | grep latency
done
done


From 75d97a066ac81f5a50b1b7618ad9e240f5497860 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Sat, 11 Feb 2023 10:05:32 +1300
Subject: [PATCH v1 1/2] Optimize various aggregate deserialization functions

The serialized representation of an internal aggregate state is a bytea
value.  In each deserial function, in order to "receive" the bytea value
we appended it onto a short-lived StringInfoData using
appendBinaryStringInfo.  This was a little wasteful as it meant having to
palloc memory, copy a (possibly long) series of bytes then later pfree
that memory.  Instead of doing to this extra trouble, we can just fake up
a StringInfoData and point the data directly at the bytea's payload.  This
should help increase the performance of internal aggregate
deserialization.

To make this work, we introduce a function named initStringInfoFromString.
This just makes the given StringInfoData point directly to the input
buffer and sets the length to the given length.
---
 src/backend/utils/adt/array_userfuncs.c | 20 +
 src/backend/utils/adt/numeric.c | 40 ++---
 src/backend/utils/adt/varlena.c | 10 +++
 src/common/stringinfo.c | 20 +
 src/include/lib/stringinfo.h| 13 
 5 files changed, 61 insertions(+), 42 deletions(-)

diff --git a/src/backend/utils/adt/array_userfuncs.c 
b/src/backend/utils/adt/array_userfuncs.c
index c8a8d33ca3..78fd0d2340 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -723,12 +723,11 @@ array_agg_deserialize(PG_FUNCTION_ARGS)
sstate = PG_GETARG_BYTEA_PP(0);
 
/*
-* Copy the bytea into a StringInfo so that we can "receive" it using 
the
-* standard recv-function infrastructure.
+* Fake up a StringInfo pointing to the bytea's value so we can 
"receive"
+* the serialized aggregate state value.
 */
-   initStringInfo(&buf);
-   appendBinaryStringInfo(&buf,
-  VARDATA_ANY(sstate), 
VARSIZE_ANY_EXHDR(sstate));
+   initStringInfoFromString(&buf, VARDATA_ANY(sstate),
+
VARSIZE_ANY_EXHDR(sstate));
 
/* element_type */
element_type = pq_getmsgint(&buf, 4);
@@ -825,7 +824,6 @@ array_agg_deserialize(PG_FUNCTION_ARGS)
}
 
pq_getmsgend(&buf);
-   pfree(buf.data);
 
PG_RETURN_POINTER(r

Re: Making aggregate deserialization (and WAL receive) functions slightly faster

2023-02-11 Thread Tom Lane
David Rowley  writes:
> While working on 16fd03e95, I noticed that in each aggregate
> deserialization function, in order to "receive" the bytea value that
> is the serialized aggregate state, appendBinaryStringInfo is used to
> append the bytes of the bytea value onto a temporary StringInfoData.
> Using  appendBinaryStringInfo seems a bit wasteful here. We could
> really just fake up a StringInfoData and point directly to the bytes
> of the bytea value.

Perhaps, but ...

> The best way I could think of to do this was to invent
> initStringInfoFromString() which initialises a StringInfoData and has
> the ->data field point directly at the specified buffer.  This will
> mean that it would be unsafe to do any appendStringInfo* operations on
> the resulting StringInfoData as enlargeStringInfo would try to
> repalloc the data buffer, which might not even point to a palloc'd
> string.

I find this patch horribly dangerous.

It could maybe be okay if we added the capability for StringInfoData
to understand (and enforce) that its "data" buffer is read-only.
However, that'd add overhead to every existing use-case.

> I've attached the benchmark results I got after testing how the
> modification changed the performance of string_agg_deserialize().
> I was hoping this would have a slightly more impressive performance
> impact, especially for string_agg() and array_agg() as the aggregate
> states of those can be large.  However, in the test I ran, there's
> only a very slight performance gain. I may just not have found the
> best case, however.

I do not think we should even consider this without solid evidence
for *major* performance improvements.  As it stands, it's a
quintessential example of a loaded foot-gun, and it seems clear
that making it safe enough to use would add more overhead than
it saves.

regards, tom lane