from:"Peter Geoghegan"

Re: Question on corruption (PostgreSQL 9.6.1)

2018-03-16 Thread Peter Geoghegan

On Thu, Mar 15, 2018 at 8:16 AM, Andy Halsall  wrote:
> Thanks for the advice. I re-indexed and reloaded a pg_dumpall into a spare 
> server - no errors. Will run pg_catcheck asap.

You can also run amcheck. Get the version targeting earlier Postgres
releases off Github (there are packages for most Linux systems). This
can verify that the heap is consistent with indexes.

-- 
Peter Geoghegan

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-19 Thread Peter Geoghegan

On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel  wrote:
> SELECT heap_page_items(get_raw_page('pg_authid', 7));

Can you post this?

SELECT * FROM page_header(get_raw_page('pg_authid', 7));

-- 
Peter Geoghegan

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-19 Thread Peter Geoghegan

On Mon, Mar 19, 2018 at 1:55 PM, Jeremy Finzel  wrote:
> @Peter :
>
> staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7));
>   lsn   | checksum | flags | lower | upper | special | pagesize |
> version | prune_xid
> +--+---+---+---+-+--+-+---
>  262B4/10FDC478 |0 | 1 |   304 |  2224 |8192 | 8192 |
> 4 | 0
> (1 row)

Thanks.

That looks normal. I wonder if the contents of that page looks
consistent with the rest of the table following manual inspection,
though. I recently saw system catalog corruption on a 9.5 instance
where an entirely different relation's page ended up in pg_attribute
and pg_depend. They were actually pristine index pages from an
application index. I still have no idea why this happened.

This is very much a guess, but it can't hurt to check if the contents
of the tuples themselves are actually sane by inspecting them with
"SELECT * FROM pg_authid". heap_page_items() doesn't actually care
about the shape of the tuples in the page, so this might have been
missed.

-- 
Peter Geoghegan

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-21 Thread Peter Geoghegan

On Wed, Mar 21, 2018 at 1:38 PM, Jeremy Finzel  wrote:
> A server restart and upgrade to 9.5.12 (at the same time), as expected, made
> the issue go away.  Still doesn't give us any answers as to what happened or
> if it would happen again!  Thanks for the feeback.

You may still want to use amcheck to look for problems. The version on
Github works with 9.5, and there are Redhat and Debian pgdg packages.
See:

https://github.com/petergeoghegan/amcheck

The "heapallindexed" option will be of particular interest to you -
that option verifies that the table has matching rows for a target
index (in addition to testing the structure of a target B-Tree index
itself). This is probably the best general test for corruption that is
available. There is a fair chance that this will reveal new
information.

-- 
Peter Geoghegan

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-22 Thread Peter Geoghegan

On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel  wrote:
> Thank you for the recommendation.  I ran both amcheck functions on all 4
> indexes of those 2 tables with heapallindexed = true, but no issues were
> found.

Probably wouldn't hurt to run it against all indexes, if you can make
time for that. If you can generalize from the example query that calls
the bt_index_check() function, but set
"heapallindexed=>i.indisprimary" and remove "n.nspname =
'pg_catalog'", as well as "LIMIT 10".  This will test tables and
indexes from all schemas, which might be interesting.
-- 
Peter Geoghegan

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-03-22 Thread Peter Geoghegan

On Thu, Mar 22, 2018 at 2:24 PM, Jeremy Finzel  wrote:
> I am running this on a san snapshot of our production system.  I assume that
> this will give me a valid check for file-system-level corruption.  I am
> going to kick it off and see if I find anything interesting.

It might. Note that SAN snapshots might have corruption hazards,
though. Postgres expects crash consistency across all filesystems, so
you might run into trouble if you had a separate filesystem for WAL,
for example. I know that LVM snapshots only provide a consistent view
of a single logical volume, even though many LVM + Postgres setups
will involve multiple logical volumes. This makes it possible for a
small inconsistency across logical volumes to corrupt data.

I don't know anything about your SAN snapshotting, but this is at
least something to consider.

-- 
Peter Geoghegan

Re: Troubleshooting a segfault and instance crash

2018-03-24 Thread Peter Geoghegan

On Thu, Mar 8, 2018 at 9:40 AM, Blair Boadway  wrote:
> Mar  7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip
> 00302f32868a sp 7ffcf1547498 error 4 in
> libc-2.12.so[302f20+18a000]
>
> Mar  7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG:
> server process (PID 29351) was terminated by signal 11: Segmentation fault

> It crashes the database, though it starts again on its own without any
> apparent issues.  This has happened 3 times in 2 months and each time the
> segfault error and memory address is the same.

We had a recent report of a segfault on a Redhat compatible system,
that seemed like it might originate from within its glibc [1].
Although all the versions there didn't match what you have, it's worth
considering as a possibility.

Maybe you can't install debuginfo packages because you don't yet have
the necessary debuginfo repos set up. Just a guess. That is sometimes
a required extra step.

[1] https://postgr.es/m/7369.1520528...@sss.pgh.pa.us
--
Peter Geoghegan

Re: Autovacuum behavior with rapid insert/delete 9.6

2018-03-29 Thread Peter Geoghegan

On Thu, Mar 29, 2018 at 4:01 PM, Ted Filmore  wrote:
> What I am really asking to confirm is after describing the situation is it
> reasonable to focus on (in the short term) tuning autovacuum to increase
> performance or does this not make sense given the workload and I should look
> elsewhere?

I would look into this suspected 9.5 regression, if that's possible:

https://postgr.es/m/CAH2-Wz=sfakvmv1x9jh19ej8am8tzn9f-yecips9hrrrqss...@mail.gmail.com

-- 
Peter Geoghegan

Re: ERROR: found multixact from before relminmxid

2018-04-09 Thread Peter Geoghegan

On Mon, Apr 9, 2018 at 7:01 AM, Tomas Vondra
 wrote:
> The bigger question is whether this can actually detect the issue. If
> it's due to an storage issue, then perhaps yes. But if you only see
> multixact issues consistently and nothing else, it might easily be a
> PostgreSQL bug (in which case the checksum will be correct).

You can also run amcheck. Get the version targeting earlier Postgres
releases off Github (there are packages for most Linux systems). This
can verify that the heap is consistent with indexes.

-- 
Peter Geoghegan

Re: ERROR: found multixact from before relminmxid

2018-04-09 Thread Peter Geoghegan

On Mon, Apr 9, 2018 at 5:55 PM, Alexandre Arruda  wrote:
> I ran amcheck in all index of a table and I only get empty returns.

Did you try doing so with the "heapallindexed" option? That's what's
really interesting here.

-- 
Peter Geoghegan

Re: ERROR: found multixact from before relminmxid

2018-04-09 Thread Peter Geoghegan

On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda  wrote:
> (... and all other indexes returns null too)
>
> I tried with bt_index_check too. Same results.

That's interesting, because it tells me that you have a table that
appears to not be corrupt, despite the CLUSTER error. Also, the error
itself comes from sanity checking added to MultiXact freezing fairly
recently, in commit 699bf7d0.

You didn't say anything about regular VACUUM being broken. Do you find
that it works without any apparent issue?

I have a suspicion that this could be a subtle bug in
CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by
CLUSTER, so it's not that hard to imagine a MultiXact freezing bug
that is peculiar to CLUSTER. Though I haven't thought about it in much
detail.

-- 
Peter Geoghegan

Re: ERROR: found multixact from before relminmxid

2018-04-09 Thread Peter Geoghegan

On Mon, Apr 9, 2018 at 7:53 PM, Andres Freund  wrote:
> I've not followed this thread. Possible it's the overeager check for pg 
> upgraded tuples from before 9.3 that Alvaro fixed recently?

I was aware of commit 477ad05e, which must be what you're referring
to. I don't think that that's what this is, since this error occurs
within heap_freeze_tuple() -- it's not the over-enforced
HEAP_XMAX_IS_LOCKED_ONLY() error within heap_prepare_freeze_tuple().
And, because this database wasn't pg_upgraded.

I should wait until tomorrow before doing any further analysis, though.

-- 
Peter Geoghegan

Re: ERROR: found multixact from before relminmxid

2018-04-10 Thread Peter Geoghegan

On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda  wrote:
> Actualy, I first notice the problem in logs by autovacuum:
>
> 2018-04-10 08:22:15.385 -03 [55477] CONTEXT:  automatic vacuum of
> table "production.public.fn06t"
> 2018-04-10 08:22:16.815 -03 [55477] ERROR:  found multixact 68834765
> from before relminmxid 73262006
>
> production=# vacuum analyze verbose fn06t;
> INFO:  vacuuming "public.fn06t"
> ERROR:  found multixact 76440919 from before relminmxid 122128619

Do you think that CLUSTER was run before regular VACUUM/autovacuum
showed this error, though?

Have you noticed any data loss? Things look okay when you do your dump
+ restore, right? The problem, as far as you know, is strictly that
CLUSTER + VACUUM refuse to finish/raise these multixactid errors?

-- 
Peter Geoghegan

Re: ERROR: found multixact from before relminmxid

2018-04-10 Thread Peter Geoghegan

On Tue, Apr 10, 2018 at 7:54 PM, Alexandre Arruda  wrote:
> pg_control version number:1002

Andres was also asking about his check_rel() function, from
https://www.postgresql.org/message-id/20180319181723.ugaf7hfkluqyo...@alap3.anarazel.de.
Can you check that out as well, please? You'll need to be able to
install the pageinspect contrib module.

-- 
Peter Geoghegan

Re: New website

2018-04-18 Thread Peter Geoghegan

On Wed, Apr 18, 2018 at 3:03 PM, Adrian Klaver
 wrote:
> I would contact the Webmaster but Contact goes to a big image of an elephant
> head. That is also where Downloads, Support and Donate lands. Might have
> been a good idea to roll out a demo site for testing first. Will reserve
> judgment on the site design until it is functioning.

I really don't think it's practical to give everyone a veto on a
website design. It took years to launch this website redesign.
Honestly, I was beginning to think that it would never happen.

Anyway, I did notice something myself, which is that the git logo
links to https://git.postgresql.org/gitweb/, which has many
non-very-important git repos. Instead, it should point to the main
PostgreSQL repository's gitweb page, which is at
https://git.postgresql.org/gitweb/?p=postgresql.git.
-- 
Peter Geoghegan

Re: New website

2018-04-18 Thread Peter Geoghegan

On Wed, Apr 18, 2018 at 3:40 PM, Adrian Klaver
 wrote:
> And to get to the mailing list archives(arguably one of the most important
> links) you have to:
>
> Click on Community, click on Mailing Lists on the left sidebar, then scroll
> to the bottom of page to find the search box.

I look at the mailing list archives as much as anyone else does, and I
don't think that it's very important for it to have prominent
placement. Even still, the only difference here is the scrolling. It
has fundamentally the same structure as before.

> I am one of the No votes in the survey.

This sounds pretty far from constructive to me, which automatically
detracts from what you're saying.

-- 
Peter Geoghegan

Re: found xmin from before relfrozenxid on pg_catalog.pg_authid

2018-05-25 Thread Peter Geoghegan

On Fri, May 25, 2018 at 1:38 PM, Andres Freund  wrote:
>> BTW I think the definition of HeapTupleHeaderXminFrozen is seriously
>> confusing, by failing to return true if the xmin is numerically
>> FrozenXid (which it'll be if the database was pg_upgraded).  I wonder
>> about this one in HeapTupleSatisfiesMVCC:
>
> I suggest raising this on -hackers. I agree that it's unfortunate.

I wonder if BootstrapTransactionId also needs to be considered here.

-- 
Peter Geoghegan

Re: Syndicating PostgreSQL mailing list to Discourse

2018-05-27 Thread Peter Geoghegan

On Sun, May 27, 2018 at 6:13 AM, Stephen Frost  wrote:
> I'm curious what would be different here from what our archives provide.
> We could certainly have a single "all lists" archive page but that seems
> more likely to be just completely confusing than actually useful at all.

Any replacement to our own archives will need to provide access to
mail from over 20 years ago to be in any way usable. It's not uncommon
to have to go back that far.

Personally, I don't buy the idea that the need to use a mailing list
rather than a web forum is a notable obstacle for new contributors.
PGLister seems pretty slick to me. It has eliminated all of the
frustrations that I had. Maybe we need to do a better job when it
comes to communicating what the benefits of a mailing list are,
though.

There are real, practical reasons to prefer a mailing list; that
preference isn't just due to ingrained habit. I'm pleased that there
has been a drive to modernize some of the community's infrastructure
in recent years, but only because those changes turned out to be
unalloyed improvements (at least in my view). Besides, while mailing
lists may seem antiquated to a lot of people, aren't web forums almost
as antiquated? Sites like Stack Overflow are very clearly not designed
to work as discussion forums. They do not allow subjective questions,
and it's common for moderators to swiftly delete new threads. Stack
Overflow is popular because it provides the fastest possible access to
a crowdsourced answer, without requiring or even encouraging
participation.

--
Peter Geoghegan

Re: Code of Conduct plan

2018-06-05 Thread Peter Geoghegan

On Mon, Jun 4, 2018 at 11:41 AM, Jason Petersen  wrote:
> Ultimately, the important thing this CoC provides is some concrete language
> to point at when a party is aggrieved and explicit avenues of redress
> available when one refuses to address one’s own behavior. We’re adults here,
> the strawmen of people being harangued out of the community because they
> said a bad word are unlikely to materialize.
>
> +1

This seems like a good summary on the purpose of the CoC.

It is of course possible that a member of the committee could act in
bad faith for any number of reasons. You can say the same thing about
any position of leadership or authority within the community, though.
That hasn't really been much of a problem in my experience, and I see
no reason for particular concern about it here.

-- 
Peter Geoghegan

Re: Code of Conduct plan

2018-06-05 Thread Peter Geoghegan

On Tue, Jun 5, 2018 at 8:49 AM, Benjamin Scherrey
 wrote:
> I thought the same thing as a member of the Django community. It adopted a
> CoC that I vocally warned was dangerous and far more likely to be abused
> than provide any benefit. I was shocked when the very first time it was ever
> invoked it was by one of the founders of the project (whom I previously
> personally respected) and it was absolutely used in the manner that I had
> feared which was to shut someone up whose opinion he did not like rather
> than any legitimate concern. Unfortunately this is not such an unusual
> circumstance as one might hope in these projects or conferences. It is
> impossible to separate the concept of political correctness from these CoCs
> I find and they are much more dangerous things than they appear. We should
> tread with extreme cautious about adopting such a thing.

It's impossible for me to know what really happened in that situation,
but it doesn't seem like the CoC was likely to have been much of a
factor in any telling. If this individual was in a position of
influence and decided to act maliciously, they would no doubt have
found another way to do so in the absence of a CoC. On the other hand,
it's easy to imagine a newer non-influential community member finding
no recourse against abusive behavior because that isn't explicitly
provided for; they might simply not know where to start, and become
totally discouraged.

Nobody is claiming that the CoC is perfect, or that it can anticipate
every situation; it's just a framework for handling disputes about
abusive and/or antisocial behavior. The core team have had exclusive
responsibility for "Handling disciplinary issues" as part of their
charter, at least until now. You can make exactly the same slippery
slope argument against that.

-- 
Peter Geoghegan

Re: Code of Conduct plan

2018-06-05 Thread Peter Geoghegan

On Tue, Jun 5, 2018 at 9:51 AM, James Keener  wrote:
> To be honest, this is a bigger problem. Why would someone not feel
> comfortable contacting the core team? Why would they feel better contacting
> the CoC board who is probably mostly core team or otherwise self-selected
> community members who have a strong belief in the CoC (and I don't mean that
> kindly)?

The CoC states that the committee's members cannot come from the core team.

-- 
Peter Geoghegan

Re: Code of Conduct plan

2018-06-05 Thread Peter Geoghegan

On Tue, Jun 5, 2018 at 10:08 AM, Tom Lane  wrote:
> So publishing a formal CoC at all is mainly meant to deal with weak
> points 1 and 2, and then the details of the process are there to try
> to fix point 3.
>
> Yeah, managing the committee is a lot of overhead that in an ideal
> world we wouldn't need, but I think we have to accept it to have a
> process people will have confidence in.

It's worth pointing out that the community has grown considerably in
the last ten years. I assume that adding a bit of process to deal with
these kinds of disputes is related to that.

We have a pretty good track record through totally informal standards
for behavior. Setting a good example is absolutely essential. While
that's still the most important thing, it doesn't seem particularly
scalable on its own.

-- 
Peter Geoghegan

Re: Code of Conduct plan

2018-06-05 Thread Peter Geoghegan

On Tue, Jun 5, 2018 at 12:20 PM, Benjamin Scherrey
 wrote:
> I keep hearing this claim. I've followed up and tried to verify them. Sorry
> but "trust me" doesn't cut it here any more than "trust me this will make
> Postgres go faster" would on a code change. What's the context for this?
> What evidence do we have that indicates this CoC would have likely resulted
> in a different outcome? Without that then your claim does not even rise up
> to the standard of theoretical. Frankly this claim does not seem very
> plausible to me at all. Let's try to keep our standards here.

Whose standards are these? By my count, the majority of e-mails you've
ever sent to a PostgreSQL mailing list have been sent in the last 2
days, to this code of conduct thread.

-- 
Peter Geoghegan

Re: Code of Conduct plan

2018-06-05 Thread Peter Geoghegan

On Tue, Jun 5, 2018 at 2:06 PM, Sven R. Kunze  wrote:
> 1) CoC might result in developers leaving projects
> http://lists.llvm.org/pipermail/llvm-dev/2018-May/122922.html

This guy left LLVM for several reasons. The pertinent reason for us
was that he had to agree to a code of conduct in order to attend
conferences, which he found to be unacceptable. He did not have to
agree that the idea of a code of conduct was a good one, though. It
would have been perfectly possible for him to be opposed in principle
to the idea of a CoC, while also formally agreeing to it and attending
those conferences. I gather that his objections were around questions
of unintended consequences, the role of a certain authority to assess
violations of the CoC, and so on (I surmise that he was not actually
opposed to or constrained by any of the specific rules around content
in technical presentations and so on).

I for one accept that these may have been reasonable concerns, even
though I don't really agree, since the LLVM CoC seems quite
reasonable. Anybody that participates in an open source community soon
learns that their opinion on almost any matter may not be the one that
prevails. There are often differences of opinion on -hackers that seem
to fundamentally be down to a difference in values. We still manage to
make it work, somehow.

> 2) CoC might result in not so equal peers and friends, might result in a
> committee which feels above their peers, and might promote conceit and
> denunciation.

I think that having a code of conduct is better than not having one,
and I think that the one that we came up with is appropriate and
proportionate. We could speculate all day about specific unintended
consequences that may or may not follow. That doesn't seem very
constructive, though. Besides, the time for that has passed.

> In related discussions, people recurringly ask not to establish a secondary
> judicial system but to use the already existing ones.

I don't follow. Practically any organized group has rules around
conduct, with varying degrees of formality, means of enforcement, etc.
Naturally, the rules across disparate groups vary widely for all kinds
of reasons. Formalizing and being more transparent about how this
works seems like the opposite of paternalism to me.

-- 
Peter Geoghegan

Re: Replica string comparsion issue

2018-06-13 Thread Peter Geoghegan

On Wed, Jun 13, 2018 at 1:10 PM, Andrey Lizenko  wrote:
> I'm observing strange behaviour on comparing ::text field with string while
> quering replica.

The first thing to do here is to install amcheck on the replica, and
verify that the indexes are consistent with the replica's own notion
of how text sorts for the collation that is implicitly in use (the
database default):

https://www.postgresql.org/docs/current/static/amcheck.html

You can modify the example query to check the indexes that you're
interested in. I think that there is a very strong chance that the
replica has incompatible collation rules, given that it uses a totally
different OS.

-- 
Peter Geoghegan

Re: What to do when dynamic shared memory control segment is corrupt

2018-06-18 Thread Peter Geoghegan

On Mon, Jun 18, 2018 at 1:03 PM, Tom Lane  wrote:
> Hm, I supposed that Sherrylyn would've noticed any PANIC entries in
> the log.  The TRAP message from an assertion failure could've escaped
> notice though, even assuming that her logging setup captured it.

Unhandled C++ exceptions end up calling std::abort(). I've seen bugs
in modules like PL/V8 that were caused by this. The symptom was a
mysterious message in the logs about SIGABRT. Perhaps that's what
happened here?

What extensions are installed, if any?

-- 
Peter Geoghegan

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-06-27 Thread Peter Geoghegan

On Wed, Jun 27, 2018 at 11:35 AM, Luca Ferrari  wrote:
> If I then disconnect and reconnect I'm able to issue the select and
> get back the results. But if I issue a reindex I got the same error
> and the table "becames unreadable" for the whole session.
> On 10.3 the table is never locked for the session, that is I can
> create the index, I can query the table and get the results, but I
> cannot reindex. However, even after a reindex, it does allow me to
> select data from the table.
>
> So my question is: why this behavior in later PostgreSQL?

It might have something to do with the changes to parallel CREATE
INDEX. It changed how we tracked whether or not an index could be used
because it was currently undergoing reindexing. This is supposed to
make no difference at all, but there was one bug that could cause us
to consider an index irrevocably unusable.

Do you find that the issue goes away if you set
max_parallel_maintenance_workers=0 on v11/master?

-- 
Peter Geoghegan

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-09 Thread Peter Geoghegan

On Thu, Jun 28, 2018 at 8:02 AM, Andres Freund  wrote:
> Peter, looks like you might be involved specifically.

Seems that way.

> This however seems wrong.  Cleary the relation's index list is out of
> date.
>
> I believe this happens because there's currently no relcache
> invalidation registered for the main relation, until *after* the index
> is built. Normally it'd be the CacheInvalidateRelcacheByTuple(tuple) in
> index_update_stats(), which is called at the bottom of index_build().
> But we never get there, because the earlier error.  That's bad, because
> any relcache entry built *after* the CommandCounterIncrement() in
> CommandCounterIncrement() will now be outdated.
>
> In the olden days we most of the time didn't build a relcache entry
> until after the index was built - but plan_create_index_workers() now
> does. I'm suspect there's other ways to trigger that earlier, too.

Note that there is a kludge within plan_create_index_workers() that
has us treat the heap relation as an inheritance parent, just to get a
RelOptInfo for the heap relation without running into similar trouble
with the index in get_relation_info(). Perhaps there's an argument to
be made for refactoring plan_create_index_workers() as a fix for this.

> Putting in a CacheInvalidateRelcache(heapRelation); before the CCI in
> index_create() indeed makes the "borked relcache" problem go away.
>
>
> I wonder why we don't just generally trigger invalidations to an
> indexes' "owning" relation in CacheInvalidateHeapTuple()?

I don't know, but that seems like a good question.

-- 
Peter Geoghegan

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-09 Thread Peter Geoghegan

On Mon, Jul 9, 2018 at 11:32 AM, Andres Freund  wrote:
>> Note that there is a kludge within plan_create_index_workers() that
>> has us treat the heap relation as an inheritance parent, just to get a
>> RelOptInfo for the heap relation without running into similar trouble
>> with the index in get_relation_info(). Perhaps there's an argument to
>> be made for refactoring plan_create_index_workers() as a fix for this.
>
> Maybe I'm missing something, but what has this got to do with the issue
> at hand?

Nothing. It might be worthwhile to find a way to not do that as part
of fixing this issue, though. Just a suggestion.

> I assume we'll have to backpatch this issue, so I think it'd probably a
> good idea to put a specific CacheInvalidateHeapTuple() in there
> explicitly in the back branches, and do the larger fix in 12. ISTM
> there's some risks that it'd cause issues.  Will you tackle this?

Okay.

-- 
Peter Geoghegan

Re: sorting/comparing column values in non-alphanumeric sorting ways ?

2018-07-11 Thread Peter Geoghegan

On Wed, Jul 11, 2018 at 2:44 PM, David Gauthier
 wrote:
> I want to load the 'highestver' column with the highest version of tcfg1-3.
>
> This won't work...
> update tv set greatest = greatest(tcfg1,tcfg2,tcfg3)
> ...because it thinks 1.0.9 is greater than 1.0.10
>
> Is there a way to get this to work right ?

If you're using v10 with ICU, then you can create a custom ICU
collation for this, with "natural" sort order. Something like this
should work:

CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');

See the docs -- "23.2.2.3.2. ICU collations".

-- 
Peter Geoghegan

Re:

2018-07-19 Thread Peter Geoghegan

On Thu, Jul 19, 2018 at 11:43 AM, Torsten Förtsch
 wrote:
> is there a way to find if a certain page in a data file is referenced by a
> btree index?
>
> I found a few completely empty pages in one of my tables. I am not sure if
> that's corruption or just bloat. Now I am thinking I could use an index, the
> PK for instance, and see if it references these pages.

Ir'a probably not serious, but you may want to try amcheck's
heapallindexed check. You'll have to use the non-contrib packages for
that right now, though, but those are available from the PGDG repos.

-- 
Peter Geoghegan

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-25 Thread Peter Geoghegan

On Wed, Jul 25, 2018 at 4:03 PM, Andres Freund  wrote:
> Peter, given that your patch made this more likely, and that you're a
> committer these days, I'm opening an open items entry, and assign it to
> you. Does that sound ok?

I intend to follow through on this soon. I have been distracted by
project work. I accept responsibility for the open item, though.

-- 
Peter Geoghegan

Re: Restore relhaspkey in PostgreSQL Version 11 Beta

2018-07-31 Thread Peter Geoghegan

On Tue, Jul 31, 2018 at 7:47 AM, Melvin Davidson  wrote:
> I was hoping that at least one other person would see my point of view, but
> by the
> harsh replies I've been getting, I feel more like a whistle blower that
> insiders
> think I also should be made to "go away".

You were bellicose from almost the beginning of this thread. And, yes,
that does detract from your argument. Just as it would in almost any
other sphere or arena.

> Well, you are right. This old Viet Vet shall now end this conversation and
> his career.
> I just need a way to do so quietly and painlessly.
> The truth is absolute and cannot be changed.
> Perception is not the truth.
> Flerp!

I cannot imagine what reaction you were expecting to this. In all
sincerity, I suggest reflecting on your words. You don't seem to have
realistic expectations about how the community works, or could ever
work.

-- 
Peter Geoghegan

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-31 Thread Peter Geoghegan

On Mon, Jul 9, 2018 at 11:32 AM, Andres Freund  wrote:
> I assume we'll have to backpatch this issue, so I think it'd probably a
> good idea to put a specific CacheInvalidateHeapTuple() in there
> explicitly in the back branches, and do the larger fix in 12. ISTM
> there's some risks that it'd cause issues.

What do you think of the attached?

The is a new CacheInvalidateRelcache() call, rather than a new call to
CacheInvalidateRelcacheByTuple(), but those two things are equivalent
(I assume that you actually meant to say
CacheInvalidateRelcacheByTuple(), not CacheInvalidateHeapTuple()).

Since nobody seems to be that excited about the
CacheInvalidateHeapTuple() idea, I haven't pursued it.

-- 
Peter Geoghegan

0001-Add-table-relcache-invalidation-to-index-builds.patch
Description: Binary data

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-07-31 Thread Peter Geoghegan

On Tue, Jul 31, 2018 at 7:02 PM, Andres Freund  wrote:
> Maybe expand a bit on this by saying that it's more likely "because
> plan_create_index_workers() triggers a relcache entry to be (re-)built,
> which previously did only happen in edge cases" or such?

Okay.

> Not a fan of this comment. It doesn't really explain that well why it's
> needed here, but then goes on to a relatively general explanation of why
> cache invalidation is necessary.  Why not just go for something like
> "register relcache invalidation on the indexes' heap relation, to
> maintain consistency of its index list"?

That seems much more generic to me!

The comment is supposed to convey that the stuff within
index_update_stats() isn't enough because of xact abort specifically.
SI invalidation is very much part of the index_update_stats() contract
already.

> I wonder if it wouldn't be more appropriately placed closer to the
> UpdateIndexRelation(), given that that's essentially what necessitates
> the relcache flush?

That makes sense. I'll do it that way.

-- 
Peter Geoghegan

Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function

2018-08-02 Thread Peter Geoghegan

On Tue, Jul 31, 2018 at 9:00 PM, Andres Freund  wrote:
> I don't think that's particularly relevant. We should always register an
> invalidation before the relevant CommandCounterIncrement(), because that
> is what makes catalog changes visible, and therefore requires
> registering invalidations for coherency.

Fair enough. How about the attached revision?

-- 
Peter Geoghegan


v2-0001-Add-table-relcache-invalidation-to-index-builds.patch
Description: Binary data

Re: Code of Conduct plan

2018-09-14 Thread Peter Geoghegan

On Fri, Sep 14, 2018 at 7:19 AM, Joshua D. Drake  wrote:
> Sure and that is unfortunate but isn't it up to the individual to deal with
> it through appropriate channels for whatever platform they are on? All of
> these platforms are:
>
> 1. Voluntary to use
> 2. Have their own Terms of Use and complaint departments
> 3. If it is abuse there are laws
>
> I agree that within Postgresql.org we must have a professional code of
> conduct but the idea that an arbitrary committee appointed by an unelected
> board can decide the fate of a community member based on actions outside of
> the community is a bit authoritarian don't you think?

The choice of the committee members is hardly arbitrary. Having
committee members be appointed by core is more or less consistent with
how the community has always dealt with disciplinary issues. The
criteria used by core were discussed quite openly. While the risk that
the committee will yield their power in an "authoritarian" way seems
very small, it cannot be ruled out entirely. In fact, it hasn't been
ruled out by the draft CoC itself.

No CoC can possibly provide for every conceivable situation. Somebody
has to interpret the rules, and it has to be possible to impose
sanctions when the CoC is violated -- otherwise, what's the point?
There are several checks and balances in place, and I for one have
confidence in the process as outlined. It's imperfect, but quite a lot
better than either the status quo, or a platitude about inclusivity.

-- 
Peter Geoghegan

Re: Code of Conduct plan

2018-09-14 Thread Peter Geoghegan

On Fri, Sep 14, 2018 at 10:31 AM, Dimitri Maziuk  wrote:
> So let me get this straight: you want to have a "sanctioned" way to deny
> people access to postgresql community support channel?

Yes.

> "Because
> somebody who may or may not be the same person, allegedly said something
> somewhere that some other tweet disagreed with on faceplant"?
>
> Great plan if you do for-pay postgresql support for the living.

You can make your own conclusions about my motivations, just as I'll
make my own conclusions about yours. I'm not going to engage with you
on either, though.

-- 
Peter Geoghegan

Re: Code of Conduct plan

2018-09-14 Thread Peter Geoghegan

On Fri, Sep 14, 2018 at 11:06 AM, Dimitri Maziuk  wrote:
> Personally I would like that. Others might prefer an invitation to
> unsubscribe or forever hold their peace, I could live with that too, but
> I believe explicit opt-ins are preferable to opt-outs.

I think that it's a legitimate position to be opposed to a CoC like
this. I also think it's legitimate to feel so strongly about it, on
philosophical or political grounds, that you are compelled to avoid
participating while subject to the CoC. FWIW, the latter position
seems rather extreme to me personally, but I still respect it.

In all sincerity, if you're compelled to walk away from participating
in mailing list discussions on a point of principle, then I wish you
well. That is your right.

-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-08 Thread Peter Geoghegan

On Mon, Jul 8, 2019 at 9:23 AM John Lumby  wrote:
> Overall,  pg-12beta2 yielded a 6.7% reduction in sizes (total pages) of 
> indexes,   which was most noticable with the 6 non-unique ones.
> In fact the primary-key index was larger with pg-12.

The avg_leaf_density was actually higher for the primary key index, so
it looks like it really came out slightly ahead on v12. Perhaps you
didn't take deleted_pages into account -- there must be free space
that is reusable by the index that has yet to be reused. It would
probably make sense to subtract that across the board.

> Would you have expected better than 6.7%?

I don't think that a test case that runs VACUUM when there are only
4300 deletions and 4300 insertions is particularly realistic, in
general. You might see a larger difference if there was more churn
between each VACUUM run.

> Although a welcome improvement,  I think it is not enough to justify stopping 
> use of setting a lower explicit FILLFACTOR. Which then brings me back to  
> thinking there is a case for the subject of this thread,  an automatic way to 
> preserve density.

I don't think that such an option would make much sense. The "waves of
misery" paper is about smoothing out the frequency of page splits
following bulk loading and a CREATE INDEX. It is not about making
splits occur less often. It's well understood that a certain amount of
free space is the overhead of B-Tree indexes, albeit an overhead that
can be avoided in certain specific instances.

> And one question :
> I notice that in some pg-11 release,   a new config parameter appeared  :
>   vacuum_cleanup_index_scale_factor

> I have not researched this at all and nor did I set it to anything for my 
> pg-12beta2 run,  but it sounds as though maybe it could be relevant to 
> this kind of workload  -   Is that so?

You seem to be worried about keeping indexes as small as possible.
vacuum_cleanup_index_scale_factor won't help with that.

-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-08 Thread Peter Geoghegan

On Mon, Jul 8, 2019 at 12:10 PM John Lumby  wrote:
> Actually the test workload does not run any explicit VACUUM command,
> it relies on autovacuum with these settings
> (same settings for 9.4 and 12beta2)

> To correspond to your " more churn between each VACUUM"
> Would you then suggest increasing
> autovacuum_vacuum_cost_delay and/or  autovacuum_vacuum_scale_factor?

Well, you're still running autovacuum very aggressively here. It'll
easily keep up when run on a relatively small table such as this.

BTW, you should definitely run the latest point release of 9.4 -- not
9.4.6. You're missing years of bug fixes by sticking to such an old
point release, including some rather nasty ones -- 9.4.23 is the
current 9.4 point release. Actually, 9.4 is going to lose support this
year, as the oldest stable version that's currently supported by the
community.

-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-08 Thread Peter Geoghegan

On Mon, Jul 8, 2019 at 12:19 PM Peter Geoghegan  wrote:
> Well, you're still running autovacuum very aggressively here. It'll
> easily keep up when run on a relatively small table such as this.

Also, an exactly equal number of insertions and deletions is rather
likely to result in bloated indexes in a way that probably isn't
representative of many workloads. Even if the number of insertions was
only slightly greater than the number of deletions, then the overall
pattern would be one of continual growth, which is generally
considered much more interesting.

For far far more information on the topic than you want, see the paper
"B-Trees with Inserts and Deletes: Why Free-at-Empty Is Better Than
Merge-at-Half":

https://www.sciencedirect.com/science/article/pii/00229390020W

The salient point made by the paper is that good space utilization
rests on the assumption that there are fewer deletes than inserts,
though maybe only slightly fewer:

"The tendency of the utilization to remain near 69% can be explained
by the following arguments: If there are even just a few more inserts
than deletes, the B-tree will grow at the net insert rate (the rate of
inserts minus the rate of deletes)."

If the volume of data never grows past a certain point, then it's
unlikely that the space utilization is very important. This may even
be premature optimization.
-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan

On Tue, Jul 9, 2019 at 10:31 AM John Lumby  wrote:
> Yes,   I see that. But surely "making splits occur less often" is a 
> desirable
> objective in itself, is it not? And I believe that a parameter to 
> preserve the "steady-state"
> density in high-traffic indexes would help achieve that goal,   wouldn't you 
> agree?

Anything that reliably reduces page splits without hurting space
utilization is well worthwhile. I can't see how what you describe
could have that effect, though. If you expect the leaf density to be
the same after a REINDEX, then why bother at all? There is no reason
to think that that will be more effective than simple vacuuming.

-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan

On Tue, Jul 9, 2019 at 11:27 AM John Lumby  wrote:
> And the point of the REINDEX at that point (below) is to remove dead tuple 
> keys-tids
> and  reorganize those split pages back into physical order without losing the 
> freespace.

VACUUM already removes the tuples, accounting for all overhead.

You are right that it would be possible for us to "defragment" the
pages, so that they'd be in sequential order on disk from the point of
view of a whole index scan -- this is what the "leaf_fragmentation"
statistic from pgstatindex() reports on. We could in principle come up
with a way of moving pages around, which would have some modest
benefit for certain types of queries (it wouldn't improve the
heap/index correlation, though, which is far more important). That
would either necessitate that the command acquire a disruptive lock on
the index (i.e. no writes, just like regular REINDEX), or that we
drastically rearchitect the B-Tree code to make it support this.
Neither of which seem particularly appealing.

I believe that this is a lot more important in systems that generally
use clustered indexes, such as MS SQL Server. This kind of
"fragmentation" isn't usually much of a problem when using Postgres.

-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan

On Tue, Jul 9, 2019 at 12:29 PM John Lumby  wrote:
> I was not thinking of a new command,  just an extension of the existing 
> REINDEX
> which would apply a fillfactor equal to current average page density,
> by adding a preliminary step to sample that first.

That would be a very different thing to REINDEX no matter how you
spelt it, though. REINDEX creates a new index, from scratch, whereas
you're talking about restructuring what's already there.

> > I believe that this is a lot more important in systems that generally
> > use clustered indexes, such as MS SQL Server. This kind of
> > "fragmentation" isn't usually much of a problem when using Postgres.
> >
> We have found that, for an index which has both experienced large number of 
> page splits
> and whose table has a large number of dead tuples (despite autovacuum),
> REINDEX with FILLFACTOR set to current page_density does produce a 
> performance improvement,
> and also does reduce future growth in number of pages.I don't have 
> numbers to
> hand,  and in fact not sure if any catalog view or pgstattuple tells me about 
> the proportion
> of dead key-tids in the index itself (do you know of any source?) as opposed 
> to the table,
> but based on that recollection,  yes,   REINDEX can reduce fragmentation.

This could help the old "getting tired" behavior with many duplicates,
by making the free space available in earlier leaf pages (those
further to the left) that are full of duplicates -- the original
average space utilization may reflect a very uneven distribution of
free space overall. Or, it could be that range scan performance
benefitted from reduced fragmentation, because your workload happened
to be bottlenecked on large range scans. Though that seems unlikely.

I believe that the effect that you identified is real, but at a
minimum it's not clear why a REINDEX with a fillfactor to match the
original leaf space utilization helped. It would be fairly difficult
to figure it out for sure. If it was a problem with
duplicates/"getting tired", then I'd expect the new v12 code will help
a lot.

> However we did not run a VACUUM command first. Maybe if we had run VACUUM 
> instead of
> the REINDEX commands,   we might have obtained the same degree of 
> improvement,  I don't know.
> I think this was Tom's point earlier on in this thread.

It was. Tom's intuition about that matches my own, though I
acknowledge that the old behavior with duplicates muddies the waters.

> Correct me if I'm wrong but I believe whether an index is "clustered" or not 
> is not relevant for
> this discussion because the clustering in that context is referring to 
> ordering of the
> table pages,  not the index pages.

Right.

> I believe it is quite possible to have a perfectly
> "clustered" table whose clustering index is itself badly disorganized.

Technically the two things are separate metrics, so that is
theoretically possible, but it doesn't seem all that likely. It could
happen with lots of non-HOT updates, where all new index tuples relate
to the same logical row as some existing index tuple, causing many
page splits despite there being no real change in the logical contents
of the index. Even then, the table will itself lose much of its
original order, so the index will become "unclustered" as it becomes
fragmented.

-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan

On Tue, Jul 9, 2019 at 3:18 PM John Lumby  wrote:
> > Or, it could be that range scan performance benefitted from reduced 
> > fragmentation,
> >
>
> Yes,  I think so.

ISTM that the simplest explanation here is that index fragmentation
(and even index size) is a red herring, and the real issue is that
you're suffering from problems similar to those that are described in
these old threads:

https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com

There have been numerous reports from users with problems involving
low cardinality indexes that gradually became less correlated with the
underlying table over time. At least a couple of these users found
that a periodic REINDEX temporarily fixed the problem -- see the first
thread for an example. Postgres 12 maintains the heap/table sort order
among duplicates by treating heap TID as a tiebreaker column, which
may make REINDEXing totally unnecessary for you. It's harder to model
this issue because the problem with heap TID order will only be seen
when there is at least a moderate amount of churn.

-- 
Peter Geoghegan

Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR

2019-07-09 Thread Peter Geoghegan

On Tue, Jul 9, 2019 at 9:04 PM Peter Geoghegan  wrote:
> ISTM that the simplest explanation here is that index fragmentation
> (and even index size) is a red herring, and the real issue is that
> you're suffering from problems similar to those that are described in
> these old threads:
>
> https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com
> https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com

I can imagine why you found you needed to reduce fillfactor to get
much of any benefit from a REINDEX. Page splits are inherently
expensive, for one thing. Also, in this specific scenario a succession
of page splits might hasten the index returning to having little
correlation with the underlying table within each large group of
duplicates. Splits on fillfactor 90 pages would make new space
available for future insertions on earlier duplicate pages, mixing old
and new rows together before long.

-- 
Peter Geoghegan

Re: after restore the size of the database is increased

2019-07-15 Thread Peter Geoghegan

On Mon, Jul 15, 2019 at 6:22 AM Luca Ferrari  wrote:
> What am I missing here?

Sometimes B-Tree indexes can be *larger* after a REINDEX (or after
they're recreated with a CREATE INDEX). It's not that common, but it
does happen. There isn't actually a very large size difference here,
so it seems worth comparing index size in detail.

-- 
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan

On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz  wrote:
> To me it looks like a *single* corrupt index held up autovacuums across our 
> entire server, even other in other databases on the same server. Am I 
> interpreting this correctly?

Yes -- that is correct.

What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11?

> Would love guidance on diagnosing this type of thing and strategies for 
> preventing it.

It's possible that amcheck would have given you an accurate diagnosis
of the problem -- especially if you used bt_index_parent_check():

https://www.postgresql.org/docs/current/amcheck.html
-- 
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan

On Wed, Jul 17, 2019 at 9:57 AM Tom Lane  wrote:
> It looks that way, but how would a broken non-shared index have held up
> autovacuuming in other databases?  Maybe, as this one's xmin horizon
> got further and further behind, the launcher eventually stopped
> considering launching workers into any other databases?  That seems
> like a bad thing; it's postponing work that will need to be done
> eventually.

I don't know exactly how the launcher would behave offhand, but it's
clear that not being able to VACUUM one table in one database (because
it has a corrupt index) ultimately risks the availability of every
database in the cluster. Many installations receive little to no
supervision, so it may just be a matter of time there. That is
certainly a bad thing.

-- 
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan

On Wed, Jul 17, 2019 at 9:21 AM Peter Geoghegan  wrote:
> It's possible that amcheck would have given you an accurate diagnosis
> of the problem -- especially if you used bt_index_parent_check():
>
> https://www.postgresql.org/docs/current/amcheck.html

BTW, be sure to use the 'heapallindexed' option with
bt_index_parent_check() to detect missing downlinks, which is exactly
the problem that VACUUM complained about. Hopefully this probably will
be limited to the single index that you've already REINDEXed. In
theory the same problem could be hiding in other indexes, though I
don't consider that particularly likely.

Note that bt_index_parent_check() requires a lock on tables that
effectively blocks writes, but not reads, so verification may require
planning or coordination. bt_index_check() doesn't have any of these
problems, but also won't detect missing downlinks specifically.

-- 
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan

On Wed, Jul 17, 2019 at 10:27 AM Tom Lane  wrote:
> Right, you're eventually going to get to a forced shutdown if vacuum never
> succeeds on one table; no question that that's bad.

It occurs to me that we use operator class/insertion scankey
comparisons within page deletion, to relocate a leaf page that looks
like a candidate for deletion. Despite this, README.hot claims:

"Standard vacuuming scans the indexes to ensure all such index entries
are removed, amortizing the index scan cost across as many dead tuples
as possible; this approach does not scale down well to the case of
reclaiming just a few tuples.  In principle one could recompute the
index keys and do standard index searches to find the index entries,
but this is risky in the presence of possibly-buggy user-defined
functions in functional indexes.  An allegedly immutable function that
in fact is not immutable might prevent us from re-finding an index
entry"

That probably wasn't the problem in Aaron's case, but it is worth
considering as a possibility.

> My concern here is
> that if we have blinders on to the extent of only processing that one
> table or DB, we're unnecessarily allowing bloat to occur in other tables,
> and causing that missed vacuuming work to pile up so that there's more of
> it to be done once the breakage is cleared.  If the DBA doesn't notice the
> problem until getting into a forced shutdown, that is going to extend his
> outage time --- and, in a really bad worst case, maybe make the difference
> between being able to recover at all and not.

The comment about "...any db at risk of Xid wraparound..." within
do_start_worker() hints at such a problem.

Maybe nbtree VACUUM should do something more aggressive than give up
when there is a "failed to re-find parent key" or similar condition.
Perhaps it would make more sense to make the index inactive (for some
value of "inactive") instead of just complaining. That might be the
least worst option, all things considered.

--
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan

On Wed, Jul 17, 2019 at 11:43 AM Alvaro Herrera
 wrote:
> This might make things worse operationally, though.  If searches aren't
> failing but vacuum is, we'd break a production system that currently
> works.

If searches aren't failing and VACUUM works, then that's probably down
to dumb luck. The user's luck could change at any time (actually, it's
quite possible that the index is already giving wrong answers without
anybody realizing). That's not always true, of course -- you could
have an OOM condition in VACUUM, where it really does make sense to
retry. But it should be true for the category of errors where we
behave more aggressively than just giving up, such as "failed to
re-find parent key" error Aaron noticed.

> Well, vacuum knows what index is being processed.  Maybe you're thinking
> that autovac can get an out-of-memory condition or something like that;
> perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED
> condition is reported (and make sure all such conditions do that.  As
> far as I remember we have a patch for this particular error to be
> reported as such.)

I don't think that it would be that hard to identify errors that
nbtree VACUUM could throw that clearly indicate corruption, without
any hope of the problem self-correcting without the DBA running a
REINDEX. There will be a small amount of gray area, perhaps, but
probably not enough to matter.

> > (c) automatically disabling constraint indexes seems less than desirable.
>
> Disabling them for writes, yeah.

I think that it's fair to say that all bets are off once you see the
"failed to re-find parent key" error, or any other such error that
indicates corruption. Admittedly it isn't 100% clear that disabling
constraint enforcement to unblock autovacuum for the whole cluster is
better than any available alternative; it's really hard to reason
about things when we already know that the database has corruption.

I think that it's okay that almost anything can break when somebody
creates an index on a non-immutable expression (including VACUUM),
provided that all the problems only affect the table with the broken
index. OTOH, taking down the entire Postgres cluster as an indirect
consequence of one person's ill-considered CREATE INDEX really sucks.
That distinction seems important to me.

-- 
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

2019-07-17 Thread Peter Geoghegan

On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan  wrote:
> > It's possible that amcheck would have given you an accurate diagnosis
> > of the problem -- especially if you used bt_index_parent_check():
> >
> > https://www.postgresql.org/docs/current/amcheck.html
>
> BTW, be sure to use the 'heapallindexed' option with
> bt_index_parent_check() to detect missing downlinks, which is exactly
> the problem that VACUUM complained about.

Can you tell us more about this index? Can you share its definition
(i.e. what does \d show in psql)?

Is it an expression index, or a partial index? A composite? What
datatypes are indexed?

Thanks
--
Peter Geoghegan

Re: Corrupt index stopping autovacuum system wide

2019-07-18 Thread Peter Geoghegan

On Thu, Jul 18, 2019 at 9:06 AM Aaron Pelz  wrote:
> It's a simple btree expression on a geometry(Point,4326) , no expression no 
> partial no composite.

The cause of the corruption may be a bug in a Postgis B-Tree operator
class. I reported a bug in the Geography type that could lead to
corrupt B-Tree indexes (not Geometry):

https://trac.osgeo.org/postgis/ticket/3841

Though I also see what could be a comparable bug in Geometry:

https://trac.osgeo.org/postgis/ticket/3777

These bugs are from about 3 years ago. If I'm right you should be able
to isolate the bug using amcheck.

-- 
Peter Geoghegan

Re: Question from someone who is not trained in computer sciences

2019-09-05 Thread Peter Geoghegan

On Thu, Sep 5, 2019 at 2:00 PM Judith Lacoste  wrote:
> I think PostgreSQL is the solution for my needs, but I am not a 
> programmer/coder.

I don't think that it's restricted to people that are computer
scientists. At least, I certainly hope it isn't. SQL was originally
supposed to be something that is usable by domain experts/analysts,
rather than by computer people (that was at a time when the divide was
far larger than it is today).

> I plan to install the database on a server in the office. Me and my four 
> colleagues will occasionally connect to this database when we are working in 
> other locations (usually hospitals or universities). In such remote 
> locations, we often do not have internet/network, yet we still need to access 
> the database.  Currently, we use a system where a copy of the database lives 
> on each of our laptops.  We can access all the information in the database 
> despite being offline.  This local copy of the database is synchronized with 
> the server once network becomes available again.
>
> My question is whether or not such set up is possible with PostgreSQL?

Since you're a biologist, you may like to play around with the Mouse
Genome database using PostgreSQL:

http://www.informatics.jax.org/downloads/database_backups/

Any supported version of PostgreSQL will work. You'll need to use
pg_restore to restore the databases. Something like this will do it:

  pg_restore -d mgd  /path/to/mgd.postgres.dump

(I'm not sure what operating system you'll use -- something similar to
this invocation ought to work on Windows through cmd.exe, though.)

>From there, you can play around with the database using a GUI tool
such as pgAdmin. I sometimes use this database to test certain things,
since it's the only example of a living, breathing PostgreSQL database
that you can just download that I am aware of. Its schema is probably
not an exemplar of good design, but it does seem reasonably well
thought out. I'm not a domain expert, though, so I can't really be
sure how good it is.

The nice thing about this approach is that you can figure it out using
a "top down" approach, by first understanding how the database is used
in practical terms, and then filling in the details of how the
application that it backs uses the database.

Last I checked, restoring this database will take about 30GB of disk
space on top of the dump file itself.

-- 
Peter Geoghegan

Re: Whan is it safe to mark a function PARALLEL SAFE?

2019-09-08 Thread Peter Geoghegan

On Sun, Sep 8, 2019 at 12:27 PM Tom Lane  wrote:
> > If an error is raised in one parallel worker, does this
> > cause the other parallel workers to be immediately terminated?
>
> I think not, though I didn't work on that code.  The error will
> be reported to the parent backend, which will cause it to fail
> the query ... but I think it just waits for the other worker
> children to exit first.  That's not something to rely on of course.
> Even if we don't make an attempt to cancel the other workers today
> we probably will in future.  But the cancel attempt would certainly
> be asynchronous, so I'm not sure how "immediate" you are worried
> about it being.

If workers call CHECK_FOR_INTERRUPTS() frequently, which they should,
then it should appear to users as if raising an error in one worker
kills everything almost immediately, or immediately.  For example, if
a parallel CREATE INDEX has a worker that raises a unique violation
error, that must work in a way that at least *appears* to be very
similar to what the user would get with a serial CREATE INDEX. (The
worst that can happen is that they'll very occasionally get two unique
violation errors instead of one, or something like that.)

That said, there are theoretical failures where it could take rather a
long time for the parent/leader to get the memo -- see
WaitForParallelWorkersToAttach() and its caller (note that anything
using a gather node is subject to the same kind of failure that
WaitForParallelWorkersToAttach() handles, even though they won't call
the function themselves). These failures (e.g. fork() failure) are
generally assumed to be rare to non-existent, though. Users will
surely be upset if parallel queries cannot be cancelled almost
immediately. If it happened with any regularity, somebody would have
complained by now.

As Tom said, it's hard to give a useful answer without more context --
how you define "immediate"?

-- 
Peter Geoghegan

Re: PG11 Parallel Thanks!!

2019-10-02 Thread Peter Geoghegan

On Wed, Oct 2, 2019 at 8:41 AM Jason Ralph  wrote:
> Since pg11 on both the target and source, the run time has decreased a lot, I 
> chalk it up to the parallel index creations in pg11 which was a very time 
> consuming process on pg9.3.
> The process has finished almost 10 hours earlier than pg93.  So thank you for 
> your hard work and dedication to this awesome piece of software.

How long did it take on 9.3?

I am the author of the parallel CREATE INDEX feature. It's good to get
feedback like this.

-- 
Peter Geoghegan

Re: Clarification on the release notes of postgresql 12 regarding pg_upgrade

2019-10-04 Thread Peter Geoghegan

On Fri, Oct 4, 2019 at 9:09 AM Tom Lane  wrote:
> > You can't REINDEX safely regarding that note.
>
> Actually running into that problem is quite unlikely; and if you did
> hit it, it'd just mean that the REINDEX fails, not that you have any
> urgent problem to fix.  I'd encourage you to just go ahead and REINDEX,
> if you have indexes that could benefit from the other changes.

Right. It is hard to imagine an application that evolved to fully rely
on the previous slightly higher limit, and cannot tolerate a reduction
in the limit by only 8 bytes. The limit applies to a tuple *after*
TOAST compression has been applied.

-- 
Peter Geoghegan

Re: PG11 Parallel Thanks!!

2019-10-04 Thread Peter Geoghegan

On Thu, Oct 3, 2019 at 10:31 AM Jason Ralph
 wrote:
> The end of month process that we run at my company was a pg_dump and 
> pg_restore of 3 tables, these tables are around ~(400GB) each.  The entire 
> process on pg93 took 29 hours.
>
> The index creation portion of the restore on the target pg9.3 database took:
> 5) time: -15 hours -4 minute ((-54264 % 60)) seconds
>
> The index creation of the restore on the target db after pg11 upgrade on 
> source and target took:
>  5) time: -5 hours -7 minute ((-18434 % 60)) seconds
>
> We saved 10 hours!!

The sort code received many improvements over the years, really
starting in 9.5, and continuing in 9.6, 10 and 11. FWIW, I think that
that was probably the biggest factor here. Though parallel CREATE
INDEX will have helped as well.

-- 
Peter Geoghegan

Re: Rows violating Foreign key constraint exists

2019-11-29 Thread Peter Geoghegan

On Fri, Nov 29, 2019 at 7:23 AM Tom Lane  wrote:
> The most likely "corruption" explanation is something wrong with the
> indexes on the referenced and/or referencing column, causing rows to
> not be found when referential actions should have found them.  Random
> querying of the tables wouldn't necessarily expose that --- you'd need
> to be sure that your queries use the questionable indexes, and maybe
> even search for some of the specific rows that seem mis-indexed.

Or try using contrib/amcheck, which is available in Postgres 10.
Perhaps try the query here, modified to verify all B-Tree indexes (not
just those indexes in the pg_catalog schema):

https://www.postgresql.org/docs/10/amcheck.html

--
Peter Geoghegan

Re: Query returns no results until REINDEX

2020-02-07 Thread Peter Geoghegan

On Fri, Feb 7, 2020 at 3:52 PM Colin Adler  wrote:
> Seems to work now. My question is, is this something I should report to the
> maintainers?

I am one of the people that maintains the B-Tree code.

You didn't mention what version of Postgres you're using here. That
could be important. Please let us know. Mention the minor component of
the release version, too (i.e. say 12.2, not just 12).

> I took a snapshot of the data folder before the reindex in case it
> would be helpful. Is index corruption something that should be actively looked
> out for?

Yes -- look for corruption. If I had to guess, I'd say that this has
something to do with upgrading the operating system to use a
different, incompatible glibc. Or perhaps it has something to do with
streaming replication between machines with different glibc version.

You should try running contrib/amcheck, which should be able to isolate
index corruption, and give you a specific complaint. You may then be
able to inspect the exact index page with the problem using
contrib/pageinspect. Something like this ought to do it on Postgres
11 or 12:

CREATE EXTENSION IF NOT EXISTS amcheck
SELECT bt_index_check('my_index', true);

If that doesn't show any errors, then perhaps try this:

SELECT bt_index_parent_check('my_index', true);

If you're on Postgres 10, then you should leave out the second
argument, "true", since that version doesn't have the extra
heapallindexed check.

Let us know what you see.

--
Peter Geoghegan

Re: Query returns no results until REINDEX

2020-02-11 Thread Peter Geoghegan

On Sun, Feb 9, 2020 at 12:50 PM Colin Adler  wrote:
> Looks like it found something. I checked out the contrib/pageinspect docs but
> wasn't too sure what to run. Are incompatible libc versions causing btree
> corruption something you consider a bug? If it's something you'd like to look
> into further I can gladly send over the database files.

No, this is not considered a bug.

It's unfortunate that there is no built in collation versioning
mechanism, or something like that -- that might have alerted you to
the problem before any real damage occurred. We have that for the ICU
collations, but it currently isn't possible to use ICU as the default
collation provider. You really have to go out of your way to use ICU
collations.

-- 
Peter Geoghegan

Re: Querying an index's btree version

2020-03-11 Thread Peter Geoghegan

On Wed, Mar 11, 2020 at 1:26 PM Darren Lafreniere
 wrote:
> We've read that PG 12 has improved btree index support, and that the latest 
> internal btree version was bumped from 3 to 4. Is it possible to query the 
> btree version that a particular index is using? We'd like to automatically 
> start a concurrent re-index if we detect any btree indexes are still on 
> version 3.

It's possible, but you have to install the superuser-only pageinspect
extension. Here is how you'd determine that an index called
'pg_aggregate_fnoid_index' is on version 4:

regression=# create extension pageinspect;
CREATE EXTENSION
regression=# select version from bt_metap('pg_aggregate_fnoid_index');
 version
-
   4
(1 row)

-- 
Peter Geoghegan

Re: Querying an index's btree version

2020-03-11 Thread Peter Geoghegan

On Wed, Mar 11, 2020 at 2:13 PM Darren Lafreniere
 wrote:
> when you restore a DB from a backup, does the restored index use the old 
> format or the latest one?

If you use pg_restore, it uses the latest index format.

If you're using pg_upgrade, the version won't change unless and until
you REINDEX. This includes cases where you're running pg_upgrade
against a restored physical backup.

-- 
Peter Geoghegan

Re: create index insist on 2 workers only

2020-04-24 Thread Peter Geoghegan

On Fri, Apr 24, 2020 at 7:32 AM Radoslav Nedyalkov  wrote:
> We 're rebuilding a big table which has set parallel_workers = 6
> system has
>  max_parallel_maintenance_workers | 6   | 
> /var/lib/pgsql/11/data/postgresql.sumup.conf
>  max_parallel_workers | 16  | 
> /var/lib/pgsql/11/data/postgresql.sumup.conf
>  max_parallel_workers_per_gather  | 4   | 
> /var/lib/pgsql/11/data/postgresql.sumup.conf
>
> Also session level on index restore there is
> set max_parallel_maintenance_workers = 6;
>
> Still we get only 2 parallel processes in a free of any other load system.
> It is postgres 11.7

Try increasing maintenance_work_mem from the default of 64MB. MWM
constrains the number of parallel workers used.

-- 
Peter Geoghegan

Re: 12.2: Why do my Redo Logs disappear??

2020-06-08 Thread Peter Geoghegan

On Mon, Jun 8, 2020 at 5:17 PM Peter  wrote:
> Loosing a RedoLog is very bad, because there is no redundancy,
> loosing a single one of them makes the timeline disappear and it
> will only reappear after another Base Backup. Very very bad.

>In this case, it seems,  Postgres will delete the current log
>without archiving it. :(

I strongly suspect that you were hit by the bug fixed in commit
4e87c483. You should upgrade to Postgres 12.3 ASAP, to get that fix:

"Avoid premature recycling of WAL segments during crash recovery
(Jehan-Guillaume de Rorthais)

WAL segments that become ready to be archived during crash recovery
were potentially recycled without being archived."

Sorry that you were affected by this bug -- it really sucks.

-- 
Peter Geoghegan

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Peter Geoghegan

On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi
 wrote:
> All of the cited log lines seem suggesting relation with deleted btree
> page items. As a possibility I can guess, that can happen if the pages
> were flushed out during a vacuum after the last checkpoint and
> full-page-writes didn't restored the page to the state before the
> index-item deletion happened(that is, if full_page_writes were set to
> off.). (If it found to be the cause, I'm not sure why that didn't
> happen on 9.5.)

There is also a Heap/HOT_UPDATE log line with similar errors.

-- 
Peter Geoghegan

Re: 12.3 replicas falling over during WAL redo

2020-08-03 Thread Peter Geoghegan

On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera  wrote:
> You can use pageinspect's page_header() function to obtain the page's
> LSN.  You can use dd to obtain the page from the file,
>
> dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501

Ben might find this approach to dumping out a single page image
easier, since it doesn't involve relfilenodes or filesystem files:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

-- 
Peter Geoghegan

Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan

On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent
 wrote:
> But, if the ordering of operator-class equal tuples is already
> system-defined, could the physical ordering of index tuples in a btree
> (with deduplication enabled for "unsafe" opclasses) be updated from
> [index_columns, tid] to [index_columns,
> image_compare(non_datum_equal_columns), tid], giving a stable sorting
> of opclass-equal and image-equal values and enabling safe consistent
> deduplication?

The issue isn't the physical ordering. The issue is that we cannot
allow the implementation to destroy semantic differences among equal
datums. We avoid deduplication with cases where two equal datums are
visibly different. For example, it would not be okay if we forgot that
your numeric datum was originally input as '5.000', and output '5'
later on.

If we wanted to fix this for numeric, we'd have to invent a new
numeric datatype (called numeric2, say). That probably isn't as hard
as it sounds, since it could be part of the same B-Tree operator
family as numeric. It could also be implicitly cast to numeric.

-- 
Peter Geoghegan

Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan

On Tue, Aug 18, 2020 at 9:44 AM Peter Geoghegan  wrote:
> If we wanted to fix this for numeric, we'd have to invent a new
> numeric datatype (called numeric2, say). That probably isn't as hard
> as it sounds, since it could be part of the same B-Tree operator
> family as numeric. It could also be implicitly cast to numeric.

I forgot to say: numeric2 would be just like numeric, except in one
specific way: it wouldn't care about display scale. The user would be
giving up on display scale by choosing numeric2 over numeric. The "5
vs 5.000" information would always be lost by design, so there'd be
nothing for deduplication to break. Deduplication could then be
enabled.

-- 
Peter Geoghegan

Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan

On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent
 wrote:
> Deduplication does not need to destroy semantic differences? 'equal'
> can (in my book) mean:
> - 'opclass-equal', that is the opclass returns true for an equality check
> - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is
> the unprocessed on-disk representations (datum image is the right term
> I believe?) of the compared values are indistinguishable.
>
> Runs of 'binary equal' datums can be freely deduplicated [0] when found.

> [0]
> Inserting a row in a deduplicated index with in, with TID ntid, can
> encounter a posting list of a opclass-equal but not datum image-equal
> tuples where the lowest TID of the posting list is less than ntid, and
> ntid is less than the highest TID of the posting list. This would
> require a posting list split to accomodate the new tuples' index entry
> in order to not lose data.

But you can't do that easily, because it breaks subtle assumptions
about posting list splits and space utilization. In particular, it
means that you can no longer think of a posting list split as
rewriting an incoming new item such that you can more or less pretend
that there was no overlap in the first place -- code like _bt_split
and nbtsplitloc.c relies on this. Introducing new special cases to
nbtsplitloc.c is very unappealing.

More concretely, if you introduce a posting list split like this then
you need three copies of the key -- the original, the new, and a
second copy of the original. That's much more complicated.

--
Peter Geoghegan

Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan

On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent
 wrote:
> Given that the above could work, the current btree tuple ordering is
> not optimized for opclass-equal but datum image-distinct values:
> ordering of opclass-equal values is currently determined only by tid,
> with as an example current ordering ['0.0', '0', '0.00', '0', '0.0',
> '0']. It would be more optimized for deduplication if that was stored
> as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I
> suggested to add an ordering by the datum image before the tid
> ordering. Additionally, this extra ordering also prevents the problem
> of [0] by never attempting an insertion of non-equal image datums in a
> posting list of otherwise equal values, as it would be ordered either
> before or after the posting list, never inside the list.

Yeah, that would work, but at the cost of making numeric totally
unusable. Now you cannot rely on unique enforcement detecting that '0'
is a duplicate of '0.0'. In fact, even the most trivial use of the =
operator will be broken in the presence of different display scales.
It's a non-starter.

The numeric2 design that I sketched is a bit ugly, but I can see no
better way. A three-way posting list split (i.e. the other design that
you sketched) is a special case that is very hard to test, very
complicated, and of little value in the grand scheme of things.

-- 
Peter Geoghegan

Re: Index tuple deduplication limitations in pg13

2020-08-18 Thread Peter Geoghegan

On Tue, Aug 18, 2020 at 1:31 PM Matthias van de Meent
 wrote:
> Would this extra ordering not effectively be an extra tiebreaker in
> the ordering, applied before the TID? I do not know the full
> implications of that, but I believe that would not result in the
> limitations that you are mentioning.

You could probably do it that way, but again you end up with a lot of
new complexity. Not to mention overhead that would have to be paid by
everyone. It would require code that supported the old way (even if it
was added to Postgres 13) for pg_upgrade, that would also be hard to
test. And it might defeat certain future optimizations based on heap
TID being the only tiebreaker. Having two types of equality might have
to bleed into the optimizer.

It's a question of engineering trade-offs. I don't think that it's worth it.

-- 
Peter Geoghegan

Re: Need explanation on index size

2020-09-24 Thread Peter Geoghegan

On Thu, Sep 24, 2020 at 6:55 AM Guillaume Luchet  wrote:
> I don’t understand why after the update where I only update a non indexed 
> column the indexes size is growing. Is it something someone can explain ?

If you reduce the table fillfactor then these updates will all be HOT
updates. That will make the table larger initially, but leaving enough
space behind on the same heap pages for successor tuples makes it
possible to use HOT updates.

-- 
Peter Geoghegan

Re: autovacuum verbose?

2021-01-21 Thread Peter Geoghegan

On Thu, Jan 21, 2021 at 12:55 PM Tommy Li  wrote:
> Is there any way to configure autovacuum to log the same information as 
> VACUUM VERBOSE?

No. Though there really should be.

-- 
Peter Geoghegan

Re: oldest xmin is far in the past :: BUT xmin is not available in system

2022-04-19 Thread Peter Geoghegan

On Mon, Apr 18, 2022 at 11:37 PM bhargav kamineni  wrote:
> executing the vacuum on the entire cluster is also giving the same HINTS and 
> WARNING's

You're using Aurora, not PostgreSQL. Perhaps this is actually a bug,
but there is no way for anybody here to know.

-- 
Peter Geoghegan

Re: autovacuum_freeze_max_age on append-only tables

2022-04-20 Thread Peter Geoghegan

On Wed, Apr 20, 2022 at 4:06 PM senor  wrote:
> I'm attempting to mimic a new feature in version 13 where INSERTS will 
> trigger vacuum for an append-only table.

The problem with that idea is that you need to express the idea that
the table needs to be vacuumed now in terms of its "age", denominated
in XIDs -- but XIDs consumed by the entire system, not just those XIDs
that happen to modify your append-only table. It will likely be very
hard for you to figure out a way to relate these logical units (XIDs)
to some kind of physical cost that captures how far behind you are on
freezing (like blocks, or even tuples). Maybe you'll find something
that works through trial and error, but I wouldn't count on it.

> I'm apparently needing an education on how this "to avoid wraparound" vacuum 
> differs from any other. I've seen it referenced as "more aggressive" but I'd 
> like details. An upgrade to 13 is "right around the corner".

It's complicated -- more complicated than it really should be.
Technically an anti-wraparound autovacuum and an aggressive vacuum are
two different things. In practice anti-wraparound autovacuums are
virtually guaranteed to be aggressive, though an aggressive autovacuum
may not be an antiwraparound VACUUM (sometimes we do aggressive
vacuuming because autovacuum launched a worker before
age(relfrozenxid) reached autovacuum_freeze_max_age, but after
age(relfrozenxid) reached vacuum_freeze_table_age).

See my recent response to a similar question here:

https://postgr.es/m/CAH2-WzkFQ-okvVXizpy4dCEVq75N-Qykh=crhzao-eajflv...@mail.gmail.com

--
Peter Geoghegan

Re: autovacuum_freeze_max_age on append-only tables

2022-04-21 Thread Peter Geoghegan

On Thu, Apr 21, 2022 at 8:14 PM Senor  wrote:
> Are the autovacuum_vacuum_cost_* settings handled any differently for
> 'to avoid wraparound' vacuums? I understand that it won't give up a lock
> but I was expecting it to still back off due to cost and allow the query
> with conflicting lock to proceed.

In general, no. For the most part an antiwraparound autovacuum does
exactly the same work as any other autovacuum. Or any other aggressive
VACUUM, at least. But even the extra work that it does over what
non-aggressive VACUUM is still work that any VACUUM might do, if the
circumstances were right.

We still freeze in regular VACUUMs, provided we scan pages with XIDs
that are sufficiently old. The most important difference between it
and aggressive VACUUM is that the former can skip all-visible pages
that have unfrozen XIDs, putting that work off. This can sometimes
lead to a big balloon payment later on, when you finally have an
aggressive VACUUM. I think that that's a design flaw that ought to be
fixed.

Currently non-aggressive VACUUMs always skip all-visible pages. They
should probably freeze some older all-visible pages eagerly, rather
than skipping them, so that the system never gets too far behind on
freezing.

> Is there any benefit to manually running a vacuum every so many inserts
> as opposed to using autovacuum_freeze_max_age. And in this case should
> it be a vacuum freeze.

Given your restrictions, this is probably the best option available.
But maybe you should just set vacuum_freeze_min_age to 0 at the table
level, instead of using vacuum freeze (so you freeze more without
doing aggressive vacuuming all the time, which FREEZE also forces).

Users understandably think that there are several different flavors of
vacuum, but that's not really true (apart from VACUUM FULL, which
really is quite different). The difference between aggressive and
non-aggressive can be big in practice due to an accumulation of
unfrozen pages over multiple non-aggressive vacuums.

-- 
Peter Geoghegan

Re: Automatic autovacuum to prevent wraparound - PG13.5

2022-06-24 Thread Peter Geoghegan

On Wed, Jun 15, 2022 at 4:13 AM Mauro Farracha  wrote:
> The scenario:
> - Out of nowhere (during the weekend), without database activity load or 
> batches running, with previous nightly run of vacuum freeze, in the middle of 
> the day, with xids and mxids below 20M we are seeing autovacuum being 
> triggered to prevent wraparound.
>
> My question is why this is occurring, which condition might be responsible 
> for this behaviour?

There is a behavior that seems like it might be relevant: VACUUM
interprets autovacuum_multixact_freeze_max_age in a way that accounts
for both MultiXactId consumption and the consumption of "member space"
by MultiXacts. Technically there are 2 SLRUs for MultiXacts, either of
which can wraparound.

This behavior was established by commit 53bb309d2d. It is documented.
Admittedly this whole area of the documentation is in dire need of an
overhaul.  :-(

-- 
Peter Geoghegan

Re: Monitoring multixact members growth

2022-08-19 Thread Peter Geoghegan

On Fri, Aug 19, 2022 at 8:40 AM Vido Vlahinic
 wrote:
> However it is not quite clear to me how I can interpret results from above 
> and e.g. conclude: my_table accumulates x more multixact members since I 
> measured last.

You can't. And not just because nobody got around to implementing it
yet -- it's quite a fundamental restriction. VACUUM must always make
sure of that, and must always scan all unfrozen pages to safely
determine that much (it may or may not have to freeze *any* MultiXacts
as part of that process, but it must always be sure that no Multis <
its final relminmxid remain).

> My goal here is to predict where multixact members are growing the fastest so 
> I can perform manual VACUUM FREEZE only on those tables

The problem with that strategy is that you still have to do
anti-wraparound autovacuums when the mxid_age(relminmxid) of a table
crosses the usual threshold, even when in reality there are *zero*
MultiXacts in the table (often the case with the largest tables).
That's just how it works, unfortunately.

There is one piece of good news, though: work in Postgres 15 taught
VACUUM to track the oldest extant XID and MXID in the table, and set
relfrozenxid and remind to those oldest values (rather than using the
cutoffs for freezing, which in general might be much older than the
oldest remaining unfrozen XID/MXID).

I expect that this will make it much less likely that anti-wraparound
autovacuums affecting many tables will all stampede, hurting
performance. This will be possible because VACUUM will now be able to
set relminmxid to a value that actually tells us something about
what's really going on in each table, MultiXact-wise (not just what
you set vacuum_multixact_freeze_min_age and
autovacuum_multixact_freeze_max_age to in postgresql.conf, which is
pretty far removed from what matters most of the time). Simply by
noticing that there are no remaining MultiXacts (and probably never
were any in the first place) with the larger tables.

The timeline for anti-wraparound autovacuums will tend to make a lot
more sense for *your* workload, where huge differences in the rate of
MultiXact consumption among tables is likely the norm. This still
isn't perfect (far from it), but it has the potential to make things
far better here.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

2022-10-11 Thread Peter Geoghegan

On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera  wrote:
> I remember having an hypothesis, upon getting a report of this exact
> problem on a customer system once, that it could be due to killtuple not
> propagating to standbys except by FPIs.  I do not remember if we proved
> that true or not.  I do not remember observing that tables were being
> read, however.

That's true, but it doesn't matter whether or not there are LP_DEAD
bits set on the standby, since in any case they cannot be trusted when
in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be
set to false on the standby.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

2022-10-11 Thread Peter Geoghegan

On Tue, Oct 11, 2022 at 10:04 PM Tom Lane  wrote:
> Do we propagate visibility-map bits to standbys?

Yes.

-- 
Peter Geoghegan

Re: Weird planner issue on a standby

2022-10-12 Thread Peter Geoghegan

On Wed, Oct 12, 2022 at 6:47 AM Tom Lane  wrote:
> However, that doesn't explain the downthread report that a
> VACUUM on the primary fixed it.  What I suspect is that that
> caused some in-fact-dead index entries to get cleaned out.

Seems likely.

> But ... if the primary is allowed to vacuum away an index
> entry that it thinks is dead, exactly what is the point of
> making standbys ignore LP_DEAD bits?  There's no additional
> interlock that guarantees the tuple will be there at all.

The interlock doesn't really protect the leaf page or its index tuples
so much as the referenced TIDs themselves. In other words it's a TID
recycling interlock.

That's why we don't need a cleanup lock to perform index tuple
deletions, even though the WAL records for those are almost identical
to the WAL records used by index vacuuming (in the case of nbtree the
only difference is the extra latestRemovedXid field in the deletion
variant WAL record). We know that there is no VACUUM process involved,
and no question of heap vacuuming going ahead for the same TIDs once
index vacuuming is allowed to complete.

We can get away with not having the interlock at all in the case of
nbtree index scans with MVCC snapshots -- but *not* with index-only
scans. See "Making concurrent TID recycling safe" in the nbtree
README. I only got around to documenting all of the details here quite
recently. The index-only scan thing dates back to 9.5.

-- 
Peter Geoghegan

Re: ON CONFLICT and WHERE

2022-11-13 Thread Peter Geoghegan

On Sun, Nov 13, 2022 at 1:07 PM Tom Lane  wrote:
> A WHERE placed there is an index_predicate attachment to the ON CONFLICT
> clause.  It doesn't have any run-time effect other than to allow partial
> indexes to be chosen as arbiter indexes.  TFM explains
>
> index_predicate
>
> Used to allow inference of partial unique indexes. Any indexes
> that satisfy the predicate (which need not actually be partial
> indexes) can be inferred.
>
> This strikes me as a bit of a foot-gun.  I wonder if we should make
> it safer by insisting that the resolved index be partial when there's
> a WHERE clause here.

I don't think that it would be safer.

Adrian has asked why it's possible to attach an arbitrary
index_predicate type WHERE clause to an ON CONFLICT query, without
that really changing the behavior of the statement. That *is* a little
odd, so it's certainly a fair question (I can recall perhaps as many
as 5 similar questions over the years). But it's not the end of the
world, either -- there are far worse things.

I think that it would be a lot worse (just for example) to have your
ON CONFLICT query suddenly start throwing an ERROR in production, just
because you replaced a partial unique index with a unique constraint.
If we have a suitable unique index or constraint, why wouldn't we use
it in ON CONFLICT? Maybe it won't work out that way (maybe there won't
be any suitable unique index or constraint), but why not do our utmost
to insulate the user from what might be a serious production issue?
That was the guiding principle.

Overall I'm quite happy with the amount of foot-guns ON CONFLICT has,
especially compared to other comparable features in other DB systems
(which had plenty). There are one or two ostensibly odd things about
the syntax that are downstream consequences of trying to make the
constraint/unique index inference process maximally forgiving. I'm
pretty happy with that trade-off.

> (This documentation text is about as clear as
> mud, too.  What does "inferred" mean here?  I think it means "chosen as
> arbiter index", but maybe I misunderstand.)

Unique index/constraint inference is the process by which we choose an
arbiter index. See the second paragraph of the "ON CONFLICT Clause"
section of the INSERT docs.

-- 
Peter Geoghegan

Re: Lots of read activity on index only scan

2022-11-18 Thread Peter Geoghegan

On Fri, Nov 18, 2022 at 12:46 PM Peter J. Holzer  wrote:
> Both do a parallel index only scan. Both perform 0 heap fetches.
> But one reads 27336 buffers (or about 22 bytes per index entry, which
> sounds reasonable) while the other reads 9995216 buffers (or almost one
> full buffer per row). Why? The entries should be dense in the index in
> both cases and since it's an index only scan (and explain says there
> were 0 heap fetches) I would not expect extra accesses. Where do these
> buffer reads come from?

The index-only scan processes an index leaf page at a time. When there
is naturally a high correlation (or some kind of clustering) in how we
access VM pages, we'll naturally be able to do more visibility checks
covering more index tuples per VM page accessed. This is a less
severe problem here than it would be with an equivalent pair of plain
index scans, just because there are so few VM pages relative to
heap pages. But it's more or less an analogous problem. You're
really noticing it here because these index scans have very low
selectivity -- which is kinda unusual in most environments.

Also worth bearing in mind that it's unusual to have a perfectly
random and uniformly distributed clustering of index tuples, which is
what the index built via hashing exhibits. Even a v4 UUID index could
easily have plenty of duplicates, which would probably do
significantly better on the metric you've focussed on.

--
Peter Geoghegan

Re: Lots of read activity on index only scan

2022-11-18 Thread Peter Geoghegan

On Fri, Nov 18, 2022 at 1:50 PM Peter J. Holzer  wrote:
> There should be about 27000 of them, same as for the othe index, right?

There aren't that many. The point I'm making is that you can access
each VM page approximately once (and check relatively many index
tuple's TIDs all in one go), or many times. The total number of VM
pages may be constant, but the access patterns are quite different
owing to differences in how the data is clustered in each index.

> > When there is naturally a high correlation (or some kind of
> > clustering) in how we access VM pages, we'll naturally be able to do
> > more visibility checks covering more index tuples per VM page
> > accessed.
>
> So you are saying that these are accesses to the visibility map, not the
> base table?

Yes. I see "Heap Fetches: 0" for both plans, that each query the same
table and scan approximately the same number of index pages. So VM
accesses are the only explanation that makes any sense.

> > Also worth bearing in mind that it's unusual to have a perfectly
> > random and uniformly distributed clustering of index tuples,
>
> Sure. This is a highly contrived example.

FWIW I think that it could be a lot less bad, even with indexes that
you'd think would be almost as bad as the bad one from your test case.
Even things that appear to be random aren't usually nearly as random
as what you've shown.

-- 
Peter Geoghegan

Re: Question regarding UTF-8 data and "C" collation on definition of field of table

2023-02-05 Thread Peter Geoghegan

On Sun, Feb 5, 2023 at 4:19 PM Tom Lane  wrote:
> If there's a predominant language in the data, selecting a collation
> matching that seems like your best bet.  Otherwise, maybe you should
> just shrug your shoulders and stick with C collation.  It's likely
> to be faster than any alternative.

FWIW there are certain "compromise locales" supported by ICU/CLDR.
These include "English (Europe)", and, most notably, EOR (European
Ordering Rules):

https://en.wikipedia.org/wiki/European_ordering_rules

I'm not sure how widely used those are. EOR seems to have been
standardized by the EU or by an adjacent institution, so not sure how
widely used it really is.

It's also possible to use a custom collation with ICU, which is almost
infinitely flexible:

http://www.unicode.org/reports/tr10/#Customization

As an example, the rules about the relative ordering of each script
can be changed this way. There is also something called merged
tailorings.

The OP should see the Postgres ICU docs for hints on how to use these
facilities to make a custom collation that matches whatever their
requirements are:

https://www.postgresql.org/docs/current/collation.html#COLLATION-MANAGING

-- 
Peter Geoghegan

Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Peter Geoghegan

On Wed, Feb 8, 2023 at 11:54 PM Paul McGarry  wrote:
> But if it is the problem, why did the update start working after I recreated 
> the other index?

There is no reason why reindexing another index ought to have had that
effect. The likely explanation is that subsequent updates used a
successor version heap TID that didn't overlap with some existing
posting list in whatever way. If you repeat the update again and
again, and get an error each time, the incoming TID will differ each
time. Eventually you won't get an error, because at some point there
won't be a posting-list-TID range overlap for some new successor TID
that leads to the insert/posting list split code detecting a problem.

It's also possible that a concurrent autovacuum "fixed" the issue.

The amcheck error shows a problem in an internal page, which cannot
have posting list tuples -- which suggests broad corruption. An issue
with collation instability due to an OS update does seem likely.

Note that the hardening/defensive checks in this area have increased.
I added an additional defensive check to 13.4, and followed up with
another similar check in 13.5. It looks like the error you've seen
("ERROR:  posting list tuple with 2 items cannot be split at offset
17") comes from the initial 13.4 hardening, since I'd expect the
additional 13.5 hardening to catch the same issue sooner, with a
different error message (something like "table tid from new index
tuple (%u,%u) cannot find insert offset between offsets %u and %u of
block %u in index \"%s\"").

> I think I should now:
> - recreate the widget_name_idx on the problem servers
> - run bt_index_check across all other indexes
>
> Any suggestions on what else I should look into, in particular anything I 
> should check before upgrading the remaining 13.8 DB to 13.9?

I recommend running amcheck on all indexes, or at least all
possibly-affected text indexes.

-- 
Peter Geoghegan

Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

2023-02-09 Thread Peter Geoghegan

On Thu, Feb 9, 2023 at 3:55 PM Paul McGarry  wrote:
> Will the amcheck reliably identify all issues that may arise from a collation 
> change?

Theoretically it might not. In practice I'd be very surprised if it
ever failed to detect such an inconsistency. If you want to be extra
careful, and can afford to block concurrent writes, then I suggest
using bt_index_parent_check instead of bt_index_check.

> and therefore I can just use that to identify bad indexes and recreate them, 
> or should I recreate all btree indexes involving text fields?

It might be easier to just reindex them all. Hard to say.

> We planned to do a dump/restore upgrade to PG14 in a month or so (already 
> done in dev).
> I am wondering whether it will be less work overall to bring that forward 
> than rebuild these indexes...

pg_amcheck is available on 14. It offers a much simpler interface for
running amcheck routine, so maybe look into that once you upgrade.

-- 
Peter Geoghegan

Re: Automatic aggressive vacuum on almost frozen table takes too long

2023-02-16 Thread Peter Geoghegan

On Thu, Feb 16, 2023 at 7:44 AM Adrian Klaver  wrote:
> > That is, if I understand it correctly, it says that there were (and
> > actually are) 2013128 pages of which 2008230 were skipped, which leaves
> > 4898 blocks to be scanned. I.e. it seems that the allocated 1GB
> > (autovacuum_work_mem) should be enough to handle that amount of blocks
> > and to avoid multiple scans of the indexes.

It's quite clear that there is only one pass over the indexes, since
"index scans: 1" says exactly that.

> > But, based on buffer usage, one can see that a huge amount of data is
> > read, greatly exceeding not only the number of remaining unfrozen
> > blocks, but also the size of the table and indexes taken together: 2
> > billion blocks, more than 15TB.
> >
> > Is this a bug in Postgresql or am I interpreting the log data wrong?

I think that it might be a bug in Postgres.

I addressed a similar issue in the same "buffer" instrumentation in
commit d3609dd2, but that wasn't backpatched because I imagined that
it only applied to the new VACUUM VERBOSE case (VACUUM VERBOSE only
started sharing the same instrumentation code as log_autovacuum in
Postgres 15). It's not immediately obvious how you could see a problem
like the one you've shown in the autovacuum log output. Even still,
the information about buffers that you've shown does indeed appear to
be total nonsense (while everything else we can see looks plausible).
There has to be some explanation for that.

The only other explanation I can think of is a pathological case where
an index scan by some particular ambulkdelete routine scans a number
of buffers that vastly exceeds the total size of the index. That does
seem just about possible with an access method like GIN. Do you have
any non-btree indexes on the table? Can you show us the details of the
table, including all of its indexes? In other words, can you show "\d
applications" output from psql?

-- 
Peter Geoghegan

Re: Automatic aggressive vacuum on almost frozen table takes too long

2023-02-17 Thread Peter Geoghegan

On Thu, Feb 16, 2023 at 5:40 PM Mikhail Balayan  wrote:
>> >> Do you have any non-btree indexes on the table? Can you show us the 
>> >> details of the
>> >> table, including all of its indexes? In other words, can you show "\d 
>> >> applications" output from psql?
>
> Only btree indexes. Please find the full table schema below:

It's possible that VACUUM had to wait a long time for a cleanup lock
on one individual heap page here, which could have added a long delay.
But...that doesn't seem particularly likely.

Can you run amcheck's bt_index_check() routine against some of the
indexes you've shown? There is perhaps some chance that index
corruption exists and causes VACUUM to take a very long time to delete
index pages. This is pretty much a wild guess, though.

-- 
Peter Geoghegan

Re: Automatic aggressive vacuum on almost frozen table takes too long

2023-02-20 Thread Peter Geoghegan

On Mon, Feb 20, 2023 at 9:43 PM Mikhail Balayan  wrote:
> What catches my eye: scanning indexes smaller than 3.1GB is fast, larger ones 
> are slow. For example:
> idx_applications2_policy_id is 3131 MB took just 5 seconds (DETAIL:  CPU: 
> user: 2.99 s, system: 1.65 s, elapsed: 5.32 s)
> but idx_applications2_deleted_at with 3264 MB took 1 minute 22 seconds 
> (DETAIL:  CPU: user: 67.93 s, system: 3.41 s, elapsed: 82.75 s)

I think that I know what this is.

If you delete many index pages during VACUUM, and those pages are all
full of duplicate values, the deletion operation can sometimes be
slower due to the need to relocate a downlink to each to-be-deleted
leaf page. When there are thousands of matches, you'll start to notice
O(n^2) behavior due to the way in which the B-Tree VACUUM code must
grovel through the parent level, which is full of duplicate keys.

If you were on Postgres 12+, then this wouldn't happen, because the
heap TID is treated as a part of the key space there, affecting sort
order. The implementation would immediately relocate the matching
parent downlink using a unique key (unique because heap TID would act
as a unique-ifier on that version). And if you were on 14+, things in
this area would be much better still.

-- 
Peter Geoghegan

Re: PG16devel - vacuum_freeze_table_age seems not being taken into account

2023-03-03 Thread Peter Geoghegan

On Fri, Mar 3, 2023 at 2:43 AM Simon Elbaz  wrote:
> I expected it not to be processed by vacuum freeze.
> However it has been entirely frozen.
> Moreover, among the 51 rows, only 1 was eligible for freeze because its XID 
> was older than vacuum_freeze_min_age.

The effect that you noticed is a consequence of page-level freezing,
which is new to Postgres 16. VACUUM will now freeze all of the tuples
on a page whenever it needs to freeze any tuples at all (barring any
tuples that are fundamentally ineligible due to being after the
removable/freezable cutoff). This is justified by the cost profile.
Once we decide to freeze at least one tuple of a page, the added cost
in WAL is low enough that it really doesn't make sense to not just
freeze everything.

The page that gets frozen by your test case is also set all-frozen in
the visibility map. Without the optimization, we'd have frozen that
one tuple and then set the page all-visible. The page would likely be
frozen again by the next aggressive VACUUM, which is usually much more
expensive.

-- 
Peter Geoghegan

Re: UPSERT in Postgres

2023-04-06 Thread Peter Geoghegan

On Thu, Apr 6, 2023 at 1:21 PM Louis Tian  wrote:
> An implicit assumption behind this definition is that table must have a 
> primary key for the upsert operation to make sense since it's the primary key 
> that uniquely identifies a row.

It could just be a unique index or a unique constraint. So you can
upsert on any individual unique constraint/index, or the primary key.
Of course there might be several on a given table, but you can only
use one as the "conflict arbiter" per statement.

> But can we achieve the same effect with "INSERT ... ON CONFLICT DO UPDATE" 
> like a lot of references on the internet seems to suggest.
>
> insert into person (id, name) values (0, 'foo') on conflict ("id") do update 
> set id=excluded.id, name=excluded.name
> insert into person (id, is_active) values (0, true) on conflict ("id") do 
> update set id=excluded.id, is_active=excluded.is_active
>
> Unfortunately. the second statement will fail due to violation of the not 
> null constraint on the "name" column.
> PostgreSQL will always try to insert the row into the table first. and only 
> fallback to update when the uniqueness constraint is violated.
> Is this behavior wrong? maybe not, I think it is doing what it reads quite 
> literally.

It sort of has to work that way, though. In general your example might
*not* fail, due to a row-level before trigger in the insert path.

Why doesn't your proposed upsert syntax have the same problem? I mean,
how could it not? I guess it doesn't if you assume that it'll never
take the insert path with your not NULL constraint example? But if you
know that for sure, why not just use a regular update statement? On
the other hand, if you're not sure if the insert path can be taken,
then why is it actually helpful to not just throw an error at the
earliest opportunity?

Surely upsert means "update or insert", so why wouldn't the user expect
to see an error like this, independent of the specifics of the row in question?
Isn't the user tacitly saying "I don't specifically know if the update or insert
path will be taken in respect of any given row" by using ON CONFLICT
DO UPDATE in the first place?

> That being said, I have never had a need for the ON CONFLICT DO UPDATE 
> statement other than where I need upsert.
> But using it as "upsert" is only valid when the table is absent of any NOT 
> NULL constraint on it's non primary key columns.

I don't know what you mean by that. "Valid"?

> The MERGE command introduced in PG15 in theory can be used to do UPSERT 
> properly that is void of the aforementioned limitation.
> The downside is it is rather verbose.

The MERGE command has various race conditions that are particularly
relevant to UPSERT type use cases. See the wiki page you referenced
for a huge amount of information on this.

> *Feature Request*
> Given that UPSERT is an *idempotent* operator it is extremely useful.

In general UPSERT (or any definition of it that I can think of) does
not imply idempotency.

--
Peter Geoghegan

Re: CREATE COLLATION to match pg_collation data

2019-01-12 Thread Peter Geoghegan

On Sat, Jan 12, 2019 at 7:13 AM Tom Lane  wrote:
> That's an ICU collation, so you're out of luck: there is no ICU
> support in 9.6.

FWIW, there is some form of FreeBSD Postgres support for ICU that
predates the officially supported ICU feature. The FreeBSD Postgres
packages have been built with ICU support for many years -- they
modify the source code minimally to make this work. It may well still
be impossible to use "az-x-icu" on a FreeBSD installation of 9.6,
though.

-- 
Peter Geoghegan

Re: Pg analog to DBCC CCHECKDB?

2019-04-05 Thread Peter Geoghegan

On Fri, Apr 5, 2019 at 7:33 AM Ron  wrote:
> In 9.6, does such a thing exist?  (We just restored a VM from snapshot and I
> want to verify the cluster sanity.)

amcheck is available for versions 9.4+, though it only appears in
contrib in Postgres 10. There are both yum and deb packages available.
See:

https://github.com/petergeoghegan/amcheck

-- 
Peter Geoghegan

Re: Upgrading locale issues

2019-05-01 Thread Peter Geoghegan

On Mon, Apr 29, 2019 at 7:45 AM rihad  wrote:
> Hi. Today we run pg_ctl promote on a slave server (10.7) and started
> using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD
> 11.2. And you guessed it, most varchar indexes got corrupted because
> system local changed in subtle ways. So I created the extension amcheck
> and reindexed all bad indexes one by one. Is there any way to prevent
> such things in the future? Will switching to ICU fix all such issues?

Not necessarily, but it will detect the incompatibility more or less
automatically, making it far more likely that the problem will be
caught before it does any harm. ICU versions collations, giving
Postgres a way to reason about their compatibility over time. The libc
collations are not versioned, though (at least not in any standard way
that Postgres can take advantage of).

> The problem with it is that ICU collations are absent in pg_collation,
> initdb should be run to create them, but pg_basebackup only runs on an
> empty base directory, so I couldn't run initdb + pg_basebackup to
> prepare the replica server. I believe I can run the create collation
> command manually, but what would it look like for en-x-icu?

It is safe to call pg_import_system_collations() directly, which is
all that initdb does. This is documented, so you wouldn't be relying
on a hack.

-- 
Peter Geoghegan

Re: Upgrading locale issues

2019-05-06 Thread Peter Geoghegan

On Wed, May 1, 2019 at 3:09 PM Thomas Munro  wrote:
> As discussed over on -hackers[1], I think it's worth pursuing that
> though.  FWIW I've proposed locale versioning for FreeBSD's libc[2].
> The reason I haven't gone further with that yet even though the code
> change has been accepted in principle by FreeBSD reviewers is because
> I got stuck on the question of how exactly to model the versions.  If,
> say, just Turkish changes, I don't want to be rebuilding my French
> indexes, which means that I don't think you can use the CLDR version
> string.

The ICU versions can handle that, though. Importantly, ICU decouples
implementation details from actual versioning.

I must say that I am not enthused about the idea of trying to get libc
people of any variety on board. I don't have an objection to it if it
can work for FreeBSD, but I don't think it can scale. ICU is built
around a culture that takes our concerns seriously already, which is
what it boils down to. Also, we can imagine a package manager taking
it upon themselves to vendor their own ICU, with platform-specific
guarantees around stability. That seems like a nice option to have, at
least.

> There is also the question of how PostgreSQL should model versions,
> and as I've argued in [1], I think we should track them at the level
> of database object dependencies.

I think you're probably right about that.

> I'm hoping to reopen this can of worms for PostgreSQL 13 (and the
> corresponding support could in theory be in FreeBSD 13... coincidence,
> or a sign!?)

Maybe we should do what Oracle did, and call it PostgreSQL 18c
instead. Actually, any number that isn't of interest to numerologists
will do.

> Unfortunately you can't use ICU collations as a database default yet
> (though there was some WIP code[3]), so ICU only saves you from
> versioning problems if you explicitly set collations for columns or
> expressions, and even then the version tracking is currently just a
> warning that you clear manually with a command, not a mechanism that
> really tracks which database objects were last rebuilt/validated with
> a given version.

Yes, that does seem like a big remaining weakness.

-- 
Peter Geoghegan

Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-03 Thread Peter Geoghegan

On Sun, Jun 2, 2019 at 4:07 PM Tom Lane  wrote:
> Anyway, I said to Jeremy in the hallway that it might not be that
> hard to bolt IOS support onto GIN for cases where the opclass is
> a non-subdividing one, but after looking at the code I'm less sure
> about that.  GIN hasn't even got an "amgettuple" code path, just
> "amgetbitmap", and a big part of the reason why is the need to merge
> results from the fastupdate pending list with results from the main
> index area.  Not sure how we could deal with that.

I suspect that GIN also avoids several other problems by only offer an
"amgetbitmap", and not an "amgettuple". For example, it doesn't have
to worry about things like numeric display scale, where a datum is
substantively different to another datum, while still being equal
according to opclass semantics (this is an example that I end up
citing in many discussions about indexing). I bet that there are a few
more of those beyond those two, that I haven't thought about.

> Anyway, the larger point here is that right now btree_gin is just a quick
> hack, and it seems like it might be worth putting some more effort into
> it, because the addition of duplicate-compression changes the calculus
> for whether it's useful.

There was also discussion about making nbtree support deduplication
during the hallway track. Jim Finnerty reminded me that there is a
patch from Anastasia that did deduplication in nbtree that didn't go
anywhere. Heikki independently talked about the possibility that he
would work on this project in the next release, without being prompted
by me.

I think that the fact that nbtree sorts duplicate entries in heap TID
order these days makes that worth looking into again. We can use
something like GIN's varbyte encoding process to compress duplicates
effectively. A lot of the problems (the numeric display scale problem,
pg_upgrade) can be avoided by defining deduplication as something that
happens on a best-effort basis. This is not the case within GIN, where
it's impossible for the main entry tree to have duplicates without at
least storing them in a posting list.

-- 
Peter Geoghegan

1 2 >

1 - 100 of 154 matches

Mail list logo