Index with new opclass not used for sorting

2019-06-21 Thread Ancoron Luciferis
Hi,

I am creating a new operator class for version 1 UUID's with an
extension and thought I was almost done by implementing everything
including SortSupport and creating a new opclass as follows:

CREATE OPERATOR CLASS uuid_timestamp_ops FOR TYPE uuid
USING btree AS
OPERATOR1   <*,
OPERATOR1   <~ (uuid, timestamp with time zone),
OPERATOR2   <=*,
OPERATOR2   <=~ (uuid, timestamp with time zone),
OPERATOR3   =,
OPERATOR3   =~ (uuid, timestamp with time zone),
OPERATOR4   >=*,
OPERATOR4   >=~ (uuid, timestamp with time zone),
OPERATOR5   >*,
OPERATOR5   >~ (uuid, timestamp with time zone),
FUNCTION1   uuid_timestamp_cmp(uuid, uuid),
FUNCTION1   uuid_timestamp_only_cmp(uuid, timestamp
with time zone),
FUNCTION2   uuid_timestamp_sortsupport(internal)
;

And I checked that after installation of the extension, that the
pg_amproc entries are there:

 family | left | right | num |   amproc
+--+---+-+
 623810 | 2950 |  2950 |   1 | uuid_timestamp_cmp
 623810 | 2950 |  1184 |   1 | uuid_timestamp_only_cmp
 623810 | 2950 |  2950 |   2 | uuid_timestamp_sortsupport

...but when sorting on an (unique) index column, I still get a separate
sort, not using the index, e.g.:

   QUERY PLAN

-
 Sort (actual rows=934567 loops=1)
   Sort Key: id
   Sort Method: external merge  Disk: 23784kB
   ->  Index Only Scan using idx_uuid_v1_ext on uuid_v1_ext (actual
rows=934567 loops=1)
 Index Cond: (id <* '2b55fb04-33d8-11e9-9cfa-e03f494ffcf7'::uuid)
 Heap Fetches: 934567

...but I was expecting a similar plan as for the standard UUID SortSupport:

 QUERY PLAN

-
 Index Only Scan using uuid_v1_pkey on uuid_v1 (actual rows=1692025 loops=1)
   Index Cond: (id < '2b55fb04-33d8-11e9-9cfa-e03f494ffcf7'::uuid)
   Heap Fetches: 1692025

Am I missing something obvious here?

Cheers,

Ancoron




Re: Index with new opclass not used for sorting

2019-06-21 Thread Ancoron Luciferis
On 21/06/2019 15:36, Tom Lane wrote:
> Ancoron Luciferis  writes:
>> I am creating a new operator class for version 1 UUID's with an
>> extension and thought I was almost done by implementing everything
>> including SortSupport and creating a new opclass as follows:
> 
>> CREATE OPERATOR CLASS uuid_timestamp_ops FOR TYPE uuid
>> USING btree AS
>> OPERATOR1   <*,
>> ...
> 
>> ...but when sorting on an (unique) index column, I still get a separate
>> sort, not using the index, e.g.:
> 
> You did not show your test query, but I imagine it just asked for the
> type's ordinary sort order, which is not what this opclass is claiming
> to provide.  To rely on the index's sort order you'd need something like
> 
>   select id from uuid_v1_ext
> where id <* '2b55fb04-33d8-11e9-9cfa-e03f494ffcf7'
> order by id using <* ;
> 
> If you want this opclass to become the default sort order for uuid
> you'd have to remove the opcdefault marking from uuid_ops and attach
> it to this opclass instead.  No, I'm not sure that that wouldn't have
> unpleasant side-effects.
> 
>   regards, tom lane
> 

OK, I somehow feared I'd be getting such an answer.

But thanks a lot for the "using <*" trick, which I somehow didn't know
even exists. That solves the problem for me at least.

Because of this issue I was already thinking about creating a new data
type which is basically just a new name for the existing "uuid" but
would enforce version 1, for which I then could provide the opclass as
default, or?

Btw. is there some example how to create derived types for PG properly
without copy/paste of a lot of existing code?

Thanx a lot and cheers,

Ancoron




Re: Kubernetes, cgroups v2 and OOM killer - how to avoid?

2025-04-08 Thread Ancoron Luciferis

On 2025-04-07 15:21, Joe Conway wrote:

On 4/5/25 07:53, Ancoron Luciferis wrote:

I've been investigating this topic every now and then but to this day
have not come to a setup that consistently leads to a PostgreSQL backend
process receiving an allocation error instead of being killed externally
by the OOM killer.

Why this is a problem for me? Because while applications are accessing
their DBs (multiple services having their own DBs, some high-frequency),
the whole server goes into recovery and kills all backends/connections.

While my applications are written to tolerate that, it also means that
at that time, esp. for the high-frequency apps, events are piling up,
which then leads to a burst as soon as connectivity is restored. This in
turn leads to peaks in resource usage in other places (event store,
in-memory buffers from apps, ...), which sometimes leads to a series of
OOM killer events being triggered, just because some analytics query
went overboard.

Ideally, I'd find a configuration that only terminates one backend but
leaves the others working.

I am wondering whether there is any way to receive a real ENOMEM inside
a cgroup as soon as I try to allocate beyond its memory.max, instead of
relying on the OOM killer.

I know the recommendation is to have vm.overcommit_memory set to 2, but
then that affects all workloads on the host, including critical infra
like the kubelet, CNI, CSI, monitoring, ...

I have already gone through and tested the obvious:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX- 
MEMORY-OVERCOMMIT


Importantly vm.overcommit_memory set to 2 only matters when memory is 
constrained at the host level.


As soon as you are running in a cgroup with a hard memory limit, 
vm.overcommit_memory is irrelevant.


You can have terabytes of free memory on the host, but if cgroup memory 
usage exceeds memory.limit (cgv1) or memory.max (cgv2) the OOM killer 
will pick the process in the cgroup with the highest oom_score and whack 
it.


Unfortunately there is no equivalent to vm.overcommit_memory within the 
cgroup.



And yes, I know that Linux cgroups v2 memory.max is not an actual hard
limit:

https://www.kernel.org/doc/html/latest/admin-guide/cgroup- 
v2.html#memory-interface-files


Read that again -- memory.max *is* a hard limit (same as memory.limit in 
cgv1).


   "memory.max

     A read-write single value file which exists on non-root cgroups. The
     default is “max”.

     Memory usage hard limit. This is the main mechanism to limit memory
     usage of a cgroup. If a cgroup’s memory usage reaches this limit and
     can’t be reduced, the OOM killer is invoked in the cgroup."


Yes, I know it says "hard limit", but then any app still can go beyond 
(might just be on me here to assume any "hard limit" to imply an actual 
error when trying to go beyond). The OOM killer then will kick in 
eventually, but not in any way that any process inside the cgroup could 
prevent. So there is no signal that the app could react to saying "hey, 
you just went beyond what you're allowed, please adjust before I kill you".





If you want a soft limit use memory.high.

   "memory.high

     A read-write single value file which exists on non-root cgroups. The
     default is “max”.

     Memory usage throttle limit. If a cgroup’s usage goes over the high
     boundary, the processes of the cgroup are throttled and put under
     heavy reclaim pressure.

     Going over the high limit never invokes the OOM killer and under
     extreme conditions the limit may be breached. The high limit should
     be used in scenarios where an external process monitors the limited
     cgroup to alleviate heavy reclaim pressure.

You want to be using memory.high rather than memory.max.


Hm, so solely relying on reclaim? I think that'll just get the whole 
cgroup into ultra-slow mode and would not actually prevent too much 
memory allocation. While this may work out just fine for the PostgreSQL 
instance, it'll for sure have effects on the other workloads on the same 
node (which I have apparently, more PG instances).


Apparently, I also don't see a way to even try this out in a Kubernetes 
environment, since there doesn't seem to be a way to set this field 
through some workload manifests field.




Also, I don't know what kubernetes recommends these days, but it used to 
require you to disable swap. In more recent versions of kubernetes you 
are able to run with swap enabled but I have no idea what the default is 
-- make sure you run with swap enabled.


Yes, this is what I wanna try out next.



The combination of some swap being available, and the throttling under 
heavy reclaim will likely mitigate your problems.




Thank you for your insights, I have something to think about.

Cheers,

Ancoron






Kubernetes, cgroups v2 and OOM killer - how to avoid?

2025-04-07 Thread Ancoron Luciferis

Hi,

I've been investigating this topic every now and then but to this day 
have not come to a setup that consistently leads to a PostgreSQL backend 
process receiving an allocation error instead of being killed externally 
by the OOM killer.


Why this is a problem for me? Because while applications are accessing 
their DBs (multiple services having their own DBs, some high-frequency), 
the whole server goes into recovery and kills all backends/connections.


While my applications are written to tolerate that, it also means that 
at that time, esp. for the high-frequency apps, events are piling up, 
which then leads to a burst as soon as connectivity is restored. This in 
turn leads to peaks in resource usage in other places (event store, 
in-memory buffers from apps, ...), which sometimes leads to a series of 
OOM killer events being triggered, just because some analytics query 
went overboard.


Ideally, I'd find a configuration that only terminates one backend but 
leaves the others working.


I am wondering whether there is any way to receive a real ENOMEM inside 
a cgroup as soon as I try to allocate beyond its memory.max, instead of 
relying on the OOM killer.


I know the recommendation is to have vm.overcommit_memory set to 2, but 
then that affects all workloads on the host, including critical infra 
like the kubelet, CNI, CSI, monitoring, ...


I have already gone through and tested the obvious:

https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

And yes, I know that Linux cgroups v2 memory.max is not an actual hard 
limit:


https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files


Any help is greatly appreciated!


Cheers,

Ancoron