Re: a misbehavior of partition row movement (?)

2021-02-18 Thread Rahila Syed
Hi Amit,


>
> Here is an updated version of the patch with some cosmetic changes
> from the previous version.  I moved the code being added to
> AfterTriggerSaveEvent() and ExecUpdate() into separate subroutines to
> improve readability, hopefully.
>
> I tested these patches. It works as expected in case of cross partition
updates, by correctly updating the
referencing table.  It works fine for ON UPDATE SET NULL and SET DEFAULT
options as well.
Also, tested by having a table reference only a partition and not the
parent. In this case, the delete
trigger is correctly called when the row is moved out of referenced
partition.

The partition-key-update-1.spec test fails with the following error message
appearing in the diffs.

 step s1u3pc: UPDATE foo_range_parted SET a=11 WHERE a=7;
+ERROR:  cannot move row being updated to another partition

I think the documentation update is missing from the patches.

>
Thank you,
Rahila Syed


Re: a misbehavior of partition row movement (?)

2021-02-23 Thread Rahila Syed
Hi Amit,

Sorry for the late reply.

I assume these are comments for the v3-0001 & v3-0002 patches...
>
> Yes, those were comments for patches on master.


> > The partition-key-update-1.spec test fails with the following error
> message appearing in the diffs.
> >
> >  step s1u3pc: UPDATE foo_range_parted SET a=11 WHERE a=7;
> > +ERROR:  cannot move row being updated to another partition
>
> ...whereas, this error happens with the patch I posted in my last
> email (prevent-row-movement-on-pk-table.patch) that is not meant to be
> considered for HEAD, but for back-branches (if at all).  I also see
> that cfbot got triggered by it and shows the same failure.  I am not
> going to try to take care of these failures unless we want to do
> something in the back-branches.
>
> To be clear, patches for HEAD do pass make check-world.
>
> OK.


> > I think the documentation update is missing from the patches.
>
> Hmm, I don't think we document the behavior that is improved by the v3
> patches as a limitation of any existing feature, neither of foreign
> keys referencing partitioned tables nor of the update row movement
> feature.  So maybe there's nothing in the existing documentation that
> is to be updated.
>
> However, the patch does add a new error message for a case that the
> patch doesn't handle, so maybe we could document that as a limitation.
> Not sure if in the Notes section of the UPDATE reference page which
> has some notes on row movement or somewhere else.  Do you have
> suggestions?
>
> You are right, I could not find any direct explanation of the impact of
row movement during
UPDATE on a referencing table in the PostgreSQL docs.

The two documents that come close are either:

1. https://www.postgresql.org/docs/13/trigger-definition.html .
The para starting with "If an UPDATE on a partitioned table causes a row to
move to another partition"
However, this does not describe the behaviour of  internal triggers which
is the focus of this patch.

2. Another one like you mentioned,
https://www.postgresql.org/docs/11/sql-update.html
This has explanation for row movement behaviour for partitioned table but
does not explain
any impact of such behaviour on a referencing table.
I think it is worth adding some explanation in this document. Thus,
explaining
impact on referencing tables here, as it already describes behaviour of
UPDATE on a partitioned table.


Thank you.
Rahila Syed


Re: row filtering for logical replication

2021-03-09 Thread Rahila Syed
Hi Euler,

Please find some comments below:

1. If the where clause contains non-replica identity columns, the delete
performed on a replicated row
 using DELETE from pub_tab where repl_ident_col = n;
is not being replicated, as logical replication does not have any info
whether the column has
to be filtered or not.
Shouldn't a warning be thrown in this case to notify the user that the
delete is not replicated.

2. Same for update, even if I update a row to match the quals on publisher,
it is still not being replicated to
the subscriber. (if the quals contain non-replica identity columns). I
think for UPDATE at least, the new value
of the non-replicate identity column is available which can be used to
filter and replicate the update.

3. 0001.patch,
Why is the name of the existing ExclusionWhereClause node being changed, if
the exact same definition is being used?

For 0002.patch,
4.   +
 +   memset(lrel, 0, sizeof(LogicalRepRelation));

Is this needed, apart from the above, patch does not use or update lrel at
all in that function.

5.  PublicationRelationQual and PublicationTable have similar fields, can
PublicationTable
be used in place of PublicationRelationQual instead of defining a new
struct?

Thank you,
Rahila Syed


Re: [PATCH] Add `truncate` option to subscription commands

2020-10-28 Thread Rahila Syed
Hi David,

The feature seems useful to me.  The code will need to be refactored due to
changes in commit : b05fe7b442

Please see the following comments.
1. Is there a specific reason behind having new relstate for truncate?
The current state flow is
INIT->DATATSYNC->SYNCWAIT->CATCHUP->SYNCDONE->READY.
Can we accommodate the truncate in either INIT or DATASYNC?

2.   +   StartTransactionCommand();
  +   rel =
table_open(MyLogicalRepWorker->relid, RowExclusiveLock);
  +
  +   rels = lappend(rels, rel);
  +   relids = lappend_oid(relids,
MyLogicalRepWorker->relid);
  +
  +   ExecuteTruncateGuts(rels, relids,
NIL, DROP_RESTRICT, false);
  +   CommitTransactionCommand();

Truncate is being performed in a separate transaction as data copy, I think
that leaves a window
open for concurrent transactions to modify the data after truncate and
before copy.

3. Regarding the truncate of the referenced table, I think one approach can
be to perform the following:
i. lock the referencing and referenced tables against writes
ii. drop the foriegn key constraints,
iii.truncate
iv. sync
v. recreate the constraints
vi. release lock.
However, I am not sure of the implications of locking these tables on the
main apply process.


Thank you,


On Mon, Oct 12, 2020 at 11:31 PM David Christensen 
wrote:

> > On Oct 11, 2020, at 10:00 PM, Amit Kapila 
> wrote:
> >
> > On Mon, Oct 12, 2020 at 3:44 AM David Christensen 
> wrote:
> >>
> >>
> >> On Oct 11, 2020, at 1:14 PM, Euler Taveira <
> euler.tave...@2ndquadrant.com> wrote:
> >>
> >> 
> >> On Fri, 9 Oct 2020 at 15:54, David Christensen 
> wrote:
> >>>
> >>>
> >>> Enclosed find a patch to add a “truncate” option to subscription
> commands.
> >>>
> >>> When adding new tables to a subscription (either via `CREATE
> SUBSCRIPTION` or `REFRESH PUBLICATION`), tables on the target which are
> being newly subscribed will be truncated before the data copy step.  This
> saves explicit coordination of a manual `TRUNCATE` on the target tables and
> allows the results of the initial data sync to be the same as on the
> publisher at the time of sync.
> >>>
> >>> To preserve compatibility with existing behavior, the default value
> for this parameter is `false`.
> >>>
> >>
> >> Truncate will fail for tables whose foreign keys refer to it. If such a
> feature cannot handle foreign keys, the usefulness will be restricted.
> >>
> >>
> >> This is true for existing “truncate” with FKs, so doesn’t seem to be
> any different to me.
> >>
> >
> > What would happen if there are multiple tables and truncate on only
> > one of them failed due to FK check? Does it give an error in such a
> > case, if so will the other tables be truncated?
>
> Currently each SyncRep relation is sync’d separately in its own worker
> process; we are doing the truncate at the initialization step of this, so
> it’s inherently in its own transaction. I think if we are going to do any
> sort of validation on this, it would have to be at the point of the CREATE
> SUBSCRIPTION/REFRESH PUBLICATION where we have the relation list and can do
> sanity-checking there.
>
> Obviously if someone changes the schema at some point between when it does
> this and when relation syncs start there is a race condition, but the same
> issue would affect other data sync things, so I don’t care to solve that as
> part of this patch.
>
> >> Hypothetically if you checked all new tables and could verify if there
> were FK cycles only already in the new tables being added then “truncate
> cascade” would be fine. Arguably if they had existing tables that were part
> of an FK that wasn’t fully replicated they were already operating brokenly.
> >>
> >
> > I think if both PK_table and FK_table are part of the same
> > subscription then there should be a problem as both them first get
> > truncated? If they are part of a different subscription (or if they
> > are not subscribed due to whatever reason) then probably we need to
> > deal such cases carefully.
>
> You mean “should not be a problem” here?  If so, I agree with that.
> Obviously if we determine this features is only useful with this support
> we’d have to chase the entire dependency graph and make sure that is all
> contained in the set of newly-subscribed tables (or at least FK referents).
>
> I have not considered tables that are part of more than one subscription
> (is that possible?); we presumably should error out if any table exists
> already in a separate subscription, as we’d want to avoid truncating tables
> already part of an existing subscription.
>
> While I’m happy to take a stab at fixing some of the FK/PK issues, it
> seems easy to go down a rabbit hole.  I’m not convinced that we couldn’t
> just detect FK issues and choose to not handle this case without decreasing
> the utility for at least so

Re: Added schema level support for publication.

2021-01-12 Thread Rahila Syed
Hi Vignesh,

I had a look at the patch, please consider following comments.

On Thu, Jan 7, 2021 at 10:03 PM vignesh C  wrote:

> Hi,
>
> This feature adds schema option while creating publication. Users will
> be able to specify one or more schemas while creating publication,
> when the user specifies schema option, then the data changes for the
> tables present in the schema specified by the user will be replicated
> to the subscriber. Few examples have been listed below:
>
> Create a publication that publishes all changes for all the tables
> present in production schema:
> CREATE PUBLICATION production_publication FOR ALL TABLES SCHEMA production;
>
> Should it be FOR TABLES IN SCHEMA instead of FOR ALL TABLES SCHEMA?


> Create a publication that publishes all changes for all the tables
> present in marketing and sales schemas:
> CREATE PUBLICATION sales_publication FOR ALL TABLES SCHEMA marketing,
> sales;
>
> Add some schemas to the publication:
> ALTER PUBLICATION sales_publication ADD SCHEMA marketing_june, sales_june;
>
> As per current implementation this command fails even if one of the
schemas does not
exist. I think this is counterintuitive, it should throw a warning and
continue adding the rest.


> Drop some schema from the publication:
> ALTER PUBLICATION production_quarterly_publication DROP SCHEMA
> production_july;
>
> Same for drop schema, if one of these schemas does not exist in
publication,
the entire DROP operation is aborted.

Thank you,
Rahila Syed


Re: Added schema level support for publication.

2021-01-20 Thread Rahila Syed
Hi Vignesh,


> I have handled the above scenario(drop schema should automatically
> remove the schema entry from publication schema relation) & addition
> of tests in the new v2 patch attached.
> Thoughts?
>

Please see some initial comments:

1. I think there should be more tests to show that the schema data is
actually replicated
to the subscriber.  Currently, I am not seeing the data being replicated
when I use FOR SCHEMA.

2. How does replication behave when a table is added or removed from a
subscribed schema
using ALTER TABLE SET SCHEMA?

3. Can we have a default schema like a public or current schema that gets
replicated in case the user didn't
specify one, this can be handy to replicate current schema tables.

4. +   The fourth, fifth and sixth variants change which schemas are part
of the
+   publication.  The SET TABLE clause will replace the
list
+   of schemas in the publication with the specified one.  The ADD

There is a typo above s/SET TABLE/SET SCHEMA

Thank you,
Rahila Syed


Re: Column Filtering in Logical Replication

2021-12-17 Thread Rahila Syed
Hi,

Thank you for updating the patch. The regression tests and tap tests pass
with v9 patch.


>
> After working on this a little bit more, I realized that this is a bad
> idea overall.  It causes lots of complications and it's just not worth
> it.  So I'm back at my original thought that we need to throw an ERROR
> at ALTER TABLE .. DROP COLUMN time if the column is part of a
> replication column filter, and suggest the user to remove the column
> from the filter first and reattempt the DROP COLUMN.
>
> This means that we need to support changing the column list of a table
> in a publication.  I'm looking at implementing some form of ALTER
> PUBLICATION for that.
>
>
I think right now the patch contains support only for ALTER PUBLICATION..
ADD TABLE with column filters.
In order to achieve changing the column lists of a published table, I think
we can extend the
ALTER TABLE ..SET TABLE syntax to support specification of column list.

So this whole thing can
> be reduced to just this:


if (att_map != NULL && !bms_is_member(att->attnum, att_map))
>continue;/* that is, don't send this attribute */


I agree the condition can be shortened now. The long if condition was
included because initially the feature
allowed specifying filters without replica identity columns(sent those
columns internally without user
having to specify).

 900 +* the table is partitioned.  Run a recursive query to iterate
> through all
>  901 +* the parents of the partition and retreive the record for
> the parent
>  902 +* that exists in pg_publication_rel.
>  903 +*/


The above comment in fetch_remote_table_info() can be changed as the
recursive query
is no longer used.

Thank you,
Rahila Syed


Re: row filtering for logical replication

2021-03-18 Thread Rahila Syed
Hi Euler,

Please find below some review comments,

1.
   +
   + 
   +  prqual
   +  pg_node_tree
   +  
   +  Expression tree (in nodeToString()
   +  representation) for the relation's qualifying condition
   + 
I think the docs are being incorrectly updated to add a column to
pg_partitioned_table
instead of pg_publication_rel.

2.   +typedef struct PublicationRelationQual
 +{
+   Oid relid;
+   Relationrelation;
+   Node   *whereClause;
+} PublicationRelationQual;

Can this be given a more generic name like PublicationRelationInfo, so that
the same struct
can be used to store additional relation information in future, for ex.
column names, if column filtering is introduced.

3. Also, in the above structure, it seems that we can do with storing just
relid and derive relation information from it
using table_open when needed. Am I missing something?

4.  Currently in logical replication, I noticed that an UPDATE is being
applied on the subscriber even if the column values
 are unchanged. Can row-filtering feature be used to change it such that,
when all the OLD.columns = NEW.columns, filter out
the row from being sent to the subscriber. I understand this would need
REPLICA IDENTITY FULL to work, but would be an
improvement from the existing state.

On subscriber:

postgres=# select xmin, * from tab_rowfilter_1;
 xmin | a |  b
--+---+-
  555 | 1 | unfiltered
(1 row)

On publisher:
postgres=# ALTER TABLE tab_rowfilter_1 REPLICA IDENTITY FULL;
ALTER TABLE
postgres=# update tab_rowfilter_1 SET b = 'unfiltered' where a = 1;
UPDATE 1

On Subscriber:  The xmin has changed indicating the update from the
publisher was applied
even though nothing changed.

postgres=# select xmin, * from tab_rowfilter_1;
 xmin | a |  b
--+---+-
  556 | 1 | unfiltered
(1 row)

5. Currently, any existing rows that were not replicated, when updated to
match the publication quals
using UPDATE tab SET pub_qual_column = 'not_filtered' where a = 1; won't be
applied, as row
does not exist on the subscriber.  It would be good if ALTER SUBSCRIBER
REFRESH PUBLICATION
would help fetch such existing rows from publishers that match the qual
now(either because the row changed
or the qual changed)

Thank you,
Rahila Syed






On Tue, Mar 9, 2021 at 8:35 PM Rahila Syed  wrote:

> Hi Euler,
>
> Please find some comments below:
>
> 1. If the where clause contains non-replica identity columns, the delete
> performed on a replicated row
>  using DELETE from pub_tab where repl_ident_col = n;
> is not being replicated, as logical replication does not have any info
> whether the column has
> to be filtered or not.
> Shouldn't a warning be thrown in this case to notify the user that the
> delete is not replicated.
>
> 2. Same for update, even if I update a row to match the quals on
> publisher, it is still not being replicated to
> the subscriber. (if the quals contain non-replica identity columns). I
> think for UPDATE at least, the new value
> of the non-replicate identity column is available which can be used to
> filter and replicate the update.
>
> 3. 0001.patch,
> Why is the name of the existing ExclusionWhereClause node being changed,
> if the exact same definition is being used?
>
> For 0002.patch,
> 4.   +
>  +   memset(lrel, 0, sizeof(LogicalRepRelation));
>
> Is this needed, apart from the above, patch does not use or update lrel at
> all in that function.
>
> 5.  PublicationRelationQual and PublicationTable have similar fields, can
> PublicationTable
> be used in place of PublicationRelationQual instead of defining a new
> struct?
>
> Thank you,
> Rahila Syed
>
>


Re: row filtering for logical replication

2021-03-29 Thread Rahila Syed
Hi Euler,

While running some tests on v13 patches, I noticed that, in case the
published table data
already exists on the subscriber database before creating the subscription,
at the time of
CREATE subscription/table synchronization, an error as seen as follows

With the patch:

2021-03-29 14:32:56.265 IST [78467] STATEMENT:  CREATE_REPLICATION_SLOT
"pg_16406_sync_16390_6944995860755251708" LOGICAL pgoutput USE_SNAPSHOT
2021-03-29 14:32:56.279 IST [78467] LOG:  could not send data to client:
Broken pipe
2021-03-29 14:32:56.279 IST [78467] STATEMENT:  COPY (SELECT aid, bid,
abalance, filler FROM public.pgbench_accounts WHERE (aid > 0)) TO STDOUT
2021-03-29 14:32:56.279 IST [78467] FATAL:  connection to client lost
2021-03-29 14:32:56.279 IST [78467] STATEMENT:  COPY (SELECT aid, bid,
abalance, filler FROM public.pgbench_accounts WHERE (aid > 0)) TO STDOUT
2021-03-29 14:33:01.302 IST [78470] LOG:  logical decoding found consistent
point at 0/4E2B8460
2021-03-29 14:33:01.302 IST [78470] DETAIL:  There are no running
transactions.

Without the patch:

2021-03-29 15:05:01.581 IST [79029] ERROR:  duplicate key value violates
unique constraint "pgbench_branches_pkey"
2021-03-29 15:05:01.581 IST [79029] DETAIL:  Key (bid)=(1) already exists.
2021-03-29 15:05:01.581 IST [79029] CONTEXT:  COPY pgbench_branches, line 1
2021-03-29 15:05:01.583 IST [78538] LOG:  background worker "logical
replication worker" (PID 79029) exited with exit code 1
2021-03-29 15:05:06.593 IST [79031] LOG:  logical replication table
synchronization worker for subscription "test_sub2", table
"pgbench_branches" has started

Without the patch the COPY command throws an ERROR, but with the patch, a
similar scenario results in client connection being lost.

I didn't investigate it more, but looks like we should maintain the
existing behaviour when table synchronization fails
due to duplicate data.

Thank you,
Rahila Syed


Re: row filtering for logical replication

2021-03-29 Thread Rahila Syed
Hi,

>
> While running some tests on v13 patches, I noticed that, in case the
> published table data
> already exists on the subscriber database before creating the
> subscription, at the time of
> CREATE subscription/table synchronization, an error as seen as follows
>
> With the patch:
>
> 2021-03-29 14:32:56.265 IST [78467] STATEMENT:  CREATE_REPLICATION_SLOT
> "pg_16406_sync_16390_6944995860755251708" LOGICAL pgoutput USE_SNAPSHOT
> 2021-03-29 14:32:56.279 IST [78467] LOG:  could not send data to client:
> Broken pipe
> 2021-03-29 14:32:56.279 IST [78467] STATEMENT:  COPY (SELECT aid, bid,
> abalance, filler FROM public.pgbench_accounts WHERE (aid > 0)) TO STDOUT
> 2021-03-29 14:32:56.279 IST [78467] FATAL:  connection to client lost
> 2021-03-29 14:32:56.279 IST [78467] STATEMENT:  COPY (SELECT aid, bid,
> abalance, filler FROM public.pgbench_accounts WHERE (aid > 0)) TO STDOUT
> 2021-03-29 14:33:01.302 IST [78470] LOG:  logical decoding found
> consistent point at 0/4E2B8460
> 2021-03-29 14:33:01.302 IST [78470] DETAIL:  There are no running
> transactions.
>
> Rahila, I tried to reproduce this issue with the attached script but no
> luck. I always get
>
> OK, Sorry for confusion. Actually both the errors are happening on
different servers. *Broken pipe*  error on publisher and
the following error on subscriber end. And the behaviour is consistent with
or without row filtering.

> Without the patch:
>
> 2021-03-29 15:05:01.581 IST [79029] ERROR:  duplicate key value violates
> unique constraint "pgbench_branches_pkey"
> 2021-03-29 15:05:01.581 IST [79029] DETAIL:  Key (bid)=(1) already exists.
> 2021-03-29 15:05:01.581 IST [79029] CONTEXT:  COPY pgbench_branches, line 1
> 2021-03-29 15:05:01.583 IST [78538] LOG:  background worker "logical
> replication worker" (PID 79029) exited with exit code 1
> 2021-03-29 15:05:06.593 IST [79031] LOG:  logical replication table
> synchronization worker for subscription "test_sub2", table
> "pgbench_branches" has started
>
> ... this message. The code that reports this error is from the COPY
> command.
> Row filter modifications has no control over it. It seems somehow your
> subscriber close the replication connection causing this issue. Can you
> reproduce it consistently? If so, please share your steps.
>
> Please ignore the report.

Thank you,
Rahila Syed


Re: logical replication empty transactions

2020-07-29 Thread Rahila Syed

Hi,

Please see below review of the 
0001-Skip-empty-transactions-for-logical-replication.patch


The make check passes.


 +               /* output BEGIN if we haven't yet */
 +               if (!data->xact_wrote_changes)
 +                       pgoutput_begin(ctx, txn);
 +
 +               data->xact_wrote_changes = true;
 +
IMO, xact_wrote_changes flag is better set inside the if condition as it 
does not need to

be set repeatedly in subsequent calls to the same function.



* Stash BEGIN data in plugin's 
LogicalDecodingContext.output_plugin_private when plugin's begin 
callback called, don't write anything to the outstream
* Write out BEGIN message lazily when any other callback generates a 
message that does need to be written out
* If no BEGIN written by the time COMMIT callback called, discard the 
COMMIT too. Check if sync rep enabled. if it is, 
call LogicalDecodingContext.update_progress
from within the output plugin commit handler, otherwise just ignore 
the commit totally. Probably by calling OutputPluginUpdateProgress().




I think the code in the patch is similar to what has been described by 
Craig in the above snippet,
except instead of stashing the BEGIN message and sending the message 
lazily, it simply maintains a flag
in LogicalDecodingContext.output_plugin_private which defers calling 
output plugin's begin callback,

until any other callback actually generates a remote write.

Also, the patch does not contain the last part where he describes 
having OutputPluginUpdateProgress()

for synchronous replication enabled transactions.
However, some basic testing suggests that the patch does not have any 
notable adverse effect on

either the replication lag or the sync_rep performance.

I performed tests by setting up publisher and subscriber on the same 
machine with synchronous_commit = on and

ran pgbench -c 12 -j 6 -T 300 on unpublished pgbench tables.

I see that  confirmed_flush_lsn is catching up just fine without any 
notable delay as compared to the test results without

the patch.

Also, the TPS for synchronous replication of empty txns with and without 
the patch remains similar.


Having said that, these are initial findings and I understand better 
performance tests are required to measure
reduction in consumption of network bandwidth and impact on synchronous 
replication and replication lag.


Thank you,
Rahila Syed





Re: More tests with USING INDEX replident and dropped indexes

2020-08-19 Thread Rahila Syed

Hi,

I couldn't test the patch as it does not apply cleanly on master.

Please find below some review comments:

1. Would it better to throw a warning at the time of dropping the 
REPLICA IDENTITY


index that it would also  drop the REPLICA IDENTITY of the parent table?


2. CCI is used after calling SetRelationReplIdent from index_drop() but not

after following call

relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
  bool is_internal)

+   /* update relreplident if necessary */
+   SetRelationReplIdent(RelationGetRelid(rel), ri_type);

3.   I think the former variable names `pg_class_tuple` and 
`pg_class_form`provided better clarity

 +   HeapTuple tuple;

 +   Form_pg_class classtuple;


- relreplident is switched to REPLICA_IDENTITY_NOTHING, which is the
behavior we have now after an index is dropped, even if there is a
primary key.


4. I am not aware of the reason of the current behavior, however it 
seems better


to switch to REPLICA_IDENTITY_DEFAULT. Can't think of a reason why user 
would not be


fine with using primary key as replica identity when replica identity 
index is dropped



Thank you,



Re: More tests with USING INDEX replident and dropped indexes

2020-09-02 Thread Rahila Syed
>
>
>
> On the other hand, it looks appealing to make index_set_state_flags()
> transactional.  This would solve the consistency problem, and looking
> at the code scanning pg_index, I don't see a reason why we could not
> do that.  What do you think?
>

TBH, I am not sure.  I think it is a reasonable change. It is even
indicated in the
comment above index_set_state_flags() that it can be made transactional.
At the same time, probably we can just go ahead with current
inconsistent update of relisreplident and indisvalid flags. Can't see what
will break with that.

Thank you,
-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-04 Thread Rahila Syed
Hi Alvaro,


> On 2019-Feb-13, Amit Langote wrote:
>
> > Doesn't the name amphasename sound a bit too generic, given that it can
> > only describe the phases of ambuild?  Maybe ambuildphase?
>
> Hmm, yeah, maybe it does.  I renamed it "ambuildphasename", since it's
> not about reporting the phase itself -- it's about translating the phase
> number to the string that's reported to the user.
>
> The attached patch does it that way.  Also, when an index build uses an
> AM that doesn't support progress reporting, it no longer reports a NULL
> phase name while building.  I also changed it to report the progress of
> phase 7 (heap scan validation) using block numbers rather than tuple
> counts.  I also tweaked the strings reported in the view.  They're
> clearer now IMO.
>
> One slight annoyance is that when parallel workers are used, the last
> block number reported in phase 3/subphase 2 (IndexBuildHeapScan stuff)
> is not necessarily accurate, since the tail of the table could well be
> scanned by a worker that's not the leader, and we only report in the
> leader when it gets a new block.
>
> When the AM does not support progress reporting, everything stays as
> zeros during the index build phase.  It's easy to notice how slow hash
> indexes are to build compared to btrees this way!  Maybe it'd be
> better fallback on reporting block numbers in IndexBuildHeapScan when
> this happens.  Thoughts?
>
> I added docs to the monitoring section -- that's the bulkiest part of
> the patch.
>

1. Thank you for incorporating review comments.
Can you please rebase the latest
0001-Report-progress-of-CREATE-INDEX-operations.patch on master? Currently
it does not apply on 754b90f657bd54b482524b73726dae4a9165031c


>  15:56:44.694 | building index (3 of 8): initializing (1/5)|
>  442478 |  442399 |0 |   0 |0
> |   0
>  15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>  442478 |  442399 |1 |   0 |0
> |   0
>  15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>  442478 |  442399 |1 |   0 |0
> |   0
>  15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) |
>  442478 |  442399 |1 |   79057 |0
> |   0
>

2. In the above report, even though we are reporting progress in terms of
tuples_done for final btree sort & load phase we have not cleared
the blocks_done entry from previous phases. I think this can be confusing
as the blocks_done does not correspond to the tuples_done in the current
phase.


-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-11 Thread Rahila Syed
Hi Alvaro,

On Tue, 5 Mar 2019 at 08:32, Alvaro Herrera 
wrote:

> Hi Rahila,
>
> Thanks for looking.
>
> On 2019-Mar-04, Rahila wrote:
>
> > 1. Thank you for incorporating review comments.
> > Can you please rebase the latest 0001-Report-progress-of-
> > CREATE-INDEX-operations.patch on master? Currently it does not apply on
> > 754b90f657bd54b482524b73726dae4a9165031c
>
> Hmm, rebased to current master.  Didn't conflict at all when rebasing,
> so it's strange that it didn't apply.


Thanks for updating the patch. Sorry, I think it wasn't that the patch
needed rebasing but
I failed to apply it correctly last time. I can apply it now.


> +extern char *btbuildphasename(int64 phasenum);


1. I think int64 is too large a datatype for phasenum.
Also int32 is used for phasenum in  pg_indexam_progress_phasename().
Can we have it as int8?

2.

>   if ((previous_blkno == InvalidBlockNumber) ||

+   (scan->rs_cblock != previous_blkno))

+   {

+
> pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,

+
> scan->rs_cblock);

+   previous_blkno = scan->rs_cblock;

+   }


. In validate_index_heapscan, we dont calculate blocks_done similar to
IndexBuildHeapScan i.e
block_done += scan->rs_cblock - previous_blkno which IMO is more accurate.
Reporting scan->rs_cblock as blocks_done might give slightly inaccurate
results as we are
still processing that block.

3. There is no if(progress) check in validate_index()/
validate_index_heapscan() code. Wont it be a problem if it is called from
other
index methods which dont support reporting progress at the moment?

4.  Just to clarify my understanding can you please see below comment

Quoting your following comment in cluster command progress monitor thread
while referring to progress reporting from IndexBuildHeapScan,

"One, err, small issue with that idea is that we need the param numbers
not to conflict for any "progress update providers" that are to be used
simultaneously by any command."

Does that mean that we can't have any other INDEX progress monitoring, use
PROGRESS_SCAN_BLOCKS_TOTAL and PROGRESS_SCAN_BLOCKS_DONE
parameter numbers to report anything but the metrics they report now.

5.

> 15:56:44.682 | building index (3 of 8): initializing (1/5)|
>442478 |  442399 |0 |   0 |0 |
>  0

15:56:44.694 | building index (3 of 8): initializing (1/5)|
>442478 |  442399 |0 |   0 |0 |
>  0

15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>442478 |  442399 |1 |   0 |0 |
>  0

15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>442478 |  442399 |1 |   0 |0 |
>  0


I wonder how is the phase 'building index(3 of 8): initializing(1/5)' when
the blocks_done count is increasing. Shouldn't it have
changed to reflect  PROGRESS_BTREE_PHASE_INDEXBUILD_HEAPSCAN as building
index(3 of 8): table scan(2/5) ?
Although I think it has been rectified in the latest patch as I now get
'table scan' phase in output as I do CREATE INDEX on table with 100
records

Thank you,
.--
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-20 Thread Rahila Syed
Hi Haribabu,

The latest patch fails while applying header files part. Kindly rebase.

The patch looks good to me. However, I wonder what are the other scenarios
where xact_commit is incremented because even if I commit a single
transaction with your patch applied the increment in xact_commit is > 1. As
you mentioned upthread, need to check what internal operation resulted in
the increase. But the increase in xact_commit is surely lesser with the
patch.

postgres=# BEGIN;
BEGIN
postgres=# select xact_commit from pg_stat_database where datname =
'postgres';
 xact_commit
-
 158
(1 row)

postgres=# explain analyze select * from foo where i = 1000;
   QUERY
PLAN

 Gather  (cost=1000.00..136417.85 rows=1 width=37) (actual
time=4.596..3792.710 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on foo  (cost=0.00..135417.75 rows=1 width=37)
(actual time=2448.038..3706.009 rows=0 loops=3)
 Filter: (i = 1000)
 Rows Removed by Filter: 333
 Planning Time: 0.353 ms
 Execution Time: 3793.572 ms
(8 rows)

postgres=# commit;
COMMIT
postgres=# select xact_commit from pg_stat_database where datname =
'postgres';
 xact_commit
-
     161
(1 row)

-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-27 Thread Rahila Syed
On Mon, 25 Mar 2019 at 22:23, Alvaro Herrera 
wrote:

> Here's v6 of this patch.  I have rebased on top of today's CLUSTER
> monitoring, as well as on table AM commits.  The latter caused a bit of
> trouble, as now the number of blocks processed by a scan is not as easy
> to get as before; I added a new entry point heapscan_get_blocks_done on
> heapam.c to help with that.  (I suppose this will need some fixups later
> on.)
>
> I removed the "M of N" phase labels that Robert didn't like; those were
> suggested by Rahila and upvoted by Amit L.  I'm of two minds about
> those.  If you care about those and want them back, please speak up.
>
> I see value in reporting those numbers because it gives user insight into
where
we are at in the whole process without having to refer to documentation or
code.
Besides here also we are reporting facts as we follow for other metrics.

I agree that it will be most effective if the phases are carried out in
succession
which is not the case every time and its relevance varies for each command
as mentioned upthread by Alvaro and Robert. But I feel as long as we have in
the documentation that some phases overlap, some are mutually exclusive
hence
may be skipped etc. reporting `phase number versus total phases` does
provide
valuable information.
We are able to give user a whole picture in addition to reporting progress
within phases.

Thank you,
-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-31 Thread Rahila Syed
Hi Alvaro,

Please see few comments below:

1. Makecheck fails currently as view definition of expected rules.out does
not reflect latest changes in progress metrics numbering.

2. +  
+   When creating an index on a partitioned, this column is set to the
+   total number of partitions on which the index is to be created.
+  
+ 
+ 
+  partitions_done
+  bigint
+  
+   When creating an index on a partitioned, this column is set to the

I think there is a typo here 's/partitioned/partitioned table/'

3.
+   if (hscan->rs_base.rs_parallel != NULL)
+   {
+   ParallelBlockTableScanDesc bpscan;
+
+   bpscan = (ParallelBlockTableScanDesc)
hscan->rs_base.rs_parallel;
+   startblock = bpscan->phs_startblock;
+   }
+   else
+   startblock = hscan->rs_startblock;
+
+   /*
+* Might have wrapped around the end of the relation, if startblock
was
+* not zero.
+*/
+   if (hscan->rs_cblock > startblock)
+   blocks_done = hscan->rs_cblock - startblock;
+   else
+   blocks_done = hscan->rs_nblocks - startblock +
+   hscan->rs_cblock;
+
+   return blocks_done;

I think parallel scan equivalent bpscan->phs_nblocks along with
hscan->rs_nblocks should be used similar to startblock computation above.

Thank you,
Rahila Syed

On Fri, 29 Mar 2019 at 23:46, Alvaro Herrera 
wrote:

> On 2019-Mar-29, Alvaro Herrera wrote:
>
> > So, CLUSTER and ALTER TABLE rewrites only do non-concurrent index
> > builds; and REINDEX can reuse pretty much the same wait-for metrics
> > columns as CIC.  So I think it's okay if I move only the metrics that
> > conflict for index_build.
>
> The attached version does it that way.  I had to enlarge the param set a
> bit more.  (I suspect those extra columns will be useful to reindex.)
> Also, rebased for recent conflicting changes.
>
>
> I think we should consider a new column of an array type, where we could
> put things like the list of PIDs to be waited for, the list of OIDs of
> index to rebuild, or the list of partitions to build the index on.
>
> --
> Álvaro Herrera    https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Minimal logical decoding on standbys

2019-11-07 Thread Rahila Syed
Hi Amit,

I am reading about this feature and reviewing it.
To start with, I reviewed the patch:
0005-Doc-changes-describing-details-about-logical-decodin.patch.

>prevent VACUUM from removing required rows from the system catalogs,
>hot_standby_feedback should be set on the standby. In spite of that,
>if any required rows get removed on standby, the slot gets dropped.
IIUC, you mean `if any required rows get removed on *the master* the slot
gets
dropped`, right?

Thank you,
-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Minimal logical decoding on standbys

2019-12-12 Thread Rahila Syed
Hi Amit,

Please see following comments:

1.  0002-Add-info-in-WAL-records-in-preparation-for-logical-s.patch

--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,6 +17,7 @@

 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"

The above header inclusion is not necessary as the code compiles fine
without it.
Also, this patch does not apply cleanly on latest master due to the above
line.

2.  Following test fails with error.
make -C src/test/recovery/ check PROVE_TESTS=t/
018_standby_logical_decoding_xmins.pl
#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 120.
#  got: ''
# expected: anything else

#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 141.
#  got: ''
# expected: anything else

#   Failed test 'physical catalog_xmin not null'
#   at t/018_standby_logical_decoding_xmins.pl line 159.
#  got: ''
# expected: anything else
t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed
out executing this query:
#

Physical catalog_xmin is NULL on master after logical slot creation on
standby .

Due to this below command in the test fails with syntax error as it
constructs the SQL query using catalog_xmin value
obtained above:

# SELECT catalog_xmin::varchar::int >
# FROM pg_catalog.pg_replication_slots
# WHERE slot_name = 'master_physical';
#
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR:  syntax error at or near "FROM"
# LINE 3:   FROM pg_catalog.pg_replication_slots

Thank you,
-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Minimal logical decoding on standbys

2019-12-18 Thread Rahila Syed
Hi,

Hi, do you consistently get this failure on your machine ? I am not
> able to get this failure, but I am going to analyze when/how this can
> fail. Thanks
>
> Yes, I am getting it each time I run make -C src/test/recovery/ check
PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
Also, there aren't any errors in logs indicating the cause.

-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Minimal logical decoding on standbys

2020-01-10 Thread Rahila Syed
Hi Amit,

Can you please rebase the patches as they don't apply on latest master?

Thank you,
Rahila Syed


On Thu, 26 Dec 2019 at 16:36, Amit Khandekar  wrote:

> On Tue, 24 Dec 2019 at 14:02, Amit Khandekar 
> wrote:
> >
> > On Thu, 19 Dec 2019 at 01:02, Rahila Syed 
> wrote:
> > >
> > > Hi,
> > >
> > >> Hi, do you consistently get this failure on your machine ? I am not
> > >> able to get this failure, but I am going to analyze when/how this can
> > >> fail. Thanks
> > >>
> > > Yes, I am getting it each time I run make -C src/test/recovery/ check
> PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> > > Also, there aren't any errors in logs indicating the cause.
> >
> > Thanks for the reproduction. Finally I could reproduce the behaviour.
> > It occurs once in 7-8 runs of the test on my machine. The issue is :
> > on master, the catalog_xmin does not immediately get updated. It
> > happens only after the hot standby feedback reaches on master. And I
> > haven't used wait_for_xmins() for these failing cases. I should use
> > that. Working on the same ...
>
> As mentioned above, I have used wait_for_xmins() so that we can wait
> for the xmins to be updated after hot standby feedback is processed.
> In one of the 3 scenarios where it failed for you, I removed the check
> at the second place because it was redundant. At the 3rd place, I did
> some appropriate changes with detailed comments. Please check.
> Basically we are checking that the master's phys catalog_xmin has
> advanced but not beyond standby's logical catalog_xmin. And for making
> sure the master's xmins are updated, I call txid_current() and then
> wait for the master's xmin to advance after hot-standby_feedback, and
> in this way I make sure the xmin/catalog_xmins are now up-to-date
> because of hot-standby-feedback, so that we can check whether the
> master's physical slot catalog_xmin has reached the value of standby's
> catalog_xmin but not gone past it.
>
> I have also moved the "wal_receiver_status_interval = 1" setting from
> master to standby. It was wrongly kept in master. This now reduces the
> test time by half, on my machine.
>
> Attached patch set v5 has only the test changes. Please check if now
> the test fails for you.
>
> >
> > --
> > Thanks,
> > -Amit Khandekar
> > EnterpriseDB Corporation
> > The Postgres Database Company
>
>
>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company
>


-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-04-01 Thread Rahila Syed
Hi,

On Mon, 1 Apr 2019 at 21:40, Alvaro Herrera 
wrote:

> Hi Rahila, thanks for reviewing.
>
> On 2019-Mar-25, Rahila Syed wrote:
>
> > Please see few comments below:
> >
> > 1. Makecheck fails currently as view definition of expected rules.out
> does
> > not reflect latest changes in progress metrics numbering.
>
> Ouch ... fixed.
>
> > 2. +  
> > I think there is a typo here 's/partitioned/partitioned table/'
>
> Ah, so there is.  Fixed.
>
> > 3.
> > I think parallel scan equivalent bpscan->phs_nblocks along with
> > hscan->rs_nblocks should be used similar to startblock computation above.
>
> Hmm, yeah.  I think the way things are setup currently it doesn't matter
> much, but it seems fragile to rely on that.
>
>
Thank you for incorporating the review comments.


> I also moved the reporting of total blocks to scan in the initial table
> scan so that it occurs wholly in heapam_index_build_range_scan; I had
> originally put that code in _bt_spools_heapscan, but that was a
> modularity violation I think.  (It didn't make a practical difference,
> but it made no sense to have the two cases report the number in wildly
> different locations.)  Also added a final nblocks metric update after
> the scan is done.
>
> I think this patch is done.
>

I tested the v8 patch by running plain CREATE INDEX, CIC, and for
partitioned tables
and the results are as expected.  Please see few observations below.

1.  FWIW, below results for CIC show that blocks_done does not become equal
to blocks_total at the end of the phase or it processes last 800 blocks so
fast that
the update is not visible in less than 1 secs interval.

*Mon Mar 25 11:06:31 IST 2019*
  pid  | datid | datname  | relid |   phase|
lockers_total | lockers_done | current_locker_pid | blocks_total |
blocks_done | tuples_total | tuples_done | partitions_total |
partitions_done
---+---+--+---++---+--++--+-+--+-+--+-
 10630 | 13533 | postgres | 16384 | building index: table scan
| 0 |0 |  0 |  1293366 |
1292535 |0 |   0 |0 |   0
(1 row)

*Mon Mar 25 11:06:31 IST 2019*
  pid  | datid | datname  | relid |  phase
| lockers_total | lockers_done | current_locker_pid | blocks_total |
blocks_done | tuples_total | tuples_done | partitions_total |
partitions_done
---+---+--+---+-+---+--++--+-+--+-+--+-
 10630 | 13533 | postgres | 16384 | building index: sorting tuples, spool 1
| 0 |0 |  0 |0
|   0 |2 |   0 |0
|   0
(1 row)

2. However in case of partitioned tables, the following difference in
blocks_done versus blocks_total at the end of phase is notably high for the
first partition . Subsequent partitions show negligible difference.
Partition 1:
Mon Mar 25 14:27:57 IST 2019
  pid  | datid | datname  | relid |   phase|
lockers_total | lockers_done | current_locker_pid | blocks_total |
blocks_done | tuples_total | tuples_done | partitions_total |
partitions_done
---+---+--+---++---+--++--+-+--+-+--+-
 10630 | 13533 | postgres | 16394 | building index: table scan
| 0 |0 |  0 |   381342 |
221233 |0 |   0 |3 |   0
(1 row)

Mon Mar 25 14:27:57 IST 2019
  pid  | datid | datname  | relid |  phase
| lockers_total | lockers_done | current_locker_pid | blocks_total |
blocks_done | tuples_total | tuples_done | partitions_total |
partitions_done
---+---+--+---+-+---+--++--+-+--+-+--+-
 10630 | 13533 | postgres | 16394 | building index: sorting tuples, spool 1
| 0 |0 |  0 |0
|   0 | 4999 |   0 |3
|   0

The partitions are equal in size and the other two partitions have
blocks_done and blocks_total to be approx. 221233. The blocks_total for
partition 1 is reported higher.

3. Sorry for nitpicking, I think following phase name can be made more
consistent with the 

Re: Allow single table VACUUM in transaction block

2022-11-03 Thread Rahila Syed
Hi Simon,

On Thu, Nov 3, 2022 at 3:53 PM Simon Riggs 
wrote:

> On Tue, 1 Nov 2022 at 23:56, Simon Riggs 
> wrote:
>
> > > I haven't checked the rest of the patch, but +1 for allowing VACUUM
> FULL
> > > within a user txn.
> >
> > My intention was to prevent that. I am certainly quite uneasy about
> > changing anything related to CLUSTER/VF, since they are old, complex
> > and bug prone.
> >
> > So for now, I will block VF, as was my original intent.
> >
> > I will be guided by what others think... so you may yet get your wish.
> >
> >
> > > Maybe the error message needs to be qualified "...when multiple
> > > relations are specified".
> > >
> > > ERROR:  VACUUM cannot run inside a transaction block
> >
> > Hmm, that is standard wording based on the statement type, but I can
> > set a CONTEXT message also. Will update accordingly.
> >
> > Thanks for your input.
>
> New version attached, as described.
>
> Other review comments and alternate opinions welcome.
>
>
I applied and did some basic testing on the patch, it works as described.

I would like to bring up a few points that I came across while looking into
the vacuum code.

1.  As a result of this change to allow VACUUM inside a user transaction, I
think there is some chance of causing
a block/delay of concurrent VACUUMs if a VACUUM is being run under a long
running transaction.
2. Also, if a user runs VACUUM in a transaction, performance optimizations
like PROC_IN_VACUUM won't work.
3. Also, if VACUUM happens towards the end of a long running transaction,
the snapshot will be old
and xmin horizon for vacuum would be somewhat old as compared to current
lazy vacuum which
acquires a new snapshot just before scanning the table.

So, while I understand the need of the feature, I am wondering if there
should be some mention
of above caveats in documentation with the recommendation that VACUUM
should be run outside
a transaction, in general.

Thank you,
Rahila Syed


Re: Allow single table VACUUM in transaction block

2022-11-04 Thread Rahila Syed
Hi Simon,

On Fri, Nov 4, 2022 at 10:15 AM Rahila Syed  wrote:

> Hi Simon,
>
> On Thu, Nov 3, 2022 at 3:53 PM Simon Riggs 
> wrote:
>
>> On Tue, 1 Nov 2022 at 23:56, Simon Riggs 
>> wrote:
>>
>> > > I haven't checked the rest of the patch, but +1 for allowing VACUUM
>> FULL
>> > > within a user txn.
>> >
>> > My intention was to prevent that. I am certainly quite uneasy about
>> > changing anything related to CLUSTER/VF, since they are old, complex
>> > and bug prone.
>> >
>> > So for now, I will block VF, as was my original intent.
>> >
>> > I will be guided by what others think... so you may yet get your wish.
>> >
>> >
>> > > Maybe the error message needs to be qualified "...when multiple
>> > > relations are specified".
>> > >
>> > > ERROR:  VACUUM cannot run inside a transaction block
>> >
>> > Hmm, that is standard wording based on the statement type, but I can
>> > set a CONTEXT message also. Will update accordingly.
>> >
>> > Thanks for your input.
>>
>> New version attached, as described.
>>
>> Other review comments and alternate opinions welcome.
>>
>>
> I applied and did some basic testing on the patch, it works as described.
>
> I would like to bring up a few points that I came across while looking
> into the vacuum code.
>
> 1.  As a result of this change to allow VACUUM inside a user transaction,
> I think there is some chance of causing
> a block/delay of concurrent VACUUMs if a VACUUM is being run under a long
> running transaction.
> 2. Also, if a user runs VACUUM in a transaction, performance optimizations
> like PROC_IN_VACUUM won't work.
> 3. Also, if VACUUM happens towards the end of a long running transaction,
> the snapshot will be old
> and xmin horizon for vacuum would be somewhat old as compared to current
> lazy vacuum which
> acquires a new snapshot just before scanning the table.
>
> So, while I understand the need of the feature, I am wondering if there
> should be some mention
> of above caveats in documentation with the recommendation that VACUUM
> should be run outside
> a transaction, in general.
>
>
Sorry, I just noticed that you have already mentioned some of these in the
documentation as follows, so it seems
it is already taken care of.

+VACUUM cannot be executed inside a transaction
block,
+unless a single table is specified and FULL is not
+specified.  When executing inside a transaction block the vacuum scan
can
+hold back the xmin horizon and does not update the database
datfrozenxid,
+as a result this usage is not useful for database maintenance, but is
provided
+to allow vacuuming in special circumstances, such as temporary or
private
+work tables.

Thank you,
Rahila Syed


Re: Allow single table VACUUM in transaction block

2022-11-04 Thread Rahila Syed
Hi,

On Fri, Nov 4, 2022 at 2:39 PM Simon Riggs 
wrote:

> Hi Rahila,
>
> Thanks for your review.
>
> On Fri, 4 Nov 2022 at 07:37, Rahila Syed  wrote:
>
> >> I would like to bring up a few points that I came across while looking
> into the vacuum code.
> >>
> >> 1.  As a result of this change to allow VACUUM inside a user
> transaction, I think there is some chance of causing
> >> a block/delay of concurrent VACUUMs if a VACUUM is being run under a
> long running transaction.
> >> 2. Also, if a user runs VACUUM in a transaction, performance
> optimizations like PROC_IN_VACUUM won't work.
> >> 3. Also, if VACUUM happens towards the end of a long running
> transaction, the snapshot will be old
> >> and xmin horizon for vacuum would be somewhat old as compared to
> current lazy vacuum which
> >> acquires a new snapshot just before scanning the table.
> >>
> >> So, while I understand the need of the feature, I am wondering if there
> should be some mention
> >> of above caveats in documentation with the recommendation that VACUUM
> should be run outside
> >> a transaction, in general.
> >>
> >
> > Sorry, I just noticed that you have already mentioned some of these in
> the documentation as follows, so it seems
> > it is already taken care of.
> >
> > +VACUUM cannot be executed inside a transaction
> block,
> > +unless a single table is specified and FULL is
> not
> > +specified.  When executing inside a transaction block the vacuum
> scan can
> > +hold back the xmin horizon and does not update the database
> datfrozenxid,
> > +as a result this usage is not useful for database maintenance, but
> is provided
> > +to allow vacuuming in special circumstances, such as temporary or
> private
> > +work tables.
>
> Yes, I wondered whether we should have a NOTICE or WARNING to remind
> people of those points?
>

 +1 . My vote for NOTICE over WARNING because I think
it is useful information for the user rather than any potential problem.

Thank you,
Rahila Syed


Column Filtering in Logical Replication

2021-06-30 Thread Rahila Syed
Hi,

Filtering of columns at the publisher node will allow for selective
replication of data between publisher and subscriber.  In case the updates
on the publisher are targeted only towards specific columns, the user will
have an option to reduce network consumption by not sending the data
corresponding to new columns that do not change. Note that replica
identity values will always be sent irrespective of column filtering settings.
The column values that are not sent by the publisher will be populated
using local values on the subscriber. For insert command, non-replicated
column values will be NULL or the default.
If column names are not specified while creating or altering a publication,
all the columns are replicated as per current behaviour.

The proposal for syntax to add table with column names to publication is as
follows:
Create publication:

CREATE PUBLICATION  [ FOR TABLE [ONLY] table_name [(colname
[,…])] | FOR ALL TABLES]


Alter publication:

ALTER PUBLICATION  ADD TABLE [ONLY] table_name [(colname [, ..])]


Please find attached a patch that implements the above proposal.
While the patch contains basic implementation and tests, several
improvements
and sanity checks are underway. I will post an updated patch with those
changes soon.

Kindly let me know your opinion.


Thank you,

Rahila Syed


0001-Add-column-filtering-to-logical-replication.patch
Description: Binary data


Re: Column Filtering in Logical Replication

2021-07-12 Thread Rahila Syed
Hi Peter,

Hi, I was wondering if/when a subset of cols is specified then does
> that mean it will be possible for the table to be replicated to a
> *smaller* table at the subscriber side?
>
e.g Can a table with 7 cols replicated to a table with 2 cols?
>
> table tab1(a,b,c,d,e,f,g) --> CREATE PUBLICATION pub1 FOR TABLE
> tab1(a,b)  --> table tab1(a,b)
>
> ~~


> I thought maybe that should be possible, but the expected behaviour
> for that scenario was not very clear to me from the thread/patch
> comments. And the new TAP test uses the tab1 table created exactly the
> same for pub/sub, so I couldn't tell from the test code either.
>

Currently, this capability is not included in the patch. If the table on
the subscriber
server has lesser attributes than that on the publisher server, it throws
an error at the
time of CREATE SUBSCRIPTION.

About having such a functionality, I don't immediately see any issue with
it as long
as we make sure replica identity columns are always present on both
instances.
However, need to carefully consider situations in which a server subscribes
to multiple
publications,  each publishing a different subset of columns of a table.


Thank you,
Rahila Syed


Re: Column Filtering in Logical Replication

2021-07-12 Thread Rahila Syed
Hi Alvaro,

Thank you for comments.

The patch adds a function get_att_num_by_name; but we have a lsyscache.c
> function for that purpose, get_attnum.  Maybe that one should be used
> instead?
>
> Thank you for pointing that out, I agree it makes sense to reuse the
existing function.
Changed it accordingly in the attached patch.


> get_tuple_columns_map() returns a bitmapset of the attnos of the columns
> in the given list, so its name feels wrong.  I propose
> get_table_columnset().  However, this function is invoked for every
> insert/update change, so it's going to be far too slow to be usable.  I
> think you need to cache the bitmapset somewhere, so that the function is
> only called on first use.  I didn't look very closely, but it seems that
> struct RelationSyncEntry may be a good place to cache it.
>
> Makes sense, changed accordingly.


> The patch adds a new parse node PublicationTable, but doesn't add
> copyfuncs.c, equalfuncs.c, readfuncs.c, outfuncs.c support for it.
> Maybe try a compile with WRITE_READ_PARSE_PLAN_TREES and/or
> COPY_PARSE_PLAN_TREES enabled to make sure everything is covered.
> (I didn't verify that this actually catches anything ...)
>

I will test this and include these changes in the next version.


> The new column in pg_publication_rel is prrel_attr.  This name seems at
> odds with existing column names (we don't use underscores in catalog
> column names).  Maybe prattrs is good enough?  prrelattrs?  We tend to
> use plurals for columns that are arrays.
>
> Renamed it to prattrs as per suggestion.

It's not super clear to me that strlist_to_textarray() and related
> processing will behave sanely when the column names contain weird
> characters such as commas or quotes, or just when used with uppercase
> column names.  Maybe it's worth having tests that try to break such
> cases.
>
> Sure, I will include these tests in the next version.


> You seem to have left a debugging "elog(LOG)" line in OpenTableList.
>
> Removed.


> I got warnings from "git am" about trailing whitespace being added by
> the patch in two places.
>
> Should be fixed now.

Thank you,
Rahila Syed


v1-0001-Add-column-filtering-to-logical-replication.patch
Description: Binary data


Re: Column Filtering in Logical Replication

2021-09-01 Thread Rahila Syed
Hi,


>> Another point is what if someone drops the column used in one of the
>> publications? Do we want to drop the entire relation from publication
>> or just remove the column filter or something else?
>>
>
After thinking about this, I think it is best to remove the entire table
from publication,
if a column specified in the column filter is dropped from the table.
Because, if we drop the entire filter without dropping the table, it means
all the columns will be replicated,
and the downstream server table might not have those columns.
If we drop only the column from the filter we might have to recreate the
filter and check for replica identity.
That means if the replica identity column is dropped, you can't drop it
from the filter,
and might have to drop the entire publication-table mapping anyways.

Thus, I think it is cleanest to drop the entire relation from publication.

This has been implemented in the attached version.


> Do we want to consider that the columns specified in the filter must
>> not have NOT NULL constraint? Because, otherwise, the subscriber will
>> error out inserting such rows?
>>
>> I think you mean columns *not* specified in the filter must not have NOT
> NULL constraint
> on the subscriber, as this will break during insert, as it will try to
> insert NULL for columns
> not sent by the publisher.
> I will look into fixing this. Probably this won't be a problem in
> case the column is auto generated or contains a default value.
>
>
I am not sure if this needs to be handled. Ideally, we need to prevent the
subscriber tables from having a NOT NULL
constraint if the publisher uses column filters to publish the values of
the table. There is no way
to do this at the time of creating a table on subscriber.
As this involves querying the publisher for this information, it can be
done at the time of initial table synchronization.
i.e error out if any of the subscribed tables has NOT NULL constraint on
non-filter columns.
This will lead to the user dropping and recreating the subscription after
removing the
NOT NULL constraint from the table.
I think the same can be achieved by doing nothing and letting the
subscriber error out while inserting rows.

Minor comments:
>> 
>>   pq_sendbyte(out, flags);
>> -
>>   /* attribute name */
>>   pq_sendstring(out, NameStr(att->attname));
>>
>> @@ -953,6 +1000,7 @@ logicalrep_write_attrs(StringInfo out, Relation rel)
>>
>>   /* attribute mode */
>>   pq_sendint32(out, att->atttypmod);
>> +
>>   }
>>
>>   bms_free(idattrs);
>> diff --git a/src/backend/replication/logical/relation.c
>> b/src/backend/replication/logical/relation.c
>> index c37e2a7e29..d7a7b00841 100644
>> --- a/src/backend/replication/logical/relation.c
>> +++ b/src/backend/replication/logical/relation.c
>> @@ -354,7 +354,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid,
>> LOCKMODE lockmode)
>>
>>   attnum = logicalrep_rel_att_by_name(remoterel,
>>   NameStr(attr->attname));
>> -
>>   entry->attrmap->attnums[i] = attnum;
>>
>> There are quite a few places in the patch that contains spurious line
>> additions or removals.
>>
>>
> Fixed these in the attached patch.

Having said that, I'm not sure I agree with this design decision; what I
> think this is doing is hiding from the user the fact that they are
> publishing columns that they don't want to publish.  I think as a user I
> would rather get an error in that case:


  ERROR:  invalid column list in published set
>   DETAIL:  The set of published commands does not include all the replica
> identity columns.


Added this.

Also added some more tests. Please find attached a rebased and updated
patch.

Thank you,
Rahila Syed


v4-0001-Add-column-filtering-to-logical-replication.patch
Description: Binary data


Re: Column Filtering in Logical Replication

2021-09-06 Thread Rahila Syed
Hi,

On Mon, Sep 6, 2021 at 8:53 AM Amit Kapila  wrote:

> On Sat, Sep 4, 2021 at 8:11 PM Alvaro Herrera 
> wrote:
> >
> > On 2021-Sep-04, Amit Kapila wrote:
> >
> > > On Thu, Sep 2, 2021 at 2:19 PM Alvaro Herrera 
> wrote:
> > > >
> > > > On 2021-Sep-02, Rahila Syed wrote:
> > > >
> > > > > After thinking about this, I think it is best to remove the entire
> table
> > > > > from publication,
> > > > > if a column specified in the column filter is dropped from the
> table.
> > > >
> > > > Hmm, I think it would be cleanest to give responsibility to the
> user: if
> > > > the column to be dropped is in the filter, then raise an error,
> aborting
> > > > the drop.
> > >
> > > Do you think that will make sense if the user used Cascade (Alter
> > > Table ... Drop Column ... Cascade)?
> >
> > ... ugh.  Since CASCADE is already defined to be a potentially-data-loss
> > operation, then that may be acceptable behavior.  For sure the default
> > RESTRICT behavior shouldn't do it, though.
> >
>
> That makes sense to me.
>
> However, the default (RESTRICT) behaviour of DROP TABLE allows
removing the table from the publication. I have implemented the removal of
table from publication
on drop column (RESTRICT)  on the same lines.

Although it does make sense to not allow dropping tables from publication,
in case of RESTRICT.
It makes me wonder how DROP TABLE (RESTRICT) allows cascading the drop
table to publication.

Did you give any thoughts to my earlier suggestion related to syntax [1]?


> [1] -
> https://www.postgresql.org/message-id/CAA4eK1J9b_0_PMnJ2jq9E55bcbmTKdUmy6jPnkf1Zwy2jxah_g%40mail.gmail.com


For future support to replicate all columns except (x,y,z), I think some
optional keywords like
COLUMNS NOT IN can be inserted between table name and (*columns_list*) as
follows.
ALTER PUBLICATION ADD TABLE tab_name [COLUMNS NOT IN] (x,y,z)
I think this should be possible as a future addition to proposed syntax in
the patch.
Please let me know your opinion.

Thank you,
Rahila Syed


Re: Column Filtering in Logical Replication

2021-09-27 Thread Rahila Syed
Hi,


>
> I don't think there is one.  I think the latest is what I posted in
> https://postgr.es/m/202109061751.3qz5xpugwx6w@alvherre.pgsql (At least I
> don't see any reply from Rahila with attachments after that), but that
> wasn't addressing a bunch of review comments that had been made; and I
> suspect that Amit K has already committed a few conflicting patches
> after that.
>
> Yes, the v5 version of the patch attached by Alvaro is the latest one.
IIUC, the review comments that are yet to be addressed apart from the
ongoing grammar
discussion, are as follows:

1. Behaviour on dropping a column from the table, that is a part of column
filter.
In the latest patch, the entire table is dropped from publication on
dropping a column
that is a part of the column filter. However, there is preference for
another approach
to drop just the column from the filter on DROP column CASCADE(continue to
filter
the other columns), and an error for DROP RESTRICT.

2. Instead of WITH RECURSIVE query to find the topmost parent of the
partition
in fetch_remote_table_info, use pg_partition_tree and pg_partition_root.

3. Report of memory leakage in get_rel_sync_entry().

4. Missing documentation

5. Latest comments(last two messages) by Peter Smith.

Thank you,
Rahila Syed


Re: Column Filtering in Logical Replication

2021-07-13 Thread Rahila Syed
Hi Tomas,

Thank you for your comments.


>
> >
> > Currently, this capability is not included in the patch. If the table on
> > the subscriber
> > server has lesser attributes than that on the publisher server, it
> > throws an error at the
> > time of CREATE SUBSCRIPTION.
> >
>
> That's a bit surprising, to be honest. I do understand the patch simply
> treats the filtered columns as "unchanged" because that's the simplest
> way to filter the *data* of the columns. But if someone told me we can
> "filter columns" I'd expect this to work without the columns on the
> subscriber.
>
> OK, I will look into adding this.


>
> > However, need to carefully consider situations in which a server
> > subscribes to multiple
> > publications,  each publishing a different subset of columns of a table.

Isn't that pretty much the same situation as for multiple subscriptions
> each with a different set of I/U/D operations? IIRC we simply merge
> those, so why not to do the same thing here and merge the attributes?
>
>
Yeah, I agree with the solution to merge the attributes, similar to how
operations are merged. My concern was also from an implementation point
of view, will it be a very drastic change. I now had a look at how remote
relation
attributes are acquired for comparison with local attributes at the
subscriber.
It seems that the publisher will need to send the information about the
filtered columns
for each publication specified during CREATE SUBSCRIPTION.
This will be read at the subscriber side which in turn updates its cache
accordingly.
Currently, the subscriber expects all attributes of a published relation to
be present.
I will add code for this in the next version of the patch.

 To nitpick, I find "Bitmapset *att_list" a bit annoying, because it's

not really a list ;-)


I will make this change with the next version



>  FWIW "make check" fails for me with this version, due to segfault in
> OpenTableLists. Apparenly there's some confusion - the code expects the
> list to contain PublicationTable nodes, and tries to extract the
> RangeVar from the elements. But the list actually contains RangeVar, so
> this crashes and burns. See the attached backtrace.
>
>
Thank you for the report, This is fixed in the attached version, now all
publication
function calls accept the PublicationTableInfo list.

Thank you,
Rahila Syed


v2-0001-Add-column-filtering-to-logical-replication.patch
Description: Binary data


Re: Added schema level support for publication.

2021-07-21 Thread Rahila Syed
On Mon, Jul 19, 2021 at 2:41 PM Greg Nancarrow  wrote:

> On Fri, Jul 16, 2021 at 8:13 PM vignesh C  wrote:
> >
> > Modified.
> >
> > Thanks for the comments, these issues are fixed as part of the v12 patch
> posted at [1].
> > [1]  -
> https://www.postgresql.org/message-id/CALDaNm3V9ny5dJM8nofLGJ3zDuDG0gS2dX%2BAhDph--U5y%2B4VbQ%40mail.gmail.com
> >
>
> There seems to be a problem with ALTER PUBLICATION ... SET TABLE ...
> After that command, it still regards it as an empty (e) publication,
> so I can then ALTER PUBLICATION ... ADD SCHEMA ...
>
>
One issue here is that the code to update publication type is missing
in AlterPublicationTables for SET TABLE command.

More broadly, I am not clear about the behaviour of the patch when a
publication is created to publish only certain tables, and is later altered
to publish
a whole schema. I think such behaviour is legitimate. However,
AFAIU as per current code we can't update the publication type
from PUBTYPE_TABLE to PUBTYPE_SCHEMA.

I have some review comments as follows:
1.
In ConvertSchemaSpecListToOidList(List *schemas) function:
 + search_path = fetch_search_path(false);
 +   nspname =
get_namespace_name(linitial_oid(search_path));
 +   if (nspname == NULL)/*
recently-deleted namespace? */
 +   ereport(ERROR,
 +
errcode(ERRCODE_UNDEFINED_SCHEMA),
 +   errmsg("no schema
has been selected"));
 +
 +   schemoid = get_namespace_oid(nspname,
false);
 +   break;

The call get_namespace_oid() is perhaps not needed as fetch_search_path
already fetches oids and simply
doing Schema oid = liinital_oid(search_path)); should be enough.

2. In the same function should there be an if else condition block instead
of a switch case as
there are only two cases.


Thank you,
Rahila Syed


Re: row filtering for logical replication

2021-07-23 Thread Rahila Syed
On Fri, Jul 23, 2021 at 8:36 AM Amit Kapila  wrote:

> On Fri, Jul 23, 2021 at 8:29 AM Amit Kapila 
> wrote:
> >
> > On Thu, Jul 22, 2021 at 8:06 PM Dilip Kumar 
> wrote:
> > >
> > > On Thu, Jul 22, 2021 at 5:15 PM Amit Kapila 
> wrote:
> > > >
> > > > On Tue, Jul 20, 2021 at 4:33 PM Dilip Kumar 
> wrote:
> > > > >
> > > > > On Tue, Jul 20, 2021 at 3:43 PM Tomas Vondra
> > > > >  wrote:
> > > > > >
> > > > > > Do we log the TOAST-ed values that were not updated?
> > > > >
> > > > > No, we don't, I have submitted a patch sometime back to fix that
> [1]
> > > > >
> > > >
> > > > That patch seems to log WAL for key unchanged columns. What about if
> > > > unchanged non-key columns? Do they get logged as part of the new
> tuple
> > > > or is there some other way we can get those? If not, then we need to
> > > > probably think of restricting filter clause in some way.
> > >
> > > But what sort of restrictions? I mean we can not put based on data
> > > type right that will be too restrictive,
> > >
> >
> > Yeah, data type restriction sounds too restrictive and unless the data
> > is toasted, the data will be anyway available. I think such kind of
> > restriction should be the last resort but let's try to see if we can
> > do something better.
> >
> > > other option is only to allow
> > > replica identity keys columns in the filter condition?
> > >
> >
> > Yes, that is what I had in mind because if key column(s) is changed
> > then we will have data for both old and new tuples. But if it is not
> > changed then we will have it probably for the old tuple unless we
> > decide to fix the bug you mentioned in a different way in which case
> > we might either need to log it for the purpose of this feature (but
> > that will be any way for HEAD) or need to come up with some other
> > solution here. I think we can't even fetch such columns data during
> > decoding because we have catalog-only historic snapshots here. Do you
> > have any better ideas?
> >
>
> BTW, I wonder how pglogical can handle this because if these unchanged
> toasted values are not logged in WAL for the new tuple then how the
> comparison for such columns will work? Either they are forcing WAL in
> some way or don't allow WHERE clause on such columns or maybe they
> have dealt with it in some other way unless they are unaware of this
> problem.
>
>
The column comparison for row filtering happens before the unchanged toast
columns are filtered. Unchanged toast columns are filtered just before
writing the tuple
to output stream. I think this is the case both for pglogical and the
proposed patch.
So, I can't see why the not logging of unchanged toast columns would be a
problem
for row filtering. Am I missing something?


Thank you,
Rahila Syed


Re: Column Filtering in Logical Replication

2021-08-08 Thread Rahila Syed
Hi,



> >
>> > Currently, this capability is not included in the patch. If the table on
>> > the subscriber
>> > server has lesser attributes than that on the publisher server, it
>> > throws an error at the
>> > time of CREATE SUBSCRIPTION.
>> >
>>
>> That's a bit surprising, to be honest. I do understand the patch simply
>> treats the filtered columns as "unchanged" because that's the simplest
>> way to filter the *data* of the columns. But if someone told me we can
>> "filter columns" I'd expect this to work without the columns on the
>> subscriber.
>>
>> OK, I will look into adding this.
>

This has been added in the attached patch. Now, instead of
treating the filtered columns as unchanged and sending a byte
with that information, unfiltered columns are not sent to the subscriber
server at all. This along with saving the network bandwidth, allows
the logical replication to even work between tables with different numbers
of
columns i.e with the table on subscriber server containing only the
filtered
columns. Currently, replica identity columns are replicated irrespective of
the presence of the column filters, hence the table on the subscriber side
must
contain the replica identity columns.

The patch adds a new parse node PublicationTable, but doesn't add
> copyfuncs.c, equalfuncs.c, readfuncs.c, outfuncs.c support for it.


Thanks, added this.


> Looking at OpenTableList(), I think you forgot to update the comment --
> it says "open relations specified by a RangeVar list",


Thank you for the review, Modified this.

To nitpick, I find "Bitmapset *att_list" a bit annoying, because it's
> not really a list ;-)


Changed this.

>
>  It's not super clear to me that strlist_to_textarray() and related
> processing will behave sanely when the column names contain weird
> characters such as commas or quotes, or just when used with uppercase
> column names.  Maybe it's worth having tests that try to break such
> cases.


Added a few test cases for this.

In AlterPublicationTables() I was confused by some code that seemed
> commented a bit too verbosely


Modified this as per the suggestion.

: REPLICA IDENTITY columns are always replicated
> : irrespective of column names specification.


... for which you don't have any tests


I have added these tests.

Having said that, I'm not sure I agree with this design decision; what I
> think this is doing is hiding from the user the fact that they are
> publishing columns that they don't want to publish.  I think as a user I
> would rather get an error in that case:


  ERROR:  invalid column list in published set
>   DETAIL:  The set of published commands does not include all the replica
> identity columns.


or something like that.  Avoid possible nasty surprises of security-
> leaking nature.


Ok, Thank you for your opinion. I agree that giving an explicit error in
this case will be safer.
I will include this, in case there are no counter views.

Thank you for your review comments. Please find attached the rebased and
updated patch.


Thank you,
Rahila Syed


v3-0001-Add-column-filtering-to-logical-replication.patch
Description: Binary data


Re: Column Filtering in Logical Replication

2021-08-11 Thread Rahila Syed
Hi Amit,

Thanks for your review.


> Can you please explain why you have the restriction for including
> replica identity columns and do we want to put a similar restriction
> for the primary key? As far as I understand, if we allow default
> values on subscribers for replica identity, then probably updates,
> deletes won't work as they need to use replica identity (or PK) to
> search the required tuple. If so, shouldn't we add this restriction
> only when a publication has been defined for one of these (Update,
> Delete) actions?
>

Yes, like you mentioned they are needed for Updates and Deletes to work.
The restriction for including replica identity columns in column filters
exists because
In case the replica identity column values did not change, the old row
replica identity columns
are not sent to the subscriber, thus we would need new replica identity
columns
to be sent to identify the row that is to be Updated or Deleted.
I haven't tested if it would break Insert as well  though. I will update
the patch accordingly.


> Another point is what if someone drops the column used in one of the
> publications? Do we want to drop the entire relation from publication
> or just remove the column filter or something else?
>
>
Thanks for pointing this out. Currently, this is not handled in the patch.
I think dropping the column from the filter would make sense on the lines
of the table being dropped from publication, in case of drop table.


> Do we want to consider that the columns specified in the filter must
> not have NOT NULL constraint? Because, otherwise, the subscriber will
> error out inserting such rows?
>
> I think you mean columns *not* specified in the filter must not have NOT
NULL constraint
on the subscriber, as this will break during insert, as it will try to
insert NULL for columns
not sent by the publisher.
I will look into fixing this. Probably this won't be a problem in
case the column is auto generated or contains a default value.


> Minor comments:
> 
>   pq_sendbyte(out, flags);
> -
>   /* attribute name */
>   pq_sendstring(out, NameStr(att->attname));
>
> @@ -953,6 +1000,7 @@ logicalrep_write_attrs(StringInfo out, Relation rel)
>
>   /* attribute mode */
>   pq_sendint32(out, att->atttypmod);
> +
>   }
>
>   bms_free(idattrs);
> diff --git a/src/backend/replication/logical/relation.c
> b/src/backend/replication/logical/relation.c
> index c37e2a7e29..d7a7b00841 100644
> --- a/src/backend/replication/logical/relation.c
> +++ b/src/backend/replication/logical/relation.c
> @@ -354,7 +354,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid,
> LOCKMODE lockmode)
>
>   attnum = logicalrep_rel_att_by_name(remoterel,
>   NameStr(attr->attname));
> -
>   entry->attrmap->attnums[i] = attnum;
>
> There are quite a few places in the patch that contains spurious line
> additions or removals.
>
>
Thank you for your comments, I will fix these.

Thank you,
Rahila Syed


Re: New PostgreSQL Contributors

2023-07-30 Thread Rahila Syed
Hi,

On Fri, Jul 28, 2023 at 8:59 PM Christoph Berg  wrote:

> The PostgreSQL contributors team has been looking over the community
> activity and, over the first half of this year, has been recognizing
> new contributors to be listed on
>
> https://www.postgresql.org/community/contributors/
>
> New PostgreSQL Contributors:
>
>   Ajin Cherian
>   Alexander Kukushkin
>   Alexander Lakhin
>   Dmitry Dolgov
>   Hou Zhijie
>   Ilya Kosmodemiansky
>   Melanie Plageman
>   Michael Banck
>   Michael Brewer
>   Paul Jungwirth
>   Peter Smith
>   Vigneshwaran C
>
> New PostgreSQL Major Contributors:
>
>   Julien Rouhaud
>   Stacey Haysler
>   Steve Singer
>
> Congratulations to all the new contributors!

Thank you,
Rahila syed


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-01-18 Thread Rahila Syed
total number of
phases as well.

+   holders = lappend(holders,
+ GetLockConflicts(locktag,
lockmode, &count));
+   total += count;
2. IIUC, the above code in WaitForLockersMultiple can be written under
if(progress) condition like rest of the progress checking code in that
function
and pass NULL for count otherwise.

3. Will it be useful to report pid's of the backend's
for the transactions which CREATE INDEX concurrently is waiting for?
I think it can be used to debug long running transactions.

4. Should we have additional statistics update phase before
index_update_stats
like PROGRESS_VACUUM_PHASE_FINAL_CLEANUP?

5. I think it may be useful to report number of parallel workers requested
and number of workers
actually running index build in case of btree.

6. Also, this can be reported as an additional validation phase for
exclusion constraint
in index_build function as it involves scanning all live tuples of heap
relation.

 /*
 * If it's for an exclusion constraint, make a second pass over the
heap
 * to verify that the constraint is satisfied.  We must not do this
until
 * the index is fully valid.  (Broken HOT chains shouldn't matter,
though;
 * see comments for IndexCheckExclusion.)
 */
if (indexInfo->ii_ExclusionOps != NULL)
IndexCheckExclusion(heapRelation, indexRelation, indexInfo);
*/

Thank you,
Rahila Syed


Re: [PATCH] Automatic HASH and LIST partition creation

2020-09-30 Thread Rahila Syed
Hi Anastasia,

I tested the syntax with some basic commands and it works fine, regression
tests also pass.

Couple of comments:
1. The syntax used omits the { IMMEDIATE | DEFERRED} keywords suggested in
the earlier discussions. I think it is intuitive to include IMMEDIATE with
the current implementation
so that the syntax can be extended with a  DEFERRED clause in future for
dynamic partitions.

>   CREATE TABLE tbl_lst (i int) PARTITION BY LIST (i)
>  CONFIGURATION (values in (1, 2), (3, 4) DEFAULT PARTITION tbl_default);


2. One suggestion for generation of partition names is to append a unique
id to
avoid conflicts.

3. Probably, here you mean to write list and hash instead of range and list
as
per the current state.

 
>  Range and list partitioning also support automatic creation of
> partitions
>   with an optional CONFIGURATION clause.
> 


4. Typo in default_part_name

+VALUES IN (  class="parameter">partition_bound_expr [, ...] ), [(
> partition_bound_expr [, ...]
> )] [, ...] [DEFAULT PARTITION  class="parameter">defailt_part_name]
> +MODULUS numeric_literal



Thank you,
Rahila Syed


Re: [PATCH] Automatic HASH and LIST partition creation

2020-10-04 Thread Rahila Syed
Hi,

Couple of comments:
> 1. The syntax used omits the { IMMEDIATE | DEFERRED} keywords suggested in
> the earlier discussions. I think it is intuitive to include IMMEDIATE with
> the current implementation
> so that the syntax can be extended with a  DEFERRED clause in future for
> dynamic partitions.
>
>>   CREATE TABLE tbl_lst (i int) PARTITION BY LIST (i)
>>  CONFIGURATION (values in (1, 2), (3, 4) DEFAULT PARTITION tbl_default);
>
>
>
> After some consideration, I decided that we don't actually need to
> introduce IMMEDIATE | DEFERRED keyword. For hash and list partitions it
> will always be immediate, as the number of partitions cannot change after
> we initially set it. For range partitions, on the contrary, it doesn't make
> much sense to make partitions immediately, because in many use-cases one
> bound will be open.
>
>
As per discussions on this thread:
https://www.postgresql.org/message-id/flat/alpine.DEB.2.21.1907150711080.22273%40lancre
DEFERRED clause refers to creating partitions on the fly, while the data is
being inserted.
The number of partitions and partition bounds can be the same as specified
initially
during partitioned table creation, but the actual creation of
partitions can be deferred.
This seems like a potential extension to statically created partitions even
in the case of
hash and list partitions, as it won't involve moving any existing data.

 2. One suggestion for generation of partition names is to append a
> unique id to

avoid conflicts.
>
> Can you please give an example of such a conflict? I agree that current
> naming scheme is far from perfect, but I think that 'tablename'_partnum
> provides unique name for each partition.
>
>
> Sorry for not being clear earlier, I mean the partition name
'tablename_partnum' can conflict with any existing table name.
As per current impemetation, if I do the following it results in the table
name conflict.

postgres=# create table tbl_test_5_1(i int);
CREATE TABLE
postgres=# CREATE TABLE tbl_test_5 (i int) PARTITION BY LIST((tbl_test_5))

CONFIGURATION (values in
('(1)'::tbl_test_5), ('(3)'::tbl_test_5) default partition tbl_default_5);
ERROR:  relation "tbl_test_5_1" already exists

Thank you,
Rahila Syed

>


Enhancing Memory Context Statistics Reporting

2024-10-21 Thread Rahila Syed
29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436604911%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=cbO2DBP6IsgMPTEVFNh%2FKeq4IoK3MZvTpzKkCQzNPMo%3D&reserved=0>*

[2] *PostgreSQL: Re: Get memory contexts of an arbitrary backend process
<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.postgresql.org%2Fmessage-id%2Fbea016ad-d1a7-f01d-a7e8-01106a1de77f%2540oss.nttdata.com&data=05%7C02%7Csyedrahila%40microsoft.com%7C3b35e97c29cf4796042408dcee8a4dbb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638647525436629740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=UCwkwg6kikVEf0oHf3%2BlliA%2FTUdMG%2F0cOiMta7fjPPk%3D&reserved=0>*


Thank you,
Rahila Syed


0001-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2024-10-22 Thread Rahila Syed
Hi Michael,

Thank you for the review.

On Tue, Oct 22, 2024 at 12:18 PM Michael Paquier 
wrote:

> On Mon, Oct 21, 2024 at 11:54:21PM +0530, Rahila Syed wrote:
> > On the other hand, [2] provides the statistics for all backends but logs
> > them in a file, which may not be convenient for quick access.
>
> To be precise, pg_log_backend_memory_contexts() pushes the memory
> context stats to LOG_SERVER_ONLY or stderr, hence this is appended to
> the server logs.
>
> > A fixed-size shared memory block, currently accommodating 30 records,
> >  is used to store the statistics. This number was chosen arbitrarily,
> >  as it covers all parent contexts at level 1 (i.e., direct children of
> the
> > top memory context)
> > based on my tests.
> > Further experiments are needed to determine the optimal number
> > for summarizing memory statistics.
>
> + * Statistics are shared via fixed shared memory which
> + * can hold statistics for 29 contexts. The rest of the
> [...]
> +   MemoryContextInfo memctx_infos[30];
> [...]
> +   memset(&memCtxState->memctx_infos, 0, 30 *
> sizeof(MemoryContextInfo));
> [...]
> +   size = add_size(size, mul_size(30, sizeof(MemoryContextInfo)));
> [...]
> +   memset(&memCtxState->memctx_infos, 0, 30 * sizeof(MemoryContextInfo));
> [...]
> +   memset(&memCtxState->memctx_infos, 0, 30 *
> sizeof(MemoryContextInfo));
>
> This number is tied to MemoryContextState added by the patch.  Sounds
> like this would be better as a constant properly defined rather than
> hardcoded in all these places.  This would make the upper-bound more
> easily switchable in the patch.
>
>
Makes sense. Fixed in the attached patch.


> +   Datum   path[128];
> +   chartype[128];
> [...]
> +   charname[1024];
> +   charident[1024];
> +   chartype[128];
> +   Datum   path[128];
>
> Again, constants.  Why these values?  You may want to use more
> #defines here.
>
> I added the #defines for these in the attached patch.
Size of the path array should match the number of levels in the memory
context tree and type is a constant string.

For the name and ident, I have used the existing #define
MEMORY_CONTEXT_IDENT_DISPLAY_SIZE as the size limit.


> > Any additional statistics that exceed the shared memory capacity
> > are written to a file per backend in the PG_TEMP_FILES_DIR. The client
> > backend
> >  first reads from the shared memory, and if necessary, retrieves the
> > remaining data from the file,
> > combining everything into a unified view. The files are cleaned up
> > automatically
> > if a backend crashes or during server restarts.
>
> Is the addition of the file to write any remaining stats really that
> useful?  This comes with a heavy cost in the patch with the "in_use"
> flag, the various tweaks around the LWLock release/acquire protecting
> the shmem area and the extra cleanup steps required after even a clean
> restart.  That's a lot of facility for this kind of information.
>

The rationale behind using the file is to cater to the unbounded
number of memory contexts.
The "in_use" flag is used to govern the access to shared memory
as I am reserving enough memory for only one backend.
It ensures that another backend does not overwrite the statistics
in the shared memory, before it is read by a client backend.


> Another thing that may be worth considering is to put this information
> in a DSM per the variable-size nature of the information, perhaps cap
> it to a max to make the memory footprint cheaper, and avoid all
> on-disk footprint because we don't need it to begin with as this is
> information that makes sense only while the server is running.
>
> Thank you for the suggestion.  I will look into using DSMs especially
if there is a  way to limit the statistics dump, while still providing a
user
with enough information to debug memory consumption.

In this draft, I preferred using a file over DSMs, as a file can provide
ample space for dumping a large number of memory context statistics
without the risk of DSM creation failure due to insufficient memory.

Also, why the single-backend limitation?


To reduce the memory footprint, the shared memory is
created for only one backend.
Each backend has to wait for previous operation
to finish before it can write.

I think a good use case for this would be a background process
periodically running the monitoring function on each of the
backends sequentially to fetch the statistics.
This way there will be little contention for shared memory.

In case a shared memory is not available, a backend immediately
returns from the interrupt handler without blocking its normal
operations.

  One could imagine a

Re: Using read_stream in index vacuum

2024-10-24 Thread Rahila Syed
Hi Andrey,

I ran the following test with
v7-0001-Prototype-B-tree-vacuum-streamlineing.patch
to measure the performance improvement.

--Table size of approx 2GB (Fits in RAM)
postgres=# create unlogged table x_big as select i from
generate_series(1,6e7) i;
SELECT 6000
postgres=# create index on x_big(i);
CREATE INDEX
-- Perform updates to create dead tuples.
postgres=# do $$
declare
var int := 0;
begin
  for counter in 1 .. 1e7 loop
var := (SELECT floor(random() * (1e7 - 1 + 1) * 1));
UPDATE x_big SET i = i + 5 WHERE i = var;
  end loop;
end;
$$;
postgres=# CREATE EXTENSION pg_buffercache;
CREATE EXTENSION
-- Evict Postgres buffer cache for this relation.
postgres=# SELECT DISTINCT pg_buffercache_evict(bufferid)
  FROM pg_buffercache
 WHERE relfilenode = pg_relation_filenode('x_big');
 pg_buffercache_evict
--
 t
(1 row)

postgres=# \timing on
Timing is on.
postgres=# vacuum x_big;
VACUUM

The timing does not seem to have improved with the patch.
Timing with the patch:  Time: 9525.696 ms (00:09.526)
Timing without the patch:  Time: 9468.739 ms (00:09.469)

While writing this email, I realized I evicted buffers for the table
and not the index. I will perform the test again. However,
I would like to know your opinion on whether this looks like
a valid test.

Thank you,
Rahila Syed


On Thu, Oct 24, 2024 at 4:45 PM Andrey M. Borodin 
wrote:

>
>
> > On 24 Oct 2024, at 10:15, Andrey M. Borodin 
> wrote:
> >
> > I've also added GiST vacuum to the patchset.
>
> I decided to add up a SP-GiST while having launched on pgconf.eu.
>
>
> Best regards, Andrey Borodin.
>


Re: Enhancing Memory Context Statistics Reporting

2024-11-12 Thread Rahila Syed
Hi,

Thank you for the review.

>
> Hmm, would it make sene to use dynamic shared memory for this?  The
> publishing backend could dsm_create one DSM chunk of the exact size that
> it needs, pass the dsm_handle to the consumer, and then have it be
> destroy once it's been read.  That way you don't have to define an
> arbitrary limit of any size.  (Maybe you could keep a limit to how much
> is published in shared memory and spill the rest to disk, but I think
> such a limit should be very high[1], so that it's unlikely to take
> effect in normal cases.)


> [1] This is very arbitrary of course, but 1 MB gives enough room for
> some 7000 contexts, which should cover normal cases.
>

I used one DSA area per process to share statistics. Currently,
the size limit for each DSA is 16 MB, which can accommodate
approximately 6,700 MemoryContextInfo structs. Any additional
statistics will spill over to a file. I opted for DSAs over DSMs to
enable memory reuse by freeing segments for subsequent
statistics copies of the same backend, without needing to
recreate DSMs for each request.

The dsa_handle for each process is stored in an array,
indexed by the procNumber, within the shared memory.
The maximum size of this array is defined as the sum of
 MaxBackends and the number of auxiliary processes.

As requested earlier, I have renamed the function to
pg_get_process_memory_contexts(pid, get_summary).
Suggestions for a better name are welcome.
When the  get_summary argument is set to true, the function provides
statistics for memory contexts up to level 2—that is, the
top memory context and all its children.

Please find attached a rebased patch that includes these changes.
I will work on adding a test for the function and some code refactoring
suggestions.

Thank you,
Rahila Syed


v2-0001-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Re: [HACKERS] Allow INSTEAD OF DELETE triggers to modify the tuple for RETURNING

2024-09-19 Thread Rahila Syed
Hi,

> So I'm suggesting to change this behavior and allow INSTEAD OF DELETE
> triggers to modify the OLD tuple and use that for RETURNING instead of
> returning the tuple in planSlot. Attached is a WIP patch implementing
that.
>
I am trying to understand the basic idea behind the change.  What is the
use-case for allowing the OLD tuple to be modified in the INSTEAD of DELETE
triggers.
AFAIU no matter what we return from the trigger OLD/NEW or NULL, the delete
operation is skipped, so what is the point of modifying the OLD row in your
test
for example.

If the requirement is to show the recent row in the returning clause, is
fetching the
row from heap and copying into the returning slot not enough?

In the tests shown as follows, inspite of the output showing 'DELETE 1',
no row is actually deleted from the table or the view. Probably this
is an existing behavior but I think this is misleading.

```
DELETE FROM test_ins_del_view where a = 1 returning a;
NOTICE:  old.a 4
 a
---
 4
(1 row)

DELETE 1
```

FWIW, it might be good to include in tests the example
to demonstrate the problem regarding the concurrent transactions,
one of which is modifying the row and the other returning it
after attempting DELETE.



Thank you,
Rahila Syed




On Fri, Mar 15, 2024 at 7:57 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi,
>
> > it took me a while to figure out why the doc build fails.
> >
> > [...]
> >
> > Anyway, based on your patch, I modified it, also added a slightly more
> complicated test.
>
> Thank you for keeping the patch up-to-date. Unfortunately, it seems to
> be needing another rebase, according to cfbot.
>
> Best regards,
> Aleksander Alekseev (wearing co-CFM hat)
>
>
>


Re: Enhancing Memory Context Statistics Reporting

2024-10-24 Thread Rahila Syed
Hi Torikoshia,

Thank you for reviewing the patch!

On Wed, Oct 23, 2024 at 9:28 AM torikoshia 
wrote:

> On 2024-10-22 03:24, Rahila Syed wrote:
> > Hi,
> >
> > PostgreSQL provides following capabilities for reporting memory
> > contexts statistics.
> > 1. pg_get_backend_memory_contexts(); [1]
> > 2. pg_log_backend_memory_contexts(pid); [2]
> >
> > [1]  provides a view of memory context statistics for a local backend,
> > while [2] prints the memory context statistics of any backend or
> > auxiliary
> > process to the PostgreSQL logs. Although [1] offers detailed
> > statistics,
> > it is limited to the local backend, restricting its use to PostgreSQL
> > client backends only.
> > On the other hand, [2] provides the statistics for all backends but
> > logs them in a file,
> > which may not be convenient for quick access.
> >
> > I propose enhancing memory context statistics reporting by combining
> > these
> > capabilities and offering a view of memory statistics for all
> > PostgreSQL backends
> > and auxiliary processes.
>
> Thanks for working on this!
>
> I originally tried to develop something like your proposal in [2], but
> there were some difficulties and settled down to implement
> pg_log_backend_memory_contexts().
>
> Yes. I am revisiting this problem :)


> > Attached is a patch that implements this functionality. It introduces
> > a SQL function
> > that takes the PID of a backend as an argument, returning a set of
> > records,
> > each containing statistics for a single memory context. The underlying
> > C function
> > sends a signal to the backend and waits for it to publish its memory
> > context statistics
> >  before returning them to the user. The publishing backend copies
> > these statistics
> > during the next CHECK_FOR_INTERRUPTS call.
>
> I remember waiting for dumping memory contexts stats could cause trouble
> considering some erroneous cases.
>
> For example, just after the target process finished dumping stats,
> pg_get_remote_backend_memory_contexts() caller is terminated before
> reading the stats, calling pg_get_remote_backend_memory_contexts() has
> no response any more:
>
> [session1]$ psql
> (40699)=#
>
> $ kill -s SIGSTOP 40699
>
> [session2] psql
>(40866)=# select * FROM
> pg_get_remote_backend_memory_contexts('40699', false); -- waiting
>
> $ kill -s SIGSTOP 40866
>
> $ kill -s SIGCONT 40699
>
> [session3] psql
> (47656) $ select pg_terminate_backend(40866);
>
> $ kill -s SIGCONT 40866 -- session2 terminated
>
> [session3] (47656)=# select * FROM
> pg_get_remote_backend_memory_contexts('47656', false); -- no response
>
> It seems the reason is memCtxState->in_use is now and
> memCtxState->proc_id is 40699.
> We can continue to use pg_get_remote_backend_memory_contexts() after
> specifying 40699, but it'd be hard to understand for users.
>
> Thanks for testing and reporting. While I am not able to reproduce this
problem,
I think this may be happening because the requesting backend/caller is
terminated
before it gets a chance to mark  memCtxState->in_use as false.

In this case memCtxState->in_use should be marked as
'false' possibly during the processing of ProcDiePending in
ProcessInterrupts().

> This approach facilitates on-demand publication of memory statistics
> > for a specific backend, rather than collecting them at regular
> > intervals.
> > Since past memory context statistics may no longer be relevant,
> > there is little value in retaining historical data. Any collected
> > statistics
> > can be discarded once read by the client backend.
> >
> > A fixed-size shared memory block, currently accommodating 30 records,
> >  is used to store the statistics. This number was chosen arbitrarily,
> >  as it covers all parent contexts at level 1 (i.e., direct children of
> > the top memory context)
> > based on my tests.
> > Further experiments are needed to determine the optimal number
> > for summarizing memory statistics.
> >
> > Any additional statistics that exceed the shared memory capacity
> > are written to a file per backend in the PG_TEMP_FILES_DIR. The client
> > backend
> >  first reads from the shared memory, and if necessary, retrieves the
> > remaining data from the file,
> > combining everything into a unified view. The files are cleaned up
> > automatically
> > if a backend crashes or during server restarts.
> >
> > The statistics are reported in a breadth-first search order of the
&g

Re: Enhancing Memory Context Statistics Reporting

2024-11-28 Thread Rahila Syed
Hi Tomas,

Thank you for the review.

>
>
> 1) I read through the thread, and in general I agree with the reasoning
> for removing the file part - it seems perfectly fine to just dump as
> much as we can fit into a buffer, and then summarize the rest. But do we
> need to invent a "new" limit here? The other places logging memory
> contexts do something like this:
>
>MemoryContextStatsDetail(TopMemoryContext, 100, 100, false);
>
> Which means we only print the 100 memory contexts at the top, and that's
> it. Wouldn't that give us a reasonable memory limit too?
>
> I think this prints more than 100 memory contexts, since 100 denotes the
max_level
and contexts at each level could have upto 100 children. This limit seems
much higher than
what I am currently storing in DSA which is approx. 7000 contexts.  I will
verify this again.


> 2) I see the function got renamed to pg_get_process_memory_contexts(),
> bu the comment still says pg_get_remote_backend_memory_contexts().
>
> Fixed

>
> 3) I don't see any SGML docs for this new function. I was a bit unsure
> what the "summary" argument is meant to do. The comment does not explain
> that either.
>
> Added docs.
Intention behind adding a summary argument is to report statistics of
contexts at level 0
and 1 i.e TopMemoryContext and its immediate children.

4) I wonder if the function needs to return PID. I mean, the caller
> knows which PID it is for, so it seems rather unnecessary.
>
> Perhaps it can be used to ascertain that the information indeed belongs to
the requested pid.

5) In the "summary" mode, it might be useful to include info about how
> many child contexts were aggregated. It's useful to know whether there
> was 1 child or 1 children. In the regular (non-summary) mode it'd
> always be "1", probably, but maybe it'd interact with the limit in (1).
> Not sure.
>
> Sure,  I will add this in the next iteration.

>
> 6) I feel a bit uneasy about the whole locking / communication scheme.
> In particular, I'm worried about lockups, deadlocks, etc. So I decided
> to do a trivial stress-test - just run the new function through pgbench
> with many clients.
>
> The memstats.sql script does just this:
>
>   SELECT * FROM pg_get_process_memory_contexts(
> (SELECT pid FROM pg_stat_activity
>   WHERE pid != pg_backend_pid()
>   ORDER BY random() LIMIT 1)
> , false);
>
> where the inner query just picks a PID for some other backend, and asks
> for memory context stats for that.
>
> And just run it like this on a scale 1 pgbench database:
>
>   pgbench -n -f memstats.sql -c 10 test
>
> And it gets stuck *immediately*. I've seen it to wait for other client
> backends and auxiliary processes like autovacuum launcher.
>
> This is absolutely idle system, there's no reason why a process would
> not respond almost immediately.


In my reproduction, this issue occurred because the process was terminated
while the requesting backend was waiting on the condition variable to be
signaled by it. I don’t see any solution other than having the waiting
client
backend timeout using ConditionVariableTimedSleep.

In the patch, since the timeout was set to a high value, pgbench ended up
stuck
waiting for the timeout to occur. The failure happens less frequently after
I added an
additional check for the process's existence, but it cannot be entirely
avoided. This is because a process can terminate after we check for its
existence but
before it signals the client. In such cases, the client will not receive
any signal.

I wonder if e.g. autovacuum launcher may
> not be handling these requests, or what if client backends can wait in a
> cycle.


Did not see a cyclic wait in client backends due to the pgbench stress test.


>
> 7) I've also seen this error:
>
>   pgbench: error: client 6 script 0 aborted in command 0 query 0: \
>   ERROR:  can't attach the same segment more than once
>
I haven't investigated it, but it seems like a problem handling errors,
> where we fail to detach from a segment after a timeout.


Thanks for the hint, fixed by adding a missing call to dsa_detach after
timeout.


>
>   > I opted for DSAs over DSMs to enable memory reuse by freeing
>   > segments for subsequent statistics copies of the same backend,
>   > without needing to recreate DSMs for each request.
>
> I feel like this might be a premature optimization - I don't have a
> clear idea how expensive it is to create DSM per request, but my
> intuition is that it's cheaper than processing the contexts and
> generating the info.
>
> I'd just remove that, unless someone demonstrates it really matters. I
> don't really worry about how expensive it is to process a request
> (within reason, of course) - it will happen only very rarely. It's more
> important to make sure there's no overhead when no one asks the backend
> for memory context info, and simplicity.
>
> Also, how expensive it is to just keep the DSA "just in case"? Imagine
> someone asks for the memory context info once - isn't

Re: Enhancing Memory Context Statistics Reporting

2024-11-20 Thread Rahila Syed
Hi,

To achieve both completeness and avoid writing to a file, I can consider
> displaying the numbers for the remaining contexts as a cumulative total
> at the end of the output.
>
> Something like follows:
> ```
> postgres=# select * from pg_get_process_memory_contexts('237244', false);
>  name  | ident
>  |   type   | path | total_bytes | tot
> al_nblocks | free_bytes | free_chunks | used_bytes |  pid
>
> ---++--+--+-+
> ---++-++
>  TopMemoryContext  |
>  | AllocSet | {0}  |   97696 |
>  5 |  14288 |  11 |  83408 | 237244
>  search_path processing cache  |
>  | AllocSet | {0,1}|8192 |
>  1 |   5328 |   7 |   2864 | 237244
>
> *Remaining contexts total:  23456 bytes (total_bytes) ,
> 12345(used_bytes),  11,111(free_bytes)*
> ```
>

Please find attached an updated patch with this change. The file previously
used to
store spilled statistics has been removed. Instead, a cumulative total of
the
remaining/spilled context statistics is now stored in the DSM segment,
which is
displayed as follows.

postgres=# select * from pg_get_process_memory_contexts('352966', false);
 *name *| ident |   type   |  path  | *total_bytes*
| *total_nblocks* | *free_bytes *| *free_chunks *| *used_bytes* |  pi
d
--+---+--++-+---++-++

 TopMemoryContext |   | AllocSet | {0}|   97696 |
  5 |  14288 |  11 |  83408 | 352
966
.
.
.
 MdSmgr   |   | AllocSet | {0,18} |8192 |
  1 |   7424 |   0 |768 | 352
966
* Remaining Totals* |   |  || *1756016*
|   *188 *| *658584 *|* 132* |   * 1097432 *| 352
966
(7129 rows)
-

I believe this serves as a good compromise between completeness
and avoiding the overhead of file handling. However, I am open to
reintroducing file handling if displaying the complete statistics of the
remaining contexts prove to be more important.

All the known bugs in the patch have been fixed.

In summary, one DSA  per PostgreSQL process is used to share
the statistics of that process. A DSA is created by the first client
backend that requests memory context statistics, and it is pinned
for all future requests to that process.
A handle to this DSA is shared between the client and the publishing
process using fixed shared memory. The fixed shared memory consists
of an array of size MaxBackends + auxiliary processes, indexed
by procno. Each element in this array is less than 100 bytes in size.

A PostgreSQL process uses a condition variable to signal a waiting client
backend once it has finished publishing the statistics. If, for some
reason,
the signal is not sent, the waiting client backend will time out.

When statistics of a local backend is requested, this function returns the
following
WARNING and exits, since this can be handled by an existing function which
doesn't require a DSA.

WARNING:  cannot return statistics for local backend
HINT:  Use pg_get_backend_memory_contexts instead

Looking forward to your review.

Thank you,
Rahila Syed


v4-Function-to-report-memory-context-stats-of-a-process.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2024-11-15 Thread Rahila Syed
Hi,

On Thu, Nov 14, 2024 at 5:18 PM Alvaro Herrera 
wrote:

> On 2024-Nov-14, Michael Paquier wrote:
>
> > Already mentioned previously at [1] and echoing with some surrounding
> > arguments, but I'd suggest to keep it simple and just remove entirely
> > the part of the patch where the stats information gets spilled into
> > disk.  With more than 6000-ish context information available with a
> > hard limit in place, there should be plenty enough to know what's
> > going on anyway.
>
> Functionally-wise I don't necessarily agree with _removing_ the spill
> code, considering that production systems with thousands of tables would
> easily reach that number of contexts (each index gets its own index info
> context, each regexp gets its own memcxt); and I don't think silently
> omitting a fraction of people's memory situation (or erroring out if the
> case is hit) is going to make us any friends.
>
>
While I agree that removing the spill-to-file logic will simplify the code,
I also understand the rationale for retaining it to ensure completeness.
To achieve both completeness and avoid writing to a file, I can consider
displaying the numbers for the remaining contexts as a cumulative total
at the end of the output.

Something like follows:
```
postgres=# select * from pg_get_process_memory_contexts('237244', false);
 name  | ident
 |   type   | path | total_bytes | tot
al_nblocks | free_bytes | free_chunks | used_bytes |  pid
---++--+--+-+
---++-++
 TopMemoryContext  |
 | AllocSet | {0}  |   97696 |
 5 |  14288 |  11 |  83408 | 237244
 search_path processing cache  |
 | AllocSet | {0,1}|8192 |
 1 |   5328 |   7 |   2864 | 237244

*Remaining contexts total:  23456 bytes (total_bytes) , 12345(used_bytes),
11,111(free_bytes)*
```


> That said, it worries me that we choose a shared memory size so large
> that it becomes impractical to hit the spill-to-disk code in regression
> testing.  Maybe we can choose a much smaller limit size when
> USE_ASSERT_CHECKING is enabled, and use a test that hits that number?
> That way, we know the code is being hit and tested, without imposing a
> huge memory consumption on test machines.
>

Makes sense. I will look into writing such a test, if we finalize the
approach
of spill-to-file.

Please find attached a rebased and updated patch with a basic test
and some fixes. Kindly let me know your thoughts.

Thank you,
Rahila Syed


v3-0001-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Re: Separate memory contexts for relcache and catcache

2024-11-26 Thread Rahila Syed
Hi Melih, Jeff,

I tested the v4 patch along with the memory statistics reporting patch from
[1] thread.

After running a plpgsql procedure periodically and querying its memory
consumption,
I see various child contexts of CacheMemoryContext created as shown in the
image below or link [2].


[image: Memorycontext.drawio.png]
Observations:
1. While there are a number of child contexts like index info of
RelCacheContext,
   CatCacheContext does not have any children.
2. While there is a bunch of used memory in RelCacheContext and
CatCacheContext,
SPICacheContext and PlanCacheContext do not have any allocations of their
own
and serve only as parents for SPI and CachedPlan related contexts
respectively.

Having reviewed the discussion regarding potential fragmentation issues
caused by
creating a large number of new contexts in each backend, I would like to
take a step
back and better understand the motivation behind separating these contexts.

IIUC, segregating cache memory allocations into RelCacheContext and
CatCacheContext
allows for grouping a large number of memory allocations under a
common context, which, in turn, aids in monitoring memory consumption.
However,
I believe this reasoning does not apply to SPICacheContext and
PlanCacheContext,
as these contexts do not have any allocations of their own.

How, then, does separating these contexts from CacheMemoryContext improve
monitoring?
 Additionally, IIUC, these contexts are created as long-lived contexts, so
they are not designed
 to optimize deletion of all their children via MemoryContextDelete on the
parent.


> Attached a separate patch to change initial sizes for relcache and
> catcache contexts as they grow
>


> large from the start. This was suggested in the thread previously [1].
> Also changed CacheMemoryContext to use ALLOCSET_START_SMALL_SIZES, so it
> starts from 1KB.
>

Applying the same change to use ALLOCSET_START_SMALL_SIZES would be
beneficial for
SPICacheContext and PlanCacheContext contexts as well.

On documentation front, the newly added contexts would require a mention in
src/backend/utils/mmgr/README.

[1] PostgreSQL: Enhancing Memory Context Statistics Reporting
<https://www.postgresql.org/message-id/flat/cah2l28v8mc9hdt8qosj8trmkau_8fm_hks41neo9-6zakuz...@mail.gmail.com>
[2] Memorycontext.png - Page-1
<https://viewer.diagrams.net/index.html?tags=%7B%7D&lightbox=1&highlight=ff&edit=_blank&layers=1&nav=1&title=Memorycontext.png#R%3Cmxfile%3E%3Cdiagram%20id%3D%22prtHgNgQTEPvFCAcTncT%22%20name%3D%22Page-1%22%3E7Ztbc9o4FMc%2FDY9kLF%2FhMSHptjNNhy17aZ92hBFYHWN5ZRFgP%2F1KtoSvgCE2mAwvrXWs6%2Fn%2Fjmwdk54xWm5%2BozD0XskM%2BT1dm216xnNP152Bzf8Vhm1iMG0nMSwoniUmkBom%2BD8kjZq0rvAMRbmKjBCf4TBvdEkQIJflbJBSss5XmxM%2FP2oIF6hkmLjQL1v%2FxjPmJdaB7qT2zwgvPDUysIfJnSVUleVKIg%2FOyDpjMl56xogSwpKr5WaEfOE75Zek3ac9d3cToyhgdRqQyVibPgZgYn3%2FFv6Fwx9%2FzmEfyG7eoL%2BSK5azZVvlAjTjHpFFQplHFiSA%2FktqfaJkFcyQGEfjpbTOV0JCbgTc%2BAsxtpXywhUj3OSxpS%2FvJmOKgfYuTpoisqIuOrAiOX8G6QKxA%2FWcnQQcXUSWiNEtb0eRDxl%2By88DSogWu3qpn%2FmFdPUJbtcrvG77fLpPUQiDnPvtf1eCkKc5CVg%2Fih34yCsAELLYcZqPA9T3JIPxLY2vzYpvqsb8aiH%2BH0HXQ69oSeh2xPtDG6bG5ctIhk4qliDIS7z2MEOTEMZKrHngV8n5higf4bCgZQFkA0OFjdw3bFlcp0EIlM3LBKCyNS6ZsVcyoUwtycxwU6nKH9swFqYsSdJ1RyQpKAKcCkn0S0pinh5FpwRLQd85dPN9fUb%2BG2LYhdmm78BgBFktDLoZsMDqGh5WexE79mFwEyE76Joo9l5RpkXCp%2FQcC59VqadQvCD6MIpyYr9G5BuhS%2BgfDf3zQSnP8zvyT4vysNRXM4463ZnvgRn6eBHwa5eTi2gzdFud23Kciz6RzsdyMv5yw48aE5gd033YrO5arOMcLrG%2FTW6OePxMKe7pI9EZDKJ%2BhCiet0tJhCB1vX9CcZLNbBGfMgPBpZA3mEZhnpcqjN49n5ASF0URDhY9oW67g7kiPm4sMOxh1zZElcY5mF8IZo8iUyOeDuIhjd28Y%2FJeRBvMfojrB0uWfsp64vp5k6n2vFWFgK8laWQaQ2X4GRs0y1aGtHFcyrUe83DjLhGPrtjYu3zGwqrWPqOtVSGtstVObMgRxgTHEaTO4wW0Ssgky5StsmmoQkemWegIFDpK%2FFDqKMZvt%2Bx3EFmVe2mISDAY1GeSF4pQZTi1cpQeIfQKMJp3GJuAsSqr1BCMORQfNPChcbTvODaBY1VGrRkcMzBqD451mMWUOzvLXV9TGN%2FA09q4E9kEkVVJvEaIBDkiawNp2KCIpGneCJLDayJpa9UfdE4l0nD0XD9ANy9LZFUG86onGuHyPJGWcWUirZpEgsGH2CUNxaDqyLrwLjmoYLKhPEicJ5yJDx5d%2FcgBCh85DPWMylA0PEBR80mP%2FenABuWYJEHYhihtJOsNs2MqqTxYKxu5Uzh9JXv0uacv2ypt8eBmtvirvnV8mBdhtfQ29pTJ%2BAu%2F3eUtvvjFZ4fHscS22druoZf0wNwlG9EumJOre8yyteMeAxf1WIu5rrNenK%2BV03Jq7pz7FL7vnKdx115S67a4s%2Btyp9%2B5a4K71lJX4itm4fXSOfn1skGwPsgrnuEMcx2Z2oWB2f%2FbuNq%2FCj9%2BbPx9hYQ3W%2FgBwyVOjTtNmj818mL6xxiJpulftBgv%2FwM%3D%3C%2Fdiagram%3E%3C%2Fmxfile%3E#%7B%22pageId%22%3A%22prtHgNgQTEPvFCAcTncT%22%7D>


Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-01-06 Thread Rahila Syed
Hi Torikoshia,

Thank you for the review.

>
>
>=# select path FROM pg_get_process_memory_contexts('20271', false);
>path
>---
> {0}
> {0,1}
> {0,2}
> ..
>
> =# select path from pg_backend_memory_contexts;
> path
>---
> {1}
> {1,2}
> {1,3}
> ..asdf asdf
>
> Would it be better to begin with 1 to make them consistent?
>
> Makes sense, fixed in the attached patch.

pg_log_backend_memory_contexts() does not allow non-superusers to
> execute by default since it can peek at other session information.
> pg_get_process_memory_contexts() does not have this restriction, but
> wouldn't it be necessary?
>
> Yes. I added the restriction to only allow super users and
users with pg_read_all_stats privileges to query the memory context
statistics of another process.


> When the target pid is the local backend, the HINT suggests using
> pg_get_backend_memory_contexts(), but this function is not described in
> the manual.
> How about suggesting pg_backend_memory_contexts view instead?
>
>=# select pg_get_process_memory_contexts('27041', false);
>WARNING:  cannot return statistics for local backend
>HINT:  Use pg_get_backend_memory_contexts instead
>
>
> There are no explanations about 'num_agg_contexts', but I thought the
> explanation like below would be useful.
>
> Ok. I added an explanation of this column in the documentation.


> > I have added this information as a column named "num_agg_contexts",
> > which indicates
> > the number of contexts whose statistics have been aggregated/added for
> > a particular output.
>
> git apply caused some warnings:
>
> Thank you for reporting. They should be gone now.

PFA the patch with above updates.

Thank you,
Rahila Syed


v9-0001-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-02-03 Thread Rahila Syed
Hi,


> >
> > Just idea; as an another option, how about blocking new requests to
> > the target process (e.g., causing them to fail with an error or
> > returning NULL with a warning) if a previous request is still
> pending?
> > Users can simply retry the request if it fails. IMO failing quickly
> > seems preferable to getting stuck for a while in cases with
> concurrent
> > requests.
> >
> > Thank you for the suggestion. I agree that it is better to fail
> > early and avoid waiting for a timeout in such cases. I will add a
> > "pending request" tracker for this in shared memory. This approach
> > will help prevent sending a concurrent request if a request for the
> > same backend is still being processed.
> >


Please find attached a patch that adds a request_pending field in
shared memory. This allows us to detect concurrent requests early
and return a WARNING message immediately, avoiding unnecessary
waiting and potential timeouts. This is added in v12-0002* patch.

I imagined it would work something like this:
>
> requesting backend:
> ---
> * set request_ts to current timestamp
> * signal the target process, to generate memory context info
> * wait until the DSA gets filled with stats_ts > request_ts
> * return the data, don't erase anything
>
> target backend
> --
> * clear the signal
> * generate the statistics
> * set stats_ts to current timestamp
> * wait all the backends waiting for the stats (through CV)
>

The attached v12-0002* patch implements this. We determine
the latest statistics based on the stats timestamp, if it is greater
than the timestamp when the request was sent, the statistics are
considered up to date and are returned immediately.  Otherwise,
the  client waits for the latest statistics to be published until the
timeout is reached.

With the latest changes, I don't see a dip in tps even when
concurrent requests are run in pgbench script.

 pgbench -n -f monitoring.sql -P 1 postgres -T 60
pgbench (18devel)
progress: 1.0 s, 816.9 tps, lat 1.218 ms stddev 0.317, 0 failed
progress: 2.0 s, 821.9 tps, lat 1.216 ms stddev 0.177, 0 failed
progress: 3.0 s, 817.1 tps, lat 1.224 ms stddev 0.209, 0 failed
progress: 4.0 s, 791.0 tps, lat 1.262 ms stddev 0.292, 0 failed
progress: 5.0 s, 780.8 tps, lat 1.280 ms stddev 0.326, 0 failed
progress: 6.0 s, 675.2 tps, lat 1.482 ms stddev 0.503, 0 failed
progress: 7.0 s, 674.0 tps, lat 1.482 ms stddev 0.387, 0 failed
progress: 8.0 s, 821.0 tps, lat 1.217 ms stddev 0.272, 0 failed
progress: 9.0 s, 903.0 tps, lat 1.108 ms stddev 0.196, 0 failed
progress: 10.0 s, 886.9 tps, lat 1.128 ms stddev 0.160, 0 failed
progress: 11.0 s, 887.1 tps, lat 1.126 ms stddev 0.243, 0 failed
progress: 12.0 s, 871.0 tps, lat 1.147 ms stddev 0.227, 0 failed
progress: 13.0 s, 735.0 tps, lat 1.361 ms stddev 0.329, 0 failed
progress: 14.0 s, 655.9 tps, lat 1.522 ms stddev 0.331, 0 failed
progress: 15.0 s, 674.0 tps, lat 1.484 ms stddev 0.254, 0 failed
progress: 16.0 s, 659.0 tps, lat 1.517 ms stddev 0.289, 0 failed
progress: 17.0 s, 641.0 tps, lat 1.558 ms stddev 0.281, 0 failed
progress: 18.0 s, 707.8 tps, lat 1.412 ms stddev 0.324, 0 failed
progress: 19.0 s, 746.3 tps, lat 1.341 ms stddev 0.219, 0 failed
progress: 20.0 s, 659.9 tps, lat 1.513 ms stddev 0.372, 0 failed
progress: 21.0 s, 651.8 tps, lat 1.533 ms stddev 0.372, 0 failed
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again
progress: 22.0 s, 635.2 tps, lat 1.574 ms stddev 0.519, 0 failed
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again
progress: 23.0 s, 730.0 tps, lat 1.369 ms stddev 0.408, 0 failed
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again
WARNING:  cannot process the request at the moment
HINT:  Another request is pending, try again

where monitoring.sql is as follows:
SELECT * FROM pg_get_process_memory_contexts(
  (SELECT pid FROM pg_stat_activity
WHERE pid != pg_backend_pid()
ORDER BY random() LIMIT 1)
  , false);

I have split the patch into 2 patches with v12-0001* consisting of fixes
needed to allow using the MemoryContextStatsInternals for this
feature and
v12-0002* containing all the remaining changes for the feature.

A few outstanding issues are as follows:

1.  Currently one DSA  is created per backend when the first request for
statistics is made and remains for the lifetime of the server.
I think I should add logic to periodically destroy DSAs, when memory
context statistics are not being *actively* queried from the backend,
as determined by the statistics timestamp.
2. The two issues reported by Fujii-san here: [1].
i. I have proposed a fix for the first issue here [2].
ii. I am able to reproduce the second issue. This happens when we try
to query statistics of a backend running infinite_recurse.sql. While I am
working on finding a root-cause, I think it happens due to some memory
being overwritten 

Re: Enhancing Memory Context Statistics Reporting

2024-11-22 Thread Rahila Syed
Hi,

How does the process know that the client backend has finished reading
> stats and it can be refreshed? What happens, if the next request for
> memory context stats comes before first requester has consumed the
> statistics it requested?
>
> A process that's copying its statistics does not need to know that.
Whenever it receives a signal to copy statistics, it goes ahead and
copies the latest statistics to the DSA after acquiring an exclusive
lwlock.

A requestor takes a lock before it starts consuming the statistics.
If the next request comes while the first requestor is consuming the
statistics, the publishing process will wait on lwlock to be released
by the consuming process before it can write the statistics.
If the next request arrives before the first requester begins consuming
the statistics, the publishing process will acquire the lock and overwrite
the earlier statistics with the most recent ones.
As a result, both the first and second requesters will consume the
updated statistics.

Does the shared memory get deallocated when the backend which
> allocated it exits?
>
> Memory in the DSA is allocated by a postgres process and deallocated
by the client backend for each request. Both the publishing postgres process
and the client backend detach from the DSA at the end of each request.
However, the DSM segment(s) persist even after all the processes exit
and are only destroyed upon a server restart. Each DSA is associated
with the procNumber of a postgres process and
can be re-used by any future process with the same procNumber.

> >
> > When statistics of a local backend is requested, this function returns
> the following
> > WARNING and exits, since this can be handled by an existing function
> which
> > doesn't require a DSA.
> >
> > WARNING:  cannot return statistics for local backend
> > HINT:  Use pg_get_backend_memory_contexts instead
>
> How about using pg_get_backend_memory_contexts() for both - local as
> well as other backend? Let PID argument default to NULL which would
> indicate local backend, otherwise some other backend?
>
> I don't see much value in combining the two, specially since with
pg_get_process_memory_contexts() we can query both the postgres
backend and a background process, the name pg_get_backend_memory_context()
would be inaccurate and I am not sure whether a change to rename the
existing function would be welcome.

Please find an updated patch which fixes an issue seen in CI runs.

Thank you,
Rahila Syed


v5-Function-to-report-memory-context-stats-of-a-process.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2024-12-03 Thread Rahila Syed
ter
>
> OTOH if I modify the script to only look at client backends, and wait
> until the processes get "stuck" (i.e. waiting on the condition variable,
> consuming 0% CPU), I get this:
>
> $ pgbench -n -f select.sql -c 4 -T 10 test
> pgbench (18devel)
> WARNING:  Wait for 107146 process to publish stats timed out, try again
> WARNING:  Wait for 107144 process to publish stats timed out, try again
> WARNING:  Wait for 107147 process to publish stats timed out, try again
> transaction type: select.sql
> ...
>
> but when it gets 'stuck', most of the processes are still very much
> running (but waiting for contexts from some other process). In the above
> example I see this:
>
>  107144 ?Ss 0:02 postgres: user test [local] SELECT
>  107145 ?Ss 0:01 postgres: user test [local] SELECT
>  107147 ?Ss 0:02 postgres: user test [local] SELECT
>
> So yes, 107146 seems to be gone. But why would that block getting info
> from 107144 and 107147?
>
> Most likely 107144 and/or 107147 must also be waiting for 107146 which is
gone. Something like 107144 -> 107147 -> 107146(dead) or 107144
->107146(dead)
and 107147->107146(dead).


Maybe that's acceptable, but couldn't this be an issue with short-lived
> connections, making it hard to implement the kind of automated
> collection of stats that you envision. If it hits this kind of timeouts
> often, it'll make it hard to reliably collect info. No?


Yes, if there is a chain of waiting clients due to a process no longer
existing,
the waiting time to receive information will increase. However, as long as
a failed
a request caused by a non-existent process is detected promptly, the wait
time should
remain manageable, allowing other waiting clients to obtain the requested
information
from the existing processes.

In such cases, it might be necessary to experiment with the waiting times
at the receiving
client. Making the waiting time user-configurable, as you suggested, by
passing it as an
argument to the function, could help address this scenario.
Thanks for highlighting this, I will test this some more.


> >
> >   > I opted for DSAs over DSMs to enable memory reuse by freeing
> >   > segments for subsequent statistics copies of the same backend,
> >   > without needing to recreate DSMs for each request.
> >
> > I feel like this might be a premature optimization - I don't have a
> > clear idea how expensive it is to create DSM per request, but my
> > intuition is that it's cheaper than processing the contexts and
> > generating the info.
> >
> > I'd just remove that, unless someone demonstrates it really matters.
> I
> > don't really worry about how expensive it is to process a request
> > (within reason, of course) - it will happen only very rarely. It's
> more
> > important to make sure there's no overhead when no one asks the
> backend
> > for memory context info, and simplicity.
> >
> > Also, how expensive it is to just keep the DSA "just in case"?
> Imagine
> > someone asks for the memory context info once - isn't it a was to
> still
> > keep the DSA? I don't recall how much resources could that be.
> >
> > I don't have a clear opinion on that, I'm more asking for opinions.
> >
> >
> > Imagining a tool that periodically queries the backends for statistics,
> > it would be beneficial to avoid recreating the DSAs for each call.
>
> I think it would be nice if you backed this with some numbers. I mean,
> how expensive is it to create/destroy the DSA? How does it compare to
> the other stuff this function needs to do?
>
> After instrumenting the code with timestamps, I observed that DSA creation
accounts for approximately 17% to 26% of the total execution time of the
function
pg_get_process_memory_contexts().

> Currently,  DSAs of size 1MB per process
> > (i.e., a maximum of 1MB * (MaxBackends + auxiliary processes))
> > would be created and pinned for subsequent reporting. This size does
> > not seem excessively high, even for approx 100 backends and
> > auxiliary processes.
> >
>
> That seems like a pretty substantial amount of memory reserved for each
> connection. IMHO the benefits would have to be pretty significant to
> justify this, especially considering it's kept "forever", even if you
> run the function only once per day.
>
> I can reduce the initial segment size to DSA_MIN_SEGMENT_SIZE, which is
256KB per process. If needed, this could grow up to 16MB based on the
current settings.

However, for the scenario you mentioned, it would be ideal to have a
mechanism
to mark a pinned DSA (using dsa_pin()) for deletion if it is not
used/attached within a
specified duration. Alternatively, I could avoid using dsa_pin()
altogether, allowing the
DSA to be automatically destroyed once all processes detach from it, and
recreate it
for a new request.

At the moment, I am unsure which approach is most feasible. Any suggestions
would be
greatly appreciated.

Thank you,
Rahila Syed


v7-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-01-08 Thread Rahila Syed
Hi Fujii-san,

Thank you for testing the feature.


> Issue 1: Error with pg_get_process_memory_contexts()
> When I used pg_get_process_memory_contexts() on the PID of a backend
> process
> that had just caused an error but hadn’t rolled back yet,
> the following error occurred:
>
>Session 1 (PID=70011):
>=# begin;
>=# select 1/0;
>ERROR:  division by zero
>
>Session 2:
>=# select * from pg_get_process_memory_contexts(70011, false);
>
>Session 1 terminated with:
>ERROR:  ResourceOwnerEnlarge called after release started
>FATAL:  terminating connection because protocol synchronization was lost
>
> In this scenario, a DSM segment descriptor is created and associated with
the
CurrentResourceOwner, which is set to the aborting transaction's resource
owner.
This occurs when ProcessGetMemoryContextInterrupts is called by the backend
while a transaction is still open and about to be rolled back.

I believe this issue needs to be addressed in the DSA and DSM code by
adding
a check to ensure that the CurrentResourceOwner is not about to be released
before
creating a DSM under the CurrentResourceOwner.

The attached fix resolves this issue. However, for a more comprehensive
solution,
I believe the same change should be extended to other parts of the DSA and
DSM
code where CurrentResourceOwner is referenced.

Issue 2: Segmentation Fault
> When I ran pg_get_process_memory_contexts() every 0.1 seconds using
> \watch command while running "make -j 4 installcheck-world",
> I encountered a segmentation fault:
>
>LOG:  client backend (PID 97975) was terminated by signal 11:
> Segmentation fault: 11
>DETAIL:  Failed process was running: select infinite_recurse();
>LOG:  terminating any other active server processes
>
> I have not been able to reproduce this issue. Could you please clarify
which process you ran
pg_get_process_memory_context() on, with the interval of 0.1? Was it a
backend process
created by make installcheck-world, or some other process?

Thank you,
Rahila Syed


fix_for_resource_owner_error.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-01-21 Thread Rahila Syed
 Moreover, with the second
approach,
determining an appropriate value for STALE_STATS_LIMIT is challenging, as
it
depends on the server's load.

Kindly let me know your preference. I have attached a patch which
implements the
2nd approach for testing, the 3rd approach being implemented in the v10
patch.


Thank you,
Rahila Syed


v11-0001-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-01-24 Thread Rahila Syed
Hi,

>
> Just idea; as an another option, how about blocking new requests to
> the target process (e.g., causing them to fail with an error or
> returning NULL with a warning) if a previous request is still pending?
> Users can simply retry the request if it fails. IMO failing quickly
> seems preferable to getting stuck for a while in cases with concurrent
> requests.
>
> Thank you for the suggestion. I agree that it is better to fail early
and avoid
waiting for a timeout in such cases. I will add a "pending request" tracker
for
this in shared memory. This approach will help prevent sending a concurrent
request if a request for the same backend is still being processed.
IMO, one downside of throwing an error in such cases is that the users
might
wonder if they need to take a corrective action, even though the issue is
actually
going to solve itself and they just need to retry. Therefore, issuing a
warning
or displaying previously updated statistics might be a better alternative
to throwing
an error.

Thank you,
Rahila Syed


Re: per backend WAL statistics

2025-01-28 Thread Rahila Syed
Hi,

Thank you for the patchset. Having per-backend WAL statistics,
in addition to cluster-wide ones, is useful.

I had a few comments while looking at v6-0003-* patch.

+ /*
+ * This could be an auxiliary process but these do not report backend
+ * statistics due to pgstat_tracks_backend_bktype(), so there is no need
+ * for an extra call to AuxiliaryPidGetProc().
+ */
+ if (!proc)
+ PG_RETURN_NULL();

Maybe an explicit call to AuxiliaryPidGetProc() followed by a check
for pgstat_tracks_backend_bktype() would be more maintainable.
Since the processes tracked by AuxiliaryPidGetProc and
pgstat_tracks_backend_bktype might diverge in future.

On that note, it is not clear to me why the WAL writer statistics are not
included in per backend
wal statistics? I understand the same limitation currently exists in
pgstats_track_io_bktype(), but why does that need to be extended to
WAL statistics?

+ pg_stat_get_backend_wal
+
+pg_stat_get_backend_wal (
integer )
+record
+   
Should the naming describe what is being returned more clearly?
Something like pg_stat_get_backend_wal_activity()? Currently it
suggests that it returns a backend's WAL, which is not the case.

+ if (pgstat_tracks_backend_bktype(MyBackendType))
+ {
+ PendingBackendStats.pending_wal.wal_write++;
+
+ if (track_wal_io_timing)
+ INSTR_TIME_ACCUM_DIFF(PendingBackendStats.pending_wal.wal_write_time,
+  end, start);
+ }
At the risk of nitpicking, may I suggest moving the above code, which is
under the
track_wal_io_timing check, to the existing check before this added chunk?
This way, all code related to track_wal_io_timing will be grouped together,
closer to where the "end" variable is computed.


Thank you,
Rahila Syed



On Tue, Jan 21, 2025 at 12:50 PM Bertrand Drouvot <
bertranddrouvot...@gmail.com> wrote:

> Hi,
>
> On Fri, Jan 17, 2025 at 08:43:57AM +0900, Michael Paquier wrote:
> > On Thu, Jan 16, 2025 at 12:44:20PM -0500, Andres Freund wrote:
> > > On 2025-01-16 17:11:09 +, Bertrand Drouvot wrote:
> > >> So, do you think that the initial proposal that has been made here
> (See R1. in
> > >> [2]) i.e make use of a new PendingBackendWalStats variable:
> > >
> > > Well, I think this first needs be fixed for for the IO stats change
> made in
> > >
> > > Once we have a pattern to model after, we can apply the same scheme
> here.
> >
> > Okay, thanks for the input.  I was not sure what you intended
> > originally with all this part of the backend code, and how much would
> > be acceptable.  The line is clear now.
> >
> > >> 0003 does not rely on pgstat_prep_backend_pending() for its pending
> statistics
> > >> but on a new PendingBackendWalStats variable. The reason is that the
> pending wal
> > >> statistics are incremented in a critical section (see XLogWrite(),
> and so
> > >> a call to pgstat_prep_pending_entry() could trigger a failed
> assertion:
> > >> MemoryContextAllocZero()->"CritSectionCount == 0 ||
> (context)->allowInCritSection"
> > >> "
> > >>
> > >> and implemented up to v4 is a viable approach?
> > >
> > > Yes-ish.  I think it would be better to make it slightly more general
> than
> > > that, handling this for all types of backend stats, not just for WAL.
> >
> > Agreed to use the same concept for all these parts of the backend
> > stats kind rather than two of them.  Will send a reply on the original
> > backend I/O thread as well.
>
> PFA v6 that now relies on the new PendingBackendStats variable introduced
> in
> 4feba03d8b9.
>
> Remark: I moved PendingBackendStats back to pgstat.h because I think that
> the
> "simple" pending stats increment that we are adding in xlog.c are not worth
> an extra function call overhead (while it made more sense for the more
> complex IO
> stats handling). So PendingBackendStats is now visible to the outside
> world like
> PendingWalStats and friends.
>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com
>


Re: Enhancing Memory Context Statistics Reporting

2025-01-29 Thread Rahila Syed
Hi,

On Sat, Jan 25, 2025 at 3:50 AM Tomas Vondra  wrote:

>
>
> On 1/24/25 14:47, Rahila Syed wrote:
> >
> > Hi,
> >
> >
> > Just idea; as an another option, how about blocking new requests to
> > the target process (e.g., causing them to fail with an error or
> > returning NULL with a warning) if a previous request is still
> pending?
> > Users can simply retry the request if it fails. IMO failing quickly
> > seems preferable to getting stuck for a while in cases with
> concurrent
> > requests.
> >
> > Thank you for the suggestion. I agree that it is better to fail
> > early and avoid waiting for a timeout in such cases. I will add a
> > "pending request" tracker for this in shared memory. This approach
> > will help prevent sending a concurrent request if a request for the
> > same backend is still being processed.
> >
>
> AFAIK these failures should be extremely rare - we're only talking about
> that because the workload I used for testing is highly concurrent, i.e.
> it requests memory context info extremely often. I doubt anyone sane is
> going to do that in practice ...

Yes, that makes sense.

>
>
> IMO, one downside of throwing an error in such cases is that the
> > users might wonder if they need to take a corrective action, even
> > though the issue is actually going to solve itself and they just
> > need to retry. Therefore, issuing a warning or displaying previously
> > updated statistics might be a better alternative to throwing an
> > error.
> >
>
> Wouldn't this be mostly mitigated by adding proper detail/hint to the
> error message? Sure, the user can always ignore that (especially when
> calling this from a script), but well ... we can only do so much.
>

 OK.

All this makes me think about how we shared pgstat data before the shmem
> approach was introduced in PG15. Until then the process signaled pgstat
> collector, and the collector wrote the statistics into a file, with a
> timestamp. And the process used the timestamp to decide if it's fresh
> enough ... Wouldn't the same approach work here?
>
> I imagined it would work something like this:
>
> requesting backend:
> ---
> * set request_ts to current timestamp
> * signal the target process, to generate memory context info
> * wait until the DSA gets filled with stats_ts > request_ts
> * return the data, don't erase anything
>
> target backend
> --
> * clear the signal
> * generate the statistics
> * set stats_ts to current timestamp
> * wait all the backends waiting for the stats (through CV)
>
> I see v11 does almost this, except that it accepts somewhat stale data.
>
That's correct.


> But why would that be necessary? I don't think it's needed, and I don't
> think we should accept data from before the process sends the signal.
>
> This is done in an attempt to avoid concurrent requests from timing out.
In such cases, data in response to another request is likely to already be
in the
dynamic shared memory. Hence instead of waiting for the latest data and
risking a
timeout, the approach displays available statistics that are newer than a
defined
threshold. Additionally, since we can't distinguish between sequential and
concurrent requests, we accept somewhat stale data for all requests.

I realize this approach has some issues, mainly regarding how to determine
an appropriate threshold value or a limit for old data.

Therefore, I agree that it makes sense to display the data that is
published
after the request is made. If such data can't be published due to
concurrent
requests or other delays, the function should detect this and return as
soon as
possible.


Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-01-12 Thread Rahila Syed
tion
in dsa and some extra computation to populate all the paths in hash table
inspite of
get_summary being true.


>
> Attached are two patches - 0001 is the original patch, 0002 has most of
> my review comments (mentioned above), and a couple additional changes to
> comments/formatting, etc. Those are suggestions rather than issues.
>
> Thank you, applied the 0002 patch and made the changes mentioned in XXX.

Answering some of your questions in the 0002 patch below:

Q. * XXX Also, what if we fill exactly this number of contexts? Won't we
* lose the last entry because it will be overwitten by the summary?
A. We are filling 0 to max_stats - 2 slots by memory context in the loop
   foreach_ptr(MemoryContextData, cur, contexts) in
ProcessGetMemoryContextInterrupt.
   max_stats - 1 is reserved for the summary statistics.

Q./* XXX I don't understand why we need to check get_summary here? */
A. get_summary check is there to ensure that the context_id is inserted in
the
hash_table if get_summary is true. If get_summary is true, the loop will
break after the first iteration
and the entire main list of contexts won't be traversed and hence
context_ids won't be inserted.
Hence it is handled separately inside a check for get_summary.

Q. /* XXX What if the memstats_dsa_pointer is not valid? Is it even
possible?
 * If it is, we have garbage in memctx_info. Maybe it should be an
Assert()? */
A . Agreed. Changed it to an assert.

Q./*
 * XXX isn't 2 x 1kB for every context a bit too much? Maybe better
to
 * make it variable-length?
 */
A. I don't know how to do this for a variable in shared memory, won't that
mean
allocating from the heap and thus the pointer would become invalid in
another
process?



Thank you,
Rahila Syed


v10-0001-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-01-13 Thread Rahila Syed
Hi,

If a DSM is created or attached from an interrupt handler while a
transaction is being
rolled back, it may result in the following error.
"ResourceOwnerEnlarge called after release started"
This was found during the testing of Enhancing Memory Context Reporting
feature
by Fujii Masao [1].

I propose a fix to fall back to creating a DSM or DSA with a NULL resource
owner if
CurrentResourceOwner is set to "releasing".
Since a DSM or DSA can already be created with a NULL resource owner—
meaning it persists for the lifetime of the backend or until explicitly
detached—
this approach should be acceptable.

Please find attached a patch which does that. Kindly let me know your views.


[1].
 
https://www.postgresql.org/message-id/a1a7e2b7-8f33-4313-baff-42e92ec14fd3%40oss.nttdata.com



0001-Prevent-the-error-on-creating-a-dsm-segment-from-an-.patch
Description: Binary data


Re: question about relation_open

2025-01-14 Thread Rahila Syed
Hi,

>
>
> I want to call the function ReadBufferExtended
> to get the raw data of a given buffer
> and I've read in some examples that I need to call relation_open
> first, in order to get a Relation variable and also lock the relation.
>
> The function relation_open returns a non NULL pointer in my extension, but
> then the last line in the following snippet crashes postgres.
>

You mentioned that the postgres crashes. Did you mean it caused a
Segmentation fault or something else like PANIC?
If possible, can you share a core dump in that case?

Also, can you share any errors that you see in logs when postgres crashes?

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2024-12-24 Thread Rahila Syed
ue of request timeouts in scenarios with concurrent
requests.
I have included this in the attached patch.



> and the other processes were waiting for 107145 in some way. But yeah,
> detecting the dead process would improve this, although it also shows
> the issues can "spread" easily.
>
> OTOH it's unlikely to have multiple pg_get_process_memory_contexts()
> queries pointing at each other like this - monitoring will just to that
> from one backend, and that's it. So not a huge issue.
>
> Makes sense.


> .
> >
> > In such cases, it might be necessary to experiment with the waiting
> > times at the receiving
> > client. Making the waiting time user-configurable, as you suggested, by
> > passing it as an
> > argument to the function, could help address this scenario.
> > Thanks for highlighting this, I will test this some more.
> >
>
> I think we should try very hard to make this work well without the user
> having to mess with the timeouts. These are exceptional conditions that
> happen only very rarely, which makes it hard to find good values.
>
> OK.


> I'm entirely unconcerned about the pg_get_process_memory_contexts()
> performance, within some reasonable limits. It's something executed
> every now and then - no one is going to complain it takes 10ms extra,
> measure tps with this function, etc.
>
> 17-26% seems surprisingly high, but Even 256kB is too much, IMHO. I'd
> just get rid of this optimization until someone complains and explains
> why it's worth it.
>
> Yes, let's make it fast, but I don't think we should optimize it at the
> expense of "regular workload" ...
>
>
After debugging the concurrent requests timeout issue, it appears there is
yet another
argument in favor of avoiding the recreation of DSAs for every request: we
get to retain
the last reported statistics for a given postgres process, which can help
prevent certain
requests to fail in case of concurrent requests to the same process.

Thank you,
Rahila Syed


v8-0001-Function-to-report-memory-context-stats-of-any-backe.patch
Description: Binary data


Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-03-24 Thread Rahila Syed
Hi,

Please find the attached updated and rebased patch.
I have added a test in the test_dsa module that uses a function
to create a dsa area. This function is called after
the resowner->releasing is set to true, using an injection point.

Thank you,
Rahila Syed


v2-0001-Prevent-the-error-on-creating-a-dsm-segment-from-an-.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-03-17 Thread Rahila Syed
Hi Alexander,

Thank you for the review.


> It looks like we're increasing *num_contexts twice per child memory
> context.  First, it gets increased with a recursive
> MemoryContextStatsInternal() call, then by adding an ichild.  I might
> be wrong, but I think these calculations at least deserve more
> comments.
>

I believe that's not the case. The recursive calls only work for children
encountered up to max_level and less than max_children per context.
The rest of the children are handled using MemoryContextTraverseNext,
without recursive calls. Thus, num_contexts is incremented for those
children separately from the recursive call counter.

I will add more comments around this.


> v17-0002-Function-to-report-memory-context-statistics.patch
>
> +   if (procNumber == MyProcNumber)
> +   {
> +   ereport(WARNING,
> +   errmsg("cannot return statistics for local backend"),
> +   errhint("Use pg_backend_memory_contexts view instead."));
> +   PG_RETURN_NULL();
> +   }
>
> Is it worth it to keep this restriction?  Can we fetch info about
> local memory context for the sake of generality?
>
>
I think that could be done, but using pg_backend_memory_context would
be more efficient in this case.


> I know there have been discussions in the thread before, but the
> mechanism of publishing memory context stats via DSA looks quite
> complicated.  Also, the user probably intends to inspect memory
> contexts when there is not a huge amount of free memory.  So, failure
> is probable on DSA allocation.  Could we do simpler?  For instance,
> allocate some amount of static shared memory and use it as a message
> queue between processes.  As a heavy load is not supposed to be here,
> I think one queue would be enough.
>
>
There could be other uses for such a function, such as a monitoring
dashboard
that periodically queries all running backends for memory statistics. If we
use a
single queue shared between all the backends, they will need to wait for
the queue
to become available before sharing their statistics, leading to processing
delays at
the publishing backend.

Even with separate queues for each backend or without expecting concurrent
use,
publishing statistics could be delayed if a message queue is full. This is
because a
backend needs to wait for a client process to consume existing messages or
statistics before publishing more.
If a client process exits without consuming messages, the publishing
backend will
experience timeouts when trying to publish stats. This will impact backend
performance
as statistics are published during CHECK_FOR_INTERRUPTS.

In the current implementation, the backend publishes all the statistics in
one go
without waiting for clients to read any statistics.

In addition, allocating complete message queues in static shared memory can
be
expensive, especially since these static structures need to be created even
if memory
context statistics are never queried.
On the contrary, a dsa is created for the feature whenever statistics are
first queried.
We are not preallocating shared memory for this feature, except for small
structures
to store the dsa_handle and dsa_pointers for each backend.

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-03-20 Thread Rahila Syed
Hi,

>>
> >> +   if (procNumber == MyProcNumber)
> >> +   {
> >> +   ereport(WARNING,
> >> +   errmsg("cannot return statistics for local backend"),
> >> +   errhint("Use pg_backend_memory_contexts view
> instead."));
> >> +   PG_RETURN_NULL();
> >> +   }
> >>
> >> Is it worth it to keep this restriction?  Can we fetch info about
> >> local memory context for the sake of generality?
> >>
> >
> > I think that could be done, but using pg_backend_memory_context would
> > be more efficient in this case.
> >
>
> I have raised a similar concern before. Having two separate functions
> one for local backend and other for remote is going to be confusing.
> We should have one function doing both and renamed appropriately.
>
>
This is a separate concern from what has been raised by Alexander.
He has suggested removing the restriction and fetching local backend
statistics also
with the proposed function.
I've removed this restriction in the latest version of the patch. Now, the
proposed
function can be used to fetch local backend statistics too.

Regarding your suggestion on merging these functions, although they both
report memory
context statistics, they differ in how they fetch these statistics—locally
versus from dynamic
shared memory. Additionally, the function signatures are different: the
proposed function
takes three arguments (pid, get_summary, and num_tries), whereas
pg_get_backend_memory_contexts does not take any arguments. Therefore, I
believe
these functions can remain separate as long as we document their usages
correctly.

Please find attached rebased and updated patches. I have added some more
comments and
fixed an issue caused due to registering before_shmem_exit callback from
interrupt processing
routine. To address this issue, I am registering this callback in the
InitProcess() function.

This happened because interrupt processing could be triggered from a
PG_ENSURE_ERROR_CLEANUP block. This block operates under the assumption
that
the before_shmem_exit callback registered at the beginning of the block,
will be the last one
in the registered callback list at the end of the block, which would not be
the case if I register
before_shmem_exit callback in the interrupt handling routine.

Thank you,
Rahila Syed


v18-0002-Function-to-report-memory-context-statistics.patch
Description: Binary data


v18-0001-Preparatory-changes-for-reporting-memory-context-sta.patch
Description: Binary data


Re: Improve monitoring of shared memory allocations

2025-04-05 Thread Rahila Syed
Hi,

I think it's almost committable. Attached is v8 with some minor review
> adjustments, and updated commit messages. Please read through those and
> feel free to suggest changes.
>
>
The changes look good to me.
About the following question.
   /* XXX what about segment size? should check have HASH_SEGMENT? */
Do you mean for a shared hash table should the caller have specified
HASH_SEGMENT
in flags?
It appears that the current code does not require this change. All the
shared hash tables seem
to have the default segment size.
I left the comment as it is as I am not sure if you intend to remove it or
not.


> I still found the hash_get_init_size() comment unclear, and it also
> referenced init_size, which is no longer relevant. I improved the
> comment a bit (I find it useful to mimic comments of nearby functions,
> so I did that too here). The "initial_elems" name was a bit confusing,
> as it seemed to suggest "number of elements", but it's a simple flag. So
> I renamed it to "prealloc", which seems clearer to me. I also tweaked
> (reordered/reformatted) the conditions a bit.


I appreciate your edtis, the comment and code are clearer now.

PFA the patches after merging the review patches.

Thank you,
Rahila Syed


v9-0001-Improve-acounting-for-memory-used-by-shared-hash-tab.patch
Description: Binary data


v9-0002-Improve-accounting-for-PredXactList-RWConflictPool-a.patch
Description: Binary data


v9-0003-Add-cacheline-padding-between-heavily-accessed-array.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-04-06 Thread Rahila Syed
Hi Daniel,


>
> After a bit more polish I landed with the attached, which I most likely
> will go
> ahead with after another round in CI.
>

Thank you for refining the code. The changes look good to me.
Regression tests ran smoothly in parallel with the memory monitoring
function,
pgbench results with the following custom script also shows good
performance.
```
SELECT * FROM pg_get_process_memory_contexts(
  (SELECT pid FROM pg_stat_activity
ORDER BY random() LIMIT 1)
  , false, 5);
```

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-03-26 Thread Rahila Syed
Hi Daniel,

Thank you for your review.

I have incorporated all your changes in v20 patches and ensured that the
review comments
corresponding to 0001 patch are included in that patch and not in 0002.


>
> +MEM_CTX_PUBLISH"Waiting for backend to publish memory information."
> I wonder if this should really be "process" and not backend?
>
> Fixed.


>
> +   default:
> +   context_type = "???";
> +   break;
> In ContextTypeToString() I'm having second thoughts about this, there
> shouldn't
> be any legitimate use-case of passing a nodetag this function which would
> fail
> MemoryContextIsValid().  I wonder if we aren't helping callers more by
> erroring
> out rather than silently returning an unknown?  I haven't changed this but
> maybe we should to set the API right from the start?
>

I cannot think of any legitimate scenario where the context type would be
unknown.
However, if we were to throw an error, it would prevent us from reporting
any memory
usage information when the context type is unidentified. Perhaps, it would
be more
informative and less restrictive to label it as "Unrecognized" or "Unknown."
I wonder if this was the reasoning behind doing it when it was added with
the
pg_backend_memory_contexts() function.


>
> +   /*
> +* Recheck the state of the backend before sleeping on the
> condition
> +* variable
> +*/
> +   proc = BackendPidGetProc(pid);
> Here we are really rechecking that the process is still alive, but I
> wonder if
> we should take the opportunity to ensure that the type is what we expect
> it to
> be?  If the pid has moved from being a backend to an aux proc or vice
> versa we
> really don't want to go on.
>
>
The reasoning makes sense to me. For periodic monitoring of all processes,
any PID that reincarnates into a different type could be queried in
subsequent
function calls. Regarding targeted monitoring of a specific process, such a
reincarnated
process would exhibit a completely different memory consumption,
likely not aligning with the user's original intent behind requesting the
statistics.



>
> + ereport(WARNING,
> + errmsg("PID %d is not a PostgreSQL server process",
> +pid));
> I wonder if we should differentiate between the warnings?  When we hit
> this in
> the loop the errmsg is describing a slightly different case.  I did leave
> it
> for now, but it's food for thought if we should perhaps reword this one.
>
>
Changed it to  "PID %d is no longer the same PostgreSQL server process".


>
> +   ereport(LOG,
> +   errmsg("hash table corrupted, can't construct path
> value"));
> I know you switched from elog(LOG..  to ereport(LOG..  but I still think a
> LOG
> entry stating corruption isn't helpful, it's not actionable for the user.
> Given that it's a case that shouldn't happen I wonder if we should
> downgrade it
> to an Assert(false) and potentially a DEBUG1?
>
> How about changing it to ERROR, in accordance with current occurrences of
the
same message? I did it in the attached version, however I am open to
changing
it to an Assert(false) and DEBUG1.

Apart from the above, I made the following improvements.

1. Eliminated the unnecessary creation of an extra memory context before
calling hash_create.
The hash_create function already generates a memory context containing the
hash table,
enabling easy memory deallocation by simply deleting the context via
hash_destroy.
Therefore, the patch relies on hash_destroy for memory management instead
of manual freeing.

2. Optimized memory usage by storing the path as an array of integers
rather than as an array of
Datums.
This approach conserves DSA memory allocated for storing this information.

3. Miscellaneous comment cleanups and introduced macros to simplify
calculations.


Thank you,
Rahila Syed


v20-0002-Function-to-report-memory-context-statistics.patch
Description: Binary data


v20-0001-Preparatory-changes-for-reporting-memory-context-sta.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-04-07 Thread Rahila Syed
Hi Daniel, Andres,


>
> > >> +} MemoryContextState;
> > >
> > > IMO that's too generic a name for something in a header.
> > >
> > >> +} MemoryContextId;
> > >
> > > This too.  Particularly because MemoryContextData->ident exist but is
> > > something different.
> >
> > Renamed both to use MemoryContextReporting* namespace, which leaves
> > MemoryContextReportingBackendState at an unwieldly long name.  I'm
> running out
> > of ideas on how to improve and it does make purpose quite explicit at
> least.
>
> How about
>
> MemoryContextReportingBackendState -> MemoryStatsBackendState
> MemoryContextReportingId -> MemoryStatsContextId
> MemoryContextReportingSharedState -> MemoryStatsCtl
> MemoryContextReportingStatsEntry -> MemoryStatsEntry
>
>
>
Fixed accordingly.


> > >> + /* context id starts with 1 */
> > >> + entry->context_id = ++(*stats_count);
> > >
> > > Given that we don't actually do anything here relating to starting
> with 1, I
> > > find that comment confusing.
> >
> > Reworded, not sure if it's much better tbh.
>
> I'd probably just remove the comment.
>
>
Reworded to mention that we pre-increment stats_count to make sure
id starts with 1.

>
> > > Hm. First I thought we'd leak memory if this second (and subsequent)
> > > dsa_allocate failed. Then I thought we'd be ok, because the memory
> would be
> > > memory because it'd be reachable from
> memCtxState[idx].memstats_dsa_pointer.
> > >
> > > But I think it wouldn't *quite* work, because
> memCtxState[idx].total_stats is
> > > only set *after* we would have failed.
> >
> > Keeping a running total in .total_stats should make the leak window
> smaller.
>
> Why not just initialize .total_stats *before* calling any fallible code?
> Afaict it's zero-allocated, so the free function should have no problem
> dealing with the entries that haven't yet been populated/
>
>
Fixed accordingly.

PFA a v28 which passes all local and github CI tests.

Thank you,
Rahila Syed


v28-0001-Add-function-to-get-memory-context-stats-for-process.patch
Description: Binary data


Re: Improve monitoring of shared memory allocations

2025-03-28 Thread Rahila Syed
 as per hash_create() requirements and
CACHELINEALIGN
will be taken care of in ShmemInitStruct at the time of allocating the
entire chunk.


> But more importantly, it adds alignment to all hctl field, and to every
> element of those arrays. But that's not what the alignment was supposed
> to do - it was supposed to align arrays, not individual elements. Not
> only would this waste memory, it would actually break direct access to
> those array elements.
>

I think existing code has occurrences of both i,.e aligning individual
elements and
arrays.
A similar precedent exists in the function hash_estimate_size(), which only
applies maxalignment to the individual structs like HASHHDR, HASHELEMENT,
entrysize, but also an array of HASHBUCKET headers.

I agree with you that perhaps we don't need maxalignment for all of these
structures.
For ex, HASHBUCKET is a pointer to a linked list of elements, it might not
require alignment
if the elements it points to are already aligned.


> But there's another detail - even before this patch, most of the stuff
> was allocated at once by ShmemInitStruct(). Everything except for the
> elements, so to replicate the alignment we only need to worry about that
> last part. So I think this should do:
>


> +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
> +CACHELINEALIGN(sizeof(HASHHDR) + \
> + ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
> + ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
>
> This is what the 0003 patch does. There's still one minor difference, in
> that we used to align each segment independently - each element_alloc()
> call allocated a new CACHELINEALIGN-ed chunk, while now have just a
> single chunk. But I think that's OK.
>
>
Before this patch, following structures were allocated separately using
ShmemAllocRaw
directory, each segment(seg_alloc) and a chunk of elements (element_alloc).
Hence,
I don't understand why v-0003* CACHEALIGNs  in the manner it does.

I think if we want to emulate the current behaviour we should do something
like:
CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
+ CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
+ CACHELINEALIGN(init_size * elementSize);

Like you mentioned the only difference would be that we would be aligning
all elements
at once instead of aligning individual partitions of elements.


3) I find the comment before hash_get_init_size a bit unclear/confusing.
> It says this:
>
>  * init_size should match the total number of elements allocated during
>  * hash table creation, it could be zero for non-shared hash tables
>  * depending on the value of nelem_alloc. For more explanation see
>  * comments within this function.
>  *
>  * nelem_alloc parameter is not relevant for shared hash tables.
>
> What does "should match" mean here? Doesn't it *determine* the number of
> elements allocated? What if it doesn't match?
>

by should match I mean - init_size  here  *is* equal to nelem in
hash_create() .

>
> AFAICS it means the hash table is sized to expect init_size elements,
> but only nelem_alloc elements are actually pre-allocated, right?


No. All the init_size elements are pre-allocated for shared hash table
irrespective of
nelem_alloc value.
For non-shared hash tables init_size elements are allocated only
if it is less than nelem_alloc, otherwise they are allocated as part of
expansion.


> But the
> comment says it's init_size which determines the number of elements
> allocated during creation. Confusing.
>
> It says "it could be zero ... depending on the value of nelem_alloc".
> Depending how? What's the relationship.
>
>
The relationship is defined in this comment:
 /*
* For a shared hash table, preallocate the requested number of elements.
* This reduces problems with run-time out-of-shared-memory conditions.
*
* For a non-shared hash table, preallocate the requested number of
* elements if it's less than our chosen nelem_alloc.  This avoids wasting
* space if the caller correctly estimates a small table size.
*/

hash_create code is confusing because the nelem_alloc named variable is used
in two different cases, In  the above case  nelem_alloc  refers to the one
returned by choose_nelem_alloc function.

The other nelem_alloc determines the number of elements in each partition
for a partitioned hash table. This is not what is being referred to in the
above
comment.

The bit "For more explanation see comments within this function" is not
> great, if only because there are not many comments within the function,
> so there's no "more explanation". But if there's something important, it
> should be in the main comment, preferably.
>
>
I will improve the comment in the next version.

Thank you,
Rahila Syed


Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-04-11 Thread Rahila Syed
Hi Daniel,

Thank you for your review and code improvements.

Please find below some observations.

1. dsm_unpin_mapping(dsm_segment *seg)
+   if (CurrentResourceOwner &&
IsResourceOwnerReleasing(CurrentResourceOwner))
+   return;

Given that the function can return without setting resowner to a
CurrentResourceOwner which is not NULL, shall we change the function
signature to return true when "unpin" is successful and false when not?
This behavior existed earlier too, but adding the check has made it
explicit.
Although this function is not currently in use, it would be beneficial to
make
the API more verbose.

2.  If value of IsResourceOwnerReleasing changes between
dsm_create_descriptor
and attach_internal, the dsm segment and dsa area will end up with
different resource owners.
Earlier the chances of CurrentResourceOwner changing between the two
functions were zero.
May be something can be done to keep resowner assignments under both these
functions
in sync.

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-04-07 Thread Rahila Syed
Hi,

Please see some responses below.

On Mon, Apr 7, 2025 at 9:13 PM Andres Freund  wrote:

> Hi,
>
> On 2025-04-07 15:41:37 +0200, Daniel Gustafsson wrote:
> > I think this function can be a valuable debugging aid going forward.
>
> What I am most excited about for this is to be able to measure server-wide
> and
> fleet-wide memory usage over time. Today I have actually very little idea
> about what memory is being used for across all connections, not to speak
> of a
> larger number of servers.
>
>
> > diff --git a/src/backend/postmaster/auxprocess.c
> b/src/backend/postmaster/auxprocess.c
> > index 4f6795f7265..d3b4df27935 100644
> > --- a/src/backend/postmaster/auxprocess.c
> > +++ b/src/backend/postmaster/auxprocess.c
> > @@ -84,6 +84,13 @@ AuxiliaryProcessMainCommon(void)
> >   /* register a before-shutdown callback for LWLock cleanup */
> >   before_shmem_exit(ShutdownAuxiliaryProcess, 0);
> >
> > + /*
> > +  * The before shmem exit callback frees the DSA memory occupied by
> the
> > +  * latest memory context statistics that could be published by
> this aux
> > +  * proc if requested.
> > +  */
> > + before_shmem_exit(AtProcExit_memstats_dsa_free, 0);
> > +
> >   SetProcessingMode(NormalProcessing);
> >  }
>
> How about putting it into BaseInit()?  Or maybe just register it when its
> first used?
>
>
Problem with registering it when dsa is first used is that dsa is used in
an interrupt handler.
The handler could be called from the PG_ENSURE_ERROR_CLEANUP block. This
block
operates under the assumption that the before_shmem_exit callback
registered at the beginning,
will be the last one in the registered callback list at the end of the
block. However, this won't be
the case if a callback is registered from an interrupt handler called in
the
PG_ENSURE_ERROR_CLEANUP block.


>

I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is

something that makes sense to use here?
>
>
To determine the memory  limit per backend in multiples of
DSA_DEFAULT_INIT_SEGMENT_SIZE.
Currently it is set to  1 * DSA_DEFAULT_INIT_SEGMENT_SIZE.
Since a call to dsa_create would create a DSA segment of this size, I
thought it makes sense
to define a limit related to the segment size.



> > +/*
>
> + /* If the dsm mapping could not be found, attach to the area */
> > + if (dsm_seg != NULL)
> > + return;
>
> I don't understand what we do here with the dsm?  Why do we not need
> cleanup
> if we are already attached to the dsm segment?
>

I am not expecting to hit this case, since we are always detaching from the
dsa.
This could be an assert but since it is a cleanup code, I thought returning
would be
a harmless step.

Thank you,
Rahila Syed


Re: Add pg_get_injection_points() for information of injection points

2025-04-28 Thread Rahila Syed
Hi Michael,

Thank you for the updated patches.

> Would it be useful to put the logic of the above function under #define
> > USE_INJECTION_POINT.  This approach would make it possible to
> > distinguish between cases where no injection points are attached and
> > instances where the build does not support injection points.
>
> For this one, I've found that InjectionPointList() was the incorrect
> bit: we can make it issue an elog(ERROR) if !USE_INJECTION_POINTS.
> This way, none of its callers will be confused between the case of a
> NIL List meaning either !USE_INJECTION_POINTS or that there are no
> points attached if the build uses USE_INJECTION_POINTS.
>

The changes LGTM.

Should the execution privileges on the function be restricted to a role
like pg_monitor?

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-04-10 Thread Rahila Syed
>
>
>
> That's not an argument against moving it to BaseInit() though, as that's
> called before procsignal is even initialized and before signals are
> unmasked.
>

Yes, OK.

> I don't really understand why DSA_DEFAULT_INIT_SEGMENT_SIZE is
> >
> > something that makes sense to use here?
> > >
> > >
> > To determine the memory  limit per backend in multiples of
> > DSA_DEFAULT_INIT_SEGMENT_SIZE.
> > Currently it is set to  1 * DSA_DEFAULT_INIT_SEGMENT_SIZE.
> > Since a call to dsa_create would create a DSA segment of this size, I
> > thought it makes sense
> > to define a limit related to the segment size.
>
> I strongly disagree.  The limit should be in an understandable unit, not on
> another subystems's defaults that might change at some point.
>

OK, makes sense.


>
>
> > > + /* If the dsm mapping could not be found, attach to the area */
> > > > + if (dsm_seg != NULL)
> > > > + return;
> > >
> > > I don't understand what we do here with the dsm?  Why do we not need
> > > cleanup
> > > if we are already attached to the dsm segment?
> > >
> >
> > I am not expecting to hit this case, since we are always detaching from
> the
> > dsa.
>
> Pretty sure it's reachable, consider a failure of dsa_allocate(). That'll
> throw an error, while attached to the segment.
>
>
You are right, I did not think of this scenario.


>
> > This could be an assert but since it is a cleanup code, I thought
> returning
> > would be a harmless step.
>
> The problem is that the code seems wrong - if we are already attached we'll
> leak the memory!
>
>
I understand your concern. One issue I recall is that we do not have a
dsa_find_mapping
function similar to dsm_find_mapping(). If I understand correctly, the only
way to access
an already attached DSA is to ensure we store the DSA area mapping in a
global variable.
I'm considering using a global variable and accessing it from the cleanup
function in case
it is already mapped.
Does that sound fine?


> As I also mentioned, I don't understand why we're constantly
> attaching/detaching from the dsa/dsm either. It just seems to make things
> more
> complicated an dmore expensive.
>

OK, I see that this could be expensive if a process is periodically being
queried for
statistics. However, in scenarios where a process is queried only once for
memory,
statistics, keeping the area mapped would consume memory resources, correct?


Thank you,
Rahila Syed


Re: Align memory context level numbering in pg_log_backend_memory_contexts()

2025-04-17 Thread Rahila Syed
 Hi,


> The attached patch is how I think we should do it.
>

Thank you for the patch.
I tested this patch and it works fine. I agree with the changes made in it.

Regarding v2 patch,
-   int level = 0;

Retaining the level variable will enhance the code readability, IMO.

Thank you,
Rahila Syed


Re: Prevent an error on attaching/creating a DSM/DSA from an interrupt handler.

2025-05-01 Thread Rahila Syed
Hi,

On Thu, May 1, 2025 at 4:26 AM Michael Paquier  wrote:

> On Wed, Apr 30, 2025 at 06:03:49PM -0400, Robert Haas wrote:
> > Sorry to turn up late here, but I strongly disagree with the notion
> > that this is a bug in the DSM or DSA code. It seems to me that it is
> > the caller's responsibility to provide a valid resource owner, not the
> > job of the called code to ignore the resource owner when it's
> > unusable. I suspect that there are many other parts of the system that
> > rely on the ResourceOwner machinery which likewise assume that the
> > ResourceOwner that they are passed is valid.
>
> Yeah, dshash would be one, I think.  It feels to me that if you want
> to enforce this kind of policy to be checked, this is something that
> should be done in the shape of one or more assertion based the state
> of the resource owner expected in these low-level paths rather than
> tweaking the DSA and DSM code to do what you are expecting here, and
> only enforce such new policies on HEAD to avoid disruption with
> existing systems.
>

I believe it would be particularly useful to add an assertion in
dsm_unpin_mapping().
This function relies on CurrentResourceOwner being non-NULL and not
releasing to
successfully unpin a mapping.


>
> I'm actually rather scared of the patch, isn't there a risk of
> breaking existing patterns that worked out of the box by forcing the
> resowner to not be set?


In this context, resowner not being set means that the dsm segment
would remain attached until the session ends or until it is explicitly
detached.  IIUC, the key difference is that a segment will stay mapped
for longer than expected in cases where the mapping was created
when a transaction was aborting.

Thank you for the review comments, it makes sense to include this
check in the callers of the DSA API.

Wrapping the dsa_create/dsa_attach call in the following snippet helps
resolve the issue in case of ProcessGetMemoryContextInterrupt.

+   ResourceOwner   current_owner;
+   boolres_releasing = false;
+
+   if (CurrentResourceOwner &&
IsResourceOwnerReleasing(CurrentResourceOwner))
+   {
+   current_owner = CurrentResourceOwner;
+   CurrentResourceOwner = NULL;
+   res_releasing = true;
+   }
+
MemoryStatsDsaArea =
dsa_create(memCxtArea->lw_lock.tranche);

+   if (res_releasing)
+   CurrentResourceOwner = current_owner;

Kindly let me know your views.

Thank you,
Rahila Syed


Re: Add pg_get_injection_points() for information of injection points

2025-04-15 Thread Rahila Syed
Hi,

Thank you for this enhancement to injection points.

Following are a few comments.

+ * This overestimates the allocation size, including slots that may be
+ * free, for simplicity.
+ */
+ result = palloc0(sizeof(InjectionPointData) * max_inuse);
+

The result variable name is a bit generic. How about renaming it to
inj_points/injpnt_list or something similar?

+typedef struct InjectionPointData
+{
+ const char *name;
+ const char *library;
+ const char *function;
+} InjectionPointData

+/*
+ * Retrieve a list of all the injection points currently attached.
+ *
+ * This list is palloc'd in the current memory context.
+ */
+InjectionPointData *
+InjectionPointList(uint32 *num_points)

The function actually retrieves an array not a list, so the comment and
the function name may be misleading.
This function not only retrieves an array of injection points but also
the number of injection points. Would it be more efficient to define a
single
struct that includes both the array of injection points and the number of
points?
This way, both values can be returned from the function.

The second patch is still something folks seem to be meh about, but if
> the end consensus is to put the new SQL function in the module
> injection_points we are going to need the first patch anyway.
>

FWIW, I believe it would be beneficial to incorporate the new SQL function
into the core. This would eliminate the need for users to install the
injection_points module solely to monitor injection points attached in the
other modules or within the core itself.

+Datum
+pg_get_injection_points(PG_FUNCTION_ARGS)

Would it be useful to put the logic of the above function under #define
USE_INJECTION_POINT.  This approach would make it possible to
distinguish between cases where no injection points are attached and
instances where the build does not support injection points.

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-04-29 Thread Rahila Syed
Hi,

Please find attached a patch with some comments and documentation changes.
Additionaly, added a missing '\0' termination to "Remaining Totals" string.
I think this became necessary after we replaced dsa_allocate0()
with dsa_allocate() is the latest version.

Thank you,
Rahila Syed


0001-Fix-typos-and-modify-few-comments.patch
Description: Binary data


Re: Memory context can be its own parent and child in replication command

2025-04-21 Thread Rahila Syed
Hi,


So here's a v4 with the test restored.
>

I tested this patch, it fixes the issue reported. It passes Github CI tests.


> already does that, so the only new assert would be in
> MemoryContextCreate.


+1 for adding the assertion to increase the chances of this bug being
caught by memory context infrastructure.

I had the following comment.

Why do we do this:
-   MemoryContext old_context;
+   MemoryContext old_context = CurrentMemoryContext;

Instead of implementing it as done in the previous version of this code,
i.e.
old_context = MemoryContextSwitchTo(cmd_context);

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-02-18 Thread Rahila Syed
Hi,

>
> Thanks for updating the patch!
>
> The below comments would be a bit too detailed at this stage, but I’d
> like to share the points I noticed.
>
> Thanks for sharing the detailed comments. I have incorporated some of them
into the new version of the patch. I will include the rest when I refine and
comment the code further.

Meanwhile, I have fixed the following outstanding issues:

1.  Currently one DSA  is created per backend when the first request for
> statistics is made and remains for the lifetime of the server.
> I think I should add logic to periodically destroy DSAs, when memory
> context statistics are not being *actively* queried from the backend,
> as determined by the statistics timestamp.
>

After an offline discussion with Andres and Tomas, I have fixed this to use
only one DSA for all the publishing backends/processes. Each backend
 allocates smaller chunks of memory within the DSA while publishing
statistics.
These chunks are tracked independently by each backend, ensuring that two
publishing backends/processes do not block each other despite using the
same
DSA. This approach eliminates the overhead of creating multiple DSAs,
one for each backend.

I am not destroying the DSA area because it stores the previously published
statistics for each process. This allows the system to display older
statistics
when the latest data cannot be retrieved within a reasonable time.
Only the most recently updated statistics are kept, while all earlier ones
are freed using dsa_free by each backend when they are no longer needed.
.

> 2. The two issues reported by Fujii-san here: [1].
> i. I have proposed a fix for the first issue here [2].
> ii. I am able to reproduce the second issue. This happens when we try
> to query statistics of a backend running infinite_recurse.sql. While I am
> working on finding a root-cause, I think it happens due to some memory
> being overwritten due to to stack-depth violation, as the issue is not
> seen
> when I reduce the max_stack_depth to 100kb.
>  }
>  }
>

The second issue is also resolved by using smaller allocations within a
DSA.
Previously, it occurred because a few statically allocated strings were
placed
within a single large chunk of DSA allocation. I have changed this to use
dynamically allocated chunks with dsa_allocate0 within the same DSA.

Please find attached updated and rebased patches.

Thank you,
Rahila Syed


v13-0001-Preparatory-changes-for-reporting-memory-context-sta.patch
Description: Binary data


v13-0002-Function-to-report-memory-context-statistics.patch
Description: Binary data


Improve monitoring of shared memory allocations

2025-03-01 Thread Rahila Syed
Hi,

The 0001* patch improved the accounting for the shared memory allocated for
a
hash table during hash_create.
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct, which,
for shared hash tables, only covers memory allocated for the hash
directory
and control structure via ShmemInitHash. The hash segments and  buckets
are allocated using ShmemAllocNoError, which does not attribute the
allocation
to the hash table and also does not add it to ShmemIndex.

Therefore, these allocations are not tracked in pg_shmem_allocations.
To improve this, include the allocation of segments and buckets in the
initial
allocation of the shared memory for the hash table, in ShmemInitHash.

This will result in pg_shmem_allocations representing the total size of the
initial
hash table, including all the buckets and elements, instead of just the
directory
size.

Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
tracked by pg_shmem_allocations.
The 0002* patch replaces most of the calls to ShmemAlloc with
ShmemInitStruct
to associate a name with the allocations and ensure that they get tracked
by
pg_shmem_allocations.

I observed an improvement in total memory allocation by consolidating
initial shared
memory allocations for the hash table. For ex. the allocated size for the
LOCK hash
hash_create decreased from 801664 bytes to 799616 bytes. Please find the
attached
patches, which I will add to the March Commitfest.

Thank you,
Rahila Syed


0001-Account-for-initial-shared-memory-allocated-during-h.patch
Description: Binary data


0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-03-03 Thread Rahila Syed
n't intended for user facing errors.
>
> I agree that this should be addressed. I added a check for path value
before
storing it in shared memory. If the path is NIL, the path pointer in DSA
will point
to InvalidDsaPointer.
When a client encounters an InvalidDsaPointer it will print NULL in the
path column.
Partial path scenario is unlikely IMO, and I am not sure if it warrants
additional
checks.


> +static void
> +compute_num_of_contexts(List *contexts, HTAB *context_id_lookup,
> +   int *stats_count, bool get_summary)
> This function does a lot than compute the number of contexts so the name
> seems
> a bit misleading.  Perhaps a rename to compute_contexts() or something
> similar?
>
> Renamed to compute_contexts_count_and_ids.


>
> +   memctx_info[curr_id].name = dsa_allocate0(area,
> + strlen(clipped_ident) + 1);
> These calls can use idlen instead of more strlen() calls no?  While there
> is no
> performance benefit, it would increase readability IMHO if the code first
> calculates a value, and then use it.
>
> Done.


>
> +   strncpy(name,
> +   clipped_ident, strlen(clipped_ident));
> Since clipped_ident has been nul terminated earlier there is no need to use
> strncpy, we can instead use strlcpy and take the target buffer size into
> consideration rather than the input string length.
>
> Replaced with the strlcpy calls.


>
> PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts
> */
> +   PROCSIG_GET_MEMORY_CONTEXT, /* ask backend to log the memory contexts
> */
> This comment should be different from the LOG_MEMORY_xx one.
>
> Fixed.

+#define MEM_CONTEXT_SHMEM_STATS_SIZE   30
> +#define MAX_TYPE_STRING_LENGTH 64
> These are unused, from an earlier version of the patch perhaps?
>
> Removed

+ * Singe DSA area is created and used by all the processes,
> s/Singe/Since/
>

Fixed.

+typedef struct MemoryContextBackendState
> This is only used in mcxtfuncs.c and can be moved there rather than being
> exported in the header.
>

This is being used in mcxt.c too in the form of the variable memCtxState.


>

+}  MemoryContextId;
> This lacks an entry in the typedefs.list file.
>
> Added.

Please find attached the updated patches with the above fixes.

Thank you,
Rahila Syed


v16-0002-Function-to-report-memory-context-statistics.patch
Description: Binary data


v16-0001-Preparatory-changes-for-reporting-memory-context-sta.patch
Description: Binary data


Re: Improve monitoring of shared memory allocations

2025-03-06 Thread Rahila Syed
States = (XidCacheStatus *)
> ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> > + ProcGlobal->subxidStates = (XidCacheStatus *)
> ShmemInitStruct("Proc Sub-transaction id states", TotalProcs *
> sizeof(*ProcGlobal->subxidStates), &found);
> >   MemSet(ProcGlobal->subxidStates, 0, TotalProcs *
> sizeof(*ProcGlobal->subxidStates));
> > - ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs *
> sizeof(*ProcGlobal->statusFlags));
> > + ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status
> Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
> >   MemSet(ProcGlobal->statusFlags, 0, TotalProcs *
> sizeof(*ProcGlobal->statusFlags));
> >
> >   /*
>
> Same.
>
> Although here I'd say it's worth padding the size of each separate
> "allocation" by PG_CACHE_LINE_SIZE.
>

Made this change.


> > - fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> > + fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs *
> (fpLockBitsSize + fpRelIdSize), &found);
> >   MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
> >
> >   /* For asserts checking we did not overflow. */
>
> This one might actually make sense to keep separate, depending on the
> configuration it can be reasonably big (max_connection = 1k,
> max_locks_per_transaction=1k results in ~5MB)..
>
> OK

PFA the rebased patches with the above changes.

Kindly let me know your views.

Thank you,
Rahila Syed


v2-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


v2-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-03-13 Thread Rahila Syed
Hi,

Please find attached updated and rebased patches. It has the following
changes

1. To prevent memory leaks, ensure that the latest statistics published by
a process
are freed before it exits. This can be achieved by calling dsa_free in the
before_shmem_exit callback.
2. Add a level column to maintain consistency with the output of
pg_backend_memory_contexts.

Thank you,
Rahila Syed

On Tue, Mar 4, 2025 at 12:30 PM Rahila Syed  wrote:

> Hi Daniel,
>
> Thanks for the rebase, a few mostly superficial comments from a first
>> read-through.
>>
> Thank you for your comments.
>
>
>> The documentation refers to attributes in the return row but the format
>> of that
>> row isn't displayed which makes following along hard.  I think we should
>> include a table or a programlisting showing the return data before this
>> paragraph.
>>
>> I included the sql function call and its output in programlisting format
> after the
> function description.
> Since the description was part of a table, I added this additional
> information at the
> end of that table.
>
>
>> +const char *
>> +AssignContextType(NodeTag type)
>> This function doesn't actually assign anything so the name is a bit
>> misleading,
>> it would be better with ContextTypeToString or something similar.
>>
>> Done.
>
>
>>
>> + * By default, only superusers or users with PG_READ_ALL_STATS are
>> allowed to
>> This sentence is really long and should probably be broken up.
>>
>> Fixed.
>
>>
>> + * The shared memory buffer has a limited size - it the process has too
>> many
>> s/it/if/
>>
>> Fixed.
>
>
>> + * If the publishing backend does not respond before the condition
>> variable
>> + * times out, which is set to MEMSTATS_WAIT_TIMEOUT, retry for max_tries
>> + * number of times, which is defined by user, before giving up and
>> + * returning previously published statistics, if any.
>> This comment should mention what happens if the process gives up and
>> there is
>> no previously published stats.
>>
>> Done.
>
>
>>
>> +   int i;
>> ...
>> +   for (i = 0; i < memCtxState[procNumber].total_stats; i++)
>> This can be rewritten as "for (int i = 0; .." since we allow C99.
>>
>> Done.
>
>
>>
>> +* process running and consuming lots of memory, that it might end on
>> its
>> +* own first and its memory contexts are not logged is not a problem.
>> This comment is copy/pasted from pg_log_backend_memory_contexts and while
>> it
>> mostly still apply it should at least be rewritten to not refer to
>> logging as
>> this function doesn't do that.
>>
>> Fixed.
>
>
>>
>> +   ereport(WARNING,
>> +   (errmsg("PID %d is not a PostgreSQL server process",
>> No need to add the extra parenthesis around errmsg anymore, so I think
>> new code
>> should omit those.
>>
>> Done.
>
>
>>
>> +  errhint("Use pg_backend_memory_contexts view instead")));
>> Super nitpick, but errhints should be complete sentences ending with a
>> period.
>>
>> Done.
>
>
>>
>> +* statitics have previously been published by the backend. In which
>> case,
>> s/statitics/statistics/
>>
>> Fixed.
>
>
>>
>> +* statitics have previously been published by the backend. In which
>> case,
>> +* check if statistics are not older than curr_timestamp, if they are
>> wait
>> I think the sentence around the time check is needlessly confusing, could
>> it be
>> rewritten into something like:
>> "A valid DSA pointer isn't proof that statistics are available, it
>> can be
>> valid due to previously published stats.  Check if the stats are
>> updated by
>> comparing the timestamp, if the stats are newer than our previously
>> recorded timestamp from before sending the procsignal they must by
>> definition be updated."
>>
>> Replaced accordingly.
>
>
>>
>> +   /* Assert for dsa_handle to be valid */
>> Was this intended to be turned into an Assert call? Else it seems better
>> to remove.
>>
>
> Added an assert and removed the comment.
>
>
>> +   if (print_location != PRINT_STATS_NONE)
>> This warrants a comment stating why it makes sense.
>>
>
>> +* Do not print the statistics if print_to_stderr is PRINT_STATS_NONE,
>> s/print_to_stderr/print_location/.  Also, do we really need
>

Re: Improve monitoring of shared memory allocations

2025-03-12 Thread Rahila Syed
Hi Andres,



>> > + if (hashp->isshared)
>> > + {
>> > + int nsegs;
>> > + int nbuckets;
>> > + nsegs = find_num_of_segs(nelem, &nbuckets,
>> hctl->num_partitions, hctl->ssize);
>> > +
>> > + curr_offset =  (((char *) hashp->hctl) +
>> sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
>> > ++ (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
>> > + }
>> > +
>>
>> Why only do this for shared hashtables? Couldn't we allocate the elments
>> together with the rest for non-share hashtables too?
>>
>
>
I have now made the changes uniformly across shared and non-shared hash
tables,
making the code fix look cleaner. I hope this aligns with your suggestions.
Please find attached updated and rebased versions of both patches.

Kindly let me know your views.

Thank you,
Rahila Syed


v3-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch
Description: Binary data


v3-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-02-24 Thread Rahila Syed
> I think something is not quite right, because if I try running a simple
> pgbench script that does pg_get_process_memory_contexts() on PIDs of
> random postgres process (just like in the past), I immediately get this:
>
> Thank you for testing. This issue occurs when a process that previously
attached
to a DSA area for publishing its own context statistics tries to attach to
it again while
querying statistics from another backend. Previously, I was not detaching
at the end
of publishing the statistics. I have now changed it to detach from the area
after the
statistics are published. The fix is included in the updated patch.


> Perhaps the backends need to synchronize creation of the DSA?
>

This has been implemented in the patch.


> Sounds good. Do you have any measurements how much this reduced the size
> of the entries written to the DSA? How many entries will fit into 1MB of
> shared memory?


The size of the entries has approximately halved after dynamically
allocating the
strings and a datum array.
Also, previously, I was allocating the entire memory for all contexts in
one large chunk
from DSA. I have now separated them into smaller allocations
per context. The integer counters are still allocated at once for all
contexts, but
the size of an allocated chunk will not exceed approximately 128 bytes *
total_num_of_contexts.
Average total number of contexts is in the hundreds.

PFA the updated and rebased patches.

Thank you,
Rahila Syed


v15-0001-Preparatory-changes-for-reporting-memory-context-sta.patch
Description: Binary data


v15-0002-Function-to-report-memory-context-statistics.patch
Description: Binary data


Re: Enhancing Memory Context Statistics Reporting

2025-02-20 Thread Rahila Syed
Hi,

Please find attached the updated patches after some cleanup and test
fixes.

Thank you,
Rahila Syed

On Tue, Feb 18, 2025 at 6:35 PM Rahila Syed  wrote:

> Hi,
>
>>
>> Thanks for updating the patch!
>>
>> The below comments would be a bit too detailed at this stage, but I’d
>> like to share the points I noticed.
>>
>> Thanks for sharing the detailed comments. I have incorporated some of them
> into the new version of the patch. I will include the rest when I refine
> and
> comment the code further.
>
> Meanwhile, I have fixed the following outstanding issues:
>
> 1.  Currently one DSA  is created per backend when the first request for
>> statistics is made and remains for the lifetime of the server.
>> I think I should add logic to periodically destroy DSAs, when memory
>> context statistics are not being *actively* queried from the backend,
>> as determined by the statistics timestamp.
>>
>
> After an offline discussion with Andres and Tomas, I have fixed this to
> use
> only one DSA for all the publishing backends/processes. Each backend
>  allocates smaller chunks of memory within the DSA while publishing
> statistics.
> These chunks are tracked independently by each backend, ensuring that two
> publishing backends/processes do not block each other despite using the
> same
> DSA. This approach eliminates the overhead of creating multiple DSAs,
> one for each backend.
>
> I am not destroying the DSA area because it stores the previously
> published
> statistics for each process. This allows the system to display older
> statistics
> when the latest data cannot be retrieved within a reasonable time.
> Only the most recently updated statistics are kept, while all earlier ones
> are freed using dsa_free by each backend when they are no longer needed.
> .
>
>> 2. The two issues reported by Fujii-san here: [1].
>> i. I have proposed a fix for the first issue here [2].
>> ii. I am able to reproduce the second issue. This happens when we try
>> to query statistics of a backend running infinite_recurse.sql. While I am
>> working on finding a root-cause, I think it happens due to some memory
>> being overwritten due to to stack-depth violation, as the issue is not
>> seen
>> when I reduce the max_stack_depth to 100kb.
>>  }
>>  }
>>
>
> The second issue is also resolved by using smaller allocations within a
> DSA.
> Previously, it occurred because a few statically allocated strings were
> placed
> within a single large chunk of DSA allocation. I have changed this to use
> dynamically allocated chunks with dsa_allocate0 within the same DSA.
>
> Please find attached updated and rebased patches.
>
> Thank you,
> Rahila Syed
>


v14-0002-Function-to-report-memory-context-statistics.patch
Description: Binary data


v14-0001-Preparatory-changes-for-reporting-memory-context-sta.patch
Description: Binary data


Re: Improve monitoring of shared memory allocations

2025-03-30 Thread Rahila Syed
Hi Tomas,


>
> Right. I'm still not convinced if this makes any difference, or whether
> this alignment was merely a consequence of using ShmemAlloc(). I don't
> want to make this harder to understand unnecessarily.
>

Yeah, it makes sense.


> Let's keep this simple - without additional alignment. I'll think about
> it a bit more, and maybe add it before commit.
>

OK.


>
>
> > I will improve the comment in the next version.
> >
>
> OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
> really used except for this bit:
>
> +if (init_size > nelem_alloc)
> +element_alloc = false;
>
> Can't we determine before calling the function, to make it a bit less
> confusing?
>

Yes, we could determine whether the pre-allocated elements are zero before
calling the function, I have fixed it accordingly in the attached 0001
patch.
Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
passed this information as a boolean variable-initial_elems. If it is
false,
no elements are pre-allocated.

Please find attached the v7-series, which incorporates your review patches
and addresses a few remaining comments.

Thank you,
Rahila Syed


v7-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


v7-0003-Add-cacheline-padding-between-heavily-accessed-array.patch
Description: Binary data


v7-0001-Account-for-all-the-shared-memory-allocated-by-hash_.patch
Description: Binary data


Re: Improve monitoring of shared memory allocations

2025-04-04 Thread Rahila Syed
Hi,

Analysis of the Bug
<https://www.postgresql.org/message-id/3d36f787-8ba5-4aa2-abb1-7ade129a7b0e%40vondra.me>
in 0002 reported by David Rowley :
The 0001* patch allocates memory for the hash header, directory, segments,
and elements collectively for both shared and non-shared hash tables. While
 this approach works well for shared hash tables, it presents an issue for
non-
shared hash tables. Specifically, during the expand_table() process,
non-shared hash tables may reallocate a new directory and free the old one.
Since the directory's memory is no longer allocated individually, it cannot
be freed
separately. This results in the following error:

ERROR: pfree called with invalid pointer 0x60a15edc44e0 (header
0x0028)

These allocations are done together to improve reporting of shared memory
allocations
for shared hash tables. Similar change is done for non-shared hash tables
only to maintain
consistency since hash_create code is shared between both types of hash
tables.

One solution could be separating allocation of directory from rest of the
allocations for
the non-shared hash tables, but this approach would undermine the purpose
of
doing the change for a non-shared hash table.

A better/safer solution would be to do this change only for shared hash
tables and
exclude the non-shared hash tables.

I believe it's acceptable to allocate everything in a single block provided
we are not trying
to individually free any of these, which we do only for the directory
pointer in dir_realloc.
Additional segment allocation goes through seg_alloc and element
allocations through
element_alloc, which do not free existing chunks but instead allocate new
ones with
pointers in existing directories and segments.
Thus, as long as we don't reallocate the directory, which we avoid in the
case of shared
hash tables, it should be safe to proceed with this change.

Please find attached the patch which removes the changes for non-shared
hash tables
and keeps them for shared hash tables.

I tested this by running make-check, make-check world and the reproducer
script shared
by David. I also ran pgbench to test creation and expansion of some of the
shared hash tables.

Thank you,
Rahila Syed


v10-0001-Improve-accounting-for-memory-used-by-shared-hash-ta.patch
Description: Binary data


Re: Improve monitoring of shared memory allocations

2025-03-23 Thread Rahila Syed
Hi Bilal,


I have a couple of comments, I have only reviewed 0001 so far.
>

Thank you for reviewing!


>
> You may need to run pgindent, it makes some changes.
>

Attached v4-patch has been updated after running pgindent.


+ * If table is shared, calculate the offset at which to find the
> the
> + * first partition of elements
> + */
> +
> +nsegs = compute_buckets_and_segs(nelem, &nbuckets,
> hctl->num_partitions, hctl->ssize);
>
> Blank line between the comment and the code.
>

Removed this.


>  /*
>   * allocate some new elements and link them into the indicated free list
>   */
> -static bool
> -element_alloc(HTAB *hashp, int nelem, int freelist_idx)
> +static HASHELEMENT *
> +element_alloc(HTAB *hashp, int nelem)
>
> Comment needs an update. This function no longer links elements into
> the free list.
>

Updated this and few other comments in the attached v4-patch.


>
> +static int
> +compute_buckets_and_segs(long nelem, int *nbuckets, long
> num_partitions, long ssize)
> +{
> ...
> +/*
> + * In a partitioned table, nbuckets must be at least equal to
> + * num_partitions; were it less, keys with apparently different
> partition
> + * numbers would map to the same bucket, breaking partition
> independence.
> + * (Normally nbuckets will be much bigger; this is just a safety
> check.)
> + */
> +while ((*nbuckets) < num_partitions)
> +(*nbuckets) <<= 1;
>
> I have some worries about this function, I am not sure what I said
> below has real life implications as you already said 'Normally
> nbuckets will be much bigger; this is just a safety check.'.
>
> 1- num_partitions is long and nbuckets is int, so could there be any
> case where num_partition is bigger than MAX_INT and cause an infinite
> loop?
> 2- Although we assume both nbuckets and num_partition initialized as
> the same type, (*nbuckets) <<= 1 will cause an infinite loop if
> num_partition is bigger than MAX_TYPE / 2.
>
> So I think that the solution is to confirm that num_partition <
> MAX_NBUCKETS_TYPE / 2, what do you think?
>
>
Your concern is valid. This has been addressed in the existing code by
calling next_pow2_int() on num_partitions before running the function.
Additionally, I am not adding any new code to the compute_buckets_and_segs
function. I am simply moving part of the init_tab() code into a separate
function
for reuse.

Please find attached the updated and rebased patches.

Thank you,
Rahila Syed


v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch
Description: Binary data


Re: Improve monitoring of shared memory allocations

2025-03-27 Thread Rahila Syed
Hi Tomas,


> I did a review on v3 of the patch. I see there's been some minor changes
> in v4 - I haven't tried to adjust my review, but I assume most of my
> comments still apply.
>
>
Thank you for your interest and review.


> 1) I don't quite understand why hash_get_shared_size() got renamed to
> hash_get_init_size()? Why is that needed? Does it cover only some
> initial allocation, and there's additional memory allocated later? And
> what's the point of the added const?
>

Earlier, this function was used to calculate the size for shared hash
tables only.
Now, it also calculates the size for a non-shared hash table. Hence the
change
of name.

If I don't change the argument to const, I get a compilation error as
follows:
"passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier from
pointer target type."
This is due to this function being called from hash_create which considers
HASHCTL to be
a const.


>
> 5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
> along with calculating the per-element requestSize, but then still does
>
> memset(PredXact->element, 0, requestSize);
>
> and
>
> memset(RWConflictPool->element, 0, requestSize);
>
> with requestSize for the whole allocation? I haven't seen any crashes,
> but this seems wrong to me. I believe we can simply zero the whole
> allocation right after ShmemInitStruct(), there's no need to do this for
> each individual element.
>

Good catch.  Thanks for updating.


>
> 5) InitProcGlobal is another example of hard-to-read code. Admittedly,
> it wasn't particularly readable before, but making the lines even longer
> does not make it much better ...
>
> I guess adding a couple macros similar to hash_create() would be one way
> to improve this (and there's even a review comment in that sense), but I
> chose to just do maintain a simple pointer. But if you think it should
> be done differently, feel free to do so.
>
>
LGTM, long lines have been reduced by the ptr approach.


> 6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
> which seems to be doing a very similar thing, and to not require the
> prototype. Minor detail, feel free to undo.
>
> LGTM.


>
> 7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
> the patch, and I see no reason to do it in the same commit. So 0005
> removes this change, and 0006 reintroduces it separately.
>
> OK.


> FWIW I see no justification for why the cache line padding is useful,
> and it seems quite unlikely this would benefit anything, consiering it
> adds padding between fairly long arrays. What kind of contention can we
> get there? I don't get it.
>

I have done that to address a review comment upthread by Andres and
the because comment above that code block also mentions need for
padding.


> Also, why is the patch adding padding after statusFlags (the last array
> allocated in InitProcGlobal) and not between allProcs and xids?
>

I agree it makes more sense this way, so changing accordingly.



>  *
> +* XXX can we rely on this matching the calculation in
> hash_get_shared_size?
> +* or could/should we add some asserts? Can we have at
> least some sanity checks
> +* on nbuckets/nsegs?
>
>
 Both places rely on compute_buckets_and_segs() to calculate nbuckets and
nsegs,
 hence the probability of mismatch is low.  I am open to adding some
asserts to verify this.
 Do you have any suggestions in mind?

Please find attached updated patches after merging all your review comments
except
a few discussed above.

Thank you,
Rahila Syed


v5-0003-Add-cacheline-padding-between-heavily-accessed-array.patch
Description: Binary data


v5-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch
Description: Binary data


v5-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


Re: Improve LWLock tranche name visibility across backends

2025-07-15 Thread Rahila Syed
Hi,


>
> Since DSM is not available during postmaster, if we were to create a DSA
> segment in place, similar to what's done in StatsShmemInit(), we would also
> need to ensure that the initial shared memory is sized appropriately. This
> is
> because it would need to be large enough to accommodate all user-defined
> tranches registered during postmaster, without having to rely on new
> dsm segments.
> From my experimentation, this sizing is not as straightforward as simply
> calculating # of tranches * size of a tranche entry.
>
> I still think we should create the dsa during postmaster, as we do with
> StatsShmemInit, but it would be better if postmaster keeps its hands off
> this
> dshash and only normal backends can use them.
>
> Thoughts?
>
>
Since creating a DSA segment in place during Postmaster startup still
requires calculating the size of
the tranche names table including the user-defined tranches, why not use
static shared memory to
pre-allocate a fixed sized table and arrays of size NAMEDATALEN to store
the names?

If a dshash table is used to store tranche names and IDs, where would the
tranche name for this table
be registered?

Thank you,
Rahila Syed


Re: Small optimization with expanding dynamic hash table

2025-07-08 Thread Rahila Syed
Hi,

Yes, for example:
>>
>> low_mask: 0x011, high_mask: 0x111, old_bucket: 0x010, new_bucket: 0x110
>>
>> The old_bucket's hash value like 0x***010 or 0x***110, the later is in
>> the old_bucket is because we didn't have new_bucket before, so only hash
>> value like 0x***110 needs relocation: hashvalue & (low_mask + 1) != 0
>>
>>
> Thanks for explaining, that clarifies things for me.
> It may be worthwhile to check if this change has led to any performance
> improvements.
>
>

One thing to note is that in this scenario, there is no safeguard if the
hashvalue is 0x111 and new_bucket is 0x110.
This means max_bucket is 0x110, but with your change, even 0x111 would meet
the condition for relocation
to the new bucket, which would be incorrect.
Although it’s unlikely that 0x111 would be passed into this check, if it is
passed, there is currently no check to
prevent it from being relocated to the new bucket. In the current code,
however, we do have such a check in
place in calc_bucket, specifically: if (bucket > hctl->max_bucket)

Thank you,
Rahila Syed


Re: Small optimization with expanding dynamic hash table

2025-07-08 Thread Rahila Syed
Hi,


Yes, for example:
>
> low_mask: 0x011, high_mask: 0x111, old_bucket: 0x010, new_bucket: 0x110
>
> The old_bucket's hash value like 0x***010 or 0x***110, the later is in the
> old_bucket is because we didn't have new_bucket before, so only hash value
> like 0x***110 needs relocation: hashvalue & (low_mask + 1) != 0
>
>
Thanks for explaining, that clarifies things for me.
It may be worthwhile to check if this change has led to any performance
improvements.

Thank you,
Rahila syed


Re: Small optimization with expanding dynamic hash table

2025-07-07 Thread Rahila Syed
Hi,


> If I understand correctly, we only need to check the specific bit
> to determine whether a hash element needs relocation:
>

> diff --git a/src/backend/utils/hash/dynahash.c
> b/src/backend/utils/hash/dynahash.c
> index 1ad155d446e..32fbae71995 100644
> --- a/src/backend/utils/hash/dynahash.c
> +++ b/src/backend/utils/hash/dynahash.c
> @@ -1626,7 +1626,7 @@ expand_table(HTAB *hashp)
>  currElement = nextElement)
> {
> nextElement = currElement->link;
> -   if ((long) calc_bucket(hctl, currElement->hashvalue) ==
> old_bucket)
> +   if (!(currElement->hashvalue & (hctl->low_mask + 1)))
> {
> *oldlink = currElement;
> oldlink = &currElement->link;
>
>
Will this still work if new_bucket is not equal to hctl->low_mask + 1?

The above situation seems possible because we increment new_bucket every
time expand_table is called,
but we only update low_mask when new_bucket exceeds high_mask.

This change has successfully passed the tests on Github CI. According to
[1], the code has decent test coverage,
but I'm not certain if the specific case mentioned above is included in the
tests.

[1] LCOV - PostgreSQL 19devel - src/backend/utils/hash/dynahash.c
<https://coverage.postgresql.org/src/backend/utils/hash/dynahash.c.gcov.html>

Thank you,
Rahila Syed


Re: Enhancing Memory Context Statistics Reporting

2025-07-11 Thread Rahila Syed
Hi,

Please find attached the latest memory context statistics monitoring patch.
It has been redesigned to address several issues highlighted in the thread
[1]
and [2].

Here are some key highlights of the new design:

- All DSA processing has been moved out of the CFI handler function. Now,
all the dynamic shared memory
needed to store the statistics is created and deleted in the client
function.  This change addresses concerns
that DSA APIs are too high level to be safely called from interrupt
handlers. There was also a concern that
DSA API calls might not provide re-entrancy, which could cause issues if
CFI is invoked from a DSA function
in the future.

- The static shared memory array has been replaced with a DSHASH table
which now holds metadata such as
pointers to actual statistics for each process.

- dsm_registry.c APIs are used  for creating and attaching to DSA and
DSHASH table, which helps prevent code
duplication.

-To address the memory leak concern, we create an exclusive memory context
under the NULL context, which
does not fall under the TopMemoryContext tree, to handle all the memory
allocations in ProcessGetMemoryContextInterrupt.
This ensures the memory context created by the function does not affect its
outcome.
The memory context is reset at the end of the function, which helps prevent
any memory leaks.

- Changes made to the mcxt.c file have been relocated to mcxtfuncs.c, which
now contains all the existing
 memory statistics-related functions along with the code for the proposed
function.

The overall flow of a request is as follows:

1. A client backend running the pg_get_process_memory_contexts function
creates a DSA and allocates memory
to store statistics, tracked by DSA pointer. This pointer is stored in a
DSHASH entry for each client querying the
statistics of any process.
The client shares its DSHASH table key with the server process using a
static shared array of keys indexed
by the server's procNumber.  It notifies the server process to publish
statistics by using SendProcSignal.

2. When a PostgreSQL server process handles the request for memory
statistics, the CFI function accesses the
client hash key stored in its procNumber slot of the shared keys array. The
server process then retrieves the
DSHASH entry to obtain the DSA pointer allocated by the client, for storing
the statistics.
 After storing the statistics, it notifies the client through its condition
variable.

3.  Although the DSA is created just once, the memory inside the DSA is
allocated and released by the client
process as soon as it finishes reading the statistics.
If it fails to do so, it is deleted by the before_shmem_exit callback when
the client exits. The client's entry in DSHASH
table is also deleted when the client exits.

4. The DSA and DSHASH table are not created
until  pg_get_process_memory_context function is called.
Once created, any client backend querying statistics and any PostgreSQL
process publishing statistics will
attach to the same area and table.

Please let me know your thoughts.

Thank you,
Rahila Syed

[1]. PostgreSQL: Re: pgsql: Add function to get memory context stats for
processes
<https://www.postgresql.org/message-id/CA%2BTgmoaey-kOP1k5FaUnQFd1fR0majVebWcL8ogfLbG_nt-Ytg%40mail.gmail.com>
[2]. PostgreSQL: Re: Prevent an error on attaching/creating a DSM/DSA from
an interrupt handler.
<https://www.postgresql.org/message-id/flat/8B873D49-E0E5-4F9F-B8D6-CA4836B825CD%40yesql.se#7026d2fe4ab0de6dd5decd32eb9c585a>

On Wed, Apr 30, 2025 at 4:13 PM Daniel Gustafsson  wrote:

> > On 30 Apr 2025, at 12:14, Peter Eisentraut  wrote:
> >
> > On 29.04.25 15:13, Rahila Syed wrote:
> >> Please find attached a patch with some comments and documentation
> changes.
> >> Additionaly, added a missing '\0' termination to "Remaining Totals"
> string.
> >> I think this became necessary after we replaced dsa_allocate0()
> >> with dsa_allocate() is the latest version.
> >
> > >   strncpy(nameptr, "Remaining Totals", namelen);
> > > + nameptr[namelen] = '\0';
> >
> > Looks like a case for strlcpy()?
>
> True.  I did go ahead with the strncpy and nul terminator assignment,
> mostly
> out of muscle memory, but I agree that this would be a good place for a
> strlcpy() instead.
>
> --
> Daniel Gustafsson
>
>


v30-0001-Add-pg_get_process_memory_context-function.patch
Description: Binary data


Re: Add pg_get_injection_points() for information of injection points

2025-07-02 Thread Rahila Syed
Hi Michael,

 #include "miscadmin.h"
+#include "nodes/execnodes.h"
 #include "nodes/pg_list.h"
 #include "nodes/value.h"

Do we need to include this? I did not see any compilation error when I
removed this.



> Now for the second part with the SRF making the injection point
> information available at SQL level.  The point about adding the new
> function in the core engine has been sticky, and I'm coming down to
> the conclusion that I'd still want to make this stuff backpatchable if
> need be in the future.  So let's just add a new function in the test
> module injection_points and call it a day.
>
>
Do you plan to document this function anywhere where it would be more
visible to those
who might want to use it in their tests?

Thank you,
Rahila Syed


Re: Improve error message for duplicate labels in enum types

2025-07-03 Thread Rahila Syed
Hi Yugo,



> Currently, when creating an enum type, duplicate labels are caught by a
> unique
> index on pg_enum, resulting in a generic error message.
>
>  postgres=# create type t as enum ('a','b','a');
>  ERROR:  duplicate key value violates unique constraint
> "pg_enum_typid_label_index"
>  DETAIL:  Key (enumtypid, enumlabel)=(16418, a) already exists.
>
> I propose adding an explicit check for duplicate labels during enum
> creation,
> so that a more user-friendly and descriptive error message can be produced,
> similar to what is already done in ALTER TYPE ... ADD VALUE
> or ALTER TYPE ... RENAME VALUE .. TO 
>
> With the attached patch applied, the error message becomes:
>
> ERROR:  label "a" used more than once
>

 Thank you for sharing the patch.
+1 to the idea of improving the error message.

Please take the following points mentioned into consideration.

1. I’m considering whether there might be a more efficient way to handle
this.
The current method adds an extra loop to check for duplicates, in addition
to the existing duplicate index check,
even when no duplicates are present. Would it be possible to achieve this
by wrapping the following
insert call in a PG_TRY() and PG_CATCH() block and logging more descriptive
error in the PG_CATCH() block?

CatalogTuplesMultiInsertWithInfo(pg_enum, slot, slotCount,

 indstate);

2. If we choose to include the check in the 0001 patch you provided, would
it make more sense to place
it earlier in the function, before assigning OIDs to the labels and running
qsort? This way, we could
catch duplicates sooner and prevent unnecessary processing.

Thank you,
Rahila Syed


  1   2   >