date:20240824

About replication minimal disk space usage

2024-08-24 Thread Manan Kansara

Hello All,
I have my self hosted postgres server on aws with 16gb disk space
attached to it for ml stuff and analysis stuff we are using vertex ai so i
have setup live replication of postgres using data stream service to
BigQuery table.  We use BigQuery table as data warehouse because we have so
many different data source so our data analysis and ml can happened at one
place.
but problem is there When i am starting replication in there pg_wal take
whole space about 15.8gb in some days of starting replication

*Question *:  how can i setup something like that that optimally use disk
space so old pg_wal data that are not usable can we delete  i think i
should create one cron job which taken care whole that things but i don't
know any approach can you please guide
In future if as data grew i will attached more disk space to that instance
but i want to make optimal setup so my whole disk is not in full usage any
time and my server crash again.

Re: About replication minimal disk space usage

2024-08-24 Thread Tomas Vondra

On 8/24/24 14:18, Manan Kansara wrote:
> Hello All,
> I have my self hosted postgres server on aws with 16gb disk space
> attached to it for ml stuff and analysis stuff we are using vertex ai so
> i have setup live replication of postgres using data stream service to
> BigQuery table.  We use BigQuery table as data warehouse because we have
> so many different data source so our data analysis and ml can
> happened at one place.
> but problem is there When i am starting replication in there pg_wal take
> whole space about 15.8gb in some days of starting replication 
> 
> *_Question_ *:  how can i setup something like that that optimally use
> disk space so old pg_wal data that are not usable can we delete  i think
> i should create one cron job which taken care whole that things but i
> don't know any approach can you please guide
> In future if as data grew i will attached more disk space to that
> instance but i want to make optimal setup so my whole disk is not in
> full usage any time and my server crash again.
> 

Why don't you just give it more disk space? I'm not a fan of blindly
throwing hardware at an issue, but 16GB is tiny these days, especially
if it's shared by both data and WAL, and the time you spend optimizing
this is likely more expensive than any savings.

If you really want to keep this on 16GB, I think we'll need more details
about what exactly you see on the instance / how it runs out of disk
space. AFAIK datastream relies on logical replication, and there's a
couple ways how that may consume disk space.

For example, the datastream replication may pause for a while, in which
case the replication slot will block removal of still-needed WAL, and if
the pause is long enough, that may be an issue. Of course, we have no
idea how much data you're dealing with (clearly not much, if it fits
onto 16GB of disk space with everything else).

Another option is that you have a huge transaction (inserting and/or
modifying a lot of data at once), and the logical decoding ends up
spilling the decoded transaction to disk.

If you want a better answer, I think you'll have to provide a lot more
details. For example, which PostgreSQL version are you using, and how is
it configured? What config parameters have non-default values?

regards

-- 
Tomas Vondra

On exclusion constraints and validity dates

2024-08-24 Thread Justin Giacobbi

Hello,

I have an issue that on the surface seems orthogonal to existing functionality. 
I'm trying to dynamically update validity ranges as new s replace old 
s.

In a nutshell the problem looks like this:

psqlprompt=# select * from rangetest;
id |  rangecol
+-
  0 | empty
  0 | ["2024-05-05 00:00:00+00","2024-05-06 00:00:00+00")
  0 | ["2024-05-06 00:00:00+00","-03-31 00:00:00+00")
  1 | ["2024-05-06 00:00:00+00",)

psqlprompt=# insert into rangetest values (1, '["2024-06-07 00:00:00+0",)') on 
conflict on constraint rangetest_id_ran
gecol_excl do update rangecol = concat('[', lower(rangetest.rangecol),',', 
lower(excluded.ran
gecol),')')::tstzrange;
ERROR:  ON CONFLICT DO UPDATE not supported with exclusion constraints


So I'm not sure if I'm after a feature request, a workaround or contribution 
advice. Maybe someone can point me in the right direction.

  1.  A 'currently valid' item that becomes invalid and is replaced by a new 
'currently valid' item seems like such a real-world use case that there should 
be explicit support for it.
 *   Unfortunately, the temporal tables extensions seem too immature for my 
needs currently.
  2.  Barring that an exclusion constraint arbiter would be a lovely solution.
  3.  Barring either of those at least a 'select all conflicts' type feature 
that at least makes it easy to pair the offending rows.

Currently I'm looking at working around this in the application or in a stored 
procedure/insert trigger that is essentially the same logic. Whichever seems 
easier to maintain.

Advice on how to submit a feature request, or maybe a better workaround that I 
haven't discovered would be most welcome. What would be even more welcome is 
someone with insight into these pieces of the program that can tell me if I'd 
be biting off more than I can chew (or violating a principle) trying to submit 
one of the three options above as a feature.

Thank you

About replication minimal disk space usage

Re: About replication minimal disk space usage

On exclusion constraints and validity dates

3 matches

Site Navigation

Mail list logo

Footer information