Re: Logical replication, need to reclaim big disk space

Moreno Andreo Tue, 20 May 2025 04:13:07 -0700


On 20/05/25 12:58, Achilleas Mantzios wrote:

Στις 20/5/25 12:17, ο/η Moreno Andreo έγραψε:
On 19/05/25 20:49, Achilleas Mantzios wrote:
On 19/5/25 17:38, Moreno Andreo wrote:
On 19/05/25 14:41, Achilleas Mantzios wrote:
On 5/19/25 09:14, Moreno Andreo wrote:
On 16/05/25 21:33, Achilleas Mantzios wrote:
On 16/5/25 18:45, Moreno Andreo wrote:
Hi,
we are moving our old binary data approach, moving themfrom bytea fields in a table to external storage (makingdatabase smaller and related operations faster and smarter).In short, we have a job that runs in background and copies datafrom the table to an external file and then sets the byteafield to NULL.(UPDATE tbl SET blob = NULL, ref = 'path/to/file' WHERE id =<uuid>)
This results, at the end of the operations, to a table that'sless than one tenth in size.We have a multi-tenant architecture (100s of schemas withidentical architecture, all inheriting from public) and we areperforming the task on one table per schema.
So? toasted data are kept on separate TOAST tables, unless thosebytea cols are selected, you won't even touch them. I cannotunderstand what you are trying to achieve here.
Years ago, when I made the mistake to go for a coffee and let mydevelopers "improvise" , the result was a design similar to whatyou are trying to achieve. Years after, I am seriouslyconsidering moving those data back to PostgreSQL.
The "related operations" I was talking about are backups anddatabase maintenance when needed, cluster/replica management,etc. With a smaller database size they would be easier in timingand effort, right?
Ok, but you'll lose replica functionality for those blobs, whichmeans you don't care about them, correct me if I am wrong.
I'm not saying I don't care about them, the opposite, they areprotected with Object Versioning and soft deletion, this shouldassure a good protection against e.g. ransomware, if someonemanages to get in there (and if this happens, we'll have biggertroubles than this).
PostgreSQL has become very popular because of ppl who care abouttheir data.
Yeah, it's always been famous for its robustness, and that's why Ichose PostgreSQL more than 10 years ago, and, in spite of how a"normal" user treats his PC, we never had corruption (only whereFS/disk were failing, but that's not PG fault)
We are mostly talking about costs, here. To give things theirnames, I'm moving bytea contents (85% of total data) to filesinto Google Cloud Storage buckets, that has a fraction of thecost of the disks holding my database (on GCE, to be clear ).
May I ask the size of the bytea data (uncompressed) ?.
single records vary from 150k to 80 MB, the grand total is morethan 8,5 TB in a circa 10 TB data footprint
This data is not accessed frequently (just by the owner when heneeds to do it), so no need to keep it on expensive hardware.I've already read in these years that keeping many big byteafields in databases is not recommended, but might havemisunderstood this.
Ok, I assume those are unimportant data, but let me ask, what isthe longevity or expected legitimacy of those ? I haven't workedwith those just reading :
https://cloud.google.com/storage/pricing?_gl=1*1b25r8o*_up*MQ..&gclid=CjwKCAjwravBBhBjEiwAIr30VKfaOJytxmk7J29vjG4rBBkk2EUimPU5zPibST73nm3XRL2h0O9SxRoCaogQAvD_BwE&gclsrc=aw.ds#storage-pricing

would you choose e.g. "*Anywhere Cache storage" ?
*
Absolutely not, this is *not* unimportant data, and we are usingStandard Storage, for 0,02$/GB/month + operations, that compared toa 0.17$/GB/month of an SSD or even more for the Hyperdisks we areusing, is a good price drop.
How about hosting your data in your own storage and spend 0$/GB/month ?
If we could host on our own hardware I'd not be here talking. Maybewe would have a 10-node full-mesh multimaster architecture withbarman backup on 2 separate SANs.But we are a small company that has to balance performance,consistency, security and, last but not latter, costs. And marginsare tightening.
**
Another way would have been to move these tables to a differenttablespace, in cheaper storage, but it still would have been 3times the buckets cost.
can you actually mount those Cloud Storage Buckets under asupported FS in linux and just move them to tablespaces backed bythis storage ?
Never tried, I mounted this via FUSE and had some simple operationsin the past, but not sure it can handle database operations interms of I/O bandwidth
Why are you considering to get data back to database tables?
Because now if we need to migrate from cloud to on-premise, orjust upgrade or move the specific server which holds those data Iwill have an extra headache. Also this is a single point offailure, or best case a cause for fragmented technology introducedjust for the sake of keeping things out of the DB.
This is managed as an hierarchical disk structure, so the callingserver may be literally everywhere, it just needs an account (or aservice account) to get in there ,
and you are locked in a proprietary solution. and at their mercy ofany future increases in cost.
Since we cannot host on our hardware, the only thing is to keep aneye on costs and migrate (yeah, more work) when it's becomingexpensive. Every solution is proprietary, if you want to run oncloud. Even the VMs where PostgreSQL is running.
The problem is: this is generating BIG table bloat, as you mayimagine.Running a VACUUM FULL on an ex-22GB table on a standalone testserver is almost immediate.If I had only one server, I'll process a table a time, with anightly script, and issue a VACUUM FULL to tables that havealready been processed.
But I'm in a logical replication architecture (we are using amultimaster system called pgEdge, but I don't think it willmake big difference, since it's based on logical replication),and I'm building a test cluster.
So you use PgEdge , but you wanna lose all the benefits ofmulti-master , since your binary data won't be replicated ...
I don't think I need it to be replicated, since this data cannotbe "edited", so either it's there or it's been deleted. Bucketshave protections for data deletions or events like ransomwareattacks and such.Also multi-master was an absolute requirement one year agobecause of a project we were building, but it has been abandonedand now a simple logical replication would be enough, but let'sdo one thing a time.
Multi-master is cool, you can configure your pooler / clients totake advantage of this for full load balanced architecture, but ifnot a strict requirement , you can live without it, as so many ofus, and employ other means of load balancing the reads.
That's what we are doing, it's a really cool feature, but Iexperienced (maybe because it uses old pglogical extension) thatthe replication is a bit fragile, especially when dealing withthose bytea fields (when I ingest big loads, say 25-30 GB or more),it happened to break replication, and recreating a replica fromscratch with "normal size" tables is not a big deal, since it canbe achieved automatically, because they normally fit in sharedmemory and can be transferred by the replicator, but you canimagine what would be the effort and the downtime necessary tocreate a base backup, transfer it to the replica, build the DB andrestart a 10-TB database (ATM we are running with a 2-node cluster).
Break this in batches, use modern techniques for robust dataloading, in smaller transactions, if you have to.
Normally it's run via COPY commands, I can throttle COPY or break itin batches. At the moment, while the schema is offline, we disconnectreplication from the bytea tables, feed them, wait for checkpoints toreturn normal and then resume replication between tables beforeputting schema online. This is safe, even if far from beingoptimized. It's a migration tool, it won't be used forever, just tomove customers from their current architecture to new cloud one.
I've been instructed to issue VACUUM FULL on both nodes,nightly, but before proceeding I read on docs that VACUUM FULLcan disrupt logical replication, so I'm a bit concerned on howto proceed. Rows are cleared one a time (one transaction, onerow, to keep errors to the record that issued them)
Mind if you shared the specific doc ?
Obviously I can't find it from a quick search, I'll search deeper,I don't think it went off a dream :-).
PgEdge is based on the old pg_logical, the old 2ndQuadrantextension, not the native logical replication we have sincepgsql 10. But I might be mistaken.
Don't know about this, it keeps running on latest pg versions (weare about to upgrade to 17.4, if I'm not wrong), but I'll ask
I read about extensions like pg_squeeze, but I wonder if theyare still not dangerous for replication.
What's pgEdge take on that, I mean the bytea thing you aretrying to achieve here.
They are positive, it's they that suggested to do VACUUM FULL onboth nodes... I'm quite new to replication, so I'm searching someadvise here.
As I told you, pgEdge logical replication (old 2ndquadrant BDR) !=native logical replication. You may look here :
https://github.com/pgEdge/spock
If multi-master is not a must you could convert to vanillapostgresql and focus on standard physical and logical replication.
No, multimaster is cool, but as I said, the project has beendiscontinued and it's not a must anymore. This is the first step,actually. We are planning to return to plain PostgreSQL, orCloudSQL for PostgreSQL, using logical replication (that seems themost reliable of the two). We created a test case for both theoptions, and they seem to be OK for now, even if I have still to doadequate stress tests. And when I'll do the migration, I'd like tobe migrating plain data only and leave blobs where they are.
as you wish. But this design has inherent data infra fragmentationas you understand.
Personally I like to let the DB take care of the data, and I takecare of the DB, not a plethora of extra systems that we need to keepconnected and consistent.
We followed this idea when the application (old version) was oncustomer premises, so backups and operations were simple and gettingin trouble (e.g. customer deleting a directory from their PC) hashappened a very few times, just when they launched disk cleanup onwindows :-)
Now we host a full cloud solution, so we got rid of many potentialproblems generated by the end user, but bumped into other, as youcertainly imagine. We have to keep it consistent, fast, reliable,keeping an eye on costs.You are right, but the more I was working with this solution, themore I'm having the impression of dealing with something heavy, hardto mantain because of these rarely-accessed files that sum up most ofmy data. Maybe it's just my impression, maybe I need some expertisein an area that's still quite new for me.At the moment that seems a good compromise between stability andcosts. Maybe in one year I'll be in your position (consideringgetting everything back), but for now we are thinking forward in thatway.
Makes perfect sense.
This been said, the original question :-)
Would be VACUUM FULL a risky operation? Has it to be done on allnodes, obviously in a low-traffic and low-access timing (night)?
VACUUM affects the physical blocks. In a physical streamingreplication scenario that might (or not) potentiallyt affect read-onlyqueries on the hot standby (depending on usage and settings). NormallyI cannot see how a VACUUM (plain or FULL) would interact with logicalreplication in any way. But again, since you run PgEdge specific, youhave to ask them.

Thanks. This makes me think I misread or misinterpreted something. Theyalready suggested me that to use VACUUM FULL on both nodes, but that"thing" I read (or I'm convinced to have) made me think twice beforecrashing everything. Two experts' according words is quite enough for me.

I will start this evening and see what happens.

Thanks for the help and the very interesting discussion.

Thanks for your help.
Moreno.-

Re: Logical replication, need to reclaim big disk space

Reply via email to