What you describe doesn’t look like something very useful for the vast majority of projects that needs a database. Why would you even want that if you can avoid it?
If your “single node” can handle tens / hundreds of thousands requests per second, still have very durable and highly available storage, as well as fast recovery mechanisms, what’s the point? I am not trying to cater to extreme outliers that may want very weird like this, that’s just not the use-cases I want to address, because I believe they are few and far between. Best, Pierre On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: > A shared storage would require a lot of extra work. That's essentially what > AWS Aurora does. > You will have to have functionality to sync in-memory states between nodes, > because all the instances will have cached data that can easily become stale > on any write operation. > That alone is not that simple. You will have to modify some locking logic. > Most likely do a lot of other changes in a lot of places, Postgres was not > just built with the assumption that the storage can be shared. > > -Vladimir > > On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote: >> Now, I'm trying to understand how CAP theorem applies here. Traditional >> PostgreSQL replication has clear CAP trade-offs - you choose between >> consistency and availability during partitions. >> >> But when PostgreSQL instances share storage rather than replicate: >> - Consistency seems maintained (same data) >> - Availability seems maintained (client can always promote an accessible >> node) >> - Partitions between PostgreSQL nodes don't prevent the system from >> functioning >> >> It seems that CAP assumes specific implementation details (like nodes >> maintaining independent state) without explicitly stating them. >> >> How should we think about CAP theorem when distributed nodes share storage >> rather than coordinate state? Are the trade-offs simply moved to a different >> layer, or does shared storage fundamentally change the analysis? >> >> Client with awareness of both PostgreSQL nodes >> | | >> ↓ (partition here) ↓ >> PostgreSQL Primary PostgreSQL Standby >> | | >> └───────────┬───────────────────┘ >> ↓ >> Shared ZFS Pool >> | >> 6 Global ZeroFS instances >> >> Best, >> Pierre >> >> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: >> > Hi Seref, >> > >> > For the benchmarks, I used Hetzner's cloud service with the following >> > setup: >> > >> > - A Hetzner s3 bucket in the FSN1 region >> > - A virtual machine of type ccx63 48 vCPU 192 GB memory >> > - 3 ZeroFS nbd devices (same s3 bucket) >> > - A ZFS stripped pool with the 3 devices >> > - 200GB zfs L2ARC >> > - Postgres configured accordingly memory-wise as well as with >> > synchronous_commit = off, wal_init_zero = off and wal_recycle = off. >> > >> > Best, >> > Pierre >> > >> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >> >> Sorry, this was meant to go to the whole group: >> >> >> >> Very interesting!. Great work. Can you clarify how exactly you're running >> >> postgres in your tests? A specific AWS service? What's the test >> >> infrastructure that sits above the file system? >> >> >> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote: >> >>> Hi everyone, >> >>> >> >>> I wanted to share a project I've been working on that enables PostgreSQL >> >>> to run on S3 storage while maintaining performance comparable to local >> >>> NVMe. The approach uses block-level access rather than trying to map >> >>> filesystem operations to S3 objects. >> >>> >> >>> ZeroFS: https://github.com/Barre/ZeroFS >> >>> >> >>> # The Architecture >> >>> >> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 >> >>> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools >> >>> built on these block devices: >> >>> >> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >> >>> >> >>> By providing block-level access and leveraging ZFS's caching >> >>> capabilities (L2ARC), we can achieve microsecond latencies despite the >> >>> underlying storage being in S3. >> >>> >> >>> ## Performance Results >> >>> >> >>> Here are pgbench results from PostgreSQL running on this setup: >> >>> >> >>> ### Read/Write Workload >> >>> >> >>> ``` >> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example >> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> >>> starting vacuum...end. >> >>> transaction type: <builtin: TPC-B (sort of)> >> >>> scaling factor: 50 >> >>> query mode: simple >> >>> number of clients: 50 >> >>> number of threads: 15 >> >>> maximum number of tries: 1 >> >>> number of transactions per client: 100000 >> >>> number of transactions actually processed: 5000000/5000000 >> >>> number of failed transactions: 0 (0.000%) >> >>> latency average = 0.943 ms >> >>> initial connection time = 48.043 ms >> >>> tps = 53041.006947 (without initial connection time) >> >>> ``` >> >>> >> >>> ### Read-Only Workload >> >>> >> >>> ``` >> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S >> >>> example >> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> >>> starting vacuum...end. >> >>> transaction type: <builtin: select only> >> >>> scaling factor: 50 >> >>> query mode: simple >> >>> number of clients: 50 >> >>> number of threads: 15 >> >>> maximum number of tries: 1 >> >>> number of transactions per client: 100000 >> >>> number of transactions actually processed: 5000000/5000000 >> >>> number of failed transactions: 0 (0.000%) >> >>> latency average = 0.121 ms >> >>> initial connection time = 53.358 ms >> >>> tps = 413436.248089 (without initial connection time) >> >>> ``` >> >>> >> >>> These numbers are with 50 concurrent clients and the actual data stored >> >>> in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, >> >>> while cold data comes from S3. >> >>> >> >>> ## How It Works >> >>> >> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can >> >>> use like any other block device >> >>> 2. Multiple cache layers hide S3 latency: >> >>> a. ZFS ARC/L2ARC for frequently accessed blocks >> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD >> >>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other >> >>> block device >> >>> c. Optional local disk cache >> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree >> >>> >> >>> ## Geo-Distributed PostgreSQL >> >>> >> >>> Since each region can run its own ZeroFS instance, you can create >> >>> geographically distributed PostgreSQL setups. >> >>> >> >>> Example architectures: >> >>> >> >>> Architecture 1 >> >>> >> >>> >> >>> PostgreSQL Client >> >>> | >> >>> | SQL queries >> >>> | >> >>> +--------------+ >> >>> | PG Proxy | >> >>> | (HAProxy/ | >> >>> | PgBouncer) | >> >>> +--------------+ >> >>> / \ >> >>> / \ >> >>> Synchronous Synchronous >> >>> Replication Replication >> >>> / \ >> >>> / \ >> >>> +---------------+ +---------------+ >> >>> | PostgreSQL 1 | | PostgreSQL 2 | >> >>> | (Primary) |◄------►| (Standby) | >> >>> +---------------+ +---------------+ >> >>> | | >> >>> | POSIX filesystem ops | >> >>> | | >> >>> +---------------+ +---------------+ >> >>> | ZFS Pool 1 | | ZFS Pool 2 | >> >>> | (3-way mirror)| | (3-way mirror)| >> >>> +---------------+ +---------------+ >> >>> / | \ / | \ >> >>> / | \ / | \ >> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >> >>> | | | | | | >> >>> +--------++--------++--------++--------++--------++--------+ >> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >> >>> +--------++--------++--------++--------++--------++--------+ >> >>> | | | | | | >> >>> | | | | | | >> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6 >> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) >> >>> >> >>> Architecture 2: >> >>> >> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2) >> >>> \ / >> >>> \ / >> >>> Same ZFS Pool (NBD) >> >>> | >> >>> 6 Global ZeroFS >> >>> | >> >>> S3 Regions >> >>> >> >>> >> >>> The main advantages I see are: >> >>> 1. Dramatic cost reduction for large datasets >> >>> 2. Simplified geo-distribution >> >>> 3. Infinite storage capacity >> >>> 4. Built-in encryption and compression >> >>> >> >>> Looking forward to your feedback and questions! >> >>> >> >>> Best, >> >>> Pierre >> >>> >> >>> P.S. The full project includes a custom NFS filesystem too. >> >>> >> > >>