Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Pierre Barre Sat, 26 Jul 2025 00:36:51 -0700

What you describe doesn’t look like something very useful for the vast majority 
of projects that needs a database. Why would you even want that if you can 
avoid it?


If your “single node” can handle tens / hundreds of thousands requests per 
second, still have very durable and highly available storage, as well as fast 
recovery mechanisms, what’s the point?

I am not trying to cater to extreme outliers that may want very weird like 
this, that’s just not the use-cases I want to address, because I believe they 
are few and far between.

Best,
Pierre 

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
> A shared storage would require a lot of extra work. That's essentially what 
> AWS Aurora does.
> You will have to have functionality to sync in-memory states between nodes, 
> because all the instances will have cached data that can easily become stale 
> on any write operation.
> That alone is not that simple. You will have to modify some locking logic. 
> Most likely do a lot of other changes in a lot of places, Postgres was not 
> just built with the assumption that the storage can be shared.
> 
> -Vladimir
> 
> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote:
>> Now, I'm trying to understand how CAP theorem applies here. Traditional 
>> PostgreSQL replication has clear CAP trade-offs - you choose between 
>> consistency and availability during partitions.
>> 
>> But when PostgreSQL instances share storage rather than replicate:
>> - Consistency seems maintained (same data)
>> - Availability seems maintained (client can always promote an accessible 
>> node)
>> - Partitions between PostgreSQL nodes don't prevent the system from 
>> functioning
>> 
>> It seems that CAP assumes specific implementation details (like nodes 
>> maintaining independent state) without explicitly stating them.
>> 
>> How should we think about CAP theorem when distributed nodes share storage 
>> rather than coordinate state? Are the trade-offs simply moved to a different 
>> layer, or does shared storage fundamentally change the analysis?
>> 
>> Client with awareness of both PostgreSQL nodes
>>     |                               |
>>     ↓ (partition here)              ↓
>> PostgreSQL Primary              PostgreSQL Standby
>>     |                               |
>>     └───────────┬───────────────────┘
>>                 ↓
>>          Shared ZFS Pool
>>                 |
>>          6 Global ZeroFS instances
>> 
>> Best,
>> Pierre
>> 
>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
>> > Hi Seref,
>> > 
>> > For the benchmarks, I used Hetzner's cloud service with the following 
>> > setup:
>> > 
>> > - A Hetzner s3 bucket in the FSN1 region
>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
>> > - 3 ZeroFS nbd devices (same s3 bucket)
>> > - A ZFS stripped pool with the 3 devices
>> > - 200GB zfs L2ARC
>> > - Postgres configured accordingly memory-wise as well as with 
>> > synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>> > 
>> > Best,
>> > Pierre
>> > 
>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> >> Sorry, this was meant to go to the whole group:
>> >> 
>> >> Very interesting!. Great work. Can you clarify how exactly you're running 
>> >> postgres in your tests? A specific AWS service? What's the test 
>> >> infrastructure that sits above the file system?
>> >> 
>> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote:
>> >>> Hi everyone,
>> >>> 
>> >>> I wanted to share a project I've been working on that enables PostgreSQL 
>> >>> to run on S3 storage while maintaining performance comparable to local 
>> >>> NVMe. The approach uses block-level access rather than trying to map 
>> >>> filesystem operations to S3 objects.
>> >>> 
>> >>> ZeroFS: https://github.com/Barre/ZeroFS
>> >>> 
>> >>> # The Architecture
>> >>> 
>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 
>> >>> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools 
>> >>> built on these block devices:
>> >>> 
>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>> >>> 
>> >>> By providing block-level access and leveraging ZFS's caching 
>> >>> capabilities (L2ARC), we can achieve microsecond latencies despite the 
>> >>> underlying storage being in S3.
>> >>> 
>> >>> ## Performance Results
>> >>> 
>> >>> Here are pgbench results from PostgreSQL running on this setup:
>> >>> 
>> >>> ### Read/Write Workload
>> >>> 
>> >>> ```
>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> >>> starting vacuum...end.
>> >>> transaction type: <builtin: TPC-B (sort of)>
>> >>> scaling factor: 50
>> >>> query mode: simple
>> >>> number of clients: 50
>> >>> number of threads: 15
>> >>> maximum number of tries: 1
>> >>> number of transactions per client: 100000
>> >>> number of transactions actually processed: 5000000/5000000
>> >>> number of failed transactions: 0 (0.000%)
>> >>> latency average = 0.943 ms
>> >>> initial connection time = 48.043 ms
>> >>> tps = 53041.006947 (without initial connection time)
>> >>> ```
>> >>> 
>> >>> ### Read-Only Workload
>> >>> 
>> >>> ```
>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S 
>> >>> example
>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> >>> starting vacuum...end.
>> >>> transaction type: <builtin: select only>
>> >>> scaling factor: 50
>> >>> query mode: simple
>> >>> number of clients: 50
>> >>> number of threads: 15
>> >>> maximum number of tries: 1
>> >>> number of transactions per client: 100000
>> >>> number of transactions actually processed: 5000000/5000000
>> >>> number of failed transactions: 0 (0.000%)
>> >>> latency average = 0.121 ms
>> >>> initial connection time = 53.358 ms
>> >>> tps = 413436.248089 (without initial connection time)
>> >>> ```
>> >>> 
>> >>> These numbers are with 50 concurrent clients and the actual data stored 
>> >>> in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, 
>> >>> while cold data comes from S3.
>> >>> 
>> >>> ## How It Works
>> >>> 
>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can 
>> >>> use like any other block device
>> >>> 2. Multiple cache layers hide S3 latency:
>> >>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>> >>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD 
>> >>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other 
>> >>> block device
>> >>>    c. Optional local disk cache
>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>> >>> 
>> >>> ## Geo-Distributed PostgreSQL
>> >>> 
>> >>> Since each region can run its own ZeroFS instance, you can create 
>> >>> geographically distributed PostgreSQL setups.
>> >>> 
>> >>> Example architectures:
>> >>> 
>> >>> Architecture 1
>> >>> 
>> >>> 
>> >>>                          PostgreSQL Client
>> >>>                                    |
>> >>>                                    | SQL queries
>> >>>                                    |
>> >>>                             +--------------+
>> >>>                             |  PG Proxy    |
>> >>>                             | (HAProxy/    |
>> >>>                             |  PgBouncer)  |
>> >>>                             +--------------+
>> >>>                                /        \
>> >>>                               /          \
>> >>>                    Synchronous            Synchronous
>> >>>                    Replication            Replication
>> >>>                             /              \
>> >>>                            /                \
>> >>>               +---------------+        +---------------+
>> >>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>> >>>               | (Primary)     |◄------►| (Standby)     |
>> >>>               +---------------+        +---------------+
>> >>>                       |                        |
>> >>>                       |  POSIX filesystem ops  |
>> >>>                       |                        |
>> >>>               +---------------+        +---------------+
>> >>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>> >>>               | (3-way mirror)|        | (3-way mirror)|
>> >>>               +---------------+        +---------------+
>> >>>                /      |      \          /      |      \
>> >>>               /       |       \        /       |       \
>> >>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>> >>>              |        |        |           |        |        |
>> >>>         +--------++--------++--------++--------++--------++--------+
>> >>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>> >>>         +--------++--------++--------++--------++--------++--------+
>> >>>              |         |         |         |         |         |
>> >>>              |         |         |         |         |         |
>> >>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>> >>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>> >>> 
>> >>> Architecture 2:
>> >>> 
>> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>> >>>                 \                    /
>> >>>                  \                  /
>> >>>                   Same ZFS Pool (NBD)
>> >>>                          |
>> >>>                   6 Global ZeroFS
>> >>>                          |
>> >>>                       S3 Regions
>> >>> 
>> >>> 
>> >>> The main advantages I see are:
>> >>> 1. Dramatic cost reduction for large datasets
>> >>> 2. Simplified geo-distribution
>> >>> 3. Infinite storage capacity
>> >>> 4. Built-in encryption and compression
>> >>> 
>> >>> Looking forward to your feedback and questions!
>> >>> 
>> >>> Best,
>> >>> Pierre
>> >>> 
>> >>> P.S. The full project includes a custom NFS filesystem too.
>> >>> 
>> >
>>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Reply via email to