Re: [zfs-discuss] optimal ZFS filesystem layout on JBOD

Marty Scholes Thu, 01 Jul 2010 08:56:09 -0700

Joachim Worringen wrote:
> Greetings,
> 
> we are running a few databases of currently 200GB
> (growing) in total for data warehousing:
> - new data via INSERTs for (up to) millions of rows
> per day; sometimes with UPDATEs
> - most data in a single table (=> 10 to 100s of
> millions of rows)
> - queries SELECT subsets of this table via an index
> - for effective parallelisation, queries create
> (potentially large) non-temporary tables which are
> deleted at the end of the query => lots of simple
> INSERTs and SELECTs during queries
> - large transactions: they may contain millions of
> INSERTs/UPDATEs
> - running version PostgreSQL 8.4.2
> 
> We are moving all this to a larger system - the
> hardware is available, therefore fixed:
> - Sun X4600 (16 cores, 64GB)
> - external SAS JBOD with 24 2,5" slots: 
>   o 18x SAS 10k 146GB drives
> o  2x SAS 10k 73GB drives
>   o 4x Intel SLC 32GB SATA SSD
> JBOD connected to Adaptec SAS HBA with BBU
> - Internal storage via on-board RAID HBA:
>   o 2x 73GB SAS 10k for OS (RAID1)
> o 2x Intel SLC 32GB SATA SSD for ZIL (RAID1) (?)
> - OS will be Solaris 10 to have ZFS as filesystem
> (and dtrace)
> - 10GigE towards client tier (currently, another
> X4600 with 32cores and 64GB)
> 
> What would be the optimal storage/ZFS layout for
> this? I checked solarisinternals.com and some
> PostgreSQL resources and came to the following
> concept - asking for your comments:
> - run the JBOD without HW-RAID, but let all
> redundancy be done by ZFS for maximum flexibility 
> - create separate ZFS pools for tablespaces (data,
> index, temp) and WAL on separate devices (LUNs):
> - use the 4 SSDs in the JBOD as Level-2 ARC cache
> (can I use a single cache for all pools?) w/o
> redundancy
> - use the 2 SSDs connected to the on-board HBA as
> RAID1 for ZFS ZIL
> 
> Potential issues that I see:
> - the ZFS ZIL will not benefit from a BBU (as it is
> connected to the backplane, driven by the
> onboard-RAID), and might be too small (32GB for ~2TB
> of data with lots of writes)?
> - the pools on the JBOD might have the wrong size for
> the tablespaces - like: using the 2 73GB drives as
> RAID 1 for temp might become too small, but adding a
> 146GB drive might not be a good idea?
> - with 20 spindles, does it make sense at all to use
> dedicated devices for the tabelspaces, or will the
> load be distributed well enough across the spindles
> anyway?
> 
> thanks for any comments & suggestions,
>  
>  Joachim


I'll chime in based on some tuning experience I had under UFS with Pg 7.x 
coupled with some experience with ZFS, but not experience with later Pg on ZFS. 
 Take this with a grain of salt.

Pg loves to push everything to the WAL and then dribble the changes back to the 
datafiles when convenient.  At a checkpoint, all of the changes are flushed in 
bulk to the tablespace.  Since the changes to WAL and disk are synchronous, ZIL 
is used, which I believe translates to all data being written four times under 
ZFS: once to WAL ZIL, then to WAL, then to tablespace ZIL, then to tablespace.

For writes, I would break WAL into it's own pool and then put an SSD ZIL mirror 
on that.  It would allow all writes to be nearly instant to WAL and would keep 
the ZIL needs to the size of the WAL, which probably won't exceed the size of 
your SSD.  The ZIL on WAL will especially help with large index updates which 
can cause cascading b-tree splits and result in large amounts of small 
syncronous I/O, bringing Pg to a crawl.  Checkpoints will still slow things 
down when the data is flushed to the tablespace pool, but that will happen with 
coalesced writes, so iops would be less of a concern.

For reads, I would either keep indexes and tables on the same pool and back 
them with as much L2ARC as needed for the working set, or if you lack 
sufficient L2ARC, break the indexes into their own pool and L2ARC those 
instead, because index reads generally are more random and heavily used, at 
least for well tuned queries.  Full table scans for well-vacuumed tables are 
generally sequential in nature, so table iops again are less of a concern.

If you have to break the indexes into their own pool for dedicated SSD L2ARC, 
you might consider adding some smaller or short-stroked 15K drives for L2ARC on 
the table pool.

For geometry, find the redundancy that you need, e.g. +1, +2 or +3, then decide 
which is more important, space or iops.  If L2ARC and ZIL reduce your need for 
iops, then go with RAIDZ[123].  If you still need the iops, pile a bunch of 
[123]-way mirrors together.

Yes, I would avoid HW raid and run pure JBOD and would be tempted to keep temp 
tables on the index or table pool.

Like I said above, take this with a grain of salt and feel free to throw out, 
disagree with or lampoon me for anything that does not resonate with you.

Whatever you do, make sure you stress-test the configuration with 
production-size data and workloads before you deploy it.

Good luck,
Marty
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] optimal ZFS filesystem layout on JBOD

Reply via email to