Andrew Robb wrote:
Richard Elling wrote:
Andrew Robb wrote:
Richard Elling wrote:
Andrew Robb wrote:
I had to let this go and get on with testing DB2 on Solaris. I had to abandon zfs on local discs in x64 Solaris 10 5/08.

This version does not have the modern write throttle code, which
should explain much of what you experience.  The fix is available
in Solaris 10 10/08.  For more info, see Roch's excellent blog
http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle

One CR to reference is
http://bugs.opensolaris.org/view_bug.do?bug_id=6429205

IMHO, if you are trying to make performance measurements on such
old releases, then you are at great risk of wasting your time.  You
would be better served to look at more recent releases, within the
constraints of your business, of course.
-- richard

This still misses the BIG point - DIRECTIO primarily tries to avoid data entering the file system cache (the database already caches it in its own much larger buffer pools). For this big-iron cluster, once written, the table data is only read back from file if the database is restarted. I suppose the same is also true of transaction logs, which are only replayed as part of data recovery. Typically, a database will be fastest if it avoids the file system altogether. However, this is difficult to manage and we will benefit greatly if file systems are nearly as fast as raw devices.


There are a lot of misunderstandings surrounding directio.
UFS directio offers the following 4 features:
   1. no buffer cache  (ZFS: primarycache property)
   2. concurrent I/O  (ZFS: concurrent by design)
   3. async I/O code path (ZFS: more modern code path)
   4. long urban myth history (ZFS: forgetaboutit ;-)

The following pointers might be useful for you.
http://blogs.sun.com/bobs/entry/one_i_o_two_i
http://blogs.sun.com/roch/entry/people_ask_where_are_we

What is missing from the above (note to self: blog this :-) is that
you can limit the size of the buffer cache and control what sort
of data is cached via the "primarycache" parameter.  It really
doesn't make  a lot of sense to have zero buffer cache since
*any* disk IO is going to be much, much, much more painful than
any bcopy/memcpy.
If you really don't want the ARC resizing on your behalf, then you
can cap it, as we describe in the Evil Tuning Guide.
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
But I think you'll find that the write throttle change is the big win
and that primarycache gives you fine control of cache behaviour.
-- richard


Thanks for that information, Richard. It still doesn't explain what method an application has available to avoid caching on a particular file handle. On a given file system, most applications will benefit from caching most files. However, some applications will want to give a hint to the OS that it is REALY a BAD idea to cache its file read/write operations on a file handle. The directio() call is a standard mechanism to achieve this on Solaris.

I'd like to explore this a bit with real use cases.  There are many
reasons for the design of directio() that have to do with the limitations
of UFS and NFS that may not exist in ZFS.


In my opinion, cache is good for directories and small files. Cache is bad for sequential access to files larger than physical memory (e.g. appending to transaction logs).

Eh?  Methinks you are getting a little confused with write caching vs
read caching. If I append a lot of little 512-byte blocks to a file, then
batching those up into a bigger record is a big win.  Having to block
while waiting for them to commit to disk is a big loss.  But most
transaction logs are written O_DSYNC, which follows a completely
different code path than normal I/O in both UFS and ZFS.

In between, it should be up to an application to give a hint to the OS as to whether cache is worthwhile or not. *The problem is that pointlessly caching large files can push small files out of the cache.*

This is particularly true for MRU caches, such as UFS.  But ZFS
implements an ARC, so the effect is mitigated.  While I'm sure
there are cases where the performance will be very bad, I don't
think that is generally true.  Again, a decent use case would be
helpful.

NB, this was a much bigger problem in the bad old days when
1 GByte of RAM cost $1M.  Today, 1 GByte of RAM costs orders of
magnitude less.


Your 'primarycache' parameter suggestions sounds like it applies to a whole file system. (I hope that 'metadata' includes directories.) This would be inadequate. If we have to set up a separate file system just for table space files, we might as well use ufs ;-). Actually, your suggestion is pretty good. We would naturally have separate zfs file systems for table spaces, in order to match zfs record size to table space block size.

From the use cases, we could determine the best match of workload
to dataset, something which is not possible in UFS or NFS.

I hope I understand you correctly. Let me summarise your suggestions:

1. upgrade Solaris to 10/08 to enable zfs write-throttling
2. set zfs 'primarycache' to 'metadata' for table space and transaction log file systems
3. ensure that SAN complies with 'flush volatile cache' semantics
4. tune ARC not to break large memory pages
5. tune ARC not to grow into database buffer pools (which are typically 90% of RAM)

I haven't looked at the recommendations for when an app used 90% of RAM
lately. It may be time to revisit the recommendations to make sure they make good sense. In the past, there have been issues with large page availability, but that is a
generic problem, not ZFS-specific.  So the recommendations may not have been
available if you were only searching for ZFS. There is, no doubt, opportunity for
improvements in the docs here.


Also, for table space file systems, I would:
4. set zfs record size to match database block size
5. turn off atime
6. turn off checksum calculation (if not using raidz)

Don't do that!  Originally there was some concern that the CPU cycles needed
for the checksum would be wasted when an app also checksummed its own
data.  But the measurements didn't support that claim.  IMHO you are much
better allowing ZFS to detect problems in the data *in addition* to any
verification done by an app.


Note: when creating a DB2 database, the database is typically restarted several times. It would only be worthwhile changing 'primarycache' from 'all' once the database is finally running. (Use ARC to cache the empty database between restarts during configuration.)

The use cases may reveal interesting opportunities for the L2ARC
as well.  I think the predominant feeling is that L2ARC is a big win
for databases.  This win will be much bigger than turning off all
caching.
-- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to