from:"Mike Gerdts"

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-18 Thread Mike Gerdts

On Wed, Mar 17, 2010 at 9:15 AM, Edward Ned Harvey
 wrote:
>> I think what you're saying is:  Why bother trying to backup with "zfs
>> send"
>> when the recommended practice, fully supportable, is to use other tools
>> for
>> backup, such as tar, star, Amanda, bacula, etc.   Right?
>>
>> The answer to this is very simple.
>> #1  ...
>> #2  ...
>
> Oh, one more thing.  "zfs send" is only discouraged if you plan to store the
> data stream and do "zfs receive" at a later date.
>
> If instead, you are doing "zfs send | zfs receive" onto removable media, or
> another server, where the data is immediately fed through "zfs receive" then
> it's an entirely viable backup technique.

Richard Elling made an interesting observation that suggests that
storing a zfs send data stream on tape is a quite reasonable thing to
do.  Richard's background makes me trust his analysis of this much
more than I trust the typical person that says that zfs send output is
poison.

http://opensolaris.org/jive/thread.jspa?messageID=465973&tstart=0#465861

I think that a similar argument could be made for storing the zfs send
data streams on a zfs file system.  However, it is not clear why you
would do this instead of just zfs send | zfs receive.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies

2010-03-20 Thread Mike Gerdts

On Fri, Mar 19, 2010 at 11:57 PM, Edward Ned Harvey
 wrote:
>> 1. NDMP for putting "zfs send" streams on tape over the network.  So
>
> Tell me if I missed something here.  I don't think I did.  I think this
> sounds like crazy talk.
>
> I used NDMP up till November, when we replaced our NetApp with a Solaris Sun
> box.  In NDMP, to choose the source files, we had the ability to browse the
> fileserver, select files, and specify file matching patterns.  My point is:
> NDMP is file based.  It doesn't allow you to spawn a process and backup a
> data stream.
>
> Unless I missed something.  Which I doubt.  ;-)

5+ years ago the variety of NDMP that was available with the
combination of NetApp's OnTap and Veritas NetBackup did backups at the
volume level.  When I needed to go to tape to recover a file that was
no longer in snapshots, we had to find space on a NetApp to restore
the volume.  It could not restore the volume to a Sun box, presumably
because the contents of the backup used a data stream format that was
proprietary to NetApp.

An expired Internet Draft for NDMPv4 says:

  butype_name
 Specifies the name of the backup method to be used for the
 transfer (dump, tar, cpio, etc). Backup types are
NDMP Server
 implementation dependent and MUST match one of the Data
 Server implementation specific butype_name
strings accessible
 via the NDMP_CONFIG_GET_BUTYPE_INFO request.

http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt

It seems pretty clear from this that an NDMP data stream can contain
most anything and is dependent on the device being backed up.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff

2010-03-29 Thread Mike Gerdts

On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams
 wrote:
> One really good use for zfs diff would be: as a way to index zfs send
> backups by contents.

Or to generate the list of files for incremental backups via NetBackup
or similar.  This is especially important for file systems will
millions of files with relatively few changes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is it safe to disable the swap partition?

2010-05-09 Thread Mike Gerdts

On Sun, May 9, 2010 at 7:40 PM, Edward Ned Harvey
 wrote:
>
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Richard Elling
> >
> > For a storage server, swap is not needed. If you notice swap being used
> > then your storage server is undersized.
>
> Indeed, I have two solaris 10 fileservers that have uptime in the range of a
> few months.  I just checked swap usage, and they're both zero.
>
> So, Bob, rub it in if you wish.  ;-)  I was wrong.  I knew the behavior in
> Linux, which Roy seconded as "most OSes," and apparently we both assumed the
> same here, but that was wrong.  I don't know if solaris and opensolaris both
> have the same swap behavior.  I don't know if there's *ever* a situation
> where solaris/opensolaris would swap idle processes.  But there's at least
> evidence that my two servers have not, or do not.

If Solaris is under memory pressure, pages may be paged to swap.
Under severe memory pressure, entire processes may be swapped.  This
will happen after freeing up the memory used for file system buffers,
ARC, etc.  If the processes never page in the pages that have been
paged out (or the processes that have been swapped out are never
scheduled) then those pages will not consume RAM.

The best thing to do with processes that can be swapped out forever is
to not run them.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

2010-05-31 Thread Mike Gerdts

On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness  wrote:
> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
>> There are multiple factors at work.  Your OpenSolaris should be new
>> enough to have the fix in which the zfs I/O tasks are run in in a
>> scheduling class at lower priority than normal user processes.
>> However, there is also a throttling mechanism for processes which
>> produce data faster than can be consumed by the disks.  This
>> throttling mechanism depends on the amount of RAM available to zfs and
>> the write speed of the I/O channel.  More available RAM results in
>> more write buffering, which results in a larger chunk of data written
>> at the next transaction group write interval.  The maximum size of a
>> transaction group may be configured in /etc/system similar to:
>>
>> * Set ZFS maximum TXG group size to 2684354560
>> set zfs:zfs_write_limit_override = 0xa000
>>
>> If the transaction group is smaller, then zfs will need to write more
>> often.  Processes will still be throttled but the duration of the
>> delay should be smaller due to less data to write in each burst.  I
>> think that (with multiple writers) the zfs pool will be "healthier"
>> and less fragmented if you can offer zfs more RAM and accept some
>> stalls during writing.  There are always tradeoffs.
>>
>> Bob
> well it seems like when messing with the txg sync times and stuff like
> that it did make the transfer more smooth but didn't actually help with
> speeds as it just meant the hangs happened for a shorter time but at a
> smaller interval and actually lowering the time between writes just
> seemed to make things worse (slightly).
>
> I think I have came to the conclusion that the problem here is CPU due
> to the fact that its only doing this with parity raid. I would think if
> it was I/O based then it would be the same as if anything its heavier on
> I/O on non parity raid due to the fact that it is no longer CPU
> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
> parity raidz2).

To see if the CPU is pegged, take a look at the output of:

mpstat 1
prstat -mLc 1

If mpstat shows that the idle time reaches 0 or the process' latency
column is more then a few tenths of a percent, you are probably short
on CPU.

It could also be that interrupts are stealing cycles from rsync.
Placing it in a processor set with interrupts disabled in that
processor set may help.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds

2010-05-31 Thread Mike Gerdts

Sorry, turned on html mode to avoid gmail's line wrapping.

On Mon, May 31, 2010 at 4:58 PM, Sandon Van Ness wrote:

> On 05/31/2010 02:52 PM, Mike Gerdts wrote:
> > On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness 
> wrote:
> >
> >> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote:
> >>
> >>> There are multiple factors at work.  Your OpenSolaris should be new
> >>> enough to have the fix in which the zfs I/O tasks are run in in a
> >>> scheduling class at lower priority than normal user processes.
> >>> However, there is also a throttling mechanism for processes which
> >>> produce data faster than can be consumed by the disks.  This
> >>> throttling mechanism depends on the amount of RAM available to zfs and
> >>> the write speed of the I/O channel.  More available RAM results in
> >>> more write buffering, which results in a larger chunk of data written
> >>> at the next transaction group write interval.  The maximum size of a
> >>> transaction group may be configured in /etc/system similar to:
> >>>
> >>> * Set ZFS maximum TXG group size to 2684354560
> >>> set zfs:zfs_write_limit_override = 0xa000
> >>>
> >>> If the transaction group is smaller, then zfs will need to write more
> >>> often.  Processes will still be throttled but the duration of the
> >>> delay should be smaller due to less data to write in each burst.  I
> >>> think that (with multiple writers) the zfs pool will be "healthier"
> >>> and less fragmented if you can offer zfs more RAM and accept some
> >>> stalls during writing.  There are always tradeoffs.
> >>>
> >>> Bob
> >>>
> >> well it seems like when messing with the txg sync times and stuff like
> >> that it did make the transfer more smooth but didn't actually help with
> >> speeds as it just meant the hangs happened for a shorter time but at a
> >> smaller interval and actually lowering the time between writes just
> >> seemed to make things worse (slightly).
> >>
> >> I think I have came to the conclusion that the problem here is CPU due
> >> to the fact that its only doing this with parity raid. I would think if
> >> it was I/O based then it would be the same as if anything its heavier on
> >> I/O on non parity raid due to the fact that it is no longer CPU
> >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with
> >> parity raidz2).
> >>
> > To see if the CPU is pegged, take a look at the output of:
> >
> > mpstat 1
> > prstat -mLc 1
> >
> > If mpstat shows that the idle time reaches 0 or the process' latency
> > column is more then a few tenths of a percent, you are probably short
> > on CPU.
> >
> > It could also be that interrupts are stealing cycles from rsync.
> > Placing it in a processor set with interrupts disabled in that
> > processor set may help.
> >
> >
>
> Unfortunately none of these utilies make it possible to ge values for <1
> second which is what the hang is (its happening for about 1/2 of a second).
>

> Here is with mpstat:
>




>
> Here is what i get with prstat:
>




> Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
>   604 root 0.0  33 0.0 0.0 0.0 0.0  42  25  18  13   0   0
> zpool-data/13
>   604 root 0.0  30 0.0 0.0 0.0 0.0  41  29  12  12   0   0
> zpool-data/15
>  1326 root  12 2.9 0.0 0.0 0.0 0.0  85 0.4  1K  12 11K   0 rsync/1
>   604 root 0.0  15 0.0 0.0 0.0 0.0  41  44 111   9   0   0
> zpool-data/27
>   604 root 0.0  14 0.0 0.0 0.0 0.0  43  42  72   3   0   0
> zpool-data/33
>   604 root 0.0 5.9 0.0 0.0 0.0 0.0  41  53 109   6   0   0
> zpool-data/19
>   604 root 0.0 5.4 0.0 0.0 0.0 0.0  42  53 106   8   0   0
> zpool-data/25
>   604 root 0.0 5.3 0.0 0.0 0.0 0.0  43  51 107   7   0   0
> zpool-data/21
>   604 root 0.0 4.5 0.0 0.0 0.0 0.0  41  54 110   4   0   0
> zpool-data/31
>   604 root 0.0 3.9 0.0 0.0 0.0 0.0  41  55 109   3   0   0
> zpool-data/23
>   604 root 0.0 3.7 0.0 0.0 0.0 0.0  44  52 111   2   0   0
> zpool-data/29
>  1322 root 0.0 0.4 0.0 0.0 0.0 0.0  98 2.0  1K   0   1   0 rsync/1
>  22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  16  13 255   0 prstat/1
>  14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   5   3  69   0 sshd/1
>   196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  15   2 105   0 nscd/17
>

In the interval abo

Re: [zfs-discuss] Sun Flash Accelerator F20

2010-06-10 Thread Mike Gerdts

On Thu, Jun 10, 2010 at 9:39 AM, Andrey Kuzmin
 wrote:
> On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski  wrote:
>>
>> On 21/10/2009 03:54, Bob Friesenhahn wrote:
>>>
>>> I would be interested to know how many IOPS an OS like Solaris is able to
>>> push through a single device interface.  The normal driver stack is likely
>>> limited as to how many IOPS it can sustain for a given LUN since the driver
>>> stack is optimized for high latency devices like disk drives.  If you are
>>> creating a driver stack, the design decisions you make when requests will be
>>> satisfied in about 12ms would be much different than if requests are
>>> satisfied in 50us.  Limitations of existing software stacks are likely
>>> reasons why Sun is designing hardware with more device interfaces and more
>>> independent devices.
>>
>>
>> Open Solaris 2009.06, 1KB READ I/O:
>>
>> # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0&
>
> /dev/null is usually a poor choice for a test lie this. Just to be on the
> safe side, I'd rerun it with /dev/random.
> Regards,
> Andrey

(aside from other replies about read vs. write and /dev/random...)

Testing performance of disk by reading from /dev/random and writing to
disk is misguided.  From random(7d):

   Applications retrieve random bytes by reading /dev/random
   or /dev/urandom. The /dev/random interface returns random
   bytes only when sufficient amount of entropy has been collected.

In other words, when the kernel doesn't think that it can give high
quality random numbers, it stops providing them until it has gathered
enough entropy.  It will pause your reads.

If instead you use /dev/urandom, the above problem doesn't exist, but
the generation of random numbers is CPU-intensive.  There is a
reasonable chance (particularly with slow CPU's and fast disk) that
you will be testing the speed of /dev/urandom rather than the speed of
the disk or other I/O components.

If your goal is to provide data that is not all 0's to prevent ZFS
compression from making the file sparse or want to be sure that
compression doesn't otherwise make the actual writes smaller, you
could try something like:

# create a file just over 100 MB
dd if=/dev/random of=/tmp/randomdata bs=513 count=204401
# repeatedly feed that file to dd
while true ; do cat /tmp/randomdataa ; done | dd of=/my/test/file
bs=... count=...

The above should make it so that it will take a while before there are
two blocks that are identical, thus confounding deduplication as well.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup... still in beta status

2010-06-15 Thread Mike Gerdts

On Tue, Jun 15, 2010 at 7:28 PM, David Magda  wrote:
> On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote:
>
>> I think dedup may have its greatest appeal in VDI environments (think
>> about a environment with 85% if the data that the virtual machine needs is
>> into ARC or L2ARC... is like a dream...almost instantaneous response... and
>> you can boot a new machine in a few seconds)...
>
> This may also be accomplished by using snapshots and clones of data sets. At
> least for OS images: user profiles and documents could be something else
> entirely.

It all depends on the nature of the VDI environment.  If the VMs are
regenerated on each login, the snapshot + clone mechanism is
sufficient.  Deduplication is not needed.  However, if VMs have a long
life and get periodic patches and other software updates,
deduplication will be required if you want to remain at somewhat
constant storage utilization.

It probably makes a lot of sense to be sure that swap or page files
are on a non-dedup dataset.  Executables and shared libraries
shouldn't be getting paged out to it and the likelihood that multiple
VMs page the same thing to swap or a page file is very small.

> Another situation that comes to mind is perhaps as the back-end to a mail
> store: if you send out a message(s) with an attachment(s) to a lot of
> people, the attachment blocks could be deduped (and perhaps compressed as
> well, since base-64 adds 1/3 overhead).

It all depends on how this is stored.  If the attachments are stored
like they were in 1990 as part of an mbox format, you will be very
unlikely to get the proper block alignment.  Even storing the message
body (including headers) in the same file as the attachment may not
align the attachments because the mail headers may be different (e.g.
different recipients messages took different paths, some were
forwarded, etc.).  If the attachments are stored in separate files or
a database format is used that stores attachments separate from the
message (with matching database + zfs block size) things may work out
favorably.

However, a system that detaches messages and stores them separately
may just as well store them in a file that matches the SHA256 hash,
assuming that file doesn't already exist.  If does exist, it can just
increment a reference count.  In other words, an intelligent mail
system should already dedup.  Or at least that is how I would have
written it for the last decade or so...

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VXFS to ZFS Quota

2010-06-18 Thread Mike Gerdts

On Fri, Jun 18, 2010 at 8:09 AM, David Magda  wrote:
> You could always split things up into groups of (say) 50. A few jobs ago,
> I was in an environment where we have a /home/students1/ and
> /home/students2/, along with a separate faculty/ (using Solaris and UFS).
> This had more to do with IOps than anything else.

A decade or so ago when I managed similar environments and had (I
think) 6 file systems handling about 5000 students.  Each file system
had about 1/6 of the students.  Challenges I found in this were:

- Students needed to work on projects together.  The typical way to do
this was for them to request a group, then create a group writable
directory in one of their home directories.  If all students in the
group had home directories on the same file system, there was nothing
special to consider.  If they were on different file systems then at
least one would need to have a non-zero quota (that is, not 0 blocks
soft, 1 block hard) quota on the file system where the group directory
resides.
- Despite your best efforts things will get imbalanced.  If you are
tight on space, this means that you will need to migrate users.  This
will become apparent only at the times of the semester where even
per-user outages are most inconvenient (i.e. at 6 and 13 weeks when
big projects tend to be due).

Its probably a good idea to consider these types of situations in the
transition plan, or at least determine they don't apply.  I was
working in a college of engineering where group projects were common
and CAD, EDA, and simulation tools could generate big files very
quickly.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 11:28 AM, Bob Friesenhahn
 wrote:
>>
>> Ok... so we've rebuilt the pool as 14 pairs of mirrors, each pair having
>> one disk in each of the two JBODs.  Now we're getting about 500-1000 IOPS
>> (according to zpool iostat) and 20-30MB/sec in random read on a big
>> database.  Does that sounds right?
>
> I am not sure who wrote the above text since the attribution quoting is all
> botched up (Gmail?) in this thread.  Regardless, it is worth pointing out
> that 'zpool iostat' only reports the I/O operations which were actually
> performed.  It will not report the operations which did not need to be
> performed due to already being in cache.  A quite busy system can still
> report very little via 'zpool iostat' if it has enough RAM to cache the
> requested data.
>
> Bob

Very good point.  You can use a combination of "zpool iostat" and
fsstat to see the effect of reads that didn't turn into physical I/Os.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 10:08 AM, Ian D  wrote:
> What I don't understand is why, when I run a single query I get <100 IOPS
> and <3MB/sec.  The setup can obviously do better, so where is the
> bottleneck?  I don't see any CPU core on any side being maxed out so it
> can't be it...

In what way is CPU contention being monitored?  "prstat" without
options is nearly useless for a multithreaded app on a multi-CPU (or
multi-core/multi-thread) system.  mpstat is only useful if threads
never migrate between CPU's.  "prstat -mL" gives a nice picture of how
busy each LWP (thread) is.

When viewed with "prstat -mL", A thread that has usr+sys at 100%
cannot go any faster, unless you can get the CPU to go faster, as I
suggest below. From my understanding (perhaps not 100% correct on the
rest of this paragraph):  The time spent in TRP may be reclaimed by
running the application in a processor set with interrupts disabled on
all of its processors.  If TFL or DFL are high, optimizing the use of
cache may be beneficial.  Examples of how you can optimize the use of
cache include using the FX scheduler with a priority that gives
relatively long time slices, using processor sets to keep other
processes off of the same caches (which are often shared by multiple
cores), or perhaps disabling CPU's (threads) to ensure that only a
single core is using each cache.  With current generation Intel CPU's,
this can allow the CPU clock rate to increase, thereby allowing more
work to get done.

> The database is MySQL, it runs on a Linux box that connects to the Nexenta

Oh, since the database runs on Linux I guess you need to dig up top's
equivalent of "prstat -mL".  Unfortunately, I don't think that Linux
has microstate accounting and as such you may not have visibility into
time spent on traps, text faults, and data faults on a per-process
basis.

> server through 10GbE using iSCSI.

Have you done any TCP tuning?  Based on the numbers you cite above, it
looks like you are doing about 32 KB I/O's.  I think you can perform a
test that involves mainly the network if you use netperf with options
like:

netperf -H $host -t TCP_RR -r 32768 -l 30

That is speculation based on reading
http://www.netperf.org/netperf/training/Netperf.html.  Someone else
(perhaps on networking or performance lists) may have better tests to
run.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Expected throughput

2010-07-04 Thread Mike Gerdts

On Sun, Jul 4, 2010 at 2:08 PM, Ian D  wrote:

> Mem:  74098512k total, 73910728k used,   187784k free,    96948k buffers
> Swap:  2104488k total,      208k used,  2104280k free, 63210472k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 17652 mysql     20   0 3553m 3.1g 5472 S   38  4.4 247:51.80 mysqld
> 16301 mysql     20   0 4275m 3.3g 5980 S    4  4.7   5468:33 mysqld
> 16006 mysql     20   0 4434m 3.3g 5888 S    3  4.6   5034:06 mysqld
> 12822 root      15  -5     0    0    0 S    2  0.0  22:00.50 scsi_wq_39

Is that 38% of one CPU or 38% of all CPU's?  How many CPU's does the
Linux box have?  I don't mean the number of sockets, I mean number of
sockets * number of cores * number of threads per core.  My
recollection of top is that the CPU percentage is:

(pcpu_t2 - pcpu_t1) / (interval * ncpus)

Where pcpu_t* is the process CPU time at a particular time.  If you
have a two socket quad core box with hyperthreading enabled, that is 2
* 4 * 2 = 16 CPU's.  38% of 16 CPU's can be roughly 6 CPU's running as
fast as they can (and 10 of them idle) or 16 CPU's each running at
about 38%.  In the "I don't have a CPU bottleneck" argument, there is
a big difference.

If PID 16301 has a single thread that is doing significant work, on
the hypothetical 16 CPU box this means that it is spending about 2/3
of the time on CPU.  If the workload does:

while ( 1 ) {
issue I/O request
get response
do cpu-intensive work work
}

It is only trying to do I/O 1/3 of the time.  Further, it has put a
single high latency operation between its bursts of CPU activity.

One other area of investigation that I didn't mention before: Your
stats imply that the Linux box is getting data 32 KB at a time.  How
does 32 KB compare to the database block size?  How does 32 KB compare
to the block size on the relevant zfs filesystem or zvol?  Are blocks
aligned at the various layers?

http://blogs.sun.com/dlutz/entry/partition_alignment_guidelines_for_unified

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hashing files rapidly on ZFS

2010-07-07 Thread Mike Gerdts

On Tue, Jul 6, 2010 at 10:29 AM, Arne Jansen  wrote:
> Daniel Carosone wrote:
>> Something similar would be useful, and much more readily achievable,
>> from ZFS from such an application, and many others.  Rather than a way
>> to compare reliably between two files for identity, I'ld liek a way to
>> compare identity of a single file between two points in time.  If my
>> application can tell quickly that the file content is unaltered since
>> last time I saw the file, I can avoid rehashing the content and use a
>> stored value. If I can achieve this result for a whole directory
>> tree, even better.
>
> This would be great for any kind of archiving software. Aren't zfs checksums
> already ready to solve this? If a file changes, it's dnodes' checksum changes,
> the checksum of the directory it is in and so forth all the way up to the
> uberblock.
> There may be ways a checksum changes without a real change in the files 
> content,
> but the other way round should hold. If the checksum didn't change, the file
> didn't change.
> So the only missing link is a way to determine zfs's checksum for a
> file/directory/dataset. Am I missing something here? Of course atime update
> should be turned off, otherwise the checksum will get changed by the archiving
> agent.

What is the likelihood that the same data is re-written to the file?
If that is unlikely, it looks as though znode_t's z_seq may be useful.
 While it isn't a checksum, it seems to be incremented on every file
change.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-25 Thread Mike Gerdts

On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore  wrote:
> On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:
>>
>> I think there may be very good reason to use iSCSI, if you're limited
>> to gigabit but need to be able to handle higher throughput for a
>> single client. I may be wrong, but I believe iSCSI to/from a single
>> initiator can take advantage of multiple links in an active-active
>> multipath scenario whereas NFS is only going to be able to take
>> advantage of 1 link (at least until pNFS).
>
> There are other ways to get multiple paths.  First off, there is IP
> multipathing. which offers some of this at the IP layer.  There is also
> 802.3ad link aggregation (trunking).  So you can still get high
> performance beyond  single link with NFS.  (It works with iSCSI too,
> btw.)

With both IPMP and link aggregation, each TCP session will go over the
same wire.  There is no guarantee that load will be evenly balanced
between links when there are multiple TCP sessions.  As such, any
scalability you get using these configurations will be dependent on
having a complex enough workload, wise cconfiguration choices, and and
a bit of luck.

Note that with Sun Trunking there was an option to load balance using
a round robin hashing algorithm.  When pushing high network loads this
may cause performance problems with reassembly.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] NFS performance?

2010-07-26 Thread Mike Gerdts

On Mon, Jul 26, 2010 at 1:27 AM, Garrett D'Amore  wrote:
> On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote:
>> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore  wrote:
>> > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:
>> >>
>> >> I think there may be very good reason to use iSCSI, if you're limited
>> >> to gigabit but need to be able to handle higher throughput for a
>> >> single client. I may be wrong, but I believe iSCSI to/from a single
>> >> initiator can take advantage of multiple links in an active-active
>> >> multipath scenario whereas NFS is only going to be able to take
>> >> advantage of 1 link (at least until pNFS).
>> >
>> > There are other ways to get multiple paths.  First off, there is IP
>> > multipathing. which offers some of this at the IP layer.  There is also
>> > 802.3ad link aggregation (trunking).  So you can still get high
>> > performance beyond  single link with NFS.  (It works with iSCSI too,
>> > btw.)
>>
>> With both IPMP and link aggregation, each TCP session will go over the
>> same wire.  There is no guarantee that load will be evenly balanced
>> between links when there are multiple TCP sessions.  As such, any
>> scalability you get using these configurations will be dependent on
>> having a complex enough workload, wise cconfiguration choices, and and
>> a bit of luck.
>
> If you're really that concerned, you could use UDP instead of TCP.  But
> that may have other detrimental performance impacts, I'm not sure how
> bad they would be in a data center with generally lossless ethernet
> links.

Heh.  My horror story with reassembly was actually with connectionless
transports (LLT, then UDP).  Oracle RAC's cache fusion sends 8 KB
blocks via UDP by default, or LLT when used in the Veritas + Oracle
RAC certified configuration from 5+ years ago.  The use of Sun
trunking with round robin hashing and the lack of use of jumbo packets
made every cache fusion block turn into 6 LLT or UDP packets that had
to be reassembled on the other end.  This was on a 15K domain with the
NICs spread across IO boards.  I assume that interrupts for a NIC are
handled by a CPU on the closest system board (Solaris 8, FWIW).  If
that assumption is true then there would also be a flurry of
inter-system board chatter to put the block back together.  In any
case, performance was horrible until we got rid of round robin and
enabled jumbo frames.

> Btw, I am not certain that the multiple initiator support (mpxio) is
> necessarily any better as far as guaranteed performance/balancing.  (It
> may be; I've not looked closely enough at it.)

I haven't paid close attention to how mpxio works.  The Veritas
analog, vxdmp, does a very good job of balancing traffic down multiple
paths, even when only a single LUN is accessed.  The exact mode that
dmp will use is dependent on the capabilities of the array it is
talking to - many arrays work in an active/passive mode.  As such, I
would expect that with vxdmp or mpxio the balancing with iSCSI would
be at least partially dependent on what the array said to do.

> I should look more closely at NFS as well -- if multiple applications on
> the same client are access the same filesystem, do they use a single
> common TCP session, or can they each have separate instances open?
> Again, I'm not sure.

It's worse than that.  A quick experiment with two different
automounted home directories from the same NFS server suggests that
both home directories share one TCP session to the NFS server.

The latest version of Oracle's RDBMS supports a userland NFS client
option.  It would be very interesting to see if this does a separate
session per data file, possibly allowing for better load spreading.

>> Note that with Sun Trunking there was an option to load balance using
>> a round robin hashing algorithm.  When pushing high network loads this
>> may cause performance problems with reassembly.
>
> Yes.  Reassembly is Evil for TCP performance.
>
> Btw, the iSCSI balancing act that was described does seem a bit
> contrived -- a single initiator and a COMSTAR server, both client *and
> server* with multiple ethernet links instead of a single 10GbE link.
>
> I'm not saying it doesn't happen, but I think it happens infrequently
> enough that its reasonable that this scenario wasn't one that popped
> immediately into my head. :-)

It depends on whether the people that control the network gear are the
same ones that control servers.  My experience suggests that if there
is a disconnect, it seems rather likely that each group's
standardization efforts, procurement cycles, and capacity plans will
work against any attempt t

Re: [zfs-discuss] NFS performance?

2010-07-26 Thread Mike Gerdts

On Mon, Jul 26, 2010 at 2:56 PM, Miles Nordin  wrote:
>>>>>> "mg" == Mike Gerdts  writes:
>    mg> it is rather common to have multiple 1 Gb links to
>    mg> servers going to disparate switches so as to provide
>    mg> resilience in the face of switch failures.  This is not unlike
>    mg> (at a block diagram level) the architecture that you see in
>    mg> pretty much every SAN.  In such a configuation, it is
>    mg> reasonable for people to expect that load balancing will
>    mg> occur.
>
> nope.  spanning tree removes all loops, which means between any two
> points there will be only one enabled path.  An L2-switched network
> will look into L4 headers for splitting traffic across an aggregated
> link (as long as it's been deliberately configured to do that---by
> default probably only looks to L2), but it won't do any multipath
> within the mesh.

I was speaking more of IPMP, which is at layer 3.

> Even with an L3 routing protocol it usually won't do multipath unless
> the costs of the paths match exactly, so you'd want to build the
> topology to achieve this and then do all switching at layer 3 by
> making sure no VLAN is larger than a switch.

By default, IPMP does outbound load spreading.  Inbound load spreading
is not practical with a single (non-test) IP address.  If you have
multiple virtual IP's you can spread them across all of the NICs in
the IPMP group and get some degree of inbound spreading as well.  This
is the default behavior of the OpenSolaris IPMP implementation, last I
looked.  I've not seen any examples (although I can't say I've looked
real hard either) of the Solaris 10 IPMP configuration set up with
multipe IP's to encourage inbound load spreading as well.

>
> There's actually a cisco feature to make no VLAN larger than a *port*,
> which I use a little bit.  It's meant for CATV networks I think, or
> DSL networks aggregated by IP instead of ATM like maybe some European
> ones?  but the idea is not to put edge ports into vlans any more but
> instead say 'ip unnumbered loopbackN', and then some black magic they
> have built into their DHCP forwarder adds /32 routes by watching the
> DHCP replies.  If you don't use DHCP you can add static /32 routes
> yourself, and it will work.  It does not help with IPv6, and also you
> can only use it on vlan-tagged edge ports (what? arbitrary!) but
> neat that it's there at all.
>
>  http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html

Interesting... however this seems to limit you to < 4096 edge ports
per VTP domain, as the VID field in the 802.1q header is only 12 bits.
 It is also unclear how this works when you have one physical host
with many guests.  And then there is the whole thing that I don't
really see how this helps with resilience in the face of a switch
failure.  Cool technology, but I'm not certain that it addresses what
I was talking about.

>
> The best thing IMHO would be to use this feature on the edge ports,
> just as I said, but you will have to teach the servers to VLAN-tag
> their packets.  not such a bad idea, but weird.
>
> You could also use it one hop up from the edge switches, but I think
> it might have problems in general removing the routes when you unplug
> a server, and using it one hop up could make them worse.  I only use
> it with static routes so far, so no mobility for me: I have to keep
> each server plugged into its assigned port, and reconfigure switches
> if I move it.  Once you have ``no vlan larger than 1 switch,'' if you
> actually need a vlan-like thing that spans multiple switches, the new
> word for it is 'vrf'.

There was some other Cisco dark magic that our network guys were
touting a while ago that would make each edge switch look like a blade
in a 6500 series.  This would then allow them to do link aggregation
across edge switches.  At least two of "organizational changes",
"personnel changes", and "roadmap changes" happened so I've not seen
this in action.

>
> so, yeah, it means the server people will have to take over the job of
> the networking people.  The good news is that networking people don't
> like spanning tree very much because it's always going wrong, so
> AFAICT most of them who are paying attention are already moving in
> this direction.
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving /export to another zpool

2010-08-13 Thread Mike Gerdts

On Fri, Aug 13, 2010 at 1:07 PM, Handojo  wrote:
>> Are the old /opt and /expore still listed in your
>> vfstab(4) file?
>
> I cant access /etc/vfstab because I can't even log in as my username. I can't 
> even log in as root from the Login Screen
>
> And when I boot on using LiveCD, how can I mount my first drive that has 
> opensolaris installed ?

To list the zpools it can see:

zpool import

To import one called rpool at an alternate root:

zpool import -R /mnt rpool


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] bigger zfs arc

2009-10-02 Thread Mike Gerdts

On Fri, Oct 2, 2009 at 1:45 PM, Rob Logan  wrote:
>> zfs will use as much memory as is "necessary" but how is "necessary"
>> calculated?
>
> using arc_summary.pl from
> http://www.cuddletech.com/blog/pivot/entry.php?id=979
> my tiny system shows:
>         Current Size:             4206 MB (arcsize)
>         Target Size (Adaptive):   4207 MB (c)

That looks a lot like ~ 4 * 1024 MB.  Is this a 64-bit capable system
that you have booted from a 32-bit kernel?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Mike Gerdts

On Mon, Nov 2, 2009 at 7:20 AM, Jeff Bonwick  wrote:
>> Terrific! Can't wait to read the man pages / blogs about how to use it...
>
> Just posted one:
>
> http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
>
> Enjoy, and let me know if you have any questions or suggestions for
> follow-on posts.
>
> Jeff

On systems with crypto accelerators (particularly Niagara 2) does the
hash calculation code use the crypto accelerators, so long as a
supported hash is used?  Assuming the answer is yes, have performance
comparisons been done between weaker hash algorithms implemented in
software and sha256 implemented in hardware?

I've been waiting very patiently to see this code go in.  Thank you
for all your hard work (and the work of those that helped too!).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedupe is in

2009-11-02 Thread Mike Gerdts

On Mon, Nov 2, 2009 at 11:58 AM, Dennis Clarke  wrote:
>
>>> Terrific! Can't wait to read the man pages / blogs about how to use
>>> it...
>>
>> Just posted one:
>>
>> http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup
>>
>> Enjoy, and let me know if you have any questions or suggestions for
>> follow-on posts.
>
> Looking at FIPS-180-3 in sections 4.1.2 and 4.1.3 I was thinking that the
> major leap from SHA256 to SHA512 was a 32-bit to 64-bit step.
>
> If the implementation of the SHA256 ( or possibly SHA512 at some point )
> algorithm is well threaded then one would be able to leverage those
> massively multi-core Niagara T2 servers. The SHA256 hash is based on six
> 32-bit functions whereas SHA512 is based on six 64-bit functions. The CMT
> Niagara T2 can easily process those 64-bit hash functions and the
> multi-core CMT trend is well established. So long as context switch times
> are very low one would think that IO with a SHA512 based de-dupe
> implementation would be possible and even realistic. That would solve the
> hash collision concern I would think.
>
> Merely thinking out loud here ...

And my out loud thinking on this says that the crypto accelerator on a
T2 system does hardware acceleration of SHA256.

NAME
 n2cp - Ultra-SPARC T2 crypto provider device driver

DESCRIPTION
 The  n2cp  device  driver  is  a  multi-threaded,   loadable
 hardware driver supporting hardware assisted acceleration of
 the following cryptographic operations, which are built into
 the Ultra-SPARC T2 CMT processor:

   DES:   CKM_DES_CBC, CKM_DES_ECB
   DES3:  CKM_DES3_CBC, CKM_DES3_ECB,
   AES:   CKM_AES_CBC, CKM_AES_ECB, CKM_AES_CTR
   RC4:   CKM_RC4
   MD5:   KM_MD5, CKM_MD5_HMAC, CKM_MD5_HMAC_GENERAL,
CKM_SSL3_MD5_MAC
   SHA-1: CKM_SHA_1, CKM_SHA_1_HMAC,
  CKM_SHA_1_HMAC_GENERAL, CKM_SSL3_SHA1_MAC
   SHA-256:CKM_SHA256, CKM_SHA256_HMAC,
  CKM_SHA256_HMAC_GENERAL

According to page 35 of
http://www.slideshare.net/ramesh_r_nagappan/wirespeed-cryptographic-acceleration-for-soa-and-java-ee-security,
a T2 CPU can do 41 Gb/s of SHA256.  The implication here is that this
keeps the MAU's busy but the rest of the core is still idle for things
like compression, TCP, etc.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] dedup question

2009-11-02 Thread Mike Gerdts

On Mon, Nov 2, 2009 at 2:16 PM, Nicolas Williams
 wrote:
> On Mon, Nov 02, 2009 at 11:01:34AM -0800, Jeremy Kitchen wrote:
>> forgive my ignorance, but what's the advantage of this new dedup over
>> the existing compression option?  Wouldn't full-filesystem compression
>> naturally de-dupe?
>
> If you snapshot/clone as you go, then yes, dedup will do little for you
> because you'll already have done the deduplication via snapshots and
> clones.  But dedup will give you that benefit even if you don't
> snapshot/clone all your data.  Not all data can be managed
> hierarchically, with a single dataset at the root of a history tree.
>
> For example, suppose you want to create two VirtualBox VMs running the
> same guest OS, sharing as much on-disk storage as possible.  Before
> dedup you had to: create one VM, then snapshot and clone that VM's VDI
> files, use an undocumented command to change the UUID in the clones,
> import them into VirtualBox, and setup the cloned VM using the cloned
> VDI files.  (I know because that's how I manage my VMs; it's a pain,
> really.)  With dedup you need only enable dedup and then install the two
> VMs.

The big difference here is when you consider a life cycle that ends
long after provisioning is complete.  With clones, the images will
diverge.  If a year after you install each VM you decide to do an OS
upgrade, they will still be linked but are quite unlikely to both
reference many of the same blocks.  However, with deduplication, the
similar changes (e.g. same patch applied, multiple of the same
application installed, upgrade to the same newer OS) will result in
fewer stored copies.

This isn't a big deal if you have 2 VM's.  It because quite
significant if you have 5000 (e.g. on a ZFS-based file server).
Assuming that the deduped blocks stay deduped in the ARC, it means
that it is feasible to every block that is accessed with any frequency
to be in memory.  Oh yeah, and you save a lot of disk space.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CIFS shares being lost

2009-11-20 Thread Mike Gerdts

On Fri, Nov 20, 2009 at 7:55 PM, Emily Grettel
 wrote:
> Well I took the plunge updating to the latest dev version. (snv_127) and I
> don't seem to be able to remotely login via ssh via putty:
>
> Using username "emilytg".
> Authenticating with public key "dsa-pub" from agent
> Server refused to allocate pty
> Sun Microsystems Inc.   SunOS 5.11  snv_127 November 2008

This looks like...

http://defect.opensolaris.org/bz/show_bug.cgi?id=12380

But that was supposed to be fixed in snv_126.  Can you check
/etc/minor_perm for this entry:

clone:ptmx 0666 root sys

Mike

>
> Not good :(
>
> Cheers,
> Em
>
> 
> From: emilygrettelis...@hotmail.com
> To: zfs-discuss@opensolaris.org
> Date: Sat, 21 Nov 2009 12:30:45 +1100
> Subject: Re: [zfs-discuss] CIFS shares being lost
>
>> by a Win7 client was crashing our CIFS server within 5-10 seconds.
> Hmmm thats probably it then. Most of our users have been using Windows 7 and
> people put their machines on standby when they leave the office for the day.
> Maybe this is why we've had issues and having to restart on a daily basis.
> It works fine during the day with no downtime.
>
> Is it safe to update to the latest dev version? I know it creates a BE and
> we can revert to 2009.06 later but is there a way of just updating ZFS
> instead of downloading 800Mb more and updating the entire OS?
>
> Cheers,
> Em
>
>
>> Date: Fri, 20 Nov 2009 18:14:06 -0700
>> From: edmud...@bounceswoosh.org
>> To: emilygrettelis...@hotmail.com
>> CC: zfs-discuss@opensolaris.org
>> Subject: Re: [zfs-discuss] CIFS shares being lost
>>
>> On Sat, Nov 21 at 11:41, Emily Grettel wrote:
>> >
>> > Wow that was mighty quick Tim!
>> >
>> > Sorry, I have to reboot the server. I can SSH into the box, VNC etc
>> > but no CIFS shares are visible.
>>
>> I found 2009.06 to be unusable for CIFS due to hangs that weren't
>> resolved until b114/b116. We had to revert to 2008.11, as any access
>> by a Win7 client was crashing our CIFS server within 5-10 seconds.
>>
>> These were the suspected culprits:
>>
>>
>> http://mail.opensolaris.org/pipermail/indiana-discuss/2009-June/015711.html
>>
>> though I think there was another issue in b114 that wasn't resolved
>> until b116.
>>
>> Unfortunately, the zpool default version in 2009.06 is 1 iteration
>> ahead of the one in 2008.11, so there's no smooth downgrade if you
>> created your zpools with a 2009.06 image. We had started with 2008.11
>> and not updated our zpool, so a re-install of 2008.11 worked.
>>
>> The latest dev branch (b126-or-thereabouts, 2010.02 preview) is
>> reportedly good for CIFS based on traffic from this list.
>>
>> --eric
>>
>> --
>> Eric D. Mudama
>> edmud...@mail.bounceswoosh.org
>>
>
> 
> Check out The Great Australian Pay Check now Want to know what your boss is
> paid?
> 
> View photos of singles in your area! Looking for a date?
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Best practices for zpools on zfs

2009-11-24 Thread Mike Gerdts

Suppose I have a storage server that runs ZFS, presumably providing
file (NFS) and/or block (iSCSI, FC) services to other machines that
are running Solaris.  Some of the use will be for LDoms and zones[1],
which would create zpools on top of zfs (fs or zvol).  I have concerns
about variable block sizes and the implications for performance.

1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss

Suppose that on the storage server, an NFS shared dataset is created
without tuning the block size.  This implies that when the client
(ldom or zone v12n server) runs mkfile or similar to create the
backing store for a vdisk or a zpool, the file on the storage server
will be created with 128K blocks.  Then when Solaris or OpenSolaris is
installed into the vdisk or zpool, files of a wide variety of sizes
will be created.  At this layer they will be created with variable
block sizes (512B to 128K).

The implications for a 512 byte write in the upper level zpool (inside
a zone or ldom) seems to be:

- The 512 byte write turns into a 128 KB write at the storage server
  (256x multiplication in write size).
- To write that 128 KB block, the rest of the block needs to be read
  to recalculate the checksum.  That is, a read/modify/write process
  is forced.  (Less impact if block already in ARC.)
- Deduplicaiton is likely to be less effective because it is unlikely
  that the same combination of small blocks in different zones/ldoms
  will be packed into the same 128 KB block.

Alternatively, the block size could be forced to something smaller at
the storage server.  Setting it to 512 bytes could eliminate the
read/modify/write cycle, but would presumably be less efficient (less
performant) with moderate to large files.  Setting it somewhere in
between may be desirable as well, but it is not clear where.  The key
competition in this area seems to have a fixed 4 KB block size.

Questions:

Are my basic assumptions about a given file consisting only of a
single sized block, except for perhaps the final block?

Has any work been done to identify the performance characteristics in
this area?

Is there less to be concerned about from a performance standpoint if
the workload is primarily read?

To maximize the efficacy of dedup, would it be best to pick a fixed
block size and match it between the layers of zfs?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best practices for zpools on zfs

2009-11-24 Thread Mike Gerdts

On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling
 wrote:
> Good question!  Additional thoughts below...
>
> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote:
>
>> Suppose I have a storage server that runs ZFS, presumably providing
>> file (NFS) and/or block (iSCSI, FC) services to other machines that
>> are running Solaris.  Some of the use will be for LDoms and zones[1],
>> which would create zpools on top of zfs (fs or zvol).  I have concerns
>> about variable block sizes and the implications for performance.
>>
>> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>>
>> Suppose that on the storage server, an NFS shared dataset is created
>> without tuning the block size.  This implies that when the client
>> (ldom or zone v12n server) runs mkfile or similar to create the
>> backing store for a vdisk or a zpool, the file on the storage server
>> will be created with 128K blocks.  Then when Solaris or OpenSolaris is
>> installed into the vdisk or zpool, files of a wide variety of sizes
>> will be created.  At this layer they will be created with variable
>> block sizes (512B to 128K).
>>
>> The implications for a 512 byte write in the upper level zpool (inside
>> a zone or ldom) seems to be:
>>
>> - The 512 byte write turns into a 128 KB write at the storage server
>>  (256x multiplication in write size).
>> - To write that 128 KB block, the rest of the block needs to be read
>>  to recalculate the checksum.  That is, a read/modify/write process
>>  is forced.  (Less impact if block already in ARC.)
>> - Deduplicaiton is likely to be less effective because it is unlikely
>>  that the same combination of small blocks in different zones/ldoms
>>  will be packed into the same 128 KB block.
>>
>> Alternatively, the block size could be forced to something smaller at
>> the storage server.  Setting it to 512 bytes could eliminate the
>> read/modify/write cycle, but would presumably be less efficient (less
>> performant) with moderate to large files.  Setting it somewhere in
>> between may be desirable as well, but it is not clear where.  The key
>> competition in this area seems to have a fixed 4 KB block size.
>>
>> Questions:
>>
>> Are my basic assumptions about a given file consisting only of a
>> single sized block, except for perhaps the final block?
>
> Yes, for a file system dataset. Volumes are fixed block size with
> the default being 8 KB.  So in the iSCSI over volume case, OOB
> it can be more efficient.  4KB matches well with NTFS or some of
> the Linux file systems

OOB is missing from my TLA translator.  Help, please.

>
>> Has any work been done to identify the performance characteristics in
>> this area?
>
> None to my knowledge.  The performance teams know to set the block
> size to match the application, so they don't waste time re-learning this.

That works great for certain workloads, particularly those with a
fixed record size or large sequential I/O.  If the workload is
"installing then running an operating system" the answer is harder to
define.

>
>> Is there less to be concerned about from a performance standpoint if
>> the workload is primarily read?
>
> Sequential read: yes
> Random read: no

I was thinking that random wouldn't be too much of a concern either
assuming that the things that are commonly read are in cache.  I guess
this does open the door for a small chunk of useful code in the middle
of a largely useless shared library to force lot of that shared
library into the ARC, among other things.

>
>> To maximize the efficacy of dedup, would it be best to pick a fixed
>> block size and match it between the layers of zfs?
>
> I don't think we know yet.  Until b128 arrives in binary, and folks get
> some time to experiment, we just don't have much data... and there
> are way too many variables at play to predict.  I can make one
> prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-)

We already have that optimization with compression.  Dedupe just
messes up my method of repeatedly writing the same smallish (<1MB)
chunk of random or already compressed data to avoid the block-of-zeros
compression optimization.

Pretty soon filebench is going to need to add statistical methods to
mimic the level of duplicate data it is simulating.  Trying to write
simple benchmarks to test increasingly smart systems looks to be
problematic.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best practices for zpools on zfs

2009-11-24 Thread Mike Gerdts

On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling
 wrote:
> On Nov 24, 2009, at 11:31 AM, Mike Gerdts wrote:
>
>> On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling
>>  wrote:
>>>
>>> Good question!  Additional thoughts below...
>>>
>>> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote:
>>>
>>>> Suppose I have a storage server that runs ZFS, presumably providing
>>>> file (NFS) and/or block (iSCSI, FC) services to other machines that
>>>> are running Solaris.  Some of the use will be for LDoms and zones[1],
>>>> which would create zpools on top of zfs (fs or zvol).  I have concerns
>>>> about variable block sizes and the implications for performance.
>>>>
>>>> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>>>>
>>>> Suppose that on the storage server, an NFS shared dataset is created
>>>> without tuning the block size.  This implies that when the client
>>>> (ldom or zone v12n server) runs mkfile or similar to create the
>>>> backing store for a vdisk or a zpool, the file on the storage server
>>>> will be created with 128K blocks.  Then when Solaris or OpenSolaris is
>>>> installed into the vdisk or zpool, files of a wide variety of sizes
>>>> will be created.  At this layer they will be created with variable
>>>> block sizes (512B to 128K).
>>>>
>>>> The implications for a 512 byte write in the upper level zpool (inside
>>>> a zone or ldom) seems to be:
>>>>
>>>> - The 512 byte write turns into a 128 KB write at the storage server
>>>>  (256x multiplication in write size).
>>>> - To write that 128 KB block, the rest of the block needs to be read
>>>>  to recalculate the checksum.  That is, a read/modify/write process
>>>>  is forced.  (Less impact if block already in ARC.)
>>>> - Deduplicaiton is likely to be less effective because it is unlikely
>>>>  that the same combination of small blocks in different zones/ldoms
>>>>  will be packed into the same 128 KB block.
>>>>
>>>> Alternatively, the block size could be forced to something smaller at
>>>> the storage server.  Setting it to 512 bytes could eliminate the
>>>> read/modify/write cycle, but would presumably be less efficient (less
>>>> performant) with moderate to large files.  Setting it somewhere in
>>>> between may be desirable as well, but it is not clear where.  The key
>>>> competition in this area seems to have a fixed 4 KB block size.
>>>>
>>>> Questions:
>>>>
>>>> Are my basic assumptions about a given file consisting only of a
>>>> single sized block, except for perhaps the final block?
>>>
>>> Yes, for a file system dataset. Volumes are fixed block size with
>>> the default being 8 KB.  So in the iSCSI over volume case, OOB
>>> it can be more efficient.  4KB matches well with NTFS or some of
>>> the Linux file systems
>>
>> OOB is missing from my TLA translator.  Help, please.
>
> Out of box.

Looky there, it was in my TLA translator after all.  Not sure how I
missed it the first time.

>
>>>
>>>> Has any work been done to identify the performance characteristics in
>>>> this area?
>>>
>>> None to my knowledge.  The performance teams know to set the block
>>> size to match the application, so they don't waste time re-learning this.
>>
>> That works great for certain workloads, particularly those with a
>> fixed record size or large sequential I/O.  If the workload is
>> "installing then running an operating system" the answer is harder to
>> define.
>
> running OSes don't create much work, post boot

Agreed, particularly if backups are pushed to the storage server.  I
suspect that most apps that shuffle bits between protocols but do
little disk I/O can piggy back on this idea.  That is, a J2EE server
that just talks to the web and database tier, with some log entries
and occasional app deployments should be pretty safe too.

>
>>>> Is there less to be concerned about from a performance standpoint if
>>>> the workload is primarily read?
>>>
>>> Sequential read: yes
>>> Random read: no
>>
>> I was thinking that random wouldn't be too much of a concern either
>> assuming that the things that are commonly read are in cache.  I guess
>> this does open the door for a small chunk of useful code in the middle
>> of a largely useless sh

Re: [zfs-discuss] ZFS Random Read Performance

2009-11-25 Thread Mike Gerdts

On Wed, Nov 25, 2009 at 7:54 AM, Paul Kraus  wrote:
>> You're peaking at 658 256KB random IOPS for the 3511, or ~66
>> IOPS per drive.  Since ZFS will max out at 128KB per I/O, the disks
>> see something more than 66 IOPS each.  The IOPS data from
>> iostat would be a better metric to observe than bandwidth.  These
>> drives are good for about 80 random IOPS each, so you may be
>> close to disk saturation.  The iostat data for IOPS and svc_t will
>> confirm.
>
> But ... if I am saturating the 3511 with one thread, then why do I get
> many times that performance with multiple threads ?

I'm having troubles making sense of the iostat data (I can't tell how
many threads at any given point), but I do see lots of times where
asvc_t * reads is in the range 850 ms to 950 ms.  That is, this is as
fast as a single threaded app with a little bit of think time can
issue reads (100 reads * 9 ms svc_t + 100 reads * 1 ms think_time = 1
sec).  The %busy shows that 90+% of the time there is an I/O in flight
(100 reads * 9ms = 900/1000 = 90%).  However, %busy isn't aware of how
many I/O's could be in flight simultaneously.

When you fire up more threads, you are able to have more I/O's in
flight concurrently.  I don't believe that the I/O's per drive is
really a limiting factor at the single threaded case, as the spec
sheet for the 3511 says that it has 1 GB of cache per controller.
Your working set is small enough that it is somewhat likely that many
of those random reads will be served from cache.  A dtrace analysis of
just how random the reads are would be interesting.  I think that
hotspot.d from the DTrace Toolkit would be a good starting place.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] proposal partial/relative paths for zfs(1)

2009-11-25 Thread Mike Gerdts

Is there still any interest in this?  I've done a bit of hacking (then
searched for this thread - I picked -P instead of -c)...

$ zfs get -P compression,dedup /var
NAMEPROPERTY VALUE  SOURCE
rpool/ROOT/zfstest  compression  on inherited from rpool/ROOT
rpool/ROOT/zfstest  dedupoffdefault

$ pfexec zfs snapshot -P @now
Creating snapshot 

Of course create/mkdir would make it into the eventual implementation
as well.  For those missing this thread in their mailboxes, the
conversation is archived at
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html

Mike


On Thu, Jul 10, 2008 at 4:42 AM, Darren J Moffat  wrote:
> I regularly create new zfs filesystems or snapshots and I find it
> annoying that I have to type the full dataset name in all of those cases.
>
> I propose we allow zfs(1) to infer the part of the dataset name upto the
> current working directory.  For example:
>
> Today:
>
> $ zfs create cube/builds/darrenm/bugs/6724478
>
> With this proposal:
>
> $ pwd
> /cube/builds/darrenm/bugs
> $ zfs create 6724478
>
> Both of these would result in a new dataset cube/builds/darrenm/6724478
>
> This will need some careful though about how to deal with cases like this:
>
> $ pwd
> /cube/builds/
> $ zfs create 6724478/test
>
> What should that do ? should it create cube/builds/6724478 and
> cube/builds/6724478/test ?  Or should it fail ?  -p already provides
> some capbilities in this area.
>
> Maybe the easiest way out of the ambiquity is to add a flag to zfs
> create for the partial dataset name eg:
>
> $ pwd
> /cube/builds/darrenm/bugs
> $ zfs create -c 6724478
>
> Why "-c" ?  -c for "current directory"  "-p" partial is already taken to
> mean "create all non existing parents" and "-r" relative is already used
> consistently as "recurse" in other zfs(1) commands (as well as lots of
> other places).
>
> Alternately:
>
> $ pwd
> /cube/builds/darrenm/bugs
> $ zfs mkdir 6724478
>
> Which would act like mkdir does (including allowing a -p and -m flag
> with the same meaning as mkdir(1)) but creates datasets instead of
> directories.
>
> Thoughts ?  Is this useful for anyone else ?  My above examples are some
> of the shorter dataset names I use, ones in my home directory can be
> even deeper.
>
> --
> Darren J Moffat
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best practices for zpools on zfs

2009-11-26 Thread Mike Gerdts

On Thu, Nov 26, 2009 at 8:53 PM, Toby Thain  wrote:
>
> On 26-Nov-09, at 8:57 PM, Richard Elling wrote:
>
>> On Nov 26, 2009, at 1:20 PM, Toby Thain wrote:
>>>
>>> On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote:
>>>
>>>> On 2009-Nov-24 14:07:06 -0600, Mike Gerdts  wrote:
>>>>>
>>>>> ... fill a 128
>>>>> KB buffer with random data then do bitwise rotations for each
>>>>> successive use of the buffer.  Unless my math is wrong, it should
>>>>> allow 128 KB of random data to be write 128 GB of data with very
>>>>> little deduplication or compression.  A much larger data set could be
>>>>> generated with the use of a 128 KB linear feedback shift register...
>>>>
>>>> This strikes me as much harder to use than just filling the buffer
>>>> with 8/32/64-bit random numbers
>>>
>>> I think Mike's reasoning is that a single bit shift (and propagation) is
>>> cheaper than generating a new random word. After the whole buffer is
>>> shifted, you have a new very-likely-unique block. (This seems like overkill
>>> if you know the dedup unit size in advance.)
>>
>> You should be able to get a unique block by shifting one word, as long
>> as the shift doesn't duplicate the word.
>
> That is true, but you will run out of permutations sooner.

Rather than shifting a word, you could just increment it.  In a
multi-threaded test, each thread picks the word corresponding to the
thread that is executing.  Assuming 32-bit words (b4-bit is overkill),
this allows up to 128 threads with 512 byte blocks.  It also allows up
to 2 TB per thread per 512 bytes in a block.  That is, if 50 threads
are used and the block size is 8 KB, there should be no duplicates in
2 * 50 * 8192 / 512 = 1600 TB.

But... this leads us to the point the workload generators are too good
at generating unique data.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS send | verify | receive

2009-12-05 Thread Mike Gerdts

On Sat, Dec 5, 2009 at 11:32 AM, Bob Friesenhahn
 wrote:
> On Sat, 5 Dec 2009, dick hoogendijk wrote:
>
>> On Sat, 2009-12-05 at 09:22 -0600, Bob Friesenhahn wrote:
>>
>>> You can also stream into a gzip or lzop wrapper in order to obtain the
>>> benefit of incremental CRCs and some compression as well.
>>
>> Can you give an example command line for this option please?
>
> Something like
>
>  zfs send mysnapshot | gzip -c -3 > /somestorage/mysnap.gz
>
> should work nicely.  Zfs send sends to its standard output so it is just a
> matter of adding another filter program on its output.  This could be
> streamed over ssh or some other streaming network transfer protocol.
>
> Later, you can do 'gzip -t mysnap.gz' on the machine where the snapshot file
> is stored to verify that it has not been corrupted in storage or transfer.
>
> lzop (not part of Solaris) is much faster than gzip but can be used in a
> similar way since it is patterned after gzip.

It seems as though a similar filter could be created to create and
inject an error correcting code into the stream.  That is:

zfs send $snap | ecc -i  > /somestorage/mysnap.ecc
ecc -o < /somestorage/mysnap | zfs receive ...

I'm not aware of an existing  ecc program, but I can't imagine it
would be hard to create one.  There seems to already be an
implementation of Reed-Solomon encoding in ON that could likely be
used as a starting point.

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS - how to determine which physical drive to replace

2009-12-12 Thread Mike Gerdts

On Sat, Dec 12, 2009 at 9:58 AM, Edward Ned Harvey
 wrote:
> I would suggest something like this:  While the system is still on, if the
> failed drive is at least writable *a little bit* … then you can “dd
> if=/dev/zero of=/dev/rdsk/FailedDiskDevice bs=1024 count=1024” … and then
> after the system is off, you could plug the drives into another system
> one-by-one, and read the first 1M, and see if it’s all zeros.   (Or instead
> of dd zero, you could echo some text onto the drive, or whatever you think
> is easiest.)
>

How about reading instead?

dd if=/dev/rdsk/$whatever of=/dev/null

If the failed disk generates I/O errors that prevent it from reading
at a rate that causes an LED to blink, you could read from all of the
good disks.  The one that doesn't blink is the broken one.

You can also get the drive serial number with iostat -En:

$ iostat -En
c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Model: Hitachi HTS5425 Revision:  Serial No: 080804BB6300HCG Size:
160.04GB <160039305216 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0
...

That /should/ be printed on the disk somewhere.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compressratio vs. dedupratio

2009-12-14 Thread Mike Gerdts

On Mon, Dec 14, 2009 at 3:54 PM, Craig S. Bell  wrote:
> I am also accustomed to seeing diluted properties such as compressratio.  
> IMHO it could be useful (or perhaps just familiar) to see a diluted dedup 
> ratio for the pool, or maybe see the size / percentage of data used to arrive 
> at dedupratio.
>
> As Jeff points out, there is enough data available to calculate this.  Would 
> it be meaningful enough to present a diluted ratio property?  IOW, would that 
> tell me anything than I don't get from simply using "available" as my fuel 
> gauge?
>
> This is probably a larger topic:  What additional statistics would be 
> genuinely useful to the admin when there is space interaction between 
> datasets.  As we have seen, some commands are less objective with dedup:

I was recently confused when doing mkfile (or was it dd if=/dev/zero
...) and found that even though blocks were compressed away to
nothing, the compressratio did not increase.  For example:

# perl -e 'print "a" x 1' > /test/a
# zfs get compressratio test
NAME  PROPERTY   VALUE  SOURCE
test  compressratio  7.87x  -

However if I put null characters into the same file:

# dd if=/dev/zero of=a bs=1 count=1
1+0 records in
1+0 records out
# zfs get compressratio test
NAME  PROPERTY   VALUE  SOURCE
test  compressratio  1.00x  -

I understand that a block is not allocated if it contains all zero's,
but that would seem to contribute to a higher compressratio rather
than a lower compressratio.

If I disable compression and enable dedup, does it count deduplicated
blocks of zeros toward the dedupratio?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] compressratio vs. dedupratio

2009-12-15 Thread Mike Gerdts

On Tue, Dec 15, 2009 at 2:31 AM, Craig S. Bell  wrote:
> Mike, I believe that ZFS treats runs of zeros as holes in a sparse file, 
> rather than as regular data.  So they aren't really present to be counted for 
> compressratio.
>
> http://blogs.sun.com/bonwick/entry/seek_hole_and_seek_data
> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/017565.html

But it only does so when compression is enabled, as such I would
expect that compression would claim this as a win.  Without it,
someone may assume that they aren't getting much benefit from
compression, turn it off, then run into problems down the road because
sparseness that develops in files never turns into free space.

Also, I would expect that:

- If a file is created via a write to every block that it would be
accounted for as non-sparse (regardless of compression=)
- If a file is sparse because the program that created the file used
seek() or similar to skip past blocks, it should be accounted for as
sparse (regardless of compression).
- If a program overwrites a block with zeros to a file where it should
not be considered sparse.

In the below example, I would expect that writing 100MB of '\0' would
contribute as much to compressratio as 100 MB of 'a'.  Notice that a
block of zeros does not turn into a sparse file with compression=off.

# zfs create test/on
# zfs create test/off
# zfs set compression=off test/off

# zfs get compression test/on test/off
NAME  PROPERTY VALUE SOURCE
test/off  compression  off   local
test/on   compression  oninherited from test

# mkfile 100m on/100m off/100m

# ls -l o*/100m
-rw--T   1 root root 104857600 Dec 15 14:27 off/100m
-rw--T   1 root root 104857600 Dec 15 14:27 on/100m

# du -h o*/100m
 100M   off/100m
   0K   on/100m

# perl -e 'print "a" x 1' > on/a
# perl -e 'print "a" x 1' > off/a
# sync

# ls -l */a
-rw-r--r--   1 root root 1 Dec 15 14:35 off/a
-rw-r--r--   1 root root 1 Dec 15 14:35 on/a

# du -h */a
  95M   off/a
 3.4M   on/a

# zfs get compressratio test/on test/off
NAME  PROPERTY   VALUE  SOURCE
test/off  compressratio  1.00x  -
test/on   compressratio  28.27x  -

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Zones on shared storage - a warning

2009-12-22 Thread Mike Gerdts

I've been playing around with zones on NFS a bit and have run into
what looks to be a pretty bad snag - ZFS keeps seeing read and/or
checksum errors.  This exists with S10u8 and OpenSolaris dev build
snv_129.  This is likely a blocker for anything thinking of
implementing parts of Ed's Zones on Shared Storage:

http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss

The OpenSolaris example appears below.  The order of events is:

1) Create a file on NFS, turn it into a zpool
2) Configure a zone with the pool as zonepath
3) Install the zone, verify that the pool is healthy
4) Boot the zone, observe that the pool is sick

r...@soltrain19# mount filer:/path /mnt
r...@soltrain19# cd /mnt
r...@soltrain19# mkdir osolzone
r...@soltrain19# mkfile -n 8g root
r...@soltrain19# zpool create -m /zones/osol osol /mnt/osolzone/root
r...@soltrain19# zonecfg -z osol
osol: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:osol> create
zonecfg:osol> info
zonename: osol
zonepath:
brand: ipkg
autoboot: false
bootargs:
pool:
limitpriv:
scheduling-class:
ip-type: shared
hostid:
zonecfg:osol> set zonepath=/zones/osol
zonecfg:osol> set autoboot=false
zonecfg:osol> verify
zonecfg:osol> commit
zonecfg:osol> exit

r...@soltrain19# chmod 700 /zones/osol

r...@soltrain19# zoneadm -z osol install
   Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/
http://pkg-na-2.opensolaris.org/dev/).
   Publisher: Using contrib (http://pkg.opensolaris.org/contrib/).
   Image: Preparing at /zones/osol/root.
   Cache: Using /var/pkg/download.
Sanity Check: Looking for 'entire' incorporation.
  Installing: Core System (output follows)
DOWNLOAD  PKGS   FILESXFER (MB)
Completed46/46 12334/1233493.1/93.1

PHASEACTIONS
Install Phase18277/18277
No updates necessary for this image.
  Installing: Additional Packages (output follows)
DOWNLOAD  PKGS   FILESXFER (MB)
Completed36/36   3339/333921.3/21.3

PHASEACTIONS
Install Phase  4466/4466

Note: Man pages can be obtained by installing SUNWman
 Postinstall: Copying SMF seed repository ... done.
 Postinstall: Applying workarounds.
Done: Installation completed in 2139.186 seconds.

  Next Steps: Boot the zone, then log into the zone console (zlogin -C)
  to complete the configuration process.
6.3 Boot the OpenSolaris zone
r...@soltrain19# zpool status osol
  pool: osol
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
osol  ONLINE   0 0 0
  /mnt/osolzone/root  ONLINE   0 0 0

errors: No known data errors

r...@soltrain19# zoneadm -z osol boot

r...@soltrain19# zpool status osol
  pool: osol
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
osol  DEGRADED 0 0 0
  /mnt/osolzone/root  DEGRADED 0 0   117  too many errors

errors: No known data errors

r...@soltrain19# zlogin osol uptime
  5:31pm  up 1 min(s),  0 users,  load average: 0.69, 0.38, 0.52


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zones on shared storage - a warning

2009-12-22 Thread Mike Gerdts

On Tue, Dec 22, 2009 at 8:02 PM, Mike Gerdts  wrote:
> I've been playing around with zones on NFS a bit and have run into
> what looks to be a pretty bad snag - ZFS keeps seeing read and/or
> checksum errors.  This exists with S10u8 and OpenSolaris dev build
> snv_129.  This is likely a blocker for anything thinking of
> implementing parts of Ed's Zones on Shared Storage:
>
> http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss
>
> The OpenSolaris example appears below.  The order of events is:
>
> 1) Create a file on NFS, turn it into a zpool
> 2) Configure a zone with the pool as zonepath
> 3) Install the zone, verify that the pool is healthy
> 4) Boot the zone, observe that the pool is sick
[snip]

An off list conversation and a bit of digging into other tests I have
done shows that this is likely limited to NFSv3.  I cannot say that
this problem has been seen with NFSv4.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts

On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling
 wrote:
> On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote:
>
>> Devzero,
>>
>> Unfortunately that was my assumption as well. I don't have source level
>> knowledge of ZFS, though based on what I know it wouldn't be an easy way to
>> do it. I'm not even sure it's only a technical question, but a design
>> question, which would make it even less feasible.
>
> It is not hard, because ZFS knows the current free list, so walking that
> list
> and telling the storage about the freed blocks isn't very hard.
>
> What is hard is figuring out if this would actually improve life.  The
> reason
> I say this is because people like to use snapshots and clones on ZFS.
> If you keep snapshots, then you aren't freeing blocks, so the free list
> doesn't grow. This is a very different use case than UFS, as an example.

It seems as though the oft mentioned block rewrite capabilities needed
for pool shrinking and changing things like compression, encryption,
and deduplication would also show benefit here.  That is, blocks would
be re-written in such a way to minimize the number of chunks of
storage that is allocated.  The current HDS chunk size is 42 MB.

The most benefit would seem to be to have ZFS make a point of reusing
old but freed blocks before doing an allocation that causes the
back-end storage to allocate another chunk of disk to the
thin-provisioned.  While it is important to be able to roll back a few
transactions in the event of some widely discussed failure modes, it
is probably reasonable to reuse a block freed by a txg that is 3,000
txg's old (about 1 day old if 1 txg per 30 seconds).  Such a threshold
could be used to determine whether to reuse a block or venture into
previously untouched regions of the disk.

This strategy would allow the SAN administrator (who is a different
person than the sysadmin) to allocate extra space to servers and the
sysadmin can control the amount of space really used by quotas.  In
the event that there is an emergency need for more space, the sysadmin
can increase the quota and allow more of the allocate SAN space to be
used.  Assuming the block rewrite feature comes to fruition, this
emergency growth could be shrunk back down to the original size once
the surge in demand (or errant process) subsides.

>
> There are a few minor bumps in the road. The ATA PASSTHROUGH
> command, which allows TRIM to pass through the SATA drivers, was
> just integrated into b130. This will be more important to small servers
> than SANs, but the point is that all parts of the software stack need to
> support the effort. As such, it is not clear to me who, if anyone, inside
> Sun is champion for the effort -- it crosses multiple organizational
> boundaries.
>
>>
>> Apart from the technical possibilities, this feature looks really
>> inevitable to me in the long run especially for enterprise customers with
>> high-end SAN as cost is always a major factor in a storage design and it's a
>> huge difference if you have to pay based on the space used vs space
>> allocated (for example).
>
> If the high cost of SAN storage is the problem, then I think there are
> better ways to solve that :-)

The "SAN" could be an OpenSolaris device serving LUNs through COMSTAR.
 If those LUNs are used to hold a zpool, the zpool could notify the
LUN that blocks are no longer used and the "SAN" could reclaim those
blocks.  This is just a variant of the same problem faced with
expensive SAN devices that have thin provisioning allocation units
measured in the tens of megabytes instead of hundreds to thousands of
kilobytes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thin device support in ZFS?

2009-12-30 Thread Mike Gerdts

On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling
 wrote:
> If the allocator can change, what sorts of policies should be
> implemented?  Examples include:
>        + should the allocator stick with best-fit and encourage more
>           gangs when the vdev is virtual?
>        + should the allocator be aware of an SSD's page size?  Is
>           said page size available to an OS?
>        + should the metaslab boundaries align with virtual storage
>           or SSD page boundaries?

Wandering off topic a little bit...

Should the block size be a tunable so that page size of SSD (typically
4K, right?) and upcoming hard disks that sport a sector size > 512
bytes?

http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt

> And, perhaps most important, how can this be done automatically
> so that system administrators don't have to be rocket scientists
> to make a good choice?

Didn't you read the marketing literature?  ZFS is easy because you
only need to know two commands: zpool and zfs.  If you just ignore all
the subcommands, options to those subcommands, evil tuning that is
sometimes needed, and effects of redundancy choices then there is no
need for any rocket scientists.  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Clearing a directory with more than 60 million files

2010-01-05 Thread Mike Gerdts

On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi  wrote:
> Hello,
>
> As a result of one badly designed application running loose for some time,
> we now seem to have over 60 million files in one directory. Good thing
> about ZFS is that it allows it without any issues. Unfortunatelly now that
> we need to get rid of them (because they eat 80% of disk space) it seems
> to be quite challenging.
>
> Traditional approaches like "find ./ -exec rm {} \;" seem to take forever
> - after running several days, the directory size still says the same. The
> only way how I've been able to remove something has been by giving "rm
> -rf" to problematic directory from parent level. Running this command
> shows directory size decreasing by 10,000 files/hour, but this would still
> mean close to ten months (over 250 days) to delete everything!
>
> I also tried to use "unlink" command to directory as a root, as a user who
> created the directory, by changing directory's owner to root and so forth,
> but all attempts gave "Not owner" error.
>
> Any commands like "ls -f" or "find" will run for hours (or days) without
> actually listing anything from the directory, so I'm beginning to suspect
> that maybe the directory's data structure is somewhat damaged. Is there
> some diagnostics that I can run with e.g "zdb" to investigate and
> hopefully fix for a single directory within zfs dataset?

In situations like this, ls will be exceptionally slow partially
because it will sort the output.  Find is slow because it needs to
call lstat() on every entry.  In similar situations I have found the
following to work.

perl -e 'opendir(D, "."); while ( $d = readdir(D) ) { print "$d\n" }'

Replace print with unlink if you wish...

>
> To make things even more difficult, this directory is located in rootfs,
> so dropping the zfs filesystem would basically mean reinstalling the
> entire system, which is something that we really wouldn't wish to go.
>
>
> OS is Solaris 10, zpool version is 10 (rather old, I know, but is there
> easy path for upgrade that might solve this problem?) and the zpool
> consists two 146 GB SAS drivers in a mirror setup.
>
>
> Any help would be appreciated.
>
> Thanks,
> Mikko
>
> --
>  Mikko Lammi | l...@lmmz.net | http://www.lmmz.net
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Zones on shared storage - a warning

2010-01-07 Thread Mike Gerdts

e error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Thu Jan  7 21:56:47 2010
config:

NAME STATE READ WRITE CKSUM
nfszone  ONLINE   0 0 0
  /nfszone/root  ONLINE   0 0   109

errors: No known data errors

I'm confused as to why this pool seems to be quite usable even with so
many checksum errors.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 6:55 AM, Darren J Moffat  wrote:
> Frank Batschulat (Home) wrote:
>>
>> This just can't be an accident, there must be some coincidence and thus
>> there's a good chance
>> that these CHKSUM errors must have a common source, either in ZFS or in
>> NFS ?
>
> What are you using for on the wire protection with NFS ?  Is it shared using
> krb5i or do you have IPsec configured ?  If not I'd recommend trying one of
> those and see if your symptoms change.

Shouldn't a scrub pick that up?  Why would there be no errors from
"zoneadm install", which under the covers does a pkg image create
followed by *multiple* pkg install invocations.  No checksum errors
pop up there.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 6:51 AM, James Carlson  wrote:
> Frank Batschulat (Home) wrote:
>> This just can't be an accident, there must be some coincidence and thus 
>> there's a good chance
>> that these CHKSUM errors must have a common source, either in ZFS or in NFS ?
>
> One possible cause would be a lack of substantial exercise.  The man
> page says:
>
>         A regular file. The use of files as a backing  store  is
>         strongly  discouraged.  It  is  designed  primarily  for
>         experimental purposes, as the fault tolerance of a  file
>         is  only  as  good  as  the file system of which it is a
>         part. A file must be specified by a full path.
>
> Could it be that "discouraged" and "experimental" mean "not tested as
> thoroughly as you might like, and certainly not a good idea in any sort
> of production environment?"
>
> It sounds like a bug, sure, but the fix might be to remove the option.

This unsupported feature is supported with the use of Sun Ops Center
2.5 when a zone is put on a "NAS Storage Library".

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

        STATE     READ WRITE CKSUM
>        nfszone     DEGRADED     0     0     0
>          /nfszone  DEGRADED     0     0   462  too many errors
>
> errors: No known data errors
>
> ==
>
> now compare this with Mike's error output as posted here:
>
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html
>
> # fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail
>
>   2    cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62 
> 0x290cbce13fc59dce
> *D   3    cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 
> 0x7e0aef335f0c7f00
> *E   3    cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 
> 0xd4f1025a8e66fe00
> *B   4    cksum_actual = 0x0 0x0 0x0 0x0
>   4    cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900 
> 0x330107da7c4bcec0
>   5    cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73 
> 0x4e0b3a8747b8a8
> *C   6    cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 
> 0x280934efa6d20f40
> *A   6    cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 
> 0x89715e34fbf9cdc0
> *F  16    cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 
> 0x7f84b11b3fc7f80
> *G  48    cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 
> 0x82804bc6ebcfc0
>
> and observe that the values in 'chksum_actual' causing our CHKSUM pool errors 
> eventually
> because of missmatching with what had been expected are the SAME ! for 2 
> totally
> different client systems and 2 different NFS servers (mine vrs. Mike's),
> see the entries marked with *A to *G.
>
> This just can't be an accident, there must be some coincidence and thus 
> there's a good chance
> that these CHKSUM errors must have a common source, either in ZFS or in NFS ?

You saved me so much time with this observation.  Thank you!


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 9:11 AM, Mike Gerdts  wrote:
> I've seen similar errors on Solaris 10 in the primary domain and on a
> M4000.  Unfortunately Solaris 10 doesn't show the checksums in the
> ereport.  There I noticed a mixture between read errors and checksum
> errors - and lots more of them.  This could be because the S10 zone
> was a full root SUNWCXall compared to the much smaller default ipkg
> branded zone.  On the primary domain running Solaris 10...

I've written a dtrace script to get the checksums on Solaris 10.
Here's what I see with NFSv3 on Solaris 10.

# zoneadm -z zone1 halt ; zpool export pool1 ; zpool import -d
/mnt/pool1 pool1 ; zoneadm -z zone1 boot ; sleep 30 ; pkill dtrace

# ./zfs_bad_cksum.d
Tracing...
dtrace: error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x301b363a000) in
action #4 at DIF offset 20
dtrace: error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x3037f746000) in
action #4 at DIF offset 20
cccdtrace:
error on enabled probe ID 9 (ID 43443:
fbt:zfs:zio_checksum_error:return): invalid address (0x3026e7b) in
action #4 at DIF offset 20
cc
Checksum errors:
   3 : 0x130e01011103 0x20108 0x0 0x400 (fletcher_4_native)
   3 : 0x220125cd8000 0x62425980c08 0x16630c08296c490c
0x82b320c082aef0c (fletcher_4_native)
   3 : 0x2f2a0a202a20436f 0x7079726967687420 0x2863292032303031
0x2062792053756e20 (fletcher_4_native)
   3 : 0x3c21444f43545950 0x452048544d4c2050 0x55424c494320222d
0x2f2f5733432f2f44 (fletcher_4_native)
   3 : 0x6005a8389144 0xc2080e6405c200b6 0x960093d40800
0x9eea007b9800019c (fletcher_4_native)
   3 : 0xac044a6903d00163 0xa138c8003446 0x3f2cd1e100b10009
0xa37af9b5ef166104 (fletcher_4_native)
   3 : 0xbaddcafebaddcafe 0xc 0x0 0x0 (fletcher_4_native)
   3 : 0xc4025608801500ff 0x1018500704528210 0x190103e50066
0xc34b90001238f900 (fletcher_4_native)
   3 : 0xfe00fc01fc42fc42 0xfc42fc42fc42fc42 0xfffc42fc42fc42fc
0x42fc42fc42fc42fc (fletcher_4_native)
   4 : 0x4b2a460a 0x0 0x4b2a460a 0x0 (fletcher_4_native)
   4 : 0xc00589b159a00 0x543008a05b673 0x124b60078d5be
0xe3002b2a0b605fb3 (fletcher_4_native)
   4 : 0x130e010111 0x32000b301080034 0x10166cb34125410
0xb30c19ca9e0c0860 (fletcher_4_native)
   4 : 0x130e010111 0x3a201080038 0x104381285501102
0x418016996320408 (fletcher_4_native)
   4 : 0x130e010111 0x3a201080038 0x1043812c5501102
0x81802325c080864 (fletcher_4_native)
   4 : 0x130e010111 0x3a0001c01080038 0x1383812c550111c
0x818975698080864 (fletcher_4_native)
   4 : 0x1f81442e9241000 0x2002560880154c00 0xff10185007528210
0x19010003e566 (fletcher_4_native)
   5 : 0xbab10c 0xf 0x53ae 0xdd549ae39aa1ba20 (fletcher_4_native)
   5 : 0x130e010111 0x3ab01080038 0x1163812c550110b
0x8180a7793080864 (fletcher_4_native)
   5 : 0x61626300 0x0 0x0 0x0 (fletcher_4_native)
   5 : 0x8003 0x3df0d6a1 0x0 0x0 (fletcher_4_native)
   6 : 0xbab10c 0xf 0x5384 0xdd549ae39aa1ba20 (fletcher_4_native)
   7 : 0xbab10c 0xf 0x0 0x9af5e5f61ca2e28e (fletcher_4_native)
   7 : 0x130e010111 0x3a201080038 0x104381265501102
0xc18c7210c086006 (fletcher_4_native)
   7 : 0x275c222074650a2e 0x5c222020436f7079 0x7269676874203139
0x38392041540a2e5c (fletcher_4_native)
   8 : 0x130e010111 0x3a0003101080038 0x1623812c5501131
0x8187f66a4080864 (fletcher_4_native)
   9 : 0x8a000801010c0682 0x2eed0809c1640513 0x70200ff00026424
0x18001d16101f0059 (fletcher_4_native)
  12 : 0xbab10c 0xf 0x0 0x45a9e1fc57ca2aa8 (fletcher_4_native)
  30 : 0xbaddcafebaddcafe 0xbaddcafebaddcafe 0xbaddcafebaddcafe
0xbaddcafebaddcafe (fletcher_4_native)
  47 : 0x0 0x0 0x0 0x0 (fletcher_4_native)
  92 : 0x130e01011103 0x10108 0x0 0x200 (fletcher_4_native)

Since I had to guess at what the Solaris 10 source looks like, some
extra eyeballs on the dtrace script is in order.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/


zfs_bad_cksum.d
Description: Binary data
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning

2010-01-08 Thread Mike Gerdts

On Fri, Jan 8, 2010 at 12:28 PM, Torrey McMahon  wrote:
> On 1/8/2010 10:04 AM, James Carlson wrote:
>>
>> Mike Gerdts wrote:
>>
>>>
>>> This unsupported feature is supported with the use of Sun Ops Center
>>> 2.5 when a zone is put on a "NAS Storage Library".
>>>
>>
>> Ah, ok.  I didn't know that.
>>
>>
>
> Does anyone know how that works? I can't find it in the docs, no one inside
> of Sun seemed to have a clue when I asked around, etc. RTFM gladly taken.

Storage libraries are discussed very briefly at:

http://wikis.sun.com/display/OC2dot5/Storage+Libraries

Creation of zones is discussed at:

http://wikis.sun.com/display/OC2dot5/Creating+Zones

I've found no documentation that explains the implementation details.
>From looking at a test environment that I have running, it seems to go
like:

1. The storage admin carves out some NFS space and exports it with the
appropriate options to the  various hosts (global zones).

2. In the Ops Center BUI, the ops center admin creates a new storage
library.  He selects type NFS and specifies the hostname and path that
was allocated.

3. The ops center admin associates the storage library with various
hosts.  This causes it to be be mounted at
/var/mnt/virtlibs/ on those hosts.  I'll call this $libmnt.

4. When the sysadmin provisions a zone through ops center, a UUID is
allocated and associated with this zone.  I'll call it $zuuid.  A
directory $libmnt/$zuuid is created with a set of directories under
it.

5. As the sysadmin provisions ops center prompts for the virtual disk
size.  A file of that size is created at $libmnt/$zuuid/virtdisk/data.

6. Ops center creates a zpool:

zpool create -m /var/mnt/oc-zpools/$zuuid/ z$zuuid \
 $libmnt/$zuuid/virtdisk/data

7. The zonepath is created using a uuid that is unique to the zonepath
($puuid) z$zuuid/$puuid.  It has a quota and a reservation set (8G
each in the zpool history I am looking at).

8. The zone is configured with
zonepath=/var/mnt/oc-zpools/$zuuid/$puuid, then installed

Just in case anyone sees this as the right way to do things, I think
it is generally OK with a couple caveats. The key areas that I would
suggest for improvement are:

- Mount the NFS space with -o forcedirectio.  There is no need to
cache data twice.
- Never use UUID's in paths.  This makes it nearly impossible for a
sysadmin or a support person to look at the output of commands on the
system and understand what it is doing.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-16 Thread Mike Gerdts

On Sat, Jan 16, 2010 at 5:31 PM, Toby Thain  wrote:
> On 16-Jan-10, at 7:30 AM, Edward Ned Harvey wrote:
>
>>> I am considering building a modest sized storage system with zfs. Some
>>> of the data on this is quite valuable, some small subset to be backed
>>> up "forever", and I am evaluating back-up options with that in mind.
>>
>> You don't need to store the "zfs send" data stream on your backup media.
>> This would be annoying for the reasons mentioned - some risk of being able
>> to restore in future (although that's a pretty small risk) and inability
>> to
>> restore with any granularity, i.e. you have to restore the whole FS if you
>> restore anything at all.
>>
>> A better approach would be "zfs send" and pipe directly to "zfs receive"
>> on
>> the external media.  This way, in the future, anything which can read ZFS
>> can read the backup media, and you have granularity to restore either the
>> whole FS, or individual things inside there.
>
> There have also been comments about the extreme fragility of the data stream
> compared to other archive formats. In general it is strongly discouraged for
> these purposes.
>

Yet it is used in ZFS flash archives on Solaris 10 and are slated for
use in the successor to flash archives.  This initial proposal seems
to imply using the same mechanism for a system image backup (instead
of just system provisioning).

http://mail.opensolaris.org/pipermail/caiman-discuss/2010-January/015909.html

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup memory overhead

2010-01-21 Thread Mike Gerdts

On Thu, Jan 21, 2010 at 2:51 PM, Andrey Kuzmin
 wrote:
> Looking at dedupe code, I noticed that on-disk DDT entries are
> compressed less efficiently than possible: key is not compressed at
> all (I'd expect roughly 2:1 compression ration with sha256 data),

A cryptographic hash such as sha256 should not be compressible.  A
trivial example shows this to be the case:

for i in {1..1} ; do
echo $i | openssl dgst -sha256 -binary
done > /tmp/sha256

$ gzip -c sha256.gz
$ compress -c sha256.Z
$ bzip2 -c sha256.bz2

$ ls -go sha256*
-rw-r--r--   1  32 Jan 22 04:13 sha256
-rw-r--r--   1  428411 Jan 22 04:14 sha256.Z
-rw-r--r--   1  321846 Jan 22 04:14 sha256.bz2
-rw-r--r--   1  320068 Jan 22 04:14 sha256.gz

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs send/receive as backup - reliability?

2010-01-22 Thread Mike Gerdts

On Thu, Jan 21, 2010 at 11:28 AM, Richard Elling
 wrote:
> On Jan 21, 2010, at 3:55 AM, Julian Regel wrote:
>> >> Until you try to pick one up and put it in a fire safe!
>>
>> >Then you backup to tape from x4540 whatever data you need.
>> >In case of enterprise products you save on licensing here as you need a one 
>> >client license per x4540 but in fact can >backup data from many clients 
>> >which are there.
>>
>> Which brings up full circle...
>>
>> What do you then use to backup to tape bearing in mind that the Sun-provided 
>> tools all have significant limitations?
>
> Poor choice of words.  Sun resells NetBackup and (IIRC) that which was
> formerly called NetWorker.  Thus, Sun does provide enterprise backup
> solutions.

(Symantec nee Veritas) NetBackup and (EMC nee Legato) Networker are
different products that compete in the enterprise backup space.

Under the covers NetBackup uses gnu tar to gather file data for the
backup stream.  At one point (maybe still the case), one of the
claimed features of netbackup is that if a tape is written without
multiplexing, you can use gnu tar to extract data.  This seems to be
most useful when you need to recover master and/or media servers and
to be able to extract your data after you no longer use netbackup.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-22 Thread Mike Gerdts

On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk
 wrote:
> Is there a way to zero out unused blocks in a pool?  I'm looking for ways to
> shrink the size of an opensolaris virtualbox VM and
> using the compact subcommand will remove zero'd sectors.

I've long suspected that you should be able to just use mkfile or "dd
if=/dev/zero ..." to create a file that consumes most of the free
space then delete that file.  Certainly it is not an ideal solution,
but seems quite likely to be effective.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-23 Thread Mike Gerdts

On Sat, Jan 23, 2010 at 11:55 AM, John Hoogerdijk
 wrote:
> Mike Gerdts wrote:
>>
>> On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk
>>  wrote:
>>
>>>
>>> Is there a way to zero out unused blocks in a pool?  I'm looking for ways
>>> to
>>> shrink the size of an opensolaris virtualbox VM and
>>> using the compact subcommand will remove zero'd sectors.
>>>
>>
>> I've long suspected that you should be able to just use mkfile or "dd
>> if=/dev/zero ..." to create a file that consumes most of the free
>> space then delete that file.  Certainly it is not an ideal solution,
>> but seems quite likely to be effective.
>>
>
> I tried this with mkfile - no joy.

Let me ask a couple of the questions that come just after "are you
sure your computer is plugged in?"

Did you wait enough time for the data to be flushed to disk (or do
sync and wait for it to complete) prior to removing the file?

You did "mkfile $huge /var/tmp/junk" not "mkfile -n $huge /var/tmp/junk", right?

If not, I suspect that "zpool replace" to a thin provisioned disk is
going to be your best bet (as suggested in another message).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zero out block / sectors

2010-01-25 Thread Mike Gerdts

On Mon, Jan 25, 2010 at 2:32 AM, Kjetil Torgrim Homme
 wrote:
> Mike Gerdts  writes:
>
>> John Hoogerdijk wrote:
>>> Is there a way to zero out unused blocks in a pool?  I'm looking for
>>> ways to shrink the size of an opensolaris virtualbox VM and using the
>>> compact subcommand will remove zero'd sectors.
>>
>> I've long suspected that you should be able to just use mkfile or "dd
>> if=/dev/zero ..." to create a file that consumes most of the free
>> space then delete that file.  Certainly it is not an ideal solution,
>> but seems quite likely to be effective.
>
> you'll need to (temporarily) enable compression for this to have an
> effect, AFAIK.
>
> (dedup will obviously work, too, if you dare try it.)

You are missing the point.  Compression and dedup will make it so that
the blocks in the devices are not overwritten with zeroes.  The goal
is to overwrite the blocks so that a back-end storage device or
back-end virtualization platform can recognize that the blocks are not
in use and as such can reclaim the space.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests

2010-02-08 Thread Mike Gerdts

On Mon, Feb 8, 2010 at 9:04 PM, grarpamp  wrote:
> PS: Is there any way to get a copy of the list since inception
> for local client perusal, not via some online web interface?

You can get monthly .gz archives in mbox format from
http://mail.opensolaris.org/pipermail/zfs-discuss/.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FS Reliability WAS: about btrfs and zfs

2011-10-21 Thread Mike Gerdts

On Fri, Oct 21, 2011 at 8:02 PM, Fred Liu  wrote:
>
>> 3. Do NOT let a system see drives with more than one OS zpool at the
>> same time (I know you _can_ do this safely, but I have seen too many
>> horror stories on this list that I just avoid it).
>>
>
> Can you elaborate #3? In what situation will it happen?

Some people have trained their fingers to use the -f option on every
command that supports it to force the operation.  For instance, how
often do you do rm -rf vs. rm -r and answer questions about every
file?

If various zpool commands (import, create, replace, etc.) are used
against the wrong disk with a force option, you can clobber a zpool
that is in active use by another system.  In a previous job, my lab
environment had a bunch of LUNs presented to multiple boxes.  This was
done for convenience in an environment where there would be little
impact if an errant command were issued.  I'd never do that in
production without some form of I/O fencing in place.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gaining access to var from a live cd

2011-11-29 Thread Mike Gerdts

On Tue, Nov 29, 2011 at 3:01 PM, Francois Dion  wrote:
> I've hit an interesting (not) problem. I need to remove a problematic
> ld.config file (due to an improper crle...) to boot my laptop. This is
> OI 151a, but fundamentally this is zfs, so i'm asking here.
>
> what I did after booting the live cd and su:
> mkdir /tmp/disk
> zpool import -R /tmp/disk -f rpool
>
> export shows up in there and rpool also, but in rpool there is only
> boot and etc.
>
> zfs list shows rpool/ROOT/openindiana as mounted on /tmp/disk and I
> see dump and swap, but no var. rpool/ROOT shows as legacy, so I
> figured, maybe mount that.
>
> mount -F zfs rpool/ROOT /mnt/rpool

That dataset (rpool/ROOT) should never have any files in it.  It is
just a "container" for boot environments.  You can see which boot
environments exist with:

zfs list -r rpool/ROOT

If you are running Solaris 11, the boot environment's root dataset
will show a mountpoint property value of /.  Assuming it is called
"solaris" you can mount it with:

zfs mount -o mountpoint=/mnt/rpool rpool/ROOT/solaris

If the system is running Solaris 11 (and was not updated from Solaris
11 Express), it will have a separate /var dataset.

zfs mount -o mountpoint=/mnt/rpool/var rpool/ROOT/solaris/var

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gaining access to var from a live cd

2011-11-29 Thread Mike Gerdts

On Tue, Nov 29, 2011 at 4:40 PM, Francois Dion  wrote:
> It is on openindiana 151a, no separate /var as far as But I'll have to
> test this on solaris11 too when I get a chance.
>
> The problem is that if I
>
> zfs mount -o mountpoint=/tmp/rescue (or whatever) rpool/ROOT/openindiana
>
> i get a cannot mount /mnt/rpool: directory is not empty.
>
> The reason for that is that I had to do a zpool import -R /mnt/rpool
> rpool (or wherever I mount it it doesnt matter) before I could do a
> zfs mount, else I dont have access to the rpool zpool for zfs to do
> its thing.
>
> chicken / egg situation? I miss the old fail safe boot menu...

You can mount it pretty much anywhere:

mkdir /tmp/foo
zfs mount -o mountpoint=/tmp/foo ...

I'm not sure when the temporary mountpoint option (-o mountpoint=...)
came in. If it's not valid syntax then:

mount -F zfs rpool/ROOT/solaris /tmp/foo

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Any rhyme or reason to disk dev names?

2011-12-21 Thread Mike Gerdts

On Wed, Dec 21, 2011 at 1:58 AM, Matthew R. Wilson
 wrote:
> Hello,
>
> I am curious to know if there is an easy way to guess or identify the device
> names of disks. Previously the /dev/dsk/c0t0d0s0 system made sense to me...
> I had a SATA controller card with 8 ports, and they showed up with the
> numbers 1-8 in the "t" position of the device name.
>
> But I just built a new system with two LSI SAS HBAs in it, and my device
> names are along the lines of:
> /dev/dsk/c0t5000CCA228C0E488d0
>
> I could not find any correlation between that identifier and the a)
> controller the disk was plugged in to, or b) the port number on the
> controller. The only way I could make a mapping of device name to controller
> port was to add one drive at a time, reboot the system, and run "format" to
> see which new disk name shows up.
>
> I'm guessing there's a better way, but I can't find any obvious answer as to
> how to determine which port on my LSI controller card will correspond with
> which seemingly random device name. Can anyone offer any suggestions on a
> way to predict the device naming, or at least get the system to list the
> disks after I insert one without rebooting?

Depending on the hardware you are using, you may be able to benefit
from croinfo.

$ croinfo
D:devchassis-path  t:occupant-type  c:occupant-compdev
-  ---  -
/dev/chassis//SYS/SASBP/HDD0/disk  disk c0t5000CCA012B66E90d0
/dev/chassis//SYS/SASBP/HDD1/disk  disk c0t5000CCA012B68AC8d0

The text in the left column represents text that should be printed on
the corresponding disk slots.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Mike Gerdts

2012/3/26 ольга крыжановская :
> How can I test if a file on ZFS has holes, i.e. is a sparse file,
> using the C api?

See SEEK_HOLE in lseek(2).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] test for holes in a file?

2012-03-26 Thread Mike Gerdts

On Mon, Mar 26, 2012 at 6:18 PM, Bob Friesenhahn
 wrote:
> On Mon, 26 Mar 2012, Andrew Gabriel wrote:
>
>> I just played and knocked this up (note the stunning lack of comments,
>> missing optarg processing, etc)...
>> Give it a list of files to check...
>
>
> This is a cool program, but programmers were asking (and answering) this
> same question 20+ years ago before there was anything like SEEK_HOLE.
>
> If file space usage is less than file directory size then it must contain a
> hole.  Even for compressed files, I am pretty sure that Solaris reports the
> uncompressed space usage.

That's not the case.

# zfs create -o compression=on rpool/junk
# perl -e 'print "foo" x 10'> /rpool/junk/foo
# ls -ld /rpool/junk/foo
-rw-r--r--   1 root root  30 Mar 26 18:25 /rpool/junk/foo
# du -h /rpool/junk/foo
  16K   /rpool/junk/foo
# truss -t stat -v stat du  /rpool/junk/foo
...
lstat64("foo", 0x08047C40)  = 0
d=0x02B90028 i=8 m=0100644 l=1  u=0 g=0 sz=30
at = Mar 26 18:25:25 CDT 2012  [ 1332804325.742827733 ]
mt = Mar 26 18:25:25 CDT 2012  [ 1332804325.889143166 ]
ct = Mar 26 18:25:25 CDT 2012  [ 1332804325.889143166 ]
bsz=131072 blks=32fs=zfs

Notice that it says it has 32 512 byte blocks.

The mechanism you suggest does work for every other file system that
I've tried it on.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Strange hang during snapshot receive

2012-05-10 Thread Mike Gerdts

On Thu, May 10, 2012 at 5:37 AM, Ian Collins  wrote:
> I have an application I have been using to manage data replication for a
> number of years.  Recently we started using a new machine as a staging
> server (not that new, an x4540) running Solaris 11 with a single pool built
> from 7x6 drive raidz.  No dedup and no reported errors.
>
> On that box and nowhere else is see empty snapshots taking 17 or 18 seconds
> to write.  Everywhere else they return in under a second.
>
> Using truss and the last published source code, it looks like the pause is
> between a printf and  the call to zfs_ioctl and there aren't any other
> functions calls between them:

For each snapshot in a stream, there is one zfs_ioctl() call.  During
that time, the kernel will read the entire substream (that is, for one
snapshot) from the input file descriptor.

>
> 100.5124     0.0004    open("/dev/zfs", O_RDWR|O_EXCL)            = 10
> 100.7582     0.0001    read(7, "\0\0\0\0\0\0\0\0ACCBBAF5".., 312)    = 312
> 100.7586     0.    read(7, 0x080464F8, 0)                = 0
> 100.7591     0.    time()                        = 1336628656
> 100.7653     0.0035    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040CF0)    = 0
> 100.7699     0.0022    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900)    = 0
> 100.7740     0.0016    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040580)    = 0
> 100.7787     0.0026    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x080405B0)    = 0
> 100.7794     0.0001    write(1, " r e c e i v i n g   i n".., 75)    = 75
> 118.3551     0.6927    ioctl(8, ZFS_IOC_RECV, 0x08042570)        = 0
> 118.3596     0.0010    ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900)    = 0
> 118.3598     0.    time()                        = 1336628673
> 118.3600     0.    write(1, " r e c e i v e d   3 1 2".., 45)    = 45
>
> zpool iostat (1 second interval) for the period is:
>
> tank        12.5T  6.58T    175      0   271K      0
> tank        12.5T  6.58T    176      0   299K      0
> tank        12.5T  6.58T    189      0   259K      0
> tank        12.5T  6.58T    156      0   231K      0
> tank        12.5T  6.58T    170      0   243K      0
> tank        12.5T  6.58T    252      0   295K      0
> tank        12.5T  6.58T    179      0   200K      0
> tank        12.5T  6.58T    214      0   258K      0
> tank        12.5T  6.58T    165      0   210K      0
> tank        12.5T  6.58T    154      0   178K      0
> tank        12.5T  6.58T    186      0   221K      0
> tank        12.5T  6.58T    184      0   215K      0
> tank        12.5T  6.58T    218      0   248K      0
> tank        12.5T  6.58T    175      0   228K      0
> tank        12.5T  6.58T    146      0   194K      0
> tank        12.5T  6.58T     99    258   209K  1.50M
> tank        12.5T  6.58T    196    296   294K  1.31M
> tank        12.5T  6.58T    188    130   229K   776K
>
> Can anyone offer any insight or further debugging tips?

I have yet to see a time when zpool iostat tells me something useful.
I'd take a look at "iostat -xzn 1" or similar output.  It could point
to imbalanced I/O or a particular disk that has abnormally high
service times.

Have you installed any SRUs?  If not, you could be seeing:

7060894 zfs recv is excruciatingly slow

which is fixed in Solaris 11 SRU 5.

If you are using zones and are using any https pkg(5) origins (such as
https://pkg.oracle.com/solaris/support), I suggest reading
https://forums.oracle.com/forums/thread.jspa?threadID=2380689&tstart=15
before updating to SRU 6 (SRU 5 is fine, however).  The fix for the
problem mentioned in that forums thread should show up in an upcoming
SRU via CR 7157313.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free

2012-06-12 Thread Mike Gerdts

On Tue, Jun 12, 2012 at 11:17 AM, Sašo Kiselkov  wrote:
> On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote:
>> find where your nics are bound too
>>
>> mdb -k
>> ::interrupts
>>
>> create a processor set including those cpus [ so just the nic code will
>> run there ]
>>
>> andy
>
> Tried and didn't help, unfortunately. I'm still seeing drops. What's
> even funnier is that I'm seeing drops when the machine is sync'ing the
> txg to the zpool. So looking at a little UDP receiver I can see the
> following input stream bandwidth (the stream is constant bitrate, so
> this shouldn't happen):

If processing in interrupt context (use intrstat) is dominating cpu
usage, you may be able to use pcitool to cause the device generating
all of those expensive interrupts to be moved to another CPU.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Benefits of enabling compression in ZFS for the zones

2012-07-10 Thread Mike Gerdts

On Tue, Jul 10, 2012 at 6:29 AM, Jordi Espasa Clofent
 wrote:
> Thanks for you explanation Fajar. However, take a look on the next lines:
>
> # available ZFS in the system
>
> root@sct-caszonesrv-07:~# zfs list
>
> NAME USED  AVAIL  REFER  MOUNTPOINT
> opt  532M  34.7G   290M  /opt
> opt/zones243M  34.7G32K  /opt/zones
> opt/zones/sct-scw02-shared   243M  34.7G   243M  /opt/zones/sct-scw02-shared
> static   104K  58.6G34K  /var/www/
>
> # creating a file in /root (UFS)
>
> root@sct-caszonesrv-07:~# dd if=/dev/zero of=file.bin count=1024 bs=1024
> 1024+0 records in
> 1024+0 records out
> 1048576 bytes (1.0 MB) copied, 0.0545957 s, 19.2 MB/s
> root@sct-caszonesrv-07:~# pwd
> /root
>
> # enable compression in some ZFS zone
>
> root@sct-caszonesrv-07:~# zfs set compression=on opt/zones/sct-scw02-shared
>
> # copying the previos file to this zone
>
> root@sct-caszonesrv-07:~# cp /root/file.bin
> /opt/zones/sct-scw02-shared/root/
>
> # checking the file size in the origin dir (UFS) and the destination one
> (ZFS with compression enabled)
>
> root@sct-caszonesrv-07:~# ls -lh /root/file.bin
> -rw-r--r-- 1 root root 1.0M Jul 10 13:21 /root/file.bin
>
> root@sct-caszonesrv-07:~# ls -lh /opt/zones/sct-scw02-shared/root/file.bin
> -rw-r--r-- 1 root root 1.0M Jul 10 13:22
> /opt/zones/sct-scw02-shared/root/file.bin
>
> # the both files has exactly the same cksum!
>
> root@sct-caszonesrv-07:~# cksum /root/file.bin
> 3018728591 1048576 /root/file.bin
>
> root@sct-caszonesrv-07:~# cksum /opt/zones/sct-scw02-shared/root/file.bin
> 3018728591 1048576 /opt/zones/sct-scw02-shared/root/file.bin
>
> So... I don't see any size variation with this test.

ls(1) tells you how much data is in the file - that is, how many bytes
of data that an application will see if it reads the whole file.
du(1) tells you how many disk blocks are used.  If you look at the
stat structure in stat(2), ls reports st_size, du reports st_blocks.

Blocks full of zeros are special to zfs compression - it recognizes
them and stores no data.  Thus, a file that contains only zeros will
only require enough space to hold the file metadata.

$ zfs list -o compression ./
COMPRESS
  on

$ dd if=/dev/zero of=1gig count=1024 bs=1024k
1024+0 records in
1024+0 records out

$ ls -l 1gig
-rw-r--r--   1 mgerdts  staff1073741824 Jul 10 07:52 1gig

$ du -k 1gig
0   1gig

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?

2013-02-20 Thread Mike Gerdts

On Wed, Feb 20, 2013 at 4:49 PM, Markus Grundmann  wrote:
> Whenever I modify zfs pools or filesystems it's possible to destroy [on a
> bad day :-)] my data. A new
> property "protected=on|off" in the pool and/or filesystem can help the
> administrator for datalost
> (e.g. "zpool destroy tank" or "zfs destroy " command will
> be rejected
> when "protected=on" property is set).
>
> It's anywhere here on this list their can discuss/forward this feature
> request? I hope you have
> understand my post ;-)

I like the idea and it is likely not very hard to implement.  This is
very similar to how snapshot holds work.

# zpool upgrade -v | grep -i hold
 18  Snapshot user holds

So long as you aren't using a really ancient zpool version, you could
use this feature to protect your file systems.

# zfs create a/b
# zfs snapshot a/b@snap
# zfs hold protectme a/b@snap
# zfs destroy a/b
cannot destroy 'a/b': filesystem has children
use '-r' to destroy the following datasets:
a/b@snap
# zfs destroy -r a/b
cannot destroy 'a/b@snap': snapshot is busy

Of course, snapshots aren't free if you write to the file system.  A
way around that is to create an empty file system within the one that
you are trying to protect.

# zfs create a/1
# zfs create a/1/hold
# zfs snapshot a/1/hold@hold
# zfs hold 'saveme!' a/1/hold@hold
# zfs holds a/1/hold@hold
NAME   TAG  TIMESTAMP
a/1/hold@hold  saveme!  Wed Feb 20 15:06:29 2013
# zfs destroy -r a/1
cannot destroy 'a/1/hold@hold': snapshot is busy

Extending the hold mechanism to filesystems and volumes would be quite nice.

Mike
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] VM's on ZFS - 7210

2010-08-28 Thread Mike Gerdts

On Sat, Aug 28, 2010 at 8:19 AM, Ray Van Dolson  wrote:
> On Sat, Aug 28, 2010 at 05:50:38AM -0700, Eff Norwood wrote:
>> I can't think of an easy way to measure pages that have not been consumed 
>> since it's really an SSD controller function which is obfuscated from the 
>> OS, and add the variable of over provisioning on top of that. If anyone 
>> would like to really get into what's going on inside of an SSD that makes it 
>> a bad choice for a ZIL, you can start here:
>>
>> http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29
>>
>> and
>>
>> http://en.wikipedia.org/wiki/Write_amplification
>>
>> Which will be more than you might have ever wanted to know. :)
>
> So has anyone on this list actually run into this issue?  Tons of
> people use SSD-backed slog devices...
>
> The theory sounds "sound", but if it's not really happening much in
> practice then I'm not too worried.  Especially when I can replace a
> drive from my slog mirror for a $400 or so if problems do arise... (the
> alternative being much more expensive DRAM backed devices)

Presumably this problem is being worked...

http://hg.genunix.org/onnv-gate.hg/rev/d560524b6bb6

Notice that it implements:

866610  Add SATA TRIM support

With this in place, I would imagine a next step is for zfs to issue
TRIM commands as zil entries have been committed to the data disks.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to migrate to 4KB sector drives?

2010-09-12 Thread Mike Gerdts

On Sun, Sep 12, 2010 at 5:42 PM, Richard Elling  wrote:
> On Sep 12, 2010, at 10:11 AM, Brandon High wrote:
>
>> On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar
>>  wrote:
>>> No replies. Does this mean that you should avoid large drives with 4KB 
>>> sectors, that is, new drives? ZFS does not handle new drives?
>>
>> Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of 
>> osol.
>
> OSol source yes, binaries no :-(  You will need another distro besides 
> OpenSolaris.
> The needed support in sd was added around the b137 timeframe.

OpenIndiana, to be released on Tuesday, is based on b146 or later.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] file level clones

2010-09-27 Thread Mike Gerdts

On Mon, Sep 27, 2010 at 6:23 AM, Robert Milkowski  wrote:

> Also see http://www.symantec.com/connect/virtualstoreserver

And 
http://blog.scottlowe.org/2008/12/03/2031-enhancements-to-netapp-cloning-technology/


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN

2010-10-26 Thread Mike Gerdts

On Tue, Oct 26, 2010 at 9:40 AM, bhanu prakash  wrote:
> Hi Team,
>
>
> There 17 zones on the machine T5120. I want to move all the zones which are
> ZFS filesystem to another new LUN.
>
> Can you give me the steps to proceed this.

If the only thing on the source lun is the pool that contains the
zones and the new LUN is at least as big as the old LUN:

zpool replace   

The above can be done while the zones are booted.  Depending on the
characteristics of the server and workloads, the workloads may feel a
bit sluggish during this time due to increased I/O activity.  If that
works for you, stop reading now.

In the event that the scenario above doesn't apply, read on.  Assuming
all the zones are under oldpool/zones, oldpool/zones is mounted at
/zones, and you have done "zpool create newpool "

Be sure to test this procedure - I didn't!

zfs create newlun/zones

# optionally, shut down the zones
zfs snapshot -r oldpool/zo...@phase1
zfs send -r oldpool/zo...@phase1 | zfs receive newpool/zo...@phase1

# If you did not shut down the zones above, shut them down now.
# If the zones were shut down, skip the next two commands
zfs snapshot -r oldpool/zo...@phase2
zfs send -rI oldpool/zo...@phase1 oldpool/zo...@phase2 \
| zfs receive newpool/zo...@phase2

# Adjust mount points and restart the zones
zfs set mountpoint=none oldpool/zones
zfs set mountpoint=/zones newpool/zones
for zone in $zonelist zoneadm -z $zone boot ; done

At such a time that you are comfortable that the zone data moved over ok...

zfs destroy -r oldpool/zones

Again, verify the procedure works on a test/lab/whatever box before
trying it for real.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN

2010-10-27 Thread Mike Gerdts

On Wed, Oct 27, 2010 at 9:27 AM, bhanu prakash  wrote:
> Hi Mike,
>
>
> Thanks for the information...
>
> Actually the requirement is like this. Please let me know whether it matches
> for the below requirement or not.
>
> Question:
>
> The SAN team will assign the new LUN’s on EMC DMX4 (currently IBM Hitache is
> there). We need to move the 17 containers which are existed on the
> server Host1 to new LUN’s”.
>
>
> Please give me the steps to do this activity.

Without knowing the layout of the storage, it is impossible to give
you precise instructions.  This sounds like it is a production Solaris
10 system in an enterprise environment.  In most places that I've
worked, I would be hesitant to provide the required level of detail on
a public mailing list.  Perhaps you should open a service call to get
the assistance you need.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hardware going bad

2010-10-27 Thread Mike Gerdts

On Wed, Oct 27, 2010 at 3:41 PM, Harry Putnam  wrote:
> I'm guessing it was probably more like 60 to 62 c under load.  The
> temperature I posted was after something like 5minutes of being
> totally shutdown and the case been open for a long while. (mnths if
> not yrs)

What happens if the case is closed (and all PCI slot, disk, etc. slots
are closed)?  Having the case open likely changes the way that air
flows across the various components.  Also, if there is tobacco smoke
near the machine, it will cause a sticky build-up that likely
contributes to heat dissipation problems.

Perhaps this belongs somewhere other than zfs-discuss - it has nothing
to do with zfs.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FW: Solaris panic

2011-03-17 Thread Mike Gerdts

ippy genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11
> Version snv_151a 64-bit
> Mar 17 15:28:51 zippy genunix: [ID 877030 kern.notice] Copyright (c) 1983,
> 2010, Oracle and/or its affiliates. All rights reserved.
>
> Can anyone help?
>
> Regards
> Karl
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Non-Global zone recovery

2011-07-07 Thread Mike Gerdts

On Thu, Jul 7, 2011 at 2:41 PM, Ram kumar  wrote:
>
> Hi Cindy,
>
> Thanks for the email.
>
> We are using Solaris 10 with out Live Upgrade.
>
> Tested following in the sandbox environment:
>
> 1)  We have one non-global zone (TestZone)  which is running on Test 
> zpool (SAN)
>
> 2)  Don’t see zpool or non-global zone after re-image of Global zone.
>
> 3)  Imported zpool Test
>
> Now I am trying to create Non-global zone and it is giving error
>
> bash-3.00# zonecfg -z Test
> Test: No such zone configured
> Use 'create' to begin configuring a new zone.
> zonecfg:Test> create -a /zones/Test
> invalid path to detached zone

If you use create -a, it requires that SUNWdetached.xml exist as a
means for configuring the various properties (e.g. zonepath, brand,
etc.) and resources (inherit-pkg-dir, net, fs, device, etc.) for the
zone.  Since you don't have the SUNWdetached.xml, you can't use it.

Assuming you have a backup of the system, you could restore a copy of
/etc/zones/ to /etc/zones/restored-.xml, then
run:

zonecfg -z  create -t restored-

If that's not an option or is just too inconvenient, use zonecfg to
configure the zone just like you did initially.  That is, do not use
"create -a", use "create", "create -b", or "create -t
" followed by whatever property settings and
added resources are appropriate.

After you get past zonecfg, you should be able to:

zoneadm -z  attach

If the package and patch levels don't match up (the global zone
perhaps was installed from a newer update or has newer patches):

zoneadm -z  attach -U
or
zoneadm -z  attach -u

Since you seem to be doing this in a test environment to prepare for
bad things to happen, I'd suggest that you make it a standard practice
when you are done configuring a zone to do:

zonecfg -z  export >  /zonecfg.export

Then if you need to recover the zone using only the things that are on
the SAN, you can do:

zpool import ...
zonecfg -z  -f /zonecfg.export
zoneadm -z  attach [-u|-U]

Any follow-ups should probably go to Oracle Support or zones-discuss.
Your problems are not related to zfs.

--
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)

2011-07-15 Thread Mike Gerdts

n,
> not only on on hardware built for dedicated storage.
>
> Sparse-root vs. full-root zones, or disk images of VMs;
> are they stuffed in one rpool or spread between rpool and
> data pools - that detail is not actually the point of the thread.
>
> Actual useability of dedup for savings and gains on these
> tasks (preferably working also on low-mid-range boxes,
> where adding a good enterprise SSD would double the
> server cost - not only on those big good systems with
> tens of GB of RAM), and hopefully simplifying the system
> configuration and maintenance - that is indeed the point
> in question.
>
> //Jim
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What is ".$EXTEND/$QUOTA" ?

2011-07-19 Thread Mike Gerdts

On Tue, Jul 19, 2011 at 2:39 PM, Orvar Korvar
 wrote:
> I am using S11E, and have created a zpool on a single disk as storage. In 
> several directories, I can see a directory called  ".$EXTEND/$QUOTA". What is 
> it for? Can I delete it?
> --

Perhaps this is of help.

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/smbsrv/smb_pathname.c#752

752 /*
753  * smb_pathname_preprocess_quota
754  *
755  * There is a special file required by windows so that the quota
756  * tab will be displayed by windows clients. This is created in
757  * a special directory, $EXTEND, at the root of the shared file
758  * system. To hide this directory prepend a '.' (dot).
759  */

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs rename query

2011-07-27 Thread Mike Gerdts

On Wed, Jul 27, 2011 at 6:37 AM, Nishchaya Bahuguna
 wrote:
> Hi,
>
> I have a query regarding the zfs rename command.
>
> There are 5 zones and my requirement is to change the zone paths using zfs
> rename.
>
> + zoneadm list -cv
> ID NAME             STATUS     PATH                           BRAND    IP
>  0 global               running    /                                 native
>   shared
> 34 public               running    /txzone/public              native
> shared
> 35 internal             running    /txzone/internal           native
> shared
> 36 restricted           running    /txzone/restricted         native
> shared
> 37 needtoknow      running    /txzone/needtoknow    native   shared
> 38 sandbox              running    /txzone/sandbox           native   shared
>
> A whole root zone  was configured and installed. Rest of the 4 zones
> were cloned from .
>
> zoneadm -z  clone public
>
> zfs get origin lists the origin as  for all 4 zones.
>
> I run zfs rename on 4 of these clone'd zones and it throws a device busy
> error because of parent-child relationship.

I think you are getting the device busy error for a different reason.
I just did the following:

zfs create -o mountpoint=/zones rpool/zones
zonecfg -z z1 'create; set zonepath=/zones/z1'
zoneadm -z z1 install
zonecfg -z z1c1 'create -t z1; set zonepath=/zones/z1c1'
zonecfg -z z1c2 'create -t z1; set zonepath=/zones/z1c2'
zoneadm -z z1c1 clone z1
zoneadm -z z1c2 clone z2

At this point, I have the following:

bash-3.2# zfs list -r -o name,origin rpool/zones
NAME  ORIGIN
rpool/zones   -
rpool/zones/z1-
rpool/zones/z1@SUNWzone1  -
rpool/zones/z1@SUNWzone2  -
rpool/zones/z1c1  rpool/zones/z1@SUNWzone1
rpool/zones/z1c2  rpool/zones/z1@SUNWzone2

Next, I decide that I would like z1c1 to be rpool/new/z1c1 instead of
it's current place.  Note that this will also change the mountpoint
which breaks the zone.

bash-3.2# zfs create -o mountpoint=/new rpool/new
bash-3.2# zfs rename rpool/zones/z1c1 rpool/new/z1c1
bash-3.2# zfs list -o name,origin -r /new
NAMEORIGIN
rpool/new   -
rpool/new/z1c1  rpool/zones/z1@SUNWzone1

To get a "device busy" error, I need to cause a situation where the
zonepath cannot be unmounted.  Having the zone running is a good way
to do that:

bash-3.2# zoneadm -z z1c2 boot
WARNING: zone z1c1 is installed, but its zonepath /zones/z1c1 does not exist.
bash-3.2# zfs rename rpool/zones/z1c2 rpool/new/z1c2
cannot unmount '/zones/z1c2': Device busy

> I guess that can be handled with zfs promote because promote would swap the
> parent and child.

You would need to do this to rename a dataset that the origin (one
that is cloned) not the clones.  That is, if you wanted to rename the
dataset for your public zone or I wanted to rename the dataset for z1,
then you would need to promote the datasets for all of the clones.
This is a known issue.

6472202 'zfs rollback' and 'zfs rename' require that clones be unmounted

> So, how do I make it work when there are multiple zones cloned from a single
> parent? Is there a way that zfs rename can work for ALL the zones rather
> than working with two zones at a time?

As I said above.

>
> Also, is there a command line option available for sorting the datasets in
> correct dependency order?

"zfs list -r -o name,origin" is a good starting point.  I suspect that
it doesn't give you exactly the output you are looking for.

FWIW, the best way to achieve what you are after without breaking the
zones is going to be along the lines of:

zlogin z1c1 init 0
zoneadm -z z1c1 detach
zfs rename rpool/zones/z1c1 rpool/new/z1c1
zoneadm -z z1c1 'set zonepath=/new/z1c1'
zoneadm -z z1c1 attach
zoneadm -z z1c1 boot

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kernel panic on zpool import. 200G of data inaccessible!

2011-08-05 Thread Mike Gerdts

On Thu, Aug 4, 2011 at 2:47 PM, Stuart James Whitefish
 wrote:
> # zpool import -f tank
>
> http://imageshack.us/photo/my-images/13/zfsimportfail.jpg/

I encourage you to open a support case and ask for an escalation on CR 7056738.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS deduplication

2008-07-22 Thread Mike Gerdts

ompromised), loss of one
> block can thus be many more times severe.

I believe this is true and likely a good topic for discussion.

> We need to think long and hard about what the real widespread benefits
> are of dedup before committing to a filesystem-level solution, rather
> than an application-level one.  In particular, we need some real-world
> data on the actual level of duplication under a wide variety of
> circumstances.

The key thing here is that distributed applications will not play
nicely.  In my best use case, Solaris zones and LDoms are the
"application".  I don't expect or want Solaris to form some sort of
P2P storage system across my data center to save a few terabytes.
D12n at the storage device can do this much more reliably with less
complexity.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS deduplication

2008-07-23 Thread Mike Gerdts

On Tue, Jul 22, 2008 at 10:44 PM, Erik Trimble <[EMAIL PROTECTED]> wrote:
> More than anything, Bob's reply is my major feeling on this.  Dedup may
> indeed turn out to be quite useful, but honestly, there's no broad data
> which says that it is a Big Win (tm) _right_now_, compared to finishing
> other features.  I'd really want a Engineering Study about the
> real-world use (i.e. what percentage of the userbase _could_ use such a
> feature, and what percentage _would_ use it, and exactly how useful
> would each segment find it...) before bumping it up in the priority
> queue of work to be done on ZFS.

I get this.  However, for most of my uses of clones dedup is
considered finishing the job.  Without it, I run the risk of having
way more writable data than I can restore.  Another solution to this
is to consider the output of "zfs send" to be a stable format and get
integration with enterprise backup software that can perform restores
in a way that maintains space efficiency.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Cannot attach mirror to SPARC zfs root pool

2008-07-24 Thread Mike Gerdts

On Wed, Jul 23, 2008 at 11:36 AM,  <[EMAIL PROTECTED]> wrote:
> Rainer,
>
> Sorry for your trouble.
>
> I'm updating the installboot example in the ZFS Admin Guide with the
> -F zfs syntax now. We'll fix the installboot man page as well.

Perhaps it also deserves a mention in the FAQ somewhere near
http://opensolaris.org/os/community/zfs/boot/zfsbootFAQ/#mirrorboot.

5. How do I attach a mirror to an existing ZFS root pool"?

Attach the second disk to form a mirror.  In this example, c1t1d0s0 is attached.

# zpool attach rpool c1t0d0s0 c1t1d0s0

Prior to build , bug 6668666 causes the following
platform-dependent steps to also be needed:

On sparc systems:
# installboot -F zfs /usr/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/c1t1d0s0

On x86 systems:
# ...

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs write cache enable on boot disks ?

2008-07-24 Thread Mike Gerdts

On Fri, Apr 25, 2008 at 9:22 AM, Robert Milkowski <[EMAIL PROTECTED]> wrote:
> Hello andrew,
>
> Thursday, April 24, 2008, 11:03:48 AM, you wrote:
>
> a> What is the reasoning behind ZFS not enabling the write cache for
> a> the root pool? Is there a way of forcing ZFS to enable the write cache?
>
> The reason is that EFI labels are not supported for booting.
> So from ZFS perspective you put root pool on a slice on SMI labeled
> disk - the way currently ZFS works it assumes in such a case that
> there could be other slices used by other programs and because you can
> enable/disable write cache per disk and not per slice it's just safer
> to not automatically enable it.
>
> If you havoever enable it yourself then it should stay that way (see
> format -e -> cache)

So long as the zpool uses all of the space used for dynamic data that
needs to survive a reboot, it would seem to make a lot of sense to
enable write cache on such disks.  This assumes that ZFS does the
flush no matter whether it thinks the write cache is enabled or not.
Am I wrong about this somehow?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS on 32bit.

2008-08-06 Thread Mike Gerdts

On Wed, Aug 6, 2008 at 6:22 PM, Carson Gaspar <[EMAIL PROTECTED]> wrote:
> Brian D. Horn wrote:
>> In the most recent code base (both OpenSolaris/Nevada and S10Ux with patches)
>> all the known marvell88sx problems have long ago been dealt with.
>
> Not true. The working marvell patches still have not been released for
> Solaris. They're still just IDRs. Unless you know something I (and my
> Sun support reps) don't, in which case please provide patch numbers.

I was able to get a Tpatch this week with encouraging words about a
likely release of 138053-02 this week.  In a separate thread last week
(?) Enda said that it should be out within a couple weeks.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS deduplication

2008-08-26 Thread Mike Gerdts

On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat <[EMAIL PROTECTED]> wrote:
> In the interest of "full disclosure" I have changed the sha256.c in the
> ZFS source to use the default kernel one via the crypto framework rather
> than a private copy. I wouldn't expect that to have too big an impact (I
> will be verifying it I just didn't have the data to hand quickly).

Would this also make it so that it would use hardware assisted sha256
on capable (e.g N2) platforms?  Is that the same as this change from
long ago?

http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.html

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Panic + corrupted pool in snv_98

2008-09-21 Thread Mike Gerdts

I had just upgraded (pkg image-update) to snv_98 then was trying to do
a build of ON.  The build was happening inside of virtualbox, so I
can't really say for sure what layer is at fault.  I'll keep the disk
image and crash dump around for a few days in case anyone is
interested in more data from them.

Here's the interesting bits from ::msgbuf

panic[cpu0]/thread=d48f6de0:
assertion failed: 0 == zap_add(dp->dp_meta_objset,
DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SCRUB_FUNC, sizeof (uint32_t), 1,
&dp->dp_scrub_func, tx), file: ../../common/fs/zfs/dsl_scrub.c, line:
124


d48f6c0c genunix:assfail+5a (feb96258, feb96238,)
d48f6c58 zfs:dsl_pool_scrub_setup_sync+2cc (d68e8b00, ea9dadb4,)
d48f6c90 zfs:dsl_sync_task_group_sync+da (df8ceac0, ee97d518)
d48f6cdc zfs:dsl_pool_sync+121 (d68e8b00, 610b, 0)
d48f6d4c zfs:spa_sync+2b5 (d3104500, 610b, 0)
d48f6dc8 zfs:txg_sync_thread+2aa (d68e8b00, 0)
d48f6dd8 unix:thread_start+8 ()


Here's what my pool looks like:

 pool: export
   id: 10328403348002192848
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   The pool may be active on another system, but can be imported using
   the '-f' flag.
  see: http://www.sun.com/msg/ZFS-8000-5E
config:

   export  FAULTED  corrupted data
 c6t0d0UNAVAIL  corrupted data


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Which is better for root ZFS: mlc or slc SSD?

2008-09-24 Thread Mike Gerdts

On Wed, Sep 24, 2008 at 1:41 PM, Erik Trimble <[EMAIL PROTECTED]> wrote:
> I was under the impression that MLC is the preferred type of SSD, but I
> want to prevent myself from having a think-o.

MLC - description as to why can be found in

http://mags.acm.org/communications/200807/

See "Flash Storage Memory" by Adam Leventhal, page 47.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] OT: ramdisks (Was: Re: create raidz with 1 disk offline)

2008-09-29 Thread Mike Gerdts

On Mon, Sep 29, 2008 at 2:12 AM, Volker A. Brandt <[EMAIL PROTECTED]> wrote:
>   kthr  memorypagedisk  faults  cpu
>  r b w   swap  free  re  mf pi po fr de sr lf lf lf s0   in   sy   cs us sy id
>  0 0 0 33849968 2223440 2 14 1  0  0  0  0  0 21  0 21  813 1263  957  0  0 99

Note this from vmstat(1M):

 Without options, vmstat displays a one-line summary  of  the
 virtual memory activity since the system was booted.

In other words, the first line of vmstat output is some value that
does not represent the current state of the system.  Try this instead:

$ vmstat 1 2
 kthr  memorypagedisk  faults  cpu
 r b w   swap  free  re  mf pi po fr de sr m0 m1 m2 m1   in   sy   cs us sy id
 0 0 0 12371648 5840200 30 93 14 1 1  0  0  0  1  0  5  711 2258 1257  1  0 98
 0 0 0 9581336 2987584 61 69 0  0  0  0  0  0  0  0  0  543  972  518  0  0 100

>From a free memory standpoint, the current state of the system is very
different than the typical state since boot.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts

On Thu, Oct 9, 2008 at 4:53 AM, . <[EMAIL PROTECTED]> wrote:
> While it's clearly my own fault for taking the risks I did, it's
> still pretty frustrating knowing that all my data is likely still
> intact and nicely checksummed on the disk but that none of it is
> accessible due to some tiny filesystem inconsistency. ?With pretty
> much any other FS I think I could get most of it back.
>
> Clearly such a small number of occurrences in what were admittedly
> precarious configurations aren't going to be particularly convincing
> motivators to provide a general solution, but I'd feel a whole lot
> better about using ZFS if I knew that there were some documented
> steps or a tool (zfsck? ;) that could help to recover from this kind
> of metadata corruption in the unlikely event of it happening.

Well said.  You have hit on my #1 concern with deploying ZFS.

FWIW, I belive that I have hit the same type of bug as the OP in the
following combinations:

- T2000, LDoms 1.0, various builds of Nevada in control and guest
  domains.
- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @
  build 97 guest

In the past year I've lost more ZFS file systems than I have any other
type of file system in the past 5 years.  With other file systems I
can almost always get some data back.  With ZFS I can't get any back.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts

On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
<[EMAIL PROTECTED]> wrote:
>
>>
>>In the past year I've lost more ZFS file systems than I have any other
>>type of file system in the past 5 years.  With other file systems I
>>can almost always get some data back.  With ZFS I can't get any back.
>
>> Thats scary to hear!
>>
>
> I am really scared now! I was the one trying to quantify ZFS reliability,
> and that is surely bad to hear!

The circumstances where I have lost data have been when ZFS has not
handled a layer of redundancy.  However, I am not terribly optimistic
of the prospects of ZFS on any device that hasn't committed writes
that ZFS thinks are committed.  Mirrors and raidz would also be
vulnerable to such failures.

I also have run into other failures that have gone unanswered on the
lists.  It makes me wary about using zfs without a support contract
that allows me to escalate to engineering.  Patching only support
won't help.

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
   Hang only after I mirrored the zpool, no response on the list

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
   I think this is fixed around snv_98, but the zfs-discuss list was
   surprisingly silent on acknowledging it as a problem - I had no
   idea that it was being worked until I saw the commit.  The panic
   seemed to be caused by dtrace - core developers of dtrace
   were quite interested in the kernel crash dump.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
   Panic during ON build.  Pool was lost, no response from list.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts

On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <[EMAIL PROTECTED]> wrote:
> Nevada isn't production code.  For real ZFS testing, you must use a
> production release, currently Solaris 10 (update 5, soon to be update 6).

I misstated before in my LDoms case.  The corrupted pool was on
Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
zpool there showed no problems.  I got into a panic loop with dangling
dbufs.  My understanding is that this was caused by a bug in the LDoms
manager 1.0 code that has been fixed in a later release.  It was a
supported configuration, I pushed for and got a fix.  However, that
pool was still lost.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts

On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <[EMAIL PROTECTED]> wrote:
> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <[EMAIL PROTECTED]> wrote:
>> Nevada isn't production code.  For real ZFS testing, you must use a
>> production release, currently Solaris 10 (update 5, soon to be update 6).
>
> I misstated before in my LDoms case.  The corrupted pool was on
> Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
> zpool there showed no problems.  I got into a panic loop with dangling
> dbufs.  My understanding is that this was caused by a bug in the LDoms
> manager 1.0 code that has been fixed in a later release.  It was a
> supported configuration, I pushed for and got a fix.  However, that
> pool was still lost.

Or maybe it wasn't fixed yet.  I see that this was committed just today.

6684721 file backed virtual i/o should be synchronous

http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Mike Gerdts

On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick <[EMAIL PROTECTED]> wrote:
> Note: even in a single-device pool, ZFS metadata is replicated via
> ditto blocks at two or three different places on the device, so that
> a localized media failure can be both detected and corrected.
> If you have two or more devices, even without any mirroring
> or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
> across those devices.

And in the event that you have a pool that is mostly not very
important but some of it is important, you can have data mirrored on a
per dataset level via copies=n.

If we can avoid losing an entire pool by rolling back a txg or two,
the biggest source of data loss and frustration is taken care of.
Ditto blocks for metadata should take care of most other cases that
would result in wide spread loss.  Normal bit rot that causes you to
lose blocks here and there are somewhat likely to take out a small
minority of files and spit warnings along the way.  If there are some
files that are more important to you than others (e.g. losing files in
rpool/home may have more impact than than rpool/ROOT) copies=2 can
help there.

And for those places where losing a txg or two is a mortal sin, don't
use flaky hardware and allow zfs to handle a layer of redundancy.

This gets me thinking that it may be worthwhile to have a small (<100
MB x 2) rescue boot environment with copies=2 (as well as rpool/boot/)
so that "pkg repair" could be used to deal with cases that prevent
your normal (>4 GB) boot environment from booting.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-13 Thread Mike Gerdts

On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts <[EMAIL PROTECTED]> wrote:
> On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <[EMAIL PROTECTED]> wrote:
>> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <[EMAIL PROTECTED]> wrote:
>>> Nevada isn't production code.  For real ZFS testing, you must use a
>>> production release, currently Solaris 10 (update 5, soon to be update 6).
>>
>> I misstated before in my LDoms case.  The corrupted pool was on
>> Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
>> zpool there showed no problems.  I got into a panic loop with dangling
>> dbufs.  My understanding is that this was caused by a bug in the LDoms
>> manager 1.0 code that has been fixed in a later release.  It was a
>> supported configuration, I pushed for and got a fix.  However, that
>> pool was still lost.
>
> Or maybe it wasn't fixed yet.  I see that this was committed just today.
>
> 6684721 file backed virtual i/o should be synchronous
>
> http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec

The related information from the LDoms Manager 1.1 Early Access
release notes (820-4914-10):

Data Might Not Be Written Immediately to the Virtual Disk Backend If
Virtual I/O Is Backed by a File or Volume

Bug ID 6684721: When a file or volume is exported as a virtual disk,
then the service domain exporting that file or volume is acting as a
storage cache for the virtual disk. In that case, data written to the
virtual disk might get cached into the service domain memory instead
of being immediately written to the virtual disk backend. Data are not
cached if the virtual disk backend is a physical disk or slice, or if
it is a volume device exported as a single-slice disk.

Workaround: If the virtual disk backend is a file or a volume device
exported as a full disk, then you can prevent data from being cached
into the service domain memory and have data written immediately to
the virtual disk backend by adding the following line to the
/etc/system file on the service domain.

set vds:vd_file_write_flags = 0

Note – Setting this tunable flag does have an impact on performance
when writing to a virtual disk, but it does ensure that data are
written immediately to the virtual disk backend.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance bake off vxfs/ufs/zfs need some help

2008-11-23 Thread Mike Gerdts

On Sat, Nov 22, 2008 at 11:41 AM, Chris Greer <[EMAIL PROTECTED]> wrote:
> vxvm with vxfs we achieved 2387 IOPS

In this combination you should be using odm, which comes as part of
the Storage Foundation for Oracle or Storage Foundation for Oracle RAC
products.  It makes the database files on vxfs behave much like they
live on raw devices and tends to allow much higher transaction rate
with fewer physical I/O's and less kernel (%sys) utilization.  The
concept is similar to but different than direct I/O.

This behavior is hard, if not impossible, to test without Oracle in
the mix because (AFAIK) oracle is the only thing that knows how to
make use of the odm interface.

> vxvm with ufs we achieved 4447 IOPS
> ufs on disk devices we achieved 4540 IOPS
> zfs we achieved 1232 IOPS

When you say RAC, I assume you mean multi-instance (clustered)
databases.  None of those are cluster file systems and as such are
worthless for multi-instance oracle databases which require a shared
file system.

On Linux, you say that you were using ocfs.  Where you really using
ocfs, or were the databases really in ASM?  Oracle's recommendation
(last I knew) was to have executables on ocfs and have databases in
ASM.  Have you tried ASM on Solaris?  It should give you a lot of the
benefits you would expect from ZFS (pooled storage, incremental
backups, (I think) efficient snapshots). It will only work for oracle
database files (and indexes, etc.) and should work for clustered
storage as well.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] HELP!!! Need to disable zfs

2008-11-25 Thread Mike Gerdts

Boot from the other root drive, mount up the "bad" one at /mnt.  Then:

# mv /mnt/etc/zfs/zpool.cache /mnt/etc/zpool.cache.bad



On Tue, Nov 25, 2008 at 8:18 AM, Mike DeMarco <[EMAIL PROTECTED]> wrote:
> My root drive is ufs. I have corrupted my zpool which is on a different drive 
> than the root drive.
> My system paniced and now it core dumps when it boots up and hits zfs start. 
> I have a alt root drive that  can boot the system up with but how can I 
> disable zfs from starting on a different drive?
>
> HELP HELP HELP
> --
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Separate /var

2008-12-02 Thread Mike Gerdts

On Tue, Dec 2, 2008 at 11:17 AM, Lori Alt <[EMAIL PROTECTED]> wrote:
> I did pre-create the file system.  Also, I tried omitting "special" and
> zonecfg complains.
>
> I think that there might need to be some changes
> to zonecfg and the zone installation code to get separate
> /var datasets in non-global zones to work.

You could probably do something like:

zfs create rpool/zones/$zone
zfs create rpool/zones/$zone/var

zonecfg -z $zone
add fs
  set dir=/var
  set special=/zones/$zone/var
  set type=lofs
  end
...

zoneadm -z $zone install

zonecfg -z $zone
remove fs dir=/var

zfs set mountpoint=/zones/$zone/root/var rpool/zones/$zone/var

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Separate /var

2008-12-02 Thread Mike Gerdts

On Tue, Dec 2, 2008 at 6:13 PM, Lori Alt <[EMAIL PROTECTED]> wrote:
> On 12/02/08 10:24, Mike Gerdts wrote:
> I follow you up to here.  But why do the next steps?
>
> > zonecfg -z $zone
> > remove fs dir=/var
> >
> > zfs set mountpoint=/zones/$zone/root/var rpool/zones/$zone/var

It's not strictly required to perform this last set of commands, but
the lofs mount point is not really needed.  Longer term it will likely
look cleaner (e.g. to live upgrade) to not have this lofs mount.  That
is, I suspect that live upgrade is more likely to look at /var in the
zone and say "ahhh, that is a zfs file system - I known how to deal
with that" than it is for it to say "ahhh, that is a lofs file system
to some other zfs file system in the global zone - I know how to deal
with that."

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Problem with time-slider

2008-12-29 Thread Mike Gerdts

On Mon, Dec 29, 2008 at 8:21 AM, Charles  wrote:
> Hi
>
> I'm a new user of OpenSolaris 2008.11, I switched from Linux to try the 
> time-slider, but now when I execute the time-slider I get this message:
>
> http://img115.imageshack.us/my.php?image=capturefentresansnomfx9.png

Try running

svcs -v zfs/auto-snapshot

The last few lines of the log files mentioned in the output from the
above command may provide helpful hints.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange performance drop of solaris 10/zfs

2009-01-29 Thread Mike Gerdts

On Thu, Jan 29, 2009 at 6:13 AM, Kevin Maguire  wrote:
> I have tried to establish if some client or clients are thrashing the
> server via nfslogd, but without seeing anything obvious.  Is there
> some kind of per-zfs-filesystem iostat?

The following should work in bash or ksh, so long as the list of zfs
mount points does not overflow the maximum command line length.

$ fsstat $(zfs list -H -o mountpoint | nawk '$1 !~ /^(\/|-|legacy)$/') 5

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is zfs snapshot -r atomic?

2009-02-22 Thread Mike Gerdts

On Sun, Feb 22, 2009 at 11:59 AM, David Abrahams  wrote:
>
> When I take a snapshot of a filesystem (or pool) and pass -r to get all
> the sub-filesystems, am I getting the state of all the sub-filesystem
> snapshots "at the same instant," or is it essentially equivalent to
> making the sub-filesystem snapshots one at a time as I would have to do
> if -r weren't available?

Google for "zfs snapshot recursive atomic" leads me to:

http://docs.sun.com/app/docs/doc/819-5461/gdfdt?a=view

Which says:

Recursive ZFS snapshots are created quickly as one atomic operation.
The snapshots are created together (all at once) or not created at
all. The benefit of atomic snapshots operations is that the snapshot
data is always taken at one consistent time, even across descendent
file systems.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Details on raidz boot + zfs patents?

2009-02-28 Thread Mike Gerdts

On Sat, Feb 28, 2009 at 4:53 AM, "C. Bergström"
 wrote:
> The other question that I am less worried about is would this violate any
> patents.. I mean.. Sun added the initial zfs support to grub and this is
> essentially extending that, but I'm not aware of any patent provisions on
> that code or some royalty free statement about ZFS related patents from
> Sun.. (Frankly.. I look at Sun as /similar/ to Cononical in that I assume
> they only sue to protect themselves and not go after any good intention foss
> project..)

See http://opensolaris.org/os/about/faq/licensing_faq/#patents.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] GSoC 09 zfs ideas?

2009-02-28 Thread Mike Gerdts

On Sat, Feb 28, 2009 at 1:20 AM, Richard Elling
 wrote:
> David Magda wrote:
>> On Feb 27, 2009, at 20:02, Richard Elling wrote:
>>> At the risk of repeating the Best Practices Guide (again):
>>> The zfs send and receive commands do not provide an enterprise-level
>>> backup solution.
>>
>> Yes, in its current state; hopefully that will change some point in the
>> future (which is what we're talking about with GSoC--the potential to change
>> the status quo).
>
> I suppose, but considering that enterprise backup solutions exist,
> and some are open source, why reinvent the wheel?
> -- richard

The default mode of operation for every enterprise backup tool that I
have used is file level backups.  The determination of which files
need to be backed up seems to be to crawl the file system looking for
files that have an mtime after the previous backup.

Areas of strength for such tools include:

- Works with any file system that provides a POSIX interface
- Restore of a full backup is an accurate representation of the data backed up
- Restore can happen to a different file system type
- Restoring an individual file is possible

Areas of weakness include:

- Extremely inefficient for file systems with lots of files and little change.
- Restore of full + incremental tends to have extra files because of
spotty support or performance overhead of tool that would prevent it.
- Large files that have blocks rewritten get backed up in full each time
- Restores of file systems with lots of small files (especially in one
directory) are extremely slow

There exist features (sometimes expensive add-ons) that deal with some
of these shortcomings via:

- Keeping track of deleted files so that a restore is more
representative of what is on disk during the incremental backup.
Administration manuals typically warn that this has a big performance
and/or size overhead on the database used by the backup software.
- Including add-ons that hook into other components (e.g. VxFS storage
checkpoints, Oracle RMAN) that provide something similar to
block-level incremental backups

Why re-invent the wheel?

- People are more likely to have snapshots available for file-level
restores, and as such a "zfs send" data stream would only be used in
the event of a complete pool loss.
- It is possible to provide a general block-level backup solution so
that every product doesn't have to invent it.  This gives ZFS another
feature benefit to put it higher in the procurement priority.
- File creation slowness can likely be avoided allowing restore to
happen at tape speed
- To be competitive with NetApp "snapmirror to tape"
- Even having a zfs(1M) option that could list the files that change
between snapshots could be very helpful to prevent file system crawls
and to avoid being fooled by bogus mtimes.


-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] GSoC 09 zfs ideas?

2009-02-28 Thread Mike Gerdts

On Sat, Feb 28, 2009 at 4:33 PM, Nicolas Williams
 wrote:
> On Sat, Feb 28, 2009 at 10:44:59PM +0100, Thomas Wagner wrote:
>> > >> pool-shrinking (and an option to shrink disk A when i want disk B to
>> > >> become a mirror, but A is a few blocks bigger)
>> >  This may be interesting... I'm not sure how often you need to shrink a 
>> > pool
>> >  though?  Could this be classified more as a Home or SME level feature?
>>
>> Enterprise level especially in SAN environments need this.
>>
>> Projects own theyr own pools and constantly grow and *shrink* space.
>> And they have no downtime available for that.
>
> Multiple pools on one server only makes sense if you are going to have
> different RAS for each pool for business reasons.  It's a lot easier to
> have a single pool though.  I recommend it.

Other scenarios for multiple pools include:

- Need independent portability of data between servers.  For example,
in a HA cluster environment, various workloads will be mapped to
various pools.  Since ZFS does not do active-active clustering, a
single pool for anything other than a simple active-standby cluster is
not useful.

- Array based copies are needed.  There are times when copies of data
are performed at a storage array level to allow testing and support
operations to happen "on different spindles".  For example, in a
consolidated database environment, each database may be constrained to
a set of spindles so that each database can be replicated or copied
independent of the various others.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] GSoC 09 zfs ideas?

2009-02-28 Thread Mike Gerdts

On Sat, Feb 28, 2009 at 8:34 PM, Nicolas Williams
 wrote:
> On Sat, Feb 28, 2009 at 05:19:26PM -0600, Mike Gerdts wrote:
>> On Sat, Feb 28, 2009 at 4:33 PM, Nicolas Williams
>>  wrote:
>> > On Sat, Feb 28, 2009 at 10:44:59PM +0100, Thomas Wagner wrote:
>> >> > >> pool-shrinking (and an option to shrink disk A when i want disk B to
>> >> > >> become a mirror, but A is a few blocks bigger)
>> >> >  This may be interesting... I'm not sure how often you need to shrink a 
>> >> > pool
>> >> >  though?  Could this be classified more as a Home or SME level feature?
>> >>
>> >> Enterprise level especially in SAN environments need this.
>> >>
>> >> Projects own theyr own pools and constantly grow and *shrink* space.
>> >> And they have no downtime available for that.
>> >
>> > Multiple pools on one server only makes sense if you are going to have
>> > different RAS for each pool for business reasons.  It's a lot easier to
>> > have a single pool though.  I recommend it.
>>
>> Other scenarios for multiple pools include:
>>
>> - Need independent portability of data between servers.  For example,
>> in a HA cluster environment, various workloads will be mapped to
>> various pools.  Since ZFS does not do active-active clustering, a
>> single pool for anything other than a simple active-standby cluster is
>> not useful.
>
> Right, but normally each head in a cluster will have only one pool
> imported.

Not necessarily.  Suppose I have a group of servers with a bunch of
zones.  Each zone represents a service group that needs to
independently fail over between servers.  In that case, I may have a
zpool per zone.  It seems this is how it is done in the real world.[1]

1. Upton, Tom. "A  Conversation with Jason Hoffman."  ACM Queue.
January/February 2008. 9.

> The Sun Storage 7xxx do this.  One pool per-head, two pools altogether
> in a cluster.

Makes sense for your use case.  If you are looking at a zpool per
zone, it is likely a zpool created on a LUN provided by a Sun Storage
7xxx that is presented to multiple hosts.  That is, ZFS on top of ZFS.

>> - Array based copies are needed.  There are times when copies of data
>> are performed at a storage array level to allow testing and support
>> operations to happen "on different spindles".  For example, in a
>> consolidated database environment, each database may be constrained to
>> a set of spindles so that each database can be replicated or copied
>> independent of the various others.
>
> This gets you back into managing physical space allocation.  Do you
> really want that?  If you're using zvols you can do "array based copies"
> of you zvols.  If you're using filesystems then you should just use
> normal backup tools.

There are times when you have no real choice.  If a regulation or a
lawyer's interpretation of a regulation says that you need to have
physically separate components, you need to have physically separate
components.  If your disaster recovery requirements mean that you need
to have a copy of data at a different site and array based copies have
historically been used - it is unlikely that "while true ; do zfs send
| ssh | zfs receive" will be adapted in the first round of
implementation.  Given this, zvols don't do it today.

When you have a smoking hole, the gap in transactions left by normal
backup tools is not always good enough - especially if some of that
smoke is coming from the tape library.  Array based replication tends
to allow you to keep much tighter tolerances on just how many
committed transactions you are willing to lose.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [Fwd: ZFS user/group quotas & space accounting [PSARC/2009/204 FastTrack timeout 04/08/2009]]

2009-03-31 Thread Mike Gerdts

2009/3/31 Matthew Ahrens :
> 4. New Properties
>
> user/group space accounting information and quotas can be manipulated
> with 4 new properties:
>
> zfs get userused@ 
> zfs get groupused@ 
>
> zfs get userquota@ 
> zfs get groupquota@ 
>
> zfs set userquota@= 
> zfs set groupquota@= 
>
> The  or  is specified using one of the following forms:
> posix name (eg. ahrens)
> posix numeric id (eg. 126829)
> sid name (eg. ahr...@sun)
> sid numeric id (eg. S-1-12345-12423-125829)

How does this work with zones?  Suppose in the global zone I have
passwd entries like:

jill:x:123:123:Jill Admin:/home/jill:/bin/bash
joe:x:124:124:Joe Admin:/home/joe:/bin/bash

And in a non-global zone (called bedrock) I have:

fred:x:123:123:Fred Flintstone:/home/fred:/bin/bash
barney:x:124:124:Barney Rubble:/home/barney:/bin/bash

Dataset rpool/quarry is delegated to the zone bedrock.

Does "zfs get all rpool/quarry" report the same thing whether it is
run in the global zone or the non-global zone?

Has there been any thought to using a UID resolution mechanism similar
to that used by ps?  That is, if "zfs get ... " is run in the
global zone and the dataset is deleted to a non-global zone, display
the UID rather than a possibly mistaken username.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [Fwd: ZFS user/group quotas & space accounting [PSARC/2009/204 FastTrack timeout 04/08/2009]]

2009-03-31 Thread Mike Gerdts

On Tue, Mar 31, 2009 at 7:12 PM, Matthew Ahrens  wrote:
> River Tarnell wrote:
>>
>> Matthew Ahrens:
>>>
>>> ZFS user quotas (like other zfs properties) will not be accessible over
>>> NFS;
>>> you must be on the machine running zfs to manipulate them.
>>
>> does this mean that without an account on the NFS server, a user cannot
>> see his
>> current disk use / quota?
>
> That's correct.

Do you have a reason for not wanting this to be implemented, or are
you just avoiding scope creep?

In the past, this was a big pain point for NFS servers that used VxFS.
 I used one of Sun's "source available" programs to get the rquotad
source to implement this in the Solaris 7 days.  Google suggests
others have done the same using the opensolaris code as a starting
point.  Still others have written wrappers around quota(1M) that
invoke rsh or ssh to the appropriate NFS server.  It seems as though
this was eventually addressed by Veritas with 110434-02. We really
shouldn't repeat this for long.

It should be fairly straight-forward to modify rquotad to support
this, so long as the zfs end of it is not overly complicated.  Is now
too early to file the RFE?   For some reason it feels like the person
on the other end of bugs.opensolars.org will get confused by the
request to enhance a feature that doesn't yet exist.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

1 2 3 >

1 - 100 of 272 matches

Mail list logo