from:"Ben Rockwood"

[zfs-discuss] Benchmarking Methodologies

2010-04-20 Thread Ben Rockwood

I'm doing a little research study on ZFS benchmarking and performance
profiling.  Like most, I've had my favorite methods, but I'm
re-evaluating my choices and trying to be a bit more scientific than I
have in the past.


To that end, I'm curious if folks wouldn't mind sharing their work on
the subject?  What tool(s) to you prefer in what situations?  Do you
have a standard method of running them (tool args; block sizes, thread
counts, ...) or procedures between runs (zpool import/export, new
dataset creation,...)?  etc.


Any feedback is appreciated.  I want to get a good sampling of opinions.

Thanks!



benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Benchmarking Methodologies

2010-04-21 Thread Ben Rockwood

On 4/21/10 2:15 AM, Robert Milkowski wrote:
> I haven't heard from you in a while! Good to see you here again :)
>
> Sorry for stating obvious but at the end of a day it depends on what
> your goals are.
> Are you interested in micro-benchmarks and comparison to other file
> systems?
>
> I think the most relevant filesystem benchmarks for users is when you
> benchmark a specific application and present results from an
> application point of view. For example, given a workload for Oracle,
> MySQL, LDAP, ... how quickly it completes? How much benefit there is
> by using SSDs? What about other filesystems?
>
> Micro-benchmarks are fine but very hard to be properly interpreted by
> most users.
>
> Additionally most benchmarks are almost useless if they are not
> compared to some other configuration with only a benchmarked component
> changed. For example, knowing that some MySQL load completes in 1h on
> ZFS is basically useless. But knowing that on the same HW with
> Linux/ext3 and under the same load it completes in 2h would be
> interesting to users.
>
> Other interesting thing would be to see an impact of different ZFS
> setting on a benchmark results (aligned recordsize for database vs.
> default, atime off vs. on, lzjb, gzip, ssd). Also comparison of
> benchmark results with all default zfs setting compared to whatever
> setting you did which gave you the best result.

Hey Robert... I'm always around. :)

You've made an excellent case for benchmarking and where its useful
but what I'm asking for on this thread is for folks to share the
research they've done with as much specificity as possible for research
purposes. :)

Let me illustrate:

To Darren's point on FileBench and vdbench... to date I've found these
two to be the most useful.   IOzone, while very popular, has always
given me strange results which are inconsistent regardless of how large
the block and data is.  Given that the most important aspect of any
benchmark is repeatability and sanity in results, I've found no value in
IOzone any longer.

vdbench has become my friend particularly in the area of physical disk
profiling.  Before tuning ZFS (or any filesystem) its important to find
a solid baseline of performance on the underlying disk structure.  So
using a variety of vdbench profiles such as the following help you
pinpoint exactly the edges of the performance envelope:

sd=sd1,lun=/dev/rdsk/c0t1d0s0,threads=1
wd=wd1,sd=sd1,readpct=100,rhpct=0,seekpct=0
rd=run1,wd=wd1,iorate=max,elapsed=10,interval=1,forxfersize=(4k-4096k,d)

With vdbench and the workload above I can get consistent, reliable
results time after time and the results on other systems match.
This is particularly key if your running a hardware RAID controller
under ZFS.  There isn't anything dd can do that vdbench can't do
better.  Using a workload like above both at differing xfer sizes and
also at differing thread counts really helps give an accurate picture of
the disk capabilities.

Moving up into the filesystem.  I've been looking intently at improving
my FileBench profiles, based on the supplied ones with tweaking.  I'm
trying to get to a methodology that provides me with time-after-time
repeatable results for real comparison between systems. 

I'm looking hard at vdbench file workloads, but they aren't yet nearly
as sophisticated as FileBench.  I am also looking at FIO
(http://freshmeat.net/projects/fio/), which is FileBench-esce.

At the end of the day, I agree entirely that application benchmarks are
far more effective judges... but they are also more time consuming and
less flexible than dedicated tools.   The key is honing generic
benchmarks to provide useful data which can be relied upon for making
accurate estimates as regards to application performance.  When you
start judging filesystem performance based on something like MySQL there
are simply too many variables involved.

So, I appreciate the Benchmark 101, but I'm looking for anyone
interested in sharing meat.  Most of the existing ZFS benchmarks folks
published are several years old now, and most were using IOzone.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Plugging in a hard drive after Solaris has booted up?

2010-05-07 Thread Ben Rockwood

On 5/7/10 9:38 PM, Giovanni wrote:
> Hi guys,
>
> I have a quick question, I am playing around with ZFS and here's what I did.
>
> I created a storage pool with several drives. I unplugged 3 out of 5 drives 
> from the array, currently:
>
>   NAMESTATE READ WRITE CKSUM
>   gpool   UNAVAIL  0 0 0  insufficient replicas
> raidz1UNAVAIL  0 0 0  insufficient replicas
>   c8t2d0  UNAVAIL  0 0 0  cannot open
>   c8t4d0  UNAVAIL  0 0 0  cannot open
>   c8t0d0  UNAVAIL  0 0 0  cannot open
>
> These drives had power all the time, the SATA cable however was disconnected. 
> Now, after I logged into Solaris and opened firefox, I plugged them back in 
> to sit and watch if the storage pool suddenly becomes "available"
>
> This did not happen, so my question is, do I need to make Solaris re-detect 
> the hard drives and if so how? I tried format -e but it did not seem to 
> detect the 3 drives I just plugged back in. Is this a BIOS issue? 
>
> Does hot-swap hard drives only work when you replace current hard drives 
> (previously detected by BIOS) with others but not when you have ZFS/Solaris 
> running and want to add more storage without shutting down?
>
> It all boils down to, say the scenario is that I will need to purchase more 
> hard drives as my array grows, I would like to be able to (without shutting 
> down) add the drives to the storage pool (zpool)
>   

There are lots of different things you can look at and do, but it comes
down to just one command:  "devfsadm -vC".  This will cleanup (-C for
cleanup, -v for verbose) the device tree if it gets into a funky state.

Then run "format" or "iostat -En" to verify that the device(s) are
there.  Then re-import the zpool or add the device or whatever you wish
to do.  Even if device locations change, ZFS will do the right thing on
import.

If you wish to dig deeper... normally when you attach a new device
hot-plug will do the right thing and you'll see the connection messages
in "dmesg".  If you want to explicitly check the state of dynamic
reconfiguration, checkout the "cfgadm" command.  Normally, however, on
modern version of Solaris there is no reason to resort to that, its just
something fun if you wish to dig.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Mirrored Servers

2010-05-08 Thread Ben Rockwood

On 5/8/10 3:07 PM, Tony wrote:
> Lets say I have two servers, both running opensolaris with ZFS. I basically 
> want to be able to create a filesystem where the two servers have a common 
> volume, that is mirrored between the two. Meaning, each server keeps an 
> identical, real time backup of the other's data directory. Set them both up 
> as file servers, and load balance between the two for incoming requests.
>
> How would anyone suggest doing this?
>   

I would carefully consider whether or not the _really_ need to be real
time.  Can you tolerate 5 minutes or even just 60 seconds of difference
between them? 

If you can, then things are much easier and less complex.  I'd
personally use ZFS Snapshots to keep the two servers in sync every 60
seconds.

As for load balancing, that depends on which protocal your using.  FTP
is easy.  NFS/CIFS is a little harder.  I'd simply use a load balancer
(Zeus, NetScaler, Balance, HA-Proxy, etc.), but that is a little scary
and bizarre in the case of NFS/CIFS, where you should instead use a
single-server failover solution, such as Sun Cluster.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS Hard disk buffer at 100%

2010-05-08 Thread Ben Rockwood

The drive (c7t2d0)is bad and should be replaced.   The second drive
(c7t5d0) is either bad or going bad.  This is exactly the kind of
problem that can force a Thumper to it knees, ZFS performance is
horrific, and as soon as you drop the bad disks things magicly return to
normal.

My first recommendation is to pull the SMART data from the disks if you
can.  I wrote a blog entry about SMART to address exactly the behavior
your seeing back in 2008:
http://www.cuddletech.com/blog/pivot/entry.php?id=993

Yes, people will claim that SMART data is useless for predicting
failures, but in a case like yours you are just looking for data to
corroborate a hypothesis.

In order to test this condition, "zpool offline..." c7t2d0, which
emulated removal.  See if performance improves.  On Thumpers I'd build a
list of "suspect disks" based on 'iostat', like you show, and then
correlate the SMART data, and then systematically offline disks to see
if it really was the problem.

In my experience the only other reason you'll legitimately see really
wierd "bottoming out" of IO like this is if you hit the max conncurrent
IO limits in ZFS (untill recently that limit was 35), so you'd see
actv=35, and then when the device finally processed the IO's the thing
would snap back to life.  But even in those cases you shouldn't see
request times (asvc_t) rise above 200ms.

All that to say, replace those disks or at least test it.  SSD's won't
help, one or more drives are toast.

benr.

On 5/8/10 9:30 PM, Emily Grettel wrote:
> Hi Giovani,
>  
> Thanks for the reply.
>  
> Here's a bit of iostat after uncompressing a 2.4Gb RAR file that has 1
> DWF file that we use.
>
> extended device statistics
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 1.0   13.0   26.0   18.0  0.0  0.00.00.8   0   1 c7t1d0
> 2.05.0   77.0   12.0  2.4  1.0  343.8  142.8 100 100 c7t2d0
> 1.0   16.0   25.5   15.5  0.0  0.00.00.3   0   0 c7t3d0
> 0.0   10.00.0   17.0  0.0  0.03.21.2   1   1 c7t4d0
> 1.0   12.0   25.5   15.5  0.4  0.1   32.4   10.9  14  14 c7t5d0
> 1.0   15.0   25.5   18.0  0.0  0.00.10.1   0   0 c0t1d0
> extended device statistics
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 0.00.00.00.0  2.0  1.00.00.0 100 100 c7t2d0
> 1.00.00.50.0  0.0  0.00.00.1   0   0 c7t0d0
> extended device statistics
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 5.0   15.0  128.0   18.0  0.0  0.00.01.8   0   3 c7t1d0
> 1.09.0   25.5   18.0  2.0  1.8  199.7  179.4 100 100 c7t2d0
> 3.0   13.0  102.5   14.5  0.0  0.10.05.2   0   5 c7t3d0
> 3.0   11.0  102.0   16.5  0.0  0.12.34.2   1   6 c7t4d0
> 1.04.0   25.52.0  0.4  0.8   71.3  158.9  12  79 c7t5d0
> 5.0   16.0  128.5   19.0  0.0  0.10.12.6   0   5 c0t1d0
> extended device statistics
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 0.04.00.02.0  2.0  2.0  496.1  498.0  99 100 c7t2d0
> 0.00.00.00.0  0.0  1.00.00.0   0 100 c7t5d0
> extended device statistics
> r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
> 7.00.0  204.50.0  0.0  0.00.00.2   0   0 c7t1d0
> 1.00.0   25.50.0  3.0  1.0 2961.6 1000.0  99 100 c7t2d0
> 8.00.0  282.00.0  0.0  0.00.00.3   0   0 c7t3d0
> 6.00.0  282.50.0  0.0  0.06.12.3   1   1 c7t4d0
> 0.03.00.05.0  0.5  1.0  165.4  333.3  18 100 c7t5d0
> 7.00.0  204.50.0  0.0  0.00.01.6   0   1 c0t1d0
> 2.02.0   89.0   12.0  0.0  0.03.16.1   1   2 c3t0d0
> 0.02.00.0   12.0  0.0  0.00.00.2   0   0 c3t1d0
>
> Sometimes two or more disks are going at 100. How does one solve this
> issue if its a firmware bug? I tried looking around for Western
> Digital Firmware for WD10EADS but couldn't find any available.
>  
> Would adding an SSD or two help here?
>  
> Thanks,
> Em
>  
> 
> Date: Fri, 7 May 2010 14:38:25 -0300
> Subject: Re: [zfs-discuss] ZFS Hard disk buffer at 100%
> From: gtirl...@sysdroid.com
> To: emilygrettelis...@hotmail.com
> CC: zfs-discuss@opensolaris.org
>
>
> On Fri, May 7, 2010 at 8:07 AM, Emily Grettel
> mailto:emilygrettelis...@hotmail.com>>
> wrote:
>
> Hi,
>  
> I've had my RAIDz volume working well on SNV_131 but it has come
> to my attention that there has been some read issues with the
> drives. Previously I thought this was a CIFS problem but I'm
> noticing that when transfering files or uncompressing some fairly
> large 7z (1-2Gb) files (or even smaller rar - 200-300Mb) files
> occasionally running iostat will give th

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-13 Thread Ben Rockwood

 On 8/13/10 9:02 PM, "C. Bergström" wrote:
> Erast wrote:
>>
>>
>> On 08/13/2010 01:39 PM, Tim Cook wrote:
>>> http://www.theregister.co.uk/2010/08/13/opensolaris_is_dead/
>>>
>>> I'm a bit surprised at this development... Oracle really just doesn't
>>> get it.  The part that's most disturbing to me is the fact they
>>> won't be
>>> releasing nightly snapshots.  It appears they've stopped Illumos in its
>>> tracks before it really even got started (perhaps that explains the
>>> timing of this press release)
>>
>> Wrong. Be patient, with the pace of current Illumos development it
>> soon will have all the closed binaries liberated and ready to sync up
>> with promised ON code drops as dictated by GPL and CDDL licenses.
> Illumos is just a source tree at this point.  You're delusional,
> misinformed, or have some big wonderful secret if you believe you have
> all the bases covered for a pure open source distribution though..
>
> What's closed binaries liberated really mean to you?
>
> Does it mean
>a. You copy over the binary libCrun and continue to use some
> version of Sun Studio to build onnv-gate
>b. You debug the problems with and start to use ancient gcc-3 (at
> the probably expense of performance regressions which most people
> would find unacceptable)
>c. Your definition is narrow and has missed some closed binaries
>
>
> I think it's great people are still hopeful, working hard and going to
> steward this forward, but I wonder.. What pace are you referring to? 
> The last commit to illumos-gate was 6 days ago and you're already not
> even keeping it in sync..  Can you even build it yet and if so where's
> the binaries?

Illumos is 2 weeks old.  Lets cut it a little slack. :)


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Opensolaris is apparently dead

2010-08-14 Thread Ben Rockwood

 On 8/14/10 1:12 PM, Frank Cusack wrote:
>
> Wow, what leads you guys to even imagine that S11 wouldn't contain
> comstar, etc.?  *Of course* it will contain most of the bits that
> are current today in OpenSolaris.

That's a very good question actually.  I would think that COMSTAR would
stay because its used by the Fishworks appliance... however, COMSTAR is
a competitive advantage for DIY storage solutions.  Maybe they will rip
it out of S11 and make it an add-on or something.   That would suck.

I guess the only real reason you can't yank COMSTAR is because its now
the basis for iSCSI Target support.  But again, there is nothing saying
that Target support has to be part of the standard OS offering.

Scary to think about. :)

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Running on Dell hardware?

2011-01-12 Thread Ben Rockwood

If you're still having issues go into the BIOS and disable C-States, if you 
haven't already.  It is responsible for most of the problems with 11th Gen 
PowerEdge.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] How to delete hundreds of emtpy snapshots

2008-07-17 Thread Ben Rockwood

zfs list is mighty slow on systems with a large number of objects, but there is 
no foreseeable plan that I'm aware of to solve that "problem".  

Never the less, you need to do a zfs list, therefore, do it once and work from 
that.

zfs list > /tmp/zfs.out
for i in `grep mydataset@ /tmp/zfs.out`; do zfs destroy $i; done


As for 5 minute snapshots this is NOT a bad idea.  It is, however, complex 
to manage.  Thus, you need to employ tactics to make it more digestible.  

You need to ask  yourself first why you want 5 min snaps. Is it replication?  
If so, create it, replicate it, destroy all but the last snapshot or even 
rotate them.  Or, is it fallback in case you make a mistake?  Then just keep 
around the last 6 snapshots or so.

zfs rename & zfs destroy are your friends use them wisely. :)

If you want to discuss exactly what your trying to facilitate I'm sure we can 
come up with some more concrete ideas to help you.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ARCSTAT Kstat Definitions

2008-08-20 Thread Ben Rockwood

Would someone "in the know" be willing to write up (preferably blog) definitive 
definitions/explanations of all the arcstats provided via kstat?  I'm 
struggling with proper interpretation of certain values, namely "p", 
"memory_throttle_count", and the mru/mfu+ghost hit vs demand/prefetch hit 
counters.  I think I've got it figured out, but I'd really like expert 
clarification before I start tweaking.

Thanks.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARCSTAT Kstat Definitions

2008-08-21 Thread Ben Rockwood

Thanks, not as much as I was hoping for but still extremely helpful.


Can you, or others have a look at this: http://cuddletech.com/arc_summary.html

This is a PERL script that uses kstats to drum up a report such as the 
following:

System Memory:
 Physical RAM:  32759 MB
 Free Memory :  10230 MB
 LotsFree:  511 MB

ARC Size:
 Current Size: 7989 MB (arcsize)
 Target Size (Adaptive):   8192 MB (c)
 Min Size (Hard Limit):1024 MB (zfs_arc_min)
 Max Size (Hard Limit):8192 MB (zfs_arc_max)

ARC Size Breakdown:
 Most Recently Used Cache Size:  13%1087 MB (p)
 Most Frequently Used Cache Size:86%7104 MB (c-p)

ARC Efficency:
 Cache Access Total: 3947194710
 Cache Hit Ratio:  99%   3944674329
 Cache Miss Ratio:  0%   2520381

 Data Demand   Efficiency:99%
 Data Prefetch Efficiency:69%

CACHE HITS BY CACHE LIST:
  Anon:0%16730069 
  Most Frequently Used:   99%3915830091 (mfu)
  Most Recently Used:  0%10490502 (mru)
  Most Frequently Used Ghost:  0%439554 (mfu_ghost)
  Most Recently Used Ghost:0%1184113 (mru_ghost)
CACHE HITS BY DATA TYPE:
  Demand Data:99%3914527790 
  Prefetch Data:   0%2447831 
  Demand Metadata: 0%10709326 
  Prefetch Metadata:   0%16989382 
CACHE MISSES BY DATA TYPE:
  Demand Data:45%1144679 
  Prefetch Data:  42%1068975 
  Demand Metadata: 5%132649 
  Prefetch Metadata:   6%174078 
-


Feedback and input is welcome, in particular if I'm mischarrectorizing data.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARCSTAT Kstat Definitions

2008-08-21 Thread Ben Rockwood

Its a starting point anyway.   The key is to try and draw useful conclusions 
from the info to answer the torrent of "why is my ARC 30GB???"

There are several things I'm unclear on whether or not I'm properly 
interpreting such as:

* As you state, the anon pages.  Even the comment in code is, to me anyway, a 
little vague.  I include them because otherwise you look at the hit counters 
and wonder where a large chunk of them went.

* Prefetch... I want to use the Prefetch Data hit ratio as a judgment call on 
the efficiency of prefetch.  If the value is very low it might be best to turn 
it off. but I'd like to hear that from someone else before I go saying that.

In high latency environments, such as ZFS on iSCSI, prefetch can either 
significantly help or hurt, determining which is difficult without some type of 
metric as as above.

* There are several instances (based on dtracing) in which the ARC is 
bypassed... for ZIL I understand, in some other cases I need to spend more time 
analyzing the DMU (dbuf_*) for why.

* In answering the "Is having a 30GB ARC good?" question, I want to say that if 
MFU is >60% of ARC, and if the hits are mostly MFU that you are deriving 
significant benefit from your large ARC but on a system with a 2GB ARC or a 
30GB ARC the overall hit ratio tends to be 99%.  Which is nuts, and tends to 
reinforce a misinterpretation of anon hits.

The only way I'm seeing to _really_ understand ARC's efficiency is to look at 
the overall number of reads and then how many are intercepted by ARC and how 
many actually made it to disk... and why (prefetch or demand).  This is tricky 
to implement via kstats because you have to pick out and monitor the zpool 
disks themselves.

I've spent a lot of time in this code (arc.c) and still have a lot of 
questions.  I really wish there was an "Advanced ZFS Internals" talk coming up; 
I simply can't keep spending so much time on this.

Feedback from PAE or other tuning experts is welcome and appreciated. :)

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ARCSTAT Kstat Definitions

2008-08-21 Thread Ben Rockwood

New version is available (v0.2) :

* Fixes divide by zero, 
* includes tuning from /etc/system in output
* if prefetch is disabled I explicitly say so.
* Accounts for jacked anon count.  Still need improvement here.
* Added friendly explanations for MRU/MFU & Ghost lists counts.

Page and examples are updated: cuddletech.com/arc_summary.pl

Still needs work, but hopefully interest in this will stimulate some improved 
understanding of ARC internals.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Lost Disk Space

2008-10-16 Thread Ben Rockwood

I've been struggling to fully understand why disk space seems to vanish.  I've 
dug through bits of code and reviewed all the mails on the subject that I can 
find, but I still don't have a proper understanding of whats going on.  

I did a test with a local zpool on snv_97... zfs list, zpool list, and zdb all 
seem to disagree on how much space is available.  In this case its only a 
discrepancy of about 20G or so, but I've got Thumpers that have a discrepancy 
of over 6TB!

Can someone give a really detailed explanation about whats going on?

block traversal size 670225837056 != alloc 720394438144 (leaked 50168601088)

bp count:15182232
bp logical:672332631040  avg:  44284
bp physical:   669020836352  avg:  44066compression:   1.00
bp allocated:  670225837056  avg:  44145compression:   1.00
SPA allocated: 720394438144 used: 96.40%

Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
12   120K   26.5K   79.5K   6.62K4.53 0.00  deferred free
 1512 512   1.50K   1.50K1.00 0.00  object directory
 3  1.50K   1.50K   4.50K   1.50K1.00 0.00  object array
 116K   1.50K   4.50K   4.50K   10.67 0.00  packed nvlist
 -  -   -   -   -   --  packed nvlist size
72  8.45M889K   2.60M   37.0K9.74 0.00  bplist
 -  -   -   -   -   --  bplist header
 -  -   -   -   -   --  SPA space map header
   974  4.48M   2.65M   7.94M   8.34K1.70 0.00  SPA space map
 -  -   -   -   -   --  ZIL intent log
 96.7K  1.51G389M777M   8.04K3.98 0.12  DMU dnode
17  17.0K   8.50K   17.5K   1.03K2.00 0.00  DMU objset
 -  -   -   -   -   --  DSL directory
13  6.50K   6.50K   19.5K   1.50K1.00 0.00  DSL directory child map
12  6.00K   6.00K   18.0K   1.50K1.00 0.00  DSL dataset snap map
14  38.0K   10.0K   30.0K   2.14K3.80 0.00  DSL props
 -  -   -   -   -   --  DSL dataset
 -  -   -   -   -   --  ZFS znode
 2 1K  1K  2K  1K1.00 0.00  ZFS V0 ACL
 5.81M   558G557G557G   95.8K1.0089.27  ZFS plain file
  382K   301M200M401M   1.05K1.50 0.06  ZFS directory
 9  4.50K   4.50K   9.00K  1K1.00 0.00  ZFS master node
12   482K   20.0K   40.0K   3.33K   24.10 0.00  ZFS delete queue
 8.20M  66.1G   65.4G   65.8G   8.03K1.0110.54  zvol object
 1512 512  1K  1K1.00 0.00  zvol prop
 -  -   -   -   -   --  other uint8[]
 -  -   -   -   -   --  other uint64[]
 -  -   -   -   -   --  other ZAP
 -  -   -   -   -   --  persistent error log
 1   128K   10.5K   31.5K   31.5K   12.19 0.00  SPA history
 -  -   -   -   -   --  SPA history offsets
 -  -   -   -   -   --  Pool properties
 -  -   -   -   -   --  DSL permissions
 -  -   -   -   -   --  ZFS ACL
 -  -   -   -   -   --  ZFS SYSACL
 -  -   -   -   -   --  FUID table
 -  -   -   -   -   --  FUID table size
 5  3.00K   2.50K   7.50K   1.50K1.20 0.00  DSL dataset next clones
 -  -   -   -   -   --  scrub work queue
 14.5M   626G623G624G   43.1K1.00   100.00  Total


real21m16.862s
user0m36.984s
sys 0m5.757s

===
Looking at the data:
[EMAIL PROTECTED] ~$ zfs list backup && zpool list backup
NAME USED  AVAIL  REFER  MOUNTPOINT
backup   685G   237K27K  /backup
NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
backup   696G   671G  25.1G96%  ONLINE  -

So zdb says 626GB is used, zfs list says 685GB is used, and zpool list says 
671GB is used.  The pool was filled to 100% capacity via dd, this is confirmed, 
I can't write data, but yet zpool list says its only 96%. 

benr.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Lost Disk Space

2008-10-20 Thread Ben Rockwood

No takers? :)

benr.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zdb to dump data

2008-10-30 Thread Ben Rockwood

Is there some hidden way to coax zdb into not just displaying data based on a 
given DVA but rather to dump it in raw usable form?

I've got a pool with large amounts of corruption.  Several directories are 
toast and I get "I/O Error" when trying to enter or read the directory... 
however I can read the directory and files using ZDB, if I could just dump it 
in a raw format I could do recovery that way.

To be clear, I've already recovered from the situation, this is purely an 
academic "can I do it" exercise for the sake of learning.

If ZDB can't do it, I'd assume I'd have to write some code to read based on 
DVA.  Maybe I could write a little tool for it.

benr.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] [Fwd: Re: [perf-discuss] ZFS performance issue - READ is slow as hell...]

2009-03-31 Thread Ben Rockwood

Ya, I agree that we need some additional data and testing.  The iostat
data in itself doesn't suggest to me that the process (dd) is slow but
rather that most of the data is being retrieved elsewhere (ARC).   An
fsstat would be useful to correlate with the iostat data.

One thing that also comes to mind with streaming write performance is
the effects of the write throttle... curious if he'd have gotten more on
the write side with that disabled.

All these things don't strike me particularly as bugs (although there is
always improvement) but rather that ZFS is designed for real world
environments, not antiquated benchmarks.

benr.


Jim Mauro wrote:
>
> Posting this back to zfs-discuss.
>
> Roland's test case (below) is a single threaded sequential write
> followed by a single threaded sequential read. His bandwidth
> goes from horrible (~2MB/sec) to expected (~30MB/sec)
> when prefetch is disabled. This is with relatively recent nv bits
> (nv110).
>
> Roland - I'm wondering if you were tripping over
> CR6732803 ZFS prefetch creates performance issues for streaming
> workloads.
> It seems possible, but that CR is specific about multiple, concurrent
> IO streams,
> and your test case was only one.
>
> I think it's more likely you were tripping over
> CR6412053 zfetch needs a whole lotta love.
>
> For both CR's the workaround is disabling prefetch
> (echo "zfs_prefetch_disable/W 1" | mdb -kw)
>
> Any other theories on this test case?
>
> Thanks,
> /jim
>
>
>  Original Message 
> Subject: Re: [perf-discuss] ZFS performance issue - READ is slow
> as hell...
> Date: Tue, 31 Mar 2009 02:33:00 -0700 (PDT)
> From: roland 
> To: perf-disc...@opensolaris.org
>
>
>
> Hello Jim,
> i double checked again - but it`s like i told:
>
> echo zfs_prefetch_disable/W0t1 | mdb -kw 
> fixes my problem.
>
> i did a reboot and only set this single param - which immediately
> makes the read troughput go up from ~2 MB/s to ~30 MB/s
>
>> I don't understand why disabling ZFS prefetch solved this
>> problem. The test case was a single threaded sequential write, followed
>> by a single threaded sequential read.
>
> i did not even do a single write - after reboot i just did
> dd if=/zfs/TESTFILE of=/dev/null
>
> Solaris Express Community Edition snv_110 X86
> FSC RX300 S2
> 4GB RAM
> LSI Logic MegaRaid 320 Onboard SCSI Raid Controller
> 1x Raid1 LUN
> 1x Raid5 LUN (3 Disks)
> (both LUN`s show same behaviour)
>
>
> before:
> extended device statistics
>  r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   21.30.1 2717.60.1  0.7  0.0   31.81.7   2   4 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>   16.00.0 2048.40.0 34.9  0.1 2181.84.8 100   3 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>   28.00.0 3579.20.0 34.8  0.1 1246.24.9 100   5 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>   45.00.0 5760.40.0 34.8  0.2  772.74.5 100   7 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>   19.00.0 2431.90.0 34.9  0.1 1837.34.4 100   3 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>   58.00.0 7421.10.0 34.6  0.3  597.45.8 100  12 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>0.00.00.00.0 35.0  0.00.00.0 100   0 c0t1d0
>
>
> after:
> extended device statistics
>  r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  218.00.0 27842.30.0  0.0  0.40.11.8   1  40 c0t1d0
>  241.00.0 30848.00.0  0.0  0.40.01.6   0  38 c0t1d0
>  237.00.0 30340.10.0  0.0  0.40.01.6   0  38 c0t1d0
>  230.00.0 29434.70.0  0.0  0.40.01.8   0  40 c0t1d0
>  238.10.0 30471.30.0  0.0  0.40.01.5   0  37 c0t1d0
>  234.90.0

Re: [zfs-discuss] zfs / nfs issue (not performance :-) with courier-imap

2007-01-25 Thread Ben Rockwood


Robert Milkowski wrote:

CLSNL> but if I click, say E, it has F's contents, F has Gs contents, and no
CLSNL> mail has D's contents that I can see.  But the list in the mail
CLSNL> client list view is correct.

I don't belive it's a problem with nfs/zfs server.

Please try with simple dtrace script to see (or even truss) what files
your imapd actually opens when you click E - I don't belive it opens E
and you get F contents, I would bet it opens F.
  


I completely agree with Robert.  I'd personally suggest 'truss' to start 
because its trivial to use, then start using DTrace to further hone down 
the problem.


In the case of Courier-IMAP the best way to go about it would be to 
truss the parent (courierlogger, which calls courierlogin and ultimately 
imapd) using 'truss -f -p '.   Then open the mailbox and watch 
those stat's and open's closely.


I'll be very interested in your findings.  We use Courier on NFS/ZFS 
heavily and I'm thankful to report having no such problems.


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Read Only Zpool: ZFS and Replication

2007-02-05 Thread Ben Rockwood

I've been playing with replication of a ZFS Zpool using the recently released 
AVS.  I'm pleased with things, but just replicating the data is only part of 
the problem.  The big question is: can I have a zpool open in 2 places?  

What I really want is a Zpool on node1 open and writable (production storage) 
and a replicated to node2 where its open for read-only access (standby storage).

This is an old problem.  I'm not sure its remotely possible.  Its bad enough 
with UFS, but ZFS maintains a hell of a lot more meta-data.  How is node2 
supposed to know that a snapshot has been created for instance.  With UFS you 
can at least get by some of these problems using directio, but thats not an 
option with a zpool.

I know this is a fairly remedial issue to bring up... but if I think about what 
I want Thumper-to-Thumper replication to look like, I want 2 usable storage 
systems.  As I see it now the secondary storage (node2) is useless untill you 
break replication and import the pool, do your thing, and then re-sync storage 
to re-enable replication.  

Am I missing something?  I'm hoping there is an option I'm not aware of.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] snapdir visable recursively throughout a dataset

2007-02-05 Thread Ben Rockwood

Is there an existing RFE for, what I'll wrongly call, "recursively visable 
snapshots"?  That is, .zfs in directories other than the dataset root.

Frankly, I don't need it available in all directories, although it'd be nice, 
but I do have a need for making it visiable 1 dir down from the dataset root.  
The problem is that while ZFS and Zones work smoothly together for moving, 
cloning, sizing, etc, you can't view .zfs/ from within the zone because  the 
zone root is one dir down:

/zones   <-- Dataset
/zones/myzone01  <-- Dataset, .zfs is located here.
/zones/myzone01/root <-- Directory, want .zfs Here!

The ultimate idea is to make ZFS snapdirs accessable from within the zone.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Read Only Zpool: ZFS and Replication

2007-02-05 Thread Ben Rockwood


Jim Dunham wrote:

Robert,

Hello Ben,

Monday, February 5, 2007, 9:17:01 AM, you wrote:

BR> I've been playing with replication of a ZFS Zpool using the
BR> recently released AVS.  I'm pleased with things, but just
BR> replicating the data is only part of the problem.  The big
BR> question is: can I have a zpool open in 2 places? 
BR> What I really want is a Zpool on node1 open and writable

BR> (production storage) and a replicated to node2 where its open for
BR> read-only access (standby storage).

BR> This is an old problem.  I'm not sure its remotely possible.  Its
BR> bad enough with UFS, but ZFS maintains a hell of a lot more
BR> meta-data.  How is node2 supposed to know that a snapshot has been
BR> created for instance.  With UFS you can at least get by some of
BR> these problems using directio, but thats not an option with a zpool.

BR> I know this is a fairly remedial issue to bring up... but if I
BR> think about what I want Thumper-to-Thumper replication to look
BR> like, I want 2 usable storage systems.  As I see it now the
BR> secondary storage (node2) is useless untill you break replication
BR> and import the pool, do your thing, and then re-sync storage to 
re-enable replication.


BR> Am I missing something?  I'm hoping there is an option I'm not 
aware of.



You can't mount rw on one node and ro on another (not to mention that
zfs doesn't offer you to import RO pools right now). You can mount the
same file system like UFS in RO on both nodes but not ZFS (no ro 
import).
  
One can not just mount a filesystem in RO mode if SNDR or any other 
host-based or controller-based replication is underneath. For all 
filesystems that I know of,  expect of course shared-reader QFS, this 
will fail given time.


Even if one has the means to mount a filesystem with DIRECTIO 
(no-caching), READ-ONLY (no-writes), it does not prevent a filesystem 
from looking at the contents of block "A" and then acting on block 
"B". The reason being is that during replication at time T1 both 
blocks "A" & "B" could be written and be consistent with each other. 
Next the file system reads block "A". Now replication at time T2 
updates blocks "A" & "B", also consistent with each other. Next the 
file system reads block "B" and panics due to an inconsistency only it 
sees between old "A" and new "B". I know this for a fact, since a 
forced "zpool import -f ", is a common instance of this exact 
failure, due most likely checksum failures between metadata blocks "A" 
& "B".


Ya, that bit me last night.  'zpool import' shows the pool fine, but 
when you force the import you panic:


Feb  5 07:14:10 uma ^Mpanic[cpu0]/thread=fe8001072c80: 
Feb  5 07:14:10 uma genunix: [ID 809409 kern.notice] ZFS: I/O failure (write on  off 0: zio fe80c54ed380 [L0 unallocated] 400L/200P DVA[0]=<0:36000:200> DVA[1]=<0:9c0003800:200> DVA[2]=<0:20004e00:200> fletcher4 lzjb LE contiguous birth=57416 fill=0 cksum=de2e56ffd:5591b77b74b:1101a91d58dfc:252efdf22532d0): error 5
Feb  5 07:14:11 uma unix: [ID 10 kern.notice] 
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072a40 zfs:zio_done+140 ()

Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072a60 
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ab0 
zfs:zio_wait_for_children+5d ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072ad0 
zfs:zio_wait_children_done+20 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072af0 
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b40 
zfs:zio_vdev_io_assess+129 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072b60 
zfs:zio_next_stage+68 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bb0 
zfs:vdev_mirror_io_done+2af ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072bd0 
zfs:zio_vdev_io_done+26 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c60 
genunix:taskq_thread+1a7 ()
Feb  5 07:14:11 uma genunix: [ID 655072 kern.notice] fe8001072c70 
unix:thread_start+8 ()
Feb  5 07:14:11 uma unix: [ID 10 kern.notice] 

So without using II, whats the best method of bring up the secondary 
storage?  Is just dropping the primary into logging acceptable?


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snapdir visable recursively throughout a dataset

2007-02-05 Thread Ben Rockwood


Robert Milkowski wrote:

I haven't tried it but what if you mounted ro via loopback into a zone


/zones/myzone01/root/.zfs is loop mounted in RO to /zones/myzone01/.zfs
  


That is so wrong. ;)

Besides just being evil, I doubt it'd work.  And if it does, it probly 
shouldn't.   I think I'm the only one that gets a rash when using LOFI.


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snapdir visable recursively throughout a dataset

2007-02-06 Thread Ben Rockwood


Darren J Moffat wrote:

Ben Rockwood wrote:

Robert Milkowski wrote:

I haven't tried it but what if you mounted ro via loopback into a zone


/zones/myzone01/root/.zfs is loop mounted in RO to /zones/myzone01/.zfs
  


That is so wrong. ;)

Besides just being evil, I doubt it'd work.  And if it does, it 
probly shouldn't.   I think I'm the only one that gets a rash when 
using LOFI.


lofi or lofs ?

lofi - Loopback file driver
Makes a block device from a file
lofs - loopback virtual file system
Makes a file system from a file system


Yes, I know.  I was referring more so to loopback happy people in 
general. :)


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Making 'zfs destroy' safer

2007-05-18 Thread Ben Rockwood


Peter Schuller wrote:

Hello,

with the advent of clones and snapshots, one will of course start
creating them. Which also means destroying them.

Am I the only one who is *extremely* nervous about doing "zfs destroy
some/[EMAIL PROTECTED]"?

This goes bot manually and automatically in a script. I am very paranoid
about this; especially because the @ sign might conceivably be
incorrectly interpreted by some layer of scripting, being a
non-alphanumeric character and highly atypical for filenames/paths.

What about having dedicated commands "destroysnapshot", "destroyclone",
or "remove" (less dangerous variant of "destroy") that will never do
anything but remove snapshots or clones? Alternatively having something
along the lines of "zfs destroy --nofs" or "zfs destroy --safe".

I realize this is borderline being in the same territory as special
casing "rm -rf /" and similar, which is generally not considered a good
idea.

But somehow the snapshot situation feels a lot more risky.
  


This isn't the first time this subject has come up.  You are definitely 
NOT alone.  The problem is only compounded when doing recursive actions. 

The general request has been for a confirmation "Are you sure?" which 
could be over-ridden with a -f.  The general response is "if run from a 
script everyone will use -f and defeat the purpose".


The suggestions you've come up with above are very good ones.  I think 
the addition of "destroysnap" or "destroyclone" are particularly good 
because they could be added without conflicting with or changing the 
existing interfaces.



benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] New zfs pr0n server :)))

2007-05-19 Thread Ben Rockwood


Diego Righi wrote:

Hi all, I just built a new zfs server for home and, being a long time and avid 
reader of this forum, I'm going to post my config specs and my benchmarks 
hoping this could be of some help for others :)

http://www.sickness.it/zfspr0nserver.jpg
http://www.sickness.it/zfspr0nserver.txt
http://www.sickness.it/zfspr0nserver.png
http://www.sickness.it/zfspr0nserver.pdf

Correct me if I'm wrong: from the benchmark results, I understand that this 
setup is slow at writing, but fast at reading (and this is perfect for my 
usage, copying large files once and then accessing only to read them). It also 
seems that at 128kb it gives the best performances, iirc due to the zfs stripe 
size (again, correct me if I'm wrong :).

I'd happily try any other test, but if you suggest bonnie++ please tell me 
what's the right version to use, too much of them I really can't understand 
which to try!

tnx :)
 


Classy.  +1 for style. ;)

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZVol Panic on 62

2007-05-25 Thread Ben Rockwood

May 25 23:32:59 summer unix: [ID 836849 kern.notice]
May 25 23:32:59 summer ^Mpanic[cpu1]/thread=1bf2e740:
May 25 23:32:59 summer genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf 
Page fault) rp=ff00232c3a80 addr=490 occurred in module "unix" due to a 
NULL pointer dereference
May 25 23:32:59 summer unix: [ID 10 kern.notice]
May 25 23:32:59 summer unix: [ID 839527 kern.notice] grep:
May 25 23:32:59 summer unix: [ID 753105 kern.notice] #pf Page fault
May 25 23:32:59 summer unix: [ID 532287 kern.notice] Bad kernel fault at 
addr=0x490
May 25 23:32:59 summer unix: [ID 243837 kern.notice] pid=18425, 
pc=0xfb83b6bb, sp=0xff00232c3b78, eflags=0x10246
May 25 23:32:59 summer unix: [ID 211416 kern.notice] cr0: 
8005003b cr4: 6f8
May 25 23:32:59 summer unix: [ID 354241 kern.notice] cr2: 490 cr3: 1fce52000 
cr8: c
May 25 23:32:59 summer unix: [ID 592667 kern.notice]rdi:  490 
rsi:0 rdx: 1bf2e740
May 25 23:32:59 summer unix: [ID 592667 kern.notice]rcx:0  
r8:d  r9: 62ccc700
May 25 23:32:59 summer unix: [ID 592667 kern.notice]rax:0 
rbx:0 rbp: ff00232c3bd0
May 25 23:32:59 summer unix: [ID 592667 kern.notice]r10: fc18 
r11:0 r12:  490
May 25 23:32:59 summer unix: [ID 592667 kern.notice]r13:  450 
r14: 52e3aac0 r15:0
May 25 23:32:59 summer unix: [ID 592667 kern.notice]fsb:0 
gsb: fffec3731800  ds:   4b
May 25 23:32:59 summer unix: [ID 592667 kern.notice] es:   4b  
fs:0  gs:  1c3
May 25 23:33:00 summer unix: [ID 592667 kern.notice]trp:e 
err:2 rip: fb83b6bb
May 25 23:33:00 summer unix: [ID 592667 kern.notice] cs:   30 
rfl:10246 rsp: ff00232c3b78
May 25 23:33:00 summer unix: [ID 266532 kern.notice] ss:   38
May 25 23:33:00 summer unix: [ID 10 kern.notice]
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3960 
unix:die+c8 ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3a70 
unix:trap+135b ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3a80 
unix:cmntrap+e9 ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3bd0 
unix:mutex_enter+b ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3c20 
zfs:zvol_read+51 ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3c50 
genunix:cdev_read+3c ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3cd0 
specfs:spec_read+276 ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3d40 
genunix:fop_read+3f ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3e90 
genunix:read+288 ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3ec0 
genunix:read32+1e ()
May 25 23:33:00 summer genunix: [ID 655072 kern.notice] ff00232c3f10 
unix:brand_sys_syscall32+1a3 ()
May 25 23:33:00 summer unix: [ID 10 kern.notice]
May 25 23:33:00 summer genunix: [ID 672855 kern.notice] syncing file systems...


Does anyone have an idea of what bug this might be?  Occurred on X86 B62.  I'm 
not seeing any putbacks into 63 or bugs that seem to match.

Any insight is appreciated.  Core's are available.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, iSCSI + Mac OS X Tiger (globalSAN iSCSI)

2007-07-05 Thread Ben Rockwood

George wrote:
> I have set up an iSCSI ZFS target that seems to connect properly from 
> the Microsoft Windows initiator in that I can see the volume in MMC 
> Disk Management.
>
>  
> When I shift over to Mac OS X Tiger with globalSAN iSCSI, I am able to 
> set up the Targets with the target name shown by `iscsitadm list 
> target` and when I actually connect or "Log On" I see that one 
> connection exists on the Solaris server.  I then go on to the Sessions 
> tab in globalSAN and I see the session details and it appears that 
> data is being transferred via the PDUs Sent, PDUs Received, Bytes, 
> etc.  HOWEVER the connection then appears to terminate on the Solaris 
> side if I check it a few minutes later it shows no connections, but 
> the Mac OS X initiator still shows connected although no more traffic 
> appears to be flowing in the Session Statistics dialog area.
>
>  
> Additionally, when I then disconnect the Mac OS X initiator it seems 
> to drop fine on the Mac OS X side, even though the Solaris side has 
> shown it gone for a while, however when I reconnect or Log On again, 
> it seems to spin infinitely on the "Target Connect..." dialog. 
>  Solaris is, interestingly, showing 1 connection while this apparent 
> issue (spinning beachball of death) is going on with globalSAN.  Even 
> killing the Mac OS X process doesn't seem to get me full control again 
> as I have to restart the system to kill all processes (unless I can 
> hunt them down and `kill -9` them which I've not successfully done 
> thus far).
>
> Has anyone dealt with this before and perhaps be able to assist or at 
> least throw some further information towards me to troubleshoot this?

When I learned of the globalSAN Initiator I was overcome with joy. 
after about 2 days of spending way too much time with it I gave up.  
Have a look at their forum 
(http://www.snsforums.com/index.php?s=b0c9031ebe1a89a40cfe4c417e3443f1&showforum=14).

There are a wide range of problems.  In my case connections to the 
target (Solaris/ZFS/iscsitgt) look fine and dandy initially, but you can 
use the connection, on reboot globalSAN goes psycho, etc.

At this point I've given up on the product; at least for now.  If I 
could actually get an accessable disk at least part of the time I'd dig 
my fingers into it, but it doesn't offer a usable remote disk to begin 
with and in a variety of other environments it have identical problems.  
I consider debugging it to be purely academic at this point.  Its a 
great way to gain insight into the inner workings of iSCSI, but without 
source code or DTrace on the Mac its hard to expect any big gains.

Thats my personal take.  If you really wanna go hacking on it regardless 
bring it up on the Storage list and we can corporately enjoy the 
academic challenge of finding the problems, but there is nothing to 
suggest its an OpenSolaris issue.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] When I stab myself with this knife, it hurts... But - should it kill me?

2007-10-04 Thread Ben Rockwood

Dick Davies wrote:
> On 04/10/2007, Nathan Kroenert <[EMAIL PROTECTED]> wrote:
>
>   
>> Client A
>>   - import pool make couple-o-changes
>>
>> Client B
>>   - import pool -f  (heh)
>> 
>
>   
>> Oct  4 15:03:12 fozzie ^Mpanic[cpu0]/thread=ff0002b51c80:
>> Oct  4 15:03:12 fozzie genunix: [ID 603766 kern.notice] assertion
>> failed: dmu_read(os, smo->smo_object, offset, size, entry_map) == 0 (0x5
>> == 0x0)
>> , file: ../../common/fs/zfs/space_map.c, line: 339
>> Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51160
>> genunix:assfail3+b9 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51200
>> zfs:space_map_load+2ef ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51240
>> zfs:metaslab_activate+66 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51300
>> zfs:metaslab_group_alloc+24e ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b513d0
>> zfs:metaslab_alloc_dva+192 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51470
>> zfs:metaslab_alloc+82 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514c0
>> zfs:zio_dva_allocate+68 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b514e0
>> zfs:zio_next_stage+b3 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51510
>> zfs:zio_checksum_generate+6e ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51530
>> zfs:zio_next_stage+b3 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515a0
>> zfs:zio_write_compress+239 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b515c0
>> zfs:zio_next_stage+b3 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51610
>> zfs:zio_wait_for_children+5d ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51630
>> zfs:zio_wait_children_ready+20 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51650
>> zfs:zio_next_stage_async+bb ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51670
>> zfs:zio_nowait+11 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51960
>> zfs:dbuf_sync_leaf+1ac ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b519a0
>> zfs:dbuf_sync_list+51 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a10
>> zfs:dnode_sync+23b ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51a50
>> zfs:dmu_objset_sync_dnodes+55 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51ad0
>> zfs:dmu_objset_sync+13d ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51b40
>> zfs:dsl_pool_sync+199 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51bd0
>> zfs:spa_sync+1c5 ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c60
>> zfs:txg_sync_thread+19a ()
>> Oct  4 15:03:12 fozzie genunix: [ID 655072 kern.notice] ff0002b51c70
>> unix:thread_start+8 ()
>> Oct  4 15:03:12 fozzie unix: [ID 10 kern.notice]
>> 
>
>   
>> Is this a known issue, already fixed in a later build, or should I bug it?
>> 
>
> It shouldn't panic the machine, no. I'd raise a bug.
>
>   
>> After spending a little time playing with iscsi, I have to say it's
>> almost inevitable that someone is going to do this by accident and panic
>> a big box for what I see as no good reason. (though I'm happy to be
>> educated... ;)
>> 
>
> You use ACLs and TPGT groups to ensure 2 hosts can't simultaneously
> access the same LUN by accident. You'd have the same problem with
> Fibre Channel SANs.
>   
I ran into similar problems when replicating via AVS.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS for OSX - it'll be in there.

2007-10-04 Thread Ben Rockwood

Dale Ghent wrote:
> ...and eventually in a read-write capacity:
>
> http://www.macrumors.com/2007/10/04/apple-seeds-zfs-read-write- 
> developer-preview-1-1-for-leopard/
>
> Apple has seeded version 1.1 of ZFS (Zettabyte File System) for Mac  
> OS X to Developers this week. The preview updates a previous build  
> released on June 26, 2007.
>   

Y!  Finally my USB Thumb Drives will work on my MacBook! :)

I wonder if it'll automatically mount the Zpool on my iPod when I sync it.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS Quota Oddness

2007-10-31 Thread Ben Rockwood

I've run across an odd issue with ZFS Quota's.  This is an snv_43 system with 
several zones/zfs datasets, but only one effected.  The dataset shows 10GB 
used, 12GB refered but when counting the files only has 6.7GB of data:

zones/ABC10.8G  26.2G  12.0G  /zones/ABC
zones/[EMAIL PROTECTED]14.7M  -  12.0G  -

[xxx:/zones/ABC/.zfs/snapshot/now] root# gdu --max-depth=1 -h .
43k ./dev
6.7G./root
1.5k./lu
6.7G.

I don't understand what might the cause this disparity.  This is an older box, 
snv_43.  Any bugs that might apply, fixed or in progress?

Thanks.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Panic on Zpool Import (Urgent)

2008-01-12 Thread Ben Rockwood

Today, suddenly, without any apparent reason that I can find, I'm 
getting panic's during zpool import.  The system paniced earlier today 
and has been suffering since.  This is snv_43 on a thumper.  Here's the 
stack:

panic[cpu0]/thread=99adbac0: assertion failed: ss != NULL, file: 
../../common/fs/zfs/space_map.c, line: 145

fe8000a240a0 genunix:assfail+83 ()
fe8000a24130 zfs:space_map_remove+1d6 ()
fe8000a24180 zfs:space_map_claim+49 ()
fe8000a241e0 zfs:metaslab_claim_dva+130 ()
fe8000a24240 zfs:metaslab_claim+94 ()
fe8000a24270 zfs:zio_dva_claim+27 ()
fe8000a24290 zfs:zio_next_stage+6b ()
fe8000a242b0 zfs:zio_gang_pipeline+33 ()
fe8000a242d0 zfs:zio_next_stage+6b ()
fe8000a24320 zfs:zio_wait_for_children+67 ()
fe8000a24340 zfs:zio_wait_children_ready+22 ()
fe8000a24360 zfs:zio_next_stage_async+c9 ()
fe8000a243a0 zfs:zio_wait+33 ()
fe8000a243f0 zfs:zil_claim_log_block+69 ()
fe8000a24520 zfs:zil_parse+ec ()
fe8000a24570 zfs:zil_claim+9a ()
fe8000a24750 zfs:dmu_objset_find+2cc ()
fe8000a24930 zfs:dmu_objset_find+fc ()
fe8000a24b10 zfs:dmu_objset_find+fc ()
fe8000a24bb0 zfs:spa_load+67b ()
fe8000a24c20 zfs:spa_import+a0 ()
fe8000a24c60 zfs:zfs_ioc_pool_import+79 ()
fe8000a24ce0 zfs:zfsdev_ioctl+135 ()
fe8000a24d20 genunix:cdev_ioctl+55 ()
fe8000a24d60 specfs:spec_ioctl+99 ()
fe8000a24dc0 genunix:fop_ioctl+3b ()
fe8000a24ec0 genunix:ioctl+180 ()
fe8000a24f10 unix:sys_syscall32+101 ()

syncing file systems... done

This is almost identical to a post to this list over a year ago titled 
"ZFS Panic".  There was follow up on it but the results didn't make it 
back to the list.

I spent time doing a full sweep for any hardware failures, pulled 2 
drives that I suspected as problematic but weren't flagged as such, etc, 
etc, etc.  Nothing helps.

Bill suggested a 'zpool import -o ro' on the other post, but thats not 
working either.

I _can_ use 'zpool import' to see the pool, but I have to force the 
import.  A simple 'zpool import' returns output in about a minute.  
'zpool import -f poolname' takes almost exactly 10 minutes every single 
time, like it hits some timeout and then panics.

I did notice that while the 'zpool import' is running 'iostat' is 
useless, just hangs.  I still want to believe this is some device 
misbehaving but I have no evidence to support that theory.

Any and all suggestions are greatly appreciated.  I've put around 8 
hours into this so far and I'm getting absolutely nowhere.

Thanks

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Removing An Errant Drive From Zpool

2008-01-15 Thread Ben Rockwood

I made a really stupid mistake... having trouble removing a hot spare 
marked as failed I was trying several ways to put it back in a good 
state.  One means I tried was to 'zpool add pool c5t3d0'... but I forgot 
to use the proper syntax "zpool add pool spare c5t3d0".

Now I'm in a bind.  I've got 4 large raidz2's and now this punty 500GB 
drive in the config:

...
  raidz2ONLINE   0 0 0
c5t7d0  ONLINE   0 0 0
c5t2d0  ONLINE   0 0 0
c7t7d0  ONLINE   0 0 0
c6t7d0  ONLINE   0 0 0
c1t7d0  ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
c4t3d0  ONLINE   0 0 0
c7t3d0  ONLINE   0 0 0
c6t3d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
  c5t3d0ONLINE   0 0 0
spares
  c5t3d0FAULTED   corrupted data
  c4t7d0AVAIL  
...



Detach and Remove won't work.  Does anyone know of a way to get that 
c5t3d0 out of the data configuration and back to hot-spare where it belongs?

However if I understand the layout properly, this should not have an 
adverse impact on my existing configuration I think.  If I can't 
dump it, what happens when that disk fills up?

I can't believe I made such a bone headed mistake.  This is one of those 
times when a "Are you sure you...?" would be helpful. :(

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Removing An Errant Drive From Zpool

2008-01-15 Thread Ben Rockwood

Eric Schrock wrote:
> There's really no way to recover from this, since we don't have device
> removal.  However, I'm suprised that no warning was given.  There are at
> least two things that should have happened:
>
> 1. zpool(1M) should have warned you that the redundancy level you were
>attempting did not match that of your existing pool.  This doesn't
>apply if you already have a mixed level of redundancy.
>
> 2. zpool(1M) should have warned you that the device was in use as an
>active spare and not let you continue.
>
> What bits were you running?
>   

snv_78, however the pool was created on snv_43 and hasn't yet been 
upgraded.  Though, programatically, I can't see why there would be a 
difference in the way 'zpool' would handle the check.

The big question is, if I'm stuck like the permanently, whats the 
potential risk?

Could I potentially just fail that drive and leave it in a failed state?

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Removing An Errant Drive From Zpool

2008-01-16 Thread Ben Rockwood

Robert Milkowski wrote:
> If you can't re-create a pool (+backup&restore your data) I would
> recommend to wait for device removal in zfs and in a mean time I would
> attach another drive to it so you've got mirrored configuration and
> remove them once there's a device removal. Since you're already
> working on nevada you probably could adopt new bits quickly.
>
> The only question is - when device removal is going to be integrated -
> last time someone mentioned it here it was supposed to be by the end
> of last year...
>   
Ya, I'm afraid your right.

benr.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Panic on Zpool Import (Urgent)

2008-01-17 Thread Ben Rockwood

The solution here was to upgrade to snv_78.  By "upgrade" I mean 
re-jumpstart the system.

I tested snv_67 via net-boot but the pool paniced just as below.  I also 
attempted using zfs_recover without success.

I then tested snv_78 via net-boot, used both "aok=1" and 
"zfs:zfs_recover=1" and was able to (slowly) import the pool.  Following 
that test I exported and then did a full re-install of the box.

A very important note to anyone upgrading a Thumper!  Don't forget about 
the NCQ bug.  After upgrading to a release more recent than snv_60 add 
the following to /etc/system:

set sata:sata_max_queue_depth = 0x1

If you don't life will be highly unpleasant and you'll believe that disks are 
failing everywhere when in fact they are not.

benr.




Ben Rockwood wrote:
> Today, suddenly, without any apparent reason that I can find, I'm 
> getting panic's during zpool import.  The system paniced earlier today 
> and has been suffering since.  This is snv_43 on a thumper.  Here's the 
> stack:
>
> panic[cpu0]/thread=99adbac0: assertion failed: ss != NULL, file: 
> ../../common/fs/zfs/space_map.c, line: 145
>
> fe8000a240a0 genunix:assfail+83 ()
> fe8000a24130 zfs:space_map_remove+1d6 ()
> fe8000a24180 zfs:space_map_claim+49 ()
> fe8000a241e0 zfs:metaslab_claim_dva+130 ()
> fe8000a24240 zfs:metaslab_claim+94 ()
> fe8000a24270 zfs:zio_dva_claim+27 ()
> fe8000a24290 zfs:zio_next_stage+6b ()
> fe8000a242b0 zfs:zio_gang_pipeline+33 ()
> fe8000a242d0 zfs:zio_next_stage+6b ()
> fe8000a24320 zfs:zio_wait_for_children+67 ()
> fe8000a24340 zfs:zio_wait_children_ready+22 ()
> fe8000a24360 zfs:zio_next_stage_async+c9 ()
> fe8000a243a0 zfs:zio_wait+33 ()
> fe8000a243f0 zfs:zil_claim_log_block+69 ()
> fe8000a24520 zfs:zil_parse+ec ()
> fe8000a24570 zfs:zil_claim+9a ()
> fe8000a24750 zfs:dmu_objset_find+2cc ()
> fe8000a24930 zfs:dmu_objset_find+fc ()
> fe8000a24b10 zfs:dmu_objset_find+fc ()
> fe8000a24bb0 zfs:spa_load+67b ()
> fe8000a24c20 zfs:spa_import+a0 ()
> fe8000a24c60 zfs:zfs_ioc_pool_import+79 ()
> fe8000a24ce0 zfs:zfsdev_ioctl+135 ()
> fe8000a24d20 genunix:cdev_ioctl+55 ()
> fe8000a24d60 specfs:spec_ioctl+99 ()
> fe8000a24dc0 genunix:fop_ioctl+3b ()
> fe8000a24ec0 genunix:ioctl+180 ()
> fe8000a24f10 unix:sys_syscall32+101 ()
>
> syncing file systems... done
>
> This is almost identical to a post to this list over a year ago titled 
> "ZFS Panic".  There was follow up on it but the results didn't make it 
> back to the list.
>
> I spent time doing a full sweep for any hardware failures, pulled 2 
> drives that I suspected as problematic but weren't flagged as such, etc, 
> etc, etc.  Nothing helps.
>
> Bill suggested a 'zpool import -o ro' on the other post, but thats not 
> working either.
>
> I _can_ use 'zpool import' to see the pool, but I have to force the 
> import.  A simple 'zpool import' returns output in about a minute.  
> 'zpool import -f poolname' takes almost exactly 10 minutes every single 
> time, like it hits some timeout and then panics.
>
> I did notice that while the 'zpool import' is running 'iostat' is 
> useless, just hangs.  I still want to believe this is some device 
> misbehaving but I have no evidence to support that theory.
>
> Any and all suggestions are greatly appreciated.  I've put around 8 
> hours into this so far and I'm getting absolutely nowhere.
>
> Thanks
>
> benr.
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] ZFS and ACL's over NFSv3

2008-06-05 Thread Ben Rockwood

Can someone please clarify the ability to utilize ACL's over NFSv3 from a ZFS 
share?  I can "getfacl" but I can't "setfacl".  I can't find any documentation 
in this regard.  My suspicion is that that ZFS Shares must be NFSv4 in order to 
utilize ACLs but I'm hoping this isn't the case.

Can anyone definitively speak to this?  The closest related bug I can find is 
6340720 which simply says "See comments."

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] 40min ls in empty directory

2008-07-16 Thread Ben Rockwood

I've run into an odd problem which I lovingly refer to as a "black hole 
directory".  

On a Thumper used for mail stores we've found find's take an exceptionally long 
time to run.  There are directories that have as many as 400,000 files, which I 
immediately considered the culprit.  However, under investigation, they aren't 
the problem at all.  The problem is seen here in this truss output (first 
column is delta time):


 0.0001 lstat64("tmp", 0x08046A20)  = 0
 0. openat(AT_FDCWD, "tmp", O_RDONLY|O_NDELAY|O_LARGEFILE) = 8
 0.0001 fcntl(8, F_SETFD, 0x0001)   = 0
 0. fstat64(8, 0x08046920)  = 0
 0. fstat64(8, 0x08046AB0)  = 0
 0. fchdir(8)   = 0
1321.3133   getdents64(8, 0xFEE48000, 8192) = 48
1255.8416   getdents64(8, 0xFEE48000, 8192) = 0
 0.0001 fchdir(7)   = 0
 0.0001 close(8)= 0

These two getdents64 syscalls take approx 20 mins each.  Notice that the 
directory structure is 48 bytes, the directory is empty:

drwx--   2 102  1022 Feb 21 02:24 tmp

My assumption is that the directory is corrupt, but I'd like to prove that.  I 
have a scrub running on the pool, but its got about 16 hours to go before it 
completes.  20% complete thus far and nothing is reported.

No errors are logged when I stimulate this problem.

Does anyone have suggestions on how to get additional data on this issue?  I've 
used dtrace flows to examine, however what I really want to see is the zio's as 
a result of the getdents, but can't see how to do so.  Ideally I'd quiet the 
system and watch all zio's occurring while I stimulate it, but this is 
production and not possible.   If anyone knows how to watch DMU/ZIO activity 
that _only_ pertains to a certain PID please let me know. ;)

Suggestions on how to pro-actively catch these sorts of instances are welcome, 
as are alternative explanations.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zvol Performance

2006-07-17 Thread Ben Rockwood

Hello, 
  I'm curious if anyone would mind sharing their experiences with zvol's.  I 
recently started using zvol as an iSCSI backend and was supprised by the 
performance I was getting.  Further testing revealed that it wasn't an iSCSI 
performance issue but a zvol issue.  Testing on a SATA disk locally, I get 
these numbers (sequentual write):

UFS: 38MB/s
ZFS: 38MB/s
Zvol UFS: 6MB/s
Zvol Raw: ~6MB/s

ZFS is nice and fast but Zvol performance just drops off a cliff.  Suggestion 
or observations by others using zvol would be extremely helpful.   

My current testing is being done using a debug build of B44 (NV 6/10/06).

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: NFS Performance and Tar

2006-10-03 Thread Ben Rockwood

I was really hoping for some option other than ZIL_DISABLE, but finally gave up 
the fight.  Some people suggested NFSv4 helping over NFSv3 but it didn't... at 
least not enough to matter.

ZIL_DISABLE was the solution, sadly.  I'm running B43/X86 and hoping to get up 
to 48 or so soonish (I BFU'd it straight to B48 last night and brick'ed it).

Here are the times.  This is an untar (gtar xfj) of SIDEkick 
(http://www.cuddletech.com/blog/pivot/entry.php?id=491) on NFSv4 on a 20TB 
RAIDZ2 ZFS Pool:

ZIL Enabled:
real1m26.941s

ZIL Disabled:
real0m5.789s


I'll update this post again when I finally get B48 or newer on the system and 
try it.  Thanks to everyone for their suggestions.

benr.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-07 Thread Ben Rockwood

I've got a Thumper doing nothing but serving NFS.  Its using B43 with 
zil_disabled.  The system is being consumed in waves, but by what I don't know. 
 Notice vmstat:

 3 0 0 25693580 2586268 0 0  0  0  0  0  0  0  0  0  0  926   91  703  0 25 75
 21 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 13 14 1720   21 1105  0 92  8
 20 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 17 18 2538   70  834  0 100 0
 25 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0  0  0  745   18  179  0 100 0
 37 0 0 25693552 2586240 0 0 0  0  0  0  0  0  0  7  7 1152   52  313  0 100 0
 16 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0 15 13 1543   52  767  0 100 0
 17 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0  2  2  890   72  192  0 100 0
 27 0 0 25693572 2586260 0 0 0  0  0  0  0  0  0 15 15 3271   19 3103  0 98  2
 0 0 0 25693456 2586144 0 11 0  0  0  0  0  0  0 281 249 34335 242 37289 0 46 54
 0 0 0 25693448 2586136 0 2  0  0  0  0  0  0  0  0  0 2470  103 2900  0 27 73
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1062  105  822  0 26 74
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1076   91  857  0 25 75
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0  917  126  674  0 25 75

These spikes of sys load come in waves like this.  While there are close to a 
hundred systems mounting NFS shares on the Thumper, the amount of traffic is 
really low.  Nothing to justify this.  We're talking less than 10MB/s.

NFS is pathetically slow.  We're using NFSv3 TCP shared via ZFS sharenfs on a 
3Gbps aggregation (3*1Gbps).

I've been slamming my head against this problem for days and can't make 
headway.  I'll post some of my notes below.  Any thoughts or ideas are welcome!

benr.

===

Step 1 was to disable any ZFS features that might consume large amounts of CPU:

# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous

These changes had no effect.

Next was to consider that perhaps NFS was doing name lookups when it shouldn't. 
Indeed "dns" was specified in /etc/nsswitch.conf which won't work given that no 
DNS servers are accessable from the storage or private networks, but again, no 
improvement. In this process I removed dns from nsswitch.conf, deleted 
/etc/resolv.conf, and disabled the dns/client service in SMF.

Turning back to CPU usage, we can see the activity is all SYStem time and comes 
in waves:

[private:/tmp] root# sar 1 100

SunOS private.thumper1 5.11 snv_43 i86pc12/07/2006

10:38:05%usr%sys%wio   %idle
10:38:06   0  27   0  73
10:38:07   0  27   0  73
10:38:09   0  27   0  73
10:38:10   1  26   0  73
10:38:11   0  26   0  74
10:38:12   0  26   0  74
10:38:13   0  24   0  76
10:38:14   0   6   0  94
10:38:15   0   7   0  93
10:38:22   0  99   0   1  <--
10:38:23   0  94   0   6  <--
10:38:24   0  28   0  72
10:38:25   0  27   0  73
10:38:26   0  27   0  73
10:38:27   0  27   0  73
10:38:28   0  27   0  73
10:38:29   1  30   0  69
10:38:30   0  27   0  73

And so we consider whether or not there is a pattern to the frequency. The 
following is sar output from any lines in which sys is above 90%:

10:40:04%usr%sys%wio   %idleDelta
10:40:11   0  97   0   3
10:40:45   0  98   0   2   34 seconds
10:41:02   0  94   0   6   17 seconds
10:41:26   0 100   0   0   24 seconds
10:42:00   0 100   0   0   34 seconds
10:42:25   (end of sample) >25 seconds

Looking at the congestion in the run queue:

[private:/tmp] root# sar -q 5 100

10:45:43 runq-sz %runocc swpq-sz %swpocc
10:45:5127.0  85 0.0   0
10:45:57 1.0  20 0.0   0
10:46:02 2.0  60 0.0   0
10:46:1319.8  99 0.0   0
10:46:2317.7  99 0.0   0
10:46:3424.4  99 0.0   0
10:46:4122.1  97 0.0   0
10:46:4813.0  96 0.0   0
10:46:5525.3 102 0.0   0

Looking at the per-CPU breakdown:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   00   324  224000  1540 00 100   0   0
  10   00   1140  2260   10   130860   1   0  99
  20   00   162  138  1490540 00   1   0  99
  30   00556   460430 00   1   0  99
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  00   00   310  210   340   17  1717 50 100   0   0
  10   00   1521  2000   17   265591  65   0  34
  20   00   271  197  1751   13   202 00  66   0  34
  30   00   12

Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-08 Thread Ben Rockwood


eric kustarz wrote:
So i'm guessing there's lots of files being created over NFS in one 
particular dataset?


We should figure out how many creates/second you are doing over NFS (i 
should have put a timeout on the script).  Here's a real simple one 
(from your snoop it looked like you're only doing NFSv3, so i'm not 
tracking NFSv4):

"
#!/usr/sbin/dtrace -s

rfs3_create:entry,
zfs_create:entry
{
@creates[probefunc] = count();
}

tick-60s
{
exit(0);
}
"



Eric, I love you. 

Running this bit of DTrace reveled more than 4,000 files being created 
in almost any given 60 second window.  And I've only got one system that 
would fit that sort of mass file creation: our Joyent Connector products 
Courier IMAP server which uses Maildir.  As a test I simply shutdown 
Courier and unmounted the mail NFS share for good measure and sure 
enough the problem vanished and could not be reproduced.  10 minutes 
later I re-enabled Courier and our problem came back. 

Clearly ZFS file creation is just amazingly heavy even with ZIL 
disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron 
cores we're in big trouble in the longer term.  In the meantime I'm 
going to find a new home for our IMAP Mail so that the other things 
served from that NFS server at least aren't effected.


You asked for the zpool and zfs info, which I don't want to share 
because its confidential (if you want it privately I'll do so, but not 
on a public list), but I will say that its a single massive Zpool in 
which we're using less than 2% of the capacity.   But in thinking about 
this problem, even if we used 2 or more pools, the CPU consumption still 
would have choked the system, right?  This leaves me really nervous 
about what we'll do when its not an internal mail server thats creating 
all those files but a customer. 

Oddly enough, this might be a very good reason to use iSCSI instead of 
NFS on the Thumper.


Eric, I owe you a couple cases of beer for sure.  I can't tell you how 
much I appreciate your help.  Thanks to everyone else who chimed in with 
ideas and suggestions, all of you guys are the best!


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-09 Thread Ben Rockwood


Spencer Shepler wrote:

Good to hear that you have figured out what is happening, Ben.

For future reference, there are two commands that you may want to
make use of in observing the behavior of the NFS server and individual
filesystems.

There is the trusty, nfsstat command.  In this case, you would have been
able to do something like:
nfsstat -s -v3 60

This will provide all of the server side NFSv3 statistics on 60 second
intervals.  


Then there is a new command fsstat that will provide vnode level
activity on a per filesystem basis.  Therefore, if the NFS server
has multiple filesystems active and you want ot look at just one
something like this can be helpful:

fsstat /export/foo 60

Fsstat has a 'full' option that will list all of the vnode operations
or just certain types.  It also will watch a filesystem type (e.g. zfs, nfs).
Very useful.
  



NFSstat I've been using, but fsstat I was unaware of.  Which I'd used it 
rather than duplicated most of its functionality with D script. :)


Thanks for the tip.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-09 Thread Ben Rockwood


Bill Moore wrote:

On Fri, Dec 08, 2006 at 12:15:27AM -0800, Ben Rockwood wrote:
  
Clearly ZFS file creation is just amazingly heavy even with ZIL 
disabled.  If creating 4,000 files in a minute squashes 4 2.6Ghz Opteron 
cores we're in big trouble in the longer term.  In the meantime I'm 
going to find a new home for our IMAP Mail so that the other things 
served from that NFS server at least aren't effected.



For local tests, this is not true of ZFS.  It seems that file creation
only swamps us when coming over NFS.  We can do thousands of files a
second on a Thumper with room to spare if NFS isn't involved.

Next step is to figure out why NFS kills us.
  


Agreed.  If mass file creation was a problem locally I'd think that we'd 
have people beating down the doors with complaints.


One thought I had as a work around was to move all my mail on NFS to an 
iSCSI LUN and then put a Zpool on that.  I'm willing to bet that'd work 
fine.  Hopefully I can try it.



To round out the discussion, the root cause of this whole mess was 
Courier IMAP Locking.   After isolating the problem last night and 
writing a little d script to find out what files were being create it 
was obviously lock files, turn off locking and file creations dropped to 
a reasonable level and our problem vanished.


If I can help at all with testing or analysis please let me know.


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [nfs-discuss] Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

2006-12-11 Thread Ben Rockwood


Robert Milkowski wrote:

Hello eric,

Saturday, December 9, 2006, 7:07:49 PM, you wrote:

ek> Jim Mauro wrote:
  
Could be NFS synchronous semantics on file create (followed by 
repeated flushing of the write cache).  What kind of storage are you 
using (feel free to send privately if you need to) - is it a thumper? 

It's not clear why NFS-enforced synchronous semantics would induce 
different behavior than the same

load to a local ZFS.
  


ek> Actually i forgot he had 'zil_disable' turned on, so it won't matter in
ek> this case.


Ben, are you sure zil_disable was set to 1 BEFORE pool was imported?
  


Yes, absolutely.  Set var in /etc/system, reboot, system come up.  That 
happened almost 2 months ago, long before this lock insanity problem 
popped up.


To be clear, the ZIL issue was a problem for creation of a handful of 
files of any size.  Untar'ing a file was a massive performance drain.  
This issue, other the other hand, deals with thousands of little files 
being created all the time (IMAP Locks).  These are separate issues from 
my point of view.  With ZIL slowness NFS performance was just slow but 
we didn't see massive CPU usage, with this issue on the other hand we 
were seeing waves in 10 second-ish cycles where the run queue would go 
sky high with 0 idle.  Please see the earlier mails for examples of the 
symptoms.


benr.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS works in waves

2006-12-15 Thread Ben Rockwood


Stuart Glenn wrote:
A little back story: I have a Norco DS-1220, a 12 bay SATA box, it is 
connected to eSATA (SiI3124) via PCI-X two drives are straight 
connections, then the other two ports go to 5x multipliers within the 
box. My needs/hopes for this was using 12 500GB drives and ZFS make a 
very large & simple data dump spot on my network for other servers to 
rsync to daily & use zfs snapshots for some quick backup & if it 
things worked out start trying to save up towards getting a thumper 
someday


The trouble is it is too slow to really useable. At times it is fast 
enough to be useable, ~ 13MB/s write. However, this last for only a 
few minutes. It then just stalls doing nothing. iostat shows 100% 
blocking for one of the drives in the pool


I can however use dd to read or write directly to/from the disks all 
at the same time with good speed (~30MB/s according to dd)


The test pools I have had are either 2 raidz of 6 drives or 3 raidz of 
4 drives. The system is using an Athlon 64 3500+ & 1GB of RAM.


Any suggestions on what I could do to make this useable? More RAM? Too 
many drives for ZFS? Any tests to find the real slow down?


I would really like to use ZFS & solaris for this. Linux was able to 
use the same hardware using some beta kernel modules for the sata 
multipliers & its software raid at an acceptable speed, but I would 
like to finally rid my network of linux boxen.


I have similar issues on my home workstation.  They started happening 
when I put Seagate SATA-II drives with NCQ on a SI3124.  I do not 
believe this to be an issue with ZFS.  I've largely dismissed the issue 
as hardware caused, although I may be wrong.   This system has had 
several problems with SATA-II drives which hardware forums suggest are 
issues with the nForce4 chipset and SATA-II.


Anyway, your not alone, but its not a ZFS issue.  Its possible a tunable 
parameter in the SATA drivers would help.  If I find an answer I'll let 
you know.


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS

2006-12-20 Thread Ben Rockwood

Andrew Summers wrote:
> So, I've read the wikipedia, and have done a lot of research on google about 
> it, but it just doesn't make sense to me.  Correct me if I'm wrong, but you 
> can take a simple 5/10/20 GB drive or whatever size, and turn it into 
> exabytes of storage space?
>
> If that is not true, please explain the importance of this other than the 
> self heal and those other features.
>   

I'm probably to blame for the image of endless storage.  With ZFS Sparse
Volumes (aka: Thin Provisioning) you can make a 1G drive _look_ like a
500TB drive, but of course it isn't.  See my entry on the topic here: 
http://www.cuddletech.com/blog/pivot/entry.php?id=729

With ZFS Compression you can, however, potentially store 10GB of data on
a 5GB drive.  It really depends on what type of data your storing and
how compressible it is, but I've seen almost 2:1 compression in some
cases by simply turning compression on.

benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS over NFS extra slow?

2007-01-02 Thread Ben Rockwood


Brad Plecs wrote:
I had a user report extreme slowness on a ZFS filesystem mounted over NFS over the weekend. 
After some extensive testing, the extreme slowness appears to only occur when a ZFS filesystem is mounted over NFS.  

One example is doing a 'gtar xzvf php-5.2.0.tar.gz'... over NFS onto a ZFS filesystem.  this takes: 


real5m12.423s
user0m0.936s
sys 0m4.760s

Locally on the server (to the same ZFS filesystem) takes: 


real0m4.415s
user0m1.884s
sys 0m3.395s

The same job over NFS to a UFS filesystem takes

real1m22.725s
user0m0.901s
sys 0m4.479s

Same job locally on server to same UFS filesystem: 


real0m10.150s
user0m2.121s
sys 0m4.953s


This is easily reproducible even with single large files, but the multiple small files 
seems to illustrate some awful sync latency between each file.  

Any idea why ZFS over NFS is so bad?  I saw the threads that talk about an fsync penalty, 
but they don't seem relevant since the local ZFS performance is quite good.
  


Known issue, discussed here: 
http://www.opensolaris.org/jive/thread.jspa?threadID=14696&tstart=15


benr.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

47 matches

Mail list logo