Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Eric Kustarz Thu, 07 Dec 2006 10:29:23 -0800

Ben Rockwood wrote:
> I've got a Thumper doing nothing but serving NFS.  Its using B43 with
> zil_disabled.  The system is being consumed in waves, but by what I
> don't know.  Notice vmstat:

We made several performance fixes in the NFS/ZFS area in recent builds,so if possible it would be great to upgrade you from snv_43. That said,there might be something else going on that we haven't accounted for.

...


Step 1 was to disable any ZFS features that might consume large amounts of CPU:

# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous

In our performance testing, we haven't found checksums to be anywherenear a large consumer of CPU, so i would recommend leaving that on (dueto its benefits).

I suspect your apps/clients don't depend on atime, so i think its a goodidea to turn that off. We've gotten better NFS performance with this off.

More of a heads up as it sounds like compression on/off isn't yourproblem. If you are not getting good I/O BW with compression turned on,its most likely due to:

6460622 zio_nowait() doesn't live up to its name

As Jim, mentioned, using lockstat to figure out where your CPU is beingspent is the first step. I've been using 'lockstat -kgIW -D 60 sleep60'. That collects data for the top 60 callers for a 1 minute period.If you see 'mutex_enter' high up in the results, then we have at leastmutex lock contention.

...

Interestingly, using prstat -mL to monitor thread latency, we see that a 
handful of threads are the culprates for consuming mass CPU:

[private:/tmp] root# prstat -mL

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506

 22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0 nfsd/1563
 22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0 nfsd/1554
 22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0 nfsd/1551
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0 nfsd/1553
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1536
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1555
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1539
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1562
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1545
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1559
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0 nfsd/1541
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1546
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1543
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1560
Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86

The high SYS times being charged to the userland nfsd threads isrepresentative of what the kernel threads are doing (which is mostlikely going to be NFS and ZFS).

Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I 
get an idea of what ZFS is doing:

[private:/] root# dtrace -s  zvop_times.d
dtrace: script 'zvop_times.d' matched 66 probes
^C
CPU     ID                    FUNCTION:NAME

1 2 :ENDZFS COUNT


  zfs_getsecattr                                                    4
  zfs_space                                                        70
  zfs_rename                                                      111
  zfs_readdir                                                     284
  zfs_read                                                        367
  zfs_mkdir                                                       670
  zfs_setattr                                                    1054
  zfs_frlock                                                     1562
  zfs_putpage                                                    3916
  zfs_write                                                      4110
  zfs_create                                                     4178
  zfs_remove                                                     7794
  zfs_fid                                                       14960
  zfs_inactive                                                  17637
  zfs_access                                                    20809
  zfs_fsync                                                     29668
  zfs_lookup                                                    31273
  zfs_getattr                                                  175457



ZFS AVG TIME

  zfs_fsync                                                      2337
  zfs_getattr                                                    2735
  zfs_access                                                     2774
  zfs_fid                                                        2948
  zfs_inactive                                                   4081
  zfs_lookup                                                     6067
  zfs_frlock                                                     6750
  zfs_putpage                                                    6840
  zfs_mkdir                                                      9259
  zfs_getsecattr                                                10940
  zfs_setattr                                                   15364
  zfs_space                                                     25214
  zfs_readdir                                                   25726
  zfs_remove                                                    26117
  zfs_write                                                     31942
  zfs_rename                                                    60504
  zfs_read                                                     144449
  zfs_create                                                 71215587

An average of 71 msecs per create seems high. Doing a quantize insteadof just an average on zfs_create() would be interesting (should haveincluded that in my script)...

...

Locks...

[private:/tmp] root# lockstat -P sleep 10

Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)
Count indv cuml rcnt spin Lock Caller-------------------------------------------------------------------------------
   32 100% 100% 0.00 59963092 0xffffffffb282a1c8     dmu_object_alloc+0x75 \

That's interesting. This is build snv_43, right? Doesn't exactly lineup with current bits, but i would guess this is "mutex_enter(&osi->os_obj_lock);".

Can you show us the output for a longer sampling period, such as'lockstat -P -D 30 sleep 60'? The "Adaptive mutex spin" category isusually the most/only interesting one in my experience.

I would be curious to know your memory usage. What is the physicalmemory on the machine?


And can you run:
# echo 'arc::print -d' | mdb -k
# echo '::kmastat' | mdb -k | sort -n -k 5 | tail -40

Matt has a nice little dscript that figures out how long txgs are taking(i'll attach that). I believe it will work with old bits (i defined"OLD" the for that to work). Can you try running that?

Can you run 'zpool list' and 'zpool status'? I'd like to know how fullyour pool is.


Let's start with that info,
eric

#!/usr/sbin/dtrace -qCs

#pragma D option bufsize=32k
#pragma D option aggsize=32k

#define ZIO_TYPE_WRITE 2
#define ZIO_TYPE_READ 1
#define B_TRUE 1
#define B_FALSE 0
/* only for older builds */
#define OLD 1

long long ios[int, int];
long long bytes[int, int];
long long txg_wait_time;
long long sync_time;
long long syncing_txg;

/*
 * header printing
 */

BEGIN
{
        bytes[0, 0] = 0;
        printed = 0;
        sync_time = timestamp;
        printf("   txg     time     sync-write     async-write       sync-read  
   async-read\n");
}

tick-1s
/printed >= 20/
{
        printf("   txg     time     sync-write     async-write       sync-read  
   async-read\n");
        printed = 0;

}

/*
 * txg quiesce monitoring
 */


#ifdef QUIESCE
txg_quiesce:entry
{
        quiesce_start = timestamp;
        quiescing_txg = arg1;
}

txg_quiesce:return
/quiescing_txg != 0 && timestamp - quiesce_start > 1000000/
{
        printf("%7d %5dms. quiescing\n",
            quiescing_txg, (timestamp - quiesce_start)/1000000);
        printed++;
        quiescing_txg = 0;
}
#endif

/*
 * txg train stall monitoring
 */

txg_wait_open:entry
{
        self->txg_wait_start = timestamp;
}

txg_wait_open:return
/self->txg_wait_start != 0 && (timestamp - self->txg_wait_start)/1000000 > 0/
{
        @txgwait["average time (milliseconds) waiting for txg to open"] =
            avg((timestamp - self->txg_wait_start)/1000000);
        @txgthr["number of threads that waited for txg to open"] =
            count();
        self->txg_wait_start = 0;
        txg_wait_time++;
}

tick-1s
/txg_wait_time > 0/
{
        printa(@txgwait);
        clear(@txgwait);
        printa(@txgthr);
        clear(@txgthr);
        txg_wait_time = 0;
}

/*
 * io stats for (and between) spa_syncs
 */

vdev_disk_io_done:entry
{
        this->sync = (args[0]->io_priority == 0) &&
            args[0]->io_done != (void*)&zfs`vdev_queue_agg_io_done;
        ios[args[0]->io_type, this->sync]++;
        bytes[args[0]->io_type, this->sync] += args[0]->io_size;
#if 0
        printf("type=%u pri=%u done=%a(%p) agg=%a(%p) sync=%u\n",
            args[0]->io_type, args[0]->io_priority,
            args[0]->io_done, args[0]->io_done,
            &zfs`vdev_queue_agg_io_done,
            &zfs`vdev_queue_agg_io_done,
            this->sync);
#endif
}

spa_sync:entry,spa_sync:return
/(timestamp - sync_time)/1000000 != 0/
{
        this->ms = (timestamp - sync_time)/1000000;

        printf("%7d %5dms. %4d/s %3dMB/s; %4d/s %3dMB/s; %4d/s %3dMB/s; %4d/s 
%3dMB/s\n",
            syncing_txg, this->ms,
            ios[ZIO_TYPE_WRITE, B_TRUE] * 1000 / this->ms,
            ((bytes[ZIO_TYPE_WRITE, B_TRUE] * 1000) >> 20) / this->ms,
            ios[ZIO_TYPE_WRITE, B_FALSE] * 1000 / this->ms,
            ((bytes[ZIO_TYPE_WRITE, B_FALSE] * 1000) >> 20) / this->ms,
            ios[ZIO_TYPE_READ, B_TRUE] * 1000 / this->ms,
            ((bytes[ZIO_TYPE_READ, B_TRUE] * 1000) >> 20) / this->ms,
            ios[ZIO_TYPE_READ, B_FALSE] * 1000 / this->ms,
            ((bytes[ZIO_TYPE_READ, B_FALSE] * 1000) >> 20) / this->ms);
        printed++;

        ios[ZIO_TYPE_WRITE, B_TRUE] = 0;
        bytes[ZIO_TYPE_WRITE, B_TRUE] = 0;
        ios[ZIO_TYPE_WRITE, B_FALSE] = 0;
        bytes[ZIO_TYPE_WRITE, B_FALSE] = 0;
        ios[ZIO_TYPE_READ, B_TRUE] = 0;
        bytes[ZIO_TYPE_READ, B_TRUE] = 0;
        ios[ZIO_TYPE_READ, B_FALSE] = 0;
        bytes[ZIO_TYPE_READ, B_FALSE] = 0;
}

spa_sync:entry
{
        sync_time = timestamp;
        syncing_txg = arg1;
}

spa_sync:return
{
        sync_time = timestamp;
        syncing_txg = 0;
}

/*
 * Per-level sync monitoring
 */

#ifndef LVLMS
#define LVLMS 100
#endif

#ifdef OLD

dmu_objset_sync_dnodes:entry
{
        self->os = args[0];
        self->txg = args[2]->tx_txg;
        self->level = 0;
}

dmu_objset_sync_dnodes:return
/self->os/
{
        self->os = 0;
}

zio_wait:entry
/self->os && self->wait_begin == 0/
{
        self->wait_begin = timestamp;
        self->wait_ios = args[0]->io_children_notdone;
        self->wait_zio = args[0];
}

zio_wait:return
/self->wait_begin && (timestamp - self->wait_begin) / 1000000 >= LVLMS/
{
        printf("%7d %5dms. %4d ios for objset \"%s\"\n",
            self->txg, (timestamp - self->wait_begin) / 1000000,
            self->wait_ios,
            self->os->os_dsl_dataset ?
            self->os->os_dsl_dataset->ds_dir->dd_myname : "mos");
        /* stack(20); */
}

zio_wait:return
/self->wait_begin/
{
        self->wait_begin = 0;
}

#else

dmu_objset_sync_dnodes:wait-begin
{
        self->wait_begin = timestamp;
        self->wait_ios = ((zio_t*)arg0)->io_children_notdone;
}

dmu_objset_sync_dnodes:wait-end
/self->wait_begin && (timestamp - self->wait_begin) / 1000000 >= LVLMS/
{
        this->ds = ((objset_impl_t*)arg2)->os_dsl_dataset;
        printf("%7d %5dms. %4d ios for lvl %d objset \"%s\"\n",
            arg1, (timestamp - self->wait_begin) / 1000000,
            self->wait_ios, arg3,
            this->ds ? this->ds->ds_dir->dd_myname : "mos");
        printed++;
}

#endif

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Reply via email to