Ben Rockwood wrote:
> I've got a Thumper doing nothing but serving NFS. Its using B43 with
> zil_disabled. The system is being consumed in waves, but by what I
> don't know. Notice vmstat:
We made several performance fixes in the NFS/ZFS area in recent builds,
so if possible it would be great to upgrade you from snv_43. That said,
there might be something else going on that we haven't accounted for.
...
Step 1 was to disable any ZFS features that might consume large amounts of CPU:
# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous
In our performance testing, we haven't found checksums to be anywhere
near a large consumer of CPU, so i would recommend leaving that on (due
to its benefits).
I suspect your apps/clients don't depend on atime, so i think its a good
idea to turn that off. We've gotten better NFS performance with this off.
More of a heads up as it sounds like compression on/off isn't your
problem. If you are not getting good I/O BW with compression turned on,
its most likely due to:
6460622 zio_nowait() doesn't live up to its name
As Jim, mentioned, using lockstat to figure out where your CPU is being
spent is the first step. I've been using 'lockstat -kgIW -D 60 sleep
60'. That collects data for the top 60 callers for a 1 minute period.
If you see 'mutex_enter' high up in the results, then we have at least
mutex lock contention.
...
Interestingly, using prstat -mL to monitor thread latency, we see that a
handful of threads are the culprates for consuming mass CPU:
[private:/tmp] root# prstat -mL
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506
22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563
22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554
22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560
Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86
The high SYS times being charged to the userland nfsd threads is
representative of what the kernel threads are doing (which is most
likely going to be NFS and ZFS).
Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I
get an idea of what ZFS is doing:
[private:/] root# dtrace -s zvop_times.d
dtrace: script 'zvop_times.d' matched 66 probes
^C
CPU ID FUNCTION:NAME
1 2 :END
ZFS COUNT
zfs_getsecattr 4
zfs_space 70
zfs_rename 111
zfs_readdir 284
zfs_read 367
zfs_mkdir 670
zfs_setattr 1054
zfs_frlock 1562
zfs_putpage 3916
zfs_write 4110
zfs_create 4178
zfs_remove 7794
zfs_fid 14960
zfs_inactive 17637
zfs_access 20809
zfs_fsync 29668
zfs_lookup 31273
zfs_getattr 175457
ZFS AVG TIME
zfs_fsync 2337
zfs_getattr 2735
zfs_access 2774
zfs_fid 2948
zfs_inactive 4081
zfs_lookup 6067
zfs_frlock 6750
zfs_putpage 6840
zfs_mkdir 9259
zfs_getsecattr 10940
zfs_setattr 15364
zfs_space 25214
zfs_readdir 25726
zfs_remove 26117
zfs_write 31942
zfs_rename 60504
zfs_read 144449
zfs_create 71215587
An average of 71 msecs per create seems high. Doing a quantize instead
of just an average on zfs_create() would be interesting (should have
included that in my script)...
...
Locks...
[private:/tmp] root# lockstat -P sleep 10
Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)
Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
32 100% 100% 0.00 59963092 0xffffffffb282a1c8 dmu_object_alloc+0x75 \
That's interesting. This is build snv_43, right? Doesn't exactly line
up with current bits, but i would guess this is "
mutex_enter(&osi->os_obj_lock);".
Can you show us the output for a longer sampling period, such as
'lockstat -P -D 30 sleep 60'? The "Adaptive mutex spin" category is
usually the most/only interesting one in my experience.
I would be curious to know your memory usage. What is the physical
memory on the machine?
And can you run:
# echo 'arc::print -d' | mdb -k
# echo '::kmastat' | mdb -k | sort -n -k 5 | tail -40
Matt has a nice little dscript that figures out how long txgs are taking
(i'll attach that). I believe it will work with old bits (i defined
"OLD" the for that to work). Can you try running that?
Can you run 'zpool list' and 'zpool status'? I'd like to know how full
your pool is.
Let's start with that info,
eric
#!/usr/sbin/dtrace -qCs
#pragma D option bufsize=32k
#pragma D option aggsize=32k
#define ZIO_TYPE_WRITE 2
#define ZIO_TYPE_READ 1
#define B_TRUE 1
#define B_FALSE 0
/* only for older builds */
#define OLD 1
long long ios[int, int];
long long bytes[int, int];
long long txg_wait_time;
long long sync_time;
long long syncing_txg;
/*
* header printing
*/
BEGIN
{
bytes[0, 0] = 0;
printed = 0;
sync_time = timestamp;
printf(" txg time sync-write async-write sync-read
async-read\n");
}
tick-1s
/printed >= 20/
{
printf(" txg time sync-write async-write sync-read
async-read\n");
printed = 0;
}
/*
* txg quiesce monitoring
*/
#ifdef QUIESCE
txg_quiesce:entry
{
quiesce_start = timestamp;
quiescing_txg = arg1;
}
txg_quiesce:return
/quiescing_txg != 0 && timestamp - quiesce_start > 1000000/
{
printf("%7d %5dms. quiescing\n",
quiescing_txg, (timestamp - quiesce_start)/1000000);
printed++;
quiescing_txg = 0;
}
#endif
/*
* txg train stall monitoring
*/
txg_wait_open:entry
{
self->txg_wait_start = timestamp;
}
txg_wait_open:return
/self->txg_wait_start != 0 && (timestamp - self->txg_wait_start)/1000000 > 0/
{
@txgwait["average time (milliseconds) waiting for txg to open"] =
avg((timestamp - self->txg_wait_start)/1000000);
@txgthr["number of threads that waited for txg to open"] =
count();
self->txg_wait_start = 0;
txg_wait_time++;
}
tick-1s
/txg_wait_time > 0/
{
printa(@txgwait);
clear(@txgwait);
printa(@txgthr);
clear(@txgthr);
txg_wait_time = 0;
}
/*
* io stats for (and between) spa_syncs
*/
vdev_disk_io_done:entry
{
this->sync = (args[0]->io_priority == 0) &&
args[0]->io_done != (void*)&zfs`vdev_queue_agg_io_done;
ios[args[0]->io_type, this->sync]++;
bytes[args[0]->io_type, this->sync] += args[0]->io_size;
#if 0
printf("type=%u pri=%u done=%a(%p) agg=%a(%p) sync=%u\n",
args[0]->io_type, args[0]->io_priority,
args[0]->io_done, args[0]->io_done,
&zfs`vdev_queue_agg_io_done,
&zfs`vdev_queue_agg_io_done,
this->sync);
#endif
}
spa_sync:entry,spa_sync:return
/(timestamp - sync_time)/1000000 != 0/
{
this->ms = (timestamp - sync_time)/1000000;
printf("%7d %5dms. %4d/s %3dMB/s; %4d/s %3dMB/s; %4d/s %3dMB/s; %4d/s
%3dMB/s\n",
syncing_txg, this->ms,
ios[ZIO_TYPE_WRITE, B_TRUE] * 1000 / this->ms,
((bytes[ZIO_TYPE_WRITE, B_TRUE] * 1000) >> 20) / this->ms,
ios[ZIO_TYPE_WRITE, B_FALSE] * 1000 / this->ms,
((bytes[ZIO_TYPE_WRITE, B_FALSE] * 1000) >> 20) / this->ms,
ios[ZIO_TYPE_READ, B_TRUE] * 1000 / this->ms,
((bytes[ZIO_TYPE_READ, B_TRUE] * 1000) >> 20) / this->ms,
ios[ZIO_TYPE_READ, B_FALSE] * 1000 / this->ms,
((bytes[ZIO_TYPE_READ, B_FALSE] * 1000) >> 20) / this->ms);
printed++;
ios[ZIO_TYPE_WRITE, B_TRUE] = 0;
bytes[ZIO_TYPE_WRITE, B_TRUE] = 0;
ios[ZIO_TYPE_WRITE, B_FALSE] = 0;
bytes[ZIO_TYPE_WRITE, B_FALSE] = 0;
ios[ZIO_TYPE_READ, B_TRUE] = 0;
bytes[ZIO_TYPE_READ, B_TRUE] = 0;
ios[ZIO_TYPE_READ, B_FALSE] = 0;
bytes[ZIO_TYPE_READ, B_FALSE] = 0;
}
spa_sync:entry
{
sync_time = timestamp;
syncing_txg = arg1;
}
spa_sync:return
{
sync_time = timestamp;
syncing_txg = 0;
}
/*
* Per-level sync monitoring
*/
#ifndef LVLMS
#define LVLMS 100
#endif
#ifdef OLD
dmu_objset_sync_dnodes:entry
{
self->os = args[0];
self->txg = args[2]->tx_txg;
self->level = 0;
}
dmu_objset_sync_dnodes:return
/self->os/
{
self->os = 0;
}
zio_wait:entry
/self->os && self->wait_begin == 0/
{
self->wait_begin = timestamp;
self->wait_ios = args[0]->io_children_notdone;
self->wait_zio = args[0];
}
zio_wait:return
/self->wait_begin && (timestamp - self->wait_begin) / 1000000 >= LVLMS/
{
printf("%7d %5dms. %4d ios for objset \"%s\"\n",
self->txg, (timestamp - self->wait_begin) / 1000000,
self->wait_ios,
self->os->os_dsl_dataset ?
self->os->os_dsl_dataset->ds_dir->dd_myname : "mos");
/* stack(20); */
}
zio_wait:return
/self->wait_begin/
{
self->wait_begin = 0;
}
#else
dmu_objset_sync_dnodes:wait-begin
{
self->wait_begin = timestamp;
self->wait_ios = ((zio_t*)arg0)->io_children_notdone;
}
dmu_objset_sync_dnodes:wait-end
/self->wait_begin && (timestamp - self->wait_begin) / 1000000 >= LVLMS/
{
this->ds = ((objset_impl_t*)arg2)->os_dsl_dataset;
printf("%7d %5dms. %4d ios for lvl %d objset \"%s\"\n",
arg1, (timestamp - self->wait_begin) / 1000000,
self->wait_ios, arg3,
this->ds ? this->ds->ds_dir->dd_myname : "mos");
printed++;
}
#endif
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss