Jürgen Keil writes:
> > ZFS 11.0 on Solaris release 06/06, hangs systems when
> > trying to copy files from my VXFS 4.1 file system.
> > any ideas what this problem could be?.
>
> What kind of system is that? How much memory is installed?
>
> I'm able to hang an Ultra 60 with 256 MByte of main memory,
> simply by writing big files to a ZFS filesystem. The problem
> happens with both Solaris 10 6/2006 and Solaris Express snv_48.
>
> In my case there seems to be a problem with ZFS' ARC cache,
> which is not returning memory to the kernel, when free memory
> gets low. Instead, ZFS' ARC cache data structures keeps growing
> until the machine is running out of kernel memory. At this point
> the machine hangs, lots of kernel threads are waiting for free memory,
> and the box must be power cycled (Well, unpluging and re-connecting
> the type 5 keyboard works and gets me to the OBP, where I can force
> a system crashdump and reboot).
Seems like:
6429205 each zpool needs to monitor it's throughput and throttle heavy
writers
(also fixes: 6415647 Sequential writing is jumping)
I'm not sure.
My problem on the Ultra 60 with 256 MByte of main memory is this:
I'm trying to setup an "amanda" backup server on that Ultra 60; it
receives backups from other amanda client system over the network
(could be big files up to 150 GBytes) and writes the received data
to a zpool / zfs on a big 300GB USB HDD.
When ~ 25-30 GBytes of data has been written to the zfs filesystem, the
machine starts gets slower and slower, and starts paging like crazy
("pi" high, "sr" high):
# vmstat 1
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s1 s3 in sy cs us sy id
0 0 0 1653608 38552 17 13 172 38 63 0 1318 0 9 0 1 1177 1269 871 2 10 88
0 0 0 1568624 17544 3 179 1310 0 0 0 0 0 125 0 11 11978 3486 1699 11 89 0
0 0 0 1568624 6704 0 174 1538 210 241 0 9217 0 118 0 22 11218 3388 1698 11 89
0
1 0 0 1568624 1568 167 367 1359 2699 7987 0 185763 0 131 0 78 7395 2373 1477 7
76 18
0 0 0 1568624 5288 119 282 1321 2811 4089 0 183360 0 129 0 47 3161 743 4001 3
59 38
0 0 0 1568624 15688 41 247 1586 655 648 0 0 0 131 0 40 1637 97 8684 1 31 68
1 0 0 1568624 24968 18 214 1600 16 16 0 0 0 129 0 28 1473 75 8685 1 30 69
3 0 0 1574232 29032 30 226 1718 0 0 0 0 0 126 0 20 6978 3232 2671 7 60 33
1 0 0 1568624 18248 40 314 2354 0 0 0 0 0 119 0 68 11085 4156 2125 12 87 1
1 0 0 1568624 7520 43 299 2162 950 1426 0 25595 0 125 0 45 11452 3434 1888 10
90 0
0 0 0 1568624 3384 201 360 2135 2897 8097 0 195148 0 105 0 58 9129 2956 1650 8
92 0
2 0 0 1568624 9656 66 241 1791 2036 2655 0 138976 0 157 0 32 2016 243 6349 1
51 48
0 0 0 1568624 18496 42 289 2900 131 131 0 0 0 150 0 51 1706 104 8640 2 32 66
2 0 0 1572416 27688 77 324 1723 46 46 0 0 0 89 0 54 2440 572 6214 3 35 62
0 0 0 1570872 24112 19 203 1506 0 0 0 0 0 110 0 27 12193 4103 2013 11 89 0
3 0 0 1568624 13760 12 250 2269 0 0 0 0 0 102 0 58 6804 2772 1713 8 51 41
2 0 0 1568624 7464 67 283 1779 1336 5188 0 44171 0 98 0 69 9749 3071 1889 9 81
11
In mdb ::kmastat, I see that a huge number of arc_buf_hdr_t entries are
allocated. The number of arc_buf_hdr_t entries keeps growing, until the
kernel runs out of memory and the machines hangs.
...
zio_buf_512 512 76 150 81920 39230 0
zio_buf_1024 1024 6 16 16384 32602 0
zio_buf_1536 1536 0 5 8192 608 0
zio_buf_2048 2048 0 4 8192 2246 0
zio_buf_2560 2560 0 3 8192 777 0
zio_buf_3072 3072 0 8 24576 913 0
zio_buf_3584 3584 0 9 32768 26041 0
zio_buf_4096 4096 3 4 16384 4563 0
zio_buf_5120 5120 0 8 40960 229 0
zio_buf_6144 6144 0 4 24576 71 0
zio_buf_7168 7168 0 8 57344 24 0
zio_buf_8192 8192 0 2 16384 808 0
zio_buf_10240 10240 0 4 40960 1083 0
zio_buf_12288 12288 0 2 24576 1145 0
zio_buf_14336 14336 0 4 57344 55396 0
zio_buf_16384 16384 18 19 311296 13957 0
zio_buf_20480 20480 0 2 40960 878 0
zio_buf_24576 24576 0 2 49152 69 0
zio_buf_28672 28672 0 4 114688 104 0
zio_buf_32768 32768 0 2 65536 310 0
zio_buf_40960 40960 0 2 81920 152 0
zio_buf_49152 49152 0 2 98304 215 0
zio_buf_57344 57344 0 2 114688 335 0
zio_buf_65536 65536 0 2 131072 742 0
zio_buf_73728 73728 0 2 147456 433 0
zio_buf_81920 81920 0 2 163840 412 0
zio_buf_90112 90112 0 2 180224 634 0
zio_buf_98304 98304 0 2 196608 1190 0
zio_buf_106496 106496 0 2 212992 1502 0
zio_buf_114688 114688 0 2 229376 277544 0
zio_buf_122880 122880 0 2 245760 2456 0
zio_buf_131072 131072 357 359 47054848 3046795 0
dmu_buf_impl_t 328 454 912 311296 2557744 0
dnode_t 648 91 144 98304 454 0
arc_buf_hdr_t 128 535823 535878 69681152 2358050 0 <<<<<<
arc_buf_t 40 382 812 32768 2370019 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 22 126 24576 241 0
...
At the same time, I see that the ARC target cache size "arc.c" has
been reduced to the minimum allowed size: "arc.c == arc.c_min"
# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs sd ip sctp usba s1394 fc
p fctl qlc ssd nca zfs
random lofs nfs audiosup logindmux ptm md cpc fcip sppp c rypto ipc ]
arc::print
{
anon = ARC_anon
mru = ARC_mru
mru_ghost = ARC_mru_ghost
mfu = ARC_mfu
mfu_ghost = ARC_mfu_ghost
size = 0x2dec200
p = 0x9c58c8
c = 0x4000000 <<<<<<<<<<<<<<<<<<<<<<<
c_min = 0x4000000 <<<<<<<<<<<<<<<<<<<<<<<
c_max = 0xb80c800
hits = 0x650d
misses = 0xf2
deleted = 0x3884e
skipped = 0
hash_elements = 0x7e83b
hash_elements_max = 0x7e83b
hash_collisions = 0xa56ef
hash_chains = 0x1000
Segmentation fault (core dumped) <<< That's another, unrelated mdb problem
in S10 6/2006
Monitoring ::kmastat while writing data to the zfs gives me something
like this (note how the zio_buf_131072 cache grows and shrinks, but the
arc_buf_hdr_t cache keeps growing all the time):
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 587 610 79953920 329670 0
dmu_buf_impl_t 328 731 984 335872 296319 0
dnode_t 648 103 168 114688 1469 0
arc_buf_hdr_t 128 11170 11214 1458176 272535 0
arc_buf_t 40 642 1015 40960 272560 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 644 738 96731136 410864 0
dmu_buf_impl_t 328 735 984 335872 364413 0
dnode_t 648 73 168 114688 1472 0
arc_buf_hdr_t 128 31673 31689 4120576 334357 0
arc_buf_t 40 677 1015 40960 335571 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 566 671 87949312 466149 0
dmu_buf_impl_t 328 668 984 335872 410679 0
dnode_t 648 74 168 114688 1483 0
arc_buf_hdr_t 128 45589 45612 5931008 376351 0
arc_buf_t 40 609 1015 40960 378393 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 536 714 93585408 479158 0
dmu_buf_impl_t 328 694 984 335872 421623 0
dnode_t 648 74 168 114688 1483 0
arc_buf_hdr_t 128 48878 48888 6356992 386276 0
arc_buf_t 40 635 1015 40960 388509 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 738 740 96993280 530363 0
dmu_buf_impl_t 328 831 984 335872 464574 0
dnode_t 648 74 168 114688 1483 0
arc_buf_hdr_t 128 61789 61803 8036352 425230 0
arc_buf_t 40 771 1015 40960 428244 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 585 609 79822848 551540 0
dmu_buf_impl_t 328 697 984 335872 482245 0
dnode_t 648 74 168 114688 1483 0
arc_buf_hdr_t 128 67101 67158 8732672 441277 0
arc_buf_t 40 638 1015 40960 444617 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 714 716 93847552 750232 0
dmu_buf_impl_t 328 805 984 335872 648794 0
dnode_t 648 74 168 114688 1485 0
arc_buf_hdr_t 128 117201 117243 15245312 592453 0
arc_buf_t 40 746 1015 40960 598751 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 703 705 92405760 824503 0
dmu_buf_impl_t 328 795 984 335872 711001 0
dnode_t 648 74 168 114688 1485 0
arc_buf_hdr_t 128 135923 135954 17678336 648924 0
arc_buf_t 40 735 1015 40960 656282 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 671 672 88080384 870370 0
dmu_buf_impl_t 328 828 984 335872 749488 0
dnode_t 648 74 168 114688 1485 0
arc_buf_hdr_t 128 147515 147546 19185664 683917 0
arc_buf_t 40 777 1015 40960 692025 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
....
zio_buf_106496 106496 0 0 0 0 0
zio_buf_114688 114688 0 2 229376 56 0
zio_buf_122880 122880 0 0 0 0 0
zio_buf_131072 131072 676 677 88735744 1002908 0
dmu_buf_impl_t 328 774 984 335872 860504 0
dnode_t 648 73 168 114688 1488 0
arc_buf_hdr_t 128 180874 180936 23527424 784588 0
arc_buf_t 40 714 1015 40960 794600 0
zil_lwb_cache 208 0 0 0 0 0
zfs_znode_cache 192 5 42 8192 10 0
It seems the problem is that we keep adding arc.mru_ghost / arc.mfu_ghost
list entries while writing data to zfs, but when the ARC cache is running
at it's minimum size, there's noone checking the ghost list sizes any more;
noone is calling arc_evict_ghost() to cleanup the arc ghost lists.
arc_evict_ghost() is called from arc_adjust():
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/arc.c#arc_ad
just
And arc_adjust() is called from arc_kmem_reclaim()...
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/arc.c#arc_km
em_reclaim
... but only when "arc.c > arc.c_min":
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/fs/zfs/arc.c#1170
1170 if (arc.c <= arc.c_min)
1171 return;
When arc.c <= arc.c_min: no more arc_adjust() calls, and no more
arc_evict_ghost() calls.
===============================================================================
I'm currently experimenting with an arc_kmem_reclaim() changed like this,
which seems to work fine so far, no more hangs:
void
arc_kmem_reclaim(void)
{
uint64_t to_free;
/*
* We need arc_reclaim_lock because we don't want multiple
* threads trying to reclaim concurrently.
*/
/*
* umem calls the reclaim func when we destroy the buf cache,
* which is after we do arc_fini(). So we set a flag to prevent
* accessing the destroyed mutexes and lists.
*/
if (arc_dead)
return;
mutex_enter(&arc_reclaim_lock);
if (arc.c > arc.c_min) {
#ifdef _KERNEL
to_free = MAX(arc.c >> arc_kmem_reclaim_shift, ptob(needfree));
#else
to_free = arc.c >> arc_kmem_reclaim_shift;
#endif
if (arc.c > to_free)
atomic_add_64(&arc.c, -to_free);
else
arc.c = arc.c_min;
atomic_add_64(&arc.p, -(arc.p >> arc_kmem_reclaim_shift));
if (arc.c > arc.size)
arc.c = arc.size;
if (arc.c < arc.c_min)
arc.c = arc.c_min;
if (arc.p > arc.c)
arc.p = (arc.c >> 1);
ASSERT((int64_t)arc.p >= 0);
}
arc_adjust();
mutex_exit(&arc_reclaim_lock);
}
==============================================================================
I'm able to reproduce the issue with a test program like the one included
below. Run it with a current directory on a zfs filesystem, and on a
machine with only 256 MByte of main memory.
/*
* gcc `getconf LFS_CFLAGS` fill.c -o fill
*/
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
int
main(int argc, char **argv)
{
char buf[32*1024];
int i, n;
int fd;
fd = open("/dev/random", O_RDONLY);
if (fd < 0) {
perror("/dev/random");
memset(buf, '*', sizeof(buf));
} else {
for (n = 0; n < sizeof(buf); n += i) {
i = read(fd, buf+n, sizeof(buf)-n);
if (i < 0) {
perror("read random data");
exit(1);
}
if (i == 0) {
fprintf(stderr, "EOF reading random data\n");
exit(1);
}
}
close(fd);
}
fd = creat("junk", 0666);
if (fd < 0) {
perror("create junk file");
exit(1);
}
for (;;) {
if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
perror("write data");
break;
}
}
close(fd);
exit(0);
}
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss