On Jun 10, 2011 11:52 AM, "Jim Klimov" <jimkli...@cos.ru> wrote: > > 2011-06-10 18:00, Steve Gonczi пишет: >> >> Hi Jim, >> >> I wonder what OS version you are running? >> >> There was a problem similar to what you are describing in earlier versions >> in the 13x kernel series. >> >> Should not be present in the 14x kernels. > > > It is OpenIndiana oi_148a, and unlike many other details - > this one was in my email post today ;) > >> I missed the system parameters in your earlier emails. > > > Other config info: one dual-core P4 @2.8Ghz or so, 8Gb RAM > (max for the motherboard), 6*2Tb Seagate ST2000DL003 disks > in raidz2 plus an old 80Gb disk for the OS and swap. > > This system turned overnight from a test box to a backup > of some of my old-but-needed files, to their only storage > after the original server was cleaned. So I do want this > box to work reliably and not lose the data which is on it > already, and without dedup at 1.5x I'm running close to > not fitting in these 8Tb ;) > >> The usual recommended solution is >> "oh just do not use dedup, it is not production ready" > > > Well, alas, I am getting to the same conclusion so far... > I have never seen any remotely similar issues on any other > servers I maintain, but this one is the first guinea-pig > box for dedup. I guess for the next few years (until 128Gb > RAM or so would become the norm for a cheap home enthusiast > NAS) this one may be the last, too ;) > >> Using Dedup is more or less hopeless with less than 16Gigs of memory. >> (More is better) > > > Well... 8Gb RAM is not as low-end configuration as some > others discussed in home-NAS-with-ZFS blogs which claimed > to have used dedup when it first came out. :) > > Since there's little real information about DDT appetites > so far (this list is still buzzing with calculations and > tests), I assumed that 8Gb RAM is a reasonable amount for > starters. At least it would have been for, say, Linux or > other open-source enthusiasts communities, which have to > do with whatever crappy hardware they got second-handed ;) > I was in that camp, and I know that personal budgets do > not often assume more than 1-2k$ per box. Until recently > that included very little RAM. And as I said, this is the > maximum which can be put into my motherboard anyway. > > Anyhow, this box has two pools at the moment: > * "pool" is the physical raidz2 on 6 disks with ashift=12 > Some datasets on pool are deduped and compressed (lzjb). > * "dcpool" is built in a compressed volume inside "pool", > which is loopback-mounted over iSCSI and in the resulting > disk I made another pool with deduped datasets. > > This "dcpool" was used to test the idea about separating > compression and deduplication (so that dedup decisions > are made about raw source data, and after that whatever > has to be written is compressed - once). > > The POC worked somewhat - except that I used small block > sizes in the "pool/dcpool" volume, so ZFS metadata to > address the volume blocks takes up as much as the userdata. > > Performance was abysmal - around 1Mb/sec to write into > "dcpool" lately, and not much faster to read it, so for > the past month I am trying to evacuate my data back from > "dcpool" into "pool" which performs faster - about 5-10Mb/s > during these copies between pools, and according to iostat, > the physical harddisks are quite busy (over 60%) while > "dcpool" is often stuck at 100% busy with several seconds(!) > of wait times and zero IO operations. The physical "pool" > datasets without dedup performed at "wirespeed" 40-50Mb/s > when I was copying files over CIFS from another computer > over a home gigabit LAN. Since this box is primarily an > archive storage with maybe a few datasets dedicated to > somewhat active data (updates to photo archive), slow > but infrequent IO to deduped data is okay with me -- > as long as the system doesn't crash as it does now. > > Partially, this is why I bumped up TXG Sync intervals > to 30 seconds - so that ZFS would have more data in > buffers after slow IO to coalesce and try to minimize > fragmentation and mechanical IOPS involved. > > >> The issue was an incorrectly sized buffer that caused ZFS to be waiting too long >> for a buffer allocation. I can dig up the bug number and the fix description >> if you are running someting 130-ish. > > > Unless this fix was not integrated into OI for some reason, > I am afraid digging it up would be of limited help. Still, > I would be interested to read the summary, postmortem and > workarounds. > > Maybe this was broken again in OI by newer "improvements"? ;) > >> The thing I would want to check is the sync times and frequencies. >> You can dtrace (and timestamp) this. > > > Umm... could you please suggest a script, preferably one > that I can leave running on console and printing stats > every second or so? > > I can only think of "time sync" so far. And I also think > it could be counted in a few seconds or more. > > >> I would suspect when the bad state occurs, your sync is taking a >> _very_ long time. > > > When the VERY BAD state occurs, I can no longer use the > system or test anything ;) > > When it nearly occurs, I have only a few seconds of uptime > left, and since each run boot-to-crash takes roughly 2-3 > hours now, I am unlikely to be active at the console at > these critical few seconds. > > And the "sync" would likely never return in this case, too. > >> Delete's, and dataset / snapshot deletes >> are not managed correctly in a deduped environment in >> ZFS. This is a known problem although it should not be anywhere >> nearly as bad as what you are describing in the current tip. > > > Well, it is, on a not lowest-end hardware (at least in > terms of what OpenSolaris developers can expect from a > general enthusiast community which is supposed to help > by testing, deploying and co-developing the best OS). > > The part where such deletes are slow are understandable > and explainable - I don't have any big performance > expectations for the box, 10Mbyte/sec is quite fine > with me here. The part where it leads to crashes and > hangs system programs (zfs, zpool, etc) is unacceptable. > > >> The startup delay you are seeing is another "feature" of ZFS, if you reboot >> in the middle of a large file delete or dataset destroy, ZFS ( and the OS) >> will not come up until it finishes the delete or dataset destroy first. > > > Why can't it be an intensive, but background, operation? > Import the pool, let it be used, and go on deleting... > like it was supposed to be in that lifetime when the > box began deleting these blocks ;) > > Well, it took me a worrysome while to figure this out > the first time, a couple of months ago. Now I am just > rather annoyed about absence of access to my box and > data, but I hope that it will come around after several > retries. > > Apparently, this unpredictability (and slowness and > crashes) is a show-stopper for any enterprise use. > > I have made workarounds for the OS to come up okay, > though. Since the root pool is separate, I removed > "pool" and "dcpool" from zpool.cache file, and now the > OS milestones do not depend on them to be available. > > Instead, importing the "pool" (with cachefile=none), > starting the iscsi target and initiator, creating and > removing the LUN with sbdadm, and importing the > "dcpool" are all wrapped in several SMF services > so I can relatively easily control the presence > of these pools (I can disable them from autostart > by touching a file in /etc directory). > >> Steve >> >> >> >> ----- "Jim Klimov" <jimkli...@cos.ru> wrote: >>> >>> I've captured an illustration for this today, with my watchdog as >>> well as vmstat, top and other tools. Half a gigabyte in under one >>> second - the watchdog never saw it coming :( >> >> > >
While your memory may be sufficient, that cpu is sorely lacking. Is it even 64bit? There's a reason intel couldn't give those things away in the early 2000s and amd was eating their lunch. --Tim
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss