On Jun 10, 2011 11:52 AM, "Jim Klimov" <jimkli...@cos.ru> wrote:
>
> 2011-06-10 18:00, Steve Gonczi пишет:
>>
>> Hi Jim,
>>
>> I wonder what  OS version you are running?
>>
>> There was a problem similar to what you are describing in earlier
versions
>> in the 13x kernel series.
>>
>> Should not be present in the 14x kernels.
>
>
> It is OpenIndiana oi_148a, and unlike many other details -
> this one was in my email post today ;)
>
>> I missed the system parameters in your earlier emails.
>
>
> Other config info: one dual-core P4 @2.8Ghz or so, 8Gb RAM
> (max for the motherboard), 6*2Tb Seagate ST2000DL003 disks
> in raidz2 plus an old 80Gb disk for the OS and swap.
>
> This system turned overnight from a test box to a backup
> of some of my old-but-needed files, to their only storage
> after the original server was cleaned. So I do want this
> box to work reliably and not lose the data which is on it
> already, and without dedup at 1.5x I'm running close to
> not fitting in these 8Tb ;)
>
>> The usual  recommended solution is
>> "oh just do not use dedup, it is not production ready"
>
>
> Well, alas, I am getting to the same conclusion so far...
> I have never seen any remotely similar issues on any other
> servers I maintain, but this one is the first guinea-pig
> box for dedup. I guess for the next few years (until 128Gb
> RAM or so would become the norm for a cheap home enthusiast
> NAS) this one may be the last, too ;)
>
>> Using Dedup is more or less  hopeless with less than 16Gigs of memory.
>> (More is better)
>
>
> Well... 8Gb RAM is not as low-end configuration as some
> others discussed in home-NAS-with-ZFS blogs which claimed
> to have used dedup when it first came out. :)
>
> Since there's little real information about DDT appetites
> so far (this list is still buzzing with calculations and
> tests), I assumed that 8Gb RAM is a reasonable amount for
> starters. At least it would have been for, say, Linux or
> other open-source enthusiasts communities, which have to
> do with whatever crappy hardware they got second-handed ;)
> I was in that camp, and I know that personal budgets do
> not often assume more than 1-2k$ per box. Until recently
> that included very little RAM. And as I said, this is the
> maximum which can be put into my motherboard anyway.
>
> Anyhow, this box has two pools at the moment:
> * "pool" is the physical raidz2 on 6 disks with ashift=12
>   Some datasets on pool are deduped and compressed (lzjb).
> * "dcpool" is built in a compressed volume inside "pool",
>   which is loopback-mounted over iSCSI and in the resulting
>   disk I made another pool with deduped datasets.
>
> This "dcpool" was used to test the idea about separating
> compression and deduplication (so that dedup decisions
> are made about raw source data, and after that whatever
> has to be written is compressed - once).
>
> The POC worked somewhat - except that I used small block
> sizes in the "pool/dcpool" volume, so ZFS metadata to
> address the volume blocks takes up as much as the userdata.
>
> Performance was abysmal - around 1Mb/sec to write into
> "dcpool" lately, and not much faster to read it, so for
> the past month I am trying to evacuate my data back from
> "dcpool" into "pool" which performs faster - about 5-10Mb/s
> during these copies between pools, and according to iostat,
> the physical harddisks are quite busy (over 60%) while
> "dcpool" is often stuck at 100% busy with several seconds(!)
> of wait times and zero IO operations. The physical "pool"
> datasets without dedup performed at "wirespeed" 40-50Mb/s
> when I was copying files over CIFS from another computer
> over a home gigabit LAN. Since this box is primarily an
> archive storage with maybe a few datasets dedicated to
> somewhat active data (updates to photo archive), slow
> but infrequent IO to deduped data is okay with me --
> as long as the system doesn't crash as it does now.
>
> Partially, this is why I bumped up TXG Sync intervals
> to 30 seconds - so that ZFS would have more data in
> buffers after slow IO to coalesce and try to minimize
> fragmentation and mechanical IOPS involved.
>
>
>> The issue was an incorrectly sized buffer that caused ZFS to be waiting
too long
>> for a buffer allocation.  I can dig up the bug number and the fix
description
>> if you are running someting 130-ish.
>
>
> Unless this fix was not integrated into OI for some reason,
> I am afraid digging it up would be of limited help. Still,
> I would be interested to read the summary, postmortem and
> workarounds.
>
> Maybe this was broken again in OI by newer "improvements"? ;)
>
>> The thing I would want to check is the sync times and frequencies.
>> You can dtrace (and timestamp) this.
>
>
> Umm... could you please suggest a script, preferably one
> that I can leave running on console and printing stats
> every second or so?
>
> I can only think of "time sync" so far. And I also think
> it could be counted in a few seconds or more.
>
>
>> I would suspect when the bad  state occurs, your sync is taking a
>> _very_ long time.
>
>
> When the VERY BAD state occurs, I can no longer use the
> system or test anything ;)
>
> When it nearly occurs, I have only a few seconds of uptime
> left, and since each run boot-to-crash takes roughly 2-3
> hours now, I am unlikely to be active at the console at
> these critical few seconds.
>
> And the "sync" would likely never return in this case, too.
>
>>   Delete's,  and dataset / snapshot deletes
>> are not managed correctly in a deduped environment in
>> ZFS.  This is a known problem although it should not be anywhere
>> nearly as bad as what you are describing in the current tip.
>
>
> Well, it is, on a not lowest-end hardware (at least in
> terms of what OpenSolaris developers can expect from a
> general enthusiast community which is supposed to help
> by testing, deploying and co-developing the best OS).
>
> The part where such deletes are slow are understandable
> and explainable - I don't have any big performance
> expectations for the box, 10Mbyte/sec is quite fine
> with me here. The part where it leads to crashes and
> hangs system programs (zfs, zpool, etc) is unacceptable.
>
>
>> The startup delay you are seeing is another "feature" of ZFS, if you
reboot
>> in the middle of a large file delete or dataset destroy, ZFS ( and the
OS)
>> will not come up until it finishes the delete or dataset destroy first.
>
>
> Why can't it be an intensive, but background, operation?
> Import the pool, let it be used, and go on deleting...
> like it was supposed to be in that lifetime when the
> box began deleting these blocks ;)
>
> Well, it took me a worrysome while to figure this out
> the first time, a couple of months ago. Now I am just
> rather annoyed about absence of access to my box and
> data, but I hope that it will come around after several
> retries.
>
> Apparently, this unpredictability (and slowness and
> crashes) is a show-stopper for any enterprise use.
>
> I have made workarounds for the OS to come up okay,
> though. Since the root pool is separate, I removed
> "pool" and "dcpool" from zpool.cache file, and now the
> OS milestones do not depend on them to be available.
>
> Instead, importing the "pool" (with cachefile=none),
> starting the iscsi target and initiator, creating and
> removing the LUN with sbdadm, and importing the
> "dcpool" are all wrapped in several SMF services
> so I can relatively easily control the presence
> of these pools (I can disable them from autostart
> by touching a file in /etc directory).
>
>> Steve
>>
>>
>>
>> ----- "Jim Klimov" <jimkli...@cos.ru> wrote:
>>>
>>> I've captured an illustration for this today, with my watchdog as
>>> well as vmstat, top and other tools. Half a gigabyte in under one
>>> second - the watchdog never saw it coming :(
>>
>>
>
>

While your memory may be sufficient, that cpu is sorely lacking.  Is it even
64bit?  There's a reason intel couldn't give those things away in the early
2000s and amd was eating their lunch.

--Tim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to