Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
On Wed, Mar 17, 2010 at 9:15 AM, Edward Ned Harvey wrote: >> I think what you're saying is: Why bother trying to backup with "zfs >> send" >> when the recommended practice, fully supportable, is to use other tools >> for >> backup, such as tar, star, Amanda, bacula, etc. Right? >> >> The answer to this is very simple. >> #1 ... >> #2 ... > > Oh, one more thing. "zfs send" is only discouraged if you plan to store the > data stream and do "zfs receive" at a later date. > > If instead, you are doing "zfs send | zfs receive" onto removable media, or > another server, where the data is immediately fed through "zfs receive" then > it's an entirely viable backup technique. Richard Elling made an interesting observation that suggests that storing a zfs send data stream on tape is a quite reasonable thing to do. Richard's background makes me trust his analysis of this much more than I trust the typical person that says that zfs send output is poison. http://opensolaris.org/jive/thread.jspa?messageID=465973&tstart=0#465861 I think that a similar argument could be made for storing the zfs send data streams on a zfs file system. However, it is not clear why you would do this instead of just zfs send | zfs receive. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thoughts on ZFS Pool Backup Strategies
On Fri, Mar 19, 2010 at 11:57 PM, Edward Ned Harvey wrote: >> 1. NDMP for putting "zfs send" streams on tape over the network. So > > Tell me if I missed something here. I don't think I did. I think this > sounds like crazy talk. > > I used NDMP up till November, when we replaced our NetApp with a Solaris Sun > box. In NDMP, to choose the source files, we had the ability to browse the > fileserver, select files, and specify file matching patterns. My point is: > NDMP is file based. It doesn't allow you to spawn a process and backup a > data stream. > > Unless I missed something. Which I doubt. ;-) 5+ years ago the variety of NDMP that was available with the combination of NetApp's OnTap and Veritas NetBackup did backups at the volume level. When I needed to go to tape to recover a file that was no longer in snapshots, we had to find space on a NetApp to restore the volume. It could not restore the volume to a Sun box, presumably because the contents of the backup used a data stream format that was proprietary to NetApp. An expired Internet Draft for NDMPv4 says: butype_name Specifies the name of the backup method to be used for the transfer (dump, tar, cpio, etc). Backup types are NDMP Server implementation dependent and MUST match one of the Data Server implementation specific butype_name strings accessible via the NDMP_CONFIG_GET_BUTYPE_INFO request. http://www.ndmp.org/download/sdk_v4/draft-skardal-ndmp4-04.txt It seems pretty clear from this that an NDMP data stream can contain most anything and is dependent on the device being backed up. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs diff
On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams wrote: > One really good use for zfs diff would be: as a way to index zfs send > backups by contents. Or to generate the list of files for incremental backups via NetBackup or similar. This is especially important for file systems will millions of files with relatively few changes. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is it safe to disable the swap partition?
On Sun, May 9, 2010 at 7:40 PM, Edward Ned Harvey wrote: > > > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > > boun...@opensolaris.org] On Behalf Of Richard Elling > > > > For a storage server, swap is not needed. If you notice swap being used > > then your storage server is undersized. > > Indeed, I have two solaris 10 fileservers that have uptime in the range of a > few months. I just checked swap usage, and they're both zero. > > So, Bob, rub it in if you wish. ;-) I was wrong. I knew the behavior in > Linux, which Roy seconded as "most OSes," and apparently we both assumed the > same here, but that was wrong. I don't know if solaris and opensolaris both > have the same swap behavior. I don't know if there's *ever* a situation > where solaris/opensolaris would swap idle processes. But there's at least > evidence that my two servers have not, or do not. If Solaris is under memory pressure, pages may be paged to swap. Under severe memory pressure, entire processes may be swapped. This will happen after freeing up the memory used for file system buffers, ARC, etc. If the processes never page in the pages that have been paged out (or the processes that have been swapped out are never scheduled) then those pages will not consume RAM. The best thing to do with processes that can be swapped out forever is to not run them. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness wrote: > On 05/31/2010 01:51 PM, Bob Friesenhahn wrote: >> There are multiple factors at work. Your OpenSolaris should be new >> enough to have the fix in which the zfs I/O tasks are run in in a >> scheduling class at lower priority than normal user processes. >> However, there is also a throttling mechanism for processes which >> produce data faster than can be consumed by the disks. This >> throttling mechanism depends on the amount of RAM available to zfs and >> the write speed of the I/O channel. More available RAM results in >> more write buffering, which results in a larger chunk of data written >> at the next transaction group write interval. The maximum size of a >> transaction group may be configured in /etc/system similar to: >> >> * Set ZFS maximum TXG group size to 2684354560 >> set zfs:zfs_write_limit_override = 0xa000 >> >> If the transaction group is smaller, then zfs will need to write more >> often. Processes will still be throttled but the duration of the >> delay should be smaller due to less data to write in each burst. I >> think that (with multiple writers) the zfs pool will be "healthier" >> and less fragmented if you can offer zfs more RAM and accept some >> stalls during writing. There are always tradeoffs. >> >> Bob > well it seems like when messing with the txg sync times and stuff like > that it did make the transfer more smooth but didn't actually help with > speeds as it just meant the hangs happened for a shorter time but at a > smaller interval and actually lowering the time between writes just > seemed to make things worse (slightly). > > I think I have came to the conclusion that the problem here is CPU due > to the fact that its only doing this with parity raid. I would think if > it was I/O based then it would be the same as if anything its heavier on > I/O on non parity raid due to the fact that it is no longer CPU > bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > parity raidz2). To see if the CPU is pegged, take a look at the output of: mpstat 1 prstat -mLc 1 If mpstat shows that the idle time reaches 0 or the process' latency column is more then a few tenths of a percent, you are probably short on CPU. It could also be that interrupts are stealing cycles from rsync. Placing it in a processor set with interrupts disabled in that processor set may help. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Small stalls slowing down rsync from holding network saturation every 5 seconds
Sorry, turned on html mode to avoid gmail's line wrapping. On Mon, May 31, 2010 at 4:58 PM, Sandon Van Ness wrote: > On 05/31/2010 02:52 PM, Mike Gerdts wrote: > > On Mon, May 31, 2010 at 4:32 PM, Sandon Van Ness > wrote: > > > >> On 05/31/2010 01:51 PM, Bob Friesenhahn wrote: > >> > >>> There are multiple factors at work. Your OpenSolaris should be new > >>> enough to have the fix in which the zfs I/O tasks are run in in a > >>> scheduling class at lower priority than normal user processes. > >>> However, there is also a throttling mechanism for processes which > >>> produce data faster than can be consumed by the disks. This > >>> throttling mechanism depends on the amount of RAM available to zfs and > >>> the write speed of the I/O channel. More available RAM results in > >>> more write buffering, which results in a larger chunk of data written > >>> at the next transaction group write interval. The maximum size of a > >>> transaction group may be configured in /etc/system similar to: > >>> > >>> * Set ZFS maximum TXG group size to 2684354560 > >>> set zfs:zfs_write_limit_override = 0xa000 > >>> > >>> If the transaction group is smaller, then zfs will need to write more > >>> often. Processes will still be throttled but the duration of the > >>> delay should be smaller due to less data to write in each burst. I > >>> think that (with multiple writers) the zfs pool will be "healthier" > >>> and less fragmented if you can offer zfs more RAM and accept some > >>> stalls during writing. There are always tradeoffs. > >>> > >>> Bob > >>> > >> well it seems like when messing with the txg sync times and stuff like > >> that it did make the transfer more smooth but didn't actually help with > >> speeds as it just meant the hangs happened for a shorter time but at a > >> smaller interval and actually lowering the time between writes just > >> seemed to make things worse (slightly). > >> > >> I think I have came to the conclusion that the problem here is CPU due > >> to the fact that its only doing this with parity raid. I would think if > >> it was I/O based then it would be the same as if anything its heavier on > >> I/O on non parity raid due to the fact that it is no longer CPU > >> bottlenecked (dd write test gives me near 700 megabytes/sec vs 450 with > >> parity raidz2). > >> > > To see if the CPU is pegged, take a look at the output of: > > > > mpstat 1 > > prstat -mLc 1 > > > > If mpstat shows that the idle time reaches 0 or the process' latency > > column is more then a few tenths of a percent, you are probably short > > on CPU. > > > > It could also be that interrupts are stealing cycles from rsync. > > Placing it in a processor set with interrupts disabled in that > > processor set may help. > > > > > > Unfortunately none of these utilies make it possible to ge values for <1 > second which is what the hang is (its happening for about 1/2 of a second). > > Here is with mpstat: > > > Here is what i get with prstat: > > Total: 57 processes, 260 lwps, load averages: 2.15, 2.16, 2.15 > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > 604 root 0.0 33 0.0 0.0 0.0 0.0 42 25 18 13 0 0 > zpool-data/13 > 604 root 0.0 30 0.0 0.0 0.0 0.0 41 29 12 12 0 0 > zpool-data/15 > 1326 root 12 2.9 0.0 0.0 0.0 0.0 85 0.4 1K 12 11K 0 rsync/1 > 604 root 0.0 15 0.0 0.0 0.0 0.0 41 44 111 9 0 0 > zpool-data/27 > 604 root 0.0 14 0.0 0.0 0.0 0.0 43 42 72 3 0 0 > zpool-data/33 > 604 root 0.0 5.9 0.0 0.0 0.0 0.0 41 53 109 6 0 0 > zpool-data/19 > 604 root 0.0 5.4 0.0 0.0 0.0 0.0 42 53 106 8 0 0 > zpool-data/25 > 604 root 0.0 5.3 0.0 0.0 0.0 0.0 43 51 107 7 0 0 > zpool-data/21 > 604 root 0.0 4.5 0.0 0.0 0.0 0.0 41 54 110 4 0 0 > zpool-data/31 > 604 root 0.0 3.9 0.0 0.0 0.0 0.0 41 55 109 3 0 0 > zpool-data/23 > 604 root 0.0 3.7 0.0 0.0 0.0 0.0 44 52 111 2 0 0 > zpool-data/29 > 1322 root 0.0 0.4 0.0 0.0 0.0 0.0 98 2.0 1K 0 1 0 rsync/1 > 22644 root 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 16 13 255 0 prstat/1 > 14409 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 5 3 69 0 sshd/1 > 196 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 15 2 105 0 nscd/17 > In the interval abo
Re: [zfs-discuss] Sun Flash Accelerator F20
On Thu, Jun 10, 2010 at 9:39 AM, Andrey Kuzmin wrote: > On Thu, Jun 10, 2010 at 6:06 PM, Robert Milkowski wrote: >> >> On 21/10/2009 03:54, Bob Friesenhahn wrote: >>> >>> I would be interested to know how many IOPS an OS like Solaris is able to >>> push through a single device interface. The normal driver stack is likely >>> limited as to how many IOPS it can sustain for a given LUN since the driver >>> stack is optimized for high latency devices like disk drives. If you are >>> creating a driver stack, the design decisions you make when requests will be >>> satisfied in about 12ms would be much different than if requests are >>> satisfied in 50us. Limitations of existing software stacks are likely >>> reasons why Sun is designing hardware with more device interfaces and more >>> independent devices. >> >> >> Open Solaris 2009.06, 1KB READ I/O: >> >> # dd of=/dev/null bs=1k if=/dev/rdsk/c0t0d0p0& > > /dev/null is usually a poor choice for a test lie this. Just to be on the > safe side, I'd rerun it with /dev/random. > Regards, > Andrey (aside from other replies about read vs. write and /dev/random...) Testing performance of disk by reading from /dev/random and writing to disk is misguided. From random(7d): Applications retrieve random bytes by reading /dev/random or /dev/urandom. The /dev/random interface returns random bytes only when sufficient amount of entropy has been collected. In other words, when the kernel doesn't think that it can give high quality random numbers, it stops providing them until it has gathered enough entropy. It will pause your reads. If instead you use /dev/urandom, the above problem doesn't exist, but the generation of random numbers is CPU-intensive. There is a reasonable chance (particularly with slow CPU's and fast disk) that you will be testing the speed of /dev/urandom rather than the speed of the disk or other I/O components. If your goal is to provide data that is not all 0's to prevent ZFS compression from making the file sparse or want to be sure that compression doesn't otherwise make the actual writes smaller, you could try something like: # create a file just over 100 MB dd if=/dev/random of=/tmp/randomdata bs=513 count=204401 # repeatedly feed that file to dd while true ; do cat /tmp/randomdataa ; done | dd of=/my/test/file bs=... count=... The above should make it so that it will take a while before there are two blocks that are identical, thus confounding deduplication as well. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On Tue, Jun 15, 2010 at 7:28 PM, David Magda wrote: > On Jun 15, 2010, at 14:20, Fco Javier Garcia wrote: > >> I think dedup may have its greatest appeal in VDI environments (think >> about a environment with 85% if the data that the virtual machine needs is >> into ARC or L2ARC... is like a dream...almost instantaneous response... and >> you can boot a new machine in a few seconds)... > > This may also be accomplished by using snapshots and clones of data sets. At > least for OS images: user profiles and documents could be something else > entirely. It all depends on the nature of the VDI environment. If the VMs are regenerated on each login, the snapshot + clone mechanism is sufficient. Deduplication is not needed. However, if VMs have a long life and get periodic patches and other software updates, deduplication will be required if you want to remain at somewhat constant storage utilization. It probably makes a lot of sense to be sure that swap or page files are on a non-dedup dataset. Executables and shared libraries shouldn't be getting paged out to it and the likelihood that multiple VMs page the same thing to swap or a page file is very small. > Another situation that comes to mind is perhaps as the back-end to a mail > store: if you send out a message(s) with an attachment(s) to a lot of > people, the attachment blocks could be deduped (and perhaps compressed as > well, since base-64 adds 1/3 overhead). It all depends on how this is stored. If the attachments are stored like they were in 1990 as part of an mbox format, you will be very unlikely to get the proper block alignment. Even storing the message body (including headers) in the same file as the attachment may not align the attachments because the mail headers may be different (e.g. different recipients messages took different paths, some were forwarded, etc.). If the attachments are stored in separate files or a database format is used that stores attachments separate from the message (with matching database + zfs block size) things may work out favorably. However, a system that detaches messages and stores them separately may just as well store them in a file that matches the SHA256 hash, assuming that file doesn't already exist. If does exist, it can just increment a reference count. In other words, an intelligent mail system should already dedup. Or at least that is how I would have written it for the last decade or so... -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VXFS to ZFS Quota
On Fri, Jun 18, 2010 at 8:09 AM, David Magda wrote: > You could always split things up into groups of (say) 50. A few jobs ago, > I was in an environment where we have a /home/students1/ and > /home/students2/, along with a separate faculty/ (using Solaris and UFS). > This had more to do with IOps than anything else. A decade or so ago when I managed similar environments and had (I think) 6 file systems handling about 5000 students. Each file system had about 1/6 of the students. Challenges I found in this were: - Students needed to work on projects together. The typical way to do this was for them to request a group, then create a group writable directory in one of their home directories. If all students in the group had home directories on the same file system, there was nothing special to consider. If they were on different file systems then at least one would need to have a non-zero quota (that is, not 0 blocks soft, 1 block hard) quota on the file system where the group directory resides. - Despite your best efforts things will get imbalanced. If you are tight on space, this means that you will need to migrate users. This will become apparent only at the times of the semester where even per-user outages are most inconvenient (i.e. at 6 and 13 weeks when big projects tend to be due). Its probably a good idea to consider these types of situations in the transition plan, or at least determine they don't apply. I was working in a college of engineering where group projects were common and CAD, EDA, and simulation tools could generate big files very quickly. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
On Sun, Jul 4, 2010 at 11:28 AM, Bob Friesenhahn wrote: >> >> Ok... so we've rebuilt the pool as 14 pairs of mirrors, each pair having >> one disk in each of the two JBODs. Now we're getting about 500-1000 IOPS >> (according to zpool iostat) and 20-30MB/sec in random read on a big >> database. Does that sounds right? > > I am not sure who wrote the above text since the attribution quoting is all > botched up (Gmail?) in this thread. Regardless, it is worth pointing out > that 'zpool iostat' only reports the I/O operations which were actually > performed. It will not report the operations which did not need to be > performed due to already being in cache. A quite busy system can still > report very little via 'zpool iostat' if it has enough RAM to cache the > requested data. > > Bob Very good point. You can use a combination of "zpool iostat" and fsstat to see the effect of reads that didn't turn into physical I/Os. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
On Sun, Jul 4, 2010 at 10:08 AM, Ian D wrote: > What I don't understand is why, when I run a single query I get <100 IOPS > and <3MB/sec. The setup can obviously do better, so where is the > bottleneck? I don't see any CPU core on any side being maxed out so it > can't be it... In what way is CPU contention being monitored? "prstat" without options is nearly useless for a multithreaded app on a multi-CPU (or multi-core/multi-thread) system. mpstat is only useful if threads never migrate between CPU's. "prstat -mL" gives a nice picture of how busy each LWP (thread) is. When viewed with "prstat -mL", A thread that has usr+sys at 100% cannot go any faster, unless you can get the CPU to go faster, as I suggest below. From my understanding (perhaps not 100% correct on the rest of this paragraph): The time spent in TRP may be reclaimed by running the application in a processor set with interrupts disabled on all of its processors. If TFL or DFL are high, optimizing the use of cache may be beneficial. Examples of how you can optimize the use of cache include using the FX scheduler with a priority that gives relatively long time slices, using processor sets to keep other processes off of the same caches (which are often shared by multiple cores), or perhaps disabling CPU's (threads) to ensure that only a single core is using each cache. With current generation Intel CPU's, this can allow the CPU clock rate to increase, thereby allowing more work to get done. > The database is MySQL, it runs on a Linux box that connects to the Nexenta Oh, since the database runs on Linux I guess you need to dig up top's equivalent of "prstat -mL". Unfortunately, I don't think that Linux has microstate accounting and as such you may not have visibility into time spent on traps, text faults, and data faults on a per-process basis. > server through 10GbE using iSCSI. Have you done any TCP tuning? Based on the numbers you cite above, it looks like you are doing about 32 KB I/O's. I think you can perform a test that involves mainly the network if you use netperf with options like: netperf -H $host -t TCP_RR -r 32768 -l 30 That is speculation based on reading http://www.netperf.org/netperf/training/Netperf.html. Someone else (perhaps on networking or performance lists) may have better tests to run. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Expected throughput
On Sun, Jul 4, 2010 at 2:08 PM, Ian D wrote: > Mem: 74098512k total, 73910728k used, 187784k free, 96948k buffers > Swap: 2104488k total, 208k used, 2104280k free, 63210472k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 17652 mysql 20 0 3553m 3.1g 5472 S 38 4.4 247:51.80 mysqld > 16301 mysql 20 0 4275m 3.3g 5980 S 4 4.7 5468:33 mysqld > 16006 mysql 20 0 4434m 3.3g 5888 S 3 4.6 5034:06 mysqld > 12822 root 15 -5 0 0 0 S 2 0.0 22:00.50 scsi_wq_39 Is that 38% of one CPU or 38% of all CPU's? How many CPU's does the Linux box have? I don't mean the number of sockets, I mean number of sockets * number of cores * number of threads per core. My recollection of top is that the CPU percentage is: (pcpu_t2 - pcpu_t1) / (interval * ncpus) Where pcpu_t* is the process CPU time at a particular time. If you have a two socket quad core box with hyperthreading enabled, that is 2 * 4 * 2 = 16 CPU's. 38% of 16 CPU's can be roughly 6 CPU's running as fast as they can (and 10 of them idle) or 16 CPU's each running at about 38%. In the "I don't have a CPU bottleneck" argument, there is a big difference. If PID 16301 has a single thread that is doing significant work, on the hypothetical 16 CPU box this means that it is spending about 2/3 of the time on CPU. If the workload does: while ( 1 ) { issue I/O request get response do cpu-intensive work work } It is only trying to do I/O 1/3 of the time. Further, it has put a single high latency operation between its bursts of CPU activity. One other area of investigation that I didn't mention before: Your stats imply that the Linux box is getting data 32 KB at a time. How does 32 KB compare to the database block size? How does 32 KB compare to the block size on the relevant zfs filesystem or zvol? Are blocks aligned at the various layers? http://blogs.sun.com/dlutz/entry/partition_alignment_guidelines_for_unified -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Hashing files rapidly on ZFS
On Tue, Jul 6, 2010 at 10:29 AM, Arne Jansen wrote: > Daniel Carosone wrote: >> Something similar would be useful, and much more readily achievable, >> from ZFS from such an application, and many others. Rather than a way >> to compare reliably between two files for identity, I'ld liek a way to >> compare identity of a single file between two points in time. If my >> application can tell quickly that the file content is unaltered since >> last time I saw the file, I can avoid rehashing the content and use a >> stored value. If I can achieve this result for a whole directory >> tree, even better. > > This would be great for any kind of archiving software. Aren't zfs checksums > already ready to solve this? If a file changes, it's dnodes' checksum changes, > the checksum of the directory it is in and so forth all the way up to the > uberblock. > There may be ways a checksum changes without a real change in the files > content, > but the other way round should hold. If the checksum didn't change, the file > didn't change. > So the only missing link is a way to determine zfs's checksum for a > file/directory/dataset. Am I missing something here? Of course atime update > should be turned off, otherwise the checksum will get changed by the archiving > agent. What is the likelihood that the same data is re-written to the file? If that is unlikely, it looks as though znode_t's z_seq may be useful. While it isn't a checksum, it seems to be incremented on every file change. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore wrote: > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: >> >> I think there may be very good reason to use iSCSI, if you're limited >> to gigabit but need to be able to handle higher throughput for a >> single client. I may be wrong, but I believe iSCSI to/from a single >> initiator can take advantage of multiple links in an active-active >> multipath scenario whereas NFS is only going to be able to take >> advantage of 1 link (at least until pNFS). > > There are other ways to get multiple paths. First off, there is IP > multipathing. which offers some of this at the IP layer. There is also > 802.3ad link aggregation (trunking). So you can still get high > performance beyond single link with NFS. (It works with iSCSI too, > btw.) With both IPMP and link aggregation, each TCP session will go over the same wire. There is no guarantee that load will be evenly balanced between links when there are multiple TCP sessions. As such, any scalability you get using these configurations will be dependent on having a complex enough workload, wise cconfiguration choices, and and a bit of luck. Note that with Sun Trunking there was an option to load balance using a round robin hashing algorithm. When pushing high network loads this may cause performance problems with reassembly. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] NFS performance?
On Mon, Jul 26, 2010 at 1:27 AM, Garrett D'Amore wrote: > On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote: >> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore wrote: >> > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: >> >> >> >> I think there may be very good reason to use iSCSI, if you're limited >> >> to gigabit but need to be able to handle higher throughput for a >> >> single client. I may be wrong, but I believe iSCSI to/from a single >> >> initiator can take advantage of multiple links in an active-active >> >> multipath scenario whereas NFS is only going to be able to take >> >> advantage of 1 link (at least until pNFS). >> > >> > There are other ways to get multiple paths. First off, there is IP >> > multipathing. which offers some of this at the IP layer. There is also >> > 802.3ad link aggregation (trunking). So you can still get high >> > performance beyond single link with NFS. (It works with iSCSI too, >> > btw.) >> >> With both IPMP and link aggregation, each TCP session will go over the >> same wire. There is no guarantee that load will be evenly balanced >> between links when there are multiple TCP sessions. As such, any >> scalability you get using these configurations will be dependent on >> having a complex enough workload, wise cconfiguration choices, and and >> a bit of luck. > > If you're really that concerned, you could use UDP instead of TCP. But > that may have other detrimental performance impacts, I'm not sure how > bad they would be in a data center with generally lossless ethernet > links. Heh. My horror story with reassembly was actually with connectionless transports (LLT, then UDP). Oracle RAC's cache fusion sends 8 KB blocks via UDP by default, or LLT when used in the Veritas + Oracle RAC certified configuration from 5+ years ago. The use of Sun trunking with round robin hashing and the lack of use of jumbo packets made every cache fusion block turn into 6 LLT or UDP packets that had to be reassembled on the other end. This was on a 15K domain with the NICs spread across IO boards. I assume that interrupts for a NIC are handled by a CPU on the closest system board (Solaris 8, FWIW). If that assumption is true then there would also be a flurry of inter-system board chatter to put the block back together. In any case, performance was horrible until we got rid of round robin and enabled jumbo frames. > Btw, I am not certain that the multiple initiator support (mpxio) is > necessarily any better as far as guaranteed performance/balancing. (It > may be; I've not looked closely enough at it.) I haven't paid close attention to how mpxio works. The Veritas analog, vxdmp, does a very good job of balancing traffic down multiple paths, even when only a single LUN is accessed. The exact mode that dmp will use is dependent on the capabilities of the array it is talking to - many arrays work in an active/passive mode. As such, I would expect that with vxdmp or mpxio the balancing with iSCSI would be at least partially dependent on what the array said to do. > I should look more closely at NFS as well -- if multiple applications on > the same client are access the same filesystem, do they use a single > common TCP session, or can they each have separate instances open? > Again, I'm not sure. It's worse than that. A quick experiment with two different automounted home directories from the same NFS server suggests that both home directories share one TCP session to the NFS server. The latest version of Oracle's RDBMS supports a userland NFS client option. It would be very interesting to see if this does a separate session per data file, possibly allowing for better load spreading. >> Note that with Sun Trunking there was an option to load balance using >> a round robin hashing algorithm. When pushing high network loads this >> may cause performance problems with reassembly. > > Yes. Reassembly is Evil for TCP performance. > > Btw, the iSCSI balancing act that was described does seem a bit > contrived -- a single initiator and a COMSTAR server, both client *and > server* with multiple ethernet links instead of a single 10GbE link. > > I'm not saying it doesn't happen, but I think it happens infrequently > enough that its reasonable that this scenario wasn't one that popped > immediately into my head. :-) It depends on whether the people that control the network gear are the same ones that control servers. My experience suggests that if there is a disconnect, it seems rather likely that each group's standardization efforts, procurement cycles, and capacity plans will work against any attempt t
Re: [zfs-discuss] NFS performance?
On Mon, Jul 26, 2010 at 2:56 PM, Miles Nordin wrote: >>>>>> "mg" == Mike Gerdts writes: > mg> it is rather common to have multiple 1 Gb links to > mg> servers going to disparate switches so as to provide > mg> resilience in the face of switch failures. This is not unlike > mg> (at a block diagram level) the architecture that you see in > mg> pretty much every SAN. In such a configuation, it is > mg> reasonable for people to expect that load balancing will > mg> occur. > > nope. spanning tree removes all loops, which means between any two > points there will be only one enabled path. An L2-switched network > will look into L4 headers for splitting traffic across an aggregated > link (as long as it's been deliberately configured to do that---by > default probably only looks to L2), but it won't do any multipath > within the mesh. I was speaking more of IPMP, which is at layer 3. > Even with an L3 routing protocol it usually won't do multipath unless > the costs of the paths match exactly, so you'd want to build the > topology to achieve this and then do all switching at layer 3 by > making sure no VLAN is larger than a switch. By default, IPMP does outbound load spreading. Inbound load spreading is not practical with a single (non-test) IP address. If you have multiple virtual IP's you can spread them across all of the NICs in the IPMP group and get some degree of inbound spreading as well. This is the default behavior of the OpenSolaris IPMP implementation, last I looked. I've not seen any examples (although I can't say I've looked real hard either) of the Solaris 10 IPMP configuration set up with multipe IP's to encourage inbound load spreading as well. > > There's actually a cisco feature to make no VLAN larger than a *port*, > which I use a little bit. It's meant for CATV networks I think, or > DSL networks aggregated by IP instead of ATM like maybe some European > ones? but the idea is not to put edge ports into vlans any more but > instead say 'ip unnumbered loopbackN', and then some black magic they > have built into their DHCP forwarder adds /32 routes by watching the > DHCP replies. If you don't use DHCP you can add static /32 routes > yourself, and it will work. It does not help with IPv6, and also you > can only use it on vlan-tagged edge ports (what? arbitrary!) but > neat that it's there at all. > > http://www.cisco.com/en/US/docs/ios/12_3t/12_3t4/feature/guide/gtunvlan.html Interesting... however this seems to limit you to < 4096 edge ports per VTP domain, as the VID field in the 802.1q header is only 12 bits. It is also unclear how this works when you have one physical host with many guests. And then there is the whole thing that I don't really see how this helps with resilience in the face of a switch failure. Cool technology, but I'm not certain that it addresses what I was talking about. > > The best thing IMHO would be to use this feature on the edge ports, > just as I said, but you will have to teach the servers to VLAN-tag > their packets. not such a bad idea, but weird. > > You could also use it one hop up from the edge switches, but I think > it might have problems in general removing the routes when you unplug > a server, and using it one hop up could make them worse. I only use > it with static routes so far, so no mobility for me: I have to keep > each server plugged into its assigned port, and reconfigure switches > if I move it. Once you have ``no vlan larger than 1 switch,'' if you > actually need a vlan-like thing that spans multiple switches, the new > word for it is 'vrf'. There was some other Cisco dark magic that our network guys were touting a while ago that would make each edge switch look like a blade in a 6500 series. This would then allow them to do link aggregation across edge switches. At least two of "organizational changes", "personnel changes", and "roadmap changes" happened so I've not seen this in action. > > so, yeah, it means the server people will have to take over the job of > the networking people. The good news is that networking people don't > like spanning tree very much because it's always going wrong, so > AFAICT most of them who are paying attention are already moving in > this direction. > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving /export to another zpool
On Fri, Aug 13, 2010 at 1:07 PM, Handojo wrote: >> Are the old /opt and /expore still listed in your >> vfstab(4) file? > > I cant access /etc/vfstab because I can't even log in as my username. I can't > even log in as root from the Login Screen > > And when I boot on using LiveCD, how can I mount my first drive that has > opensolaris installed ? To list the zpools it can see: zpool import To import one called rpool at an alternate root: zpool import -R /mnt rpool -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] bigger zfs arc
On Fri, Oct 2, 2009 at 1:45 PM, Rob Logan wrote: >> zfs will use as much memory as is "necessary" but how is "necessary" >> calculated? > > using arc_summary.pl from > http://www.cuddletech.com/blog/pivot/entry.php?id=979 > my tiny system shows: > Current Size: 4206 MB (arcsize) > Target Size (Adaptive): 4207 MB (c) That looks a lot like ~ 4 * 1024 MB. Is this a 64-bit capable system that you have booted from a 32-bit kernel? -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
On Mon, Nov 2, 2009 at 7:20 AM, Jeff Bonwick wrote: >> Terrific! Can't wait to read the man pages / blogs about how to use it... > > Just posted one: > > http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup > > Enjoy, and let me know if you have any questions or suggestions for > follow-on posts. > > Jeff On systems with crypto accelerators (particularly Niagara 2) does the hash calculation code use the crypto accelerators, so long as a supported hash is used? Assuming the answer is yes, have performance comparisons been done between weaker hash algorithms implemented in software and sha256 implemented in hardware? I've been waiting very patiently to see this code go in. Thank you for all your hard work (and the work of those that helped too!). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe is in
On Mon, Nov 2, 2009 at 11:58 AM, Dennis Clarke wrote: > >>> Terrific! Can't wait to read the man pages / blogs about how to use >>> it... >> >> Just posted one: >> >> http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup >> >> Enjoy, and let me know if you have any questions or suggestions for >> follow-on posts. > > Looking at FIPS-180-3 in sections 4.1.2 and 4.1.3 I was thinking that the > major leap from SHA256 to SHA512 was a 32-bit to 64-bit step. > > If the implementation of the SHA256 ( or possibly SHA512 at some point ) > algorithm is well threaded then one would be able to leverage those > massively multi-core Niagara T2 servers. The SHA256 hash is based on six > 32-bit functions whereas SHA512 is based on six 64-bit functions. The CMT > Niagara T2 can easily process those 64-bit hash functions and the > multi-core CMT trend is well established. So long as context switch times > are very low one would think that IO with a SHA512 based de-dupe > implementation would be possible and even realistic. That would solve the > hash collision concern I would think. > > Merely thinking out loud here ... And my out loud thinking on this says that the crypto accelerator on a T2 system does hardware acceleration of SHA256. NAME n2cp - Ultra-SPARC T2 crypto provider device driver DESCRIPTION The n2cp device driver is a multi-threaded, loadable hardware driver supporting hardware assisted acceleration of the following cryptographic operations, which are built into the Ultra-SPARC T2 CMT processor: DES: CKM_DES_CBC, CKM_DES_ECB DES3: CKM_DES3_CBC, CKM_DES3_ECB, AES: CKM_AES_CBC, CKM_AES_ECB, CKM_AES_CTR RC4: CKM_RC4 MD5: KM_MD5, CKM_MD5_HMAC, CKM_MD5_HMAC_GENERAL, CKM_SSL3_MD5_MAC SHA-1: CKM_SHA_1, CKM_SHA_1_HMAC, CKM_SHA_1_HMAC_GENERAL, CKM_SSL3_SHA1_MAC SHA-256:CKM_SHA256, CKM_SHA256_HMAC, CKM_SHA256_HMAC_GENERAL According to page 35 of http://www.slideshare.net/ramesh_r_nagappan/wirespeed-cryptographic-acceleration-for-soa-and-java-ee-security, a T2 CPU can do 41 Gb/s of SHA256. The implication here is that this keeps the MAU's busy but the rest of the core is still idle for things like compression, TCP, etc. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup question
On Mon, Nov 2, 2009 at 2:16 PM, Nicolas Williams wrote: > On Mon, Nov 02, 2009 at 11:01:34AM -0800, Jeremy Kitchen wrote: >> forgive my ignorance, but what's the advantage of this new dedup over >> the existing compression option? Wouldn't full-filesystem compression >> naturally de-dupe? > > If you snapshot/clone as you go, then yes, dedup will do little for you > because you'll already have done the deduplication via snapshots and > clones. But dedup will give you that benefit even if you don't > snapshot/clone all your data. Not all data can be managed > hierarchically, with a single dataset at the root of a history tree. > > For example, suppose you want to create two VirtualBox VMs running the > same guest OS, sharing as much on-disk storage as possible. Before > dedup you had to: create one VM, then snapshot and clone that VM's VDI > files, use an undocumented command to change the UUID in the clones, > import them into VirtualBox, and setup the cloned VM using the cloned > VDI files. (I know because that's how I manage my VMs; it's a pain, > really.) With dedup you need only enable dedup and then install the two > VMs. The big difference here is when you consider a life cycle that ends long after provisioning is complete. With clones, the images will diverge. If a year after you install each VM you decide to do an OS upgrade, they will still be linked but are quite unlikely to both reference many of the same blocks. However, with deduplication, the similar changes (e.g. same patch applied, multiple of the same application installed, upgrade to the same newer OS) will result in fewer stored copies. This isn't a big deal if you have 2 VM's. It because quite significant if you have 5000 (e.g. on a ZFS-based file server). Assuming that the deduped blocks stay deduped in the ARC, it means that it is feasible to every block that is accessed with any frequency to be in memory. Oh yeah, and you save a lot of disk space. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] CIFS shares being lost
On Fri, Nov 20, 2009 at 7:55 PM, Emily Grettel wrote: > Well I took the plunge updating to the latest dev version. (snv_127) and I > don't seem to be able to remotely login via ssh via putty: > > Using username "emilytg". > Authenticating with public key "dsa-pub" from agent > Server refused to allocate pty > Sun Microsystems Inc. SunOS 5.11 snv_127 November 2008 This looks like... http://defect.opensolaris.org/bz/show_bug.cgi?id=12380 But that was supposed to be fixed in snv_126. Can you check /etc/minor_perm for this entry: clone:ptmx 0666 root sys Mike > > Not good :( > > Cheers, > Em > > > From: emilygrettelis...@hotmail.com > To: zfs-discuss@opensolaris.org > Date: Sat, 21 Nov 2009 12:30:45 +1100 > Subject: Re: [zfs-discuss] CIFS shares being lost > >> by a Win7 client was crashing our CIFS server within 5-10 seconds. > Hmmm thats probably it then. Most of our users have been using Windows 7 and > people put their machines on standby when they leave the office for the day. > Maybe this is why we've had issues and having to restart on a daily basis. > It works fine during the day with no downtime. > > Is it safe to update to the latest dev version? I know it creates a BE and > we can revert to 2009.06 later but is there a way of just updating ZFS > instead of downloading 800Mb more and updating the entire OS? > > Cheers, > Em > > >> Date: Fri, 20 Nov 2009 18:14:06 -0700 >> From: edmud...@bounceswoosh.org >> To: emilygrettelis...@hotmail.com >> CC: zfs-discuss@opensolaris.org >> Subject: Re: [zfs-discuss] CIFS shares being lost >> >> On Sat, Nov 21 at 11:41, Emily Grettel wrote: >> > >> > Wow that was mighty quick Tim! >> > >> > Sorry, I have to reboot the server. I can SSH into the box, VNC etc >> > but no CIFS shares are visible. >> >> I found 2009.06 to be unusable for CIFS due to hangs that weren't >> resolved until b114/b116. We had to revert to 2008.11, as any access >> by a Win7 client was crashing our CIFS server within 5-10 seconds. >> >> These were the suspected culprits: >> >> >> http://mail.opensolaris.org/pipermail/indiana-discuss/2009-June/015711.html >> >> though I think there was another issue in b114 that wasn't resolved >> until b116. >> >> Unfortunately, the zpool default version in 2009.06 is 1 iteration >> ahead of the one in 2008.11, so there's no smooth downgrade if you >> created your zpools with a 2009.06 image. We had started with 2008.11 >> and not updated our zpool, so a re-install of 2008.11 worked. >> >> The latest dev branch (b126-or-thereabouts, 2010.02 preview) is >> reportedly good for CIFS based on traffic from this list. >> >> --eric >> >> -- >> Eric D. Mudama >> edmud...@mail.bounceswoosh.org >> > > > Check out The Great Australian Pay Check now Want to know what your boss is > paid? > > View photos of singles in your area! Looking for a date? > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Best practices for zpools on zfs
Suppose I have a storage server that runs ZFS, presumably providing file (NFS) and/or block (iSCSI, FC) services to other machines that are running Solaris. Some of the use will be for LDoms and zones[1], which would create zpools on top of zfs (fs or zvol). I have concerns about variable block sizes and the implications for performance. 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss Suppose that on the storage server, an NFS shared dataset is created without tuning the block size. This implies that when the client (ldom or zone v12n server) runs mkfile or similar to create the backing store for a vdisk or a zpool, the file on the storage server will be created with 128K blocks. Then when Solaris or OpenSolaris is installed into the vdisk or zpool, files of a wide variety of sizes will be created. At this layer they will be created with variable block sizes (512B to 128K). The implications for a 512 byte write in the upper level zpool (inside a zone or ldom) seems to be: - The 512 byte write turns into a 128 KB write at the storage server (256x multiplication in write size). - To write that 128 KB block, the rest of the block needs to be read to recalculate the checksum. That is, a read/modify/write process is forced. (Less impact if block already in ARC.) - Deduplicaiton is likely to be less effective because it is unlikely that the same combination of small blocks in different zones/ldoms will be packed into the same 128 KB block. Alternatively, the block size could be forced to something smaller at the storage server. Setting it to 512 bytes could eliminate the read/modify/write cycle, but would presumably be less efficient (less performant) with moderate to large files. Setting it somewhere in between may be desirable as well, but it is not clear where. The key competition in this area seems to have a fixed 4 KB block size. Questions: Are my basic assumptions about a given file consisting only of a single sized block, except for perhaps the final block? Has any work been done to identify the performance characteristics in this area? Is there less to be concerned about from a performance standpoint if the workload is primarily read? To maximize the efficacy of dedup, would it be best to pick a fixed block size and match it between the layers of zfs? -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for zpools on zfs
On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling wrote: > Good question! Additional thoughts below... > > On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote: > >> Suppose I have a storage server that runs ZFS, presumably providing >> file (NFS) and/or block (iSCSI, FC) services to other machines that >> are running Solaris. Some of the use will be for LDoms and zones[1], >> which would create zpools on top of zfs (fs or zvol). I have concerns >> about variable block sizes and the implications for performance. >> >> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss >> >> Suppose that on the storage server, an NFS shared dataset is created >> without tuning the block size. This implies that when the client >> (ldom or zone v12n server) runs mkfile or similar to create the >> backing store for a vdisk or a zpool, the file on the storage server >> will be created with 128K blocks. Then when Solaris or OpenSolaris is >> installed into the vdisk or zpool, files of a wide variety of sizes >> will be created. At this layer they will be created with variable >> block sizes (512B to 128K). >> >> The implications for a 512 byte write in the upper level zpool (inside >> a zone or ldom) seems to be: >> >> - The 512 byte write turns into a 128 KB write at the storage server >> (256x multiplication in write size). >> - To write that 128 KB block, the rest of the block needs to be read >> to recalculate the checksum. That is, a read/modify/write process >> is forced. (Less impact if block already in ARC.) >> - Deduplicaiton is likely to be less effective because it is unlikely >> that the same combination of small blocks in different zones/ldoms >> will be packed into the same 128 KB block. >> >> Alternatively, the block size could be forced to something smaller at >> the storage server. Setting it to 512 bytes could eliminate the >> read/modify/write cycle, but would presumably be less efficient (less >> performant) with moderate to large files. Setting it somewhere in >> between may be desirable as well, but it is not clear where. The key >> competition in this area seems to have a fixed 4 KB block size. >> >> Questions: >> >> Are my basic assumptions about a given file consisting only of a >> single sized block, except for perhaps the final block? > > Yes, for a file system dataset. Volumes are fixed block size with > the default being 8 KB. So in the iSCSI over volume case, OOB > it can be more efficient. 4KB matches well with NTFS or some of > the Linux file systems OOB is missing from my TLA translator. Help, please. > >> Has any work been done to identify the performance characteristics in >> this area? > > None to my knowledge. The performance teams know to set the block > size to match the application, so they don't waste time re-learning this. That works great for certain workloads, particularly those with a fixed record size or large sequential I/O. If the workload is "installing then running an operating system" the answer is harder to define. > >> Is there less to be concerned about from a performance standpoint if >> the workload is primarily read? > > Sequential read: yes > Random read: no I was thinking that random wouldn't be too much of a concern either assuming that the things that are commonly read are in cache. I guess this does open the door for a small chunk of useful code in the middle of a largely useless shared library to force lot of that shared library into the ARC, among other things. > >> To maximize the efficacy of dedup, would it be best to pick a fixed >> block size and match it between the layers of zfs? > > I don't think we know yet. Until b128 arrives in binary, and folks get > some time to experiment, we just don't have much data... and there > are way too many variables at play to predict. I can make one > prediction, though, dedupe for mkfile or dd if=/dev/zero will scream :-) We already have that optimization with compression. Dedupe just messes up my method of repeatedly writing the same smallish (<1MB) chunk of random or already compressed data to avoid the block-of-zeros compression optimization. Pretty soon filebench is going to need to add statistical methods to mimic the level of duplicate data it is simulating. Trying to write simple benchmarks to test increasingly smart systems looks to be problematic. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for zpools on zfs
On Tue, Nov 24, 2009 at 1:39 PM, Richard Elling wrote: > On Nov 24, 2009, at 11:31 AM, Mike Gerdts wrote: > >> On Tue, Nov 24, 2009 at 9:46 AM, Richard Elling >> wrote: >>> >>> Good question! Additional thoughts below... >>> >>> On Nov 24, 2009, at 6:37 AM, Mike Gerdts wrote: >>> >>>> Suppose I have a storage server that runs ZFS, presumably providing >>>> file (NFS) and/or block (iSCSI, FC) services to other machines that >>>> are running Solaris. Some of the use will be for LDoms and zones[1], >>>> which would create zpools on top of zfs (fs or zvol). I have concerns >>>> about variable block sizes and the implications for performance. >>>> >>>> 1. http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss >>>> >>>> Suppose that on the storage server, an NFS shared dataset is created >>>> without tuning the block size. This implies that when the client >>>> (ldom or zone v12n server) runs mkfile or similar to create the >>>> backing store for a vdisk or a zpool, the file on the storage server >>>> will be created with 128K blocks. Then when Solaris or OpenSolaris is >>>> installed into the vdisk or zpool, files of a wide variety of sizes >>>> will be created. At this layer they will be created with variable >>>> block sizes (512B to 128K). >>>> >>>> The implications for a 512 byte write in the upper level zpool (inside >>>> a zone or ldom) seems to be: >>>> >>>> - The 512 byte write turns into a 128 KB write at the storage server >>>> (256x multiplication in write size). >>>> - To write that 128 KB block, the rest of the block needs to be read >>>> to recalculate the checksum. That is, a read/modify/write process >>>> is forced. (Less impact if block already in ARC.) >>>> - Deduplicaiton is likely to be less effective because it is unlikely >>>> that the same combination of small blocks in different zones/ldoms >>>> will be packed into the same 128 KB block. >>>> >>>> Alternatively, the block size could be forced to something smaller at >>>> the storage server. Setting it to 512 bytes could eliminate the >>>> read/modify/write cycle, but would presumably be less efficient (less >>>> performant) with moderate to large files. Setting it somewhere in >>>> between may be desirable as well, but it is not clear where. The key >>>> competition in this area seems to have a fixed 4 KB block size. >>>> >>>> Questions: >>>> >>>> Are my basic assumptions about a given file consisting only of a >>>> single sized block, except for perhaps the final block? >>> >>> Yes, for a file system dataset. Volumes are fixed block size with >>> the default being 8 KB. So in the iSCSI over volume case, OOB >>> it can be more efficient. 4KB matches well with NTFS or some of >>> the Linux file systems >> >> OOB is missing from my TLA translator. Help, please. > > Out of box. Looky there, it was in my TLA translator after all. Not sure how I missed it the first time. > >>> >>>> Has any work been done to identify the performance characteristics in >>>> this area? >>> >>> None to my knowledge. The performance teams know to set the block >>> size to match the application, so they don't waste time re-learning this. >> >> That works great for certain workloads, particularly those with a >> fixed record size or large sequential I/O. If the workload is >> "installing then running an operating system" the answer is harder to >> define. > > running OSes don't create much work, post boot Agreed, particularly if backups are pushed to the storage server. I suspect that most apps that shuffle bits between protocols but do little disk I/O can piggy back on this idea. That is, a J2EE server that just talks to the web and database tier, with some log entries and occasional app deployments should be pretty safe too. > >>>> Is there less to be concerned about from a performance standpoint if >>>> the workload is primarily read? >>> >>> Sequential read: yes >>> Random read: no >> >> I was thinking that random wouldn't be too much of a concern either >> assuming that the things that are commonly read are in cache. I guess >> this does open the door for a small chunk of useful code in the middle >> of a largely useless sh
Re: [zfs-discuss] ZFS Random Read Performance
On Wed, Nov 25, 2009 at 7:54 AM, Paul Kraus wrote: >> You're peaking at 658 256KB random IOPS for the 3511, or ~66 >> IOPS per drive. Since ZFS will max out at 128KB per I/O, the disks >> see something more than 66 IOPS each. The IOPS data from >> iostat would be a better metric to observe than bandwidth. These >> drives are good for about 80 random IOPS each, so you may be >> close to disk saturation. The iostat data for IOPS and svc_t will >> confirm. > > But ... if I am saturating the 3511 with one thread, then why do I get > many times that performance with multiple threads ? I'm having troubles making sense of the iostat data (I can't tell how many threads at any given point), but I do see lots of times where asvc_t * reads is in the range 850 ms to 950 ms. That is, this is as fast as a single threaded app with a little bit of think time can issue reads (100 reads * 9 ms svc_t + 100 reads * 1 ms think_time = 1 sec). The %busy shows that 90+% of the time there is an I/O in flight (100 reads * 9ms = 900/1000 = 90%). However, %busy isn't aware of how many I/O's could be in flight simultaneously. When you fire up more threads, you are able to have more I/O's in flight concurrently. I don't believe that the I/O's per drive is really a limiting factor at the single threaded case, as the spec sheet for the 3511 says that it has 1 GB of cache per controller. Your working set is small enough that it is somewhat likely that many of those random reads will be served from cache. A dtrace analysis of just how random the reads are would be interesting. I think that hotspot.d from the DTrace Toolkit would be a good starting place. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] proposal partial/relative paths for zfs(1)
Is there still any interest in this? I've done a bit of hacking (then searched for this thread - I picked -P instead of -c)... $ zfs get -P compression,dedup /var NAMEPROPERTY VALUE SOURCE rpool/ROOT/zfstest compression on inherited from rpool/ROOT rpool/ROOT/zfstest dedupoffdefault $ pfexec zfs snapshot -P @now Creating snapshot Of course create/mkdir would make it into the eventual implementation as well. For those missing this thread in their mailboxes, the conversation is archived at http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-July/019762.html Mike On Thu, Jul 10, 2008 at 4:42 AM, Darren J Moffat wrote: > I regularly create new zfs filesystems or snapshots and I find it > annoying that I have to type the full dataset name in all of those cases. > > I propose we allow zfs(1) to infer the part of the dataset name upto the > current working directory. For example: > > Today: > > $ zfs create cube/builds/darrenm/bugs/6724478 > > With this proposal: > > $ pwd > /cube/builds/darrenm/bugs > $ zfs create 6724478 > > Both of these would result in a new dataset cube/builds/darrenm/6724478 > > This will need some careful though about how to deal with cases like this: > > $ pwd > /cube/builds/ > $ zfs create 6724478/test > > What should that do ? should it create cube/builds/6724478 and > cube/builds/6724478/test ? Or should it fail ? -p already provides > some capbilities in this area. > > Maybe the easiest way out of the ambiquity is to add a flag to zfs > create for the partial dataset name eg: > > $ pwd > /cube/builds/darrenm/bugs > $ zfs create -c 6724478 > > Why "-c" ? -c for "current directory" "-p" partial is already taken to > mean "create all non existing parents" and "-r" relative is already used > consistently as "recurse" in other zfs(1) commands (as well as lots of > other places). > > Alternately: > > $ pwd > /cube/builds/darrenm/bugs > $ zfs mkdir 6724478 > > Which would act like mkdir does (including allowing a -p and -m flag > with the same meaning as mkdir(1)) but creates datasets instead of > directories. > > Thoughts ? Is this useful for anyone else ? My above examples are some > of the shorter dataset names I use, ones in my home directory can be > even deeper. > > -- > Darren J Moffat > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best practices for zpools on zfs
On Thu, Nov 26, 2009 at 8:53 PM, Toby Thain wrote: > > On 26-Nov-09, at 8:57 PM, Richard Elling wrote: > >> On Nov 26, 2009, at 1:20 PM, Toby Thain wrote: >>> >>> On 25-Nov-09, at 4:31 PM, Peter Jeremy wrote: >>> >>>> On 2009-Nov-24 14:07:06 -0600, Mike Gerdts wrote: >>>>> >>>>> ... fill a 128 >>>>> KB buffer with random data then do bitwise rotations for each >>>>> successive use of the buffer. Unless my math is wrong, it should >>>>> allow 128 KB of random data to be write 128 GB of data with very >>>>> little deduplication or compression. A much larger data set could be >>>>> generated with the use of a 128 KB linear feedback shift register... >>>> >>>> This strikes me as much harder to use than just filling the buffer >>>> with 8/32/64-bit random numbers >>> >>> I think Mike's reasoning is that a single bit shift (and propagation) is >>> cheaper than generating a new random word. After the whole buffer is >>> shifted, you have a new very-likely-unique block. (This seems like overkill >>> if you know the dedup unit size in advance.) >> >> You should be able to get a unique block by shifting one word, as long >> as the shift doesn't duplicate the word. > > That is true, but you will run out of permutations sooner. Rather than shifting a word, you could just increment it. In a multi-threaded test, each thread picks the word corresponding to the thread that is executing. Assuming 32-bit words (b4-bit is overkill), this allows up to 128 threads with 512 byte blocks. It also allows up to 2 TB per thread per 512 bytes in a block. That is, if 50 threads are used and the block size is 8 KB, there should be no duplicates in 2 * 50 * 8192 / 512 = 1600 TB. But... this leads us to the point the workload generators are too good at generating unique data. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send | verify | receive
On Sat, Dec 5, 2009 at 11:32 AM, Bob Friesenhahn wrote: > On Sat, 5 Dec 2009, dick hoogendijk wrote: > >> On Sat, 2009-12-05 at 09:22 -0600, Bob Friesenhahn wrote: >> >>> You can also stream into a gzip or lzop wrapper in order to obtain the >>> benefit of incremental CRCs and some compression as well. >> >> Can you give an example command line for this option please? > > Something like > > zfs send mysnapshot | gzip -c -3 > /somestorage/mysnap.gz > > should work nicely. Zfs send sends to its standard output so it is just a > matter of adding another filter program on its output. This could be > streamed over ssh or some other streaming network transfer protocol. > > Later, you can do 'gzip -t mysnap.gz' on the machine where the snapshot file > is stored to verify that it has not been corrupted in storage or transfer. > > lzop (not part of Solaris) is much faster than gzip but can be used in a > similar way since it is patterned after gzip. It seems as though a similar filter could be created to create and inject an error correcting code into the stream. That is: zfs send $snap | ecc -i > /somestorage/mysnap.ecc ecc -o < /somestorage/mysnap | zfs receive ... I'm not aware of an existing ecc program, but I can't imagine it would be hard to create one. There seems to already be an implementation of Reed-Solomon encoding in ON that could likely be used as a starting point. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS - how to determine which physical drive to replace
On Sat, Dec 12, 2009 at 9:58 AM, Edward Ned Harvey wrote: > I would suggest something like this: While the system is still on, if the > failed drive is at least writable *a little bit* … then you can “dd > if=/dev/zero of=/dev/rdsk/FailedDiskDevice bs=1024 count=1024” … and then > after the system is off, you could plug the drives into another system > one-by-one, and read the first 1M, and see if it’s all zeros. (Or instead > of dd zero, you could echo some text onto the drive, or whatever you think > is easiest.) > How about reading instead? dd if=/dev/rdsk/$whatever of=/dev/null If the failed disk generates I/O errors that prevent it from reading at a rate that causes an LED to blink, you could read from all of the good disks. The one that doesn't blink is the broken one. You can also get the drive serial number with iostat -En: $ iostat -En c3d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 Model: Hitachi HTS5425 Revision: Serial No: 080804BB6300HCG Size: 160.04GB <160039305216 bytes> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 Illegal Request: 0 ... That /should/ be printed on the disk somewhere. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compressratio vs. dedupratio
On Mon, Dec 14, 2009 at 3:54 PM, Craig S. Bell wrote: > I am also accustomed to seeing diluted properties such as compressratio. > IMHO it could be useful (or perhaps just familiar) to see a diluted dedup > ratio for the pool, or maybe see the size / percentage of data used to arrive > at dedupratio. > > As Jeff points out, there is enough data available to calculate this. Would > it be meaningful enough to present a diluted ratio property? IOW, would that > tell me anything than I don't get from simply using "available" as my fuel > gauge? > > This is probably a larger topic: What additional statistics would be > genuinely useful to the admin when there is space interaction between > datasets. As we have seen, some commands are less objective with dedup: I was recently confused when doing mkfile (or was it dd if=/dev/zero ...) and found that even though blocks were compressed away to nothing, the compressratio did not increase. For example: # perl -e 'print "a" x 1' > /test/a # zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 7.87x - However if I put null characters into the same file: # dd if=/dev/zero of=a bs=1 count=1 1+0 records in 1+0 records out # zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.00x - I understand that a block is not allocated if it contains all zero's, but that would seem to contribute to a higher compressratio rather than a lower compressratio. If I disable compression and enable dedup, does it count deduplicated blocks of zeros toward the dedupratio? -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compressratio vs. dedupratio
On Tue, Dec 15, 2009 at 2:31 AM, Craig S. Bell wrote: > Mike, I believe that ZFS treats runs of zeros as holes in a sparse file, > rather than as regular data. So they aren't really present to be counted for > compressratio. > > http://blogs.sun.com/bonwick/entry/seek_hole_and_seek_data > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/017565.html But it only does so when compression is enabled, as such I would expect that compression would claim this as a win. Without it, someone may assume that they aren't getting much benefit from compression, turn it off, then run into problems down the road because sparseness that develops in files never turns into free space. Also, I would expect that: - If a file is created via a write to every block that it would be accounted for as non-sparse (regardless of compression=) - If a file is sparse because the program that created the file used seek() or similar to skip past blocks, it should be accounted for as sparse (regardless of compression). - If a program overwrites a block with zeros to a file where it should not be considered sparse. In the below example, I would expect that writing 100MB of '\0' would contribute as much to compressratio as 100 MB of 'a'. Notice that a block of zeros does not turn into a sparse file with compression=off. # zfs create test/on # zfs create test/off # zfs set compression=off test/off # zfs get compression test/on test/off NAME PROPERTY VALUE SOURCE test/off compression off local test/on compression oninherited from test # mkfile 100m on/100m off/100m # ls -l o*/100m -rw--T 1 root root 104857600 Dec 15 14:27 off/100m -rw--T 1 root root 104857600 Dec 15 14:27 on/100m # du -h o*/100m 100M off/100m 0K on/100m # perl -e 'print "a" x 1' > on/a # perl -e 'print "a" x 1' > off/a # sync # ls -l */a -rw-r--r-- 1 root root 1 Dec 15 14:35 off/a -rw-r--r-- 1 root root 1 Dec 15 14:35 on/a # du -h */a 95M off/a 3.4M on/a # zfs get compressratio test/on test/off NAME PROPERTY VALUE SOURCE test/off compressratio 1.00x - test/on compressratio 28.27x - -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zones on shared storage - a warning
I've been playing around with zones on NFS a bit and have run into what looks to be a pretty bad snag - ZFS keeps seeing read and/or checksum errors. This exists with S10u8 and OpenSolaris dev build snv_129. This is likely a blocker for anything thinking of implementing parts of Ed's Zones on Shared Storage: http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss The OpenSolaris example appears below. The order of events is: 1) Create a file on NFS, turn it into a zpool 2) Configure a zone with the pool as zonepath 3) Install the zone, verify that the pool is healthy 4) Boot the zone, observe that the pool is sick r...@soltrain19# mount filer:/path /mnt r...@soltrain19# cd /mnt r...@soltrain19# mkdir osolzone r...@soltrain19# mkfile -n 8g root r...@soltrain19# zpool create -m /zones/osol osol /mnt/osolzone/root r...@soltrain19# zonecfg -z osol osol: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:osol> create zonecfg:osol> info zonename: osol zonepath: brand: ipkg autoboot: false bootargs: pool: limitpriv: scheduling-class: ip-type: shared hostid: zonecfg:osol> set zonepath=/zones/osol zonecfg:osol> set autoboot=false zonecfg:osol> verify zonecfg:osol> commit zonecfg:osol> exit r...@soltrain19# chmod 700 /zones/osol r...@soltrain19# zoneadm -z osol install Publisher: Using opensolaris.org (http://pkg.opensolaris.org/dev/ http://pkg-na-2.opensolaris.org/dev/). Publisher: Using contrib (http://pkg.opensolaris.org/contrib/). Image: Preparing at /zones/osol/root. Cache: Using /var/pkg/download. Sanity Check: Looking for 'entire' incorporation. Installing: Core System (output follows) DOWNLOAD PKGS FILESXFER (MB) Completed46/46 12334/1233493.1/93.1 PHASEACTIONS Install Phase18277/18277 No updates necessary for this image. Installing: Additional Packages (output follows) DOWNLOAD PKGS FILESXFER (MB) Completed36/36 3339/333921.3/21.3 PHASEACTIONS Install Phase 4466/4466 Note: Man pages can be obtained by installing SUNWman Postinstall: Copying SMF seed repository ... done. Postinstall: Applying workarounds. Done: Installation completed in 2139.186 seconds. Next Steps: Boot the zone, then log into the zone console (zlogin -C) to complete the configuration process. 6.3 Boot the OpenSolaris zone r...@soltrain19# zpool status osol pool: osol state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM osol ONLINE 0 0 0 /mnt/osolzone/root ONLINE 0 0 0 errors: No known data errors r...@soltrain19# zoneadm -z osol boot r...@soltrain19# zpool status osol pool: osol state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM osol DEGRADED 0 0 0 /mnt/osolzone/root DEGRADED 0 0 117 too many errors errors: No known data errors r...@soltrain19# zlogin osol uptime 5:31pm up 1 min(s), 0 users, load average: 0.69, 0.38, 0.52 -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zones on shared storage - a warning
On Tue, Dec 22, 2009 at 8:02 PM, Mike Gerdts wrote: > I've been playing around with zones on NFS a bit and have run into > what looks to be a pretty bad snag - ZFS keeps seeing read and/or > checksum errors. This exists with S10u8 and OpenSolaris dev build > snv_129. This is likely a blocker for anything thinking of > implementing parts of Ed's Zones on Shared Storage: > > http://hub.opensolaris.org/bin/view/Community+Group+zones/zoss > > The OpenSolaris example appears below. The order of events is: > > 1) Create a file on NFS, turn it into a zpool > 2) Configure a zone with the pool as zonepath > 3) Install the zone, verify that the pool is healthy > 4) Boot the zone, observe that the pool is sick [snip] An off list conversation and a bit of digging into other tests I have done shows that this is likely limited to NFSv3. I cannot say that this problem has been seen with NFSv4. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, Dec 30, 2009 at 1:40 PM, Richard Elling wrote: > On Dec 30, 2009, at 10:53 AM, Andras Spitzer wrote: > >> Devzero, >> >> Unfortunately that was my assumption as well. I don't have source level >> knowledge of ZFS, though based on what I know it wouldn't be an easy way to >> do it. I'm not even sure it's only a technical question, but a design >> question, which would make it even less feasible. > > It is not hard, because ZFS knows the current free list, so walking that > list > and telling the storage about the freed blocks isn't very hard. > > What is hard is figuring out if this would actually improve life. The > reason > I say this is because people like to use snapshots and clones on ZFS. > If you keep snapshots, then you aren't freeing blocks, so the free list > doesn't grow. This is a very different use case than UFS, as an example. It seems as though the oft mentioned block rewrite capabilities needed for pool shrinking and changing things like compression, encryption, and deduplication would also show benefit here. That is, blocks would be re-written in such a way to minimize the number of chunks of storage that is allocated. The current HDS chunk size is 42 MB. The most benefit would seem to be to have ZFS make a point of reusing old but freed blocks before doing an allocation that causes the back-end storage to allocate another chunk of disk to the thin-provisioned. While it is important to be able to roll back a few transactions in the event of some widely discussed failure modes, it is probably reasonable to reuse a block freed by a txg that is 3,000 txg's old (about 1 day old if 1 txg per 30 seconds). Such a threshold could be used to determine whether to reuse a block or venture into previously untouched regions of the disk. This strategy would allow the SAN administrator (who is a different person than the sysadmin) to allocate extra space to servers and the sysadmin can control the amount of space really used by quotas. In the event that there is an emergency need for more space, the sysadmin can increase the quota and allow more of the allocate SAN space to be used. Assuming the block rewrite feature comes to fruition, this emergency growth could be shrunk back down to the original size once the surge in demand (or errant process) subsides. > > There are a few minor bumps in the road. The ATA PASSTHROUGH > command, which allows TRIM to pass through the SATA drivers, was > just integrated into b130. This will be more important to small servers > than SANs, but the point is that all parts of the software stack need to > support the effort. As such, it is not clear to me who, if anyone, inside > Sun is champion for the effort -- it crosses multiple organizational > boundaries. > >> >> Apart from the technical possibilities, this feature looks really >> inevitable to me in the long run especially for enterprise customers with >> high-end SAN as cost is always a major factor in a storage design and it's a >> huge difference if you have to pay based on the space used vs space >> allocated (for example). > > If the high cost of SAN storage is the problem, then I think there are > better ways to solve that :-) The "SAN" could be an OpenSolaris device serving LUNs through COMSTAR. If those LUNs are used to hold a zpool, the zpool could notify the LUN that blocks are no longer used and the "SAN" could reclaim those blocks. This is just a variant of the same problem faced with expensive SAN devices that have thin provisioning allocation units measured in the tens of megabytes instead of hundreds to thousands of kilobytes. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Thin device support in ZFS?
On Wed, Dec 30, 2009 at 3:12 PM, Richard Elling wrote: > If the allocator can change, what sorts of policies should be > implemented? Examples include: > + should the allocator stick with best-fit and encourage more > gangs when the vdev is virtual? > + should the allocator be aware of an SSD's page size? Is > said page size available to an OS? > + should the metaslab boundaries align with virtual storage > or SSD page boundaries? Wandering off topic a little bit... Should the block size be a tunable so that page size of SSD (typically 4K, right?) and upcoming hard disks that sport a sector size > 512 bytes? http://arc.opensolaris.org/caselog/PSARC/2008/769/final_spec.txt > And, perhaps most important, how can this be done automatically > so that system administrators don't have to be rocket scientists > to make a good choice? Didn't you read the marketing literature? ZFS is easy because you only need to know two commands: zpool and zfs. If you just ignore all the subcommands, options to those subcommands, evil tuning that is sometimes needed, and effects of redundancy choices then there is no need for any rocket scientists. :) -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Clearing a directory with more than 60 million files
On Tue, Jan 5, 2010 at 4:34 AM, Mikko Lammi wrote: > Hello, > > As a result of one badly designed application running loose for some time, > we now seem to have over 60 million files in one directory. Good thing > about ZFS is that it allows it without any issues. Unfortunatelly now that > we need to get rid of them (because they eat 80% of disk space) it seems > to be quite challenging. > > Traditional approaches like "find ./ -exec rm {} \;" seem to take forever > - after running several days, the directory size still says the same. The > only way how I've been able to remove something has been by giving "rm > -rf" to problematic directory from parent level. Running this command > shows directory size decreasing by 10,000 files/hour, but this would still > mean close to ten months (over 250 days) to delete everything! > > I also tried to use "unlink" command to directory as a root, as a user who > created the directory, by changing directory's owner to root and so forth, > but all attempts gave "Not owner" error. > > Any commands like "ls -f" or "find" will run for hours (or days) without > actually listing anything from the directory, so I'm beginning to suspect > that maybe the directory's data structure is somewhat damaged. Is there > some diagnostics that I can run with e.g "zdb" to investigate and > hopefully fix for a single directory within zfs dataset? In situations like this, ls will be exceptionally slow partially because it will sort the output. Find is slow because it needs to call lstat() on every entry. In similar situations I have found the following to work. perl -e 'opendir(D, "."); while ( $d = readdir(D) ) { print "$d\n" }' Replace print with unlink if you wish... > > To make things even more difficult, this directory is located in rootfs, > so dropping the zfs filesystem would basically mean reinstalling the > entire system, which is something that we really wouldn't wish to go. > > > OS is Solaris 10, zpool version is 10 (rather old, I know, but is there > easy path for upgrade that might solve this problem?) and the zpool > consists two 146 GB SAS drivers in a mirror setup. > > > Any help would be appreciated. > > Thanks, > Mikko > > -- > Mikko Lammi | l...@lmmz.net | http://www.lmmz.net > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zones on shared storage - a warning
e error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h0m with 0 errors on Thu Jan 7 21:56:47 2010 config: NAME STATE READ WRITE CKSUM nfszone ONLINE 0 0 0 /nfszone/root ONLINE 0 0 109 errors: No known data errors I'm confused as to why this pool seems to be quite usable even with so many checksum errors. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 6:55 AM, Darren J Moffat wrote: > Frank Batschulat (Home) wrote: >> >> This just can't be an accident, there must be some coincidence and thus >> there's a good chance >> that these CHKSUM errors must have a common source, either in ZFS or in >> NFS ? > > What are you using for on the wire protection with NFS ? Is it shared using > krb5i or do you have IPsec configured ? If not I'd recommend trying one of > those and see if your symptoms change. Shouldn't a scrub pick that up? Why would there be no errors from "zoneadm install", which under the covers does a pkg image create followed by *multiple* pkg install invocations. No checksum errors pop up there. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 6:51 AM, James Carlson wrote: > Frank Batschulat (Home) wrote: >> This just can't be an accident, there must be some coincidence and thus >> there's a good chance >> that these CHKSUM errors must have a common source, either in ZFS or in NFS ? > > One possible cause would be a lack of substantial exercise. The man > page says: > > A regular file. The use of files as a backing store is > strongly discouraged. It is designed primarily for > experimental purposes, as the fault tolerance of a file > is only as good as the file system of which it is a > part. A file must be specified by a full path. > > Could it be that "discouraged" and "experimental" mean "not tested as > thoroughly as you might like, and certainly not a good idea in any sort > of production environment?" > > It sounds like a bug, sure, but the fix might be to remove the option. This unsupported feature is supported with the use of Sun Ops Center 2.5 when a zone is put on a "NAS Storage Library". -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
STATE READ WRITE CKSUM > nfszone DEGRADED 0 0 0 > /nfszone DEGRADED 0 0 462 too many errors > > errors: No known data errors > > == > > now compare this with Mike's error output as posted here: > > http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg33041.html > > # fmdump -eV | grep cksum_actual | sort | uniq -c | sort -n | tail > > 2 cksum_actual = 0x14c538b06b6 0x2bb571a06ddb0 0x3e05a7c4ac90c62 > 0x290cbce13fc59dce > *D 3 cksum_actual = 0x175bb95fc00 0x1767673c6fe00 0xfa9df17c835400 > 0x7e0aef335f0c7f00 > *E 3 cksum_actual = 0x2eb772bf800 0x5d8641385fc00 0x7cf15b214fea800 > 0xd4f1025a8e66fe00 > *B 4 cksum_actual = 0x0 0x0 0x0 0x0 > 4 cksum_actual = 0x1d32a7b7b00 0x248deaf977d80 0x1e8ea26c8a2e900 > 0x330107da7c4bcec0 > 5 cksum_actual = 0x14b8f7afe6 0x915db8d7f87 0x205dc7979ad73 > 0x4e0b3a8747b8a8 > *C 6 cksum_actual = 0x1184cb07d00 0xd2c5aab5fe80 0x69ef5922233f00 > 0x280934efa6d20f40 > *A 6 cksum_actual = 0x348e6117700 0x765aa1a547b80 0xb1d6d98e59c3d00 > 0x89715e34fbf9cdc0 > *F 16 cksum_actual = 0xbaddcafe00 0x5dcc54647f00 0x1f82a459c2aa00 > 0x7f84b11b3fc7f80 > *G 48 cksum_actual = 0x5d6ee57f00 0x178a70d27f80 0x3fc19c3a19500 > 0x82804bc6ebcfc0 > > and observe that the values in 'chksum_actual' causing our CHKSUM pool errors > eventually > because of missmatching with what had been expected are the SAME ! for 2 > totally > different client systems and 2 different NFS servers (mine vrs. Mike's), > see the entries marked with *A to *G. > > This just can't be an accident, there must be some coincidence and thus > there's a good chance > that these CHKSUM errors must have a common source, either in ZFS or in NFS ? You saved me so much time with this observation. Thank you! -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 9:11 AM, Mike Gerdts wrote: > I've seen similar errors on Solaris 10 in the primary domain and on a > M4000. Unfortunately Solaris 10 doesn't show the checksums in the > ereport. There I noticed a mixture between read errors and checksum > errors - and lots more of them. This could be because the S10 zone > was a full root SUNWCXall compared to the much smaller default ipkg > branded zone. On the primary domain running Solaris 10... I've written a dtrace script to get the checksums on Solaris 10. Here's what I see with NFSv3 on Solaris 10. # zoneadm -z zone1 halt ; zpool export pool1 ; zpool import -d /mnt/pool1 pool1 ; zoneadm -z zone1 boot ; sleep 30 ; pkill dtrace # ./zfs_bad_cksum.d Tracing... dtrace: error on enabled probe ID 9 (ID 43443: fbt:zfs:zio_checksum_error:return): invalid address (0x301b363a000) in action #4 at DIF offset 20 dtrace: error on enabled probe ID 9 (ID 43443: fbt:zfs:zio_checksum_error:return): invalid address (0x3037f746000) in action #4 at DIF offset 20 cccdtrace: error on enabled probe ID 9 (ID 43443: fbt:zfs:zio_checksum_error:return): invalid address (0x3026e7b) in action #4 at DIF offset 20 cc Checksum errors: 3 : 0x130e01011103 0x20108 0x0 0x400 (fletcher_4_native) 3 : 0x220125cd8000 0x62425980c08 0x16630c08296c490c 0x82b320c082aef0c (fletcher_4_native) 3 : 0x2f2a0a202a20436f 0x7079726967687420 0x2863292032303031 0x2062792053756e20 (fletcher_4_native) 3 : 0x3c21444f43545950 0x452048544d4c2050 0x55424c494320222d 0x2f2f5733432f2f44 (fletcher_4_native) 3 : 0x6005a8389144 0xc2080e6405c200b6 0x960093d40800 0x9eea007b9800019c (fletcher_4_native) 3 : 0xac044a6903d00163 0xa138c8003446 0x3f2cd1e100b10009 0xa37af9b5ef166104 (fletcher_4_native) 3 : 0xbaddcafebaddcafe 0xc 0x0 0x0 (fletcher_4_native) 3 : 0xc4025608801500ff 0x1018500704528210 0x190103e50066 0xc34b90001238f900 (fletcher_4_native) 3 : 0xfe00fc01fc42fc42 0xfc42fc42fc42fc42 0xfffc42fc42fc42fc 0x42fc42fc42fc42fc (fletcher_4_native) 4 : 0x4b2a460a 0x0 0x4b2a460a 0x0 (fletcher_4_native) 4 : 0xc00589b159a00 0x543008a05b673 0x124b60078d5be 0xe3002b2a0b605fb3 (fletcher_4_native) 4 : 0x130e010111 0x32000b301080034 0x10166cb34125410 0xb30c19ca9e0c0860 (fletcher_4_native) 4 : 0x130e010111 0x3a201080038 0x104381285501102 0x418016996320408 (fletcher_4_native) 4 : 0x130e010111 0x3a201080038 0x1043812c5501102 0x81802325c080864 (fletcher_4_native) 4 : 0x130e010111 0x3a0001c01080038 0x1383812c550111c 0x818975698080864 (fletcher_4_native) 4 : 0x1f81442e9241000 0x2002560880154c00 0xff10185007528210 0x19010003e566 (fletcher_4_native) 5 : 0xbab10c 0xf 0x53ae 0xdd549ae39aa1ba20 (fletcher_4_native) 5 : 0x130e010111 0x3ab01080038 0x1163812c550110b 0x8180a7793080864 (fletcher_4_native) 5 : 0x61626300 0x0 0x0 0x0 (fletcher_4_native) 5 : 0x8003 0x3df0d6a1 0x0 0x0 (fletcher_4_native) 6 : 0xbab10c 0xf 0x5384 0xdd549ae39aa1ba20 (fletcher_4_native) 7 : 0xbab10c 0xf 0x0 0x9af5e5f61ca2e28e (fletcher_4_native) 7 : 0x130e010111 0x3a201080038 0x104381265501102 0xc18c7210c086006 (fletcher_4_native) 7 : 0x275c222074650a2e 0x5c222020436f7079 0x7269676874203139 0x38392041540a2e5c (fletcher_4_native) 8 : 0x130e010111 0x3a0003101080038 0x1623812c5501131 0x8187f66a4080864 (fletcher_4_native) 9 : 0x8a000801010c0682 0x2eed0809c1640513 0x70200ff00026424 0x18001d16101f0059 (fletcher_4_native) 12 : 0xbab10c 0xf 0x0 0x45a9e1fc57ca2aa8 (fletcher_4_native) 30 : 0xbaddcafebaddcafe 0xbaddcafebaddcafe 0xbaddcafebaddcafe 0xbaddcafebaddcafe (fletcher_4_native) 47 : 0x0 0x0 0x0 0x0 (fletcher_4_native) 92 : 0x130e01011103 0x10108 0x0 0x200 (fletcher_4_native) Since I had to guess at what the Solaris 10 source looks like, some extra eyeballs on the dtrace script is in order. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ zfs_bad_cksum.d Description: Binary data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [zones-discuss] Zones on shared storage - a warning
On Fri, Jan 8, 2010 at 12:28 PM, Torrey McMahon wrote: > On 1/8/2010 10:04 AM, James Carlson wrote: >> >> Mike Gerdts wrote: >> >>> >>> This unsupported feature is supported with the use of Sun Ops Center >>> 2.5 when a zone is put on a "NAS Storage Library". >>> >> >> Ah, ok. I didn't know that. >> >> > > Does anyone know how that works? I can't find it in the docs, no one inside > of Sun seemed to have a clue when I asked around, etc. RTFM gladly taken. Storage libraries are discussed very briefly at: http://wikis.sun.com/display/OC2dot5/Storage+Libraries Creation of zones is discussed at: http://wikis.sun.com/display/OC2dot5/Creating+Zones I've found no documentation that explains the implementation details. >From looking at a test environment that I have running, it seems to go like: 1. The storage admin carves out some NFS space and exports it with the appropriate options to the various hosts (global zones). 2. In the Ops Center BUI, the ops center admin creates a new storage library. He selects type NFS and specifies the hostname and path that was allocated. 3. The ops center admin associates the storage library with various hosts. This causes it to be be mounted at /var/mnt/virtlibs/ on those hosts. I'll call this $libmnt. 4. When the sysadmin provisions a zone through ops center, a UUID is allocated and associated with this zone. I'll call it $zuuid. A directory $libmnt/$zuuid is created with a set of directories under it. 5. As the sysadmin provisions ops center prompts for the virtual disk size. A file of that size is created at $libmnt/$zuuid/virtdisk/data. 6. Ops center creates a zpool: zpool create -m /var/mnt/oc-zpools/$zuuid/ z$zuuid \ $libmnt/$zuuid/virtdisk/data 7. The zonepath is created using a uuid that is unique to the zonepath ($puuid) z$zuuid/$puuid. It has a quota and a reservation set (8G each in the zpool history I am looking at). 8. The zone is configured with zonepath=/var/mnt/oc-zpools/$zuuid/$puuid, then installed Just in case anyone sees this as the right way to do things, I think it is generally OK with a couple caveats. The key areas that I would suggest for improvement are: - Mount the NFS space with -o forcedirectio. There is no need to cache data twice. - Never use UUID's in paths. This makes it nearly impossible for a sysadmin or a support person to look at the output of commands on the system and understand what it is doing. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive as backup - reliability?
On Sat, Jan 16, 2010 at 5:31 PM, Toby Thain wrote: > On 16-Jan-10, at 7:30 AM, Edward Ned Harvey wrote: > >>> I am considering building a modest sized storage system with zfs. Some >>> of the data on this is quite valuable, some small subset to be backed >>> up "forever", and I am evaluating back-up options with that in mind. >> >> You don't need to store the "zfs send" data stream on your backup media. >> This would be annoying for the reasons mentioned - some risk of being able >> to restore in future (although that's a pretty small risk) and inability >> to >> restore with any granularity, i.e. you have to restore the whole FS if you >> restore anything at all. >> >> A better approach would be "zfs send" and pipe directly to "zfs receive" >> on >> the external media. This way, in the future, anything which can read ZFS >> can read the backup media, and you have granularity to restore either the >> whole FS, or individual things inside there. > > There have also been comments about the extreme fragility of the data stream > compared to other archive formats. In general it is strongly discouraged for > these purposes. > Yet it is used in ZFS flash archives on Solaris 10 and are slated for use in the successor to flash archives. This initial proposal seems to imply using the same mechanism for a system image backup (instead of just system provisioning). http://mail.opensolaris.org/pipermail/caiman-discuss/2010-January/015909.html -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup memory overhead
On Thu, Jan 21, 2010 at 2:51 PM, Andrey Kuzmin wrote: > Looking at dedupe code, I noticed that on-disk DDT entries are > compressed less efficiently than possible: key is not compressed at > all (I'd expect roughly 2:1 compression ration with sha256 data), A cryptographic hash such as sha256 should not be compressible. A trivial example shows this to be the case: for i in {1..1} ; do echo $i | openssl dgst -sha256 -binary done > /tmp/sha256 $ gzip -c sha256.gz $ compress -c sha256.Z $ bzip2 -c sha256.bz2 $ ls -go sha256* -rw-r--r-- 1 32 Jan 22 04:13 sha256 -rw-r--r-- 1 428411 Jan 22 04:14 sha256.Z -rw-r--r-- 1 321846 Jan 22 04:14 sha256.bz2 -rw-r--r-- 1 320068 Jan 22 04:14 sha256.gz -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs send/receive as backup - reliability?
On Thu, Jan 21, 2010 at 11:28 AM, Richard Elling wrote: > On Jan 21, 2010, at 3:55 AM, Julian Regel wrote: >> >> Until you try to pick one up and put it in a fire safe! >> >> >Then you backup to tape from x4540 whatever data you need. >> >In case of enterprise products you save on licensing here as you need a one >> >client license per x4540 but in fact can >backup data from many clients >> >which are there. >> >> Which brings up full circle... >> >> What do you then use to backup to tape bearing in mind that the Sun-provided >> tools all have significant limitations? > > Poor choice of words. Sun resells NetBackup and (IIRC) that which was > formerly called NetWorker. Thus, Sun does provide enterprise backup > solutions. (Symantec nee Veritas) NetBackup and (EMC nee Legato) Networker are different products that compete in the enterprise backup space. Under the covers NetBackup uses gnu tar to gather file data for the backup stream. At one point (maybe still the case), one of the claimed features of netbackup is that if a tape is written without multiplexing, you can use gnu tar to extract data. This seems to be most useful when you need to recover master and/or media servers and to be able to extract your data after you no longer use netbackup. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk wrote: > Is there a way to zero out unused blocks in a pool? I'm looking for ways to > shrink the size of an opensolaris virtualbox VM and > using the compact subcommand will remove zero'd sectors. I've long suspected that you should be able to just use mkfile or "dd if=/dev/zero ..." to create a file that consumes most of the free space then delete that file. Certainly it is not an ideal solution, but seems quite likely to be effective. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On Sat, Jan 23, 2010 at 11:55 AM, John Hoogerdijk wrote: > Mike Gerdts wrote: >> >> On Fri, Jan 22, 2010 at 1:00 PM, John Hoogerdijk >> wrote: >> >>> >>> Is there a way to zero out unused blocks in a pool? I'm looking for ways >>> to >>> shrink the size of an opensolaris virtualbox VM and >>> using the compact subcommand will remove zero'd sectors. >>> >> >> I've long suspected that you should be able to just use mkfile or "dd >> if=/dev/zero ..." to create a file that consumes most of the free >> space then delete that file. Certainly it is not an ideal solution, >> but seems quite likely to be effective. >> > > I tried this with mkfile - no joy. Let me ask a couple of the questions that come just after "are you sure your computer is plugged in?" Did you wait enough time for the data to be flushed to disk (or do sync and wait for it to complete) prior to removing the file? You did "mkfile $huge /var/tmp/junk" not "mkfile -n $huge /var/tmp/junk", right? If not, I suspect that "zpool replace" to a thin provisioned disk is going to be your best bet (as suggested in another message). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zero out block / sectors
On Mon, Jan 25, 2010 at 2:32 AM, Kjetil Torgrim Homme wrote: > Mike Gerdts writes: > >> John Hoogerdijk wrote: >>> Is there a way to zero out unused blocks in a pool? I'm looking for >>> ways to shrink the size of an opensolaris virtualbox VM and using the >>> compact subcommand will remove zero'd sectors. >> >> I've long suspected that you should be able to just use mkfile or "dd >> if=/dev/zero ..." to create a file that consumes most of the free >> space then delete that file. Certainly it is not an ideal solution, >> but seems quite likely to be effective. > > you'll need to (temporarily) enable compression for this to have an > effect, AFAIK. > > (dedup will obviously work, too, if you dare try it.) You are missing the point. Compression and dedup will make it so that the blocks in the devices are not overwritten with zeroes. The goal is to overwrite the blocks so that a back-end storage device or back-end virtualization platform can recognize that the blocks are not in use and as such can reclaim the space. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [OT] excess zfs-discuss mailman digests
On Mon, Feb 8, 2010 at 9:04 PM, grarpamp wrote: > PS: Is there any way to get a copy of the list since inception > for local client perusal, not via some online web interface? You can get monthly .gz archives in mbox format from http://mail.opensolaris.org/pipermail/zfs-discuss/. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FS Reliability WAS: about btrfs and zfs
On Fri, Oct 21, 2011 at 8:02 PM, Fred Liu wrote: > >> 3. Do NOT let a system see drives with more than one OS zpool at the >> same time (I know you _can_ do this safely, but I have seen too many >> horror stories on this list that I just avoid it). >> > > Can you elaborate #3? In what situation will it happen? Some people have trained their fingers to use the -f option on every command that supports it to force the operation. For instance, how often do you do rm -rf vs. rm -r and answer questions about every file? If various zpool commands (import, create, replace, etc.) are used against the wrong disk with a force option, you can clobber a zpool that is in active use by another system. In a previous job, my lab environment had a bunch of LUNs presented to multiple boxes. This was done for convenience in an environment where there would be little impact if an errant command were issued. I'd never do that in production without some form of I/O fencing in place. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gaining access to var from a live cd
On Tue, Nov 29, 2011 at 3:01 PM, Francois Dion wrote: > I've hit an interesting (not) problem. I need to remove a problematic > ld.config file (due to an improper crle...) to boot my laptop. This is > OI 151a, but fundamentally this is zfs, so i'm asking here. > > what I did after booting the live cd and su: > mkdir /tmp/disk > zpool import -R /tmp/disk -f rpool > > export shows up in there and rpool also, but in rpool there is only > boot and etc. > > zfs list shows rpool/ROOT/openindiana as mounted on /tmp/disk and I > see dump and swap, but no var. rpool/ROOT shows as legacy, so I > figured, maybe mount that. > > mount -F zfs rpool/ROOT /mnt/rpool That dataset (rpool/ROOT) should never have any files in it. It is just a "container" for boot environments. You can see which boot environments exist with: zfs list -r rpool/ROOT If you are running Solaris 11, the boot environment's root dataset will show a mountpoint property value of /. Assuming it is called "solaris" you can mount it with: zfs mount -o mountpoint=/mnt/rpool rpool/ROOT/solaris If the system is running Solaris 11 (and was not updated from Solaris 11 Express), it will have a separate /var dataset. zfs mount -o mountpoint=/mnt/rpool/var rpool/ROOT/solaris/var -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] gaining access to var from a live cd
On Tue, Nov 29, 2011 at 4:40 PM, Francois Dion wrote: > It is on openindiana 151a, no separate /var as far as But I'll have to > test this on solaris11 too when I get a chance. > > The problem is that if I > > zfs mount -o mountpoint=/tmp/rescue (or whatever) rpool/ROOT/openindiana > > i get a cannot mount /mnt/rpool: directory is not empty. > > The reason for that is that I had to do a zpool import -R /mnt/rpool > rpool (or wherever I mount it it doesnt matter) before I could do a > zfs mount, else I dont have access to the rpool zpool for zfs to do > its thing. > > chicken / egg situation? I miss the old fail safe boot menu... You can mount it pretty much anywhere: mkdir /tmp/foo zfs mount -o mountpoint=/tmp/foo ... I'm not sure when the temporary mountpoint option (-o mountpoint=...) came in. If it's not valid syntax then: mount -F zfs rpool/ROOT/solaris /tmp/foo -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Any rhyme or reason to disk dev names?
On Wed, Dec 21, 2011 at 1:58 AM, Matthew R. Wilson wrote: > Hello, > > I am curious to know if there is an easy way to guess or identify the device > names of disks. Previously the /dev/dsk/c0t0d0s0 system made sense to me... > I had a SATA controller card with 8 ports, and they showed up with the > numbers 1-8 in the "t" position of the device name. > > But I just built a new system with two LSI SAS HBAs in it, and my device > names are along the lines of: > /dev/dsk/c0t5000CCA228C0E488d0 > > I could not find any correlation between that identifier and the a) > controller the disk was plugged in to, or b) the port number on the > controller. The only way I could make a mapping of device name to controller > port was to add one drive at a time, reboot the system, and run "format" to > see which new disk name shows up. > > I'm guessing there's a better way, but I can't find any obvious answer as to > how to determine which port on my LSI controller card will correspond with > which seemingly random device name. Can anyone offer any suggestions on a > way to predict the device naming, or at least get the system to list the > disks after I insert one without rebooting? Depending on the hardware you are using, you may be able to benefit from croinfo. $ croinfo D:devchassis-path t:occupant-type c:occupant-compdev - --- - /dev/chassis//SYS/SASBP/HDD0/disk disk c0t5000CCA012B66E90d0 /dev/chassis//SYS/SASBP/HDD1/disk disk c0t5000CCA012B68AC8d0 The text in the left column represents text that should be printed on the corresponding disk slots. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
2012/3/26 ольга крыжановская : > How can I test if a file on ZFS has holes, i.e. is a sparse file, > using the C api? See SEEK_HOLE in lseek(2). -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] test for holes in a file?
On Mon, Mar 26, 2012 at 6:18 PM, Bob Friesenhahn wrote: > On Mon, 26 Mar 2012, Andrew Gabriel wrote: > >> I just played and knocked this up (note the stunning lack of comments, >> missing optarg processing, etc)... >> Give it a list of files to check... > > > This is a cool program, but programmers were asking (and answering) this > same question 20+ years ago before there was anything like SEEK_HOLE. > > If file space usage is less than file directory size then it must contain a > hole. Even for compressed files, I am pretty sure that Solaris reports the > uncompressed space usage. That's not the case. # zfs create -o compression=on rpool/junk # perl -e 'print "foo" x 10'> /rpool/junk/foo # ls -ld /rpool/junk/foo -rw-r--r-- 1 root root 30 Mar 26 18:25 /rpool/junk/foo # du -h /rpool/junk/foo 16K /rpool/junk/foo # truss -t stat -v stat du /rpool/junk/foo ... lstat64("foo", 0x08047C40) = 0 d=0x02B90028 i=8 m=0100644 l=1 u=0 g=0 sz=30 at = Mar 26 18:25:25 CDT 2012 [ 1332804325.742827733 ] mt = Mar 26 18:25:25 CDT 2012 [ 1332804325.889143166 ] ct = Mar 26 18:25:25 CDT 2012 [ 1332804325.889143166 ] bsz=131072 blks=32fs=zfs Notice that it says it has 32 512 byte blocks. The mechanism you suggest does work for every other file system that I've tried it on. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Strange hang during snapshot receive
On Thu, May 10, 2012 at 5:37 AM, Ian Collins wrote: > I have an application I have been using to manage data replication for a > number of years. Recently we started using a new machine as a staging > server (not that new, an x4540) running Solaris 11 with a single pool built > from 7x6 drive raidz. No dedup and no reported errors. > > On that box and nowhere else is see empty snapshots taking 17 or 18 seconds > to write. Everywhere else they return in under a second. > > Using truss and the last published source code, it looks like the pause is > between a printf and the call to zfs_ioctl and there aren't any other > functions calls between them: For each snapshot in a stream, there is one zfs_ioctl() call. During that time, the kernel will read the entire substream (that is, for one snapshot) from the input file descriptor. > > 100.5124 0.0004 open("/dev/zfs", O_RDWR|O_EXCL) = 10 > 100.7582 0.0001 read(7, "\0\0\0\0\0\0\0\0ACCBBAF5".., 312) = 312 > 100.7586 0. read(7, 0x080464F8, 0) = 0 > 100.7591 0. time() = 1336628656 > 100.7653 0.0035 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040CF0) = 0 > 100.7699 0.0022 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900) = 0 > 100.7740 0.0016 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040580) = 0 > 100.7787 0.0026 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x080405B0) = 0 > 100.7794 0.0001 write(1, " r e c e i v i n g i n".., 75) = 75 > 118.3551 0.6927 ioctl(8, ZFS_IOC_RECV, 0x08042570) = 0 > 118.3596 0.0010 ioctl(8, ZFS_IOC_OBJSET_STATS, 0x08040900) = 0 > 118.3598 0. time() = 1336628673 > 118.3600 0. write(1, " r e c e i v e d 3 1 2".., 45) = 45 > > zpool iostat (1 second interval) for the period is: > > tank 12.5T 6.58T 175 0 271K 0 > tank 12.5T 6.58T 176 0 299K 0 > tank 12.5T 6.58T 189 0 259K 0 > tank 12.5T 6.58T 156 0 231K 0 > tank 12.5T 6.58T 170 0 243K 0 > tank 12.5T 6.58T 252 0 295K 0 > tank 12.5T 6.58T 179 0 200K 0 > tank 12.5T 6.58T 214 0 258K 0 > tank 12.5T 6.58T 165 0 210K 0 > tank 12.5T 6.58T 154 0 178K 0 > tank 12.5T 6.58T 186 0 221K 0 > tank 12.5T 6.58T 184 0 215K 0 > tank 12.5T 6.58T 218 0 248K 0 > tank 12.5T 6.58T 175 0 228K 0 > tank 12.5T 6.58T 146 0 194K 0 > tank 12.5T 6.58T 99 258 209K 1.50M > tank 12.5T 6.58T 196 296 294K 1.31M > tank 12.5T 6.58T 188 130 229K 776K > > Can anyone offer any insight or further debugging tips? I have yet to see a time when zpool iostat tells me something useful. I'd take a look at "iostat -xzn 1" or similar output. It could point to imbalanced I/O or a particular disk that has abnormally high service times. Have you installed any SRUs? If not, you could be seeing: 7060894 zfs recv is excruciatingly slow which is fixed in Solaris 11 SRU 5. If you are using zones and are using any https pkg(5) origins (such as https://pkg.oracle.com/solaris/support), I suggest reading https://forums.oracle.com/forums/thread.jspa?threadID=2380689&tstart=15 before updating to SRU 6 (SRU 5 is fine, however). The fix for the problem mentioned in that forums thread should show up in an upcoming SRU via CR 7157313. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Occasional storm of xcalls on segkmem_zio_free
On Tue, Jun 12, 2012 at 11:17 AM, Sašo Kiselkov wrote: > On 06/12/2012 05:58 PM, Andy Bowers - Performance Engineering wrote: >> find where your nics are bound too >> >> mdb -k >> ::interrupts >> >> create a processor set including those cpus [ so just the nic code will >> run there ] >> >> andy > > Tried and didn't help, unfortunately. I'm still seeing drops. What's > even funnier is that I'm seeing drops when the machine is sync'ing the > txg to the zpool. So looking at a little UDP receiver I can see the > following input stream bandwidth (the stream is constant bitrate, so > this shouldn't happen): If processing in interrupt context (use intrstat) is dominating cpu usage, you may be able to use pcitool to cause the device generating all of those expensive interrupts to be moved to another CPU. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Benefits of enabling compression in ZFS for the zones
On Tue, Jul 10, 2012 at 6:29 AM, Jordi Espasa Clofent wrote: > Thanks for you explanation Fajar. However, take a look on the next lines: > > # available ZFS in the system > > root@sct-caszonesrv-07:~# zfs list > > NAME USED AVAIL REFER MOUNTPOINT > opt 532M 34.7G 290M /opt > opt/zones243M 34.7G32K /opt/zones > opt/zones/sct-scw02-shared 243M 34.7G 243M /opt/zones/sct-scw02-shared > static 104K 58.6G34K /var/www/ > > # creating a file in /root (UFS) > > root@sct-caszonesrv-07:~# dd if=/dev/zero of=file.bin count=1024 bs=1024 > 1024+0 records in > 1024+0 records out > 1048576 bytes (1.0 MB) copied, 0.0545957 s, 19.2 MB/s > root@sct-caszonesrv-07:~# pwd > /root > > # enable compression in some ZFS zone > > root@sct-caszonesrv-07:~# zfs set compression=on opt/zones/sct-scw02-shared > > # copying the previos file to this zone > > root@sct-caszonesrv-07:~# cp /root/file.bin > /opt/zones/sct-scw02-shared/root/ > > # checking the file size in the origin dir (UFS) and the destination one > (ZFS with compression enabled) > > root@sct-caszonesrv-07:~# ls -lh /root/file.bin > -rw-r--r-- 1 root root 1.0M Jul 10 13:21 /root/file.bin > > root@sct-caszonesrv-07:~# ls -lh /opt/zones/sct-scw02-shared/root/file.bin > -rw-r--r-- 1 root root 1.0M Jul 10 13:22 > /opt/zones/sct-scw02-shared/root/file.bin > > # the both files has exactly the same cksum! > > root@sct-caszonesrv-07:~# cksum /root/file.bin > 3018728591 1048576 /root/file.bin > > root@sct-caszonesrv-07:~# cksum /opt/zones/sct-scw02-shared/root/file.bin > 3018728591 1048576 /opt/zones/sct-scw02-shared/root/file.bin > > So... I don't see any size variation with this test. ls(1) tells you how much data is in the file - that is, how many bytes of data that an application will see if it reads the whole file. du(1) tells you how many disk blocks are used. If you look at the stat structure in stat(2), ls reports st_size, du reports st_blocks. Blocks full of zeros are special to zfs compression - it recognizes them and stores no data. Thus, a file that contains only zeros will only require enough space to hold the file metadata. $ zfs list -o compression ./ COMPRESS on $ dd if=/dev/zero of=1gig count=1024 bs=1024k 1024+0 records in 1024+0 records out $ ls -l 1gig -rw-r--r-- 1 mgerdts staff1073741824 Jul 10 07:52 1gig $ du -k 1gig 0 1gig -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Feature Request for zfs pool/filesystem protection?
On Wed, Feb 20, 2013 at 4:49 PM, Markus Grundmann wrote: > Whenever I modify zfs pools or filesystems it's possible to destroy [on a > bad day :-)] my data. A new > property "protected=on|off" in the pool and/or filesystem can help the > administrator for datalost > (e.g. "zpool destroy tank" or "zfs destroy " command will > be rejected > when "protected=on" property is set). > > It's anywhere here on this list their can discuss/forward this feature > request? I hope you have > understand my post ;-) I like the idea and it is likely not very hard to implement. This is very similar to how snapshot holds work. # zpool upgrade -v | grep -i hold 18 Snapshot user holds So long as you aren't using a really ancient zpool version, you could use this feature to protect your file systems. # zfs create a/b # zfs snapshot a/b@snap # zfs hold protectme a/b@snap # zfs destroy a/b cannot destroy 'a/b': filesystem has children use '-r' to destroy the following datasets: a/b@snap # zfs destroy -r a/b cannot destroy 'a/b@snap': snapshot is busy Of course, snapshots aren't free if you write to the file system. A way around that is to create an empty file system within the one that you are trying to protect. # zfs create a/1 # zfs create a/1/hold # zfs snapshot a/1/hold@hold # zfs hold 'saveme!' a/1/hold@hold # zfs holds a/1/hold@hold NAME TAG TIMESTAMP a/1/hold@hold saveme! Wed Feb 20 15:06:29 2013 # zfs destroy -r a/1 cannot destroy 'a/1/hold@hold': snapshot is busy Extending the hold mechanism to filesystems and volumes would be quite nice. Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] VM's on ZFS - 7210
On Sat, Aug 28, 2010 at 8:19 AM, Ray Van Dolson wrote: > On Sat, Aug 28, 2010 at 05:50:38AM -0700, Eff Norwood wrote: >> I can't think of an easy way to measure pages that have not been consumed >> since it's really an SSD controller function which is obfuscated from the >> OS, and add the variable of over provisioning on top of that. If anyone >> would like to really get into what's going on inside of an SSD that makes it >> a bad choice for a ZIL, you can start here: >> >> http://en.wikipedia.org/wiki/TRIM_%28SSD_command%29 >> >> and >> >> http://en.wikipedia.org/wiki/Write_amplification >> >> Which will be more than you might have ever wanted to know. :) > > So has anyone on this list actually run into this issue? Tons of > people use SSD-backed slog devices... > > The theory sounds "sound", but if it's not really happening much in > practice then I'm not too worried. Especially when I can replace a > drive from my slog mirror for a $400 or so if problems do arise... (the > alternative being much more expensive DRAM backed devices) Presumably this problem is being worked... http://hg.genunix.org/onnv-gate.hg/rev/d560524b6bb6 Notice that it implements: 866610 Add SATA TRIM support With this in place, I would imagine a next step is for zfs to issue TRIM commands as zil entries have been committed to the data disks. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How to migrate to 4KB sector drives?
On Sun, Sep 12, 2010 at 5:42 PM, Richard Elling wrote: > On Sep 12, 2010, at 10:11 AM, Brandon High wrote: > >> On Sun, Sep 12, 2010 at 10:07 AM, Orvar Korvar >> wrote: >>> No replies. Does this mean that you should avoid large drives with 4KB >>> sectors, that is, new drives? ZFS does not handle new drives? >> >> Solaris 10u9 handles 4k sectors, so it might be in a post-b134 release of >> osol. > > OSol source yes, binaries no :-( You will need another distro besides > OpenSolaris. > The needed support in sd was added around the b137 timeframe. OpenIndiana, to be released on Tuesday, is based on b146 or later. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] file level clones
On Mon, Sep 27, 2010 at 6:23 AM, Robert Milkowski wrote: > Also see http://www.symantec.com/connect/virtualstoreserver And http://blog.scottlowe.org/2008/12/03/2031-enhancements-to-netapp-cloning-technology/ -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN
On Tue, Oct 26, 2010 at 9:40 AM, bhanu prakash wrote: > Hi Team, > > > There 17 zones on the machine T5120. I want to move all the zones which are > ZFS filesystem to another new LUN. > > Can you give me the steps to proceed this. If the only thing on the source lun is the pool that contains the zones and the new LUN is at least as big as the old LUN: zpool replace The above can be done while the zones are booted. Depending on the characteristics of the server and workloads, the workloads may feel a bit sluggish during this time due to increased I/O activity. If that works for you, stop reading now. In the event that the scenario above doesn't apply, read on. Assuming all the zones are under oldpool/zones, oldpool/zones is mounted at /zones, and you have done "zpool create newpool " Be sure to test this procedure - I didn't! zfs create newlun/zones # optionally, shut down the zones zfs snapshot -r oldpool/zo...@phase1 zfs send -r oldpool/zo...@phase1 | zfs receive newpool/zo...@phase1 # If you did not shut down the zones above, shut them down now. # If the zones were shut down, skip the next two commands zfs snapshot -r oldpool/zo...@phase2 zfs send -rI oldpool/zo...@phase1 oldpool/zo...@phase2 \ | zfs receive newpool/zo...@phase2 # Adjust mount points and restart the zones zfs set mountpoint=none oldpool/zones zfs set mountpoint=/zones newpool/zones for zone in $zonelist zoneadm -z $zone boot ; done At such a time that you are comfortable that the zone data moved over ok... zfs destroy -r oldpool/zones Again, verify the procedure works on a test/lab/whatever box before trying it for real. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Moving the 17 zones from one LUN to another LUN
On Wed, Oct 27, 2010 at 9:27 AM, bhanu prakash wrote: > Hi Mike, > > > Thanks for the information... > > Actually the requirement is like this. Please let me know whether it matches > for the below requirement or not. > > Question: > > The SAN team will assign the new LUN’s on EMC DMX4 (currently IBM Hitache is > there). We need to move the 17 containers which are existed on the > server Host1 to new LUN’s”. > > > Please give me the steps to do this activity. Without knowing the layout of the storage, it is impossible to give you precise instructions. This sounds like it is a production Solaris 10 system in an enterprise environment. In most places that I've worked, I would be hesitant to provide the required level of detail on a public mailing list. Perhaps you should open a service call to get the assistance you need. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] hardware going bad
On Wed, Oct 27, 2010 at 3:41 PM, Harry Putnam wrote: > I'm guessing it was probably more like 60 to 62 c under load. The > temperature I posted was after something like 5minutes of being > totally shutdown and the case been open for a long while. (mnths if > not yrs) What happens if the case is closed (and all PCI slot, disk, etc. slots are closed)? Having the case open likely changes the way that air flows across the various components. Also, if there is tobacco smoke near the machine, it will cause a sticky build-up that likely contributes to heat dissipation problems. Perhaps this belongs somewhere other than zfs-discuss - it has nothing to do with zfs. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FW: Solaris panic
ippy genunix: [ID 540533 kern.notice] ^MSunOS Release 5.11 > Version snv_151a 64-bit > Mar 17 15:28:51 zippy genunix: [ID 877030 kern.notice] Copyright (c) 1983, > 2010, Oracle and/or its affiliates. All rights reserved. > > Can anyone help? > > Regards > Karl > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Non-Global zone recovery
On Thu, Jul 7, 2011 at 2:41 PM, Ram kumar wrote: > > Hi Cindy, > > Thanks for the email. > > We are using Solaris 10 with out Live Upgrade. > > Tested following in the sandbox environment: > > 1) We have one non-global zone (TestZone) which is running on Test > zpool (SAN) > > 2) Don’t see zpool or non-global zone after re-image of Global zone. > > 3) Imported zpool Test > > Now I am trying to create Non-global zone and it is giving error > > bash-3.00# zonecfg -z Test > Test: No such zone configured > Use 'create' to begin configuring a new zone. > zonecfg:Test> create -a /zones/Test > invalid path to detached zone If you use create -a, it requires that SUNWdetached.xml exist as a means for configuring the various properties (e.g. zonepath, brand, etc.) and resources (inherit-pkg-dir, net, fs, device, etc.) for the zone. Since you don't have the SUNWdetached.xml, you can't use it. Assuming you have a backup of the system, you could restore a copy of /etc/zones/ to /etc/zones/restored-.xml, then run: zonecfg -z create -t restored- If that's not an option or is just too inconvenient, use zonecfg to configure the zone just like you did initially. That is, do not use "create -a", use "create", "create -b", or "create -t " followed by whatever property settings and added resources are appropriate. After you get past zonecfg, you should be able to: zoneadm -z attach If the package and patch levels don't match up (the global zone perhaps was installed from a newer update or has newer patches): zoneadm -z attach -U or zoneadm -z attach -u Since you seem to be doing this in a test environment to prepare for bad things to happen, I'd suggest that you make it a standard practice when you are done configuring a zone to do: zonecfg -z export > /zonecfg.export Then if you need to recover the zone using only the things that are on the SAN, you can do: zpool import ... zonecfg -z -f /zonecfg.export zoneadm -z attach [-u|-U] Any follow-ups should probably go to Oracle Support or zones-discuss. Your problems are not related to zfs. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
n, > not only on on hardware built for dedicated storage. > > Sparse-root vs. full-root zones, or disk images of VMs; > are they stuffed in one rpool or spread between rpool and > data pools - that detail is not actually the point of the thread. > > Actual useability of dedup for savings and gains on these > tasks (preferably working also on low-mid-range boxes, > where adding a good enterprise SSD would double the > server cost - not only on those big good systems with > tens of GB of RAM), and hopefully simplifying the system > configuration and maintenance - that is indeed the point > in question. > > //Jim > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is ".$EXTEND/$QUOTA" ?
On Tue, Jul 19, 2011 at 2:39 PM, Orvar Korvar wrote: > I am using S11E, and have created a zpool on a single disk as storage. In > several directories, I can see a directory called ".$EXTEND/$QUOTA". What is > it for? Can I delete it? > -- Perhaps this is of help. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/smbsrv/smb_pathname.c#752 752 /* 753 * smb_pathname_preprocess_quota 754 * 755 * There is a special file required by windows so that the quota 756 * tab will be displayed by windows clients. This is created in 757 * a special directory, $EXTEND, at the root of the shared file 758 * system. To hide this directory prepend a '.' (dot). 759 */ -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs rename query
On Wed, Jul 27, 2011 at 6:37 AM, Nishchaya Bahuguna wrote: > Hi, > > I have a query regarding the zfs rename command. > > There are 5 zones and my requirement is to change the zone paths using zfs > rename. > > + zoneadm list -cv > ID NAME STATUS PATH BRAND IP > 0 global running / native > shared > 34 public running /txzone/public native > shared > 35 internal running /txzone/internal native > shared > 36 restricted running /txzone/restricted native > shared > 37 needtoknow running /txzone/needtoknow native shared > 38 sandbox running /txzone/sandbox native shared > > A whole root zone was configured and installed. Rest of the 4 zones > were cloned from . > > zoneadm -z clone public > > zfs get origin lists the origin as for all 4 zones. > > I run zfs rename on 4 of these clone'd zones and it throws a device busy > error because of parent-child relationship. I think you are getting the device busy error for a different reason. I just did the following: zfs create -o mountpoint=/zones rpool/zones zonecfg -z z1 'create; set zonepath=/zones/z1' zoneadm -z z1 install zonecfg -z z1c1 'create -t z1; set zonepath=/zones/z1c1' zonecfg -z z1c2 'create -t z1; set zonepath=/zones/z1c2' zoneadm -z z1c1 clone z1 zoneadm -z z1c2 clone z2 At this point, I have the following: bash-3.2# zfs list -r -o name,origin rpool/zones NAME ORIGIN rpool/zones - rpool/zones/z1- rpool/zones/z1@SUNWzone1 - rpool/zones/z1@SUNWzone2 - rpool/zones/z1c1 rpool/zones/z1@SUNWzone1 rpool/zones/z1c2 rpool/zones/z1@SUNWzone2 Next, I decide that I would like z1c1 to be rpool/new/z1c1 instead of it's current place. Note that this will also change the mountpoint which breaks the zone. bash-3.2# zfs create -o mountpoint=/new rpool/new bash-3.2# zfs rename rpool/zones/z1c1 rpool/new/z1c1 bash-3.2# zfs list -o name,origin -r /new NAMEORIGIN rpool/new - rpool/new/z1c1 rpool/zones/z1@SUNWzone1 To get a "device busy" error, I need to cause a situation where the zonepath cannot be unmounted. Having the zone running is a good way to do that: bash-3.2# zoneadm -z z1c2 boot WARNING: zone z1c1 is installed, but its zonepath /zones/z1c1 does not exist. bash-3.2# zfs rename rpool/zones/z1c2 rpool/new/z1c2 cannot unmount '/zones/z1c2': Device busy > I guess that can be handled with zfs promote because promote would swap the > parent and child. You would need to do this to rename a dataset that the origin (one that is cloned) not the clones. That is, if you wanted to rename the dataset for your public zone or I wanted to rename the dataset for z1, then you would need to promote the datasets for all of the clones. This is a known issue. 6472202 'zfs rollback' and 'zfs rename' require that clones be unmounted > So, how do I make it work when there are multiple zones cloned from a single > parent? Is there a way that zfs rename can work for ALL the zones rather > than working with two zones at a time? As I said above. > > Also, is there a command line option available for sorting the datasets in > correct dependency order? "zfs list -r -o name,origin" is a good starting point. I suspect that it doesn't give you exactly the output you are looking for. FWIW, the best way to achieve what you are after without breaking the zones is going to be along the lines of: zlogin z1c1 init 0 zoneadm -z z1c1 detach zfs rename rpool/zones/z1c1 rpool/new/z1c1 zoneadm -z z1c1 'set zonepath=/new/z1c1' zoneadm -z z1c1 attach zoneadm -z z1c1 boot -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Kernel panic on zpool import. 200G of data inaccessible!
On Thu, Aug 4, 2011 at 2:47 PM, Stuart James Whitefish wrote: > # zpool import -f tank > > http://imageshack.us/photo/my-images/13/zfsimportfail.jpg/ I encourage you to open a support case and ask for an escalation on CR 7056738. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
ompromised), loss of one > block can thus be many more times severe. I believe this is true and likely a good topic for discussion. > We need to think long and hard about what the real widespread benefits > are of dedup before committing to a filesystem-level solution, rather > than an application-level one. In particular, we need some real-world > data on the actual level of duplication under a wide variety of > circumstances. The key thing here is that distributed applications will not play nicely. In my best use case, Solaris zones and LDoms are the "application". I don't expect or want Solaris to form some sort of P2P storage system across my data center to save a few terabytes. D12n at the storage device can do this much more reliably with less complexity. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Jul 22, 2008 at 10:44 PM, Erik Trimble <[EMAIL PROTECTED]> wrote: > More than anything, Bob's reply is my major feeling on this. Dedup may > indeed turn out to be quite useful, but honestly, there's no broad data > which says that it is a Big Win (tm) _right_now_, compared to finishing > other features. I'd really want a Engineering Study about the > real-world use (i.e. what percentage of the userbase _could_ use such a > feature, and what percentage _would_ use it, and exactly how useful > would each segment find it...) before bumping it up in the priority > queue of work to be done on ZFS. I get this. However, for most of my uses of clones dedup is considered finishing the job. Without it, I run the risk of having way more writable data than I can restore. Another solution to this is to consider the output of "zfs send" to be a stable format and get integration with enterprise backup software that can perform restores in a way that maintains space efficiency. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cannot attach mirror to SPARC zfs root pool
On Wed, Jul 23, 2008 at 11:36 AM, <[EMAIL PROTECTED]> wrote: > Rainer, > > Sorry for your trouble. > > I'm updating the installboot example in the ZFS Admin Guide with the > -F zfs syntax now. We'll fix the installboot man page as well. Perhaps it also deserves a mention in the FAQ somewhere near http://opensolaris.org/os/community/zfs/boot/zfsbootFAQ/#mirrorboot. 5. How do I attach a mirror to an existing ZFS root pool"? Attach the second disk to form a mirror. In this example, c1t1d0s0 is attached. # zpool attach rpool c1t0d0s0 c1t1d0s0 Prior to build , bug 6668666 causes the following platform-dependent steps to also be needed: On sparc systems: # installboot -F zfs /usr/`uname -i`/lib/fs/zfs/bootblk /dev/rdsk/c1t1d0s0 On x86 systems: # ... -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs write cache enable on boot disks ?
On Fri, Apr 25, 2008 at 9:22 AM, Robert Milkowski <[EMAIL PROTECTED]> wrote: > Hello andrew, > > Thursday, April 24, 2008, 11:03:48 AM, you wrote: > > a> What is the reasoning behind ZFS not enabling the write cache for > a> the root pool? Is there a way of forcing ZFS to enable the write cache? > > The reason is that EFI labels are not supported for booting. > So from ZFS perspective you put root pool on a slice on SMI labeled > disk - the way currently ZFS works it assumes in such a case that > there could be other slices used by other programs and because you can > enable/disable write cache per disk and not per slice it's just safer > to not automatically enable it. > > If you havoever enable it yourself then it should stay that way (see > format -e -> cache) So long as the zpool uses all of the space used for dynamic data that needs to survive a reboot, it would seem to make a lot of sense to enable write cache on such disks. This assumes that ZFS does the flush no matter whether it thinks the write cache is enabled or not. Am I wrong about this somehow? -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on 32bit.
On Wed, Aug 6, 2008 at 6:22 PM, Carson Gaspar <[EMAIL PROTECTED]> wrote: > Brian D. Horn wrote: >> In the most recent code base (both OpenSolaris/Nevada and S10Ux with patches) >> all the known marvell88sx problems have long ago been dealt with. > > Not true. The working marvell patches still have not been released for > Solaris. They're still just IDRs. Unless you know something I (and my > Sun support reps) don't, in which case please provide patch numbers. I was able to get a Tpatch this week with encouraging words about a likely release of 138053-02 this week. In a separate thread last week (?) Enda said that it should be out within a couple weeks. Mike -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat <[EMAIL PROTECTED]> wrote: > In the interest of "full disclosure" I have changed the sha256.c in the > ZFS source to use the default kernel one via the crypto framework rather > than a private copy. I wouldn't expect that to have too big an impact (I > will be verifying it I just didn't have the data to hand quickly). Would this also make it so that it would use hardware assisted sha256 on capable (e.g N2) platforms? Is that the same as this change from long ago? http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.html -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Panic + corrupted pool in snv_98
I had just upgraded (pkg image-update) to snv_98 then was trying to do a build of ON. The build was happening inside of virtualbox, so I can't really say for sure what layer is at fault. I'll keep the disk image and crash dump around for a few days in case anyone is interested in more data from them. Here's the interesting bits from ::msgbuf panic[cpu0]/thread=d48f6de0: assertion failed: 0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_SCRUB_FUNC, sizeof (uint32_t), 1, &dp->dp_scrub_func, tx), file: ../../common/fs/zfs/dsl_scrub.c, line: 124 d48f6c0c genunix:assfail+5a (feb96258, feb96238,) d48f6c58 zfs:dsl_pool_scrub_setup_sync+2cc (d68e8b00, ea9dadb4,) d48f6c90 zfs:dsl_sync_task_group_sync+da (df8ceac0, ee97d518) d48f6cdc zfs:dsl_pool_sync+121 (d68e8b00, 610b, 0) d48f6d4c zfs:spa_sync+2b5 (d3104500, 610b, 0) d48f6dc8 zfs:txg_sync_thread+2aa (d68e8b00, 0) d48f6dd8 unix:thread_start+8 () Here's what my pool looks like: pool: export id: 10328403348002192848 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-5E config: export FAULTED corrupted data c6t0d0UNAVAIL corrupted data -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which is better for root ZFS: mlc or slc SSD?
On Wed, Sep 24, 2008 at 1:41 PM, Erik Trimble <[EMAIL PROTECTED]> wrote: > I was under the impression that MLC is the preferred type of SSD, but I > want to prevent myself from having a think-o. MLC - description as to why can be found in http://mags.acm.org/communications/200807/ See "Flash Storage Memory" by Adam Leventhal, page 47. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OT: ramdisks (Was: Re: create raidz with 1 disk offline)
On Mon, Sep 29, 2008 at 2:12 AM, Volker A. Brandt <[EMAIL PROTECTED]> wrote: > kthr memorypagedisk faults cpu > r b w swap free re mf pi po fr de sr lf lf lf s0 in sy cs us sy id > 0 0 0 33849968 2223440 2 14 1 0 0 0 0 0 21 0 21 813 1263 957 0 0 99 Note this from vmstat(1M): Without options, vmstat displays a one-line summary of the virtual memory activity since the system was booted. In other words, the first line of vmstat output is some value that does not represent the current state of the system. Try this instead: $ vmstat 1 2 kthr memorypagedisk faults cpu r b w swap free re mf pi po fr de sr m0 m1 m2 m1 in sy cs us sy id 0 0 0 12371648 5840200 30 93 14 1 1 0 0 0 1 0 5 711 2258 1257 1 0 98 0 0 0 9581336 2987584 61 69 0 0 0 0 0 0 0 0 0 543 972 518 0 0 100 >From a free memory standpoint, the current state of the system is very different than the typical state since boot. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 4:53 AM, . <[EMAIL PROTECTED]> wrote: > While it's clearly my own fault for taking the risks I did, it's > still pretty frustrating knowing that all my data is likely still > intact and nicely checksummed on the disk but that none of it is > accessible due to some tiny filesystem inconsistency. ?With pretty > much any other FS I think I could get most of it back. > > Clearly such a small number of occurrences in what were admittedly > precarious configurations aren't going to be particularly convincing > motivators to provide a general solution, but I'd feel a whole lot > better about using ZFS if I knew that there were some documented > steps or a tool (zfsck? ;) that could help to recover from this kind > of metadata corruption in the unlikely event of it happening. Well said. You have hit on my #1 concern with deploying ZFS. FWIW, I belive that I have hit the same type of bug as the OP in the following combinations: - T2000, LDoms 1.0, various builds of Nevada in control and guest domains. - Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @ build 97 guest In the past year I've lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can't get any back. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal <[EMAIL PROTECTED]> wrote: > >> >>In the past year I've lost more ZFS file systems than I have any other >>type of file system in the past 5 years. With other file systems I >>can almost always get some data back. With ZFS I can't get any back. > >> Thats scary to hear! >> > > I am really scared now! I was the one trying to quantify ZFS reliability, > and that is surely bad to hear! The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. Mirrors and raidz would also be vulnerable to such failures. I also have run into other failures that have gone unanswered on the lists. It makes me wary about using zfs without a support contract that allows me to escalate to engineering. Patching only support won't help. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html Hang only after I mirrored the zpool, no response on the list http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html I think this is fixed around snv_98, but the zfs-discuss list was surprisingly silent on acknowledging it as a problem - I had no idea that it was being worked until I saw the commit. The panic seemed to be caused by dtrace - core developers of dtrace were quite interested in the kernel crash dump. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html Panic during ON build. Pool was lost, no response from list. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <[EMAIL PROTECTED]> wrote: > Nevada isn't production code. For real ZFS testing, you must use a > production release, currently Solaris 10 (update 5, soon to be update 6). I misstated before in my LDoms case. The corrupted pool was on Solaris 10, with LDoms 1.0. The control domain was SX*E, but the zpool there showed no problems. I got into a panic loop with dangling dbufs. My understanding is that this was caused by a bug in the LDoms manager 1.0 code that has been fixed in a later release. It was a supported configuration, I pushed for and got a fix. However, that pool was still lost. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <[EMAIL PROTECTED]> wrote: > On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <[EMAIL PROTECTED]> wrote: >> Nevada isn't production code. For real ZFS testing, you must use a >> production release, currently Solaris 10 (update 5, soon to be update 6). > > I misstated before in my LDoms case. The corrupted pool was on > Solaris 10, with LDoms 1.0. The control domain was SX*E, but the > zpool there showed no problems. I got into a panic loop with dangling > dbufs. My understanding is that this was caused by a bug in the LDoms > manager 1.0 code that has been fixed in a later release. It was a > supported configuration, I pushed for and got a fix. However, that > pool was still lost. Or maybe it wasn't fixed yet. I see that this was committed just today. 6684721 file backed virtual i/o should be synchronous http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick <[EMAIL PROTECTED]> wrote: > Note: even in a single-device pool, ZFS metadata is replicated via > ditto blocks at two or three different places on the device, so that > a localized media failure can be both detected and corrected. > If you have two or more devices, even without any mirroring > or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) > across those devices. And in the event that you have a pool that is mostly not very important but some of it is important, you can have data mirrored on a per dataset level via copies=n. If we can avoid losing an entire pool by rolling back a txg or two, the biggest source of data loss and frustration is taken care of. Ditto blocks for metadata should take care of most other cases that would result in wide spread loss. Normal bit rot that causes you to lose blocks here and there are somewhat likely to take out a small minority of files and spit warnings along the way. If there are some files that are more important to you than others (e.g. losing files in rpool/home may have more impact than than rpool/ROOT) copies=2 can help there. And for those places where losing a txg or two is a mortal sin, don't use flaky hardware and allow zfs to handle a layer of redundancy. This gets me thinking that it may be worthwhile to have a small (<100 MB x 2) rescue boot environment with copies=2 (as well as rpool/boot/) so that "pkg repair" could be used to deal with cases that prevent your normal (>4 GB) boot environment from booting. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts <[EMAIL PROTECTED]> wrote: > On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <[EMAIL PROTECTED]> wrote: >> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <[EMAIL PROTECTED]> wrote: >>> Nevada isn't production code. For real ZFS testing, you must use a >>> production release, currently Solaris 10 (update 5, soon to be update 6). >> >> I misstated before in my LDoms case. The corrupted pool was on >> Solaris 10, with LDoms 1.0. The control domain was SX*E, but the >> zpool there showed no problems. I got into a panic loop with dangling >> dbufs. My understanding is that this was caused by a bug in the LDoms >> manager 1.0 code that has been fixed in a later release. It was a >> supported configuration, I pushed for and got a fix. However, that >> pool was still lost. > > Or maybe it wasn't fixed yet. I see that this was committed just today. > > 6684721 file backed virtual i/o should be synchronous > > http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec The related information from the LDoms Manager 1.1 Early Access release notes (820-4914-10): Data Might Not Be Written Immediately to the Virtual Disk Backend If Virtual I/O Is Backed by a File or Volume Bug ID 6684721: When a file or volume is exported as a virtual disk, then the service domain exporting that file or volume is acting as a storage cache for the virtual disk. In that case, data written to the virtual disk might get cached into the service domain memory instead of being immediately written to the virtual disk backend. Data are not cached if the virtual disk backend is a physical disk or slice, or if it is a volume device exported as a single-slice disk. Workaround: If the virtual disk backend is a file or a volume device exported as a full disk, then you can prevent data from being cached into the service domain memory and have data written immediately to the virtual disk backend by adding the following line to the /etc/system file on the service domain. set vds:vd_file_write_flags = 0 Note – Setting this tunable flag does have an impact on performance when writing to a virtual disk, but it does ensure that data are written immediately to the virtual disk backend. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Performance bake off vxfs/ufs/zfs need some help
On Sat, Nov 22, 2008 at 11:41 AM, Chris Greer <[EMAIL PROTECTED]> wrote: > vxvm with vxfs we achieved 2387 IOPS In this combination you should be using odm, which comes as part of the Storage Foundation for Oracle or Storage Foundation for Oracle RAC products. It makes the database files on vxfs behave much like they live on raw devices and tends to allow much higher transaction rate with fewer physical I/O's and less kernel (%sys) utilization. The concept is similar to but different than direct I/O. This behavior is hard, if not impossible, to test without Oracle in the mix because (AFAIK) oracle is the only thing that knows how to make use of the odm interface. > vxvm with ufs we achieved 4447 IOPS > ufs on disk devices we achieved 4540 IOPS > zfs we achieved 1232 IOPS When you say RAC, I assume you mean multi-instance (clustered) databases. None of those are cluster file systems and as such are worthless for multi-instance oracle databases which require a shared file system. On Linux, you say that you were using ocfs. Where you really using ocfs, or were the databases really in ASM? Oracle's recommendation (last I knew) was to have executables on ocfs and have databases in ASM. Have you tried ASM on Solaris? It should give you a lot of the benefits you would expect from ZFS (pooled storage, incremental backups, (I think) efficient snapshots). It will only work for oracle database files (and indexes, etc.) and should work for clustered storage as well. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] HELP!!! Need to disable zfs
Boot from the other root drive, mount up the "bad" one at /mnt. Then: # mv /mnt/etc/zfs/zpool.cache /mnt/etc/zpool.cache.bad On Tue, Nov 25, 2008 at 8:18 AM, Mike DeMarco <[EMAIL PROTECTED]> wrote: > My root drive is ufs. I have corrupted my zpool which is on a different drive > than the root drive. > My system paniced and now it core dumps when it boots up and hits zfs start. > I have a alt root drive that can boot the system up with but how can I > disable zfs from starting on a different drive? > > HELP HELP HELP > -- > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate /var
On Tue, Dec 2, 2008 at 11:17 AM, Lori Alt <[EMAIL PROTECTED]> wrote: > I did pre-create the file system. Also, I tried omitting "special" and > zonecfg complains. > > I think that there might need to be some changes > to zonecfg and the zone installation code to get separate > /var datasets in non-global zones to work. You could probably do something like: zfs create rpool/zones/$zone zfs create rpool/zones/$zone/var zonecfg -z $zone add fs set dir=/var set special=/zones/$zone/var set type=lofs end ... zoneadm -z $zone install zonecfg -z $zone remove fs dir=/var zfs set mountpoint=/zones/$zone/root/var rpool/zones/$zone/var -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Separate /var
On Tue, Dec 2, 2008 at 6:13 PM, Lori Alt <[EMAIL PROTECTED]> wrote: > On 12/02/08 10:24, Mike Gerdts wrote: > I follow you up to here. But why do the next steps? > > > zonecfg -z $zone > > remove fs dir=/var > > > > zfs set mountpoint=/zones/$zone/root/var rpool/zones/$zone/var It's not strictly required to perform this last set of commands, but the lofs mount point is not really needed. Longer term it will likely look cleaner (e.g. to live upgrade) to not have this lofs mount. That is, I suspect that live upgrade is more likely to look at /var in the zone and say "ahhh, that is a zfs file system - I known how to deal with that" than it is for it to say "ahhh, that is a lofs file system to some other zfs file system in the global zone - I know how to deal with that." -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problem with time-slider
On Mon, Dec 29, 2008 at 8:21 AM, Charles wrote: > Hi > > I'm a new user of OpenSolaris 2008.11, I switched from Linux to try the > time-slider, but now when I execute the time-slider I get this message: > > http://img115.imageshack.us/my.php?image=capturefentresansnomfx9.png Try running svcs -v zfs/auto-snapshot The last few lines of the log files mentioned in the output from the above command may provide helpful hints. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] strange performance drop of solaris 10/zfs
On Thu, Jan 29, 2009 at 6:13 AM, Kevin Maguire wrote: > I have tried to establish if some client or clients are thrashing the > server via nfslogd, but without seeing anything obvious. Is there > some kind of per-zfs-filesystem iostat? The following should work in bash or ksh, so long as the list of zfs mount points does not overflow the maximum command line length. $ fsstat $(zfs list -H -o mountpoint | nawk '$1 !~ /^(\/|-|legacy)$/') 5 -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is zfs snapshot -r atomic?
On Sun, Feb 22, 2009 at 11:59 AM, David Abrahams wrote: > > When I take a snapshot of a filesystem (or pool) and pass -r to get all > the sub-filesystems, am I getting the state of all the sub-filesystem > snapshots "at the same instant," or is it essentially equivalent to > making the sub-filesystem snapshots one at a time as I would have to do > if -r weren't available? Google for "zfs snapshot recursive atomic" leads me to: http://docs.sun.com/app/docs/doc/819-5461/gdfdt?a=view Which says: Recursive ZFS snapshots are created quickly as one atomic operation. The snapshots are created together (all at once) or not created at all. The benefit of atomic snapshots operations is that the snapshot data is always taken at one consistent time, even across descendent file systems. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Details on raidz boot + zfs patents?
On Sat, Feb 28, 2009 at 4:53 AM, "C. Bergström" wrote: > The other question that I am less worried about is would this violate any > patents.. I mean.. Sun added the initial zfs support to grub and this is > essentially extending that, but I'm not aware of any patent provisions on > that code or some royalty free statement about ZFS related patents from > Sun.. (Frankly.. I look at Sun as /similar/ to Cononical in that I assume > they only sue to protect themselves and not go after any good intention foss > project..) See http://opensolaris.org/os/about/faq/licensing_faq/#patents. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] GSoC 09 zfs ideas?
On Sat, Feb 28, 2009 at 1:20 AM, Richard Elling wrote: > David Magda wrote: >> On Feb 27, 2009, at 20:02, Richard Elling wrote: >>> At the risk of repeating the Best Practices Guide (again): >>> The zfs send and receive commands do not provide an enterprise-level >>> backup solution. >> >> Yes, in its current state; hopefully that will change some point in the >> future (which is what we're talking about with GSoC--the potential to change >> the status quo). > > I suppose, but considering that enterprise backup solutions exist, > and some are open source, why reinvent the wheel? > -- richard The default mode of operation for every enterprise backup tool that I have used is file level backups. The determination of which files need to be backed up seems to be to crawl the file system looking for files that have an mtime after the previous backup. Areas of strength for such tools include: - Works with any file system that provides a POSIX interface - Restore of a full backup is an accurate representation of the data backed up - Restore can happen to a different file system type - Restoring an individual file is possible Areas of weakness include: - Extremely inefficient for file systems with lots of files and little change. - Restore of full + incremental tends to have extra files because of spotty support or performance overhead of tool that would prevent it. - Large files that have blocks rewritten get backed up in full each time - Restores of file systems with lots of small files (especially in one directory) are extremely slow There exist features (sometimes expensive add-ons) that deal with some of these shortcomings via: - Keeping track of deleted files so that a restore is more representative of what is on disk during the incremental backup. Administration manuals typically warn that this has a big performance and/or size overhead on the database used by the backup software. - Including add-ons that hook into other components (e.g. VxFS storage checkpoints, Oracle RMAN) that provide something similar to block-level incremental backups Why re-invent the wheel? - People are more likely to have snapshots available for file-level restores, and as such a "zfs send" data stream would only be used in the event of a complete pool loss. - It is possible to provide a general block-level backup solution so that every product doesn't have to invent it. This gives ZFS another feature benefit to put it higher in the procurement priority. - File creation slowness can likely be avoided allowing restore to happen at tape speed - To be competitive with NetApp "snapmirror to tape" - Even having a zfs(1M) option that could list the files that change between snapshots could be very helpful to prevent file system crawls and to avoid being fooled by bogus mtimes. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] GSoC 09 zfs ideas?
On Sat, Feb 28, 2009 at 4:33 PM, Nicolas Williams wrote: > On Sat, Feb 28, 2009 at 10:44:59PM +0100, Thomas Wagner wrote: >> > >> pool-shrinking (and an option to shrink disk A when i want disk B to >> > >> become a mirror, but A is a few blocks bigger) >> > This may be interesting... I'm not sure how often you need to shrink a >> > pool >> > though? Could this be classified more as a Home or SME level feature? >> >> Enterprise level especially in SAN environments need this. >> >> Projects own theyr own pools and constantly grow and *shrink* space. >> And they have no downtime available for that. > > Multiple pools on one server only makes sense if you are going to have > different RAS for each pool for business reasons. It's a lot easier to > have a single pool though. I recommend it. Other scenarios for multiple pools include: - Need independent portability of data between servers. For example, in a HA cluster environment, various workloads will be mapped to various pools. Since ZFS does not do active-active clustering, a single pool for anything other than a simple active-standby cluster is not useful. - Array based copies are needed. There are times when copies of data are performed at a storage array level to allow testing and support operations to happen "on different spindles". For example, in a consolidated database environment, each database may be constrained to a set of spindles so that each database can be replicated or copied independent of the various others. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] GSoC 09 zfs ideas?
On Sat, Feb 28, 2009 at 8:34 PM, Nicolas Williams wrote: > On Sat, Feb 28, 2009 at 05:19:26PM -0600, Mike Gerdts wrote: >> On Sat, Feb 28, 2009 at 4:33 PM, Nicolas Williams >> wrote: >> > On Sat, Feb 28, 2009 at 10:44:59PM +0100, Thomas Wagner wrote: >> >> > >> pool-shrinking (and an option to shrink disk A when i want disk B to >> >> > >> become a mirror, but A is a few blocks bigger) >> >> > This may be interesting... I'm not sure how often you need to shrink a >> >> > pool >> >> > though? Could this be classified more as a Home or SME level feature? >> >> >> >> Enterprise level especially in SAN environments need this. >> >> >> >> Projects own theyr own pools and constantly grow and *shrink* space. >> >> And they have no downtime available for that. >> > >> > Multiple pools on one server only makes sense if you are going to have >> > different RAS for each pool for business reasons. It's a lot easier to >> > have a single pool though. I recommend it. >> >> Other scenarios for multiple pools include: >> >> - Need independent portability of data between servers. For example, >> in a HA cluster environment, various workloads will be mapped to >> various pools. Since ZFS does not do active-active clustering, a >> single pool for anything other than a simple active-standby cluster is >> not useful. > > Right, but normally each head in a cluster will have only one pool > imported. Not necessarily. Suppose I have a group of servers with a bunch of zones. Each zone represents a service group that needs to independently fail over between servers. In that case, I may have a zpool per zone. It seems this is how it is done in the real world.[1] 1. Upton, Tom. "A Conversation with Jason Hoffman." ACM Queue. January/February 2008. 9. > The Sun Storage 7xxx do this. One pool per-head, two pools altogether > in a cluster. Makes sense for your use case. If you are looking at a zpool per zone, it is likely a zpool created on a LUN provided by a Sun Storage 7xxx that is presented to multiple hosts. That is, ZFS on top of ZFS. >> - Array based copies are needed. There are times when copies of data >> are performed at a storage array level to allow testing and support >> operations to happen "on different spindles". For example, in a >> consolidated database environment, each database may be constrained to >> a set of spindles so that each database can be replicated or copied >> independent of the various others. > > This gets you back into managing physical space allocation. Do you > really want that? If you're using zvols you can do "array based copies" > of you zvols. If you're using filesystems then you should just use > normal backup tools. There are times when you have no real choice. If a regulation or a lawyer's interpretation of a regulation says that you need to have physically separate components, you need to have physically separate components. If your disaster recovery requirements mean that you need to have a copy of data at a different site and array based copies have historically been used - it is unlikely that "while true ; do zfs send | ssh | zfs receive" will be adapted in the first round of implementation. Given this, zvols don't do it today. When you have a smoking hole, the gap in transactions left by normal backup tools is not always good enough - especially if some of that smoke is coming from the tape library. Array based replication tends to allow you to keep much tighter tolerances on just how many committed transactions you are willing to lose. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [Fwd: ZFS user/group quotas & space accounting [PSARC/2009/204 FastTrack timeout 04/08/2009]]
2009/3/31 Matthew Ahrens : > 4. New Properties > > user/group space accounting information and quotas can be manipulated > with 4 new properties: > > zfs get userused@ > zfs get groupused@ > > zfs get userquota@ > zfs get groupquota@ > > zfs set userquota@= > zfs set groupquota@= > > The or is specified using one of the following forms: > posix name (eg. ahrens) > posix numeric id (eg. 126829) > sid name (eg. ahr...@sun) > sid numeric id (eg. S-1-12345-12423-125829) How does this work with zones? Suppose in the global zone I have passwd entries like: jill:x:123:123:Jill Admin:/home/jill:/bin/bash joe:x:124:124:Joe Admin:/home/joe:/bin/bash And in a non-global zone (called bedrock) I have: fred:x:123:123:Fred Flintstone:/home/fred:/bin/bash barney:x:124:124:Barney Rubble:/home/barney:/bin/bash Dataset rpool/quarry is delegated to the zone bedrock. Does "zfs get all rpool/quarry" report the same thing whether it is run in the global zone or the non-global zone? Has there been any thought to using a UID resolution mechanism similar to that used by ps? That is, if "zfs get ... " is run in the global zone and the dataset is deleted to a non-global zone, display the UID rather than a possibly mistaken username. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [Fwd: ZFS user/group quotas & space accounting [PSARC/2009/204 FastTrack timeout 04/08/2009]]
On Tue, Mar 31, 2009 at 7:12 PM, Matthew Ahrens wrote: > River Tarnell wrote: >> >> Matthew Ahrens: >>> >>> ZFS user quotas (like other zfs properties) will not be accessible over >>> NFS; >>> you must be on the machine running zfs to manipulate them. >> >> does this mean that without an account on the NFS server, a user cannot >> see his >> current disk use / quota? > > That's correct. Do you have a reason for not wanting this to be implemented, or are you just avoiding scope creep? In the past, this was a big pain point for NFS servers that used VxFS. I used one of Sun's "source available" programs to get the rquotad source to implement this in the Solaris 7 days. Google suggests others have done the same using the opensolaris code as a starting point. Still others have written wrappers around quota(1M) that invoke rsh or ssh to the appropriate NFS server. It seems as though this was eventually addressed by Veritas with 110434-02. We really shouldn't repeat this for long. It should be fairly straight-forward to modify rquotad to support this, so long as the zfs end of it is not overly complicated. Is now too early to file the RFE? For some reason it feels like the person on the other end of bugs.opensolars.org will get confused by the request to enhance a feature that doesn't yet exist. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss