Re: How to bind a static ether address to bridge?
If you can swing a routed network that will definitely have the fewest complications. For a switched network if_bridge and ARP have to be integrated, something I just finished doing in DragonFly, so that all member interfaces of the bridge use *only* the bridge's MAC for all transactions, including ARP transactions, whether they require forwarding through the bridge or not. The bridge has its own internal forwarding table and a great deal of confusion occurs if the normal ARP code is trying to tie into individual interfaces instead of just the bridge interface, for *ANY* member of the bridge, not just the first member of the bridge. Some of the problems you are likely to hit using if_bridge: * ARP response flows in on member interface A with an ether destination of member interface B. OS decides to record the ARP route as coming from interface B (when it's actually coming from interface A), while the bridge internally records the proper forwarding (A). Fireworks ensue. * ARP responses targetting member interfaces which are part of the spanning tree protocol (when you have redundant links), and then wind up in the blocking state by the spanning tree protocol. The if_bridge code in FreeBSD sets the bridge's MAC to be the same as the first added interface, which is usually your LAN ethernet port. This will help a bit, just make sure that it *IS* your LAN ethernet port and that the spanning tree protocol is *NOT* turned on for that port. However, other member interfaces (usually TAPs if you are using something like OpenVPN) will have different MAC addresses and that will cause confusion. It might be possible to work around both issues by setting the MAC for *ALL* member interfaces to be the same as the bridge MAC, but I don't know. I gave up trying to do that in DFly and instead modified the ARP code to always use the bridge MAC for any interface which is a member of a bridge. That appears to have worked quite well. My home network (using DragonFly) is using if_bridge to a colocated box, ether bridging a class C over three WANs via OpenVPN, with the related TAP interfaces and the LAN interface as members of the bridge. The bridge is set up with the spanning tree protocol turned on for the three TAP interfaces and with bonding turned on for two of the TAP interfaces. But that's with DFly (and I just finished the work two days ago). If something similar cannot be done w/FreeBSD then I recommend porting the changes from DFly over to FreeBSD's bridging and ARP modules. It was a big headache but once I cleared up the ARP confusion things just started magically working. Other caveats: * TAP and BRIDGE interfaces are assigned a nearly random MAC address when they are created (in FreeBSD the bridge sets its MAC to the first member interface so that is at least ok if you always add your LAN as the first member interface, however the other member interfaces aren't so lucky). Rebooting the machine containing the bridge or destroying and rebuilding the bridge can create total and absolute havoc on your network because the rest of your switching infrastructure and machines will have the old MACs cached. The partial solution is taking on the MAC address of the LAN interface, which FreeBSD's bridging code does, and it might be possible to also set the other member interfaces to that same MAC (but I don't know if that will work). If not then this is almost a non-solvable problem short of making the ARP module more aware of the bridge. * If using redundant links without bonding support in the bridge code the bridge itself will get confused when the topology changes, though if it is a simple topology the bridge should be able to start forwarding to the backup link even though its internal forwarding table is messed up. The concept of a 'backup' link is a bit of a hack in the STP code (just as the concept of 'bonding' is a bit of a hack), so how well it works will depend on a lot of different factors. The idea of a 'backup' link is to be able to continue to switch packets when only one path is available even if that path has not been completely resolved through the STP protocol. * ARP only works because *EVERYONE* uses the same timeout. Futzing around with member associations on the bridge will cause the bridge to forget. The bridge should theoretically broadcast unicast packets for which it doesn't have a forwarding entry but... well, it is still possible for machines to get confused. When working on your setup you may have to 'arp -d -a' on one or more machines multiple times to force them to re-arp and cause all your intermediate ethernet switches to r
Re: Constant rebooting after power loss
The core of the issue here comes down to two things: First, a power loss to the drive will cause the drive's dirty write cache to be lost, that data will not make it to disk. Nor do you really want to turn of write caching on the physical drive. Well, you CAN turn it off, but if you do performance will become so bad that there's no point. So turning off the write caching is really a non-starter. The solution to this first item is for the OS/filesystem to issue a disk flush command to the drive at appropriate times. If I recall the ZFS implementation in FreeBSD *DOES* do this for transaction groups, which guarantees that a prior transaction group is fully synced before a new ones starts running (HAMMER in DragonFly also does this). (Just getting an 'ack' from the write transaction over the SATA bus only means the data made it to the drive's cache, not that it made it to the platter). I'm not sure about UFS vis-a-vie the recent UFS logging features... it might be an option but I don't know if it is a default. Perhaps someone can comment on that. One last note here. Many modern drives have very large ram caches. OCZ's SSDs have something like 256MB write caches and many modern HDs now come with 32MB and 64MB caches. Aged drives with lots of relocated sectors and bit errors can also take a very long time to perform writes on certain sectors. So these large caches take time to drain and one can't really assume that an acknowledged write to disk will actually make it to the disk under adverse circumstances any more. All sorts of bad things can happen. Finally, the drives don't order their writes to the platter (you can set a bit to tell them to, but like many similar bits in the past there is no real guarantee that the drives will honor it). So if two transactions do not have a disk flush command inbetween them it is possible for data from the second transaction to commit to the platter before all the data from the first transaction commits to the platter. Or worse, for the non-transactional data to update out of order relative to the transactional data which was supposed to commit first. Hence IMHO the OS/filesystem must use the disk flush command in such situations for good reliability. -- The second problem is that a physical loss of power to the drive can cause the drive to physically lose one or more sectors, and can even effectively destroy the drive (even with the fancy auto-park)... if the drive happens to be in the middle of a track write-back when power is lost it is possible to lose far more than a single sector, including sectors unrelated to recent filesystem operations. The only solution to #2 is to make sure your machines (or at least the drives if they happen to be in external enclosures) are connected to a UPS and that the machines are communicating with the UPS via something like the "apcupsd" port. AND also that you test to make sure the machines properly shut themselves down when AC is lost before the UPS itself runs out of battery time. After all, a UPS won't help if the machines don't at least idle their drives before power is lost!!! I learned this lesson the hard way about 3 years ago. I had something like a dozen drives in two raid arrays doing heavy write activity and lost physical power and several of the drives were totally destroyed, with thousands of sector errors. Not just one or two... thousands. (It is unclear how SSDs react to physical loss of power during heavy writing activity. Theoretically while they will certainly lose their write cache they shouldn't wind up with any read errors). -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Constant rebooting after power loss
at covers 90% of the market and 99% of the cases where protocol reliability is required. -Matt Matthew Dillon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Constant rebooting after power loss
:> Do you know if that's changed at all with NCQ on modern SATA drives? :> I've seen people commenting that using tags recovers most, if not all, :> of the performance lost by disabling the write cache. :... I've never tried that combination. Theoretically the 32 tags SATA supports would just barely be enough for sequential write service loads but I really doubt it would be enough for mixed service loads and you would be blowing up your read performance to achieve even that due to the length of time the tags stay busy with writes. With some driver massaging, such as partitioning the tag space and dedicating a specific number of tags for writing, read performance could probably be maintained but write performance (with caches off) would definitely still suffer. It might not horrible, though. One advantage of turning off the drive's write cache is that it would be possible for the OS to control write interference vs read loads, which is impossible to do with caches turned on. That is, with caches turned on your writes are instantly acknowledged until the drive's own caches exceed their dirty limits and by that time the drive is juggling so much dirty data that we (the OS/driver) have no control over read vs write performance. This is why it is so blasted difficult to write I/O schedulers in OS's that actually work. With caches disabled the OS/driver would have a great deal more control over read vs write performance. I/O scheduling would become viable. But to really make it work well I think we would need 64-128 tags (or more) to be able to cover multiple writing zones. With only 32 tags the drive's zone cache will be defeated. It would be a very interesting test. I can't immediately dismiss tagged I/O with write caches disabled. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: PCIe SATA HBA for ZFS on -STABLE
:I'm not on the -STABLE list so please reply to me. : :I'm using an Intel Core i3-530 on a Gigabyte H55M-D2H motherboard with 8 x :2TB drives & 2 x 1TB drives. :The plan is to have the 1 TB drives in a zmirror and the 8 in a raidz2. : :Now the Intel chipset has only 6 on board SATA II ports so ideally I'm :looking for a non RAID SATA II HBA to give me 6 extra ports (4 min). :Why 6 extra ? :Well the case I'm using has 2 x eSATA ports so 6 would be ideal, 5 OK, and 4 :the minimum I need to do the job. : :So... : :What do people recommend for 8-STABLE as a PCIe SATA II HBA for someone :using ZFS ? : :Not wanting to break the bank. :Not interested in SATA III 6GB at this time... though it could be useful if :I add an SSD for... (is it ZIL ?). :Can this be added at any time ? : :The main issue is I need at least 10 ports total for all existing drives... :ZIL would require 11 so ideally we are talking a 6 port HBA. The absolute cheapest solution is to buy a Sil-3132 PCIe card (providing 2 E-SATA ports), and then connect an external port multiplier to each port. External port multiplier enclosures typically support 5 drives each so that would give you your 10 drives. Even the 3132 is a piss-ant little card it does support FIS-Based switching so performance will be very good... it will just be limited to SATA-II speeds is all. Motherboard AHCI-based SATA ports typically do NOT have FIS-Based switching support (this would be the FBSS capability flag when the AHCI driver probes the chipset). This means that while you can attach an external port multiplier enclosure to mobo SATA ports (see later on E-SATA vs SATA), read performance from multiple drives concurrently will be horrible. Write performance will still be decent due to drive write caches despite being serialized. On E-SATA vs SATA. Essentially there are only two differences between E-SATA and SATA. One is the cable and connector format. The other is hot swap detection. Most mobo SATA ports can be strung out to E-SATA with an appropriate adapter. High-end Intel ASUS mobos often come with such adapters (this is why they usually don't sport an actual E-Sata port on the backplane) and the BIOS has setup features to specify E-SATA on a port-by-port basis. -- For SSDs you want to directly connect the SSD to a mobo SATA port and then either mount the SSD in the case or mount it in a hot-swap gadget that you can screw into a PCI slot (it doesn't actually use the PCI connector, just the slot). A SATA-III port with a SATA-III SSD really shines here and 400-500 MBytes/sec random read performance from a single SSD is possible, but it isn't an absolute requirement. A SATA-II port will still work fine as long as you don't mind maxing out the bandwidth at 250 MBytes/sec. -- I can't help with any of the other questions. Someone also suggested the MPS driver for FreeBSD, with caveats. I'll add a caveat on the port multiplier enclosures. Nearly all such enclosures use another SIL chipset internally and it works pretty well EXCEPT that it isn't 100% dependable if you try to hot-swap drives in the enclosure while other drives in the enclosure are active. So with that caveat, I recommend the port multiplier enclosure as the cheapest solution. To get robust hot-swap enclosures you either need to go with SAS or you need to go with discrete SATA ports (no port multiplication), and the ports have to support hot-swap. The best hot-swap support for an AHCI port is if the AHCI chipset supports cold-presence-detect (CPD), and again Mobo AHCI chipsets usually don't. Hot-swap is a bit hit or miss without CPD because power savings modes can effectively prevent hot-swap detect from working properly. Drive disconnects will always be detected but drive connects might not be. And even with discrete SATA ports the AHCI firmware on mobos does not necessarily handle hot-swap properly. For example my Intel-I7 ASUS mobo will generate spurious interrupts and status on a DIFFERENT discrete SATA port when I hot swap on some other discrete SATA port, in addition to generating the status interrupt on the correct port. So then it comes down to the driver in the operating system properly handling the spurious status and properly stopping and restarting pending commands when necessary. So, again, it is best for the machine to be idle before attempting a hot-swap. Lots of caveats. Sorry... you can blame Intel for all the blasted issues with AHCI and SATA. Intel didn't produce a very good chipset spec and vendors took all sorts of liberties. -Matt Matthew Dillon
Re: 32GB limit per swap device?
The limitation was ONLY due to a *minor* 32-bit integer overflow in one or two *intermediate* calculations in the radix tree code, which I long ago fixed in DragonFly. Just find the changes in the DFly codebase and determine if they need to be applied. The swap space radix code (which I wrote long ago) is in page-sized blocks, so you actually probably want to keep using a 32-bit integer for the block number there to keep the physical memory reservation required for the radix tree low. If you just pop the base block id up to 64 bits without adjusting the radix code to overlay a 64 bit bitmap on it you waste a lot of physical memory for the same amount of swap reservation. This is NOT where the limitation lies. It was strictly an intermediate calculation that caused the original limitation. With 32 bit block numbers stored in the radix tree nodes in the swap code the physical limitation is something like 1 to 4 TB of total swap. I forget exactly but it is at least 1TB. I've tested 1TB swap partitions on DragonFly with just the minor fixes to the original radix tree code. -- Also note that I believe FreeBSD has done away with the interleaved swap. I'm not sure why, I guess geom can interleave the swap for you but I've always thought that it would be easier to just specify and add the partitions separately so one has the flexibility to swapon and swapoff the individual partitions on a live system. Interleaving is important because you get an almost perfect performance multiplier. You don't want to just append the swap partitions after each other. -- One last thing: The amount of wired physical memory required is still on the order of ~1MB per ~1GB of swap. A 32-bit kernel is thus still limited by available KVM, effectively limiting you to around ~32G of swap depending on various factors if you do not want to run the system out of KVM. I've run upwards of 128G of swap on 32-bit systems but it really pushed the KVM use and I would not recommend it. A 64-bit kernel is *NOT* limited by KVM. Swap is effectively limited to ~1TB or ~2TB using the original radix code with the one or two intermediate overflow fixes applied. The daddr_t in the original radix code can remain 32-bits (in DragonFly I typedef'd another name so I could explicitly make it 32-bits regardless of daddr_t). Large amounts of swap space are becoming important as things like tmpfs (and swapcache in DragonFly as well) can really make use of it. Swap performance (the ability to interleave the swap space) is also important for the same reason. Interleaved swap on two SATA-III SSDs is just insane... gives you something like 800MB/sec of aggregate read bandwidth. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: 32GB limit per swap device?
Two additional pieces of information. The original limitation was more related to DEV_BSIZE calculations for the buf/bio, which is now 64-bits and thus not applicable, though you probably need some preemptive casts to ensure the multiplication is done in 64-bits. There was also another intermediate calculation overflow in the swap radix-tree code which had to be fixed to be able to use the full range... I think w/simple casts. I haven't looked it up but there should be a few commits in the DFly codebase that can be referenced. Second item: The main physical memory use is not the radix tree bitmap code for the swap code, but instead the auxillary data structure used to store the swapblk information which is associated with the vm_object structure. This structure contains a short array of swap block assignments (as a memory optimization to reduce header overhead) and it is these fields which you really want to keep 32-bits (unless you want the ~1MB per ~1GB of swap to become ~2MB per ~1GB of swap in physical memory overhead). The block number is in page-sized chunks so the practical limit is still ~4TB, with a further caveat below. The further caveat is that the actual limitation for the radix tree is 0x4000 blocks, which is 1/4 the full range or ~1TB, so the actual limitation for the (fixed) original radix tree code is ~1TB rather than ~4TB. This restricted range is due to some shift << >> operators used in the radix tree code that I didn't want to make more complicated. So, my recommendation is to fix the intermediate calculations and keep the swapblk related blockno fields 32 bits. The preallocation for the vm_object's auxillary structure must be large enough to actually be able to fill up swap and assign all the swap blocks. This is what eats the physical memory (4 bytes per 4K = 1024x storage factor). The radix tree bitmap itself winds up eating only around 2 bits per swap block in total overhead. So the auxillary structure is the main culprit. You definitely want to keep those block number fields in the aux structure 32 bits. The practical limit of ~1TB of swap requires ~1GB of preallocated physical memory with a 32 bit block number field. That would become ~2GB of preallocated memory if 64 bit block numbers were used instead, for no gain other than wasting physical memory. Ok, nobody is likely to actually need that much swap but people might be surprised, there are a lot of modern-day uses for swap space that don't involve heavy paging of anonymous memory. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Make ZFS auto-destroy snapshots when the out of space?
It is actually a security issue to automatically destroy snapshots based on whether a filesystem is full, even automatically generated snapshots. Since one usually implements snapshots to perform a function you wish to rely on, such as to retain backups of historical data for auditing or other purposes, you do not want an attacker to be able to indirectly destroy snapshots simply by filling up the filesystem. Instead what you want to do is to treat both the automatic and the manual snapshots as an integrated part of the filesystem's operation. Just as we have to deal with a nominal non-snapshotted filesystem-full condition today we also want to treat a filesystem with multiple snapshots in the same vein. So, for example, you might administratively desire 60 1-day snapshots plus 10 minute snapshots for the most recent 3 days to be retained at all times. The automatic maintainance of the snapshots would then administratively delete snapshots over 60 days old and prune to a coarser grain past 3 days. The use of snapshots on modern filesystem capable of managing large numbers of snapshots relatively pain-free, particularly on large storage systems and/or on modern multi-terrabyte HDs, requires a bit of a change in thinking. You have to stop thinking of the snapshots as optional and start thinking of them as mandatory. When snapshot availability is an assumed condition and not an exceptional or special-case condition it opens up a whole new arena in how filesystems can be managed, backed-up, audited, and used in every-day work. Once your thinking processes change you'll never go back to non-snapshotted or nontrivially-snapshotted filesystems. And you will certainly not want to allow a filesystem being mistakenly filled up to destroy your precious snapshots :-) -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: vm.swap_reserved toooooo large?
One of the problems with resource management in general is that it has traditionally been per-process, and due to the multiplicative effect (e.g. max-descriptors * limit-per-descriptor), per-process resources cannot be set such that any given user is prevented from DDOSing the system without making them so low that normal programs begin to fail for no good reason. Hence the advent of per-user and other more suitable resource limits, nominally set via sysctl. Even with these, however, it is virtually impossible to protect against a user DDOS. The kernel itself has resource limitations which are fairly easy to blow out... mbufs are usually the easiest to blow up, followed by pipe KVM memory. Filesytems can be blown up too by creating sparse files and mmap()ing them (thus circumventing normal overcommit limitations). Paging just itself, without running the system out of VM, can destroy a machine's performance and be just as effective a DDOS attack as resource starvation is. Virtual memory resources are similarly impacted. Overcommit limiting features have as many downsides as they have upsides. Its an endless argument but I've seen systems blow up with overcommit limits set even more readily than with no (overcommit) limits set. Theoretically overcommit limits make the system more manageable but in actual practice they only work when the application base is written with such limits in mind (and most are not). So for a general purpose unix environment putting limits on overcommit tends to create headaches. To be sure, in a turn-key environment overcommit serves a very important function. In a non-turn-key environment however it will likely create more problems than it will solve. The only way to realistically deal with the mess, if it is important to you, is to partition the systems' real resources and run stuff inside their own virtualized kernels each of which does its own independent resource management and whos I/O on the real system can be well-controlled as an aggregate. Alternatively, creating very large swap partitions work very well to mitigate the more common problems. Swap itself is changing its function. Swap is no longer just used for real memory overcommit (in fact, real memory overcommit is quite uncommon these days). It is now also used for things like tmpfs, temporary virtual disks, meta-data caching, and so forth. These days the minimum amount of swap I configure is 32G and as efficient swap storage gets more cost effective (e.g. SSDs), significantly more. 70G, 110G, etc. It becomes more a matter of being able to detact and act on the DDOS/resource issue BEFORE it gets to the point of killing important processes (definition: whatever is important for the functioning of that particular machine, user-run or root-run), and less a matter of hoping the system will do the right thing when the resource limit is actually reached. Having a lot of swap gives you more time to act. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: bad NFS/UDP performance
:> -vfs.nfs.realign_test: 22141777 :> +vfs.nfs.realign_test: 498351 :> :> -vfs.nfsrv.realign_test: 5005908 :> +vfs.nfsrv.realign_test: 0 :> :> +vfs.nfsrv.commit_miss: 0 :> +vfs.nfsrv.commit_blks: 0 :> :> changing them did nothing - or at least with respect to nfs throughput :-) : :I'm not sure what any of these do, as NFS is a bit out of my league. ::-) I'll be following this thread though! : :-- :| Jeremy Chadwickjdc at parodius.com | A non-zero nfs_realign_count is bad, it means NFS had to copy the mbuf chain to fix the alignment. nfs_realign_test is just the number of times it checked. So nfs_realign_test is irrelevant. it's nfs_realign_count that matters. Several things can cause NFS payloads to be improperly aligned. Anything from older network drivers which can't start DMA on a 2-byte boundary, resulting in the 14-byte encapsulation header causing improper alignment of the IP header & payload, to rpc embedded in NFS TCP streams winding up being misaligned. Modern network hardware either support 2-byte-aligned DMA, allowing the encapsulation to be 2-byte aligned so the payload winds up being 4-byte aligned, or support DMA chaining allowing the payload to be placed in its own mbuf, or pad, etc. -- One thing I would check is to be sure a couple of nfsiod's are running on the client when doing your tests. If none are running the RPCs wind up being more synchronous and less pipelined. Another thing I would check is IP fragment reassembly statistics (for UDP) - there should be none for TCP connections no matter what the NFS I/O size selected. (It does seem more likely to be scheduler-related, though). -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: bad NFS/UDP performance
:how can I see the IP fragment reassembly statistics? : :thanks, : danny netstat -s Also look for unexpected dropped packets, dropped fragments, and errors during the test and such, they are counted in the statistics as well. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY
A couple of things to note here. Well, many things actually. * Turning off write caching, assuming the drive even looks at the bit, will destroy write performance for any driver which does not support command queueing. So, for example, scsi typically has command queueing (as long as the underlying drive firmware actually implements it properly), 3Ware cards have it (underlying drives, if SATA, may not, but 3Ware's firmware itself might do the right thing). The FreeBSD ATA driver does not, not even in AHCI mode. The RAID code does not as far as I can tell. You don't want to turn this off. * Filesystems like ZFS and HAMMER make no assumptions on write ordering to disk for completed write I/O vs future write I/O and use BIO_FLUSH to enforce ordering on-disk. These filesystems are able to queue up large numbers of parallel writes inbetween each BIO_FLUSH, so the flush operation has only a very small effect on actual performance. Numerous Linux filesystems also use the flush command and do not make assumptions on BIO-completion/future-BIO ordering. * UFS + softupdates assumes write ordering between completed BIO's and future BIOs. This doesn't hold true on a modern drive (with write caching turned on). Unfortunately it is ALSO not really the cause behind most of the inconsistency reports. UFS was *never* designed to deal with disk flushing. Softupdates was never designed with a BIO_FLUSH command in mind. They were designed for formally ordered I/O (bowrite) which fell out of favor about a decade ago and has since been removed from most operating systems. * Don't get stuck in a rut and blame DMA/Drive/firmware for all the troubles. It just doesn't happen often enough to even come close to being responsible for the number of bug reports. With some work UFS can be modified to do it, but performance will probably degrade considerably because the only way to do it is to hold the completed write BIOs (not biodone() them) until something gets stuck, or enough build up, then issue a BIO_FLUSH and, after it returns, finish completing the BIOs (call the biodone()) for the prior write I/Os. This will cause softupdates to work properly. Softupdates orders I/O's based on BIO completions. Another option would be to complete the BIOs but do major surgery on softupdates itself to mark the dependancies as waiting for a flush, then flush proactively and re-sync. Unfortunately, this will not solve the whole problem. IF THE DRIVE DOESN'T LOOSE POWER IT WILL FLUSH THE BIOs IT SAID WERE COMPLETED. In otherwords, unless you have an actual power failure the assumptions softupdates will hold. A kernel crash does NOT prevent the actual drive from flushing the IOs in its cache. The disk can wind up with unexpected softupdates inconsistencies on reboot anyway. Thus the source of most of the inconsistency reports will not be fixed by adding this feature. So more work is needed on top of that. -- Nearly ALL of the unexpected softupdates inconsistencies you see *ARE* for the case where the drive DOES in fact get all the BIO data it returned as completed onto the disk media. This has happened to me many, many times with UFS. I'm repeating this: Short of an actual power failure, any I/O's sent to and acknowledged by the drive are flushed to the media before the drive resets. A FreeBSD crash does not magically prevent the drive from flushing out its internal queues. This means that there are bugs in softupdates & the kernel which can result in unexpected inconsistencies on reboot. Nobody has ever life-tested softupdates to try to locate and fix the issues. Though I do occassionally see commits that try to fix various issues, they tend to be more for live-side non-crash cases then for crash cases. Some easy areas which can be worked on: * Don't flush the buffer cache on a crash. Some of you already do this for other reasons (it makes it more likely that you can get a crash dump). The kernel's flushing of the buffer cache is likely a cause of a good chunk of the inconsitency reports by fsck, because unless someone worked on the buffer flushing code it likely bypasses softupdates. I know when working on HAMMER I had to add a bioop explicitly to allow the kernel flush-buffers-on-crash code to query whether it was actually ok to flush a dirty buffer or not. Until I did that DragonFly was flushing HAMMER buffers which on crash which it had absolutely no business flushing. * Implement active dependancy flushing in softupdates. Instead of having it just adjust the dependancies for later flushes softupdates needs to actively initiate I/O for
Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY
:Completely agree. ZFS is the way of the future for FreeBSD. In my :latest testing, the memory problems are now under control, there is just :stability problems with random lockups after days of heavy load unless I :turn off ZIL. So its nearly there. : :If only ZFS also supported a network distributed mode. Or can we :convince you to port Hammer to FreeBSD? :-) : :- Andrew Heh. No, you guys would have to port it if you want it, though I would be happy to support it once ported. Issues are between minor and moderate but would still require a knowledgeable filesystem person to do. Biggest issues will be buffer cache and bioops differences, and differences in the namespace VOPs. -- But, IMHO, you guys should focus on ZFS since clearly a lot of work has gone into its port, it works now in FBSD, and it just needs to be made production-ready and a little more programming support from the community. It also has a different feature set then HAMMER. P.S. FreeBSD should issue a $$ grant or stipend to Pawel for that work, he's really saving your asses. UFS has clearly reached its end-of-life. Speaking of ZFS, you guys probably went through the same boot issues that we are going through with HAMMER. I came up with a solution which turned out to be quite non-invasive and very easy to implement. * Make a small /boot UFS partition. e.g. 256M ad0s1a. * Put HAMMER (or ZFS in your case) on the rest of the disk (ad0s1d). * Adjust the loader to search both / and /boot, so /boot can be its own partition or a sub-directory on root. * Add a simple line to /boot/loader.conf to direct the kernel to the proper root, e.g. vfs.root.mountfrom="hammer:ad0s1d" And poof, you're done. Then when the system boots it boots into a HAMMER (ZFS) root, and /boot is mounted as small UFS filesystem under it. Miscellanious other partitions would then be pseudo-fs's under the single HAMMER (or ZFS) root, removing the need to worry about reserving particular amounts of space, and providing the needed backup and snapshot separation domains. Well, you guys might have already solved it. There isn't much to it. I recall there was quite a discussion on trying to create redundant boot setup on FreeBSD, such as boot-to-RAID setups, and having trouble getting the BIOS to recognize it. There's an alternative solution... having a separate, small /boot means you can boot from a small solid state storage device whos MTBF is going to be the same as the PC hardware itself. No real storage redundancy is needed and if your root is somewhere else that gives you the option of putting more sophisticated code in /boot (it being the full kernel) to properly mount the root. I have 0 (ZERO) trust in BIOS-RAID or card-supported RAID-enabled (such as with 3Ware) boot support. ZERO. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: UNEXPECTED SOFT UPDATE INCONSISTENCY; RUN fsck MANUALLY
or not, without risking unexpected softupdates inconsistencies on-media. This alone makes background fsck problematic and risky. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Would anybody port DragonFlyBSD's HAMMER fs to FreeBSD?
Guys, please don't start a flamewar. And lhmwzy we discussed this on the DFly lists. It's really up to them... that is, a programmer who has an interest, inclination, and time. It isn't really fair to try to push it. I personally believe that the FreeBSD community as a whole should focus on ZFS for now. It has the momentum and the most interest on their lists. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: sidetrack [was Re: 'at now' not working as expected]
Also, if you happen to have a handheld GPS unit, it almost certainly has a menu option to tell you the sunrise and sunset times at your current position. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: An old gripe: Reading via mmap stinks
: mmap: 43.400u 9.439s 2:35.19 34.0%16+184k 0+0io 106994pf+0w : read: 41.358u 23.799s 2:12.04 49.3% 16+177k 67677+0io 0pf+0w : :Observe, that even though read-ing is quite taxing on the kernel (high :sys-time), the mmap-ing loses overall -- at least, on an otherwise idle :system -- because read gets the full throughput of the drive (systat -vm :shows 100% disk utilization), while pagefaulting gets only about 69%. : :When I last brought this up in 2006, it was "revealed", that read(2) :uses heuristics to perform a read-ahead. Why can't the pagefaulting-in :implementation use the same or similar "trickery" was never explained... Well, the VM system does do read-ahead, but clearly the pipelining is not working properly because if it were then either the cpu or the disk would be pegged, and neither is. It's broken in DFly too. Both FreeBSD and DragonFly use vnode_pager_generic_getpages() (UFS's ffs_getpages() just calls the generic) which means (typically) the whole thing devolves into a UIO_NOCOPY VOP_READ(). The VOP_READ should be doing read-ahead based on the sequential access heuristic but I already see issues in both implementations of vnode_pager_generic_getpages() where it finds a valid page from an earlier read-ahead and stops (failing to issue any new read-aheads because it fails to issue a new UIO_NOCOPY VOP_READ... doh!). This would explain why the performance is not as bad as linux but is not as good as a properly pipelined case. I'll play with it some in DFly and I'm sure the FreeBSD folks can fix it in FreeBSD. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: immense delayed write to file system (ZFS and UFS2), performance issues
:I'm experiencing the same thing, except in my case it's most noticeable :when writing to a USB flash drive with a FAT32 filesystem. It slows the :entire system down, even if the data being written is coming from cache :or a memory file system. : :I don't know if it's related. I'm running 8-STABLE from about 4 December. : :Regards, :Aragon I don't know re: the main thread but in regards to writing to a USB flash drive interfering with other operations the most likely cause is that the buffer cache fills up with dirty buffers destined for the (slow) USB drive. This causes other unrelated drive subsystems to block on the buffer cache. There are no easy answers. A poor-man's solution would be to limit dirty buffers in the buffer cache to 80% of the nominal dirty maximum on a per-mount basis so no single mount can kill the buffer cache. (One can't just cut-up the buffer cache as that would leave too few buffers available for each mount to operate efficiently). A per-mount minimum buffer guarantee would also help smooth things out but the value would have to be small (comprise no more than 20% of the buffer cache in aggregate). In the case of UFS the write-behind code is asynchronous, so even though UFS wants to flush the buffers out all that happens in reality when writing to slow media is that the dirty buffers wind up on the I/O queue (which is actually worse then leaving them B_DELWRI in the buffer cache because now the VM pages are all soft-busied). -Matt Matthew Dillon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: immense delayed write to file system (ZFS and UFS2), performance issues
Here's what I got from one of my 2TB WD green drives. This one is Firmware 01.00A01. Load_Cycle_Count is 26... seems under control. It gets hit with a lot of activity separated by a lot of time (several minutes to several hours), depending on what is going on. The box is used for filesystem testing. Regardless it seems to stay spun-up all the time, or nearly all the time. Neither the BIOS nor the kernel driver is messing with the SUD control on the Silicon Image board it is connected to (other then just turning it on and leaving it that way). If the drive has an intelligent parking function it doesn't seem to be using it much. I haven't specifically disabled any such function. Device Model: WDC WD20EADS-00R6B0 Serial Number:WD-WCAVY0259672 Firmware Version: 01.00A01 User Capacity:2,000,398,934,016 bytes Device is:Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is:Tue Jan 26 19:25:48 2010 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled ... ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051Pre-fail Always - 0 3 Spin_Up_Time0x0027 212 150 021Pre-fail Always - 6375 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 39 5 Reallocated_Sector_Ct 0x0033 200 200 140Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000Old_age Always - 0 9 Power_On_Hours 0x0032 095 095 000Old_age Always - 4252 10 Spin_Retry_Count0x0032 100 253 000Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 37 192 Power-Off_Retract_Count 0x0032 200 200 000Old_age Always - 13 193 Load_Cycle_Count0x0032 200 200 000Old_age Always - 26 194 Temperature_Celsius 0x0022 121 111 000Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x0032 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000Old_age Offline - 0 I have a few of these babies strewn around. The others show about the same stats, e.g. this one is used in a production box. Same drive type, same firmware: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours 0x0032 095 095 000Old_age Always - 4164 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 43 193 Load_Cycle_Count0x0032 200 200 000Old_age Always - 26 ... So on the face of it things seem ok with these drives. Presumably WD is working adjustments into the firmware as time goes on. Hopefully they aren't just masking the count in the SMART page to appease techies :-) These particular WDs (2TB Caviar Green's) are slow drives. 5600 rpm, 100MB/sec. But they are also very quiet in operation and seem to be quite power efficient. -Matt Matthew Dillon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: hardware for home use large storage
The Silicon Image 3124A chipsets (the PCI-e version of the 3124. The original 3124 was PCI-x). The 3124A's are starting to make their way into distribution channels. This is probably the best 'cheap' solution which offers fully concurrent multi-target NCQ operation through a port multiplier enclosure with more than the PCIe 1x bus the ultra-cheap 3132 offers. I think the 3124A uses an 8x bus (not quite sure, but it is more than 1x). AHCI on-motherboard with equivalent capabilities do not appear to be in wide distribution yet. Most AHCI chips can do NCQ to a single target (even a single target behind a PM), but not concurrently to multiple targets behind a port multiplier. Even though SATA bandwidth constraints might seem to make this a reasonable alternative it actually isn't because any seek heavy activity to multiple drives will be serialized and perform EXTREMELY poorly. Linear performance will be fine. Random performance will be horrible. It should be noted that while hotswap is supported with silicon image chipsets and port multiplier enclosures (which also use Sili chips in the enclosure), the hot-swap capability is not anywhere near as robust as you would find with a more costly commercial SAS setup. SI chips are very poorly made (this is the same company that went bust under another name a few years back due to shoddy chipsets), and have a lot of on-chip hardware bugs, but fortunately OSS driver writers (linux guys) have been able to work around most of them. So even though the chipset is a bit shoddy actual operation is quite good. However, this does mean you generally want to idle all activity on the enclosure to safely hot swap anything, not just the drive you are pulling out. I've done a lot of testing and hot-swapping an idle disk while other drives in the same enclosure are hot is not reliable (for a cheap port multiplier enclosure using a Sili chip inside, which nearly all do). Also, a disk failure within the enclosure can create major command sequencing issues for other targets in the enclosure because error processing has to be serialized. Fine for home use but don't expect miracles if you have a drive failure. The Sili chips and port multiplier enclosures are definitely the cheapest multi-disk solution. You lose on aggregate bandwidth and you lose on some robustness but you get the hot-swap basically for free. -- Multi-HD setups for home use are usually a lose. I've found over the years that it is better to just buy a big whopping drive and then another one or two for backups and not try to gang them together in a RAID. And yes, at one time in the past I was running three separate RAID-5 using 3ware controllers. I don't anymore and I'm a lot happier. If you have more than 2TB worth of critical data you don't have much of a choice, but I'd go with as few physical drives as possible regardless. The 2TB Maxtor green or black drives are nice. I strongly recommend getting the highest-capacity drives you can afford if you don't want your power bill to blow out your budget. The bigger problem is always having an independent backup of the data. Depending on a single-instanced filesystem, even one like ZFS, for a lifetime's worth of data is not a good idea. Fire, theft... there are a lot of ways the data can be lost. So when designing the main system you have to take care to also design the backup regimen including something off-site (or swapping the physical drive once a month, etc). i.e. multiple backup regimens. If single-drive throughput is an issue then using ZFS's caching solution with a small SSD is the way to go (and yes, DFly has a SSD caching solution now too but that's not pertainant to this thread). The Intel SSDs are really nice, but I am singularly unimpressed with the OCZ Colossus's which don't even negotiate NCQ. I don't know much re: other vendors. A little $100 Intel 40G SSD has around a 40TB write endurance and can last 10 years as a disk meta-data caching environment with a little care, particularly if you only cache meta-data. A very small incremental cost gives you 120-200MB/sec of seek-agnostic bandwidth which is perfect for network serving, backup, remote filesystems, etc. Unless the box has 10GigE or multiple 1xGigE network links there's no real need to try to push HD throughput beyond what the network can do so it really comes down to avoiding thrashing the HDs with random seeks. That is what the small SSD cache gives you. It can be like night and day. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubsc
Re: hardware for home use large storage
:Correction -- more than likely on a consumer motherboard you *will not* :be able to put a non-VGA card into the PCIe x16 slot. I have numerous :Asus and Gigabyte motherboards which only accept graphics cards in their :PCIe x16 slots; this """feature""" is documented in user manuals. I :don't know how/why these companies chose to do this, but whatever. : :I would strongly advocate that the OP (who has stated he's focusing on :stability and reliability over speed) purchase a server motherboard that :has a PCIe x8 slot on it and/or server chassis (usually best to buy both :of these things from the same vendor) and be done with it. : :-- :| Jeremy Chadwick j...@parodius.com | It is possible this is related to the way Intel on-board graphics work in recent chipsets. e.g. i915 or i925 chipsets. The on-motherboard video uses a 16-lane internal PCI-e connection which is SHARED with the 16-lane PCI-e slot. If you plug something into the slot (e.g. a graphics card), it disables the on-motherboard video. I'm not sure if the BIOS can still boot if you plug something other than a video card into these MBs and no video at all is available. Presumably it should be able to, you just wouldn't have any video at all. Insofar as I know AMD-based MBs with on-board video don't have this issue, though it should also be noted that AMD-based MBs tend to be about 6-8 months behind Intel ones in terms of features. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to take down a system to the point of requiring a newfs with one line of C (userland)
Jim's original report seemed to indicate that the filesystem paniced on mount even after repeated fsck's. That implies that Jim has a filesystem image that panics on mount. Maybe Jim can make that image available and a few people can see if downloading and mounting it reproduces the problem. It would narrow things down anyhow. Also, I didn't see a system backtrace anywhere. If it paniced, where did it panic? The first thing that came to my mind was the dirhash code, but simply mounting a filesystem doesn't scan the mount point directory at all, except possibly for '.' or '..'... I don't think it even does that. All it does is resolve the root inode of the filesystem. The code path for mounting a UFS or UFS2 filesystem is very short. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: fsck_ufs: cannot alloc 94208 bytes for inoinfo
fsck's memory usage is directly related to the number of inodes and the number of directories in the filesystem. Directories are particularly memory intensive. I've found on my backup system that a UFS1 filesystem with 40 million inodes is about the limit that can be fsck'd (at least with a 32 bit architecture). My cron jobs keep my backup partition below that point. Even in a 64 bit environment you will be limited by swap and the sheer time it takes for fsck to run. It takes well over 8 hours for my backup system to fsck. You can also reduce fsck time by reducing the number of cylinder groups on the disk. I usually max them out (-c 999 and newfs then sets it to the maximum, usually in the 50-80 range). This will improve performance but not reduce the memory required. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: udf
:> BTW, Remko has kindly notified me that Reinoud Zandijk has completed his :> long work on UDF write support in NetBSD. I think that porting his work :> is our best chance to get write support in FreeBSD too. :> : :I think you'll find that implementing VOPs and filling in UDF data :structures will be easy, while interacting with the VM will be many :orders of magnitude harder. Still it should be a fun challenge for :someone to do. : :Scott One avenue that can be persued would be to finish the UIO_NOCOPY support in the vm/vnode_pager.c. You have UIO_NOCOPY support for the putpages code but not the getpages code. If that were done the VFS can simply use VMIO-backed buffer cache buffers (they have to be VMIO-backed for UIO_NOCOPY to work properly)... and not have to deal with getpages or putpages at all. The vnode pager would convert them to a UIO_NOCOPY VOP_READ or VOP_WRITE as appropriate. The entire VM coding burden winds up being in the kernel proper and not in the VFS at all. IMHO implementing per-VFS getpages/putpages is an exercise in frustration, to be avoided at all costs. Plus once you have a generic getpages/putpages layer in vm/vnode_pager.c the VFS code no longer has to mess with VM pages anywhere and winds up being far more portable. I did the necessary work in DragonFly in order to avoid having to mess with VM pages in HAMMER. Primary work: * It is a good idea to require that all vnode-based buffer cache buffers be B_VMIO backed (aka have a VM object). It ensures a clean interface and avoids confusion, and also cleans up numerous special cases that are simply not needed in this day and age. * Add support for UIO_NOCOPY in generic getpages. Get rid of all the special cases for small-block filesystems in getpages. Make it completely generic and simply issue the UIO_NOCOPY VOP_READ/VOP_WRITE. * Make minor adjustments to existing VFSs (but nothing prevents them from still rolling their own getpages/putpages so no major changes are needed). And then enjoy the greatly simplified VFS interactions that result. I would also recommend removing the VOP_BMAP() from the generic getpages/putpages code and simply letting the VFS's VOP_READ/VOP_WRITE deal with it. The BMAP calls were being made from getpages/putpages to check for discontiguous blocks, to avoid unnecessary disk seeks. Those checks are virtually worthless on today's modern hardware particularly since filesystems already localize most data accesses. In other words, if your filesystem is fragmented you are going to be doing the seeks anyway, probably. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Sockets stuck in FIN_WAIT_1
I guess nobody mentioned the obvious thing to check: Make sure TCP keepalive is turned on. sysctl net.inet.tcp.always_keepalive=1 If you don't do this then dead TCP connections can build up, particularly on busy servers, due to the other end simply disappearing. Without this option the TCP protocol can get stuck, because it does not usually send packets to the other end of an idle connection unless (1) its window has closed completely or (2) it has unackncowledged data or state pending. The keepalive forces a probe to occur every so often on an idle connection (like once every 30min-2hrs, I forget what the default is), to check that the connection still exists. It is possible to get stuck during normal data operation and while in a half-closed state. The 2MSL timeout does not activate until you go into a fully closed state (FIN2/TIME_WAIT). Pretty much if you are running any sort of service on the internet, and even if you aren't, you need to make sure keepalive is turned on for the long term health of your system. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Sockets stuck in FIN_WAIT_1
:On May 29, 2008, at 3:12 PM, Matthew Dillon wrote: :>I guess nobody mentioned the obvious thing to check: Make sure :>TCP keepalive is turned on. :> :>sysctl net.inet.tcp.always_keepalive=1 : : :Thanks Matt. : :I also thought that a keepalives were not running and sessions just :stuck around forever, however I do have: : : :net.inet.tcp.keepidle=90 :net.inet.tcp.keepintvl=3 :net.inet.tcp.msl=5000 :net.inet.tcp.always_keepalive=1 (default) : : :I believe keep idle was defaulted to 2hrs, I changed it to 15 minutes :with a 30 second tick... I still found FIN_WAIT_1 sessions stuck for :several hours, if not infinite. : :Nonet he less, I have a new server up running 7.0-p1, I'll be pumping :a lot of traffic to that box soon and I'll see how that makes out. : :-- :Robert Blayzor, BOFH :INOC, LLC :[EMAIL PROTECTED] :http://www.inoc.net/~rblayzor/ If it is still giving you trouble I recommend using tcpdump to observe the IP/port pair of one of the stuck connections over the keepalive period and see if the keepalives are still being sent and, if they are, what kind of response you get from the other end. It is quite possible that the other ends of the connection are still live and that the issue could very well be a timeout setting in the server config file instead of something in the TCP stack. This is what you should see when a keepalive occurs over an idle connection: * A TCP packet w/ 0 data sent to the remote * A response from the remote: Either a pure ACK, or a TCP RESET If no response occurs from the remote the keepalive code will then retry a couple of times over keepintvl (every 30 seconds in your case), and if it still gets no response after I think 3 retries (30+30+30 = 90 seconds later) it should terminate the connection state. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Sockets stuck in FIN_WAIT_1
:I think we're onto something here, but for some reason it doesn't make :any sense. I have keepalives turned OFF in Apache: : :When I tcpdump this, I see something sending ack's back and forth :every 60 seconds, but what? Apache? I'm not sure why. I don't see :any timeouts in Apache for ~60 seconds. As you can see, sometimes we :send an ack, but never see a reply. I'm gathering the OS level :keepalives don't come into play because this session is not considered :idle? : : :0:13:07.640426 IP 1.1.1.1.80 > 2.2.2.2.33379: . :4208136508:4208136509(1) ack 1471446041 win 520 :20:13:07.736505 IP 2.2.2.2.33379 > 1.1.1.1.80: . ack 0 win 0 : :20:14:07.702647 IP 1.1.1.1.80 > 2.2.2.2.33379: . 0:1(1) ack 1 win 520 : :20:15:07.764920 IP 1.1.1.1.80 > 2.2.2.2.33379: . 0:1(1) ack 1 win 520 : :20:15:07.860988 IP 2.2.2.2.33379 > 1.1.1.1.80: . ack 0 win 0 : :20:16:07.827262 IP 1.1.1.1.80 > 2.2.2.2.33379: . 0:1(1) ack 1 win 520 :... Yah, the connection is valid so keepalives do not come into play. What is happening is that 1.1.1.1 wants to send something to 2.2.2.2, but 2.2.2.2 is telling 1.1.1.1 that it has no buffer space (win 0). This forces the TCP stack on 1.1.1.1 (the kernel, not the apache server) to 'probe' the connection, which it appears to be doing once a minute. It is probing the connection waiting for 2.2.2.2 to tell it that buffer space is available (win != 0). The connection remains valid because 2.2.2.2 continues to respond to the probes. Now, the connection is also in a half-closed state, which means that one direction is closed. I can't tell which direction that is but my guess is that 1.1.1.1 (the apache server) closed the 1.1.1.1->2.2.2.2 direction and the 2.2.2.2 box has a broken TCP implementation and can't deal with it. :I'm finding several of these sessions doing the same exact thing : :-- :Robert Blayzor, BOFH :INOC, LLC I can suggest two things. First, the TCP connection is good but you still may be able to tell Apache, in the apache configuration file, to timeout after a certain period of time and clear the connection. Secondly, it may be beneficial to identify exactly what the client and server were talking about which caused the client to hang with a live tcp connection. The only way to do that is to tcpdump EVERYTHING going on related to the apache srever, save it to a big-ass disk partition (like 500G), and then when you see a stuck connection go back through the tcpdump log file and locate it, grep it out, and review what exactly it was talking about. You'd have to tcpdump with options to tell it to dump the TCP data payloads. It seems likely that the client is running an applet or javascript that receives a stream over the connection, and that applet or javascript program has locked up, causing the data sent from the server to build up and for the client's buffer space to run out, and start advertising the 0 window. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Sockets stuck in FIN_WAIT_1
:This is exactly what we're seeing, it's VERY strange. I did kill off :Apache, and all the FIN_WAIT_1's stuck around, so the kernel is in :fact sending these probe packets, every 60 seconds, which the client :responds to... (most of the time). Ach. Now that I think about it, it is still possible for it to happen that way. Apache closed the connection while there was still data in the scoket buffer to the client. The client then refused to read it, but otherwise left the connection alive. It's got to a be a bug on the client(s) in question. I can't think of anything else. You may have to resort to injecting a TCP RST packet (e.g. via a TUN device) to clear the connections. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Sockets stuck in FIN_WAIT_1
:Yes, IPFW is running on the box. Why not? : :-- :Robert Blayzor, BOFH :INOC, LLC :[EMAIL PROTECTED] :http://www.inoc.net/~rblayzor/ There's nothing wrong with running IPFW on the same box :-) But, I think that rule change is masking the problem rather then solving it. The keep-state is limited. The reason the number of dead connections isn't going up is probably because IPFW is either hitting its keep-state limit and dropping connections, or the connection becomes idle long enough for IPFW to recycle the keep-state for it, also causing it to drop. Once the keep-state is lost that deny established rule will cause the connection to fail. I would be very careful with any type of ruleset (IPFW or PF) which relies on keep-state. You can wind up causing legitimate connections to drop if it isn't carefully tuned. It might be a reasonable bandaid, though. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Sysctl knob(s) to set TCP 'nagle' time-out?
:Hi, : :I'm wondering if anything exists to set this.. When you create an INET :socket :without the 'TCP_NODELAY' flag the network layer does 'naggling' on your :transmitted data. Sometimes with hosts that use Delayed_ACK :(net.inet.tcp. :delayed_ack) it creates a dead-lock where the host will not ACK until :it gets :another packet and the client will not send another packet until it :gets an ACK.. : :The dead-lock gets broken by a time-out, which I think is around 200ms? : :But I would like to change that time-out if possible to something :lower, yet :I can't really see any sysctl knobs that have a name that suggests :they do :that.. : :So does anyone know IF this can be tuned and if so by what? : :Cheers, :Jerahmy. : :(And yes you could solve it by setting the TCP_NODELAY flag on the :socket, :but not everything has programmed in options to set it and you don't :always :have access to the source, besides setting a sysctl value would be much :simpler than recompiling stuff) There is a sysctl which adjusts the delayed-ack timing, its called net.inet.tcp.delacktime. The default is 1/10 of a second (100 == 100 ms = 1/10 of a second). BUT, it shouldn't be possible for nagle to deadlock against delayed acks unless the TCP implementation is broken somehow. A delayed ack is simply that... the ack is delayed 100 ms in order to improve its chances of being piggy-backed on return data. The ack is not blocked completely, just delayed, and certain events (such as the receiving end turning around and sending data back, which is typical for an interactive connection)... certain events will cause the delayed ack to be aborted and for the ack to be immediately sent with the return data. Can it break down and cause excessive lag? Yes, it can. Interactive games almost universally have to disable Nagle because the lag is actually due to the data relay from client 1 -> server then relaying the interactive event to client 2. Without an immediate interactive response to client 1 the ack gets delayed and the next event from client 1 hits Nagle and stops dead in the water until the first event reaches client 2 and client 2 reacts to it (then client 2 -> server -> (abort delayed ack and send) -> client 1 (client 1's nagle now allows the second event to be transmitted). That isn't a deadlock, just really poor interactive performance in that particular situation. Delayed acks also have a safety valve. The spec says that an ack cannot be delayed more then two packets. In a batch link when the second (unacked) packet is received, the delayed ack is aborted and an ack is immediately returned to the sender. This is to prevent congestion control (which is based on acks) from getting completely out of whack and also to prevent the TCP window from getting exhausted. In anycase, the usual solution is to disable Nagle rather then mess with delayed acks. What we need is a new Nagle that understands the new reality for interactive connections... something that doesn't break performance in the 'server in the middle' data relaying case. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Sysctl knob(s) to set TCP 'nagle' time-out?
:One possibility I see is a statistic about DelACKs per TCP connection, :counting those that were rightfully delayed (with hindsight). I.e., :if an ACK is delayed, but there was no chance to piggy-back it or to :combine it with another ACK, it could have been sent without delay. :Only those delayed ACKs that reduce load are "good", all others cause :additional state to be maintained and may increase latencies for no :good reason. : :... :consideration. And to me, automatic setting of TCP_NODELAY seems :more useful than automatic clearing (after delayed ACKs had been :found to be of no use for a window of say 8 or 16 ACKs). : :The implementation would be quite simple: Whenever a delayed ACK :is sent, check whether it is sent on its own (bad) or whether it :could be piggy-backed (good). If, say, 7 of 8 delayed ACKs had to :be sent as ACK-only packets, anyway, set TCP_NODELAY and do not :bother to keep on deciding whether delayed ACKs had become useful :in a different phase of the communication. If you want to be able :to automatically disable TCP_NODELAY, then just set a time-stamp :... :Regards, STefan That's an interesting approach. I think it would catch some of the cases, but not enough of them. If the round-trip in the server-relaying case is less then the delayed-ack, the acks will still wind up piggy-backed on return traffic but the latency will also still remain horrible. It should be noted that Nagle can cause high latencies even when delayed acks are turned off. Nagle's delay is not timed... in its simplest description it prevents packets from being transmitted for new data coming from userland if the data already in the sockbuf (and presumably already transmitted) has not yet been acknowledged. For interactive traffic this means that Nagle is putting the screws on the packet stream even if the acks aren't delayed, simply from the ack latency. With delayed acks turned off the latency is lower, but not 0, so interactive traffic is still being held up by Nagle. The effect is noticeable even on a LAN. Jerahmy brought up Samba... that is an excellent example. NFS-over-TCP would be another good example. Any protocol which multiplexes multiple commands from different sources over the same connection gets really messed up (slowed down) by Nagle. On the flip side, Nagle can't just be turned off by default because it would cause streaming connections from user programs which do tiny writes to generate a lot of unnecessarily tiny packets. This can become apparent when using SSH over a slow link. Numerous programs run from a shell generate fairly ineffcient packets which could have easily been batched when operating over SSH. The result can be sludgy performance for output which ought be batched up by TCP but isn't because SSH turns off Nagle unconditionally. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Performance of madvise / msync
: 65074 python 0.06 CALL madvise(0x287c5000,0x70,_MADV_WILLNEED) : 65074 python 0.027455 RET madvise 0 : 65074 python 0.58 CALL madvise(0x287c5000,0x1c20,_MADV_WILLNEED) : 65074 python 0.016904 RET madvise 0 : 65074 python 0.000179 CALL madvise(0x287c6000,0x1950,_MADV_WILLNEED) : 65074 python 0.008629 RET madvise 0 : 65074 python 0.40 CALL madvise(0x287c8000,0x8,_MADV_WILLNEED) : 65074 python 0.004173 RET madvise 0 :... : 65074 python 0.006084 CALL msync(0x287c5000,0x453b7618,MS_ASYNC) : 65074 python 0.106284 RET msync 0 :... :As you can see, it's quite a bit faster. : :I know that msync is necessary under Linux but obsolete under FreeBSD, but :it's still funny that it takes a tenth of a second to return even with :MS_ASYNC specified. : :Also, why is it that the madvise() calls take so much longer when the :program does a couple of its own madvise() calls? Was madvise() never :intended to be run so frequently and is therefore a little slower than :it could be? : :Here's the diff between the code for the first kdump above and the :second one. Those times are way way too large, even with other running threads in the system. madvise() should not take that long unless it is being forced to wait on a busied page, and neither should msync(). madvise() doesn't even do any I/O (or shouldn't anyhow). Try removing just the msync() but keep the madvise() calls and see if the madvise() calls continue to take horrendous amounts of time. Then try the vise-versa. It kinda feels like a prior msync() is initiating physical I/O on pages and a later mmap/madvise or page fault is being forced to wait on the prior pages for the I/O to finish. The size_t argument to msync() (0x453b7618) is highly questionable. It could be ktrace reporting the wrong value, but maybe not. On any sort of random writing test, particularly if multiple threads are involved, specifying a size that large could result in very large latencies. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Performance of madvise / msync
:With madvise() and without msync(), there are high numbers of :faults, which matches the number of disk io operations. It :goes through cycles, every once in a while stalling while about :60MB of data is dumped to disk at 20MB/s or so (buffers flushing?) :At the beginning of each cycle it's fast, with 140 faults/s or so, :and slows as the number of faults climbs to 180/s or so before :stalling and flusing again. It never gets _really_ slow though. Yah, without the msync() the dirty pages build up in the kernel's VM page cache. A flush should happen automatically every 30-60 seconds, or sooner if the buffer cache builds up too many dirty pages. The activity you are seeing sounds like the 30-60 second filesystem sync the kernel does periodically. Either NetBSD or OpenBSD, I forget which, implemented a partial sync feature to prevent long stalls when the filesystem syncer hits a file with a lot of dirty pages. FreeBSD could borrow that optimization if they want to reduce stalls from the filesytem sync. I ported it to DFly a while back myself. :With msync() and without madvise(), things are very slow, and :there are no faults, just writes. :... :> The size_t argument to msync() (0x453b7618) is highly questionable. :> It could be ktrace reporting the wrong value, but maybe not. : :That's the size of rg2.rrd. It's 1161524760 bytes long. :... :Looks like the source of my problem is very slow msync() on the :file when the file is over a certain size. It's still fastest :without either madvise or msync. : :Thanks for your time, : :Marcus The msync() is clearly the problem. There are numerous optimizations in the kernel but msync() is frankly a rather nasty critter even with the optimizations work. Nobody using msync() in real life ever tries to run it over the entirety of such a large mapping... usually it is just run on explicit sub-ranges that the program wishes to sync. One reason why msync() is so nasty is that the kernel must physically check the page table(s) to determine whether a page has been marked dirty by the MMU, so it can't just iterate the pages it knows are dirty in the VM object. It's nasty whether it scans the VM object and iterates the page tables, or scans the page tables and looks up the related VM pages. The only way to optimize this is to force write-faults by mapping clean pages read-only, in order to track whether a page is actually dirty in real time instead of lazily. Then msync() would only have to do a ranged-scan of the VM object's dirty-page list and would not have to actually check the page tables for clean pages. A secondary effect of the msync() is that it is initiating asynchronous I/O for what sounds like hundreds of VM pages, or even more. All those pages are locked and busied from the point they are queued to the point the I/O finishes, which for some of the pages can be a very, very long time (into the multiples of seconds). Pages locked that long will interfere with madvise() calls made after the msync(), and probably even interfere with the follow msync(). It used to be that msync() only synced VM pages to the underlying file, making them consistent with read()'s and write()'s against the underlying file. Since FreeBSD uses a unified VM page cache this is always true. However, the Open Group specification now requires that the dirty pages actually be written out to the underlying media... i.e. issue real I/O. So msync() can't be a NOP if you go by the OpenGroup specification. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: dvd dma problems
:> quite fine from 5.3 to somewhere in the 6.x branch. Nowadays I have to send :> them to PIO4 to play DVDs, because they'll just throw DMA not aligned errors :> around in UDMA33 or WDMA2 mode. :> :> Should someone be interested in this I'm willing to supply all necessary :> information, such as the exact drives, firmware versions, kernel traces... :> whatever comes to your mind. I'm also willing to test patches. : :Is the problem you're seeing identical to this? : :http://lists.freebsd.org/pipermail/freebsd-hackers/2008-July/025297.html : :-- :| Jeremy Chadwickjdc at parodius.com | One of our guys (in DragonFly-land) tracked this down two two issues, fixing either one will fix the problem. I'd include a patch but he has not finished it yet. Still, anyone with moderate kernel programming skills can probably fix it in an hour or less. physio() - uses vmapbuf(). vmapbuf() does NOT realign the user address, it simply maps it into the buffer and adjusts b_data. So if the user supplies a badly aligned buffer, physio() will happily pass that bad alignment to the driver. physio() could be modified to allocate kernel memory to back the pbuf and copy instead of calling vmapbuf(), for those cases where the user supplied buffer is not well aligned (e.g. not at least 512-byte aligned). The pbuf already reserve KVA so all one would need to do is allocate pages to back the KVA space. I think a couple of other subsystems in the kernel do this with pbufs so there is plenty of example material. -- The ATA driver has an explicit alignment check and also uses BUS_DMA_NOWAIT in its call to bus_dmamap_load() in ata_dmaload(). The ATA driver could be adjusted to remove the alignment check, remove the BUS_DMA_NOWAIT flag, and also not free the bounce buffer when DMA ends (so you don't get allocator deadlocks). You might have other issues related to lock ordering, and this solution would eat a considerable amount of memory (upwards of a megabyte, more if you have more ATA channels), but that's the jist of it. It should be noted that only physio() can supply unaligned BIOs to the driver layer. All other BIO sources (that I know of) will always be at least 512-byte aligned. -- My recommendation is to fix physio(). User programs that do not supply aligned buffers clearly don't care about performance, so the kernel can just back the pbuf with memory and copyin/out the user data. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
:Hi everyone, : :I'm wondering if the problems described in the following link have been :resolved: : :http://unix.derkeiler.com/Mailing-Lists/FreeBSD/stable/2008-02/msg00211.html : :I've got four 500GB SATA disks in a ZFS raidz pool, and all four of them :are experiencing the behavior. : :The problem only happens with extreme disk activity. The box becomes :unresponsive (can not SSH etc). Keyboard input is displayed on the :console, but the commands are not accepted. : :Is there anything I can do to either figure this out, or work around it? : :Steve If you are getting DMA timeouts, go to this URL: http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting Then I would suggest going into /usr/src/sys/dev/ata (I think, on FreeBSD), locate all instances where request->timeout is set to 5, and change them all to 10. cd /usr/src/sys/dev/ata fgrep 'request->timeout' *.c ... change all assignments of 5 to 10 ... Try that first. If it helps then it is a known issue. Basically a combination of the on-disk write cache and possible ECC corrections, remappings, or excessive remapped sectors can cause the drive to take much longer then normal to complete a request. The default 5-second timeout is insufficient. If it does help, post confirmation to prod the FBsd developers to change the timeouts. -- If you are NOT getting DMA timeouts then the ZFS lockups may be due to buffer/memory deadlocks. ZFS has knobs for adjusting its memory footprint size. Lowering the footprint ought to solve (most of) those issues. It's actually somewhat of a hard issue to solve. Filesystems like UFS aren't complex enough to require the sort of dynamic memory allocations deep in the filesystem that ZFS and HAMMER need to do. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Multi-machine mirroring choices
:Oliver Fromme wrote: : :> Yet another way would be to use DragoFly's "Hammer" file :> system which is part of DragonFly BSD 2.0 which will be :> released in a few days. It supports remote mirroring, :> i.e. mirror source and mirror target can run on different :> machines. Of course it is still very new and experimental :> (however, ZFS is marked experimental, too), so you probably :> don't want to use it on critical production machines. : :Let's not get carried away here :) : :Kris Heh. I think its safe to say that a *NATIVE* uninterrupted and fully cache coherent fail-over feature is not something any of us in BSDland have yet. It's a damn difficult problem that is frankly best solved above the filesytem layer, but with filesystem support for bulk mirroring operations. HAMMER's native mirroring was the last major feature to go into it before the upcoming release, so it will definitely be more experimental then the rest of HAMMER. This is mainly because it implements a full blown queue-less incremental snapshot and mirroring algorithm, single-master-to-multi-slave. It does it at a very low level, by optimally scanning HAMMER's B-Tree. In other words, the kitchen sink. The B-Tree propagates the highest transaction id up to the root to support incremental mirroring and that's the bit that is highly experimental and not well tested yet. It's fairly complex because even destroyed B-Tree records and collapses must propagate a transaction id up the tree (so the mirroring code knows what it needs to send to the other end to do comparative deletions on the target). (transaction ids are bundled together in larger flushes so the actual B-Tree overhead is minimal). The rest of HAMMER is shaping up very well for the release. It's phenominal when it comes to storing backups. Post-release I'll be moving more of our production systems to HAMMER. The only sticky issue we have is filesystem-full handling, but it is more a matter of fine-tuning then anything else. -- Someone mentioned atime and mtime. For something like ZFS or HAMMER, these fields represent a real problem (atime more then mtime). I'm kinda interested in knowing, does ZFS do block replacement for atime updates? For HAMMER I don't roll new B-Tree records for atime or mtime updates. I update the fields in-place in the current version of the inode and all snapshot accesses will lock them (in getattr) to ctime in order to guarantee a consistent result. That way (tar | md5) can be used to validate snapshot integrity. At the moment, in this first release, the mirroring code does not propagate atime or mtime. I plan to do it, though. Even though I don't roll new B-Tree records for atime/mtime updates I can still propagate a new transaction id up the B-Tree to make the changes visible to the mirroring code. I'll definitely be doing that for mtime and will have the option to do it for atime as well. But atime still represents a big expense in actual mirroring bandwidth. If someone reads a million files on the master then a million inode records (sans file contents) would end up in the mirroring stream just for the atime update. Ick. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
:Went from 10->15, and it took quite a bit longer into the backup before :the problem cropped back up. Try 30 or longer. See if you can make the problem go away entirely. then fall back to 5 and see if the problem resumes at its earlier pace. -- It could be temperature related. The drives are being exercised a lot, they could very well be overheating. To find out add more airflow (a big house fan would do the trick). -- It could be that errors are accumulating on the drives, but it seems unlikely that four drives would exhibit the same problem. -- Also make sure the power supply can handle four drives. Most power supplies that come with consumer boxes can't under full load if you also have a mid or high-end graphics card installed. Power supplies that come with OEM slap-together enclosures are not usually much better. Specifically, look at the +5V and +12V amperage maximums on the power supply, then check the disk labels to see what they draw, then multiply by 2. e.g. if your power supply can do [EMAIL PROTECTED] and you have four drives each taking [EMAIL PROTECTED] (and typically ~half that at 5V), thats 4x2x2 = [EMAIL PROTECTED] and you would probably be ok. To test, remove two of the four drives, reformat the ZFS to use just 2, and see if the problem reoccurs with just two drives. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: taskqueue timeout
:... :> and see if the problem reoccurs with just two drives. : :... I knew that was going to come up... my response is "I worked so hard :to get this system with ZFS all configured *exactly* how I wanted it". : :To test, I'm going to flip to 30 as per Matthews recommendation, and see :how far that takes me. At this time, I'm only testing by backing up one :machine on the network. If it fails, I'll clock the time, and then :'reformat' with two drives. : :Is there a technical reason this may work better with only two drives? : :Is there anyone interested to the point where remote login would be helpful? : :Steve This issue is vexing a lot of people. Setting the timeout to 30 will not effect performance, but it will cause a 30 second delay in recovery when (if) the problem occurs. i.e. when the disk stalls it will just sit there doing nothing for 30 seconds, then it will print the timeout message and try to recover. It occurs to me that it might be beneficial to actually measure the disk's response time to each request, and then graph it over a period of time. Maybe seeing the issue visually will give some clue as to the actual cause. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Max size of one swap slice
:> Recently we found that we can only allocate 32GB for one swap slice. :> Does there is any sysctl oid or any kernel option to increase it? Why :> we have this restriction? : :this is a consequence of the data structure used to manage swap space. See :sys/blist.h for details. It *seems* that you *might* be able to increase the :coverage by decreasing BLIST_META_RADIX, but that's from a quick glance and :most certainly not a good idea. : :However, the blist is a abstract enough API so that you can likely replace it :with something that supports 64bit addresses (and thus 512*2^64 bytes of swap :space per device) ... but I don't see why you'd want to do something like :this. Remember that you need memory to manage your swap space as well! : :-- :/"\ Best regards, | [EMAIL PROTECTED] :\ / Max Laier | ICQ #67774661 The core structures can handle 2 billion swap pages == 2TB of swap, but the blist code hits arithmatic overflows if a single blist has more then (0x4000 / BLIST_META_RADIX) = 1G/16 = 64M swap blocks, or 256GB. I think the VM/BIO system had additional overflow issues due to conversions back and forth between PAGE_SIZE and DEV_BSIZE which further restricted the limit to 32GB. Those restrictions may be gone now that FreeBSD is using 64 bit block numbers, so you may be able to pop it up to 256GB with virtually no effort (but you need to test it significantly!). With some work on the blist code only (not its structures) the arithmatic overflow issues could also be resolved, increasing the swap capability to 2TB. I do not recommend changing any of the core blist structure, particularly not BLIST_META_RADIX. Just don't try :-). You do NOT want to bump the swap block number fields to 64 bits. Also note that significant memory is used to manage that much swap. It's a factor of 1:16384 or so for the blist structures and probably about the same amount for the vm_object tracking structures. 32G of swap needs around 2-4MB of wired ram. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Max size of one swap slice
: :See : :http://www.freebsd.org/cgi/getmsg.cgi?fetch=540837+0+/usr/local/www/db/text/2008/freebsd-questions/20080706.freebsd-questions : :Kris Hmm. I see an issue that FreeBSD could correct to reduce wired memory use by the swap system. Your sys/blist.h has this: typedef u_int32_t u_daddr_t; /* unsigned disk address */ and your sys/types.h has this: typedef int64_t daddr_t; /* unsigned disk address */ sys/blist.h really assumes a 32 bit daddr_t. It's amazing the code even still works with daddr_t at 64 bits and u_daddr_t at 32 bits. Changing that whole mess in sys/blist.h to a different typedef name, say swblk_t (which is already defined to be 32 bits), and renaming u_daddr_t to u_swblk_t, plus also changing the swblock structure in vm/swap_pager.c to use a 32 bit array elements instead of 64 bit array elements will cut the size of struct swblock almost in half. There is no real need for swap block addressing > 32 bits. 32 bits gives you swap in the terrabyte range. struct swblock { struct swblock *swb_hnext; vm_object_t swb_object; vm_pindex_t swb_index; int swb_count; daddr_t swb_pages[SWAP_META_PAGES]; <<<<<<<<< this array }; Any arithmatic accessing the elements would also have to be vetted for any necessary promotions. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: calcru: runtime went backwards, RELENG_6, SMP
:IV> > Upd: on GENERIC/amd64 kernel I got the same errors. :IV> :IV> Do you perhaps run with TSC timecounter? (that's the only cause I've notice :IV> that can generate this message). : :Nope: : :[EMAIL PROTECTED]:~> sysctl kern.timecounter :kern.timecounter.tick: 1 :kern.timecounter.choice: TSC(-100) ACPI-fast(1000) i8254(0) dummy(-100) :kern.timecounter.hardware: ACPI-fast :... kgdb your live kernel and 'print cpu_ticks'. See what the cpu ticker is actually pointing at, because it might not be the time counter. It could still be TSC. The TSC isn't synchronized between the cores on a SMP box, not even on multi-core parts. It can't be used to calculate delta times for any thread that has the possibility of migrating between cpu's. Not only will the absolute offset be off between cpus, but the frequency will also be slightly different (at least on SMP multi-core parts), so you get frequency drift too. There is also possibly an issue with tc_cpu_ticks(), which seems to be using a static 64 bit variable to handle rollover instead of a per-cpu variable. I don't see how that could possibly be MP safe, especially if the timecount is not synchronized between cpus and causes multiple rollover events. In fact, I can *barely* use the TSC on DragonFly for KTR logging, and even then I have to have some kernel threads sitting there doing nothing but figuring out the drift between the cpus so it can correct the TSC values when it logs information... and even with all of that I can't get them synchronized any closer then around 500ns from each other. I'd recommend that FreeBSD do what we did years ago with calcru ... stop trying to calculate the time down to the nanosecond and just do it statistically. It works just fine and takes the whole mess out of the critical path. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: calcru: runtime went backwards, RELENG_6, SMP
:Hmm, i'm not sure I understand you right: what do you mean by 'kgdb live :kernel'? I send break over serial console, and in ddb got : :db> print cpu_ticks :Symbol not found : :Sincerely, :D.Marck [DM5020, MCK-RIPE, DM3-RIPN] I think it works the same on FreeBSD, so it would be something like: kgdb /kernel /dev/mem ^^^ NOTE! Dangerous! ^^^ But I looked at the cvs logs and the variable didn't exist in FreeBSD-6, so it wouldn't have helped anyway. It looks like it is using binuptime() in 6.x, and it also looks like the tick calculations, e.g. rux_uticks, is based on the stat clock interrupt, whereas the runtime calculation is using binuptime. There is no way those two could possibly be synchronized. No chance whatsoever. Your only solution may be to upgrade to FreeBSD-7 which uses an entirely different mechanism for the calculation (though one that also seems flawed in its own way). Alternatively you could just remove the error message from the kernel entirely and not worry about it. It's a printf around line 774 in /usr/src/sys/kern/kern_resource.c (in FreeBSD-6.x). -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: calcru: runtime went backwards, RELENG_6, SMP
:Well, I can of course shut the kernel up, but kernel time stability is still my :concern. I run ntpd there and while sometimes it seems stable (well, sorta: :drift are within several seconds...) there are cases of half-a-minute time :steps. : :Sincerely, :D.Marck [DM5020, MCK-RIPE, DM3-RIPN] I think the only hope you have of getting the issue addressed is to run FreeBSD current. If you can reproduce the time slips under current the developers should be able to track the problem down and fix it. The code is so different between those two releases that they are going to have a hard time working the problem in FreeBSD-6. If you don't want to do that, try forcing the timer to use the 8254 and see if that helps. You may also have to reduce the system tick to ~100-200 hz. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: calcru: runtime went backwards, RELENG_6, SMP
:== : cs3661.rinet.ru 192.38.7.240 2 u 354 1024 3775.305 -66314. 4321.47 : ns.rinet.ru 130.207.244.240 2 u 365 1024 3776.913 -66316. 4305.33 : whale.rinet.ru 195.2.64.5 2 u 358 1024 3777.939 -66308. 4304.90 : :Any directions to debug this? : :Sincerely, :D.Marck [DM5020, MCK-RIPE, DM3-RIPN] Since you are running on HEAD now, could you also kgdb the live kernel and print cpu_ticks? I believe the sequence is (someone correct me if I am wrong): kgdb /kernel /dev/mem print cpu_ticks As for further tests... try building a non-SMP kernel (i.e. one that only recognizes one cpu) and see if the problem occurs there. That will determine whether there is a basic problem with time keeping or whether it is an issue with SMP. I'm afraid there isn't much more I can do to help, other then to make suggestions on tests that you can run that will hopefully ring a bell with another developer. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: calcru: runtime went backwards, RELENG_6, SMP
:s,/kernel,/boot/kernel/kernel, ;-) : :well, strange enough result for me: : :(kgdb) print cpu_ticks :$1 = (cpu_tick_f *) 0x8036cef0 : :Does this mean that kernel uses tsc? sysctl reports : :kern.timecounter.choice: TSC(-100) ACPI-fast(1000) i8254(0) dummy(-100) :kern.timecounter.hardware: ACPI-fast It means the kernel is using the TSC for calcru. It's using ACPI-fast for normal timekeeping. In anycase, that's the problem right there, or at least one problem. The TSC cannot safely be used for calcru or much of anything else on a SMP system because the TSCs aren't synchronized between cpu's and because their frequencies aren't locked, so they will drift relative to each other as well. If you want to run another test, try disabling the use of the TSC for calcru. There is no boot variable I can see to do it so go into /usr/src/sys/i386/i386/tsc.c and comment out the call to set_cputicker() in Line 107 and line 187. Then see if that helps. If you are doing an amd64 build comment it out in amd64/amd64/tsc.c line 98 and line 163. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: removing external usb hdd without unmounting causes reboot?
:> By the way, the problem apparently has been solved in :> DragonFly BSD (i.e. DF BSD does not panic when a mounted :> FS is physically removed). Maybe it is worth to have a We didn't do much here. Just started pulling devices, looking at the crash dumps, and fixing things. Basically it was just a collection of minor bugs... things like certain error paths in UFS (which only occur on an I/O error) had bugs, or caused corruption instead of properly handling the error, and various bits and pieces of the USB I/O path would get ripped out on the device pull while still referenced by other bits of the USB I/O path. You will also have to look at the way vfs flushing handles errors in order to allow a filesystem to be force-unmounted after the device has been pulled. Basically you have to make umount -f work and you have to make sure it properly dereferencing the underlying device and properly destroys the (now unwritable) dirty buffers. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: default dns config change causing major poolpah
The vast majority of machine installations just slave their dns off of another machine, and because of that I do not think it is particularly odious to require some level of skill for those who actually want to set up their own server. To that end what I do on DragonFly is simply supply a README file in /etc/namedb along with a few helper scripts describing how to do it in a fairly painless manner. If a user cannot understand the README then he has no business setting up a DNS server anyhow. Distributions need to be fairly sensitive to doing anything that might accidently (through lack of understanding) cause an overload of critical internet resources. http://www.dragonflybsd.org/cvsweb/src/etc/namedb/ I generally recommend using our 'getroot' script to download an actual root.zone file instead of using a hints file (and I guess AXFR is supposed to replace both concepts). It has always seemed to me that actually downloading a physical root zone file once a week is the most reliable solution. I've never trusted using a hints file... not for at least a decade, and I probably wouldn't trust AXFR for the same reason. Probably my mistrust is due to the massive problems I had using a hints file long ago and I'm sure it works better these days, but I've never found any reason to switch back from an actual root.zone. I've enclosed the getroot script we ship below. In anycase, it seems to me that there is no good reason to try to automate dns services as a distribution default in the manner being described. Just my two-cents. -Matt #!/bin/tcsh -f # # If you are running named and using root.zone as a master, the root.zone # file should be updated periodicly from ftp.rs.internic.net. # # $DragonFly: src/etc/namedb/getroot,v 1.2 2005/02/24 21:58:20 dillon Exp $ cd /etc/namedb umask 027 set hostname = 'ftp.rs.internic.net' set remfile = domain/root.zone.gz set locfile = root.zone.gz set path = ( /bin /usr/bin /sbin /usr/sbin ) fetch ftp://${hostname}:/${remfile} if ( $status != 0) then rm -f ${locfile} echo "Download failed" else gunzip < ${locfile} > root.zone.new if ( $status == 0 ) then rm -f ${locfile} if ( -f root.zone ) then mv -f root.zone root.zone.bak endif chmod 644 root.zone.new mv -f root.zone.new root.zone echo "Download succeeded, restarting named" rndc reload sleep 1 rndc status else echo "Download failed: gunzip returned an error" rm -f ${locfile} endif endif ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: A little story of failed raid5 (3ware 8000 series)
A friend of mine once told me that the only worthwhile RAID systems are the ones that email you a detailed message when something goes south. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Quation about HZ kernel option
The basic answer is that HZ is almost, but not quite irrelevant. If a process blocks another will immediately be scheduled. More importantly, if an interrupt driven event (keyboard, tty, network, disk, etc) wakes a process up the scheduler has the ability to force an IMMEDIATE reschedule. Nearly ALL process related events schedule the process from this sort of reschedule. Generally speaking only cpu-bound processes will be hitting the schedular quantum on a regular basis. For network protocols HZ is the basis for the timeout subsystem which is only triggered when things actually time-out, which is fairly rare in a normally running system. Queue timers, select timeouts, and nanosleep are restricted by HZ in granularity, but in nearly all cases those calls are used with very large timeouts not really subject to the granularity of HZ. I think a higher HZ can be somewhat beneficial if you are running a lot of processes which fall through the scheduler's cracks (both cpu and disk bound, usually), or if the scheduler is badly written, but otherwise a lower value will not have much of an effect. I would not go under 100, though. I personally believe that a default of 1000 is ridiculously high, especially on a SMP system. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Quation about HZ kernel option
:Nuts! Everybody has his own opinion on this matter. :Any idea how to actually build syntetic but close to real :benchmark for this? It is literally impossible to write a benchmark to test this, because the effects you are measuring are primarily scheduling effects related to the scheduling algorithm and not so much the time quantum. One can demonstrate that ultra low values of HZ are bad, and ultra high values of HZ are also bad, but everything in the middle is subject to so much 'noise', to the type of test, the scheduler algorithm, and so on and so forth that it is just impossible. This is probably why there is so much argument over the issue. :For example: :Usual web server does: :1) forks :2) reads a bunch of small files from disk for some time :3) forks some cgi scripts :4) dies : :If i write a test in C doing somthing like this and run :very many of then is parallel for, say, 1 hour and then :count how many interation have been done with HZ=100 and :with HZ=1000 will it be a good test for this? : :-- :Regards :Artem Well, the vast majority of web pages are served in a microsecond timeframe and clearly not subject to scheduler quantum because the web server almost immediately blocks. Literally 100 uS or less and the web server's work is done. You can ktrace a web server to see this in action. Serving pages is usually either very fast or the process winds up blocking on I/O (again not subject to the scheduler quantum). CGIs and applets are another story because they tend to be more cpu-intensive, but I would argue that the scheduler algorithm will have a much larger effect on performance and interactivity then the time quantum. You only have so much cpu to play with -- a faster HZ will not give you more, so if your system is cpu bound it all comes down to the scheduler selecting which processes it feels are the most important to run at any given moment. One might think that quickly switching between processes is a good idea but there are plenty of workloads where it can have catastrophic results, such as when a X client is shoving a lot of data to the X server. In that case fast switching is bad because efficient client/server interactions depend very heavily on the client being able to build up a large buffer of operations for the server to execute in bulk. X becomes wildly inefficient with fast switching... It can wind up going 2x, 4x, even 8x slower. Generally speaking, any pipelined workload suffers with fast switching whereas non-pipelined workloads tend to benefit. Operations which can complete in a short period of time anyway (say 10ms) suffer if they are switched out, operations which take longer do not. One of the biggest problems is that applications tend to operate in absolutes (a different absolute depending on the application and the situation), whereas the scheduler has to make decisions based on counting quantums. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: ZFS root File System
My experience with one of our people trying to do the same thing w/ HAMMER... we got it working, but it is not necessarily cleaner. I'd rather just boot from a small UFS /boot partition on 'a' (256M or 512M), followed by swap on 'b', followed by the big-ass root partition on 'd' using your favorite filesystem. The boot code already pretty much handles this state of affairs, one only needs: (1) To partition it this way. (2) Add line to /boot/loader.conf pointing the kernel at the actual root, e.g. (in my case): vfs.root.mountfrom="hammer:ad6s1d" (3) Adjust sysctl kern.bootfile in e.g. /etc/sysctl.conf. Since the boot loader thinks the kernel is on / instead of /boot (because /boot is the root from the point of view of the bootloader), it might set this to "/kernel" instead of "/boot/kernel". So you may have to override it to make crash dumps and name lists work properly. (4) Add a mount for the little /boot partition in /etc/fstab. Trying to create one large root on 'a' puts the default spot for swap on 'b' at the end of the disk instead of near the beginning. The end of the disk (closer to the spindle) is a bad place for swap. Having a small /boot partition there instead retains the ordering and puts the swap where it is expected to be. # df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/ad6s1d193888256 1662976 192225280 1%/ /dev/ad6s1a 257998 11089612646447%/boot -- In anycase, if RAID is an issue the loader could always be adjusted to look for a boot partition on multiple disks. One could then have a /boot on two independant disks, or even operate it as a soft-raid-mirror. It seems less of an issue these days since someone with that sort of requirement who isn't already net-booting can just pop in a SSD for booting which will have approximately the same or better MTBF as the motherboard electronics. The problem we face with HAMMER is related to the boot loader not being able to run the UNDO buffer (yet), so it might not be able to find the kernel after a crash. That and the inconvenient place swap ends up at. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: incorrect usleep/select delays with HZ > 2500
What we wound up doing was splitting tvtohz() into two functions. tvtohz_high(tv) Returned value meets or exceeds requested time. A minimum value of 1 is returned (really only for {0,0}.. else minimum value is 2). tvtohz_low(tv) Returned value might be shorter then requested time, and 0 can be returned. Most kernel functions use the tvtohz_high() function. Only a few use tvtohz_low(). I have not found any 'good' solution to the problem. For example, average-up errors can mount up when using the results to control a callout timer resulting in much longer delays then originally intended, and similarly same-tick interrupts (e.g. a value of 1) can create much shorter delays then expected. Sometimes one cares more about the average interval being correct, other times the time must not be allowed to be too short. You lose no matter what you choose. http://fxr.watson.org/fxr/source/kern/kern_clock.c?v=DFBSD If you look at tvtohz_high() you will note that the minimum value of 1 is only returned if the passed tv is essentially {0,0}. i.e. 0uS. 1uS == 2 ticks (((us + (tick - 1)) / tick) + 1). The 'tick' global here is the number of uS per tick (not to be confused with 'ticks'). Because of all of that I decided to split the function to make the requirements more apparent. -- The nanosleep() work is a different issue... that's for userland calls (primarily the libc usleep() function). We found that some linux programs assumed that nanosleep() was far more fine-grained then (hz) and, anyway, the system call is called 'nanosleep' and 'usleep' which kind of implies a fine-grained sleep, so we turned it into one when small time intervals were being requested. http://fxr.watson.org/fxr/source/kern/kern_time.c?v=DFBSD The way I figure it if a userland program wants to make system calls with fine-grained sleeps that are too small, it's really no different from treating that program as being cpu-bound anyway so why not try to accomodate it? -- The 8254 issue is more one of a lack of interest in fixing it. Basically using the 8254 as a measure of realtime when the reload value is set to small (i.e. high hz) will always lead to serious timing problems. The reason there is such a lack of interest in fixing it is that most machines have other timers available (lapic, acpi, hpet, tsc, etc). A secondary issue might be tying real-time functions to 'ticks', which could still be driven by the 8254 interrupt those have to be divorced from ticks. I'm not sure if FreeBSD has any of those left (does date still skip quickly if hz is set ultra-high? Even when other timers are available?). I will note that tying real-time functions to the hz-based tick function (which is also the 8254-driven problem when other timers are not available) leads to serious problems, particularly with ntpd, even if you only lose track of the full cycle of the timer occassionally. However, neither do you want to 'skip' the ticks value to catch up to a lost interrupt. That will mess up tsleep() and other hz-based timeouts that assume that values of '2' will not instantly timeout. So actual realtime operations really do have to be completely divorced from the hz-based ticks counter and it must only be used for looser timing needs such as protocol timeouts and sleeps. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: serious networking (em) performance (ggate and NFS) problem
Polling should not produce any improvement over interrupts for EM0. The EM0 card will aggregate 8-14+ packets per interrupt, or more. which is only around 8000 interrupts/sec. I've got a ton of these cards installed. # mount_nfs -a 4 dhcp61:/home /mnt # dd if=/mnt/x of=/dev/null bs=32k # netstat -in 1 input(Total) output packets errs bytespackets errs bytes colls 66401 0 93668746 5534 0 962920 0 66426 0 94230092 5537 01007108 0 66424 0 93699848 5536 0 963268 0 66422 0 94222372 5536 01007290 0 66391 0 93654846 5534 0 962746 0 66375 0 94154432 5532 01006404 0 zfod Interrupts Proc:r p d s wCsw Trp Sys Int Sof Fltcow8100 total 19 62117 75 81004 12 88864 wire 7873 mux irq10 10404 act ata0 irq14 19.2%Sys 0.0%Intr 0.0%User 0.0%Nice 80.8%Idl 864476 inact ata1 irq15 |||||||||| 58152 cache mux irq11 == 2992 free227 clk irq0 Note that the interrupt rate is only 7873 interrupts per second while I am transfering 94 MBytes/sec over NFS (UDP) and receiving over 66000 packets per second (~8 packets per interrupt). If I use a TCP mount I get just about the same thing: # mount_nfs -T -a 4 dhcp61:/home /mnt # dd if=/mnt/x of=/dev/null bs=32k # netstat -in 1 input(Total) output packets errs bytespackets errs bytes colls 61752 0 93978800 8091 0 968618 0 61780 0 93530484 8098 0 904370 0 61710 0 93917880 8093 0 968128 0 61754 0 93491260 8095 0 903940 0 61756 0 93986320 8097 0 968336 0 Proc:r p d s wCsw Trp Sys Int Sof Fltcow8145 total 5 8 22828 13 5490 8146 13 11 141556 wire 7917 mux irq10 7800 act ata0 irq14 26.4%Sys 0.0%Intr 0.0%User 0.0%Nice 73.6%Idl 244872 inact ata1 irq15 |||||||||| 8 cache mux irq11 = 630780 free228 clk irq0 In this case around 8000 interrupts per second with 61700 packet per second incoming on the interface (around ~8 packets per interrupt). The extra interrupts are due to the additional outgoing TCP ack traffic. If I look at the systat -vm 1 output on the NFS server it also sees only around 8000 interrupts per second, which isn't saying much other then it's transmit path (61700 pps outoging) is not creating an undue interrupt burden relative to the receive path. -Matt ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Re[2]: serious networking (em) performance (ggate and NFS) problem
: I did simple benchmark at some settings. : : I used two boxes which are single Xeon 2.4GHz with on-boarded em. : I measured a TCP throughput by iperf. : : These results show that the throughput of TCP increased if Interrupt :Moderation is turned OFF. At least, adjusting these parameters affected :TCP performance. Other appropriate combination of parameter may exist. Very interesting, but the only reason you get lower results is simply because the TCP window is not big enough. That's it. 8000 ints/sec = ~15KB of backlogged traffic. x 2 (sender, receiver) Multiply by two (both the sender's reception of acks and the receiver's reception of data) and you get ~30KB. This is awefully close to the default 32.5KB window size that iperf uses. Other then window sizing issues I can think of no rational reason why throughput would be lower. Can you? And, in fact, when I do the same tests on DragonFly and play with the interrupt throttle rate I get nearly the results I expect. * Shuttle Athlon 64 3200+ box, EM card in 32 bit PCI slot * 2 machines connected through a GiGE switch * All other hw.em0 delays set to 0 on both sides * throttle settings set on both sides * -w option set on iperf client AND server for 63.5KB window * software interrupt throttling has been turned off for these tests throttleresult result freq(32.5KB win)(63.5KB win) (default) -- --- maxrate 481 MBit/s 533 MBit/s (not sure what's going on here) 12518 MBit/s 558 MBit/s (not sure what's going on here) 10613 MBit/s 667 MBit/s (not sure what's going on here) 7679 MBit/s 691 MBit/s 6668 MBit/s 694 MBit/s 5678 MBit/s 684 MBit/s 4694 MBit/s 696 MBit/s 3694 MBit/s 696 MBit/s 2698 MBit/s 703 MBit/s 1707 MBit/s 716 MBit/s 9000708 MBit/s 716 MBit/s 8000710 MBit/s 717 MBit/s <--- drop off pt 32.5KB win 7000683 MBit/s 716 MBit/s 6000680 MBit/s 720 MBit/s 5000652 MBit/s 718 MBit/s <--- drop off pt 63.5KB win 4000555 Mbit/s 695 MBit/s 3000522 MBit/s 533 MBit/s <--- GiGE throttling likely 2000449 MBit/s 384 MBit/s (256 ring descriptors = 1000260 MBit/s 193 MBit/s2500 hz minimum) Unless you are in a situation where you need to route small packets flying around a cluster where low latency is important, it doesn't really make any sense to turn off interrupt throttling. It might make sense to change the default from 8000 to 1 to handle typical default TCP window sizes (at least in a LAN situation), but it certainly should not be turned off. I got some weird results when I increased the frequency past 100KHz, and when I turned throttling off entirely. I'm not sure why. Maybe setting the ITR register to 0 is a bad idea. If I set it to 1 (i.e. 3906250 Hz) then I get 625 MBit/s. Setting the ITR to 1 (i.e. 256ns delay) should amount to the same thing as setting it to 0 but it doesn't. Very odd. The maximum interrupt rate as reported by systat is only ~46000 ints/sec so all the values above 50KHz should read about the same... and they do until we hit around 100Khz (10uS delay). Then everything goes to hell in a handbasket. Conclusion: 1 hz would probably be a better default then 8000 hz. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Re[4]: serious networking (em) performance (ggate and NFS) problem
vity that requires cpu, turning off moderation could wind up being a very, very bad idea. In fact, even if you are just routing packets I would argue that turning off moderation might not be a good choice... it might make more sense to set it to some high frequency like 4 Hz. But, of course, it depends on what other things the machine might be running and what sort of processing (e.g. firewall lists) the machine has to do on the packets. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Re[4]: serious networking (em) performance (ggate and NFS) problem
:Increasing the interrupt moderation frequency worked on the re driver, :but it only made it marginally better. Even without moderation, :however, I could lose packets without m_defrag. I suspect that there is :something in the higher level layers that is causing the packet loss. I :have no explanation why m_defrag makes such a big difference for me, but :it does. I also have no idea why a 20Mbps UDP stream can lose data over :gigE phy and not lose anything over 100BT... without the above mentioned :changes that is. It kinda sounds like the receiver's UDP buffer is not large enough to handle the burst traffic. 100BT is a much slower transport and the receiver (userland process) was likely able drain its buffer before new packets arrived. Use netstat -s to observe the drop statistics for udp on both the sender and receiver sides. You may also be able to get some useful information looking at the ip stats on both sides too. Try bumping up net.inet.udp.recvspace and see if that helps. In anycase, you should be able to figure out where the drops are occuring by observing netstat -s output. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: ntpd flipping between PLL and FLL mode
:How would decreasing the polling time fix this? I do not understand :the semantics/behaviour of NTP very well. : :Taken from the manpage: : : maxpoll maxpoll : These options specify the minimum and maximum poll intervals for : NTP messages, in seconds to the power of two. The maximum poll : interval defaults to 10 (1,024 s), but can be increased by the : maxpoll option to an upper limit of 17 (36.4 h). The minimum : poll interval defaults to 6 (64 s), but can be decreased by the : minpoll option to a lower limit of 4 (16 s). Though I can't speak to the algorithm ntpd uses, if a correllation is used along with a standard deviation to calculate offset and frequency errors, then decreasing the polling interval makes it virtually impossible to get an accurate frequency lock. Frequency locks require long polling intervals. So you wouldn't see any flips (or fewer flips), but you wouldn't have a very accurate time base either. You know you have a bad frequency correction if you see significant offset corrections occuring every day. The whole concept of 'flips' is broken anyhow, it just means the application is not using the correct mathmatical algorithm. NTPD never worked very well for me in all the years I've used it. Not ever. OpenNTPD also uses an aweful algorithm. If you need a NTP client-only app you might want to consider porting our DNTPD. It is a client-only app (no server component) which uses two staggered correllations and two staggered standard deviations for each time source and corrects the time based on a mathmatically calculated accuracy rather then after some pre-contrived time delay or interval. Some minor messing around might be needed to port it since we use a slightly more advanced sysctl scheme to control offset and frequency correction. It also has a tendancy to make errors in OS time keeping obvious. In particular, any bugs in how the OS handles offset and frequency corrections will become very obvious. We found a microsecond-vs-nanosecond bug in DragonFly with it. If you have a good frequency lock you should not see offset corrections occuring very often. I've included examples of what you should be able to achieve below from a few of our machines. In the examples below I got a reasonable frequency lock within an hour and then did not have to correct for it after that (which means that the error calculation for the continuously running frequency drift calculation was not small enough to make further frequency corrections). These are using the pool NTP sources on the internet. With a LAN source you would probably see more frequency corrections. Correllations are only useful with a limited number of samples... once you get beyond 30 samples or so the algorithm tends to plateau, which is why you need to have at least two running correllations with staggered start times. I have considered adding two additional staggered correllations to get more accurate frequency locks (e.g. 30 2 hour samples in addition to 30 30 minute samples) but PC time bases just aren't accurate enough to justify it (I know of no PC motherboards which use temperature-corrected crystal time bases. They are all uncorrected time bases. It's really annoying). Ah well. -Matt Dec 3 10:46:57 crater dntpd[605]: dntpd version 1.0 started Dec 3 10:47:13 crater dntpd[605]: issuing offset adjustment: 0.706663 Dec 3 11:29:32 crater dntpd[605]: issuing offset adjustment: 0.015905 Dec 3 11:39:57 crater dntpd[605]: issuing frequency adjustment: 8.656ppm Dec 3 11:50:25 crater dntpd[605]: issuing offset adjustment: 0.011579 Dec 4 09:21:18 crater dntpd[605]: issuing offset adjustment: -0.007325 Dec 5 20:26:08 crater dntpd[605]: issuing offset adjustment: 0.007002 Dec 6 09:20:32 crater dntpd[605]: issuing offset adjustment: -0.008491 Dec 6 09:40:11 crater dntpd[605]: issuing offset adjustment: 0.004089 Dec 6 22:23:50 crater dntpd[605]: issuing offset adjustment: 0.006602 Dec 6 22:43:16 crater dntpd[605]: issuing offset adjustment: -0.002391 Dec 8 13:29:11 crater dntpd[605]: issuing offset adjustment: -0.005005 Dec 11 23:37:00 crater dntpd[605]: issuing offset adjustment: 0.004607 Dec 17 23:11:26 crater dntpd[605]: issuing offset adjustment: -0.005559 Dec 18 23:05:12 crater dntpd[605]: issuing offset adjustment: 0.008101 Dec 3 10:47:13 leaf dntpd[593]: dntpd version 1.0 started Dec 3 10:47:29 leaf dntpd[593]: issuing offset adjustment: 0.027401 Dec 3 11:08:45 leaf dntpd[593]: issuing frequency adjustment: -12.384ppm Dec 3 13:14:49 leaf dntpd[593]: issuing offset adjustment: -0.012258 Dec 3 20:14:44 leaf dntpd[593]: issuing offset adjustment: -0.010502 Dec 10 04:27:05 leaf dntpd[593]: issuing offset adjustment: -0.008231 Dec 16
Re: Xen Dom0, are we making progress?
with fairly optimal network and file I/O calls seem to do the best. Virtual kernels won't be winning any rewards, but they sure can be convenient. Most of my kernel development is now done in virtual kernels. It also makes kernel development more attainable to people who are not traditionally kernel coders. The synergy is very good. -- In anycase, as usual I rattle on. If FreeBSD is interested I recommend simply looking at the cool features I added to DragonFly's kernel to make virtual kernels possible. It's really just three major items: Signal mailboxes, a new MAP_VPAGETABLE for mmap, and the new vmspace_*() system calls for managing VM spaces. Once those features were in place it didn't take long for me to create a 'vkernel' platform that linked against libc and used the new system calls. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Xen Dom0, are we making progress?
:Virtual kernels are a cool idea, but I (and I believe practically anyone :using FreeBSD for non-development work) would much rather see a Xen-like :functionality (to be precise: ability to run foreign kernels and :Windows; qemu is too slow) than just a variation of the native kernel. There is certainly a functionality there that people will find useful, but you also have to realize that Xen involves two or more distinct operating systems which will multiply the number of bugs you have to deal with and create major compatibility issues with the underlying hardware, making it less then reliable. Really only the disk and network I/O can be made reliably compatible in a Xen installation. Making sound cards, video capture cards, encryption cards, graphics engines, and many other hardware features work well with the guest operating system will not only be difficult, but it will also be virtually unmaintainable in that environment over the long term. Good luck getting anything more then basic application functionality out of it. For example, you would have no problem running pure network applications such as web and mail servers on the guest operating system, but the moment you delve outside of that box and into sound and high quality (or high performance) video, things won't be so rosy. I don't see much of an advantage in having multi-OS hardware virtualization for any serious deployment. It would be interesting and useful on a personal desktop, at least within the scope of the limited hardware compatibility, but at the same time it will also lock you into software and OS combinations that aren't likely to extend into the future, and which will be a complete an utter nightmare to maintain. Any failure at all could lead to a completely unrecoverable system. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
:I'm trying to probe this as well as I can, but network stacks and sockets have :never been my strong suit ... : :Robert had mentioned in one of his emails about a "Sockets can also exist :without any referencing process (if the application closes, but there is still :data draining on an open socket)." : :Now, that makes sense to me, I can understand that ... but, how would that look :as far as netstat -nA shows? Or, would it? For example, I have: : :... Netstat should show any sockets, whether they are attached to processes or not. Usually you can match up the address from netstat -nA with the addresses from sockets shown by fstat to figure out what processes the sockets are attached to. There are three situations that you have to watch out for: (1) The socket was close()'d and is still draining. The socket will timeout and terminate within ~1-5 minutes. It will not be referenced to a descriptor or process. (2) The socket descriptor itself has been sent over a unix domain socket from one process to another and is currently in transit. The file pointer representing the descriptor is what is actually in transit, and will not be referenced by any processes while it is in transit. There is a garbage collector that figures out unreferencable loops. I think its called unp_gc or something like that. (3) The socket is not closed, but is idle (like having a remote shell open and never typing in it). Service processes can get stuck waiting for data on such sockets. The socket WILL be referenced by some process. These are controlled by net.inet.tcp.keep* and net.inet.tcp.always_keepalive. I almost universally turn on net.inet.tcp.always_keepalive to ensure that dead idle connections get cleaned out. Note that keepalive only applies to idle connections. A socket that has been closed and needs to drain (either data or the FIN state) will timeout and clean up itself whether keepalive is turned on or off). netstat -nA will give you the status of all your sockets. You can observe the state of any TCP sockets. Unix domain sockets have no state and closure is governed simply by them being dereferenced, just like a pipe. In this case there are really only two situations: (1) One end of the unix domain socket is still referenced by a process or (2) The socket has been sent over another unix domain socket and is 'in transit'. The socket will remain intact until it is either no longer in transit (read out from the other unix domain socket), or the garbage collector determines that the socket the descripor is transiting over is not externally referencablee, and will destroy it and any in-transit sockets contained within. Any sockets that don't fall into these categories are in trouble... either a timer has failed somewhere or (if unix domain) the garbage collector has failed to detect that it is in an unreferencable loop. - One thing you can do is drop into single user mode... kill all the processes on the system, and see if the sockets are recovered. That will give you a good idea as to whether it is a real leak or whether some process is directly or indirectly (by not draining a unix domain socket on which other sockets are being transfered) holding onto the socket. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Socket leak (Was: Re: What triggers "No Buffer Space) Available"?
:*groan* why couldn't this be happening on a server that I have better remote :access to? :( : :But, based on your explanation(s) above ... if I kill off all of the jail(s) on :the machine, so that there are minimal processes running, shouldn't I see a :significant drop in the number of sockets in use as well? or is there :something special about single user mode vs just killing off all 'extra :processes'? : :- :Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Yes, you can. Nothing special about single user... just kill all the processes that might be using sockets. Killing the jails is a good start. If you are running a lot of jails then I would strongly suspect that there is an issue with file desciptor passing over unix domain sockets. In particular, web servers, databases, and java or other applets could be the culprit. Other possibilities... you could just be running out of file descriptors in the file descriptor table. use vmstat -m and vmstat -z too... find out what allocates the socket memory and see what it reports. Check your mbuf allocation statistics too (netstat -m). Damn, I wish that information were collected on a per-jail basis but I don't think it is. Look at all the memory statistics and check to see if anything is growing unbounded over a long period of time (verses just growing into a cache balance). Create a cron job that dumps memory statistics once a minute to a file then break each report with a clear-screen sequence and cat it in a really big xterm window. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: swap zone exhausted, increase kern.maxswzone
Basically maxswzone is the amount of KVM the kernel is willing to use to store 'struct swblock' structures. These are the little structures that are stuck onto VM objects and specify which pages in the VM object(s) correspond to which pages of swap, for any swapped out data that no longer has a vm_page_t. It should be almost impossible to run out. Each structure can handle 16 contiguous swap block assignments in the VM object. Pages in VM objects tend to get swapped out in large linear swaths and the dynamic nature of paging tends to enforce this even if things are a bit chunky initially. So running out should just never happen. The only thing I can think of is if a machine has a tiny, tiny amount of ram and a huge amount of swap. e.g. like 64M of ram and 2G of swap, and actually tries to use it all. The default KVM reservation is based on physical memory, I think. Otherwise, it just shouldn't happen. I see that the code in FreeBSD is using UMA now, I would double check that it is actually calculating the proper amount of space allowed to be allocated. Maybe you have a leak somewhere. Note that swap interactions have to operate in low-memory situations. Make sure UMA isn't gonna have a meltdown if the system is running low on freeable VM pages. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: swap zone exhausted, increase kern.maxswzone
:If you do have 8gigs of swap, then you do need to increase the parameter.. :The default is 7.7gigs of supported swap... (assuming that struct swblock :hasn't changed size... The maxswblock only limits it... If swap is more :than 8x memory, then changing kern.maxswzone will not fix it and will :require a code change... : :-- : John-Mark Gurney Voice: +1 415 225 5579 The swblock structures only apply to actively swapped out data. Mark, how much data is actually swapped out (pstat -s) at the time the problem is reported? If you can dump UMA memory statistics that would be beneficial as well. I just find it hard to imagine that any system would actually be using that much swap, but hey! :-) -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: swap zone exhausted, increase kern.maxswzone
:That's why I think that the socket issue and this one are co-related ... with :everything started up (93 jails), my swap usage right now is: : :mars# pstat -s :Device 1K-blocks UsedAvail Capacity :/dev/da0s1b 8388608 20 8388588 0% : :Its only been up 2.5 hours so far, but still, everything is started up ... : :- :Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) The "swap zone exhausted, increase kern.maxswzone" message only prints if uma_zone_exhausted() returns TRUE. uma_zone_exhausted() appears to be based on a UMA flag which is only set if the pages for the zone exceeds some maximum setting. Insofar as I can tell, vmstat -z on FreeBSD will dump the UMA zones, so try using that when the problem occurs along with pstat -s. It sounds like there is a leak somewhere (but I don't see how anything in any other UMA zones could cause the SWAPMETA zone to fill up). Or the maximum setting is too low, or something is getting lost somewhere. We'll have a better idea as to what is going on when you get the message again. You might even want to do a once-a-10-minutes cron job to append pstat -s, vmstat -m, and vmstat -z to a file. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: Creating one's own installer/mfsroot
:> You could also look at the INSTALL guides for (early versions of?) :> Dragonfly, it taught me how to install a BSD system from :> scratch, using :> only what's in base of liveCD :) :... : :We have set up a boot CD (or pxeboot/nfs environment) where we can run a :Ruby script that will take directives from a configuration file, configure :the disks, slices and partitions, align the partitions to start on a block :... I came up with a neat little script-driven remote configurator called rconfig (/usr/src/sbin/rconfig in the DragonFly codebase) as an alternative to the standard features in our installer. I recommend checking it out. http://www.dragonflybsd.org/cvsweb/src/sbin/rconfig/ rconfig is really easy to use. Basically its just a client/server pair with socket broadcast capabilities. All it does is negotiate a shell script download from the server to the client (server on the same subnet), then runs the script on the client. That's it. I wanted to be able to boot a CD, login as root, dhclient the network up, and then just go 'rconfig -a' and have my script do the rest. It takes a bit of time to write the shell script to do a full install from fdisk to completion, but if you have a fully working CD based environment (all the binaries in /, /usr, a writable /tmp, /etc, and so forth)... then shell scripts are just as easy to write as they are on fully installed machines. I use rconfig to do fresh installs of my test boxes from CD, with all the customization, my ssh keys, fstab entries, NFS mounts, etc that I need to be able to ssh into the box and start using it immediately. NFS booting is 'ok', but requires a lot of infrastructure and gets out of date easily. Often you also have to mess with the BIOS settings, which is very annoying because you have to change them back after you finish the install. I used NFS booting for a while, but just couldn't depend on it always being operational. With rconfig I just leave the rconfig server running on one of my boxes and the worst I have to do is tweak my script a bit. And adding smarts to the script is easy whereas you just can't add much smarts to a NFS boot without a lot of messing around with the RC sequence. In anycase, check it out. My assumption is that rconfig would compile nearly without modification on a FreeBSD box. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: clock problem
:One of our customers has 6 GPS-locked NTP servers. Only problem is :that two of them are reporting a time that is exactly one second :different to the other four. You shouldn't rely solely on your :GPS or DCF receiver - use it as the primary source but have some :secondary sources for sanity checks. (From experience, I can state :that ntpd does not behave well when presented with two stratum 1 :servers that differ by 1 second). : :--=20 :Peter Jeremy Ntp will also become really unhappy when chunky time slips occur or if the skew rate is more then a few hundred ppm. Ntp will also blow up if it loses the network link for a long period of time. It will just give up and stop making corrections entirely, even after the link is restored. This is particularly true when it is used over a dialup (me having done that for over a year in 1997, so I can tell you how badly it works). A slow time slip over a day could still be chunky, which would imply lost interrupts. Determining whether the problem is due to an 8254 rollover or lost hardclock interrupts is easy... just set 'hz' to something really high, like 2, and see if your time goes crazy. If it does, then you have your culprit. I don't know if those bugs are still present in FreeBSD, but I do remember that I had to redo all the timekeeping in DragonFly because lost interrupts from high 'hz' settings were causing timekeeping to go nuts. That turned out to mainly be due to the same 8254 timer being used to generate the hardclock interrupt AND handle time keeping. i.e. at high hz settings one was not getting the full 1/18 second benefit from the timer. You just can't do that... it doesn't work. It is almost 100% guarenteed to result in a bad time base. It is easy to test.. just set your kern.hz in the boot env, reboot, and see if things blow up or not. Time keeping should be stable regardless of what hz is set to (provisio: never set hz less then 100). Unfortunately, all the timebases in the system have their own quirks. Blame the hardware manufacturers. The 8254 timer 0 is actually the MOST consistent of the lot, with the ACPI timer coming a close second. TSC Haha. Good luck. Nice wide timer, easy to read, but any power savings mode, including the failsafe modes that intel has when a cpu overheats, will probably blow it up. Because of that it is not really a good idea to use it as a timebase. I shake my fist at Intel! $#%$#%$#% ACPI timer Despite the hardware bugs this almost always works as a timebase, but sometimes the frequency changes when the cpu goes into power savings mode or EST, and sometimes the frequency is something other then what it is supposed to be. 8254 timer 0Almost always works as a timebase, but only if not also used to generate high-speed interrupts (because interrupts are lost easily). Set it to a full cycle (1/18 second) and you will be fine. Set it to anything else and you will lose interrupts. The BIOS will sometimes mess with timer 0, but not as often as it messes with timer 2. 8254 timer 1Sometimes works as a time base, but can lock older machines up. Can even lock up newer machines. Why? Because hardware manufacturers are idiots. 8254 timer 2Often can be used as a time base, but video bios calls often try to use it too. [EMAIL PROTECTED] bios makers! Still, this is better then losing interrupts when timer 0 is set to high speed so DragonFly uses timer 2 for its timebase as a default until the ACPI timer becomes available, with a boot option to use timer 1 instead. Using timer 2 as a time base means you don't get motherboard speaker sound (the old beep beep BEEP!). Do I care? No. LAPIC timer Dunno. Probably best to use it as a high speed clock interrupt which would free 8254 timer 0 to use as a time base. RTC interrupt Basically unusable. Stable, but doesn't have sufficient resolution to be helpful and takes forever to read. -Matt Matthew Dillon
Re: clock problem
Another idea to help track down timebase problems. Port dntpd to FreeBSD. You need like three sysctls (because the ntp API and the original sysctl API are both insufficient). Alternatively you could probably hack dntpd to run in debug mode without having to implement any new sysctls, as long as you be sure to clean out any active kernel timebase adjustments in the kernel before you run it. Here's some sample output: http://apollo.backplane.com/DFlyMisc/dntpd.sample01.txt Dntpd in debug mode will print out the results from two staggered continuously running linear regressions (resets after 30 samples, staggered by 15 samples). For anyone who understands how linear regressions work, finding kernel timekeeping bugs is really easy with this sort of output. You get the slope, y-intercept, correlation, and standard deviation, and then you get calculated frequency drift and time offset based on those numbers. The correlation is accurate after around 10 samples. Note that frequency drift calculations require longer intervals to get better results. The forced 30 second interval set in the sample output is way too short, hence the errors (it has to be in 90th percentile to even have a chance of producing a reasonable PPM calculation). But also remember we are talking parts per million here. If you throw away iteration numbers < 15 or so you will get very nice output and kernel bugs will show up in fairly short order. Kernel bugs will show up as non-trivial y-intercept calculations over multiple samples, large jumps in the offset, inability to get a good correlation (provisio: sample interval has to be at least 120 seconds, not the 30 in my example), and so on and so forth. Also be sure to use a locked ntp source, otherwise running corrections on the source will show up as problems in the debug output. ntp.pool.org is usually good enough. It's fun checking various time sources with an idle box with a good timebase. hhahahhaha. OMG. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Does a pipe take a socket ... ?
:Marc G. Fournier wrote: : > For those that remmeber the other day, I had that swzone issue, where I ran out : > of swap space? I just about hit it again today, swap was up to 99% used ... I : > was able to get a ps listing in, and there were a whack of find processes : > running ... : > : > Now, I think I know which VPS they were running in, so that isn't a problem ... : > and I suspect that the find was just part of a longer pipe ... I'm just curious : > if those pipes would happen to use up any of those sockets that are : > 'evaporating', or is this totally unrelated to sockets? : :In FreeBSD, pipe() is implemented with the socketpair(2) :system call. Every pipe uses two sockets (one for each :endpoint). : :Best regards : Oliver : :-- :Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Nuh uh. pipe() is a direct implementation... no sockets anywhere. Using socketpair() will eat sockets up, but using pipe() will not. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: weird bugs with mmap-ing via NFS
: : [Moved from -current to -stable] : :צ×ÔÏÒÏË 21 ÂÅÒÅÚÅÎØ 2006 16:23, Matthew Dillon ÷É ÎÁÐÉÓÁÌÉ: :> You might be doing just writes to the mmap()'d memory, but the system :> doesn't know that. : :Actually, it does. The program tells it, that I don't care to read, what's :currently there, by specifying the PROT_READ flag only. That's an architectural flag. Very few architectures actually support write-only memory maps. IA32 does not. It does not change the fact that the operating system must validate the memory underlying the page, nor does it imply that the system shouldn't. :Sounds like a missed optimization opportunity :-( Even on architectures that did support write-only memory maps, the system would still have to fault in the rest of the data on the page, because the system would have no way of knowing which bytes in the page you wrote to (that is, whether you wrote to all the bytes in the page or whether you left gaps). The system does not take a fault for every write you issue to the page, only for the first one. So, no matter how you twist it, the system *MUST* validate the entire page when it takes the page fault. :> It kinda sounds like the buffer cache is getting blown out, but not :> having seen the program I can't really analyze it. : :See http://aldan.algebra.com/~mi/mzip.c I can't access this URL, it says 'not found'. :> It will always be more efficient to write to a file using write() then :> using mmap() : :I understand, that write() is much better optimized at the moment, but the :mmap interface carries some advantages, which may allow future OSes to :optimize their ways. The application can hint at its planned usage of the :data via madvise, for example. Yes, but those advantages are limited by the way memory mapping hardware works. There are some things that simply cannot be optimized through lack of sufficient information. Reading via mmap() is very well optimized. Making modifications via mmap() is optimized insofar as the expectation that the data is intended to be read, modified, and written back. It is not possible to optimize with the expectation that the data would only be written to the mmap, for the reasons described above. The hardware simply does not provide sufficient information to the operating system to optimize the write-only case. :Unfortunately, my problem, so far, is with it not writing _at all_... Not sure what is going on since I can't access the program yet, but I'd be happy to take a look at the code. The most common mistake people make when trying to write to a file via mmap() is that they forget to ftruncate() the file to the proper length first. Mapped memory beyond the file's EOF is ignored within the last page, and the program will take a page fault if it tries to write to mapped pages that are entire beyond the file's current EOF. Writing to mapped memory does *not* extend the size of a file. Only ftruncate() or write() can extend the size of a file. The second most common mistake is to forget to specify MAP_SHARED in the mmap() call. :Yes, this is an example of how a good implemented mmap can be better than :write. Without explicit writes by the application and without doubling the :memory requirements, the data can be written in the most optimal way. :... :Thanks for your help. Yours, : : -mi I don't think mmap()-based writing will EVER be more efficient then write() except in the case where the entire data set fits into memory and has been entirely cached by the system. In that one case writing via mmap will be faster. In all other cases the system will be taking as many VM faults on the pages as it would be taking system call faults to execute the write()'s. You are making a classic mistake by assuming that the copying overhead of a write() into the file's backing store, verses directly mmap()ing the file's backing store, represents a large chunk of the overhead for the operation. In fact, the copying overhead represents only a small chunk of the related overhead. The vast majority of the overhead is always going to be the disk I/O itself. I/O must occur even in the cached/delayed-write case so on a busy system it still represents the greatest overhead from the point of view of system load. On a lightly loaded system nobody is going to care about a few milliseconds of improved performance here and there since, by definition, the system is lightly loaded and thus has plenty of idle cpu and I/O cycles to spare. -Matt Matthew Dillon
Re: more weird bugs with mmap-ing via NFS
:When the client is in this state it remains quite usable except for the :following: : : 1) Trying to start `systat 1 -vm' stalls ALL access to local disks, : apparently -- no new programs can start, and the running ones : can not access any data either; attempts to Ctrl-C the starting : systat succeed only after several minutes. : : 2) The writing process is stuck unkillable in the following state: : : CPU PRI NI VSZ RSS MWCHAN STAT TT TIME : 27 -4 0 1351368 137764 nfsDLp41:05,52 : : Sending it any signal has no effect. (Large sizes are explained : by it mmap-ing its large input and output.) : : 3) Forceful umount of the share, that the program is writing to, : paralyzes the system for several minutes -- unlike in 1), not : even the mouse is moving. It would seem, the process is dumping : core, but it is not -- when the system unfreezes, the only : message from the kernel is: : : vm_fault: pager read error, pid (mzip) : :Again, this is on 6.1/i386 from today, which we are about to release into the :cruel world. : :Yours, : : -mi There are a number of problems using a block size of 65536. First of all, I think you can only safely do it if you use a TCP mount, also assuming the TCP buffer size is appropriately large to hold an entire packet. For UDP mounts, 65536 is too large (the UDP data length can only be 65536 bytes. For that matter, the *IP* packet itself can not exceed 65535 bytes. So 65536 will not work with a UDP mount. The second problem is related to the network driver. The packet MTU is 1500, which means, typically, a limit of around 1460-1480 payload bytes per packet. A UDP large UDP packet that is, say, 48KB, will be broken down into over 33 IP packet fragments. The network stack could very well drop some of these packet fragments making delivery of the overall UDP packet unreliable. The NFS protocol itself does allow read and write packets to be truncated providing that the read or write operation is either bounded by the file EOF or (for a read) the remaining data is all zero's. Typically the all-zero's case is only optimized by the NFS server when the underlying filesystem block itself is unallocated (i.e. a 'hole' in the file). In all other cases the full NFS block size is passed between client and server. I would stick to an NFS block size of 8K or 16K. Frankly, there is no real reason to use a larger block size. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: more weird bugs with mmap-ing via NFS
:I don't specify either, but the default is UDP, is not it? Yes, the default is UDP. :> Now imagine a client that experiences this problem only :> sometimes. Modern hardware, but for some reason (network :> congestion?) some frames are still lost if sent back-to-back. :> (Realtek chipset on the receiving side?) : :No, both sides have em-cards and are only separated by a rather decent large :switch. : :I'll try the TCP mount, workaround. If it helps, we can assume, our UDP NFS is :broken for sustained high bandwidth writes :-( : :Thanks! : : -mi I can't speak for FreeBSD's current implementation, but it should be possible to determine whether there is an issue with packet drops or not by observing the network statistics via netstat -s. Generally speaking, however, I know of no problems with a UDP NFS mount per-say, at least as long reasonable values are chosen for the block size. The mmap() call in your mzip.c program looks ok to me with the exception of the use of PROT_WRITE. Try using PROT_READ|PROT_WRITE. The ftruncate() looks ok as well. If the program works over a local filesystem but fails to produce data in the output file on an NFS mount (but completes otherwise), then there is a bug in NFS somewhere. If the problem is simply due to the program stalling, and not completing due to the stalling, then it could be a problem with dropped packets in the network stack. If the problem is that the program simply runs very inefficiently over NFS, with excessive network bandwidth for the data being written (as you also reported), this is probably an artifact of attempting to use mmap() to write out the data, for reasons previously discussed. I would again caution against using mmap() to populate a file in this manner. Even with MADV_SEQUENTIAL there is no guarentee that the system will actually flush the pages to the actual file on the server sequentially, and you could end up with a very badly fragmented file. When a file is truncated to a larger size the underlying filesystem does not allocate the actual backing store on disk for the data hole created. Allocation winds up being based on the order in which the operating system flushes the VM pages. The VM system does its best, but it is really designed more as a random-access system rather then a sequential system. Pages are flushed based on memory availability and a thousand other factors and may not necessarily be flushed to the file in the order you think they should be. write() is really a much better way to write out a sequential file (on any operating system, not just BSD). -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: more weird bugs with mmap-ing via NFS
:The file stops growing, but the network bandwidth remains at 20Mb/s. `Netstat :-s' on the client, had the following to say (udp and ip only): If the network bandwidth is still going full bore then the program is doing something. NFS retries would not account for it. A simple test for that would be to ^Z the program once it gets into this state and see if the network bandwidth goes to zero. So if we assume that packets aren't being lost, then the question becomes: what is the program doing that is causing the network bandwidth to go nuts? And if it isn't the program, then what is the OS doing that is causing the network bandwidth to go nuts? ktrace on the program would tell us if read() or write() or ftruncate() were causing an issue. 'vmstat 1' while the program is running would tell us if VM faults are creating an issue. If neither of those are an issue then I would guess that the problem could be related to the NFSv3 2-phase commit protocol. A way to test that would be to mount with NFSv2 and see if the problem still occurs. Running tcpdump on the network interface while the program is in this state might also give us some valuable clues. 50 lines of output from something like this after the program has gotten into its weird state might give us a clue: tcpdump -s 4096 -n -i -l port 2049 -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: more weird bugs with mmap-ing via NFS
:>tcpdump -s 4096 -n -i -l port 2049 : :Now I am thoroughly confused, the lines are very repetative: : :tcpdump: verbose output suppressed, use -v or -vv for full protocol decode :listening on em0, link-type EN10MB (Ethernet), capture size 4096 bytes :20:41:55.788436 IP 172.21.128.43.2049 > 172.21.130.86.1445243414: reply ok 60 :20:41:55.788502 IP 172.21.130.86.1445243415 > 172.21.128.43.2049: 1472 write :fh 1090,6005/15141914 5120 (5120) bytes @ 4943872 :20:41:55.788811 IP 172.21.128.43.2049 > 172.21.130.86.1445243415: reply ok 60 :write ERROR: Permission denied :20:41:55.788872 IP 172.21.130.86.1445243416 > 172.21.128.43.2049: 1472 write :fh 1090,6005/15141914 5120 (5120) bytes @ 4947968 :[...] : :The only reason for "permission denied" I know, is the firewall, but neither :the server nor the client even have ipfw loaded... : :Yours, : : -mi Ah ha. That's the problem. I don't know why you are getting a write error, but that is preventing the client from cleaning out the dirty buffers. The number of dirty buffers continues to rise and the client is just cycling on them over and over trying to write them out, because it's just as confused about why it is getting a permission denied error as you are :-) If you can figure out why you are getting that error, and fix it, it will solve the problem. It is an NFS error returned by the server, not a firewall issue. So it probably has something to do either with the way the filesystem being exported was mounted on the server, or the export line in /etc/exports. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: more weird bugs with mmap-ing via NFS
My guess is that you are exporting the filesystem as a particular user id that is not root (i.e. you do not have -maproot=root: in the exports line on the server). What is likely happening is that the NFS client is trying to push out the pages using the root uid rather then the user uid. This is a highly probable circumstance for VM pages because once they get disassociated from the related buffer cache buffer, the cred information for the last process to modify the related VM pages is lost. When the kernel tries to flush the pages out it winds up using root creds. On DragonFly, I gave up entirely on trying to associate creds with buffers. I consider this more of a bug on the server side then on the client side. The server should automatically translate the root uid to the exported uid for I/O ops. Or, baring that, we have to add an option to the client-side mount to be able to specify a user/group id to translate all I/O requests to. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: more weird bugs with mmap-ing via NFS
:So mmap is just a more "reliable" way to trigger this problem, right? : :Is not this, like, a major bug? A file can be opened, written to for a while, :and then -- at a semi-random moment -- the log will drop across the road? :Ouch... : :Thanks a lot to all concerned for helping solve this problem. Yours, : : -mi I consider it a bug. I think the only way to reliably fix the problem is to give the client the ability to specify the uid to issue RPCs with in the NFS mount command, to match what the export does. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: flushing "anonymous" buffers over NFS is rejected by server (more weird bugs with mmap-ing via NFS)
:So, the problem is, the dirtied buffers _sometimes_ lose their owner and thus :become root-owned. When the NFS client tries to flush them out, the NFS :server (by default suspecting remote roots of being evil) rejects the :flushing, which brings the client to its weak knees. : :1. Do the yet unflushed buffers really have to be anonymous? : :2. Can't the client's knees be strengthened in this regard? : :Thanks! : : -mi Basically correct, though its not the buffers that get lost, its that the VM pages get disconnected from the buffers when the buffers are recycled, then get reconnected (sans creds info) later on. The basic answer is that we don't want to strengthen the client with regards to buffer/VM page creds, because buffers and VM pages are cached items in the system and can potentially have many different 'owners'. The entire cred infrastructure for buffers was a terrible hack put into place many years ago, solely to support NFS. It created a huge mess in the system code and didn't even solve the problem (as you found out). I've already removed most of that junk from DragonFly and I would argue that there isn't much point keeping it in FreeBSD either. The only real solution is to make the NFS client aware of the restricted user id exported by the server by requiring that the same uid be specified in the mount command the client uses to mount the NFS partition. The NFS client would then use that user id for all write I/O operations. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: flushing "anonymous" buffers over NFS is rejected by server (more weird bugs with mmap-ing via NFS)
:What about different users accessing the same share from the same client? : : -mi Yah, you're right. That wouldn't work. It would have to be a server-side solution. Basically the server would have to accept root creds but instead of translating them to a fixed uid it should allow the I/O operation to run as long as some non-root user would be able to do the I/O op. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: flushing "anonymous" buffers over NFS is rejected by server (more weird bugs with mmap-ing via NFS)
:This doesn't work with modes like 446 (which allow writing by everyone :not in a particular group). It should work just fine. The client validated the creds as of the original operation (such as the mmap() or the original write()). Regardless of what happens after that, if the creds were valid when the original operation occured, then the server should allow the write. If the client supplies root creds for a later operation and the server translated that to mean 'write it if its possible to write without root creds' for exports whos roots were not mapped to root, it would actually conform better to the reality of the state of the file at the time the client originally performed the operation verses if the client provided the user creds of the original write. If the file were chmoded or chowned inbetween the original write and the actual I/O operation then it is arguable that the delayed write I/O should succeed rather then fail. :Doesn't that amount to significantly reducing the security of NFS? :ISTR the original reason for "nobody" was that it was trivial to fake :root so the server would map it to an account with (effectively) no :privileges. This change would give root on a client (file) privileges :equal to the union of every non-root user on the server. In :particular, it appears that the server can't tell if a file was opened :for read or write so a client could open a file for reading (getting a :valid FH) and then write to it (even though it couldn't have opened the :file for writing). : :-- :Peter Jeremy No, it has no effect on the security of NFS. With the exception of 'root' creds, the server trusts the client's creds, so there isn't going to be any real difference between the client supplying user creds verses the server translating root creds into some non-root user's creds. NFS has never been secure. The only reasonably secure method of exporting a NFS filesystem is to export an entire filesystem read-only. For any read-write export, NFS is only secure insofar as you assume that the client can then modify any file in the exported filesystem. The 'maproot' option is a bandaid at best, and not a very good one. For example, exporting subdirectories of a filesystem is not secure (and never was). It is fairly trivial for a client to supply file handles that are outside of the subdirectory tree that was exported. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:Actually, I can not agree here -- quite the opposite seems true. When running :locally (no NFS involved) my compressor with the `-1' flag (fast, least :effective compression), the program easily compresses faster, than it can :read. : :The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. :I guess, despite the noise I raised on this subject a year ago, reading via :mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability. : :Unlike read, which uses buffering, mmap-reading still does not pre-fault the :file's pieces in efficiently :-( : :Although the program was written to compress files, that are _likely_ still in :memory, when used with regular files, it exposes the lack of mmap :optimization. : :This should be even more obvious, if you time searching for a string in a :large file using grep vs. 'grep --mmap'. : :Yours, : : -mi : :http://aldan.algebra.com/~mi/mzip.c Well, I don't know about FreeBSD, but both grep cases work just fine on DragonFly. I can't test mzip.c because I don't see the compression library you are calling (maybe that's a FreeBSD thing). The results of the grep test ought to be similar for FreeBSD since the heuristic used by both OS's is the same. If they aren't, something might have gotten nerfed accidently in the FreeBSD tree. Here is the cache case test. mmap is clearly faster (though I would again caution that this should not be an implicit assumption since VM fault overheads can rival read() overheads, depending on the situation). The 'x1' file in all tests below is simply /usr/share/dict/words concactenated over and over again to produce a large file. crater# ls -la x1 -rw-r--r-- 1 root wheel 638228992 Mar 23 11:36 x1 [ machine has 1GB of ram ] crater# time grep --mmap asdfasf x1 1.000u 0.117s 0:01.11 100.0%10+40k 0+0io 0pf+0w crater# time grep --mmap asdfasf x1 0.976u 0.132s 0:01.13 97.3% 10+40k 0+0io 0pf+0w crater# time grep --mmap asdfasf x1 0.984u 0.140s 0:01.11 100.9%10+41k 0+0io 0pf+0w crater# time grep asdfasf x1 0.601u 0.781s 0:01.40 98.5% 10+42k 0+0io 0pf+0w crater# time grep asdfasf x1 0.507u 0.867s 0:01.39 97.8% 10+40k 0+0io 0pf+0w crater# time grep asdfasf x1 0.562u 0.812s 0:01.43 95.8% 10+41k 0+0io 0pf+0w crater# iostat 1 [ while grep is running, in order to test the cache case and verify that no I/O is occuring once the data has been cached ] The disk I/O case, which I can test by unmounting and remounting the partition containing the file in question, then running grep, seems to be well optimized on DragonFly. It should be similarly optimized on FreeBSD since the code that does this optimization is nearly the same. In my test, it is clear that the page-fault overhead in the uncached case is considerably greater then the copying overhead of a read(), though not by much. And I would expect that, too. test28# umount /home test28# mount /home test28# time grep asdfasdf /home/x1 0.382u 0.351s 0:10.23 7.1% 55+141k 42+0io 4pf+0w test28# umount /home test28# mount /home test28# time grep asdfasdf /home/x1 0.390u 0.367s 0:10.16 7.3% 48+123k 42+0io 0pf+0w test28# umount /home test28# mount /home test28# time grep --mmap asdfasdf /home/x1 0.539u 0.265s 0:10.53 7.5% 36+93k 42+0io 19518pf+0w test28# umount /home test28# mount /home test28# time grep --mmap asdfasdf /home/x1 0.617u 0.289s 0:10.47 8.5% 41+105k 42+0io 19518pf+0w test28# test28# iostat 1 during the test showed ~60MBytes/sec for all four tests Perhaps you should post specifics of the test you are running, as well as specifics of the results you are getting, such as the actual timing output instead of a human interpretation of the results. For that matter, being an opteron system, were you running the tests on a UP system or an SMP system? grep is a single-threaded so on a 2-cpu system it will show 50% cpu utilization since one cpu will be saturated and the other idle. With specifics, a FreeBSD person can try to reproduce your test results. A grep vs grep --mmap test is pretty straightforward and should be a good test of the VM read-ahead code, but there might always be some unknown circumstance specific to a machine configuration that is the cause of the problem. Repeatability and reproducability by third parties is important when diagnosing any problem. Insofar as MADV_SEQUENTIAL goes... you shouldn't need it on FreeBSD. Unless someone ripped it out since I committed it many years ago, which I doubt, FreeBSD's VM heuristic will figure out that the accesses are sequential and start issuing read-aheads. It should pre-fault, and it should do read-ahead. That isn't to say that there isn't a bug, just that everyone interested in the problem has to be able to reproduce it and help each other track down the source. Just making an
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:Yes, they both do work fine, but time gives very different stats for each. In :my experiments, the total CPU time is noticably less with mmap, but the :elapsed time is (much) greater. Here are results from FreeBSD-6.1/amd64 -- :notice the large number of page faults, because the system does not try to :preload file in the mmap case as it does in the read case: : : time fgrep meowmeowmeow /home/oh.0.dump : 2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w : time fgrep --mmap meowmeowmeow /home/oh.0.dump : 1.552u 7.109s 2:46.03 5.2% 18+1031k 156+0io 106327pf+0w : :Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb), :I'm sure, you will have no problems reproducing this result. 106,000 page faults. How many pages is a 2.9GB file? If this is running in 64-bit mode those would be 8K pages, right? So that would come to around 380,000 pages. About 1:4. So, clearly the operating system *IS* pre-faulting multiple pages. Since I don't believe that a memory fault would be so inefficient as to account for 80 seconds of run time, it seems more likely to me that the problem is that the VM system is not issuing read-aheads. Not issuing read-aheads would easily account for the 80 seconds. It is possible that the kernel believes the VM system to be too loaded to issue read-aheads, as a consequence of your blowing out of the system caches. It is also possible that the read-ahead code is broken in FreeBSD. To determine which of the two is more likely, you have to run a smaller data set (like 600MB of data on a system with 1GB of ram), and use the unmount/mount trick to clear the cache before each grep test. If the time differential is still huge using the unmount/mount data set test as described above, then the VM system's read-ahead code is broken. If the time differential is tiny, however, then it's probably nothing more then the kernel interpreting your massive 2.9GB mmap as being too stressful on the VM system and disabling read-aheads for that reason. In anycase, this sort of test is not really a good poster child for how to use mmap(). Nobody in their right mind uses mmap() on datasets that they expect to be uncacheable and which are accessed sequentially. It's just plain silly to use mmap() in that sort of circumstance. This is a trueism on ANY operating system, not just FreeBSD. The uncached data set test (using unmount/mount and a dataset which fits into memory) is a far more realistic test because it simulates the most common case encountered by a system under load... the accessing of a reasonably sized data set which happens to not be in the cache. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:I thought one serious advantage to this situation for sequential read :mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to :wait for the clock hands to reap them. On a large Solaris box I used :to have the non-pleasure of running the VM page scan rate was high, and :I suggested to the app vendor that proper use of mmap might reduce that :overhead. Admitedly the files in question were much smaller than the :available memory, but they were also not likely to be referenced again :before the memory had to be reclaimed forcibly by the VM system. : :Is that not the case? Is it better to let the VM system reclaim pages :as needed? : :Thanks, : :Gary madvise() should theoretically have that effect, but it isn't quite so simple a solution. Lets say you have, oh, your workstation, with 1GB of ram, and you run a program which runs several passes on a 900MB data set. Your X session, xterms, gnome, kde, etc etc etc all take around 300MB of working memory. Now that data set could fit into memory if portions of your UI were pushed out of memory. The question is not only how much of that data set should the kernel fit into memory, but which portions of that data set should the kernel fit into memory and whether the kernel should bump out other data (pieces of your UI) to make it fit. Scenario #1: If the kernel fits the whole 900MB data set into memory, the entire rest of the system would have to compete for the remaining 100MB of memory. Your UI would suck rocks. Scenario #2: If the kernel fits 700MB of the data set into memory, and the rest of the system (your UI, etc) is only using 300MB, and the kernel is using MADV_DONTNEED on pages it has already scanned, now your UI works fine but your data set processing program is continuously accessing the disk for all 900MB of data, on every pass, because the kernel is always only keeping the most recently accessed 700MB of the 900MB data set in memory. Scenario #3: Now lets say the kernel decides to keep just the first 700MB of the data set in memory, and not try to cache the last 200MB of the data set. Now your UI works fine, and your processing program runs FOUR TIMES FASTER because it only has to access the disk for the last 200MB of the 900MB data set. -- Now, which of these scenarios does madvise() cover? Does it cover scenario #1? Well, no. the madvise() call that the program makes has no clue whether you intend to play around with your UI every few minutes, or whether you intend to leave the room for 40 minutes. If the kernel guesses wrong, we wind up with one unhappy user. What about scenario #2? There the program decided to call madvise(), and the system dutifully reuses the pages, and you come back an hour later and your data processing program has only done 10 passes out of the 50 passes it needs to do on the data and you are PISSED. Ok. What about scenario #3? Oops. The program has no way of knowing how much memory you need for your UI to be 'happy'. No madvise() call of any sort will make you happy. Not only that, but the KERNEL has no way of knowing that your data processing program intends to make multiple passes on the data set, whether the working set is represented by one file or several files, and even the data processing program itself might not know (you might be running a script which runs a separate program for each pass on the same data set). So much for madvise(). So, no matter what, there will ALWAYS be an unhappy user somewhere. Lets take Mikhail's grep test as an example. If he runs it over and over again, should the kernel be 'optimized' to realize that the same data set is being scanned sequentially, over and over again, ignore the localized sequential nature of the data accesses, and just keep a dedicated portion of that data set in memory to reduce long term disk access? Should it keep the first 1.5GB, or the last 1.5GB, or perhaps it should slice the data set up and keep every other 256MB block? How does it figure out what to cache and when? What if the program suddenly starts accessing the data in a cacheable way? Maybe it should randomly throw some of the data away slowly in the hopes of 'adapting' to the access pattern, which would also require that it throw away most of the 'recently read' data far more quickly to make up for the data it isn't throwing away. Believe it or not, that actually works for certain types of problems, except then you get hung up in a situation where two subsystems are competing with each other for memory resources (like mail server verses web server), and the system is unable to cope as the relative load factors for the competing subsystems change. The problem becomes really complex really fast. This so
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:On an amd64 system running about 6-week old -stable, both behave :pretty much identically. In both cases, systat reports that the disk :is about 96% busy whilst loading the cache. In the cache case, mmap :is significantly faster. : :... :turion% ls -l /6_i386/var/tmp/test :-rw-r--r-- 1 peter wheel 586333684 Mar 24 19:24 /6_i386/var/tmp/test :turion% /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test : 21.69 real 0.16 user 0.68 sys :[umount/remount /6_i386/var] :turion% /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test : 21.68 real 0.41 user 0.51 sys :The speed gain with mmap is clearly evident when the data is cached and :the CPU clock wound right down (99MHz ISO 2200MHz): :... :-- :Peter Jeremy That pretty much means that the read-ahead algorithm is working. If it weren't, the disk would not be running at near 100%. Ok. The next test is to NOT do umount/remount and then use a data set that is ~2x system memory (but can still be mmap'd by grep). Rerun the data set multiple times using grep and grep --mmap. If the times for the mmap case blow up relative to the non-mmap case, then the vm_page_alloc() calls and/or vm_page_count_severe() (and other tests) in the vm_fault case are causing the read-ahead to drop out. If this is the case the problem is not in the read-ahead path, but probably in the pageout code not maintaining a sufficient number of free and cache pages. The system would only be allocating ~60MB/s (or whatever your disk can do), so the pageout thread ought to be able to keep up. If the times for the mmap case do not blow up, we are back to square one and I would start investigating the disk driver that Mikhail is using. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:The results here are weird. With 1GB RAM and a 2GB dataset, the :timings seem to depend on the sequence of operations: reading is :significantly faster, but only when the data was mmap'd previously :There's one outlier that I can't easily explain. :... :Peter Jeremy Really odd. Note that if your disk can only do 25 MBytes/sec, the calculation is: 2052167894 / 25MB = ~80 seconds, not ~60 seconds as you would expect from your numbers. So that would imply that the 80 second numbers represent read-ahead, and the 60 second numbers indicate that some of the data was retained from a prior run (and not blown out by the sequential reading in the later run). This type of situation *IS* possible as a side effect of other heuristics. It is particularly possible when you combine read() with mmap because read() uses a different heuristic then mmap() to implement the read-ahead. There is also code in there which depresses the page priority of 'old' already-read pages in the sequential case. So, for example, if you do a linear grep of 2GB you might end up with a cache state that looks like this: l = low priority page m = medium priority page h = high priority page FILE: [---m] Then when you rescan using mmap, FILE: [l--m] [--lm] [-l-m] [l--m] [---l---m] [--lm] [-llHHHmm] [lllHHmmm] [---H] [---mmmHm] The low priority pages don't bump out the medium priority pages from the previous scan, so the grep winds up doing read-ahead until it hits the large swath of pages already cached from the previous scan, without bumping out those pages. There is also a heuristic in the system (FreeBSD and DragonFly) which tries to randomly retain pages. It clearly isn't working :-) I need to change it to randomly retain swaths of pages, the idea being that it should take repeated runs to rebalance the VM cache rather then allowing a single run to blow it out or allowing a static set of pages to be retained indefinitely, which is what your tests seem to show is occuring. -Matt Matthew Dillon <[EMAIL PROTECTED]> ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Maximum Swapsize
From 'man tuning' (I think I wrote this, a long time ago): You should typically size your swap space to approximately 2x main mem- ory. If you do not have a lot of RAM, though, you will generally want a lot more swap. It is not recommended that you configure any less than 256M of swap on a system and you should keep in mind future memory expan- sion when sizing the swap partition. The kernel's VM paging algorithms are tuned to perform best when there is at least 2x swap versus main mem- ory. Configuring too little swap can lead to inefficiencies in the VM page scanning code as well as create issues later on if you add more mem- ory to your machine. Finally, on larger systems with multiple SCSI disks (or multiple IDE disks operating on different controllers), we strongly recommend that you configure swap on each drive (up to four drives). The swap partitions on the drives should be approximately the same size. The kernel can handle arbitrary sizes but internal data structures scale to 4 times the largest swap partition. Keeping the swap partitions near the same size will allow the kernel to optimally stripe swap space across the N disks. Do not worry about overdoing it a little, swap space is the saving grace of UNIX and even if you do not normally use much swap, it can give you more time to recover from a runaway program before being forced to reboot. -- The last sentence is probably the most important. The primary reason why you want to configure a fairly large amount of swap has less to do with performance and more to do with giving the system admin a long runway to have the time to deal with unexpected situations before the machine blows itself to bits. The swap subsystem has the following limitation: /* * If we go beyond this, we get overflows in the radix * tree bitmap code. */ if (nblks > 0x4000 / BLIST_META_RADIX / nswdev) { printf("exceeded maximum of %d blocks per swap unit\n", 0x4000 / BLIST_META_RADIX / nswdev); VOP_CLOSE(vp, FREAD | FWRITE, td); return (ENXIO); } By default, BLIST_META_RADIX is 16 and nswdev is 4, so the maximum number of blocks *PER* swap device is 16 million. If PAGE_SIZE is 4K, the limitation is 64 GB per swap device and up to 4 swap devices (256 GB total swap). The kernel has to allocate memory to track the swap space. This memory is allocated and managed by kern/subr_blist.c (assuming you haven't changed things since I wrote it). This is basically implemented as a flattened radix tree using a fixed radix of 16. The memory overhead is fixed (based on the amount of swap configured) and comes to approximately 2 bits per VM page. Performance is approximately O(log N). Additionally, once pages are actually swapped out the VM object must record the swap index for each page. This costs around 4 bytes per swapped-out page and is probably the greatest limiting factor in the amount of swap you can actually use. 256GB of 100% used swap would eat 256MB of kernel ram. I believe that large linear chunks of reserved swap, such as used by MD, currently still require the per-page overhead. However, theoretically, since the reservation model uses a radix tree, it *IS* possible to reserve huge swaths of linear-addressed swap space with no per-page storage requirements in the VM object. It is even possible to do away with the 2 bits per page that the radix tree uses if the radix tree were allocated dynamically. I decided against doing that because I did not want the swap subsystem to be reliant on malloc() during critical low-memory paging situations. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: serious networking (em) performance (ggate and NFS) problem
Polling should not produce any improvement over interrupts for EM0. The EM0 card will aggregate 8-14+ packets per interrupt, or more. which is only around 8000 interrupts/sec. I've got a ton of these cards installed. # mount_nfs -a 4 dhcp61:/home /mnt # dd if=/mnt/x of=/dev/null bs=32k # netstat -in 1 input(Total) output packets errs bytespackets errs bytes colls 66401 0 93668746 5534 0 962920 0 66426 0 94230092 5537 01007108 0 66424 0 93699848 5536 0 963268 0 66422 0 94222372 5536 01007290 0 66391 0 93654846 5534 0 962746 0 66375 0 94154432 5532 01006404 0 zfod Interrupts Proc:r p d s wCsw Trp Sys Int Sof Fltcow8100 total 19 62117 75 81004 12 88864 wire 7873 mux irq10 10404 act ata0 irq14 19.2%Sys 0.0%Intr 0.0%User 0.0%Nice 80.8%Idl 864476 inact ata1 irq15 |||||||||| 58152 cache mux irq11 == 2992 free227 clk irq0 Note that the interrupt rate is only 7873 interrupts per second while I am transfering 94 MBytes/sec over NFS (UDP) and receiving over 66000 packets per second (~8 packets per interrupt). If I use a TCP mount I get just about the same thing: # mount_nfs -T -a 4 dhcp61:/home /mnt # dd if=/mnt/x of=/dev/null bs=32k # netstat -in 1 input(Total) output packets errs bytespackets errs bytes colls 61752 0 93978800 8091 0 968618 0 61780 0 93530484 8098 0 904370 0 61710 0 93917880 8093 0 968128 0 61754 0 93491260 8095 0 903940 0 61756 0 93986320 8097 0 968336 0 Proc:r p d s wCsw Trp Sys Int Sof Fltcow8145 total 5 8 22828 13 5490 8146 13 11 141556 wire 7917 mux irq10 7800 act ata0 irq14 26.4%Sys 0.0%Intr 0.0%User 0.0%Nice 73.6%Idl 244872 inact ata1 irq15 |||||||||| 8 cache mux irq11 = 630780 free228 clk irq0 In this case around 8000 interrupts per second with 61700 packet per second incoming on the interface (around ~8 packets per interrupt). The extra interrupts are due to the additional outgoing TCP ack traffic. If I look at the systat -vm 1 output on the NFS server it also sees only around 8000 interrupts per second, which isn't saying much other then it's transmit path (61700 pps outoging) is not creating an undue interrupt burden relative to the receive path. -Matt ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: 4.4-STABLE machine unusable (was Re: Openssh)
:On Thu, 25 Oct 2001 [EMAIL PROTECTED] wrote: : :> Guys, :> :> You can say all you like, but something in stable is totally fscked up. As :> soon as I log in and start doing anything that involves a bit of traffic :> (e.g. tailing a file), the connection freezes and I have to kill it. sshd :> doesn't die, so I can log in again. I can reproducibly freeze it by :> doing... well practically anything: :> :> tail /var/log/messages, vi, cat, etc. all freeze the connection. Strangely :> enough, : :I've seen this before, or something that sounds identical. telnet did the :same thing, and anything over a size i dont remember via http did it as :well. : :The workaround I found was to drop the MTU on the ethernet card (a generic :ne2k card at the time, no idea what it was plugged into.) down to 512 and :it was fine. Move it above 512 and the problems came back. :... TCP does what is known as MTU discovery to figure out the lowest MTU in the connection path. TCP then sets the no-frag bit on its packets. This can break down if you are running through a misconfigured firewall or an intermediate router or machine does not respond with the correct ICMP error when an oversized no-frag packet is received. If the firewall blocks ICMP error #3 (destination unreachable) subcode 4, your TCP connection will not properly detect the MTU. Reducing the client machine's interface MTU is a work-around (it sets a maximum MTU which is hopefully less then the maximum MTU of routers in between you and the destination), but the best solution is to figure out where the misconfigured router/machine is and fix it. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 4.4-STABLE machine unusable (was Re: Openssh)
:Hey I won't criticize you. I thought that was my dsl doing it but it does :not happen on my windows box. Whenever I start doing something that involves :traffic on my freebsd box it will not respond remotely for about 30 seconds. :Whats up with it? Typically startup-delays in the 30-second range are due to broken DNS. Typically the host involved has trouble looking up its hostname or looking up its IP address (reverse DNS). It can take a while for the DNS request to timeout, hence the delay. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: /usr/src/UPDATING
: * Bill Fenner's code to make ipfw/dummynet/bridge KLD'able : : BOTH THESE THINGS REQUIRES REBUILDING OF ipfw.ko and /sbin/ipfw : :I am not sure if that should be added to UPDATING. After all, it was not :added to -CURRENT's /usr/src/UPDATING neither ;-) : : Regards, : JMA :-- Whenever ipfw changes people who update their kernel and do not reinstall world (or at leaset the ipfw binary) often wind up with unaccessible machines when they reboot, because /etc/rc* cannot load the IPFW rule set. It's happened to me many, many times. It's extremely annoying. An UPDATING entry is the minimum that should be added. It would also be nice if the system detected out-of-date IPFW's (like maybe adding a version field to the 'rule' structure) so ipfw could request a 'safe' mode of operation or something. I don't know. But it's getting really annoying. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: KDE init [was RE: DCOP server problem...]
: :Matt, : :Check out /usr/ports/sysutils/portupgrade. Automatic port & port :dependency upgrading, fully recursive. : :Took me about thirty minutes to set up. My desktop is three years old :and has been manually upgraded, patched, etc, repeatedly. And :portupgrade upgraded every port to the latest version, cleaned up all :sorts of old crap, etc. Highly recommended. Yah, I'm playing with it now. After fixing up the database with pkgdb -F (scary!) I am now running portupgrade -a. It's happily churning away on the ports tree upgrade god knows what. Thanks! -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: -STABLE buildkernel broke! (linux module)
:: Yet it _did_ fail when doing: :: cd /usr/src :: make buildkernel KERNCONF=HADES :: :: Please advise! : :I had to rm all the files in the compile/HADES/./modules/linux :directory in order for my compile to succeed. The problem is, I :think, that some generated files in the modules directory become :actual files in the repo. This causes the wrong versions of these :files to be picked up and boom. : :Warner I had to do the same thing to get rid of the linux_time_t compiler failures. Hmm. 'make depend' might also work. But the easiest thing to do is just rm -rf /usr/src/sys/compile/BLAH/modules and then config and make it again. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
HEADS UP! GENERIC/Kernel configs using maxusers of 0 will autosize but require new config binary.
The simplified version of the maxusers auto-sizing has been MFCd but people need to be aware that to use it you need to update your kernel source AND recompile /usr/src/usr.sbin/config. If you use an old config with a kernel conf specifying a maxusers of 0 you will get a warning and maxusers will be forced to 8 by the old config binary. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: Speeding up ssh over modem?
:I am wondering if there is a way for sshd to check if the delay between :two write()s to the terminal it creates are closer than say 5ms, and :combining those on the outgoing wire, for a max of 10. This would yield :a max delay of 50ms, which I think would not really be noticeable. : :Alex : :+-- :Alex Popa, | "Artificial Intelligence is Try hacking your ssh to not set TCP_NODELAY. In /usr/src/crypto/openssh/packet.c around line 1284. #if 0 out the setsockopt TCP_NODELAY code. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: on/off NFS connection errors
:For a while I have been treating this as an annoyance, but I thought it :would be wise to investigate if something larger and more nefarious might :be being indicated by this. : :I have a mixed environment of mainly Linux boxen, with a dozen or so :FreeBSD machines (For when we need the kind of network resources that :raising the NMBCLUSTERS can offer.) Both types of systems serve mainly as :webservers, serving content that ultimately comes off of exported NFS :directories, from a Network Appliance (NetApp Release 5.3.4R3: Thu Jan 27 :12:08:07 PST 2000) The Linux boxen don't complain at all, but the FreeBSD :boxen can get rather noisy about NFS connection errors. It happens :on and off like so: : :><118>Dec 15 21:01:47 cc117 /kernel: nfs server netapp1:/vol/members: not :>responding :><118>Dec 15 21:01:47 cc117 /kernel: nfs server netapp1:/vol/members: is :>alive again :.. Our NFS is somewhat finicky about response times. It could probably use some tuning. Trying mounting with the 'dumbtimer' mount option and see if that fixes your problem. (also see the -d and -t options in 'man mount_nfs'. Note that -t is specified in 1/10 second increments). -Matt :Has anyone else seen these kinds of persistent NFS errors is the 4.x :branch? (This didn't happen noticeably in 3.x, but I would still :maintain that the NFS code in 4.x is an improvement over 3.x.) Can anyone :suggest a sysctl/kernel variable I might tune to help remedy the problem? :If the root of the problem is more likely on the Netapp side, I have a support :contact and am not afraid to use it. Anyone have any advice or :suggestions to offer? : :This is the platform that I am working on: :FreeBSD cc117 4.2-STABLE FreeBSD 4.2-STABLE #0: Sat Aug 18 00:21:16 EDT :2001 root@cc117:/usr/src/sys/compile/CCI_KERNEL i386 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: HEADS UP! GENERIC/Kernel configs using maxusers of 0 willautosize but require new config binary.
: :On Thu, 13 Dec 2001, Matthew Dillon wrote: : :> The simplified version of the maxusers auto-sizing has been MFCd but :> people need to be aware that to use it you need to update your kernel :> source AND recompile /usr/src/usr.sbin/config. : :So will the following sequence be ok? :make buildworld :make installworld :make kernel Sure, that will work fine. In fact, I made a slight change after 3 people forgot to update their config and wound up with a maxusers of 8. So, in fact, you do not have to recompile the world or config for this change to be effective, you need only recompile the kernel. Recompiling the world (or just config) will get rid of a harmless config-generated warning, however. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: Matt Dillon: cvs commit: src/sys/kern vfs_bio.c src/sys/nfs nfs.h nfs_bio.c nfs_vnops.c src/sys/vm vm_page.c vnode_pager.c
: :Ah, I withdraw my concerns about issue #1 then. Now we're just down :to the floppy overflow issue. :) : :- Jordan What was issue #1 ? -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: Matt Dillon: cvs commit: src/sys/kern vfs_bio.c src/sys/nfs nfs.h nfs_bio.c nfs_vnops.c src/sys/vm vm_page.c vnode_pager.c
: :: ::Ah, I withdraw my concerns about issue #1 then. Now we're just down ::to the floppy overflow issue. :) :: ::- Jordan : :What was issue #1 ? : Never mind :-). I have two things left on my plate for -stable * The vnode recycling code, which Yahoo and I have been testing. * The pageout daemon's vget() deadlock. Martin got back to me on an older version of the patch, which appears to have worked, and he is now testing the newer (submitted to the release engineers) version of the patch. If that succeeds I will want the re's to sign off on it. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: Matt Dillon: cvs commit: src/sys/kern vfs_bio.c src/sys/nfs nfs.h nfs_bio.c nfs_vnops.c src/sys/vm vm_page.c vnode_pager.c
(resend, apparently '[EMAIL PROTECTED]' is causing a mail loop and bounced). :: ::Ah, I withdraw my concerns about issue #1 then. Now we're just down ::to the floppy overflow issue. :) :: ::- Jordan : :What was issue #1 ? : Never mind :-). I have two things left on my plate for -stable * The vnode recycling code, which Yahoo and I have been testing. * The pageout daemon's vget() deadlock. Martin got back to me on an older version of the patch, which appears to have worked, and he is now testing the newer (submitted to the release engineers) version of the patch. If that succeeds I will want the re's to sign off on it. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 4.4-STABLE crashes - suspects new ata-driver over wd-drivers
:I induced the crash by running "make clean; make buildworld" in one :infinite loop and "portsdb -Uu" in another. That string occurs in a :bunch of makefiles in /usr/ports. Some of the occurences in the core :are clearly from them, but many of them are surrounded by binary :data. I recursively grepped /usr/{src,obj,bin,ports} and :/usr/local/{bin,lib} and didn't find any binary files with that :string. My guess then is that it's from the memory image of a make :process. : :-- : Brady Montz This is s weird. The corruption is occuring in the vm_page_t itself, at least in the crash you sent me. The vm_page_t is a locked-down address in the kernel. It is not effecting the vm_page_t's around the one that got corrupted. The corruption does not appear to be on a page or device block boundry. I am at a loss as to how its getting there. Could you try playing with the DMA modes on your IDE hard drive? Try turning DMA off, for example, and see if the corruption still occurs. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: My first 4.5-PRERELEASE Issue
:> > Ever since I made world/kernel I've been getting small hangs in my :> > internet connection. These messages have started appearing in my logs :> > as well: sio0: 1 more silo overflow (total 1257) :> > sio0: 1 more silo overflow (total 1258) :> > sio0: 1 more silo overflow (total 1259) :> > sio0: 1 more silo overflow (total 1260) :> :> Are you sure you haven't just previously overlooked these types of messages? : :Yes. They started when I indicated. Try changing the FIFO_RX_HIGH to FIFO_RX_MEDH in isa/sio.c line 2385, recompile your kernel, and see if that helps. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message
Re: 4.5-PRERELEASE NFS Problem
:I'm seeing problems with NFS serving over UDP in a 4.5-PRERELEASE :system. The problem seems to be in readdir or perhaps stat. I can't ls :a directory that's mounted via UDP. I can read files. The problem :goes away with TCP mounts. I had problems with both FBSD 4.4 and Solaris :2.8 clients. The server is new, so I don't have any historical data, :however I've never had problems in the past in other servers using UDP. :The system was cvsup'ed Thursday evening. : :Danny : :-- :Daniel Schales Louisiana Tech University This could be just about anything. I recommend you run tcpdump on both the client and server and observe the NFS packet traffic during the failure. We haven't made any changes to the readdir code specifically. Other possibilities: The ethernet card may have packet loss or may have problems. There may be mtu issues. There may be firewall issues. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-stable" in the body of the message