Re: zfs + uma
on 21/09/2010 09:39 Jeff Roberson said the following: > I'm afraid there is not enough context here for me to know what 'the same > mechanism' is or what solaris does. Can you elaborate? This was in my first post: [[[ There is this good book: http://books.google.com/books?id=r_cecYD4AKkC&printsec=frontcover Please see section 6.2.4.5 on page 225 and table 6-11 on page 226. And also this code: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#971 ]]] > I prefer not to take the weight of specific examples too heavily when > considering the allocator as it must handle many cases and many types of > systems. I believe there are cases where you want large allocations to be > handled by per-cpu caches, regardless of whether ZFS is one such case. If ZFS > does not need them, then it should simply allocate directly from the VM. > However, I don't want to introduce some maximum constraint unless it can be > shown that adequate behavior is not generated from some more adaptable > algorithm. Yes, I agree in general. But sometimes simplicity has its benefits too as opposed to complex dynamic behavior that _might_ result from adaptive algorithms. Anyway, I have some early patches to implement first two of your suggestions and I am testing them now. Looks good to me so far. Parameters in the adaptions would probably need some additional tuning. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: zfs + uma
on 21/09/2010 09:35 Jeff Roberson said the following: > On Tue, 21 Sep 2010, Andriy Gapon wrote: > >> on 19/09/2010 01:16 Jeff Roberson said the following: >>> Additionally we could make a last ditch flush mechanism that runs on each >>> cpu in >> >> How would you qualify a "last ditch" trigger? >> Would this be called from "standard" vm_lowmem look or would there be some >> extra >> check for even more severe memory condition? > > If lowmem does not make enough progress to improve the condition. Do we have a good way to detect that? I see that currently vm_lowmem is always invoked with argument value of zero. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: zfs + uma
On Tue, 21 Sep 2010, Andriy Gapon wrote: on 19/09/2010 01:16 Jeff Roberson said the following: Additionally we could make a last ditch flush mechanism that runs on each cpu in How would you qualify a "last ditch" trigger? Would this be called from "standard" vm_lowmem look or would there be some extra check for even more severe memory condition? If lowmem does not make enough progress to improve the condition. Jeff turn and flushes some or all of the buckets in per-cpu caches. Presently that is not done due to synchronization issues. It can't be done from a central place. It could be done with a callout mechanism or a for loop that binds to each core in succession. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: zfs + uma
On Tue, 21 Sep 2010, Andriy Gapon wrote: on 19/09/2010 11:42 Andriy Gapon said the following: on 19/09/2010 11:27 Jeff Roberson said the following: I don't like this because even with very large buffers you can still have high enough turnover to require per-cpu caching. Kip specifically added UMA support to address this issue in zfs. If you have allocations which don't require per-cpu caching and are very large why even use UMA? Good point. Right now I am running with 4 items/bucket limit for items larger than 32KB. But I also have two counter-points actually :) 1. Uniformity. E.g. you can handle all ZFS I/O buffers via the same mechanism regardless of buffer size. 2. (Open)Solaris does that for a while and it seems to suit them well. Not saying that they are perfect, or the best, or an example to follow, but still that means quite a bit (for me). I'm afraid there is not enough context here for me to know what 'the same mechanism' is or what solaris does. Can you elaborate? I prefer not to take the weight of specific examples too heavily when considering the allocator as it must handle many cases and many types of systems. I believe there are cases where you want large allocations to be handled by per-cpu caches, regardless of whether ZFS is one such case. If ZFS does not need them, then it should simply allocate directly from the VM. However, I don't want to introduce some maximum constraint unless it can be shown that adequate behavior is not generated from some more adaptable algorithm. Thanks, Jeff -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: zfs + uma
On Tue, Sep 21, 2010 at 1:39 AM, Jeff Roberson wrote: > On Tue, 21 Sep 2010, Andriy Gapon wrote: > > on 19/09/2010 11:42 Andriy Gapon said the following: >> >>> on 19/09/2010 11:27 Jeff Roberson said the following: >>> I don't like this because even with very large buffers you can still have high enough turnover to require per-cpu caching. Kip specifically added UMA support to address this issue in zfs. If you have allocations which don't require per-cpu caching and are very large why even use UMA? >>> >>> Good point. >>> Right now I am running with 4 items/bucket limit for items larger than >>> 32KB. >>> >> >> But I also have two counter-points actually :) >> 1. Uniformity. E.g. you can handle all ZFS I/O buffers via the same >> mechanism >> regardless of buffer size. >> 2. (Open)Solaris does that for a while and it seems to suit them well. >> Not >> saying that they are perfect, or the best, or an example to follow, but >> still >> that means quite a bit (for me). >> > > I'm afraid there is not enough context here for me to know what 'the same > mechanism' is or what solaris does. Can you elaborate? > > I prefer not to take the weight of specific examples too heavily when > considering the allocator as it must handle many cases and many types of > systems. I believe there are cases where you want large allocations to be > handled by per-cpu caches, regardless of whether ZFS is one such case. If > ZFS does not need them, then it should simply allocate directly from the VM. > However, I don't want to introduce some maximum constraint unless it can be > shown that adequate behavior is not generated from some more adaptable > algorithm. > > Actually, I think that there is a middle ground between "per-cpu caches" and "directly from the VM" that we are missing. When I've looked at the default configuration of ZFS (without the extra UMA zones enabled), there is an incredible amount of churn on the kmem map caused by the implementation of uma_large_malloc() and uma_large_free() going directly to the kmem map. Not only are the obvious things happening, like allocating and freeing kernel virtual addresses and underlying physical pages on every call, but also system-wide TLB shootdowns and sometimes superpage demotions are occurring. I have some trouble believing that the large allocations being performed by ZFS really need per-CPU caching, but I can certainly believe that they could benefit from not going directly to the kmem map on every uma_large_malloc() and uma_large_free(). In other words, I think it would make a lot of sense to have a thin layer between UMA and the kmem map that caches allocated but unused ranges of pages. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Kernel side buffer overflow issue
Hi, I am using FreeBSD 8.1-STABLE-201008 snapshot. System behaves strangely. Unexpected and meaningless messages seen at consoles. You can download the screen shot from : http://193.255.128.30/~ryland/syslogd.jpg Additionally default router changes unexpectedly. I tried all FreeBSD 7.1, 7.2, 7.3, 8.1-STABLE-201008 releases ( both i386 and amd64 ). All this versions are affected. I inspected logs if someone logged in or changed route (with route -n monitor command). When the default route changed, there isn't any messages at the "route -n monitor" command output. I think there can be a buffer overflow in kernel code. When dummynet enabled, this problem could be seen more frequently. This problem repeats once per 10 minute. I wrote a shell script which monitors the default router. I saw that sometimes netstat -rn shows that default router is changed as 10.0.16.251 or 10.6.10.240 etc. which are client IP addresses but routing still routes to right router 193.X.Y.Z . After a while, routing really fails. You can download the tcpdump capture file from http://193.255.128.30/~ryland/flowdata_10_0_16_251 . This file captured while the default router changes. Tcpdump capture, belongs to the IP Address which shown in default router (10.0.16.251) the tcpdump command: tcpdump -w /home/flowdata_10_0_16_251 -ni bce0.116 host 10.0.16.251 -- dummynet rules are: 3 pipe 3 tcp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to any dst-port 8000,80,22,25,88,110,443,1720,1863,1521,3389,4489 via em0 // Upload 3 pipe 3 udp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to any dst-port 53 via em0 // Upload 3 pipe 4 tcp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to any via em0 // Upload 3 pipe 4 udp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to any via em0 // Upload LOTS OF NAT RULES HERE (in kernel nat) 6 pipe 1 tcp from any 8000,80,22,25,88,110,443,1720,1863,1521,3389,4489 to 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download 6 pipe 1 udp from any 53 to 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download 6 pipe 2 tcp from any to 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download 6 pipe 2 udp from any to 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download /sbin/ipfw pipe 1 config bw 8192Kbit/s mask dst-ip 0x /sbin/ipfw pipe 3 config bw 1024Kbit/s mask src-ip 0x /sbin/ipfw pipe 2 config bw 4096Kbit/s mask dst-ip 0x /sbin/ipfw pipe 4 config bw 1024Kbit/s mask src-ip 0x -- sysctl vars: net.inet.ip.fw.dyn_max=65535 net.inet.ip.fw.dyn_ack_lifetime=100 net.inet.ip.fw.dyn_short_lifetime=10 net.inet.ip.fw.one_pass=0 kern.maxfiles=65000 kern.ipc.somaxconn=1024 net.inet.ip.process_options=0 net.inet.ip.fastforwarding=1 net.link.ether.ipfw=1 net.inet.ip.fw.dyn_buckets=65536 kern.maxvnodes=40 net.inet.ip.dummynet.hash_size=256 ( also tried with 8192 ) net.inet.ip.dummynet.pipe_slot_limit=500 net.inet.ip.dummynet.io_fast=1 -- /boot/loader.conf: autoboot_delay="1" beastie_disable="YES" kern.ipc.nmbclusters=98304 vm.kmem_size="2048M" vm.kmem_size_max="2048M" splash_bmp_load="YES" vesa_load="YES" bitmap_load="YES" bitmap_name="/boot/splash.bmp" hw.ata.ata_dma=0 kern.hz="1" -- kernel config ( additionally to GENERIC ): device tap device if_bridge device vlan device carp options GEOM_BDE options IPFIREWALL options IPFIREWALL_VERBOSE options HZ=4000 options IPFIREWALL_VERBOSE_LIMIT=4000 options IPFIREWALL_FORWARD options IPFIREWALL_DEFAULT_TO_ACCEPT options IPFIREWALL_NAT options DUMMYNET options IPDIVERT options IPSTEALTH options NETGRAPH options NETGRAPH_IPFW options LIBALIAS options NETGRAPH_NAT options NETGRAPH_PPPOE options NETGRAPH_SOCKET options NETGRAPH_ETHER options DEVICE_POLLING device crypto options IPSEC -- Some Information about network: System has 3 NICS as WAN, LAN, DMZ. There are VLANs on WAN and LAN interfaces Throuput between 20Mbps and 100Mbps. Any ideas? Regards, Ozkan KIRIK Mersin University @ Turkey ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: page table fault, which should map kernel virtual address space
On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus wrote: > > Hallo, > > this is about 'NKPT' definition, 'kernel_map' submaps, > and 'vm_map_findspace' function. > > Variable 'kernel_map' is used to manage kernel virtual address > space. When 'vm_map_findspace' function deals with 'kernel_map' > then 'pmap_growkernel' function is called. > > At least in 'i386' architecture, pmap implementation uses > 'pmap_growkernel' function to allocate missing page tables. > Missing page tables are problem, because no one checks > 'pte' pointer for validity after use of 'vtopte' macro. > > 'NKPT' definition defines a number of preallocated > page tables during system boot. > > Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map, > pager_map,...) exist as result of 'kmem_suballoc' function call. > When this submaps are used (for example 'kmem_alloc_nofault' > function) and its virtual address subspace is at the end of > used kernel virtual address space at the moment (and above 'NKPT' > preallocation), then missing page tables are not allocated > and double fault can happen. > > No, the page tables are allocated. If you create a submap X of the kernel map using kmem_suballoc(), then a vm_map_findspace() is performed by vm_map_find() on the kernel map to find space for the submap X. As you note above, the call to vm_map_findspace() on the kernel map will call pmap_growkernel() if needed to extend the kernel page table. If you create another submap X' of X, then that submap X' can only map addresses that fall within the range for X. So, any necessary page table pages were allocated when X was created. That said, there may actually be a problem with the implementation of the superpage_align parameter to kmem_suballoc(). If a submap is created with superpage_align equal to TRUE, but the submap's size is not a multiple of the superpage size, then vm_map_find() may not allocate a page table page for the last megabyte or so of the submap. There are only a few places where kmem_suballoc() is called with superpage_align set to TRUE. If you changed them to FALSE, that is an easy way to test this hypothesis. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"
Re: ar(1) format_decimal failure is fatal?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/18/10 03:24, Tim Kientzle wrote: > > On Sep 17, 2010, at 9:01 PM, Benjamin Kaduk wrote: > >> On Sun, 29 Aug 2010, Jilles Tjoelker wrote: >> >>> On Sat, Aug 28, 2010 at 07:08:34PM -0400, Benjamin Kaduk wrote: [...] building static egacy library ar: fatal: Numeric user ID too large *** Error code 70 >>> This error appears to be coming from lib/libarchive/archive_write_set_format_ar.c , which seems to only have provisions for outputting a user ID in AR_uid_size = 6 columns. >> [...] It looks like this macro was so defined in version 1.1 of that file, with commit message "'ar' format support for libarchive, contributed by Kai Wang.". This doesn't make it terribly clear whether the 'ar' format mandates this length, or if it is an implementation decision... > > There's no official standard for the ar format, only old > conventions and compatibility with legacy implementations. > >>> I wonder if the uid/gid fields are useful at all for ar archives. Ar >>> archives are usually not extracted, and when they are, the current >>> user's values seem good enough. The uid/gid also prevent exactly >>> reproducible builds (together with the timestamp). >> >> GNU binutils has recently (well, March 2009) added a -D ("deterministic") >> argument to ar(1) which sets the timestamp, uid, and gid to zero, and the >> mode to 644. If that argument is not given, linux's ar(1) happily uses my >> 8-digit uid as-is; the manual page seems to imply that it will handle 15 or >> 16 digits in that field. > > Please send me a small example file... I don't think I've seen > this format variant. Maybe we can extend our ar(1) to support > this variant. > > Personally, I wonder if it wouldn't make sense to just always > force the timestamp, uid, and gid to zero. I find it hard > to believe anyone is using ar(1) as a general-purpose archiving > tool. Of course, it should be trivial to add -D support to our ar(1). > >> I propose that format_{decimal,octal}() return ARCHIVE_FAILED for negative >> input, and ARCHIVE_WARN for overflow. archive_write_ar_header() can then >> catch ARCHIVE_WARN from the format_foo functions and continue on, >> propagating the ARCHIVE_WARN return value at the end of its execution ... > > This sounds entirely reasonable to me. I personally don't see much > advantage to distinguishing negative versus overflow, but certainly > have no objections to that part. Definitely ar(1) should not abort on > a simple ARCHIVE_WARN. > >> Would (one of) you be willing to review a patch to that effect? > > Happy to do so. > Hi, I've been using the attached patch for quite some time now. It basically replace the offending gid/uid with nobody's id when necessary. If I remember correctly, Tim was supposed to add them to the upstream version of libarchive and then import them back in fbsd. Tim, do you remember what happened with those? Regards, Steph -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.16 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyZI38ACgkQmdOXtTCX/nt2WwCgqvd4GIyE5zRvL5kkHCWTGoAA yA0AoJ/8Dx2QrLXAJHkOrd1YqW+QR03h =KxCW -END PGP SIGNATURE- Index: usr.bin/tar/write.c === --- usr.bin/tar/write.c (revision 212556) +++ usr.bin/tar/write.c (working copy) @@ -439,7 +439,30 @@ { const char *arg; struct archive_entry *entry, *sparse_entry; + struct passwdnobody_pw, *nobody_ppw; + struct group nobody_gr, *nobody_pgr; + char id_buffer[128]; + /* +* Some formats (like ustar) have a limit on the size of the uids/gids +* supported. Tell libarchive to use the uid/gid of nobody in this case +* instead of failing. +*/ + getpwnam_r("nobody", &nobody_pw, id_buffer, sizeof (id_buffer), + &nobody_ppw); + if (nobody_ppw) + archive_write_set_nobody_uid(a, nobody_ppw->pw_uid); + else + bsdtar_warnc(0, + "nobody's uid not found, large uids won't be supported."); + getgrnam_r("nobody", &nobody_gr, id_buffer, sizeof (id_buffer), + &nobody_pgr); + if (nobody_pgr) + archive_write_set_nobody_gid(a, nobody_pgr->gr_gid); + else + bsdtar_warnc(0, + "nobody's gid not found, large gids won't be supported."); + /* Allocate a buffer for file data. */ if ((bsdtar->buff = malloc(FILEDATABUFLEN)) == NULL) bsdtar_errc(1, 0, "cannot allocate memory"); Index: usr.bin/tar/bsdtar.1 === --- usr.bin/tar/bsdtar.1(revision 212556) +++ usr.bin/tar/bsdtar.1(working copy) @@ -1027,3 +1027,6 @@ convention can cause hard link information to be lost. (Th