Re: zfs + uma

2010-09-21 Thread Andriy Gapon
on 21/09/2010 09:39 Jeff Roberson said the following:
> I'm afraid there is not enough context here for me to know what 'the same
> mechanism' is or what solaris does.  Can you elaborate?

This was in my first post:
[[[
There is this good book:
http://books.google.com/books?id=r_cecYD4AKkC&printsec=frontcover
Please see section 6.2.4.5 on page 225 and table 6-11 on page 226.
And also this code:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#971
]]]

> I prefer not to take the weight of specific examples too heavily when
> considering the allocator as it must handle many cases and many types of
> systems.  I believe there are cases where you want large allocations to be
> handled by per-cpu caches, regardless of whether ZFS is one such case.  If ZFS
> does not need them, then it should simply allocate directly from the VM. 
> However, I don't want to introduce some maximum constraint unless it can be
> shown that adequate behavior is not generated from some more adaptable 
> algorithm.

Yes, I agree in general.
But sometimes simplicity has its benefits too as opposed to complex dynamic
behavior that _might_ result from adaptive algorithms.

Anyway, I have some early patches to implement first two of your suggestions and
I am testing them now.  Looks good to me so far.
Parameters in the adaptions would probably need some additional tuning.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: zfs + uma

2010-09-21 Thread Andriy Gapon
on 21/09/2010 09:35 Jeff Roberson said the following:
> On Tue, 21 Sep 2010, Andriy Gapon wrote:
> 
>> on 19/09/2010 01:16 Jeff Roberson said the following:
>>> Additionally we could make a last ditch flush mechanism that runs on each 
>>> cpu in
>>
>> How would you qualify a "last ditch" trigger?
>> Would this be called from "standard" vm_lowmem look or would there be some 
>> extra
>> check for even more severe memory condition?
> 
> If lowmem does not make enough progress to improve the condition.

Do we have a good way to detect that?
I see that currently vm_lowmem is always invoked with argument value of zero.
-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: zfs + uma

2010-09-21 Thread Jeff Roberson

On Tue, 21 Sep 2010, Andriy Gapon wrote:


on 19/09/2010 01:16 Jeff Roberson said the following:

Additionally we could make a last ditch flush mechanism that runs on each cpu in


How would you qualify a "last ditch" trigger?
Would this be called from "standard" vm_lowmem look or would there be some extra
check for even more severe memory condition?


If lowmem does not make enough progress to improve the condition.

Jeff




turn and flushes some or all of the buckets in per-cpu caches. Presently that is
not done due to synchronization issues.  It can't be done from a central place.
It could be done with a callout mechanism or a for loop that binds to each core
in succession.


--
Andriy Gapon


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: zfs + uma

2010-09-21 Thread Jeff Roberson

On Tue, 21 Sep 2010, Andriy Gapon wrote:


on 19/09/2010 11:42 Andriy Gapon said the following:

on 19/09/2010 11:27 Jeff Roberson said the following:

I don't like this because even with very large buffers you can still have high
enough turnover to require per-cpu caching.  Kip specifically added UMA support
to address this issue in zfs.  If you have allocations which don't require
per-cpu caching and are very large why even use UMA?


Good point.
Right now I am running with 4 items/bucket limit for items larger than 32KB.


But I also have two counter-points actually :)
1. Uniformity.  E.g. you can handle all ZFS I/O buffers via the same mechanism
regardless of buffer size.
2. (Open)Solaris does that for a while and it seems to suit them well.  Not
saying that they are perfect, or the best, or an example to follow, but still
that means quite a bit (for me).


I'm afraid there is not enough context here for me to know what 'the same 
mechanism' is or what solaris does.  Can you elaborate?


I prefer not to take the weight of specific examples too heavily when 
considering the allocator as it must handle many cases and many types of 
systems.  I believe there are cases where you want large allocations to be 
handled by per-cpu caches, regardless of whether ZFS is one such case.  If 
ZFS does not need them, then it should simply allocate directly from the 
VM.  However, I don't want to introduce some maximum constraint unless it 
can be shown that adequate behavior is not generated from some more 
adaptable algorithm.


Thanks,
Jeff



--
Andriy Gapon


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: zfs + uma

2010-09-21 Thread Alan Cox
On Tue, Sep 21, 2010 at 1:39 AM, Jeff Roberson wrote:

> On Tue, 21 Sep 2010, Andriy Gapon wrote:
>
>  on 19/09/2010 11:42 Andriy Gapon said the following:
>>
>>> on 19/09/2010 11:27 Jeff Roberson said the following:
>>>
 I don't like this because even with very large buffers you can still
 have high
 enough turnover to require per-cpu caching.  Kip specifically added UMA
 support
 to address this issue in zfs.  If you have allocations which don't
 require
 per-cpu caching and are very large why even use UMA?

>>>
>>> Good point.
>>> Right now I am running with 4 items/bucket limit for items larger than
>>> 32KB.
>>>
>>
>> But I also have two counter-points actually :)
>> 1. Uniformity.  E.g. you can handle all ZFS I/O buffers via the same
>> mechanism
>> regardless of buffer size.
>> 2. (Open)Solaris does that for a while and it seems to suit them well.
>>  Not
>> saying that they are perfect, or the best, or an example to follow, but
>> still
>> that means quite a bit (for me).
>>
>
> I'm afraid there is not enough context here for me to know what 'the same
> mechanism' is or what solaris does.  Can you elaborate?
>
> I prefer not to take the weight of specific examples too heavily when
> considering the allocator as it must handle many cases and many types of
> systems.  I believe there are cases where you want large allocations to be
> handled by per-cpu caches, regardless of whether ZFS is one such case.  If
> ZFS does not need them, then it should simply allocate directly from the VM.
>  However, I don't want to introduce some maximum constraint unless it can be
> shown that adequate behavior is not generated from some more adaptable
> algorithm.
>
>
Actually, I think that there is a middle ground between "per-cpu caches" and
"directly from the VM" that we are missing.  When I've looked at the default
configuration of ZFS (without the extra UMA zones enabled), there is an
incredible amount of churn on the kmem map caused by the implementation of
uma_large_malloc() and uma_large_free() going directly to the kmem map.  Not
only are the obvious things happening, like allocating and freeing kernel
virtual addresses and underlying physical pages on every call, but also
system-wide TLB shootdowns and sometimes superpage demotions are occurring.

I have some trouble believing that the large allocations being performed by
ZFS really need per-CPU caching, but I can certainly believe that they could
benefit from not going directly to the kmem map on every uma_large_malloc()
and uma_large_free().  In other words, I think it would make a lot of sense
to have a thin layer between UMA and the kmem map that caches allocated but
unused ranges of pages.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Kernel side buffer overflow issue

2010-09-21 Thread Özkan KIRIK
Hi,

I am using FreeBSD 8.1-STABLE-201008 snapshot.
System behaves strangely. Unexpected and meaningless messages seen at consoles.
You can download the screen shot from :
http://193.255.128.30/~ryland/syslogd.jpg

Additionally default router changes unexpectedly.
I tried all FreeBSD 7.1, 7.2, 7.3, 8.1-STABLE-201008 releases ( both
i386 and amd64 ). All this versions are affected.
I inspected logs if someone logged in or changed route (with route -n
monitor command).
When the default route changed, there isn't any messages at the "route
-n monitor" command output.
I think there can be a buffer overflow in kernel code.
When dummynet enabled, this problem could be seen more frequently.

This problem repeats once per 10 minute.
I wrote a shell script which monitors the default router.
I saw that sometimes netstat -rn shows that default router is changed
as 10.0.16.251 or 10.6.10.240 etc.
which are client IP addresses but routing still routes to right router
193.X.Y.Z .
After a while, routing really fails.

You can download the tcpdump capture file from
http://193.255.128.30/~ryland/flowdata_10_0_16_251 .
This file captured while the default router changes.
Tcpdump capture, belongs to the IP Address which shown in default
router (10.0.16.251)

the tcpdump command:

tcpdump -w /home/flowdata_10_0_16_251 -ni bce0.116 host 10.0.16.251
--

dummynet rules are:
3 pipe 3 tcp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to
any dst-port 8000,80,22,25,88,110,443,1720,1863,1521,3389,4489 via em0
// Upload
3 pipe 3 udp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to
any dst-port 53 via em0 // Upload
3 pipe 4 tcp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to
any via em0 // Upload
3 pipe 4 udp from 10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 to
any via em0 // Upload
 LOTS OF NAT RULES HERE (in kernel nat)
6 pipe 1 tcp from any
8000,80,22,25,88,110,443,1720,1863,1521,3389,4489 to
10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download
6 pipe 1 udp from any 53 to
10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download
6 pipe 2 tcp from any to
10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download
6 pipe 2 udp from any to
10.0.0.0/8,192.168.0.0/16,172.16.0.0/12 via bce0* // Download

/sbin/ipfw pipe 1 config bw 8192Kbit/s mask dst-ip 0x
/sbin/ipfw pipe 3 config bw 1024Kbit/s mask src-ip 0x
/sbin/ipfw pipe 2 config bw 4096Kbit/s mask dst-ip 0x
/sbin/ipfw pipe 4 config bw 1024Kbit/s mask src-ip 0x
--

sysctl vars:
net.inet.ip.fw.dyn_max=65535
net.inet.ip.fw.dyn_ack_lifetime=100
net.inet.ip.fw.dyn_short_lifetime=10
net.inet.ip.fw.one_pass=0
kern.maxfiles=65000
kern.ipc.somaxconn=1024
net.inet.ip.process_options=0
net.inet.ip.fastforwarding=1
net.link.ether.ipfw=1
net.inet.ip.fw.dyn_buckets=65536
kern.maxvnodes=40
net.inet.ip.dummynet.hash_size=256 ( also tried with 8192 )
net.inet.ip.dummynet.pipe_slot_limit=500
net.inet.ip.dummynet.io_fast=1
--

/boot/loader.conf:
autoboot_delay="1"
beastie_disable="YES"
kern.ipc.nmbclusters=98304
vm.kmem_size="2048M"
vm.kmem_size_max="2048M"
splash_bmp_load="YES"
vesa_load="YES"
bitmap_load="YES"
bitmap_name="/boot/splash.bmp"
hw.ata.ata_dma=0
kern.hz="1"
--

kernel config ( additionally to GENERIC ):
device  tap
device  if_bridge
device  vlan
device  carp
options GEOM_BDE
options IPFIREWALL
options IPFIREWALL_VERBOSE
options HZ=4000
options IPFIREWALL_VERBOSE_LIMIT=4000
options IPFIREWALL_FORWARD
options IPFIREWALL_DEFAULT_TO_ACCEPT
options IPFIREWALL_NAT
options DUMMYNET
options IPDIVERT
options IPSTEALTH
options NETGRAPH
options NETGRAPH_IPFW
options LIBALIAS
options NETGRAPH_NAT
options NETGRAPH_PPPOE
options NETGRAPH_SOCKET
options NETGRAPH_ETHER
options DEVICE_POLLING
device  crypto
options IPSEC
--


Some Information about network:
System has 3 NICS as WAN, LAN, DMZ.
There are VLANs on WAN and LAN interfaces
Throuput between 20Mbps and 100Mbps.


Any ideas?

Regards,
Ozkan KIRIK
Mersin University @ Turkey
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: page table fault, which should map kernel virtual address space

2010-09-21 Thread Alan Cox
On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus  wrote:

>
> Hallo,
>
> this is about 'NKPT' definition, 'kernel_map' submaps,
> and 'vm_map_findspace' function.
>
> Variable 'kernel_map' is used to manage kernel virtual address
> space. When 'vm_map_findspace' function deals with 'kernel_map'
> then 'pmap_growkernel' function is called.
>
> At least in 'i386' architecture, pmap implementation uses
> 'pmap_growkernel' function to allocate missing page tables.
> Missing page tables are problem, because no one checks
> 'pte' pointer for validity after use of 'vtopte' macro.
>
> 'NKPT' definition defines a number of preallocated
> page tables during system boot.
>
> Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
> pager_map,...) exist as result of 'kmem_suballoc' function call.
> When this submaps are used (for example 'kmem_alloc_nofault'
> function) and its virtual address subspace is at the end of
> used kernel virtual address space at the moment (and above 'NKPT'
> preallocation), then missing page tables are not allocated
> and double fault can happen.
>
>
No, the page tables are allocated.  If you create a submap X of the kernel
map using kmem_suballoc(), then a vm_map_findspace() is performed by
vm_map_find() on the kernel map to find space for the submap X.  As you note
above, the call to vm_map_findspace() on the kernel map will call
pmap_growkernel() if needed to extend the kernel page table.

If you create another submap X' of X, then that submap X' can only map
addresses that fall within the range for X.  So, any necessary page table
pages were allocated when X was created.

That said, there may actually be a problem with the implementation of the
superpage_align parameter to kmem_suballoc().  If a submap is created with
superpage_align equal to TRUE, but the submap's size is not a multiple of
the superpage size, then vm_map_find() may not allocate a page table page
for the last megabyte or so of the submap.

There are only a few places where kmem_suballoc() is called with
superpage_align set to TRUE.  If you changed them to FALSE, that is an easy
way to test this hypothesis.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: ar(1) format_decimal failure is fatal?

2010-09-21 Thread Stephane E. Potvin
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/18/10 03:24, Tim Kientzle wrote:
> 
> On Sep 17, 2010, at 9:01 PM, Benjamin Kaduk wrote:
> 
>> On Sun, 29 Aug 2010, Jilles Tjoelker wrote:
>>
>>> On Sat, Aug 28, 2010 at 07:08:34PM -0400, Benjamin Kaduk wrote:
 [...]
 building static egacy library
 ar: fatal: Numeric user ID too large
 *** Error code 70
>>>
 This error appears to be coming from
 lib/libarchive/archive_write_set_format_ar.c , which seems to only have
 provisions for outputting a user ID in AR_uid_size = 6 columns.
>> [...]
 It looks like this macro was so defined in version 1.1 of that file, with
 commit message "'ar' format support for libarchive, contributed by Kai
 Wang.".  This doesn't make it terribly clear whether the 'ar' format
 mandates this length, or if it is an implementation decision...
> 
> There's no official standard for the ar format, only old
> conventions and compatibility with legacy implementations.
> 
>>> I wonder if the uid/gid fields are useful at all for ar archives. Ar
>>> archives are usually not extracted, and when they are, the current
>>> user's values seem good enough. The uid/gid also prevent exactly
>>> reproducible builds (together with the timestamp).
>>
>> GNU binutils has recently (well, March 2009) added a -D ("deterministic") 
>> argument to ar(1) which sets the timestamp, uid, and gid to zero, and the 
>> mode to 644.  If that argument is not given, linux's ar(1) happily uses my 
>> 8-digit uid as-is; the manual page seems to imply that it will handle 15 or 
>> 16 digits in that field.
> 
> Please send me a small example file...  I don't think I've seen
> this format variant.  Maybe we can extend our ar(1) to support
> this variant.
> 
> Personally, I wonder if it wouldn't make sense to just always
> force the timestamp, uid, and gid to zero.  I find it hard
> to believe anyone is using ar(1) as a general-purpose archiving
> tool.  Of course, it should be trivial to add -D support to our ar(1).
> 
>> I propose that format_{decimal,octal}() return ARCHIVE_FAILED for negative 
>> input, and ARCHIVE_WARN for overflow.  archive_write_ar_header() can then 
>> catch ARCHIVE_WARN from the format_foo functions and continue on, 
>> propagating the ARCHIVE_WARN return value at the end of its execution ...
> 
> This sounds entirely reasonable to me.  I personally don't see much
> advantage to distinguishing negative versus overflow, but certainly
> have no objections to that part.  Definitely ar(1) should not abort on
> a simple ARCHIVE_WARN.
> 
>> Would (one of) you be willing to review a patch to that effect?
> 
> Happy to do so. 
> 

Hi,

I've been using the attached patch for quite some time now. It basically
replace the offending gid/uid with nobody's id when necessary.

If I remember correctly, Tim was supposed to add them to the upstream version of
libarchive and then import them back in fbsd. Tim, do you remember what
happened with those?

Regards,

Steph

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.16 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyZI38ACgkQmdOXtTCX/nt2WwCgqvd4GIyE5zRvL5kkHCWTGoAA
yA0AoJ/8Dx2QrLXAJHkOrd1YqW+QR03h
=KxCW
-END PGP SIGNATURE-
Index: usr.bin/tar/write.c
===
--- usr.bin/tar/write.c (revision 212556)
+++ usr.bin/tar/write.c (working copy)
@@ -439,7 +439,30 @@
 {
const char *arg;
struct archive_entry *entry, *sparse_entry;
+   struct passwdnobody_pw, *nobody_ppw;
+   struct group nobody_gr, *nobody_pgr;
+   char id_buffer[128];
 
+   /*
+* Some formats (like ustar) have a limit on the size of the uids/gids
+* supported. Tell libarchive to use the uid/gid of nobody in this case
+* instead of failing.
+*/
+   getpwnam_r("nobody", &nobody_pw, id_buffer, sizeof (id_buffer),
+   &nobody_ppw);
+   if (nobody_ppw)
+   archive_write_set_nobody_uid(a, nobody_ppw->pw_uid);
+   else
+   bsdtar_warnc(0,
+   "nobody's uid not found, large uids won't be supported.");
+   getgrnam_r("nobody", &nobody_gr, id_buffer, sizeof (id_buffer),
+   &nobody_pgr);
+   if (nobody_pgr)
+   archive_write_set_nobody_gid(a, nobody_pgr->gr_gid);
+   else
+   bsdtar_warnc(0,
+   "nobody's gid not found, large gids won't be supported.");
+
/* Allocate a buffer for file data. */
if ((bsdtar->buff = malloc(FILEDATABUFLEN)) == NULL)
bsdtar_errc(1, 0, "cannot allocate memory");
Index: usr.bin/tar/bsdtar.1
===
--- usr.bin/tar/bsdtar.1(revision 212556)
+++ usr.bin/tar/bsdtar.1(working copy)
@@ -1027,3 +1027,6 @@
 convention can cause hard link information to be lost.
 (Th