[PATCH] perf: remove duplicate block from Makefile
This looks like a merge error, the code is duplicated with the first copy doing something else as well. Just remove the second block. Signed-off-by: Ulrich Drepper Makefile |8 1 file changed, 8 deletions(-) Index: perf/config/Makefile === --- perf.orig/config/Makefile +++ perf/config/Makefile @@ -200,14 +200,6 @@ endif # NO_DWARF endif # NO_LIBELF -ifndef NO_LIBELF -CFLAGS += -DLIBELF_SUPPORT -FLAGS_LIBELF=$(CFLAGS) $(LDFLAGS) $(EXTLIBS) -ifeq ($(call try-cc,$(SOURCE_ELF_MMAP),$(FLAGS_LIBELF),-DLIBELF_MMAP),y) - CFLAGS += -DLIBELF_MMAP -endif # try-cc -endif # NO_LIBELF - # There's only x86 (both 32 and 64) support for CFI unwind so far ifneq ($(ARCH),x86) NO_LIBUNWIND := 1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] apparently broken RLIMIT_CORE
On Sun, Oct 6, 2013 at 4:42 PM, Linus Torvalds wrote: > I doubt it is intentional, but I also cannot really feel that we care > deeply. Afaik we don't really honor the size limit exactly anyway, ie > we tend to check only at page boundaries etc. So do we really care? I could imagine in the case Al brought up (a pipe as core file filter) we might want to have some assurance the limits are not breached. If it doesn't cost that much I'd say implement it precisely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sendfile and EAGAIN
On Mon, Feb 25, 2013 at 2:22 PM, Eric Dumazet wrote: > I don't understand the issue. > > sendfile() returns -EAGAIN only if no bytes were copied to the socket. There is something wrong/unexpected/... I have a program which can use either sendfile or send. When using sendfile to transmit a large block (I've seen it with 900k) the sendfile call does not transmit everything. There receiver gets only about 600k. This is the situation when I think I've seen EAGAIN errors from sendmail but I cannot just now reproduce it. This is with sockets of AF_UNIX type. Are there any limits to take into account? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sendfile and EAGAIN
On Sat, Mar 2, 2013 at 10:09 PM, Eric Dumazet wrote: > > Using non blocking IO means the sender (and the receiver) must be able > to perform several operations, as long as the whole transfert is not > finished. Certainly, and this is implemented. But the receiver never gets the rest of the data while the sender (most of the time) gets notified that everything is sent. I don't have a reduced test case yet. Hopefully I'll get to it sometime soon. For now I worked around it by not using sendfile. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] perf: use XSI-complaint version of strerror_r() instead of GNU-specific
On Mon, Jul 23, 2012 at 11:00 AM, Kirill A. Shutemov wrote: > The right way to fix it is to switch to XSI-compliant version. And why exactly would this be "the right way"? Just fix the use of strerror_r or use strerror_l. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] perf: use XSI-complaint version of strerror_r() instead of GNU-specific
On Mon, Jul 23, 2012 at 4:31 PM, Kirill A. Shutemov wrote: > + const char *err = strerror_r(errnum, buf, buflen); > + > + if (err != buf && buflen > 0) { > + size_t len = strlen(err); > + char *c = mempcpy(buf, err, min(buflen - 1, len)); > + *c = '\0'; > + } No need to check for err == NULL. buflen == 0 is a possibility given the interface but I'd say this is an error and should be tested for at the beginning of the function and the call should fail or even abort the program. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] perf: use XSI-complaint version of strerror_r() instead of GNU-specific
On Mon, Jul 23, 2012 at 5:06 PM, Kirill A. Shutemov wrote: > They are bugs. > > Let's fix strerror_r() usage. > > Signed-off-by: Kirill A. Shutemov Acked-by: Ulrich Drepper -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] perf tool: Adding ratios support
On Tue, Jan 15, 2013 at 8:39 AM, Jiri Olsa wrote: > $ perf stat -f formula.conf:cpi kill > usage: kill [ -s signal | -p ] [ -a ] pid ... > kill -l [ signal ] I do like this proposal. The only comment I have is that perhaps the command line syntax isn't ideal. What you use above is tied to the ratios be defined in the config file. I would imagine that at least over time (for some ratios probably right away) they become available by default and don't require a config file. Also, users might want to put individualized ratio definitions in a config file which is read by default. How about the formulas becoming available whenever the config file is read. Maybe this means a few more keywords in the config file (ratio, ratio-set, ...). E.g.: ratio-set branch { events = {instructions,branch-instructions,branch-misses}:u ratio branch-rate { formula = branch-instructions / instructions desc = branch rate } ratio branch-miss-rate { formula = branch-misses / instructions desc = branch misprediction rate } ratio branch-miss-ratio{ formula = branch-misses / branch-instructions desc = branch misprediction ratio } } You get the idea. Maybe substitute "ratio":with "formula". Then allow such a ratio/formula to be used just like a normal event, perhaps with a special suffix/prefix to designate it. This should then also mark the events as part of a group so that the underlying counters are scheduled in together. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/4] perf tool: Adding ratios support
On Wed, Jan 16, 2013 at 9:25 AM, Jiri Olsa wrote: > I was thinking having config files (global and arch specific) > comming with perf having predefined formulas. All the more reason to not mention the file name or really any source for the definition of the formula in the name, > 1) -e 'ratio/branch-rate/' # special event class > 2) -e 'ratio-branch-rate' # 'ratio-' prefix > 3) -e cpu/branch-rate/ # handled like aliases, ratio name would need to > be unique > ... ? I think 3 is the most extensible. Perhaps use the syntax used in other places. We have these :u suffixes etc. Perhaps have :r or :R or whatever. Given the other comments, we might want to avoid right away "ratio". If the mechanism is generalized it could be used to express "counter1 - counter2" for events which cannot be expressed with a single counter but are not really ratios. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: struct stat{st_blksize} for /dev entries in 2.4.3
Russell Coker <[EMAIL PROTECTED]> writes: > diff -ru textutils-2.0/src/cat.c textutils-new/src/cat.c > --- textutils-2.0/src/cat.c Sun Apr 8 22:55:10 2001 > +++ textutils-new/src/cat.c Sun Apr 8 23:23:54 2001 > @@ -790,6 +790,9 @@ >if (options == 0) > { > insize = max (insize, outsize); > +#ifdef _SC_PHYS_PAGES > + insize = max (insize, sysconf(_SC_PAGESIZE)); > +#endif > inbuf = (unsigned char *) xmalloc (insize); > > simple_cat (inbuf, insize); The #ifdef is certainly wrong. And there is no guarantee that any of the _SC_* constants are defined as macros. You'll have to add a configure test. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: List of all-zero .data variables in linux-2.4.3 available
"Adam J. Richter" <[EMAIL PROTECTED]> writes: > >Shouldn't a compiler be able to deal with this instead? > > Yes. No. gcc must not do this. There are situations where you must place a zero-initialized variable in .data. It is a programmer problem. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: List of all-zero .data variables in linux-2.4.3 available
"Adam J. Richter" <[EMAIL PROTECTED]> writes: > I am aware of a couple of cases where code relied on static > variables being allocated contiguously, but, in both cases, those > variables were either all zeros or all non-zeros, so my proposed > change would not break such code. Continuous placement is not the only property defined by initialization. There are many more. You cannot change this since it will quite a few programs and libraries and subtle and hard to impossible to identify ways. Simply educate programmers to not initialize. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH(?): linux-2.4.4-pre2: fork should run child first
Linus Torvalds <[EMAIL PROTECTED]> writes: > spawn() is trivial to implement if you want to. I don't think it's all > that much more interesting than vfork()+execve(), though. spawn() (actually posix_spawn) is currently implemented in the libc. If anybody for whatever reason thinks it is necessary to implement this in the kernel look at the interface. It is really only interesting for systems with limited VMs but it would be trivial to add another flag which allow different scheduling characteristics which some people apparently want. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Linus Torvalds <[EMAIL PROTECTED]> writes: Sounds good so far. Some comments. > - FS_create is responsible for allocating a shared memory region >at "FS_create()" time. This is not so great. The POSIX shared semaphores require that an pthread_mutex_t object placed in a shared memory region can be initialized to work across process boundaries. I.e., the FS_create function would actually be FS_init. There is no problem with the kernel or the helper code at user level allocating more storage (for the waitlist of whatever) but it must not be necessary for the user to know about them and place them in share memory themselves. The situation for non-shared (i.e. intra-process) semaphores are easier. What I didn't understand is your remark about fork. The semaphores should be cloned. Unless the shared flag is set there should be no sharing among processes. The rest seems OK. Thanks, -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Linus Torvalds <[EMAIL PROTECTED]> writes: > Looks good to me. Anybody want to try this out and test some benchmarks? I fail to see how this works across processes. How can you generate a file descriptor for this pipe in a second process which simply shares some memory with the first one? The first process is passive: no file descriptor passing must be necessary. How these things are working elsewhere is that a memory address (probably a physical address) is used as a token. The semaphore object is placed in the memory shared by the processes and the virtual address is passed in the syscall. Note that semaphores need not always be shared between processes. This is a property the user has to choose. So the implementation can be easier in the normal intra-process case. In any case all kinds of user-level operations are possible as well and all the schemes suggested for dealing with the common case without syscalls can be applied here as well. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Alan Cox <[EMAIL PROTECTED]> writes: > > can libraries use fast semaphores behind the back of the user? They might > > well want to use the semaphores exactly for things like memory allocator > > locking etc. But libc certainly cant use fd's behind peoples backs. > > libc is entitled to, and most definitely does exactly that. Take a look at > things like gethostent, getpwent etc etc. You are mixing two completely different things. Functions like gethostent() and catopen() are explicitly allowed to be implemented using file descriptors. If this is allowed the standard contains appropriate wording. Other functions like setlocale() do use file descriptors, yes, but these are not kept. Before the function returns they are closed. This can cause disruptions in other threads which find descriptors not allocated sequentially but this has to be taken into account. Rules for multi-threaded applications are different. A single-threaded application will not see such a difference. Now, the standards do not allow POSIX mutexes to be implemented using file descriptors. The same is true for unnamed POSIX semaphores. So Linus is right, though for a different reason than he thought. The situation is a bit different for named POSIX semaphores. These can be implemented using file descriptors. But they don't have to and IMO they shouldn't. A memory reference based semaphore implementation would allow a named semaphore to be implemented using fd = open (name) addr = mmap (..fd..) close (fd) sem_syscall (addr) i.e., it can be mapped to a memory reference again. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Alan Cox <[EMAIL PROTECTED]> writes: > mknod foo p. Or use sockets (although AF_UNIX sockets are higher latency) > Thats why I suggested using flock - its name based. Whether you mkstemp() > stuff and pass it around isnt something I care about > > Files give you permissions for free too I don't want nor need file permissions. A program would look like this: process 1: fd = open("somefile") addr = mmap(fd); pthread_mutexattr_init(&attr); pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED); pthread_mutex_init ((pthread_mutex_t *) addr, &attr); pthread_mutex_lock ((pthread_mutex_t *) addr); pthread_mutex_destroy((pthread_mutex_t *) addr); process 2: fd = open("somefile") addr = mmap(fd); pthread_mutex_lock ((pthread_mutex_t *) addr); The shared mem segment can be retrieved in whatever way. The mutex in this case is anonymous. Everybody who has access to the shared mem *must* have access to the mutex. For semaphores it looks similarly. First the anonymous case: process 1: fd = open("somefile") addr = mmap(fd); sem_init ((sem_t *) addr, 1, 10);// 10 is arbitrary sem_wait ((sem_t *) addr); sem_destroy((sem_t *) addr); process 2: fd = open("somefile") addr = mmap(fd); sem_wait ((sem_t *) addr); Note that POSIX semaphores could be implemented with global POSIX mutexes. Finally, named semaphores: semp = sem_open("somefile", O_CREAT|O_EXCL, 0600) sem_wait (semp); sem_close(semp); sem_unlink(semp); This is the only semaphore kind which maps nicely to a pipe or socket. All the others don't. And even for named semaphores it is best to have a separate name space like the shmfs. > So you have unix file permissions on them ? See above. Permissions are only allowed for named semaphores. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Alan Cox <[EMAIL PROTECTED]> writes: > > I don't want nor need file permissions. A program would look like this: > > Your example opens/mmaps so has file permissions. Which is what I was asking There are no permissions on the mutex object. It is the shared memory which counts. If you would implement the global mutexes as independent objects in the filesystem hierarchy you would somehow magically make the permissions match those of the object containing the memory representation of the global semaphore. fd = open("somefile", O_CREAT|O_TRUNC, 0666) addr=mmap(fd) // assume attr is for a global mutex pthread_mutex_init((pthread_mutex_t*)addr, &attr) fchmod(fd, 0600) fchown(fd, someuser, somegroup) If pthread_mutex_attr() is allocating some kind of file, how do you determine the permissions? How are they changed if the permissions to the file change? The kernel representation of the mutex must not be disassociated from the shared memory region. Even if you all think very little about Solaris, look at the kernel interface for semaphores. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Ingo Oeser <[EMAIL PROTECTED]> writes: > Are you sure, you can implement SMP-safe, atomic operations (which you need > for all up()/down() in user space) WITHOUT using privileged > instructions on ALL archs Linux supports? Which processors have no such instructions but are SMP-capable? > How do we do this on nccNUMA machines later? How on clusters[1]? Clusters are not my problem. They require additional software. And NUMA machines maybe be requiring a certain sequence in which the operations must be performed and the hardware should take care of the rest. I don't really care what the final implementation will be like. For UP and SMP machines I definitely want to have as much as possible at user-level. If you need a special libpthread for NUMA machines, so be it. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Linus Torvalds <[EMAIL PROTECTED]> writes: > > I fail to see how this works across processes. > > It's up to FS_create() to create whatever shared mapping is needed. No, the point is that FS_create is *not* the one creating the shared mapping. The user is explicitly doing this her/himself. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Linus Torvalds <[EMAIL PROTECTED]> writes: > I'm not interested in re-creating the idiocies of Sys IPC. I'm not talking about sysv semaphores (couldn't care less). And you haven't read any of the mails with examples I sent. If the new interface can be useful for anything it must allow to implement process-shared POSIX mutexes. The user-level representation of these mutexes are simple variables which in the case of inter-process mutexes are placed in shared memory. These variables must be usable with the normal pthread_mutex_lock() functions and perform whatever is needed. Whether the pthread_mutex_init() function for shared mutexes is doing a lot more work and allocates even more memory, I don't care. The standard certainly permits this and every pthread_mutex_init() must have a pthread_mutex_destroy() which allows allocating and freeing resources (no file descriptor, though). So, yes, your FS_create syscall can allocate something. But the question is what handle to put in the pthread_mutex_t variable so the different processes can use the mutex. It cannot be a file descriptor since it's not shared between processes. It cannot be a pointer to some other place in the virtual memory since the place pointed to might not be (and probably isn't if FS_create is allocating something in the process setting up the mutex). You could put some magic cookie in the pthread_mutex_t object the kernel can then use. So, instead of repeating over and over again the same old story, fill in the gaps here: int pthread_mutex_init (pthread_mutex_t *mutex, const pthread_mutexattr_t *mutex_attr) { if (mutex_attr != NULL && mutex_attr->__pshared != 0) { ... FILL IN HERE ... } else ...intra-process mutex, uninteresting here... } int pthread_mutex_lock (pthread_mutex_t *mutex) { if (mutex_attr != NULL && mutex_attr->__pshared != 0) { ... FILL IN HERE ... } else ...intra-process mutex, uninteresting here... } int pthread_mutex_destroy (pthread_mutex_t *mutex) { if (mutex_attr != NULL && mutex_attr->__pshared != 0) { ... FILL IN HERE ... } else ...intra-process mutex, uninteresting here... } These functions must work with something like this: ~ cons.c ~~ #include #include #include #include #include int main (int argc, char *argv[]) { char tmpl[] = "/tmp/fooXX"; int fd = mkstemp (tmpl); pthread_mutexattr_t attr; pthread_mutex_t *m1; pthread_mutex_t *m2; void *addr; volatile int *i; pthread_mutexattr_init (&attr); pthread_mutexattr_setpshared (&attr, PTHREAD_PROCESS_SHARED); ftruncate (fd, 2 * sizeof (*m1) + sizeof (int)); addr = mmap (NULL, sizeof (*m1), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); m1 = addr; m2 = m1 + 1; i = (int *) (m2 + 1); *i = 0; pthread_mutex_init (m1, &attr); pthread_mutex_lock (m1); pthread_mutex_init (m2, &attr); pthread_mutex_lock (m2); if (fork () == 0) { char buf[10]; snprintf (buf, sizeof buf, "%d", fd); execl ("./prod", "prod", buf, NULL); } while (1) { pthread_mutex_lock (m1); printf ("*i = %d\n", *i); pthread_mutex_unlock (m2); } return 0; } ~~prod.c ~~ #include #include #include #include #include int main (int argc, char *argv[]) { int fd = atoi (argv[1]); void *addr; pthread_mutex_t *m1; pthread_mutex_t *m2; volatile int *i; addr = mmap (NULL, sizeof (*m1), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); m1 = addr; m2 = m1 + 1; i = (int *) (m2 + 1); while (1) { ++*i; pthread_mutex_unlock (m1); pthread_mutex_lock (m2); } return 0; } ~~~~~~~~~~~ -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-lvm] 2.4.3-ac{6,7} LVM hang
Jens Axboe <[EMAIL PROTECTED]> writes: > Does attached patch fix it? Yes. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: light weight user level semaphores
Alexander Viro <[EMAIL PROTECTED]> writes: > > If the new interface can be useful for anything it must allow to > > implement process-shared POSIX mutexes. > > Pardon me the bluntness, but... Why? Because otherwise there is no reason to even waste a second with this. At least for me and everybody else who has interest in portable solutions. I don't care how it's implemented. Look at the code example I posted. If you can provide an implementation which can implement anonymous inter-process mutexes then ring again. Until then I'll wait. If you implement something else I couldn't care less since it's useless for me. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
"Richard B. Johnson" <[EMAIL PROTECTED]> writes: > If it "fixes" it, there is no problem with the FPU, but with the > 'C' runtime library which doesn't initialize the FPU to a known > state before it uses it. It's the kernel which initializes the FPU. This was always the case and necessary to implement the fast lazy FPU saving/restoring. Processes which never use the FPU never initialize it. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Global FPU corruption in 2.2
"Richard B. Johnson" <[EMAIL PROTECTED]> writes: > The kernel doesn't know if a process is going to use the FPU when > a new process is created. Only the user's code, i.e., the 'C' runtime > library knows. Maybe you should try to understand the kernel code and the features of the processor first. The kernel can detect when the FPU is used for the first time. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: changing precision control setting in initial FPU context
[EMAIL PROTECTED] (Kevin Buhr) writes: > > You want peoples existing applications to suddenely and magically change > > their results. Umm problem. > > So, how would you feel about a mechanism whereby the kernel could be > passed a default FPU control word by the binary (with old binaries, by > default, There will be no change whatsoever with me. The existing ABI is fixed. If you want your programs to behave different set the mode appropriately. I have not the slightest interest in seeing applications (including the libc) being broken just because of this stupid idea. No kernel and no libc modifications necessary. This is the end of the story as far as I'm concerned. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] new bug report script
Matthias Juchem <[EMAIL PROTECTED]> writes: > +# c library 5 > +if ( -e "/lib/libc.so.5" ) { > + ( $v_libc5 = `/lib/libc.so.5`) =~ m/GNU C Library .+ version (\S+),/; > + $v_libc5 = $1; > +} else { > + $v_libc5 = "not found"; > +} This is wrong. You cannot execute libc.so.5. This only works with glibc. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] new bug report script
Matthias Juchem <[EMAIL PROTECTED]> writes: > Or is the file name scheme reliable (/lib/libc.so.5.x.y)? Yes, since this was how HJ named the releases. You have to find out which version is actually used (there might be several .so files there). -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Re: [PATCH] new bug report script
Matthias Juchem <[EMAIL PROTECTED]> writes: > # c library 5 > -if ( -e "/lib/libc.so.5" ) { > - ( $v_libc5 = `/lib/libc.so.5`) =~ m/GNU C Library .+ version (\S+),/; > - $v_libc5 = $1; > -} else { > - $v_libc5 = "not found"; > +opendir LIBDIR, "/lib" or die "/lib/ not found, very strange"; > +my @allfiles = readdir LIBDIR; > +closedir LIBDIR; > +$v_libc5 = 'not found'; > +foreach (sort @allfiles) { > + m/libc.so.(5\S+)/ and $v_libc5 = $1; > } > +closedir LIBDIR; This won't work everywhere either. Red Hat systems (maybe others) have libc5 out of the way in a separate subdir. Your best bet is to use ldconfig: /sbin/ldconfig -p|grep libc.so.5 which produces something like libc.so.5 (libc5) => /usr/i486-linux-libc5/lib/libc.so.5 and then look in that directory (/usr/i486-linux-libc5/lib). -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] new bug report script
Keith Owens <[EMAIL PROTECTED]> writes: > 5) > # glibc versions. Take the last symbolic link, > # extract the version number from the file it points to. > if [ `expr "X$1" : 'Xl'` -eq 2 ] > then > while [ "X$2" != "X->" ] > do > shift > done > version2=`echo "$3" | tr -cd '[.0-9]' | \ > sed -e 's/\.\.*/./g' | > sed -e 's/^\.//g' | > sed -e 's/\.$//g'` > fi > ;; Why don't you, as the other script suggested, execute libc.so.6? Symlinks can be missing or can be wrong. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
thread group comments
I hoped somebody else would write something about Linus' test8-pre1 thread group changes but I haven't seen anything. Now you have to bear with me even though I'm incompetent[1]. I took a look at the code and thought about the changes necessary/possible in the thread library. Here's what I came up with so far: 1st Problem: One signal handler process process-wide What is handled correctly now is sending signals to the group. Also that every thread has its mask. But there must be exactly one signal handler installed. I.e., a sigaction() call to set a new handler has consequences process-wide. Since this muse be atomic I think the information should be kept in the thread group leader's data structures and the other threads should simply use this information all the time. Yeah, I know, one more indirection. 2nd Problem: Fatal signal handling kernel/signal.c contains: * Send a thread-group-wide signal. * * Rule: SIGSTOP and SIGKILL get delivered to _everybody_. That's OK. Except that is a signal whose default action is to terminate the process is not caught be the application, this signal is also handled process-wide. E.g., if there is no SIGSEGV handler the whole process is terminated. This will have to go hand in hand with an extension of the core file format to include information about all threads but for the time being it's enough if only the offending thread is dumped and the rest simply killed. 3rd Problem: one uid/gid process-wide All the ID (uid/guid/euid/egid/...) must be process wide. The problem is similar to the signal handler. I think one should again keep the information exclusively in the master thread and have all others refer to this information. 4th Problem: thread termination In general, thread termination is not of much interest for the rest of the system. It is in the moment but if the fatal signal handling is done this will change. If a thread gets a fatal signal, the whole process is killed. No cleanup necessary. Signal handlers can be installed if necessary. If a thread terminates naturally and can perform the cleanup itself. In any case, the death signal should be ignored. Except for the last thread, of course, which has to notify the process starting the MT application. I can see two possible solutions, neither of which I've tried: - the termination signal given to clone calls is 0 (zero). So no notification is sent out. Question is: does the kernel allow this? - the kernel ignores the SIGCHLD signal for all threads except the last one In any case is there the problem how to handle the termination of the master thread. If it is not aware of starting and terminating threads I could imagine some user-level mechanisms to make this work but they are not very clean (it involves changing the death signal in the thread depending on the situation). If there is something people think the kernel could do this would be probably better. 5th Problem: suspended starting Related to the last problem a good old friend pops up. Depending on the solution of the last problem it might be necessary to add suspended starting of threads. The problem is that sometimes the starter has to modify parameters (e.g., scheduler) of the newly started thread before it can actually start working. If this fails, the new thread must be terminated immediately. But who will get the termination signal? The data structures for the new thread must be removed as well and this after the new thread is guaranteed to be vanished. Anyway, I still think it's not even worth discussing this much since the whole change to implement this is only a few lines. And it's in no fastpath. I might have more if I get deeper into implementation details. But if the above problems could be fixed we'd be a long way down the read to a good implementation. [1] Since Linus says so it must be true. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
Alan Cox <[EMAIL PROTECTED]> writes: > You dont want it in kernel space. I don't see how you can do this. Also on user level you would have to do this atomically since otherwise communication between the threads isn't possible anymore. We have a PR in the glibc bug data base about just that. And I know that there are many other users with this problem. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
"Andi Kleen" <[EMAIL PROTECTED]> writes: > I've been thinking about how to best get rid of the thread manager for > thread creation in LinuxThreads. It is currently needed to do the wait. If you get rid of the manager thread (the +1 thread) then you have another problem: you cannot send a signal explicitly to this thread (to implement pthread_kill). The PID of this initial thread is now used as the PID of the thread group. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
"Theodore Y. Ts'o" <[EMAIL PROTECTED]> writes: > True, but this can be handled by having the master thread process catch > SIGSEGV and redistributing the signal to all of its child-threads. No, it cannot. We have to have a core dump with all threads. > (The assumption I'm making here is that the master thread doesn't do > anything except spawn all threads for the process and monitors its child > processes for death. This is the n+1 model.) The master thread will not anymore spawn the threads. That's the whole purpose of this exercise. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
Linus Torvalds <[EMAIL PROTECTED]> writes: > But I'd much rather just have the "n+1" thing. The overhead is basically > nonexistent, and it simplifies so many things. I see no big problems with this either. The only tricky thing is to get the stack swapped after the first clone() but this is solvable. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
"Andi Kleen" <[EMAIL PROTECTED]> writes: > Do you think the SA_NOCLDWAIT/queued exit signal approach makes sense ? I'm not sure whether it's worth the effort. But I'm saying this now looking at the code for another implementation following the 1:1 model. In a second stage where we have m kernel threads and n user-level threads (the ultimate goal) things might be different. But this is beyond what is needed in the 2.4 kernel so lets just skip the SA_NOCLDWAIT stuff for now. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
Linus Torvalds <[EMAIL PROTECTED]> writes: > Well, I would just swap it _before_ the clone() - basically in the > original parent when you create the new stack, you call clone() with the > new stack and with the old stack as the argument. No? Yes. I have it basically working. You have of course to swap before the clone since the new thread will use the stack. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
"Andi Kleen" <[EMAIL PROTECTED]> writes: > But I guess you don't want the context switch to a thread manager just to > generate a thread ? (and which is one of the main causes of the bad thread > creation latency in Linux currently) The thread manager, is I see it in the moment, will consist more or less of this: extern volatile int nthreads; do waitpid (0, &res, __WCLONE) while (nthreads > 0); exit (WEXITSTATUS (res)); No signal handler, since it cannot receive signals. Everything else the threads will do themselves. There is a problem though: the code we currently use for something like restarting depends on the manager doing this. This can be implemented in two ways: - send the manager a signal; this would require the threadkill() syscall already mentioned. Note that we can assume RT signals and therefore can transport data. But we get into problems if too many RT signals are queued. - extend the loop above to something similar to what we have today: do n = poll (..,..,.., timeout); check_for_dead_threads(); // use WNOHANG if (n > 0) read request and process it while (nthreads > 0) I really would like to avoid this. It has the problems we are seeing today: * high latency of these requests * must adjust the priority of the manager (this now gets complicated since it's not the manager which start the threads) * problems with changing UID/GID It will require some investigation to see whether we can implement the restart semantics correctly without a manager thread. If yes, we should be able to live with the simple loop. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: thread group comments
"Andi Kleen" <[EMAIL PROTECTED]> writes: > So you have a different way to implement pthread_create without context > switch to the thread manager in 2.4 ? It should be possible to do these things with CLONE_PARENT. It's a long weekend coming up, let's see what I have next week. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Question regarding kernel_threads
[EMAIL PROTECTED] writes: > I'm currently thinking of adding a PF_NOZOMBIE flag to the process > flags which releases the process immeadiately instead of calling > exit_notify in do_exit in exit.c I think this should happen if the exit signal is zero. At least I would like to use it this way. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] threaded processes get stuck in rt_sigsuspend/fillonedir/exit_notify
David Ford <[EMAIL PROTECTED]> writes: > On the recent test kernels, processes get stuck. A kill -9 results in > zombies. The thread group changes broke the signal handling in linuxthreads. The CLONE_SIGHAND is now also used to enable thread groups but since linuxthreads already used CLONE_SIGHAND and is not prepared for thread groups all hell breaks loose. I've told Linus several times about this problems but he puts out one test release after the other without this fixed. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] threaded processes get stuck in rt_sigsuspend/fillonedir/exit_notify
Ray Bryant <[EMAIL PROTECTED]> writes: > Is there a succinct description of the the thread group changes > someplace? I'd be willing to take a look at fixing linuxthreads, > but haven't seen any description (other than the kernel source) of > what the changes are. Or is someone already working on this? "Fixing" alone won't cut it. I've started a rewrite and send Linus more comments about what is needed but not even got a reply. Seems the short interest span is already over. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] threaded processes get stuck in rt_sigsuspend/fillonedir/exit_notify
David Ford <[EMAIL PROTECTED]> writes: > Regardless of who does it or whether or not it goes in testX patch, I'd > surely like to have a patch(es) for my systems. Depending on what gets run, > I could easily end up with hundreds+ of hung programs and zombies. This is completely unrelated. The fix for your problem is to change the CLONE_SIGHAND flag back to it's original behavior. Changing linuxthreads to take advantage of the new kernel functionality is on a different plate. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] threaded processes get stuck in rt_sigsuspend/fillonedir/exit_notify
[EMAIL PROTECTED] writes: > I didn't realize things had changed that broke the old threading model. > Did Linus do more than add support for the new thread groups? I didn't > think any that had changed that would break the old LinuxThread > programs. First he introduces CLONE_THREAD (or how it was called). This was fine. But in pre2 ore pre3 he unified CLONE_SIGHAND and CLONE_THREAD under the new name CLONE_SIGNAL which makes perfect if CLONE_SIGHAND would be used. But it is. Simply undo this change, separate the two flags. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
Jesse Pollard <[EMAIL PROTECTED]> writes: > It's not a bug, but a security feature. NO log to syslogd should be lost, > since it may be related to an attack. That's exactly the argument I'm always using to turn down change requests like this. If the syslog() function could drop an entry and not send it is easy enough for somebody who has something to hide to overflow syslog() and then do the whatever s/he does not want to be logged. If anything has to be changed it's (as suggested) the configuration or even the implementation of syslogd. Make it robust. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: syslog() blocks on glibc 2.1.3 with kernel 2.2.x
[EMAIL PROTECTED] (Patrick J. LoPresti) writes: > OK, but my current syslogd only listens to /dev/log as a SOCK_DGRAM. > [...] I don't care what the current syslogd does. Changing the libc just to work around the limitations of current implementations is wrong. Write a new syslogd and do it right. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Dual XEON - >>SLOW<< on SMP
"Richard B. Johnson" <[EMAIL PROTECTED]> writes: > Yes. Look at the NMI count. Looks like every access produces a > NMI. I'm seeing this as well, but only with PIII Xeon systems, not PII Xeon. Every single timer interrupt on any CPU is accompanied by a NMI and LOC increment on every CPU. CPU0 CPU1 0: 146727 153389IO-APIC-edge timer [...] NMI: 300035 300035 LOC: 300028 300028 -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Can EINTR be handled the way BSD handles it? -- a plea from a user-land programmer...
[EMAIL PROTECTED] writes: > Can we _PLEASE_PLEASE_PLEASE_ not do this anymore and have the kernel do > what BSD does: re-start the interrupted call? This is crap. Returning EINTR is necessary for many applications. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Can EINTR be handled the way BSD handles it? -- a plea from a user-land programmer...
"Theodore Y. Ts'o" <[EMAIL PROTECTED]> writes: > Arguably though the bug is in glibc, in that if it's using signals > behinds the scenes, it should have passed SA_RESTART to sigaction. Why are you talking such a nonsense? > > However, from a portability point of view, you should *always* surround > certain system calls with while loops, since even if your program > doesn't use signals, if you run that program on a System-V derived Unix > system, and someone types ^Z at the wrong moment, you can also get an > EINTR. Similarly, you should always check the return value from write > and make sure all of what you asked to be written, was actually > written. > > What I normally do is have a full_write routine which looks something > like this: > > static errcode_t full_write(int fd, void *buf, int count) > { > char*cp = buf; > int left = count, c; > > while (left) { > c = write(fd, cp, left); > if (c < 0) { > if (errno == EINTR || errno == EAGAIN) > continue; > return errno; > } > left -= c; > cp += c; > } > return 0; > } > > It's like checking the return value from malloc(). Not everyone does > it, but even if it's not needed 99% of the time, it's a darned good idea > to do that. > > - Ted > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ > > -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Can EINTR be handled the way BSD handles it? -- a plea from a user-land programmer...
Ulrich Drepper <[EMAIL PROTECTED]> writes: > "Theodore Y. Ts'o" <[EMAIL PROTECTED]> writes: > > > Arguably though the bug is in glibc, in that if it's using signals > > behinds the scenes, it should have passed SA_RESTART to sigaction. > > Why are you talking such a nonsense? [Note to self: remove kitten from keyboard before writing mail.] Glibc has to use signals because there *still* is not mechanism in the kernel to allow synchronization. After how many years. I don't blame Linux. He has no interest in threads and therefore spends not much time thinking about it. But everybody who's complaining about things like this has to be willing to fix the real problems. Get your ass up and write a fast semaphore/mutex system. -- -------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.18 signal.h
Andrea Arcangeli <[EMAIL PROTECTED]> writes: > x() > { > > switch (1) { > case 0: > case 1: > case 2: > case 3: > ; > } > } > > Why am I required to put a `;' only in the last case and not in all > the previous ones? Or maybe gcc-latest is forgetting to complain about > the previous ones ;) Your C language knowledge seems to have holes. It must be possible to have more than one label for a statement. Look through the kernel sources, there are definitely cases where this is needed. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: What is up with Redhat 7.0?
I really didn't want to make a comment on this stupid thread but now you are getting personal: > > > OTOH, [EMAIL PROTECTED] might get pressed into not doing incompatible > > > changes, > > > > We're doing no such thing. > > If you say so However, I am not sure that you (we?) can actually > control it. You are excused this one and only time since I am fortunate enough to never have met you but listen carefully now: I allow nobody to tell me what to do. Nobody from Red Hat ever tried to do this. If this would have been on the mind of somebody (which I doubt) this illusion would have been destroyed on the first day when I told them that this never would be an option. There are external entities (commercial and non-commercial) who try this, though, of course without success either. > > If we did this sort of thing, he would have been pressed into releasing > > glibc 2.2 in time. > > Well, I actually do think that this has happened with glibc-2.1. And this I take as personal insult. Who the f*ck do you think you are to claim the right of making such a statement? This is so completely insane that I really have not the slightest idea how you can make something like this up. Go and find somebody who is working on glibc to back up this "statement" and not some idiot like you who has no inside whatsoever. If you cannot find anybody I demand a public apology from you. -- -------. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Additional pad in struct stat(64)
Christoph Hellwig <[EMAIL PROTECTED]> writes: > Hehe, that's why I'd like to introduce some additional pad with my > patch ;) There is no reason to introduce now unnecessarily incompatibilities. If you want to look forward and add more padding do this when there is another change necessary. Introducing breakage just to possibily avoid them in future is stupid. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Additional pad in struct stat(64)
Christoph Hellwig <[EMAIL PROTECTED]> writes: > I'd like to have st_flags added to struct stat64, so adding the actual > feature in Linux 2.5 (if it has a chance to get in - that's why I'm > interested in a comment by Linus on this) will not need a new version > of struct stat (and a new libc to use it), It will need a new libc version anyway. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] pcnet32 compilation fix for 2.4.3pre6
[EMAIL PROTECTED] writes: > with the new ansi standard, this use of __inline__ is no longer > necessary, This is not correct. Since the semantics of inline in C99 and gcc differ all code which depends on the gcc semantics should continue to use __inline__ since this keyword will hopefully forever signal the gcc semantics. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] devfsd, compiling on glibc22x
Pierre Rousselet <[EMAIL PROTECTED]> writes: > for me : > make CFLAGS='-O2 -I. -D_GNU_SOURCE' > compiles without any patch. is it correct ? Yes. RTLD_NEXT is not in any standard, it's an extension available via -D_GNU_SOURCE. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] devfsd, compiling on glibc22x
Richard Gooch <[EMAIL PROTECTED]> writes: > So why do old binaries (compiled with glibc 2.1.3) segfault when they > call dlsym() with RTLD_NEXT? Even newly compiled binaries (with glibc > 2.2) still segfault. What do you ask me? You wrote the code. -- ---. ,-. 1325 Chesapeake Terrace Ulrich Drepper \,---' \ Sunnyvale, CA 94089 USA Red Hat `--' drepper at redhat.com ` - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: pselect() modifying timeout
Michael Kerrisk wrote: > Please consider making Linux pselect() conform to POSIX on this > point. There is no question the implementation will conform. But this not dependent on changing the syscall interface. We deliberately decided to not require the kernel interface to be different from select. The userlevel code will take care of the difference. The kernel code is good as proposed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: sigwait() breaks when straced
On 7/30/05, Sanjoy Mahajan <[EMAIL PROTECTED]> wrote: > so the return value should not be 4 (or the docs are not right). This return value simply indicated EINTR (sigwait does not set errno, read the docs). The kernel simply doesn't restart the function in case of a signal. It should do this, though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fw: sigwait() breaks when straced
On 7/31/05, Roland McGrath <[EMAIL PROTECTED]> wrote: > However, there is in fact no bug here. The test program is just wrong. > sigwait returns zero or an error number, as POSIX specifies. No question, no error is detected incorrectly. But sigwait is not a function specified with an EINTR error number. As I said before, this does not mean that EINTR cannot be returned. But it will create havoc among programs and it causes undefined behavior wrt to SA_RESTART. I think it is best to not have any function for which EINTR is not a defined error to fail this way. This causes the least amount of surprises and unnecessary loops around the userlevel call sites. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fw: sigwait() breaks when straced
On 8/1/05, Jesper Juhl <[EMAIL PROTECTED]> wrote: > I'm not quite sure you are right Ulrich. Given this little bit from > SUSv3 about SA_RESTART in the page describing sigaction ( > http://www.opengroup.org/onlinepubs/009695399/functions/sigaction.html > ) : It's not an official SA_RESTART since the syscall is defined to support EINTR. It's clear that sigwait in this sense is not interruptible. Return EINTR from sigwait is only allowed by POSIX since there is no contrary wording (unlike for the pthread functions). But if this clause would be used each and every syscall could return EINTR and we would have to surround all syscalls with a loop. Hence the syscall should be restarted, not because SA_RESTART is set, but because EINTR shouldn't be returned. Now, Roland correctly said sigtimedwait and sigwaitinfo need to return EINTR and we use one syscall for them all. I overlook that part. So, I'll add the wrapper in the libc so that sigwait restarts on EINTR. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.13-rc6 1/2] New Syscall: get rlimits of any process (update)
On 8/18/05, Alan Cox <[EMAIL PROTECTED]> wrote: > Perhaps those application authors should provide a management interface > to do so within the soft limit range at least. Its not clear to me that > growing the fd array on a process is even safe. Some programs do size > arrays at startup after querying the rlimit data. That's very true. Using such a remote-rlimit syscall would break all kinds of code. It's a basic assumption from Unix/POSIX that the limits remain constant. And as Alan hinted at: this is why there are soft and hard limits. If tey are set to the same value you obviously don't get anything. But this is the application programmer's fault. An application which is aware of resources and tries to limit them should set the soft limits to a reasonable low value and the hard limit to the absolute maximum (probably the system's maximum). Then you can have remote procedure calls into the application to adjust the soft limits. Having to change the hard limit means the capacity planning for the app is completely wrong. A restart is certainly acceptable in that case since it should really never happen. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mremap() use is racy
Not the mremap() implementation itself, so don't worry. If mremap() is to be used without the MREMAP_MAYMOVE flag the call will only succeed of the address space after the block which is to be remapped is empty. This is rarely the case since there are many users of mmap and memory is allocated consecutively in many cases. So what programs have to do is to make sure ahead of time that the mremap() call can succeed. The best way to do this is using an anonymous, unused, unusable mapping. Code like this: p = mmap(NULL, 65536, PROT_NONE, MAP_PRIVATE|MAP_ANON, -1, 0); mmap(p, 16384, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Then when the mapping has to be extended one should be able to use mremap(): mremap(p, 16384, 32768, 0); But this is not possible since mremap() respects the anonymous mapping. So one has to use munmap((char*)p + 16384, 16384); before the mremap() call. But this is where the race comes in. Some other thread might allocate these blocks before the mremap() call can do it. One possible solution would be to add a flag to mremap() which allows mremap() to steal memory. In general that would be too dangerous but we could limit it to private, anonymous mappings which have no access permissions (i.e., PROT_NONE with MAP_PRIVATE and MAP_ANON). One explicitly has to allocate such blocks, they don't appear naturally. And the program in any case knows about the address space layout. So, how about adding MREMAP_MAPOVERNONE or so? -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: mremap() use is racy
Hugh Dickins wrote: > If the app can plan ahead as you're proposing, why doesn't it just > mmap the maximum it might need, mprotect PROT_NONE the end it doesn't > need yet, then progressively re-mprotect parts to make them accessible > as needed? Because the underlying file isn't larger than the initial mapping. In the one case I'm working on now the file can grow over time. More data is added at the end but the mapping cannot move in the address space. Using mmap with a too-large size for the underlying file and then hoping that future file growth is magically handled when those pages are accessed is not valid. > I'm missing what mremap gives you here that mprotect doesn't. Though > I do see that it would be nice not to be forced into mremap moving > all the time, because of other maps blocking you off: nice perhaps > to know what region of the layout is least likely to be so affected. Just accept here that moving is not an option. If remap cannot be used then a complete new mmap() with adjusted length is needed. That is unnecessarily expensive. It is the reason why there is mremap(). But mremap() with MREMAP_MAYMOVE is unreliable as it is implemented today. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: mremap() use is racy
Linus Torvalds wrote: > Actually, it should be pretty much as valid as using mremap - ie it works > on Linux. > > Especially if you use MAP_SHARED, you don't even need to mprotect > anything: you'll get a nice SIGBUS if you ever try to access past the last > page that maps the file. If you guarantee this (and test for this) it's fine with me. The POSIX spec explicitly leaves this undefined and requiring to use mremap() would be a nice way to work around this without allowing the introduction of undefined behavior into programs. I probably would prefer to use mremap() since this makes it clear what should happen but I can live with using the too-large mapping. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: Add pselect, ppoll system calls.
David Woodhouse wrote: > If it's mandatory that we actually call the signal handler, then we need > to play tricks like sigsuspend() does to leave the old signal mask on > the stack frame. That's a bit painful atm because do_signal is different > between architectures. It is necessary that the handler is called. This is the purpose of these interfaces. If this means more complexity is needed then this is how the cookie crumbles. One use case for pselect would be something like this: int got_signal; void sigint_handler(int sig) { got_signal = 1; } { ... while (1) { if (!got_signal) pselect() if (got_signal) { handle signal got_signal = 0; } } ... } -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [PATCH] RUSAGE_THREAD
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Roland McGrath wrote: > +#define RUSAGE_LWP RUSAGE_THREAD /* Solaris name for same */ No need to clutter the kernel header with this, it'll be in the libc header. Aside from that: Acked-by: Ulrich Drepper <[EMAIL PROTECTED]> - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHkZbk2ijCOnn/RHQRAtohAKCyWgJsm20LSqxTznvff3LI8zplvgCgwttu 16eJFNgQXWNEk76b141uZvo= =DzhA -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 0/4] sys_indirect system call
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eric Dumazet wrote: >> union indirect_params i; >> i.file_flags.flags = O_CLOEXEC; > > This setup forbids future addons to file_flags > > In three years, when we want to add a new indirect feature to socket() > call, do we need a new indirect2() syscall ? No, it doesn't. The setup is indefinitely expandable. All you need to do, if it becomes necessary to have more than an int, is to define a little structure for the system call and then use it. The only requirement is that the code has to assume a value of zero is what is used today. That's the whole point. union indirect_params { struct { int flags; } file_flags; struct { int flags; int new_syscall_data1; sigset_t and_a_sigmask; } new_data; }; Old programs will set only the 'flags' member of 'new_data' while new once can also set the new elements. New programs on old kernels will eithe have failing calls since the structure is too big or the call will not have all the desired effects. The latter can be tested for. > Or better, you could avoid using 'union indirect_params' in user code, and > only use the substructs for each function. There is no overhead introduced through the union. The only reason the union is there in the first place is to allocate sufficient data in task_struct to cover all cases. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHQafd2ijCOnn/RHQRAlSFAJ99lahwCDZGRSlIHCov5bWowrpoiQCgwvW4 LDSEusNUpMfIE1ywBCRDBfc= =ChVT -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 0/4] sys_indirect system call
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eric Dumazet wrote: > So when you recompile your old program (as you post it and as I commented on), > it will pass a >= 12 bytes data to kernel, with only first 4 bytes set to > O_CLOEXEC. > > Other bytes will contain junk If you don't initialize the entire structure and you use it all, of course you get undefined behavior. That's nothing new. The program I attached is not an example, it's a test for the functionality in this patch. Like with every kernel interface, you have to use it correctly. The good news is that user programs should never use this syscall directly (just like don't for existing ones). I see no problem at all here. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHQbBH2ijCOnn/RHQRAkc3AKCxVTWQ3BiQnCBwdbAsT122QWWaiwCggKXN Z5Sz9/NFojMHZXXTzIMoxX4= =slte -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 0/4] sys_indirect system call
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 H. Peter Anvin wrote: > What bothers me about the sys_indirect approach is that it will get > increasingly expensive as time goes on, and in doing so it does a > user-space memory reference, which are extra expensive. The extra table > can be colocated with the main table (a structure, in effect) so they'll > share the same cache line. You assume that using sys_indirect will be the norm. It won't. We mustn't design system calls deliberately wrong so that they require the indirection. Beside, if the number of syscalls which has to be handled this way grows we can use something more efficient for large numbers of test than a switch statement. It could even be a word next to the system call table. But I still don't see that the magic encoding is a valid solution, it doesn't address the limited parameter number. Plus, using sys_indirect could in future be used to transport entire parameters (like a sigset_t) along with other information, thereby saving individual copy operations. I think the sys_indirect approach is the way forward. I'll submit a last version of the patch in a bit. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHQlRw2ijCOnn/RHQRApifAKDE1nZqRbm4cJxbhobBb7jCx1T00QCgiSa0 EXKjL2Gwu3atSLSD+Rb4yO4= =6ZGt -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv4 4/6] Allow setting FD_CLOEXEC flag for new sockets
This is a first user of sys_indirect. Several of the socket-related system calls which produce a file handle now can be passed an additional parameter to set the FD_CLOEXEC flag. arch/x86/ia32/Makefile|1 + arch/x86/ia32/sys_ia32.c |4 include/asm-x86/ia32_unistd.h |1 + include/linux/indirect.h | 33 + kernel/Makefile |2 ++ kernel/indirect.c |4 net/socket.c | 21 + 7 files changed, 58 insertions(+), 8 deletions(-) --- arch/x86/ia32/Makefile +++ arch/x86/ia32/Makefile @@ -36,6 +36,7 @@ $(obj)/vsyscall-sysenter.so.dbg $(obj)/vsyscall-syscall.so.dbg: \ $(obj)/vsyscall-%.so.dbg: $(src)/vsyscall.lds $(obj)/vsyscall-%.o FORCE $(call if_changed,syscall) +CFLAGS_sys_ia32.o = -Wno-undef AFLAGS_vsyscall-sysenter.o = -m32 -Wa,-32 AFLAGS_vsyscall-syscall.o = -m32 -Wa,-32 --- kernel/Makefile +++ kernel/Makefile @@ -67,6 +67,8 @@ ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer endif +CFLAGS_indirect.o = -Wno-undef + $(obj)/configs.o: $(obj)/config_data.h # config_data.h contains the same information as ikconfig.h but gzipped. diff -u net/socket.c net/socket.c --- net/socket.c +++ net/socket.c @@ -344,11 +344,11 @@ * but we take care of internal coherence yet. */ -static int sock_alloc_fd(struct file **filep) +static int sock_alloc_fd(struct file **filep, int flags) { int fd; - fd = get_unused_fd(); + fd = get_unused_fd_flags(flags); if (likely(fd >= 0)) { struct file *file = get_empty_filp(); @@ -391,10 +391,10 @@ return 0; } -int sock_map_fd(struct socket *sock) +static int sock_map_fd_flags(struct socket *sock, int flags) { struct file *newfile; - int fd = sock_alloc_fd(&newfile); + int fd = sock_alloc_fd(&newfile, flags); if (likely(fd >= 0)) { int err = sock_attach_fd(sock, newfile); @@ -409,6 +409,11 @@ return fd; } +int sock_map_fd(struct socket *sock) +{ + return sock_map_fd_flags(sock, 0); +} + static struct socket *sock_from_file(struct file *file, int *err) { if (file->f_op == &socket_file_ops) @@ -1208,7 +1213,7 @@ if (retval < 0) goto out; - retval = sock_map_fd(sock); + retval = sock_map_fd_flags(sock, INDIRECT_PARAM(file_flags, flags)); if (retval < 0) goto out_release; @@ -1249,13 +1254,13 @@ if (err < 0) goto out_release_both; - fd1 = sock_alloc_fd(&newfile1); + fd1 = sock_alloc_fd(&newfile1, INDIRECT_PARAM(file_flags, flags)); if (unlikely(fd1 < 0)) { err = fd1; goto out_release_both; } - fd2 = sock_alloc_fd(&newfile2); + fd2 = sock_alloc_fd(&newfile2, INDIRECT_PARAM(file_flags, flags)); if (unlikely(fd2 < 0)) { err = fd2; put_filp(newfile1); @@ -1411,7 +1416,7 @@ */ __module_get(newsock->ops->owner); - newfd = sock_alloc_fd(&newfile); + newfd = sock_alloc_fd(&newfile, INDIRECT_PARAM(file_flags, flags)); if (unlikely(newfd < 0)) { err = newfd; sock_release(newsock); diff -u arch/x86/ia32/sys_ia32.c arch/x86/ia32/sys_ia32.c --- arch/x86/ia32/sys_ia32.c +++ arch/x86/ia32/sys_ia32.c @@ -902,6 +902,10 @@ switch (INDIRECT_SYSCALL32(®s)) { +#define INDSYSCALL(name) __NR_ia32_##name +#include + break; + default: return -EINVAL; } diff -u include/linux/indirect.h include/linux/indirect.h --- include/linux/indirect.h +++ include/linux/indirect.h @@ -1,6 +1,39 @@ +#ifndef INDSYSCALL #ifndef _LINUX_INDIRECT_H #define _LINUX_INDIRECT_H #include + +union indirect_params { + struct { +int flags; + } file_flags; +}; + +#define INDIRECT_PARAM(set, name) current->indirect_params.set.name + +#endif +#else + +/* Here comes the list of system calls which can be called through + sys_indirect. When the list if support system calls is needed the + file including this header is supposed to define a macro "INDSYSCALL" + which adds a prefix fitting to the use. If the resulting macro is + defined we generate a line + case MACRO: + */ +#if INDSYSCALL(accept) + case INDSYSCALL(accept): +#endif +#if INDSYSCALL(socket) + case INDSYSCALL(socket): +#endif +#if INDSYSCALL(socketcall) + case INDSYSCALL(socketcall): +#endif +#if INDSYSCALL(socketpair) + case INDSYSCALL(socketpair): +#endif + #endif diff -u kernel/indirect.c kernel/indirect.c --- kernel/indirect.c +++ kernel/indirect.c @@ -19,6 +19,10 @@ switch (INDIRECT_SYSCALL (®s)) { +#define INDSYSCALL(name) __NR_##name +#include + break; + default: return -EINVAL; } --- include/
[PATCHv4 1/6] actual sys_indirect code
This is the actual architecture-independent part of the system call implementation. include/linux/indirect.h |6 ++ include/linux/sched.h|4 include/linux/syscalls.h |4 kernel/Makefile |2 +- kernel/indirect.c| 36 5 files changed, 51 insertions(+), 1 deletion(-) --- /dev/null +++ include/linux/indirect.h @@ -0,0 +1,6 @@ +#ifndef _LINUX_INDIRECT_H +#define _LINUX_INDIRECT_H + +#include + +#endif --- include/linux/sched.h +++ include/linux/sched.h @@ -80,6 +80,7 @@ struct sched_param { #include #include #include +#include #include #include @@ -1174,6 +1175,9 @@ struct task_struct { int make_it_fail; #endif struct prop_local_single dirties; + + /* Additional system call parameters. */ + union indirect_params indirect_params; }; /* --- include/linux/syscalls.h +++ include/linux/syscalls.h @@ -54,6 +54,7 @@ struct compat_stat; struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct indirect_registers; #include #include @@ -611,6 +612,9 @@ asmlinkage long sys_timerfd(int ufd, int clockid, int flags, const struct itimerspec __user *utmr); asmlinkage long sys_eventfd(unsigned int count); asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); +asmlinkage long sys_indirect(struct indirect_registers __user *userregs, +void __user *userparams, size_t paramslen, +int flags); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); --- /dev/null +++ kernel/indirect.c @@ -0,0 +1,36 @@ +#include +#include +#include +#include + + +asmlinkage long sys_indirect(struct indirect_registers __user *userregs, +void __user *userparams, size_t paramslen, +int flags) +{ + struct indirect_registers regs; + long result; + + if (unlikely(flags != 0)) + return -EINVAL; + + if (copy_from_user(®s, userregs, sizeof(regs))) + return -EFAULT; + + switch (INDIRECT_SYSCALL (®s)) + { + default: + return -EINVAL; + } + + if (paramslen > sizeof(union indirect_params)) + return -EINVAL; + + result = -EFAULT; + if (!copy_from_user(¤t->indirect_params, userparams, paramslen)) + result = CALL_INDIRECT(®s); + + memset(¤t->indirect_params, '\0', paramslen); + + return result; +} --- kernel/Makefile +++ kernel/Makefile @@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \ rcupdate.o extable.o params.o posix-timers.o \ kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \ - utsname.o notifier.o + utsname.o notifier.o indirect.o obj-$(CONFIG_SYSCTL) += sysctl_check.o obj-$(CONFIG_STACKTRACE) += stacktrace.o - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv4 3/6] UML support for sys_indirect
This part adds support for sys_indirect for UML. indirect.h |6 ++ 1 file changed, 6 insertions(+) --- /dev/null +++ include/asm-um/indirect.h @@ -0,0 +1,6 @@ +#ifndef __UM_INDIRECT_H +#define __UM_INDIRECT_H + +#include "asm/arch/indirect.h" + +#endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv4 2/6] x86&x86-64 support for sys_indirect
This part adds support for sys_indirect on x86 and x86-64. arch/x86/ia32/ia32entry.S |2 ++ arch/x86/ia32/sys_ia32.c | 31 +++ arch/x86/kernel/syscall_table_32.S |1 + include/asm-x86/indirect.h |5 + include/asm-x86/indirect_32.h | 23 +++ include/asm-x86/indirect_64.h | 34 ++ include/asm-x86/unistd_32.h|3 ++- include/asm-x86/unistd_64.h|2 ++ 8 files changed, 100 insertions(+), 1 deletion(-) --- arch/x86/ia32/ia32entry.S +++ arch/x86/ia32/ia32entry.S @@ -400,6 +400,7 @@ END(ia32_ptregs_common) .section .rodata,"a" .align 8 + .globl ia32_sys_call_table ia32_sys_call_table: .quad sys_restart_syscall .quad sys_exit @@ -726,4 +727,5 @@ ia32_sys_call_table: .quad compat_sys_timerfd .quad sys_eventfd .quad sys32_fallocate + .quad sys32_indirect/* 325 */ ia32_syscall_end: --- arch/x86/ia32/sys_ia32.c +++ arch/x86/ia32/sys_ia32.c @@ -887,3 +887,37 @@ asmlinkage long sys32_fallocate(int fd, int mode, unsigned offset_lo, return sys_fallocate(fd, mode, ((u64)offset_hi << 32) | offset_lo, ((u64)len_hi << 32) | len_lo); } + +asmlinkage long sys32_indirect(struct indirect_registers32 __user *userregs, + void __user *userparams, size_t paramslen, + int flags) +{ + extern long (*ia32_sys_call_table[])(u32, u32, u32, u32, u32, u32); + + struct indirect_registers32 regs; + long result; + + if (flags != 0) + return -EINVAL; + + if (copy_from_user(®s, userregs, sizeof(regs))) + return -EFAULT; + + switch (INDIRECT_SYSCALL32(®s)) + { + default: + return -EINVAL; + } + + if (paramslen > sizeof(union indirect_params)) + return -EINVAL; + result = -EFAULT; + if (!copy_from_user(¤t->indirect_params, userparams, paramslen)) + result = ia32_sys_call_table[regs.eax](regs.ebx, regs.ecx, + regs.edx, regs.esi, + regs.edi, regs.ebp); + + memset(¤t->indirect_params, '\0', paramslen); + + return result; +} --- arch/x86/kernel/syscall_table_32.S +++ arch/x86/kernel/syscall_table_32.S @@ -324,3 +324,4 @@ ENTRY(sys_call_table) .long sys_timerfd .long sys_eventfd .long sys_fallocate + .long sys_indirect /* 325 */ --- /dev/null +++ include/asm-x86/indirect_32.h @@ -0,0 +1,23 @@ +#ifndef _ASM_X86_INDIRECT_32_H +#define _ASM_X86_INDIRECT_32_H + +struct indirect_registers { + __u32 eax; + __u32 ebx; + __u32 ecx; + __u32 edx; + __u32 esi; + __u32 edi; + __u32 ebp; +}; + +#define INDIRECT_SYSCALL(regs) (regs)->eax + +#define CALL_INDIRECT(regs) \ + ({ extern long (*sys_call_table[]) (__u32, __u32, __u32, __u32, __u32, __u32); \ + sys_call_table[INDIRECT_SYSCALL(regs)] ((regs)->ebx, (regs)->ecx, \ +(regs)->edx, (regs)->esi, \ +(regs)->edi, (regs)->ebp); \ + }) + +#endif --- /dev/null +++ include/asm-x86/indirect_64.h @@ -0,0 +1,34 @@ +#ifndef _ASM_X86_INDIRECT_64_H +#define _ASM_X86_INDIRECT_64_H + +struct indirect_registers { + __u64 rax; + __u64 rdi; + __u64 rsi; + __u64 rdx; + __u64 r10; + __u64 r8; + __u64 r9; +}; + +struct indirect_registers32 { + __u32 eax; + __u32 ebx; + __u32 ecx; + __u32 edx; + __u32 esi; + __u32 edi; + __u32 ebp; +}; + +#define INDIRECT_SYSCALL(regs) (regs)->rax +#define INDIRECT_SYSCALL32(regs) (regs)->eax + +#define CALL_INDIRECT(regs) \ + ({ extern long (*sys_call_table[]) (__u64, __u64, __u64, __u64, __u64, __u64); \ + sys_call_table[INDIRECT_SYSCALL(regs)] ((regs)->rdi, (regs)->rsi, \ +(regs)->rdx, (regs)->r10, \ +(regs)->r8, (regs)->r9); \ + }) + +#endif --- /dev/null +++ include/asm-x86/indirect.h @@ -0,0 +1,5 @@ +#ifdef CONFIG_X86_32 +# include "indirect_32.h" +#else +# include "indirect_64.h" +#endif --- include/asm-x86/unistd_32.h +++ include/asm-x86/unistd_32.h @@ -330,10 +330,11 @@ #define __NR_timerfd 322 #define __NR_eventfd 323 #define __NR_fallocate 324 +#define __NR_indirect 325 #ifdef __KERNEL__ -#define NR_syscalls 325 +#define NR_syscalls 326 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR --- include/asm-x86/unistd_64.h +++ include/asm-x86/unistd_64.h @@ -635,6 +635,8 @@ __SYSCALL(__NR_timerfd, sys_timerfd) __SYSCALL(__NR_eventfd, sys_eventfd) #define __NR_fallocate
[PATCHv4 0/6] sys_indirect system call
wing patches provide an alternative implementation of the sys_indirect system call which has been discussed a few times. This no system call allows us to extend existing system call interfaces with adding more system calls. Davide's previous implementation is IMO far more complex than warranted. This code here is trivial, as you can see. I've discussed this approach with Linus last week and for a brief moment we actually agreed on something. We pass an additional block of data to the kernel, it is copied into the task_struct, and then it is up to the function implementing the system call to interpret the data. Each system call, which is meant to be extended this way, has to be white-listed in sys_indirect. The alternative is to filter out those system calls which absolutely cannot be handled using sys_indirect (like clone, execve) since they require the stack layout of an ordinary system call. This is more dangerous since it is too easy to miss a call. The code for x86 and x86-64 gets by without a single line of assembly code. This is likely to be true for most/all the other archs as well. There is architecture-dependent code, though. For x86 and x86-64 I've also fixed up UML (although only x86-64 is tested, that's my setup). The last three patches show the first application of the functionality. They also show a complication: we need the test for valid sub-syscalls in the main implementation and in the compatibility code. And more: the actual sources and generated binary for the test are very different (the numbers differ). Duplicating the information is a big problem, though. I've used some macro tricks to avoid this. All the information about the flags and the system calls using them is concentrated in one header. This should maintenance bearable. This patch to use sys_indirect is just the beginning. More will follow, but I want to see how these patches are received before I spend more time on it. This code is enough to test the implementation with the following test program. Adjust it for architectures other than x86 and x86-64. #include #include #include #include #include #include #include #include typedef uint32_t __u32; typedef uint64_t __u64; union indirect_params { struct { int flags; } file_flags; }; #ifdef __x86_64__ # define __NR_indirect 286 struct indirect_registers { __u64 rax; __u64 rdi; __u64 rsi; __u64 rdx; __u64 r10; __u64 r8; __u64 r9; }; #elif defined __i386__ # define __NR_indirect 325 struct indirect_registers { __u32 eax; __u32 ebx; __u32 ecx; __u32 edx; __u32 esi; __u32 edi; __u32 ebp; }; #else # error "need to define __NR_indirect and struct indirect_params" #endif #define FILL_IN(var, values...) \ var = (struct indirect_registers) { values } int main (void) { int fd = socket (AF_INET, SOCK_DGRAM, IPPROTO_IP); int s1 = fcntl (fd, F_GETFD); int t1 = fcntl (fd, F_GETFL); printf ("old: FD_CLOEXEC %s set, NONBLOCK %s set\n", s1 == 0 ? "not" : "is", (t1 & O_NONBLOCK) ? "is" : "not"); close (fd); union indirect_params i; i.file_flags.flags = O_CLOEXEC|O_NONBLOCK; struct indirect_registers r; #ifdef __NR_socketcall # define SOCKOP_socket 1 long args[3] = { AF_INET, SOCK_DGRAM, IPPROTO_IP }; FILL_IN (r, __NR_socketcall, SOCKOP_socket, (long) args); #else FILL_IN (r, __NR_socket, AF_INET, SOCK_DGRAM, IPPROTO_IP); #endif fd = syscall (__NR_indirect, &r, &i, sizeof (i)); int s2 = fcntl (fd, F_GETFD); int t2 = fcntl (fd, F_GETFL); printf ("new: FD_CLOEXEC %s set, NONBLOCK %s set\n", s2 == 0 ? "not" : "is", (t2 & O_NONBLOCK) ? "is" : "not"); close (fd); i.file_flags.flags = O_CLOEXEC; sigset_t ss; sigemptyset(&ss); FILL_IN(r, __NR_signalfd, -1, (long) &ss, 8); fd = syscall (__NR_indirect, &r, &i, sizeof (i)); int s3 = fcntl (fd, F_GETFD); printf ("signalfd: FD_CLOEXEC %s set\n", s3 == 0 ? "not" : "is"); close (fd); FILL_IN(r, __NR_eventfd, 8); fd = syscall (__NR_indirect, &r, &i, sizeof (i)); int s4 = fcntl (fd, F_GETFD); printf ("eventfd: FD_CLOEXEC %s set\n", s4 == 0 ? "not" : "is"); close (fd); return s1 != 0 || s2 == 0 || t1 != 0 || t2 == 0 || s3 == 0 || s4 == 0; } Signed-off-by: Ulrich Drepper <[EMAIL PROTECTED]> arch/x86/ia32/Makefile |1 arch/x86/ia32/ia32entry.S |2 + arch/x86/ia32/sys_ia32.c | 37 +- arch/x86/kernel/syscall_table_32.S |1 include/asm-um/indirect.h |6 + include/asm-x86/ia32_unistd.h |1 include/asm-x86/indi
[PATCHv4 6/6] FD_CLOEXEC support for eventfd, signalfd, timerfd
This patch adds support to set the FD_CLOEXEC flag for the file descriptors returned by eventfd, signalfd, timerfd. fs/anon_inodes.c | 15 +++ fs/eventfd.c |5 +++-- fs/signalfd.c |6 -- fs/timerfd.c |6 -- include/asm-x86/ia32_unistd.h |3 +++ include/linux/anon_inodes.h |3 +++ include/linux/indirect.h |3 +++ 7 files changed, 31 insertions(+), 10 deletions(-) --- fs/anon_inodes.c +++ fs/anon_inodes.c @@ -70,9 +70,9 @@ static struct dentry_operations anon_inodefs_dentry_operations = { * hence saving memory and avoiding code duplication for the file/inode/dentry * setup. */ -int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, -const char *name, const struct file_operations *fops, -void *priv) +int anon_inode_getfd_flags(int *pfd, struct inode **pinode, struct file **pfile, + const char *name, const struct file_operations *fops, + void *priv, int flags) { struct qstr this; struct dentry *dentry; @@ -85,7 +85,7 @@ int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, if (!file) return -ENFILE; - error = get_unused_fd(); + error = get_unused_fd_flags(flags); if (error < 0) goto err_put_filp; fd = error; @@ -138,6 +138,13 @@ err_put_filp: put_filp(file); return error; } + +int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, +const char *name, const struct file_operations *fops, +void *priv) +{ + return anon_inode_getfd_flags(pfd, pinode, pfile, name, fops, priv, 0); +} EXPORT_SYMBOL_GPL(anon_inode_getfd); /* --- fs/eventfd.c +++ fs/eventfd.c @@ -215,8 +215,9 @@ asmlinkage long sys_eventfd(unsigned int count) * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&fd, &inode, &file, "[eventfd]", -&eventfd_fops, ctx); + error = anon_inode_getfd_flags(&fd, &inode, &file, "[eventfd]", + &eventfd_fops, ctx, + INDIRECT_PARAM(file_flags, flags)); if (!error) return fd; --- fs/signalfd.c +++ fs/signalfd.c @@ -224,8 +224,10 @@ asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemas * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&ufd, &inode, &file, "[signalfd]", -&signalfd_fops, ctx); + error = anon_inode_getfd_flags(&ufd, &inode, &file, + "[signalfd]", &signalfd_fops, + ctx, INDIRECT_PARAM(file_flags, + flags)); if (error) goto err_fdalloc; } else { --- fs/timerfd.c +++ fs/timerfd.c @@ -182,8 +182,10 @@ asmlinkage long sys_timerfd(int ufd, int clockid, int flags, * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&ufd, &inode, &file, "[timerfd]", -&timerfd_fops, ctx); + error = anon_inode_getfd_flags(&ufd, &inode, &file, "[timerfd]", + &timerfd_fops, ctx, + INDIRECT_PARAM(file_flags, + flags)); if (error) goto err_tmrcancel; } else { --- include/asm-x86/ia32_unistd.h +++ include/asm-x86/ia32_unistd.h @@ -15,5 +15,8 @@ #define __NR_ia32_socketcall 102 #define __NR_ia32_sigreturn119 #define __NR_ia32_rt_sigreturn 173 +#define __NR_ia32_signalfd 321 +#define __NR_ia32_timerfd 322 +#define __NR_ia32_eventfd 323 #endif /* _ASM_X86_64_IA32_UNISTD_H_ */ --- include/linux/anon_inodes.h +++ include/linux/anon_inodes.h @@ -8,6 +8,9 @@ #ifndef _LINUX_ANON_INODES_H #define _LINUX_ANON_INODES_H +int anon_inode_getfd_flags(int *pfd, struct inode **pinode, struct file **pfile, + const char *name, const struct file_operations *fops, + void *priv, int flags); int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, const char *name, const struct file_operations *fops, void *priv); --- include/
[PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
This patch adds support for setting the O_NONBLOCK flag of the file descriptors returned by socket, socketpair, and accept. socket.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) --- net/socket.c +++ net/socket.c @@ -362,7 +362,7 @@ static int sock_alloc_fd(struct file **filep, int flags) return fd; } -static int sock_attach_fd(struct socket *sock, struct file *file) +static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; struct qstr name = { .name = "" }; @@ -384,7 +384,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file) init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, &socket_file_ops); SOCK_INODE(sock)->i_fop = &socket_file_ops; - file->f_flags = O_RDWR; + file->f_flags = O_RDWR | (flags & O_NONBLOCK); file->f_pos = 0; file->private_data = sock; @@ -397,7 +397,7 @@ static int sock_map_fd_flags(struct socket *sock, int flags) int fd = sock_alloc_fd(&newfile, flags); if (likely(fd >= 0)) { - int err = sock_attach_fd(sock, newfile); + int err = sock_attach_fd(sock, newfile, flags); if (unlikely(err < 0)) { put_filp(newfile); @@ -1268,12 +1268,14 @@ asmlinkage long sys_socketpair(int family, int type, int protocol, goto out_release_both; } - err = sock_attach_fd(sock1, newfile1); + err = sock_attach_fd(sock1, newfile1, +INDIRECT_PARAM(file_flags, flags)); if (unlikely(err < 0)) { goto out_fd2; } - err = sock_attach_fd(sock2, newfile2); + err = sock_attach_fd(sock2, newfile2, +INDIRECT_PARAM(file_flags, flags)); if (unlikely(err < 0)) { fput(newfile1); goto out_fd1; @@ -1423,7 +1425,8 @@ asmlinkage long sys_accept(int fd, struct sockaddr __user *upeer_sockaddr, goto out_put; } - err = sock_attach_fd(newsock, newfile); + err = sock_attach_fd(newsock, newfile, +INDIRECT_PARAM(file_flags, flags)); if (err < 0) goto out_fd_simple; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David Miller wrote: > FWIW, I think this indirect syscall stuff is the most ugly interface > I've ever seen proposed for the kernel. Well, the alternative is to introduce a dozens of new interfaces. It was Linus who suggested this alternative. Plus, it seems that for syslets we need basically the same interface anyway. > And I agree with all of the objections raised by both H. Pater Anvin > and Eric Dumazet. Eric had no arguments and HP's comments lack a viable alternative proposal. > Where does this INDIRECT_PARAM() macro get defined? I do not > see it being defined anywhere in these patches. Defined in : +#define INDIRECT_PARAM(set, name) current->indirect_params.set.name Not my idea, I was following one review comment. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHQwWl2ijCOnn/RHQRAhEbAJ9/bkrb/phOMRl16Fb0N1TDYglSsgCeNhHQ 3huhdKCAVTu4CJnktf/ufy4= =Jj6h -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 0/6] sys_indirect system call
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eric Dumazet wrote: > I am wondering if some parts are missing from your ChangeLog > > You apparently added in v3 a new 'flags' parameter to indirect syscall > but no trace of this change in Changelog, and why it was added. This > seems to imply a future multiplexor. This was mentioned in one of my mails. I added the parameter to accommodate Linus's and Zack's idea to use the functionality for syslets as well. Not really a multiplexer, it is meant to be a "execute synchronously or asynchronously" flag. In the latter case an additional parameter might be needed to indicate the notification mechanism. > And no change in the test program reflecting this 'flags' new param, so > it fails. Yep, sorry, I didn't update the text by including the most recent test program. I'll do that for the next version. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHQwca2ijCOnn/RHQRAgQJAKDH+N3+FSJ0kD5VbzbAFN4918wREwCePHbc nSY/t9x1FuYstYDaaT6Kut0= =c95e -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 0/4] sys_indirect system call
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 dean gaudet wrote: > as an application writer how do i access accept(2) with FD_CLOEXEC > functionality? will glibc expose an accept2() with a flags param? Not yet decided. There is the alternative to extend the accept() interface to have both interfaces: int accept(int, struct sockaddr *, socklen_t *); and int accept(int, struct sockaddr *, socklen_t *, int); We can do this with type safety even in C nowadays. > if so... why don't we just have an accept2() syscall? If you read the mails of my first submission you'll find that I explained this. I talked to Andrew and he favored new syscalls. But then I talked to Linus and he favored this approach. Probably especially because it can be used for syslets as well. And it is less code and data than introducing new syscalls. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHQwhx2ijCOnn/RHQRAnezAKCkFmGwlwDZjpfKTRSUN4yLIeGTkACgtMK/ OcHdIaR8wbp848D3GU2iNYQ= =nTu9 -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 2/6] x86&x86-64 support for sys_indirect
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Heiko Carstens wrote: > All these macros could be functions, or? Would give us some type checking > and avoids the capital letters. Should be possible now. I didn't do it initially since the macro used the macro for the largest syscall number. That macro wasn't always available. I'll test it. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHQwdg2ijCOnn/RHQRAmh9AJ9EuthsaoupSHn3kR/x0cWxqR3FoQCfSbmE 8RIDWzPKZ6cv+QVGNl0fawM= =ScgY -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 0/6] sys_indirect system call
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Zach Brown wrote: > I'm sure the additional parameter will be needed, and it might be pretty > involved. I think the current notion of syslets needs, at the very least: All correct. I just want to point out that the proposed interface is sufficiently prepared for this and that there is no need to wait adding this initial, synchronous syscall stuff before the syslet stuff is ready. These interface changes are security-relevant and should be added ASAP. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHQySu2ijCOnn/RHQRAnQqAKCz0JzvmAeEcL8m77jbEYAZ4ZFWXwCgpfvE do7pJGn9XBu9jfQhfLkxQSc= =eX6m -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 4/6] Allow setting FD_CLOEXEC flag for new sockets
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Zach Brown wrote: > Have you given thought to having to perform compat translation on this? > Today it's only copied directly from the user pointer into the union > in the task_struct. Since there is no legacy interface to worry about all members added to the structure can and should be neutral of the word size. We've done this with some syscalls already (like pread64) where we always use the wide form in the parameter list. It's just more simple here since it does not have to split into two 32-bit registers. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHQyJn2ijCOnn/RHQRAmWeAJ0Q6qBDtZDvsZYlfBnPFL6n11Z+lwCghiVp NklFHsSnVyQYMD5rinDFQPo= =Yo5E -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc3: find complains about /proc/net
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Roland McGrath wrote: > Oh, it seems it has indeed been that way for a very long time, so I was > mistaken. It still seems a little odd to me. Ulrich can say definitively > whether the kind of concern I mentioned really matters one way or the other > for glibc. glibc cannot survive (at least NPTL) if somebody uses funny CLONE_* flags to separate various pieces of information, e.g., file descriptors. So, all the information in each thread's /proc/self should be identical. When the information is not the same, the current semantics seems to be more useful. So I guess, no change is the way to go here. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHQ25/2ijCOnn/RHQRAmhhAJsHRF7FqO8DWwZ97gHxIO/i4Z1AAQCffCGa Q2J8kjthKbbNQf1USWMAw3Y= =xl/a -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv5 5/5] FD_CLOEXEC support for eventfd, signalfd, timerfd
This patch adds support to set the FD_CLOEXEC flag for the file descriptors returned by eventfd, signalfd, timerfd. fs/anon_inodes.c | 15 +++ fs/eventfd.c |5 +++-- fs/signalfd.c |6 -- fs/timerfd.c |6 -- include/asm-x86/ia32_unistd.h |3 +++ include/linux/anon_inodes.h |3 +++ include/linux/indirect.h |3 +++ 7 files changed, 31 insertions(+), 10 deletions(-) --- linux/include/linux/indirect.h +++ linux/include/linux/indirect.h @@ -40,5 +40,8 @@ union indirect_params { #if INDSYSCALL(socketpair) case INDSYSCALL(socketpair): #endif + case INDSYSCALL(eventfd): + case INDSYSCALL(signalfd): + case INDSYSCALL(timerfd): #endif --- linux/fs/anon_inodes.c +++ linux/fs/anon_inodes.c @@ -70,9 +70,9 @@ static struct dentry_operations anon_inodefs_dentry_operations = { * hence saving memory and avoiding code duplication for the file/inode/dentry * setup. */ -int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, -const char *name, const struct file_operations *fops, -void *priv) +int anon_inode_getfd_flags(int *pfd, struct inode **pinode, struct file **pfile, + const char *name, const struct file_operations *fops, + void *priv, int flags) { struct qstr this; struct dentry *dentry; @@ -85,7 +85,7 @@ int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, if (!file) return -ENFILE; - error = get_unused_fd(); + error = get_unused_fd_flags(flags); if (error < 0) goto err_put_filp; fd = error; @@ -138,6 +138,13 @@ err_put_filp: put_filp(file); return error; } + +int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, +const char *name, const struct file_operations *fops, +void *priv) +{ + return anon_inode_getfd_flags(pfd, pinode, pfile, name, fops, priv, 0); +} EXPORT_SYMBOL_GPL(anon_inode_getfd); /* --- linux/include/linux/anon_inodes.h +++ linux/include/linux/anon_inodes.h @@ -8,6 +8,9 @@ #ifndef _LINUX_ANON_INODES_H #define _LINUX_ANON_INODES_H +int anon_inode_getfd_flags(int *pfd, struct inode **pinode, struct file **pfile, + const char *name, const struct file_operations *fops, + void *priv, int flags); int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, const char *name, const struct file_operations *fops, void *priv); --- linux/fs/eventfd.c +++ linux/fs/eventfd.c @@ -215,8 +215,9 @@ asmlinkage long sys_eventfd(unsigned int count) * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&fd, &inode, &file, "[eventfd]", -&eventfd_fops, ctx); + error = anon_inode_getfd_flags(&fd, &inode, &file, "[eventfd]", + &eventfd_fops, ctx, + INDIRECT_PARAM(file_flags, flags)); if (!error) return fd; --- linux/fs/signalfd.c +++ linux/fs/signalfd.c @@ -224,8 +224,10 @@ asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemas * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&ufd, &inode, &file, "[signalfd]", -&signalfd_fops, ctx); + error = anon_inode_getfd_flags(&ufd, &inode, &file, + "[signalfd]", &signalfd_fops, + ctx, INDIRECT_PARAM(file_flags, + flags)); if (error) goto err_fdalloc; } else { --- linux/fs/timerfd.c +++ linux/fs/timerfd.c @@ -182,8 +182,10 @@ asmlinkage long sys_timerfd(int ufd, int clockid, int flags, * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&ufd, &inode, &file, "[timerfd]", -&timerfd_fops, ctx); + error = anon_inode_getfd_flags(&ufd, &inode, &file, "[timerfd]", + &timerfd_fops, ctx, + INDIRECT_PARAM(file_flags, + flags)); if (error) goto err_tmrcancel; } else { --- linux/include/
[PATCHv5 4/5] Allow setting O_NONBLOCK flag for new sockets
This patch adds support for setting the O_NONBLOCK flag of the file descriptors returned by socket, socketpair, and accept. socket.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) --- linux/net/socket.c +++ linux/net/socket.c @@ -362,7 +362,7 @@ static int sock_alloc_fd(struct file **filep, int flags) return fd; } -static int sock_attach_fd(struct socket *sock, struct file *file) +static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; struct qstr name = { .name = "" }; @@ -384,7 +384,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file) init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, &socket_file_ops); SOCK_INODE(sock)->i_fop = &socket_file_ops; - file->f_flags = O_RDWR; + file->f_flags = O_RDWR | (flags & O_NONBLOCK); file->f_pos = 0; file->private_data = sock; @@ -397,7 +397,7 @@ static int sock_map_fd_flags(struct socket *sock, int flags) int fd = sock_alloc_fd(&newfile, flags); if (likely(fd >= 0)) { - int err = sock_attach_fd(sock, newfile); + int err = sock_attach_fd(sock, newfile, flags); if (unlikely(err < 0)) { put_filp(newfile); @@ -1268,12 +1268,14 @@ asmlinkage long sys_socketpair(int family, int type, int protocol, goto out_release_both; } - err = sock_attach_fd(sock1, newfile1); + err = sock_attach_fd(sock1, newfile1, +INDIRECT_PARAM(file_flags, flags)); if (unlikely(err < 0)) { goto out_fd2; } - err = sock_attach_fd(sock2, newfile2); + err = sock_attach_fd(sock2, newfile2, +INDIRECT_PARAM(file_flags, flags)); if (unlikely(err < 0)) { fput(newfile1); goto out_fd1; @@ -1423,7 +1425,8 @@ asmlinkage long sys_accept(int fd, struct sockaddr __user *upeer_sockaddr, goto out_put; } - err = sock_attach_fd(newsock, newfile); + err = sock_attach_fd(newsock, newfile, +INDIRECT_PARAM(file_flags, flags)); if (err < 0) goto out_fd_simple; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv5 3/5] Allow setting FD_CLOEXEC flag for new sockets
This is a first user of sys_indirect. Several of the socket-related system calls which produce a file handle now can be passed an additional parameter to set the FD_CLOEXEC flag. include/asm-x86/ia32_unistd.h |1 + include/linux/indirect.h | 27 +++ net/socket.c | 21 + 3 files changed, 41 insertions(+), 8 deletions(-) diff -u linux/include/linux/indirect.h linux/include/linux/indirect.h --- linux/include/linux/indirect.h +++ linux/include/linux/indirect.h @@ -1,3 +1,4 @@ +#ifndef INDSYSCALL #ifndef _LINUX_INDIRECT_H #define _LINUX_INDIRECT_H @@ -13,5 +14,31 @@ + struct { +int flags; + } file_flags; }; #define INDIRECT_PARAM(set, name) current->indirect_params.set.name #endif +#else + +/* Here comes the list of system calls which can be called through + sys_indirect. When the list if support system calls is needed the + file including this header is supposed to define a macro "INDSYSCALL" + which adds a prefix fitting to the use. If the resulting macro is + defined we generate a line + case MACRO: + */ +#if INDSYSCALL(accept) + case INDSYSCALL(accept): +#endif +#if INDSYSCALL(socket) + case INDSYSCALL(socket): +#endif +#if INDSYSCALL(socketcall) + case INDSYSCALL(socketcall): +#endif +#if INDSYSCALL(socketpair) + case INDSYSCALL(socketpair): +#endif + +#endif --- linux/include/asm-x86/ia32_unistd.h +++ linux/include/asm-x86/ia32_unistd.h @@ -12,6 +12,7 @@ #define __NR_ia32_exit 1 #define __NR_ia32_read 3 #define __NR_ia32_write 4 +#define __NR_ia32_socketcall 102 #define __NR_ia32_sigreturn119 #define __NR_ia32_rt_sigreturn 173 diff -u linux/net/socket.c linux/net/socket.c --- linux/net/socket.c +++ linux/net/socket.c @@ -344,11 +344,11 @@ * but we take care of internal coherence yet. */ -static int sock_alloc_fd(struct file **filep) +static int sock_alloc_fd(struct file **filep, int flags) { int fd; - fd = get_unused_fd(); + fd = get_unused_fd_flags(flags); if (likely(fd >= 0)) { struct file *file = get_empty_filp(); @@ -391,10 +391,10 @@ return 0; } -int sock_map_fd(struct socket *sock) +static int sock_map_fd_flags(struct socket *sock, int flags) { struct file *newfile; - int fd = sock_alloc_fd(&newfile); + int fd = sock_alloc_fd(&newfile, flags); if (likely(fd >= 0)) { int err = sock_attach_fd(sock, newfile); @@ -409,6 +409,11 @@ return fd; } +int sock_map_fd(struct socket *sock) +{ + return sock_map_fd_flags(sock, 0); +} + static struct socket *sock_from_file(struct file *file, int *err) { if (file->f_op == &socket_file_ops) @@ -1208,7 +1213,7 @@ if (retval < 0) goto out; - retval = sock_map_fd(sock); + retval = sock_map_fd_flags(sock, INDIRECT_PARAM(file_flags, flags)); if (retval < 0) goto out_release; @@ -1249,13 +1254,13 @@ if (err < 0) goto out_release_both; - fd1 = sock_alloc_fd(&newfile1); + fd1 = sock_alloc_fd(&newfile1, INDIRECT_PARAM(file_flags, flags)); if (unlikely(fd1 < 0)) { err = fd1; goto out_release_both; } - fd2 = sock_alloc_fd(&newfile2); + fd2 = sock_alloc_fd(&newfile2, INDIRECT_PARAM(file_flags, flags)); if (unlikely(fd2 < 0)) { err = fd2; put_filp(newfile1); @@ -1411,7 +1416,7 @@ */ __module_get(newsock->ops->owner); - newfd = sock_alloc_fd(&newfile); + newfd = sock_alloc_fd(&newfile, INDIRECT_PARAM(file_flags, flags)); if (unlikely(newfd < 0)) { err = newfd; sock_release(newsock); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv5 1/5] actual sys_indirect code
This is the actual architecture-independent part of the system call implementation. include/linux/indirect.h | 17 + include/linux/sched.h|4 include/linux/syscalls.h |4 kernel/Makefile |3 +++ kernel/indirect.c| 40 5 files changed, 68 insertions(+) diff -u linux/include/linux/indirect.h linux/include/linux/indirect.h --- linux/include/linux/indirect.h +++ linux/include/linux/indirect.h @@ -0,0 +1,17 @@ +#ifndef _LINUX_INDIRECT_H +#define _LINUX_INDIRECT_H + +#include + + +/* IMPORTANT: + All the elements of this union must be neutral to the word size + and must not require reworking when used in compat syscalls. Used + fixed-size types or types which are known to not vary in size across + architectures. */ +union indirect_params { +}; + +#define INDIRECT_PARAM(set, name) current->indirect_params.set.name + +#endif diff -u linux/kernel/Makefile linux/kernel/Makefile --- linux/kernel/Makefile +++ linux/kernel/Makefile @@ -57,6 +57,7 @@ obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o obj-$(CONFIG_MARKERS) += marker.o +obj-$(CONFIG_ARCH_HAS_INDIRECT_SYSCALLS) += indirect.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <[EMAIL PROTECTED]>, the -fno-omit-frame-pointer is @@ -67,6 +68,8 @@ CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer endif +CFLAGS_indirect.o = -Wno-undef + $(obj)/configs.o: $(obj)/config_data.h # config_data.h contains the same information as ikconfig.h but gzipped. diff -u linux/kernel/indirect.c linux/kernel/indirect.c --- linux/kernel/indirect.c +++ linux/kernel/indirect.c @@ -0,0 +1,40 @@ +#include +#include +#include +#include + + +asmlinkage long sys_indirect(struct indirect_registers __user *userregs, +void __user *userparams, size_t paramslen, +int flags) +{ + struct indirect_registers regs; + long result; + + if (unlikely(flags != 0)) + return -EINVAL; + + if (copy_from_user(®s, userregs, sizeof(regs))) + return -EFAULT; + + switch (INDIRECT_SYSCALL (®s)) + { +#define INDSYSCALL(name) __NR_##name +#include + break; + + default: + return -EINVAL; + } + + if (paramslen > sizeof(union indirect_params)) + return -EINVAL; + + result = -EFAULT; + if (!copy_from_user(¤t->indirect_params, userparams, paramslen)) + result = call_indirect(®s); + + memset(¤t->indirect_params, '\0', paramslen); + + return result; +} diff -u linux/include/linux/syscalls.h linux/include/linux/syscalls.h --- linux/include/linux/syscalls.h +++ linux/include/linux/syscalls.h @@ -54,6 +54,7 @@ struct compat_timeval; struct robust_list_head; struct getcpu_cache; +struct indirect_registers; #include #include @@ -611,6 +612,9 @@ const struct itimerspec __user *utmr); asmlinkage long sys_eventfd(unsigned int count); asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); +asmlinkage long sys_indirect(struct indirect_registers __user *userregs, +void __user *userparams, size_t paramslen, +int flags); int kernel_execve(const char *filename, char *const argv[], char *const envp[]); --- linux/include/linux/sched.h +++ linux/include/linux/sched.h @@ -80,6 +80,7 @@ struct sched_param { #include #include #include +#include #include #include @@ -1174,6 +1175,9 @@ struct task_struct { int make_it_fail; #endif struct prop_local_single dirties; + + /* Additional system call parameters. */ + union indirect_params indirect_params; }; /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv5 0/5] sys_indirect system call
The following patches provide an alternative implementation of the sys_indirect system call which has been discussed a few times. This is a system call that allows us to extend existing system call interfaces by adding more system call parameters. Davide's previous implementation is IMO far more complex than warranted. This code here is trivial, as you can see. I've discussed this approach with Linus recently and for a brief moment we actually agreed on something. We pass an additional block of data to the kernel, it is copied into the task_struct, and then it is up to the function implementing the system call to interpret the data. Each system call, which is meant to be extended this way, has to be white-listed in sys_indirect. The alternative is to filter out those system calls which absolutely cannot be handled using sys_indirect (like clone, execve) since they require the stack layout of an ordinary system call. This is more dangerous since it is too easy to miss a call. Note that the sys_indirect system call takes an additional parameter which is for now forced to be zero. This parameter is meant to enable the use of sys_indirect to create syslets, asynchronously executed system calls. This syslet approach is also the main reason for the interface in the form proposed here. The code for x86 and x86-64 gets by without a single line of assembly code. This is likely to be true for many other archs as well. There is architecture-dependent code, though. The last three patches show the first application of the functionality. They also show a complication: we need the test for valid sub-syscalls in the main implementation and in the compatibility code. And more: the actual sources and generated binary for the test are very different (the numbers differ). Duplicating the information is a big problem, though. I've used some macro tricks to avoid this. All the information about the flags and the system calls using them is concentrated in one header. This should keep maintenance bearable. This patch to use sys_indirect is just the beginning. More will follow, but I want to see how these patches are received before I spend more time on it. This code is enough to test the implementation with the following test program. Adjust it for architectures other than x86 and x86-64. What is not addressed are differences in opinion about the whole approach. Maybe Linus can chime in a defend what is basically his design. #include #include #include #include #include #include #include #include typedef uint32_t __u32; typedef uint64_t __u64; union indirect_params { struct { int flags; } file_flags; }; #ifdef __x86_64__ # define __NR_indirect 286 struct indirect_registers { __u64 rax; __u64 rdi; __u64 rsi; __u64 rdx; __u64 r10; __u64 r8; __u64 r9; }; #elif defined __i386__ # define __NR_indirect 325 struct indirect_registers { __u32 eax; __u32 ebx; __u32 ecx; __u32 edx; __u32 esi; __u32 edi; __u32 ebp; }; #else # error "need to define __NR_indirect and struct indirect_params" #endif #define FILL_IN(var, values...) \ var = (struct indirect_registers) { values } int main (void) { int fd = socket (AF_INET, SOCK_DGRAM, IPPROTO_IP); int s1 = fcntl (fd, F_GETFD); int t1 = fcntl (fd, F_GETFL); printf ("old: FD_CLOEXEC %s set, NONBLOCK %s set\n", s1 == 0 ? "not" : "is", (t1 & O_NONBLOCK) ? "is" : "not"); close (fd); union indirect_params i; memset(&i, '\0', sizeof(i)); i.file_flags.flags = O_CLOEXEC|O_NONBLOCK; struct indirect_registers r; #ifdef __NR_socketcall # define SOCKOP_socket 1 long args[3] = { AF_INET, SOCK_DGRAM, IPPROTO_IP }; FILL_IN (r, __NR_socketcall, SOCKOP_socket, (long) args); #else FILL_IN (r, __NR_socket, AF_INET, SOCK_DGRAM, IPPROTO_IP); #endif fd = syscall (__NR_indirect, &r, &i, sizeof (i), 0); int s2 = fcntl (fd, F_GETFD); int t2 = fcntl (fd, F_GETFL); printf ("new: FD_CLOEXEC %s set, NONBLOCK %s set\n", s2 == 0 ? "not" : "is", (t2 & O_NONBLOCK) ? "is" : "not"); close (fd); i.file_flags.flags = O_CLOEXEC; sigset_t ss; sigemptyset(&ss); FILL_IN(r, __NR_signalfd, -1, (long) &ss, 8); fd = syscall (__NR_indirect, &r, &i, sizeof (i), 0); int s3 = fcntl (fd, F_GETFD); printf ("signalfd: FD_CLOEXEC %s set\n", s3 == 0 ? "not" : "is"); close (fd); FILL_IN(r, __NR_eventfd, 8); fd = syscall (__NR_indirect, &r, &i, sizeof (i), 0); int s4 = fcntl (fd, F_GETFD); printf ("eventfd: FD_CLOEXEC %s set\n", s4 == 0 ? "not" : "is"); close (fd); return s1 != 0 || s2 == 0 || t1 != 0 || t2 == 0 || s3 == 0 || s4 == 0; }
[PATCHv5 2/5] x86&x86-64 support for sys_indirect
This part adds support for sys_indirect on x86 and x86-64. arch/x86/Kconfig |3 ++ arch/x86/ia32/Makefile |1 arch/x86/ia32/ia32entry.S |2 + arch/x86/ia32/sys_ia32.c | 38 + arch/x86/kernel/syscall_table_32.S |1 include/asm-x86/indirect.h |5 include/asm-x86/indirect_32.h | 25 include/asm-x86/indirect_64.h | 36 +++ include/asm-x86/unistd_32.h|3 +- include/asm-x86/unistd_64.h|2 + 10 files changed, 115 insertions(+), 1 deletion(-) --- linux/arch/x86/Kconfig +++ linux/arch/x86/Kconfig @@ -112,6 +112,9 @@ config GENERIC_TIME_VSYSCALL bool default X86_64 +config ARCH_HAS_INDIRECT_SYSCALLS + def_bool y + diff -u linux/include/asm-x86/indirect_32.h linux/include/asm-x86/indirect_32.h --- linux/include/asm-x86/indirect_32.h +++ linux/include/asm-x86/indirect_32.h @@ -0,0 +1,25 @@ +#ifndef _ASM_X86_INDIRECT_32_H +#define _ASM_X86_INDIRECT_32_H + +struct indirect_registers { + __u32 eax; + __u32 ebx; + __u32 ecx; + __u32 edx; + __u32 esi; + __u32 edi; + __u32 ebp; +}; + +#define INDIRECT_SYSCALL(regs) (regs)->eax + +static inline long call_indirect(struct indirect_registers *regs) +{ + extern long (*sys_call_table[]) (__u32, __u32, __u32, __u32, __u32, __u32); + + return sys_call_table[INDIRECT_SYSCALL(regs)](regs->ebx, regs->ecx, + regs->edx, regs->esi, + regs->edi, regs->ebp); +} + +#endif diff -u linux/include/asm-x86/indirect_64.h linux/include/asm-x86/indirect_64.h --- linux/include/asm-x86/indirect_64.h +++ linux/include/asm-x86/indirect_64.h @@ -0,0 +1,36 @@ +#ifndef _ASM_X86_INDIRECT_64_H +#define _ASM_X86_INDIRECT_64_H + +struct indirect_registers { + __u64 rax; + __u64 rdi; + __u64 rsi; + __u64 rdx; + __u64 r10; + __u64 r8; + __u64 r9; +}; + +struct indirect_registers32 { + __u32 eax; + __u32 ebx; + __u32 ecx; + __u32 edx; + __u32 esi; + __u32 edi; + __u32 ebp; +}; + +#define INDIRECT_SYSCALL(regs) (regs)->rax +#define INDIRECT_SYSCALL32(regs) (regs)->eax + +static inline long call_indirect(struct indirect_registers *regs) +{ + extern long (*sys_call_table[]) (__u64, __u64, __u64, __u64, __u64, __u64); + + return sys_call_table[INDIRECT_SYSCALL(regs)](regs->rdi, regs->rsi, + regs->rdx, regs->r10, + regs->r8, regs->r9); +} + +#endif diff -u linux/arch/x86/ia32/sys_ia32.c linux/arch/x86/ia32/sys_ia32.c --- linux/arch/x86/ia32/sys_ia32.c +++ linux/arch/x86/ia32/sys_ia32.c @@ -889,0 +890,38 @@ + +asmlinkage long sys32_indirect(struct indirect_registers32 __user *userregs, + void __user *userparams, size_t paramslen, + int flags) +{ + extern long (*ia32_sys_call_table[])(u32, u32, u32, u32, u32, u32); + + struct indirect_registers32 regs; + long result; + + if (flags != 0) + return -EINVAL; + + if (copy_from_user(®s, userregs, sizeof(regs))) + return -EFAULT; + + switch (INDIRECT_SYSCALL32(®s)) + { +#define INDSYSCALL(name) __NR_ia32_##name +#include + break; + + default: + return -EINVAL; + } + + if (paramslen > sizeof(union indirect_params)) + return -EINVAL; + result = -EFAULT; + if (!copy_from_user(¤t->indirect_params, userparams, paramslen)) + result = ia32_sys_call_table[regs.eax](regs.ebx, regs.ecx, + regs.edx, regs.esi, + regs.edi, regs.ebp); + + memset(¤t->indirect_params, '\0', paramslen); + + return result; +} --- linux/arch/x86/ia32/Makefile +++ linux/arch/x86/ia32/Makefile @@ -36,6 +36,7 @@ $(obj)/vsyscall-sysenter.so.dbg $(obj)/vsyscall-syscall.so.dbg: \ $(obj)/vsyscall-%.so.dbg: $(src)/vsyscall.lds $(obj)/vsyscall-%.o FORCE $(call if_changed,syscall) +CFLAGS_sys_ia32.o = -Wno-undef AFLAGS_vsyscall-sysenter.o = -m32 -Wa,-32 AFLAGS_vsyscall-syscall.o = -m32 -Wa,-32 --- linux/arch/x86/ia32/ia32entry.S +++ linux/arch/x86/ia32/ia32entry.S @@ -400,6 +400,7 @@ END(ia32_ptregs_common) .section .rodata,"a" .align 8 + .globl ia32_sys_call_table ia32_sys_call_table: .quad sys_restart_syscall .quad sys_exit @@ -726,4 +727,5 @@ ia32_sys_call_table: .quad compat_sys_timerfd .quad sys_eventfd .quad sys32_fallocate + .quad sys32_indirect/* 325 */ ia32_syscall_end: --- linux/arch/x86/kernel/syscall_table_32.S +++ linux/arch/x86
Re: Where is the new timerfd?
On Nov 23, 2007 9:29 AM, Davide Libenzi <[EMAIL PROTECTED]> wrote: > Yes, it's disabled, and yes, I'll repost today ... I haven't seen the patch and don't feel like searching. So I say it here: please mak sure you add a flags parameter to the system call itself (instead of adding it on as for eventfd and signalfd). We need to be able to use O_CLOEXEC some way or another. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 4/5] Allow setting O_NONBLOCK flag for new sockets
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eric Dumazet wrote: > 1) Can the fd passing with recvmsg() on AF_UNIX also gets O_CLOEXEC > support ? Already there, see MSG_CMSG_CLOEXEC. > 2) Why this O_NONBLOCK ability is needed for sockets ? Is it a security > issue, and if yes could you remind it to me ? No security issue. But look at any correct network program, all need to set the mode to non-blocking. Adding this support to the syscall comes at almost no cost and it cuts the cost for every program down by one or two syscalls. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHR9YQ2ijCOnn/RHQRArbyAJ0d25FPg/BWmJ4YIzJKhO9iaBJNXwCgmpuX PAA6u3Dc56AlBegTRqtqJPc= =j5vi -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv5 4/5] Allow setting O_NONBLOCK flag for new sockets
On Nov 24, 2007 12:28 AM, Eric Dumazet <[EMAIL PROTECTED]> wrote: > OK, but maybe for consistency, we might accept the two mechanisms. It's not a question of the kernel interface. The issue with all these extensions is the userlevel interface. Ideally no new userlevel interface is needed. This is the case for open() and incidentally also for this case (through the flags parameter for recvmsg). For socket(), accept(), the situation is unfortunately different and we need a new interface. With your proposed patch, we would have to introduce another recvmsg() interface to take advantage of the additional functionality. This just doesn't make any sense. This is no contest in aesthetics. You first have to think about the interface presented to the programmer at userlevel and then design the syscall interface. This is how MSG_CMSG_CLOEXEC came about. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 H. Peter Anvin wrote: > The 6-word limit is a red herring. There is at least two ways to deal > with it (and this doesn't mean wiping the legacy stuff we already have): > > - Let each architecture pick a calling convention and redefine the > architecture-independent bits to take an arbitrary number of arguments. > This is a one-time panarchitectural change. > [...] Just think beyond wishful thinking for a moment. What does it take to come up with something completely new and grand? Let's start at the basic: you need to signal that the new syscall calling convention is used. Since the syscall entry code is limited (at least the likes of syscall/sysenter, it would be easy enough to use int $0x81 in addition to int $0x80) you would have to extend the use of the syscall number while keeping binary compatibility. This means additional costs for every single syscall. Once you're past that, how do you implement the expandable syscall parameter count? There are two ways: - - pass to the real sys_* implementations the number of provided syscall parameters and have each function figure out what this means - - dynamically construct a call to the sys_* functions where the syscall magic adds an appropriate number of parameters filled with zeros. This is quite complicated and, more importantly, it requires that you have code/data somewhere which specifies how many parameters each of the sys_* function actually requires. The actual sys_* code and the data has to be kept in sync at all times. A maintenance nightmare. The handling of syscalls with many parameters should not at all be a driver of this design at all. Syscalls shouldn't be that complicated, I completely agree with ingo. I'm perfectly willing to give you the benefit of doubt, show us a design for what you're proposing which is not slower than the current code, doesn't impact existing code, and solves the problem in a nice and clean way. I cannot really see it now but I might miss something. The sys_indirect approach ain't pretty but it does it jobs, doesn't impact performance, and is expandable in direction we *know* we will want to go very soon. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHS1X12ijCOnn/RHQRAihRAJwLNJ9fT8GTv6MAoO6RZGOub07sGgCdGBLR frXyQVB8Oh5VgWY5YJhpitg= =FuBx -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv4 5/6] Allow setting O_NONBLOCK flag for new sockets
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 H. Peter Anvin wrote: > No. > > I already said I'm not looking at changing the calling convention for > existing syscalls. I did not suggest or ask for that at all. I was asking you to consider the real implementation details for a new syscall mechanism. We do not want to abandon the use of syscall/sysenter and go back to int (on x86/x86-64). This means that you have to come up with a mechanism which hooks into the current syscall/sysenter path while preserving full backward compatibility. Now it's your turn. How do you do this without additional costs? > Hardly so, as evidenced by the fact that we have successfully done so > for 15 years already; a number of Linux architectures require this > information for the existing system calls. Nothing at this scale is there in the moment, as far as I can see. And nothing so critical for getting right. Talk is cheap. You still haven't shown one bit if design how you want to achieve your grand goal. The time for hand-waiving is over. Do some work or step out of the way. Nothing you have said so far in the least convinces me and your arguments like "sys_indirect adds parameters" are not really contested. Yes, that's what sys_indirect does. So what? It does this with almost no cost which outweighs the ugliness factor in my book. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHS2gQ2ijCOnn/RHQRAlN5AKCWZQL97sROWBv33//Uj/MN+CNi3gCdFgCU uLVEOfclERpakp1kdYzy2oI= =stVB -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Per-thread getrusage
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Vinay Sridhar wrote: > There are two ways to implement this in the kernel: > 1) Introduce an additional parameter 'tid' to sys_getrusage() and put > code in glibc to handle getrusage() and pthread_getrusage() calls > correctly. > 2) Introduce a new system call to handle pthread_getrusage() and leave > sys_getrusage() untouched. You're doing two things at once: a) provide a way to get a thread's usage b) provide a way to get another process's/thread's usage The former is a trivial extension and I completely agree. RUSAGE_THREAD is trivial to implement and should go in ASAP. The second part isn't that easy. The first question is: do we really need this? It is a new type of interface. We have the /proc filesystem etc for programs which want to look at other process' data. Second, more importantly right now, your patch seems not to include any security support. Correct me if I'm wrong, but find_task_by_pid will always succeed, regardless of whether the calling thread belongs to another UID or not. I.e., your patch enables any process to read any other process' usage. That's a no-no. I suggest that you split the patch in two. The first should implement RUSAGE_THREAD. You'll immediately get an ACK from me for that. The second part then should introduce a way to get another process' usage. This patch should only be used initially as a starting point for discussions. You'll have to argue why it is necessary in the first place. The argument might have to do with why you want a pthread_getrusage() interface (which, btw, is a bad name since the interface is nothing like getrusage, getrusage doesn't allow requesting any other process' data). Yes, for intra-process lookups relying on /proc is no good idea. But then, I have not seen any reason so far why such an API is needed and why a thread cannot just be responsible for reading its own usage data. Anyway, if pthread_getrusage (or whatever it'll be called) is the only usage then the syscall should require that the TID parameter is from a thread in the same process which would solve the security problem. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHj3do2ijCOnn/RHQRAiKdAKCSooiEWcxr780hJGenElyDiWPWKgCdE+6Y j6ibmGsPT4aYxhSfpimSdiw= =jOC9 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sigwait() and 2.6
On Tue, 15 Feb 2005 13:58:28 +0100, Yves Crespin <[EMAIL PROTECTED]> wrote: >ThreadUnblockSignal(); >signo = WaitSignal(); >ThreadBlockSignal(); You expect this to work? Just read the POSIX spec or even the man pages. All signals sigwait() waits for must be blocked before the call. You deliberately do the opposite. Swap the ThreadUnblockSignal and ThreadBlockSignal lines and suddenly the program doesn't crash anymore. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: close-exec flag not working in 2.6.9?
On Sun, 30 Jan 2005 23:56:07 -0800, Ben Greear <[EMAIL PROTECTED]> wrote: >flags = fcntl(s, F_GETFL); >flags |= (FD_CLOEXEC); >if (fcntl(s, F_SETFL, flags) < 0) { These have to be F_GETFD and F_SETFD respectively. Note L -> D. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: short read from /dev/urandom
Matt Mackall wrote: _Neither_ case mentions signals and the "and will return as many bytes as requested" is clearly just a restatement of "does not have this limit". Whoever copied this comment to the manpage was a bit sloppy and dropped the first clause rather than the second: It still means the documented API says there are no short reads. So anyone doing a read() can expect a short read regardless of the fd and is quite clear that reads can be interrupted by signals. "It is not an error". Ever. Of course are signal interruptions wrong if the signal uses SA_RESTART. -- â Ulrich Drepper â Red Hat, Inc. â 444 Castro St â Mountain View, CA â signature.asc Description: OpenPGP digital signature
Re: Pollable Semaphores
On Fri, 21 Jan 2005 17:17:51 -0600, Brent Casavant <[EMAIL PROTECTED]> wrote: > 2. select/poll on the fd return EWOULDBLOCK if the current value of > the futex is not equal to the value of interest. Otherwise it > behaves as FUTEX_FD currently does. This is the problematic part. The expected value, as you suggested, can be handled with a write() and since the expected value is often constant, this is a low-overhead method. But the poll() interface is not so easy. You cannot change the poll() semantic to return such an error. It makes really no sense. What I thought could be done is to define instead a new POLL* constant which signals the EWOULDBLOCK condition of the futex() syscall in the revents member. The poll/epoll syscall would do it's normal work and just fill all the appropriate revents. A futex value mismatch would mean the call is not blocking at all, just as available data would be for POLLIN. For select, I would use the exception bitmap. The bit is set for futex fds in the EWOULDBLOCK case. All this _could_ work. But we've been bitten quite a few times in the past. There might be special cases which may need at least some additional functionality. This should be taken into account in the original design. So, if people are interested in this, code something up and try it. Stress it as much as you can. I would oppose adding any new futex interface created at a hunch if I'd be Andrew. And is another thing to consider. There is at least one other event which should be pollable: process (maybe threads) deaths. I was hoping that we get support for this, perhaps in the form of polling the /proc/PID directory. For poll(), a POLLERR value could mean the process/thread died. For select(), once again a bit in the except array could be set. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/