On 07/02/2014 09:50 AM, Andrea Arcangeli wrote: > Once an userfaultfd is created MADV_USERFAULT regions talks through > the userfaultfd protocol with the thread responsible for doing the > memory externalization of the process. > > The protocol starts by userland writing the requested/preferred > USERFAULT_PROTOCOL version into the userfault fd (64bit write), if > kernel knows it, it will ack it by allowing userland to read 64bit > from the userfault fd that will contain the same 64bit > USERFAULT_PROTOCOL version that userland asked. Otherwise userfault > will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it > will have to try again by writing an older protocol version if > suitable for its usage too, and read it back again until it stops > reading -1ULL. After that the userfaultfd protocol starts. > > The protocol consists in the userfault fd reads 64bit in size > providing userland the fault addresses. After a userfault address has > been read and the fault is resolved by userland, the application must > write back 128bits in the form of [ start, end ] range (64bit each) > that will tell the kernel such a range has been mapped. Multiple read > userfaults can be resolved in a single range write. poll() can be used > to know when there are new userfaults to read (POLLIN) and when there > are threads waiting a wakeup through a range write (POLLOUT). > > Signed-off-by: Andrea Arcangeli <aarca...@redhat.com>
> +#ifdef CONFIG_PROC_FS > +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f) > +{ > + struct userfaultfd_ctx *ctx = f->private_data; > + int ret; > + wait_queue_t *wq; > + struct userfaultfd_wait_queue *uwq; > + unsigned long pending = 0, total = 0; > + > + spin_lock(&ctx->fault_wqh.lock); > + list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) { > + uwq = container_of(wq, struct userfaultfd_wait_queue, wq); > + if (uwq->pending) > + pending++; > + total++; > + } > + spin_unlock(&ctx->fault_wqh.lock); > + > + ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total); This should show the protocol version, too. > + > +SYSCALL_DEFINE1(userfaultfd, int, flags) > +{ > + int fd, error; > + struct file *file; This looks like it can't be used more than once in a process. That will be unfortunate for libraries. Would it be feasible to either have userfaultfd claim a range of addresses or for a vma to be explicitly associated with a userfaultfd? (In the latter case, giant PROT_NONE MAP_NORESERVE mappings could be used.)