On Wed, Jun 20, 2012, Nadav Har'El wrote about "Python question - first call is slower?": > Hi, I have run across a puzzling issue in Python, and I wonder if anyone > on the list can explain it. >... > But strangely, it doesn't - the first call takes 14 milliseconds, while > the second and all subsequent calls take only 2 milliseconds each. > Does anybody have any idea why this can happen?
Thanks to everyone for all the ideas and suggestions. While trying to simplify the example which showed this problem, I started to think that this is perhaps not a general Python problem, but rather a problem caused by one specific thing that the example program did which I previously neglected to mention: My example program actually *forked* right before that first call. It turns out that (at least, this is my understanding now) memory allocation - in Python but also in C - is significantly slower during this first call, but returns to normal speed during the second call. Because Python relies on a lot of memory allocations, the first call becomes significantly slower than the second. But why is allocation slower during the first run of that function? The short explanation is copy-on-write page faults. The long explanation: Python's initialization before our fork() allocates and frees a lot of memory, leaving a fragmented heap full of holes that can be reused by future allocations. Now, think of what happens after a fork(). The child gets a copy of this fragmented heap; But this is not a real copy, but rather a copy-on-write - pages are only copied when the child writes them. What happens now when the child allocates memory? Since the heap is full of holes, one of these holes will be returned, but when the child writes in it, it causes the full page containing the hole to be copied - which is slow. I include below an example program (in C) to demonstrate the problem. It starts by creating a "swiss cheese" heap (100 MB allocated and then half the memory is freed creating a lot of 1000 byte holes), and runs a measure() function which allocates memory from these holes. Before the fork, measure() takes 15 milliseconds each time (the first and subsequent). After the fork, the first measure() takes 90 millisecond, and all the subsequent calls again take 15 milliseconds. In theory, this problem can be elliminated if we could "compact" the heap before calling fork(), or find a way to cause memory allocations in the child to return memory from a new parts of the heap - not holes in the parent's heap. Somewhat counter-intuitively, that second approach will also save memory. /* Try to show that memory allocation after fork is slower because of copy-on-write and previously freed memory - and this can explain the slowdown we see after the fork. */ #include <sys/time.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> #define N 100000 #define M 1000 static measure(){ /* See how much time it takes to allocate (and then free) N/2 chunks. Presumably, these will be returned from the holes left in the heap by our previously freeing half of the N chunks we allocated initially. */ int i; struct timeval old_time, new_time; long usec; void **p = calloc(N/2+1, sizeof(void *)); void **pp = p; gettimeofday(&old_time, NULL); for(i=0; i<N; i+=2){ *(pp++)=malloc(M); } pp=p; for(i=0; i<N; i+=2){ free(*(pp++)); } gettimeofday(&new_time, NULL); usec = new_time.tv_usec - old_time.tv_usec + 1000000*(new_time.tv_sec - old_time.tv_sec); fprintf(stderr, "measure() - alloc and freeing N/2 chunks: %d usec\n", usec); } int main(int argc, char *argv[]){ int i; struct timeval old_time, new_time; long usec; fprintf(stderr,"Preparing swiss-cheese heap:\n"); fprintf(stderr," N=%d (chunks), M=%d (chunk size)\n",N,M); void **p = calloc(N, sizeof(void *)); /* Allocate N chunks of M bytes */ gettimeofday(&old_time, NULL); for(i=0; i<N; i++){ p[i]=malloc(M); } gettimeofday(&new_time, NULL); usec = new_time.tv_usec - old_time.tv_usec + 1000000*(new_time.tv_sec - old_time.tv_sec); fprintf(stderr, " allocating N chunks: %d usec\n", usec); /* Free half the chunks - every second chunk, so that we have a lot of free space in the heap, but no page can actually be fully returned to the system. */ gettimeofday(&old_time, NULL); for(i=0; i<N; i+=2){ free(p[i]); } gettimeofday(&new_time, NULL); usec = new_time.tv_usec - old_time.tv_usec + 1000000*(new_time.tv_sec - old_time.tv_sec); fprintf(stderr, " freeing N/2 chunks: %d usec\n", usec); fprintf(stderr, "\nIn original process. Note first run is just as fast as the rest:\n"); measure(); measure(); measure(); measure(); if(fork() != 0){ /* parent */ wait(NULL); } else { /* child */ fprintf(stderr, "\nIn child. Note how slow the first run in the child is!\n"); measure(); measure(); measure(); measure(); } } -- Nadav Har'El | Thursday, Jun 21 2012, n...@math.technion.ac.il |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |The person who knows how to laugh at http://nadav.harel.org.il |himself will never cease to be amused. _______________________________________________ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il