"David E. Wheeler" <da...@kineticode.com> writes: > One of the reasons our client wants GIN for the integer[] column so bad is > because recreating the GiST integer[] index is quite painful. Before I duped > the table, I was just dropping and recreating the index on the original > table. It was great to create the GIN index; for 400K rows, it took 1300 ms. > After my initial round of testing, I dropped it and created the GiST index. > That ran forÂ…well, *hours*. Four at least, maybe six or seven (I forgot > \timing and was letting it run on screen while I did some iOS hacking). I > think something might be really wrong with GiST index creation for integer > arrays, because the difference was just appalling.
I poked into this a bit, although it's really off-topic for what I was doing in GIN, and I agree that there is something fundamentally rotten in the gist__int_ops logic. oprofile shows that the code is spending all its time in g_int_compress and g_int_union during index build: samples % app name symbol name 52747 40.2486 _int.so g_int_compress 27092 20.6726 libc-2.12.2.so _wordcopy_fwd_aligned 18938 14.4506 _int.so compASC 18256 13.9302 postgres pg_qsort 5741 4.3807 no-vmlinux /no-vmlinux 3297 2.5158 postgres swapfunc 864 0.6593 _int.so g_int_decompress 826 0.6303 _int.so _int_unique 700 0.5341 _int.so inner_int_union 617 0.4708 postgres med3 588 0.4487 libc-2.12.2.so memset 371 0.2831 _int.so g_int_same 143 0.1091 libc-2.12.2.so memcpy 123 0.0939 postgres ArrayGetNItems 67 0.0511 _int.so inner_int_inter 48 0.0366 postgres AllocSetAlloc 47 0.0359 libc-2.12.2.so memmove 40 0.0305 postgres MemoryContextAllocZero The time in g_int_compress is all on this loop: while (len > MAXNUMRANGE * 2) { min = 0x7fffffff; for (i = 2; i < len; i += 2) if (min > (dr[i] - dr[i - 1])) { min = (dr[i] - dr[i - 1]); cand = i; } memmove((void *) &dr[cand - 1], (void *) &dr[cand + 1], (len - cand - 1) * sizeof(int32)); len -= 2; } which is not too surprising since that's obviously O(N^2). The other high-percentage functions are qsort and subsidiaries, and those calls are coming from g_int_union. That part could probably be sped up, since most of the calls are unioning two inputs, which could be done a lot cheaper than this (the inputs are always presorted no?). But there is something really fundamentally wrong in the logic, I think, because while poking at it with gdb I saw it union-ing fifty-thousand-element arrays. Surely it should get lossy before it gets to that. I'm also wondering how it tells the difference between lossy and non-lossy keys --- although it's hard to tell what such spectacularly uncommented code is supposed to be doing, it looks like the lossy case is supposed to be pairs of values representing ranges, in which case g_int_compress is surely doing the Wrong Thing with already-lossy inputs, and union'ing lossy inputs is an entirely insane operation as well. At the moment my opinion is that gist__int_ops is too broken to be usable, and it's also too uncommented to be fixable by anyone other than the original author. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers