[HACKERS] Cube extension improvement, GSoC

Stas Kelvich Fri, 22 Mar 2013 16:11:08 -0700

Hello,

some time ago I started working on the data search system (about 100-200M of 
records) with queries consisted of several diapason and equality conditions, 
e.g.:


  WHERE dim1 BETWEEN 128 AND 137 AND
  WHERE dim2 BETWEEN 4815 AND 162342 AND
  WHERE dim3 = 42
  ORDER BY dim1 ASC

There are 6 or 7 search criteria in my task. In order to avoid full table scan 
I started using R-Tree over cube data type:

  CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))

For fast sorting I used Alexander Korotkov's patch for knngist 
(http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), 
which changes metric for nearest neighbors search and allows to obtain cubes 
ordered by a specific coordinate. Having changed some data types and 
function/operator definitions I ported this to 9.2 
(https://github.com/kelvich/postgres/commit/bb372). Then, after this I could 
select ordered records right from the index:

  SELECT * FROM my_table
  WHERE cube(array[dim1, dim2, dim3]) <@ cube '(128,4815,42),(137,162342,42)'
  ORDER BY cube(array[dim1, dim2, dim3]) <*> 5;

The main issue of such approach is the index size. In my case it was about 
100GB for 100M of records. Therefore index building becomes very expensive 
disk-related operation. For the same reason reading a large number of records 
from the index is slow too.

I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander Korotkov 
for any help. (I'm very thankful to them and glad that they agreed to meet, 
listen to me and give useful advices). Having discussed it we decided that 
there was several possible improvements for the cube extension:

  * Adding point data type support to the cube extension in order to avoid 
storing of coincident upper left and lower right vertices, which may reduce the 
volume that leaf nodes occupy almost twice.
  * Checking the split algorithm with big datasets and, if possible, improving 
it.
  * Learning cube extension to store dimensions with different data types. Such 
index would be good alternative to compound key B-Tree multi-index (suitable 
for diapason queries and data ordering).
  * Providing support for KNN with metrics induced by the different norms: 
euclidean, taxicab norm, p-norm. This can be useful in fields where we can 
extract signature: similar images search, similar audio search.

I'd like to participate in GSoC with this improvements, and I'm very interested 
in any comments or suggestions about this feature list.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Cube extension improvement, GSoC

Reply via email to