[PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Andrew Lazarus
I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.

I imported the cube contrib package, and I tried creating an index on
a cube of the last 6 elements, which are the most important. Then I
tested the 2.5MM rows for being contained within a tolerance of the
last 6 elements of X, +/- 0.1 in each coordinate, figuring that would
be an indexed search (which I CLUSTERED on). I then ran the sort on
this smaller set. The index was used, but it was still too slow. I
also tried creating new columns with rounded int2 values of the last 6
coordinates and made a multicolumn index.

For each X the search is taking about 4-15 seconds which is above my
target at least one order of magnitude. Absolute numbers are dependent
on my hardware and settings, and some of this can be addressed with
configuration tweaks, etc., but first I think I need to know the
optimum data structure/indexing strategy.

Is anyone on the list experienced with this sort of issue?

Thanks.
Andrew Lazarus  [EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Jeff Davis
On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote:
> I have a table with 2.5 million real[] arrays. (They are points in a
> time series.) Given a new array X, I'd like to find, say, the 25
> closest to X in some sense--for simplification, let's just say in the
> usual vector norm. Speed is critical here, and everything I have tried
> has been too slow.
> 
> I imported the cube contrib package, and I tried creating an index on
> a cube of the last 6 elements, which are the most important. Then I

Have you tried just making normal indexes on each of the last 6 elements
and see if postgresql will use a BitmapAnd to combine them? 

Regards,
Jeff Davis




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Mark Kirkwood

Jeff Davis wrote:

On Fri, 2007-04-20 at 12:07 -0700, Andrew Lazarus wrote:

I have a table with 2.5 million real[] arrays. (They are points in a
time series.) Given a new array X, I'd like to find, say, the 25
closest to X in some sense--for simplification, let's just say in the
usual vector norm. Speed is critical here, and everything I have tried
has been too slow.

I imported the cube contrib package, and I tried creating an index on
a cube of the last 6 elements, which are the most important. Then I


Have you tried just making normal indexes on each of the last 6 elements
and see if postgresql will use a BitmapAnd to combine them? 





I don't think that will work for the vector norm i.e:

|x - y| = sqrt(sum over j ((x[j] - y[j])^2))


Cheers

Mark

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Andrew Lazarus
Because I know the 25 closest are going to be fairly close in each
coordinate, I did try a multicolumn index on the last 6 columns and
used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
that hypercube on the distribution of data in question.)

This hypercube tended to have 10-20K records, and took at least 4
seconds to retrieve. I was a little surprised by how long that took.
So I'm wondering if my data representation is off the wall.

I should mention I also tried a cube index using gist on all 114
elements, but CREATE INDEX hadn't finished in 36 hours, when I killed
it, and I wasn't in retrospect sure an index that took something like
6GB by itself would be helpful on a 2GB of RAM box.

MK> I don't think that will work for the vector norm i.e:

MK> |x - y| = sqrt(sum over j ((x[j] - y[j])^2))


MK> Cheers

MK> Mark


-- 
Sincerely,
 Andrew Lazarusmailto:[EMAIL PROTECTED]BEGIN:VCARD
VERSION:2.1
N:Lazarus;Andrew;;;Ph.D.
FN:Andrew Lazarus, Ph.D.
EMAIL;PREF;INTERNET:[EMAIL PROTECTED]
TITLE:Director of R&D
ADR;WORK:;800-366-0688;3028 Fillmore Street;San Francisco;CA;94123;USA
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:800-366-0688=0D=0A3028 Fillmore S=
 treet=0D=0ASan Francisco=0D=0ACA=0D=0A94123=0D=0AUSA
X-GENDER:Male
REV:18991230T08Z
END:VCARD
---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Mark Kirkwood

Andrew Lazarus wrote:

Because I know the 25 closest are going to be fairly close in each
coordinate, I did try a multicolumn index on the last 6 columns and
used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
that hypercube on the distribution of data in question.)

This hypercube tended to have 10-20K records, and took at least 4
seconds to retrieve. I was a little surprised by how long that took.
So I'm wondering if my data representation is off the wall.

I should mention I also tried a cube index using gist on all 114
elements, but CREATE INDEX hadn't finished in 36 hours, when I killed
it, and I wasn't in retrospect sure an index that took something like
6GB by itself would be helpful on a 2GB of RAM box.

MK> I don't think that will work for the vector norm i.e:

MK> |x - y| = sqrt(sum over j ((x[j] - y[j])^2))




Sorry, in that case it probably *is* worth trying out 6 single column 
indexes and seeing if they get bitmap and'ed together...


Mark

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] index structure for 114-dimension vector

2007-04-20 Thread Tom Lane
Andrew Lazarus <[EMAIL PROTECTED]> writes:
> Because I know the 25 closest are going to be fairly close in each
> coordinate, I did try a multicolumn index on the last 6 columns and
> used a +/- 0.1 or 0.2 tolerance on each. (The 25 best are very probably inside
> that hypercube on the distribution of data in question.)

> This hypercube tended to have 10-20K records, and took at least 4
> seconds to retrieve. I was a little surprised by how long that took.
> So I'm wondering if my data representation is off the wall.

A multicolumn btree index isn't going to be helpful at all.  Jeff's idea
of using six single-column indexes with the above query might work,
though.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate