On 10/6/2014 12:52 PM, Richard Frith-Macdonald wrote:
On 6 Oct 2014, at 17:54, Igor Neyman <iney...@perceptron.com> wrote:

-----Original Message-----
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Richard Frith-Macdonald
Sent: Monday, October 06, 2014 4:02 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] How to get good performance for very large lists/sets?

I'm wondering if anyone can help with advice on how to manage large lists/sets 
of items in a postgresql database.

I have a database which uses multiple  lists of items roughly like this:

CREATE TABLE List (
  ID SERIAL,
  Name VARCHAR ....
);

and a table containing individual entries in the lists:

CREATE TABLE ListEntry (
  ListID INT, /* Reference the List table */
  ItemID INT /* References an Item table */
) ;
CREATE UNIQUE INDEX ListEntryIDX ON ListEntry(ListID, ItemID);

Now, there are thousands of lists, many with millions of entries, and items are 
added to and removed from lists in an unpredictable way (in response to our 
customer's actions, not under our control).  Lists are also created by customer 
actions.

Finding whether a particular item is in a particular list is reasonably fast, 
but when we need to do things like find all the items in list A but not list B 
things can get very slow (particularly when both lists contain millions of 
common items).

I think that server won't use index-only scans because, even in cases where a 
particular list has not had any recent changes, the ListEntry table will almost 
always have had some change (for one of the other lists) since its last vacuum.
Perhaps creating multiple ListEntry tables (one for each list) would allow 
better performance; but that would be thousands (possibly tens of thousands) of 
tables, and allowing new tables to be created by our clients might conflict 
with things like nightly backups.

Is there a better way to manage list/set membership for many thousands of sets 
and many millions of items?

--

You mean you are get sequential scans?
Index-only scans are not always quicker (you could try "turning off" seq scans 
by setting enable_seqscan=off).

Could you show your query, corresponding plans, and what don't you like about 
them?

I guess I didn't express myself well.

No I'm not particularly dissatisfied with any query plan;  have tried 
enabling/disabling different scan types to experiment, and have been able to 
get better results from the query planner with such tweaks in some cases (ie 
with specific datasets), but not consistently.  Certainly the index is used 
quite often, and when it isn't the query planner seems to be making reasonable 
decisions.
I've tried NOT IN, and NOT EXISTS and NOT EXISTS for different situations ...

My fundamental problem is huge datasets;  with hundreds of gigabytes of memory, 
I can have the lists basically in memory and these queries seem to be 
cpu-limited ... so I'm searching for a way to minimise the work the cpu has to 
do.

So what I was wondering was whether this whole approach to set/list membership 
was the correct one to use or if there's some other approach which can simply 
avoid the cpu having to look at so much data (which was why I wondered about 
index-only scans).





Ohhh..

Um, completely left field, but, if your items are sequential in some way, maybe there is some gross misuse of ranges you could use?

http://www.postgresql.org/docs/9.2/static/rangetypes.html


-Andy





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to