Re: [GENERAL] Re: Clustered index to preserve data locality in a multitenant application?

Mike Sofen Tue, 30 Aug 2016 05:32:25 -0700

From: Thomas Kellerer Sent: Tuesday, August 30, 2016 4:39 AM
Nicolas Grilly schrieb am 30.08.2016 um 13:12:


> We rely on clustered indexes to preserve data locality for each 

> tenant. Primary keys start with the tenant ID. This way, rows 

> belonging to the same tenant are stored next to each other. Because 

> all requests hit only one tenant, this is a great performance 

> improvement.

> 

 

What about partitioning by tenant? With a local index on each partition. 

Partitioning is currently a bit limited in Postgres (e.g. you can't have 
incoming foreign keys) but this would fit your requirements pretty much as I 
understand them. 

 

For Nicolas’s situation, that would require 10,000 partitions – not very 
useful, and each partition would be very small.

 

The premise of clustered indexes is that the row data is “in the index”, so no 
row lookups are needed – that’s very effective/performant and clearly works 
well for many scenarios.  

 

In Postgres, as you mentioned, clustering is a “one time” operation but only in 
the sense that after you add more rows, you’ll need to re-cluster the table.  
Depending on the activity model for that table, that may be feasible/ok.  For 
example, if you load it via regular batch scripts, then the clustering could be 
done after those loads.  If you add rows only rarely but then do lots of 
updates, then the clustering would work great.  If this is an active real time 
data table, then clustering would not be viable.

 

But you may be still be fine without them.  Here’s why:  Postgres is a very 
fast database engine.  I am constantly surprised at how performant it is.  I 
came from the SQL Server world where I always leveraged the automatic clustered 
indexes but in Postgres I am not using them and am running some big data with 
no table partitioning (yet) and performance is still very good.  I carefully 
optimize my data models, so that is part of it.  I also carefully optimize my 
stored functions/indexes.  I am migrating data from 500+ mysql databases/~8,000 
tables/~20 billion rows into a single Postgres db.  As my data grows, I may 
shard it.  The new parallel query support in PG v9.6 may also support your 
table model very nicely, depending on how you query into it.

 

So I’d say, just build a prototype PG db, build that one table, load your 
existing data into it (very easy via the mysqly_fdw data wrapper – it’s exactly 
what I am doing…ping me off list if you need some ideas), put a regular index 
on it and run some queries.  If you have a decent size dev server to work on, 
you should see adequate performance.  And, what you’ll be incredibly pleased 
with is the remarkably rich and flexible plpgsql coding environment.  It’s 
heaven for sql devs.

 

BTW, I am putting together a rather detailed examination of Postgres ETL/bulk 
loading performance, on 2 different platforms: 4 cores/16gb ram/3tb SSD on AWS 
(Amazon cloud), and 48 cores/256gb ram/ 10tb SSD on a monster loaner Cisco UCS 
server.  Should have that posted to the Perform list later this week.

 

Mike Sofen (USA)

Re: [GENERAL] Re: Clustered index to preserve data locality in a multitenant application?

Reply via email to