Re: All Taxi Services need Index Clustered Heap Append

Konstantin Knizhnik Mon, 12 Mar 2018 01:56:49 -0700


On 02.03.2018 19:30, Darafei "Komяpa" Praliaskouski wrote:

Hi,
I work at a ride sharing company and we found a simple scenario thatPostgres has a lot to improve at.After my talk at pgconf.ru <http://pgconf.ru> Alexander Korotkovencouraged me to share my story and thoughts in -hackers.
Story setting:
- Amazon RDS (thus vanilla unpatchable postgres),synchronuous_commit=off (we're ok to lose some seconds on crash).
- Amazon gp2 storage. It's almost like SSD for small spikes, but hasIOPS limit if you're doing a lot of operations.
- Drivers driving around the city and sending their GPS positionseach second, for simplicity - 50k of them at a time.
- An append-only table of shape (id uuid, ts timestamp, geomgeometry, heading speed accuracy float, source text).
A btree index on (id, ts).
- A service that gets the measurements on the network, batches allinto a buffer of 1 second (~N=50k rows) and inserting via COPY.
After someone orders and completes a trip, we have the id of thedriver and trip time interval coming from another service, and want toget trip's route to calculate the bill. Trip times are from seconds upto 4 hours (N=4*3600=14400 rows, a page typically has 60 rows).
select * from positions where id = :id and ts between :start_ts and:end_ts;
Data for more than 24 hours ago is not needed in this service, thus astorage of 70 gb should be enough. On the safe side we're giving it500 gb, which on gp2 gives a steady 1500 iops.
In development (on synthetic data) plans use index and look great, sowe proceed with Postgres and not MySQL. :)
When deployed to production we figured out that a single query forinterval of more than half an hour (1800+ rows) can exhaust all the IOPS.Data is appended with increasing time field, which effectively ensuresno rows from the same driver are ever going to be in the same heappage. A 4 hour long request can degrade system for 10 seconds. gp2provides max 10000 IOPS, and we need to get 14400 pages then. We needthe biggest available gp2 offer just to read 2 megabytes of data in1.5 seconds. The best io-optimized io1 volumes provide 32000 IOPS,which get us as low as 500 ms.
If the data was perfectly packed, it would be just 240 8k pages andtranslate to 120 input operations of gp2's 16kb blocks.
Our options were:
- partitioning. Not entirely trivial when your id is uuid. To getvisible gains, we need to make sure each driver gets their ownpartition. That would leave us with 50 000(+) tables, and rumors saythat in that's what is done in some bigger taxi service, and relcachethen eats up all the RAM and system OOMs.
- CLUSTER table using index. Works perfect on test stand, isn'tavailable as online option.
- Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfesthttps://commitfest.postgresql.org/15/1350/). We can't use that as it'snot in upstream/amazon Postgres yet.
- We decided to live with overhead of unnecessary sorting by allfields and keeping a copy of heap and created a btree over all thefields to utilize Index-Only Scans:
  * Testing went well on dump of production database.
* After we've made indexes on production, we found out thatperformance is _worse_ than with simpler index.
* EXPLAIN (BUFFERS) revealed that Visibility Map is never beingfrozen, as autovacuum ignores append-only never-updated never-deletedtable that is only truncated once a day. No way to force autovacuum onsuch table exists.
* We created a microservice (hard to find spot for crontab indistributed system!) that periodically agressively runs VACUUM on thetable.It indeed helped with queries, but.. VACUUM skips all-visible pages inindex, but always walks over all the pages of btree, which is evenlarger than the heap in our case.There is a patch to repair this behavior on commitfesthttps://commitfest.postgresql.org/16/952/ - but not yet inupstream/amazon Postgres.
* We ended up inventing partitioning schema that rotates a tableevery gigabyte of data, to keep VACUUM run time low. There are ahundreds partitions with indexes on all the fields. Finally the systemis stable.
* EXPLAIN (BUFFERS) shows later that each reading query visits allthe indexes of partitioned table and fetches a page from index to knowthere are 0 rows there. To prune obviously unneeded partitions wedecided to add constraint on timestamp after a partition is finalized.Timestamps are sanitized due to mobile network instability are stampedon the client side, so we don't know the bounds in advance.Unfortunately that means we need two seq scans to do it: first one toget min(ts), max(ts), and second one on ALTER TABLE ADD CONSTRAINT.This operation also eats up iops.
We are not very large company but we bump into too many scalabilityissues on this path already. Searching for solutions on every stepshows other people with tables named like "gpsdata" and "traces", sowe're not alone with this problem. :)
I gave this all some thought and it looks like it all could have nothappened if Postgres was able to cluster heap insertions by (id, ts)index. We're ok with synchronuous_commit=off, so amplified write won'timmediately hit disk and can get cooled down in progress. Clusteringdoesn't require perfect sorting: we need to minimize number of pagesfetched, it's ok if the pages are not consecutive on disk.
I see the option for clustered heap writes as follows:

 - on tuple insertion, check if table is clustered by some index;
- if table is clustered, we start writing not into last used page,but instead go into index and get page numbers for index tuples thatare less or equal than current one, up to index page boundary (or someother exit strategy, at least one heap page is needed);
 - if we can fit tuple into that page, let it be written there;
- if we cannot fit it, consult statistics to see if we have too manyempty space in pages. If we have more than 50% space empty, get pagesby FSM. Create a new page otherwise.(looking into FSM as safety measure for people who specifically inserttuples in backward sorted order).
This would not require more space than is currently required to keepvacuumed all-fields index, and let us omit VACUUM passes overrelations. A pencil-and-paper simulation shows it as a good thing. Doyou have thoughts, or can you help with implementation, or tell why itwould be a bad idea and we need to follow our competitor in movingaway to other RDBMS? :)
I encourage everyone to help with at leasthttps://commitfest.postgresql.org/16/952/ as no good SQL-levelworkaround exists for it. After that enabling autovacuum for tablesthat have only inserted and no deleted tuples would be cool to have,so that we have them marked as Visible.
Hackers, can you help me keep Postgres in the system? :)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk

Clustered updateable heap should have structure very similar to B-Tree.We need to somehow locate target page for insert by key, should handlepage overflow/underflow (split/merge),...So if something looks like B-Tree, behaves like B-Tree,... then mostlikely we should not reinvent a wheel and use B-Tree.From my point of view covering index (CREATE INDEX .. INCLUDE oncommitfest https://commitfest.postgresql.org/15/1350/) is what we needin this case.May be some more options are needed to force use index only scan forappend only data. While this patch is not committed yet, it is possibleto try to create standard compound index, including all record columns(the problem can be caused by types not having required comparisonoperators).

Alternative solution is manual record clustering. It can be achieved bystoring several points of the same rote in one Postgres record. You canuse Postgres array or json types.The approach with json was used by Platon company solving the similartask (tracking information about vehicles movements).Although it seems to be a litle bit strange idea to use JSON forstatically structured data, Platon was able to several times reducestorage size and increase query execution speed.Instead of arrays/json type you can defined your own "vector"/"tile"types or use my extension VOPS. Certainly in all this cases you willhave to rewrite queries and them will become more complicated.

It is actually very common problem that the order of inserting andtraversing data is different. The same think happen in most tradingsystems, when data is imported in time ascending order while it isusually accessed and analyzed grouped by symbols. The most generalsolution of the problem is to maintain several different representationsof the data.It can be for example "vertical" and "horizonal" representations. InVertica you can create arbitrary number of "projections" where data issorted in different way.In Postgres is can be achieved using indexes or external storages. Ihope that extensible heap API will simplify development and integrationof such alternative storages.But even right now you can try to do some manual clustering usingindexes or compound types.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: All Taxi Services need Index Clustered Heap Append

Reply via email to