On 02.03.2018 19:30, Darafei "Komяpa" Praliaskouski wrote:
Hi,
I work at a ride sharing company and we found a simple scenario that
Postgres has a lot to improve at.
After my talk at pgconf.ru <http://pgconf.ru> Alexander Korotkov
encouraged me to share my story and thoughts in -hackers.
Story setting:
- Amazon RDS (thus vanilla unpatchable postgres),
synchronuous_commit=off (we're ok to lose some seconds on crash).
- Amazon gp2 storage. It's almost like SSD for small spikes, but has
IOPS limit if you're doing a lot of operations.
- Drivers driving around the city and sending their GPS positions
each second, for simplicity - 50k of them at a time.
- An append-only table of shape (id uuid, ts timestamp, geom
geometry, heading speed accuracy float, source text).
A btree index on (id, ts).
- A service that gets the measurements on the network, batches all
into a buffer of 1 second (~N=50k rows) and inserting via COPY.
After someone orders and completes a trip, we have the id of the
driver and trip time interval coming from another service, and want to
get trip's route to calculate the bill. Trip times are from seconds up
to 4 hours (N=4*3600=14400 rows, a page typically has 60 rows).
select * from positions where id = :id and ts between :start_ts and
:end_ts;
Data for more than 24 hours ago is not needed in this service, thus a
storage of 70 gb should be enough. On the safe side we're giving it
500 gb, which on gp2 gives a steady 1500 iops.
In development (on synthetic data) plans use index and look great, so
we proceed with Postgres and not MySQL. :)
When deployed to production we figured out that a single query for
interval of more than half an hour (1800+ rows) can exhaust all the IOPS.
Data is appended with increasing time field, which effectively ensures
no rows from the same driver are ever going to be in the same heap
page. A 4 hour long request can degrade system for 10 seconds. gp2
provides max 10000 IOPS, and we need to get 14400 pages then. We need
the biggest available gp2 offer just to read 2 megabytes of data in
1.5 seconds. The best io-optimized io1 volumes provide 32000 IOPS,
which get us as low as 500 ms.
If the data was perfectly packed, it would be just 240 8k pages and
translate to 120 input operations of gp2's 16kb blocks.
Our options were:
- partitioning. Not entirely trivial when your id is uuid. To get
visible gains, we need to make sure each driver gets their own
partition. That would leave us with 50 000(+) tables, and rumors say
that in that's what is done in some bigger taxi service, and relcache
then eats up all the RAM and system OOMs.
- CLUSTER table using index. Works perfect on test stand, isn't
available as online option.
- Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest
https://commitfest.postgresql.org/15/1350/). We can't use that as it's
not in upstream/amazon Postgres yet.
- We decided to live with overhead of unnecessary sorting by all
fields and keeping a copy of heap and created a btree over all the
fields to utilize Index-Only Scans:
* Testing went well on dump of production database.
* After we've made indexes on production, we found out that
performance is _worse_ than with simpler index.
* EXPLAIN (BUFFERS) revealed that Visibility Map is never being
frozen, as autovacuum ignores append-only never-updated never-deleted
table that is only truncated once a day. No way to force autovacuum on
such table exists.
* We created a microservice (hard to find spot for crontab in
distributed system!) that periodically agressively runs VACUUM on the
table.
It indeed helped with queries, but.. VACUUM skips all-visible pages in
index, but always walks over all the pages of btree, which is even
larger than the heap in our case.
There is a patch to repair this behavior on commitfest
https://commitfest.postgresql.org/16/952/ - but not yet in
upstream/amazon Postgres.
* We ended up inventing partitioning schema that rotates a table
every gigabyte of data, to keep VACUUM run time low. There are a
hundreds partitions with indexes on all the fields. Finally the system
is stable.
* EXPLAIN (BUFFERS) shows later that each reading query visits all
the indexes of partitioned table and fetches a page from index to know
there are 0 rows there. To prune obviously unneeded partitions we
decided to add constraint on timestamp after a partition is finalized.
Timestamps are sanitized due to mobile network instability are stamped
on the client side, so we don't know the bounds in advance.
Unfortunately that means we need two seq scans to do it: first one to
get min(ts), max(ts), and second one on ALTER TABLE ADD CONSTRAINT.
This operation also eats up iops.
We are not very large company but we bump into too many scalability
issues on this path already. Searching for solutions on every step
shows other people with tables named like "gpsdata" and "traces", so
we're not alone with this problem. :)
I gave this all some thought and it looks like it all could have not
happened if Postgres was able to cluster heap insertions by (id, ts)
index. We're ok with synchronuous_commit=off, so amplified write won't
immediately hit disk and can get cooled down in progress. Clustering
doesn't require perfect sorting: we need to minimize number of pages
fetched, it's ok if the pages are not consecutive on disk.
I see the option for clustered heap writes as follows:
- on tuple insertion, check if table is clustered by some index;
- if table is clustered, we start writing not into last used page,
but instead go into index and get page numbers for index tuples that
are less or equal than current one, up to index page boundary (or some
other exit strategy, at least one heap page is needed);
- if we can fit tuple into that page, let it be written there;
- if we cannot fit it, consult statistics to see if we have too many
empty space in pages. If we have more than 50% space empty, get pages
by FSM. Create a new page otherwise.
(looking into FSM as safety measure for people who specifically insert
tuples in backward sorted order).
This would not require more space than is currently required to keep
vacuumed all-fields index, and let us omit VACUUM passes over
relations. A pencil-and-paper simulation shows it as a good thing. Do
you have thoughts, or can you help with implementation, or tell why it
would be a bad idea and we need to follow our competitor in moving
away to other RDBMS? :)
I encourage everyone to help with at least
https://commitfest.postgresql.org/16/952/ as no good SQL-level
workaround exists for it. After that enabling autovacuum for tables
that have only inserted and no deleted tuples would be cool to have,
so that we have them marked as Visible.
Hackers, can you help me keep Postgres in the system? :)
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
Clustered updateable heap should have structure very similar to B-Tree.
We need to somehow locate target page for insert by key, should handle
page overflow/underflow (split/merge),...
So if something looks like B-Tree, behaves like B-Tree,... then most
likely we should not reinvent a wheel and use B-Tree.
From my point of view covering index (CREATE INDEX .. INCLUDE on
commitfest https://commitfest.postgresql.org/15/1350/) is what we need
in this case.
May be some more options are needed to force use index only scan for
append only data. While this patch is not committed yet, it is possible
to try to create standard compound index, including all record columns
(the problem can be caused by types not having required comparison
operators).
Alternative solution is manual record clustering. It can be achieved by
storing several points of the same rote in one Postgres record. You can
use Postgres array or json types.
The approach with json was used by Platon company solving the similar
task (tracking information about vehicles movements).
Although it seems to be a litle bit strange idea to use JSON for
statically structured data, Platon was able to several times reduce
storage size and increase query execution speed.
Instead of arrays/json type you can defined your own "vector"/"tile"
types or use my extension VOPS. Certainly in all this cases you will
have to rewrite queries and them will become more complicated.
It is actually very common problem that the order of inserting and
traversing data is different. The same think happen in most trading
systems, when data is imported in time ascending order while it is
usually accessed and analyzed grouped by symbols. The most general
solution of the problem is to maintain several different representations
of the data.
It can be for example "vertical" and "horizonal" representations. In
Vertica you can create arbitrary number of "projections" where data is
sorted in different way.
In Postgres is can be achieved using indexes or external storages. I
hope that extensible heap API will simplify development and integration
of such alternative storages.
But even right now you can try to do some manual clustering using
indexes or compound types.
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company