Following list etiquette response inline ;) On Mon, Sep 26, 2016 at 2:28 AM, Álvaro Hernández Tortosa <a...@8kdata.com> wrote:
> > > On 26/09/16 05:50, Greg Spiegelberg wrote: > >> Hey all, >> >> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a >> time has said not to have millions of tables. I too have long believed it >> until recently. >> >> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) >> for PGDATA. Over the weekend, I created 8M tables with 16M indexes on >> those tables. Table creation initially took 0.018031 secs, average >> 0.027467 and after tossing out outliers (qty 5) the maximum creation time >> found was 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 >> seconds. Tables were created by a single process. Do note that table >> creation is done via plpgsql function as there are other housekeeping tasks >> necessary though minimal. >> >> No system tuning but here is a list of PostgreSQL knobs and switches: >> shared_buffers = 2GB >> work_mem = 48 MB >> max_stack_depth = 4 MB >> synchronous_commit = off >> effective_cache_size = 200 GB >> pg_xlog is on it's own file system >> >> There are some still obvious problems. General DBA functions such as >> VACUUM and ANALYZE should not be done. Each will run forever and cause >> much grief. Backups are problematic in the traditional pg_dump and PITR >> space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing >> it in my test case) are no-no's. A system or database crash could take >> potentially hours to days to recover. There are likely other issues ahead. >> >> You may wonder, "why is Greg attempting such a thing?" I looked at >> DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, >> it's antiquated and don't get me started on "Hadoop". I looked at many >> others and ultimately the recommended use of each vendor was to have one >> table for all data. That overcomes the millions of tables problem, right? >> >> Problem with the "one big table" solution is I anticipate 1,200 trillion >> records. Random access is expected and the customer expects <30ms reads >> for a single record fetch. >> >> No data is loaded... yet Table and index creation only. I am interested >> in the opinions of all including tests I may perform. If you had this >> setup, what would you capture / analyze? I have a job running preparing >> data. I did this on a much smaller scale (50k tables) and data load via >> function allowed close to 6,000 records/second. The schema has been >> simplified since and last test reach just over 20,000 records/second with >> 300k tables. >> >> I'm not looking for alternatives yet but input to my test. Takers? >> >> I can't promise immediate feedback but will do my best to respond with >> results. >> >> TIA, >> -Greg >> > > Hi Greg. > > This is a problem (creating a large number of tables; really large > indeed) that we researched in my company a while ago. You might want to > read about it: https://www.pgcon.org/2013/schedule/events/595.en.html > > updatedb, funny. Thank you for the pointer. I had no intention of going to 1B tables. I may need to understand autovacuum better. My impression was it consulted statistics and performed vacuums one table at a time based on the vacuum threshold formula on https://www.postgresql.org/docs/9.5/static/routine-vacuuming.html. -Greg