Re: [PERFORM] Storing large documents - one table or partition by doc?

2016-09-25 Thread Jeff Janes
On Fri, Sep 23, 2016 at 3:12 AM, Dev Nop  wrote:

> I’m storing thousands of independent documents each containing around 20k
> rows. The larger the document, the more likely it is to be active with
> inserts and updates (1000s/day). The most common read query is to get all
> the rows for a single document (100s/day).
>

How can the query be an order of magnitude less than the writes? Wouldn't
anything doing an insert or update want to see the results of other
people's inserts/updates about as frequently as they happen?




> It will be supporting real-time collaboration but with strong-consistency
> for a simple schema so not well-suited to dedicated "document databases"
> that assume schema-less & eventual consistency. I won’t have great
> hardware/budget so need to squeeze the most out of the least.
>
> My question is whether to put all documents into a single huge table or
> partition by document?
>
> The documents are independent so its purely a performance question. Its
> too many tables for postgresql partitioning support but I don’t get any
> benefit from a master table and constraints. Handling partitioning in
> application logic is effectively zero cost.
>
> I know that 1000s of tables is regarded as an anti-pattern but I can only
> see the performance and maintenance benefits of one table per independent
> document e.g. fast per-table vacuum, incremental schema updates, easy
> future sharding. A monster table will require additional key columns and
> indexes that don’t have any value beyond allowing the documents to sit in
> the same table.
>

If you go the partitioned route, I would add the extra column anyway (but
not an index on it), so that it is there if/when you need it.


>
> The only downsides seem to be the system level per-table overhead but I
> only see that as a problem if I have a very long tail of tiny documents.
> I'd rather solve that problem if it occurs than manage an
> all-eggs-in-one-basket monster table.
>
> Is there anything significant I am missing in my reasoning?
>

If you use a reasonably modern version of PostgreSQL (say, >=9.4) , the
overhead of having 1000s of tables should not be too large of a problem.
When get into the 100,000 range, that it is likely to start being a
problem.  If you get to 1,000,000, you almost definitely have a problem.

Cheers,

Jeff


[PERFORM] Millions of tables

2016-09-25 Thread Greg Spiegelberg
Hey all,

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
has said not to have millions of tables.  I too have long believed it until
recently.

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
tables.  Table creation initially took 0.018031 secs, average 0.027467 and
after tossing out outliers (qty 5) the maximum creation time found was
0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
Tables were created by a single process.  Do note that table creation is
done via plpgsql function as there are other housekeeping tasks necessary
though minimal.

No system tuning but here is a list of PostgreSQL knobs and switches:
shared_buffers = 2GB
work_mem = 48 MB
max_stack_depth = 4 MB
synchronous_commit = off
effective_cache_size = 200 GB
pg_xlog is on it's own file system

There are some still obvious problems.  General DBA functions such as
VACUUM and ANALYZE should not be done.  Each will run forever and cause
much grief.  Backups are problematic in the traditional pg_dump and PITR
space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
it in my test case) are no-no's.  A system or database crash could take
potentially hours to days to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked at
DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
it's antiquated and don't get me started on "Hadoop".  I looked at many
others and ultimately the recommended use of each vendor was to have one
table for all data.  That overcomes the millions of tables problem, right?

Problem with the "one big table" solution is I anticipate 1,200 trillion
records.  Random access is expected and the customer expects <30ms reads
for a single record fetch.

No data is loaded... yet  Table and index creation only.  I am interested
in the opinions of all including tests I may perform.  If you had this
setup, what would you capture / analyze?  I have a job running preparing
data.  I did this on a much smaller scale (50k tables) and data load via
function allowed close to 6,000 records/second.  The schema has been
simplified since and last test reach just over 20,000 records/second with
300k tables.

I'm not looking for alternatives yet but input to my test.  Takers?

I can't promise immediate feedback but will do my best to respond with
results.

TIA,
-Greg


Re: [PERFORM] Millions of tables

2016-09-25 Thread julyanto SUTANDANG
Dear Greg,

Have you checked PostgresXL ?
with millions of table, how the apps choose which table is approriate?
in my opinion, with that scale it should go with parallel query with
data sharing like what PostgresXL is done.

Thanks,


Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source and Open Mind Company)
www.equnix.co.id
Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta Pusat
T: +6221 2282 F: +62216315281 M: +628164858028


Caution: The information enclosed in this email (and any attachments)
may be legally privileged and/or confidential and is intended only for
the use of the addressee(s). No addressee should forward, print, copy,
or otherwise reproduce this message in any manner that would allow it
to be viewed by any individual not originally listed as a recipient.
If the reader of this message is not the intended recipient, you are
hereby notified that any unauthorized disclosure, dissemination,
distribution, copying or the taking of any action in reliance on the
information herein is strictly prohibited. If you have received this
communication in error, please immediately notify the sender and
delete this message.Unless it is made by the authorized person, any
views expressed in this message are those of the individual sender and
may not necessarily reflect the views of PT Equnix Business Solutions.


On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg
 wrote:
> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as VACUUM
> and ANALYZE should not be done.  Each will run forever and cause much grief.
> Backups are problematic in the traditional pg_dump and PITR space.  Large
> JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test
> case) are no-no's.  A system or database crash could take potentially hours
> to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads for
> a single record fetch.
>
> No data is loaded... yet  Table and index creation only.  I am interested in
> the opinions of all including tests I may perform.  If you had this setup,
> what would you capture / analyze?  I have a job running preparing data.  I
> did this on a much smaller scale (50k tables) and data load via function
> allowed close to 6,000 records/second.  The schema has been simplified since
> and last test reach just over 20,000 records/second with 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Millions of tables

2016-09-25 Thread Mike Sofen
From: Greg Spiegelberg  Sent: Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those tables. 

… A system or database crash could take potentially hours to days to recover.  
There are likely other issues ahead.

 

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, 
BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated 
and don't get me started on "Hadoop".  Problem with the "one big table" 
solution is I anticipate 1,200 trillion records.  Random access is expected and 
the customer expects <30ms reads for a single record fetch.

 

I'm not looking for alternatives yet but input to my test.

_

 

Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the 
risks/limitations of the known for the risks of the unknown.  The unknown 
being, in the numerous places where postgres historical development may have 
cut corners, you may be the first to exercise those corners and flame out like 
the recent SpaceX rocket.

 

Put it another way – you’re going to bet your career (perhaps) or a client’s 
future on an architectural model that just doesn’t seem feasible.  I think 
you’ve got a remarkable design problem to solve, and am glad you’ve chosen to 
share that problem with us.

 

And I do think it will boil down to this: it’s not that you CAN do it on 
Postgres (which you clearly can), but once in production, assuming things are 
actually stable, how will you handle the data management aspects like 
inevitable breakage, data integrity issues, backups, restores, user contention 
for resources, fault tolerance and disaster recovery.  Just listing the tables 
will take forever.  Add a column?  Never.  I do think the amount of testing 
you’ll need to do prove that every normal data management function still works 
at that table count…that in itself is going to be not a lot of fun.

 

This one hurts my head.  Ironically, the most logical destination for this type 
of data may actually be Hadoop – auto-scale, auto-shard, fault tolerant, 
etc…and I’m not a Hadoopie.

 

I am looking forward to hearing how this all plays out, it will be quite an 
adventure!  All the best,

 

Mike Sofen (Synthetic Genomics…on Postgres 9.5x)



Re: [PERFORM] Millions of tables

2016-09-25 Thread Greg Spiegelberg
Precisely why I shared with the group.  I must understand the risks
involved.  I need to explore if it can be stable at this size when does it
become unstable?  Aside from locking down user access to superuser, is
there a way to prohibit database-wide VACUUM & ANALYZE?  Certainly putting
my trust in autovacuum :) which is something I have not yet fully explored
how to best tune.

Couple more numbers... ~231 GB is the size of PGDATA with 8M empty tables
and 16M empty indexes.  ~5% of inodes on the file system have been used.
Sar data during the 8M table creation shows a very stable and regular I/O
pattern.  Not a blip worth mentioning.

Another point worth mentioning, the tables contain a boolean, int8's and
timestamptz's only.  Nothing of variable size like bytea, text, json or
xml.  Each of the 8M tables will contain on the very high side between 140k
and 200k records.  The application also has a heads up as to which table
contains which record.  The searches come in saying "give me record X from
partition key Y" where Y identifies the table and X is used in the filter
on the table.

Last point, add column will never be done.  I can hear eyes rolling :) but
the schema and it's intended use is complete.  You'll have to trust me on
that one.

-Greg

On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen  wrote:

> *From:* Greg Spiegelberg  *Sent:* Sunday, September 25, 2016 7:50 PM
> … Over the weekend, I created 8M tables with 16M indexes on those tables.
>
> … A system or database crash could take potentially hours to days to
> recover.  There are likely other issues ahead.
>
>
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  Problem with the
> "one big table" solution is I anticipate 1,200 trillion records.  Random
> access is expected and the customer expects <30ms reads for a single record
> fetch.
>
>
>
> I'm not looking for alternatives yet but input to my test.
>
> _
>
>
>
> Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the
> risks/limitations of the known for the risks of the unknown.  The unknown
> being, in the numerous places where postgres historical development may
> have cut corners, you may be the first to exercise those corners and flame
> out like the recent SpaceX rocket.
>
>
>
> Put it another way – you’re going to bet your career (perhaps) or a
> client’s future on an architectural model that just doesn’t seem feasible.
> I think you’ve got a remarkable design problem to solve, and am glad you’ve
> chosen to share that problem with us.
>
>
>
> And I do think it will boil down to this: it’s not that you CAN do it on
> Postgres (which you clearly can), but once in production, assuming things
> are actually stable, how will you handle the data management aspects like
> inevitable breakage, data integrity issues, backups, restores, user
> contention for resources, fault tolerance and disaster recovery.  Just
> listing the tables will take forever.  Add a column?  Never.  I do think
> the amount of testing you’ll need to do prove that every normal data
> management function still works at that table count…that in itself is going
> to be not a lot of fun.
>
>
>
> This one hurts my head.  Ironically, the most logical destination for this
> type of data may actually be Hadoop – auto-scale, auto-shard, fault
> tolerant, etc…and I’m not a Hadoopie.
>
>
>
> I am looking forward to hearing how this all plays out, it will be quite
> an adventure!  All the best,
>
>
>
> Mike Sofen (Synthetic Genomics…on Postgres 9.5x)
>


Re: [PERFORM] Millions of tables

2016-09-25 Thread Greg Spiegelberg
I did look at PostgresXL and CitusDB.  Both are admirable however neither
could support the need to read a random record consistently under 30ms.
It's a similar problem Cassandra and others have: network latency.  At this
scale, to provide the ability to access any given record amongst trillions
it is imperative to know precisely where it is stored (system & database)
and read a relatively small index.  I have other requirements that prohibit
use of any technology that is eventually consistent.

I liken the problem to fishing.  To find a particular fish of length, size,
color &c in a data lake you must accept the possibility of scanning the
entire lake.  However, if all fish were in barrels where each barrel had a
particular kind of fish of specific length, size, color &c then the problem
is far simpler.

-Greg

On Sun, Sep 25, 2016 at 9:04 PM, julyanto SUTANDANG 
wrote:

> Dear Greg,
>
> Have you checked PostgresXL ?
> with millions of table, how the apps choose which table is approriate?
> in my opinion, with that scale it should go with parallel query with
> data sharing like what PostgresXL is done.
>
> Thanks,
>
>
> Julyanto SUTANDANG
>
> Equnix Business Solutions, PT
> (An Open Source and Open Mind Company)
> www.equnix.co.id
> Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta
> Pusat
> T: +6221 2282 F: +62216315281 M: +628164858028
>
>
> Caution: The information enclosed in this email (and any attachments)
> may be legally privileged and/or confidential and is intended only for
> the use of the addressee(s). No addressee should forward, print, copy,
> or otherwise reproduce this message in any manner that would allow it
> to be viewed by any individual not originally listed as a recipient.
> If the reader of this message is not the intended recipient, you are
> hereby notified that any unauthorized disclosure, dissemination,
> distribution, copying or the taking of any action in reliance on the
> information herein is strictly prohibited. If you have received this
> communication in error, please immediately notify the sender and
> delete this message.Unless it is made by the authorized person, any
> views expressed in this message are those of the individual sender and
> may not necessarily reflect the views of PT Equnix Business Solutions.
>
>
> On Mon, Sep 26, 2016 at 9:50 AM, Greg Spiegelberg
>  wrote:
> > Hey all,
> >
> > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
> time
> > has said not to have millions of tables.  I too have long believed it
> until
> > recently.
> >
> > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
> for
> > PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> > tables.  Table creation initially took 0.018031 secs, average 0.027467
> and
> > after tossing out outliers (qty 5) the maximum creation time found was
> > 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> > Tables were created by a single process.  Do note that table creation is
> > done via plpgsql function as there are other housekeeping tasks necessary
> > though minimal.
> >
> > No system tuning but here is a list of PostgreSQL knobs and switches:
> > shared_buffers = 2GB
> > work_mem = 48 MB
> > max_stack_depth = 4 MB
> > synchronous_commit = off
> > effective_cache_size = 200 GB
> > pg_xlog is on it's own file system
> >
> > There are some still obvious problems.  General DBA functions such as
> VACUUM
> > and ANALYZE should not be done.  Each will run forever and cause much
> grief.
> > Backups are problematic in the traditional pg_dump and PITR space.  Large
> > JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my
> test
> > case) are no-no's.  A system or database crash could take potentially
> hours
> > to days to recover.  There are likely other issues ahead.
> >
> > You may wonder, "why is Greg attempting such a thing?"  I looked at
> > DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> > it's antiquated and don't get me started on "Hadoop".  I looked at many
> > others and ultimately the recommended use of each vendor was to have one
> > table for all data.  That overcomes the millions of tables problem,
> right?
> >
> > Problem with the "one big table" solution is I anticipate 1,200 trillion
> > records.  Random access is expected and the customer expects <30ms reads
> for
> > a single record fetch.
> >
> > No data is loaded... yet  Table and index creation only.  I am
> interested in
> > the opinions of all including tests I may perform.  If you had this
> setup,
> > what would you capture / analyze?  I have a job running preparing data.
> I
> > did this on a much smaller scale (50k tables) and data load via function
> > allowed close to 6,000 records/second.  The schema has been simplified
> since
> > and last test reach just over 20,000 records/second with 300k tables.
> >
> > I'm not looking for alternatives yet but input to my test.  

Re: [PERFORM] Millions of tables

2016-09-25 Thread Gavin Flower

Hi Greg,

Please follow the conventions of this mailing list,  to avoid confusion 
- see bottom of this posting for further comments



On 26/09/16 17:05, Greg Spiegelberg wrote:
Precisely why I shared with the group.  I must understand the risks 
involved.  I need to explore if it can be stable at this size when 
does it become unstable?  Aside from locking down user access to 
superuser, is there a way to prohibit database-wide VACUUM & ANALYZE?  
Certainly putting my trust in autovacuum :) which is something I have 
not yet fully explored how to best tune.


Couple more numbers... ~231 GB is the size of PGDATA with 8M empty 
tables and 16M empty indexes.  ~5% of inodes on the file system have 
been used.  Sar data during the 8M table creation shows a very stable 
and regular I/O pattern.  Not a blip worth mentioning.


Another point worth mentioning, the tables contain a boolean, int8's 
and timestamptz's only.  Nothing of variable size like bytea, text, 
json or xml.  Each of the 8M tables will contain on the very high side 
between 140k and 200k records.  The application also has a heads up as 
to which table contains which record.  The searches come in saying 
"give me record X from partition key Y" where Y identifies the table 
and X is used in the filter on the table.


Last point, add column will never be done.  I can hear eyes rolling :) 
but the schema and it's intended use is complete.  You'll have to 
trust me on that one.


-Greg

On Sun, Sep 25, 2016 at 9:23 PM, Mike Sofen > wrote:


*From:*Greg Spiegelberg *Sent:* Sunday, September 25, 2016 7:50 PM
… Over the weekend, I created 8M tables with 16M indexes on those
tables.

… A system or database crash could take potentially hours to days
to recover.  There are likely other issues ahead.

You may wonder, "why is Greg attempting such a thing?"  I looked
at DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's
face it, it's antiquated and don't get me started on "Hadoop". 
Problem with the "one big table" solution is I anticipate 1,200

trillion records.  Random access is expected and the customer
expects <30ms reads for a single record fetch.

I'm not looking for alternatives yet but input to my test.

_

Holy guacamole, batman!  Ok, here’s my take:  you’ve traded the
risks/limitations of the known for the risks of the unknown.  The
unknown being, in the numerous places where postgres historical
development may have cut corners, you may be the first to exercise
those corners and flame out like the recent SpaceX rocket.

Put it another way – you’re going to bet your career (perhaps) or
a client’s future on an architectural model that just doesn’t seem
feasible.  I think you’ve got a remarkable design problem to
solve, and am glad you’ve chosen to share that problem with us.

And I do think it will boil down to this: it’s not that you CAN do
it on Postgres (which you clearly can), but once in production,
assuming things are actually stable, how will you handle the data
management aspects like inevitable breakage, data integrity
issues, backups, restores, user contention for resources, fault
tolerance and disaster recovery.  Just listing the tables will
take forever.  Add a column? Never.  I do think the amount of
testing you’ll need to do prove that every normal data management
function still works at that table count…that in itself is going
to be not a lot of fun.

This one hurts my head.  Ironically, the most logical destination
for this type of data may actually be Hadoop – auto-scale,
auto-shard, fault tolerant, etc…and I’m not a Hadoopie.

I am looking forward to hearing how this all plays out, it will be
quite an adventure!  All the best,

Mike Sofen (Synthetic Genomics…on Postgres 9.5x)



In this list, the convention is to post replies at the end (with some rare 
exceptions), or interspersed when appropriate,
and to omit parts no longer relevant.

The motivation of bottom posting like this: is that people get to see the 
context before the reply, AND emails don't end
up getting longer & longer as people reply at the beginning forgetting to trim 
the now irrelevant stuff at the end.


Cheers,
Gavin



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Millions of tables

2016-09-25 Thread Jeff Janes
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.
>

Why would the auto versions of those cause less grief than the manual
versions?


>   Backups are problematic in the traditional pg_dump and PITR space.
>

Is there a third option to those two spaces?  File-system snapshots?


> Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in
> my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.
>

Isn't that a show-stopper?


> There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>

Sorry, I don't really follow.  Whether you have 1 table or millions,
eventually someone has to go get the data off the disk. Why would the
number of tables make much of a difference to that fundamental?

Also, how many tablespaces do you anticipate having?  Can you get 120
petabytes of storage all mounted to one machine?


> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>

Go through and put one row (or 8kB worth of rows) into each of 8 million
table.  The stats collector and the autovacuum process will start going
nuts.  Now, maybe you can deal with it.  But maybe not.  That is the first
non-obvious thing I'd look at.

Cheers,

Jeff


Re: [PERFORM] Millions of tables

2016-09-25 Thread julyanto SUTANDANG
-sorry for my last email, which also not bottom posting-

Hi Greg,
On Mon, Sep 26, 2016 at 11:19 AM, Greg Spiegelberg 
wrote:

> I did look at PostgresXL and CitusDB.  Both are admirable however neither
> could support the need to read a random record consistently under 30ms.
> It's a similar problem Cassandra and others have: network latency.  At this
> scale, to provide the ability to access any given record amongst trillions
> it is imperative to know precisely where it is stored (system & database)
> and read a relatively small index.  I have other requirements that prohibit
> use of any technology that is eventually consistent.
>
 Then, you can get below 30ms, but how many process you might have to have
conncurently?
This is something that you should consider, single machine can only have
less than 50 HT for intel, 192HT for Power8, still it is far below millions
compare with the number of tables (8Million)
If you use index correctly, you would not need sequencial scan since the
scanning run on the memory (index loaded into memory)
Do you plan to query thru Master table of the partition? it is quite slow
actually, considering millions rule to check for every query.

with 8 Millions of data, you would require very big data storage for sure
and it would not fit mounted into single machine unless you would planning
to use IBM z machines.


> I liken the problem to fishing.  To find a particular fish of length,
> size, color &c in a data lake you must accept the possibility of scanning
> the entire lake.  However, if all fish were in barrels where each barrel
> had a particular kind of fish of specific length, size, color &c then the
> problem is far simpler.
>
>