Rearchitecting for storage

2019-07-18 Thread Matthew Pounsett
I've recently inherited a database that is dangerously close to outgrowing
the available storage on its existing hardware.  I'm looking for (pointers
to) advice on scaling the storage in a financially constrained
not-for-profit.

The current size of the DB's data directory is just shy of 23TB.  When I
received the machine it's on, it was configured with 18x3TB drives in
RAID10 (9x 2-drive mirrors striped together) for about 28TB of available
storage.  As a short term measure I've reconfigured them into RAID50 (3x
6-drive RAID5 arrays).  This is obviously a poor choice for performance,
but it'll get us through until we figure out what to do about
upgrading/replacing the hardware.  The host is constrained to 24x3TB
drives, so we can't get much of an upgrade by just adding/replacing disks.

One of my anticipated requirements for any replacement we design is that I
should be able to do upgrades of Postgres for up to five years without
needing major upgrades to the hardware.  My understanding of the standard
upgrade process is that this requires that the data directory be smaller
than the free storage (so that there is room to hold two copies of the data
directory simultaneously).  I haven't got detailed growth statistics yet,
but given that the DB has grown to 23TB in 5 years, I should assume that it
could double in the next five years, requiring 100TB of available storage
to be able to do updates.

This seems to be right on the cusp of what is possible to fit in a single
chassis with a RAID10 configuration (at least, with commodify hardware),
which means we're looking at pretty high cost:performance ratio.  I'd like
to see if we can find designs that get that ratio down a bit, or a lot, but
I'm a general sysadmin, and the detailed effects on those choices are
outside of my limited DBA experience.

Are there good documents out there on sizing hardware for this sort of
mid-range storage requirement, that is neither big data, nor "small data"
able to fit on a single host?   I'm hoping for an overview of the tradeoffs
between single head, dual-head setups with a JBOD array, or whatever else
is advisable to consider these days.  Corrections of any poor assumptions
exposed above are also quite welcome. :)

Thanks in advance for any assistance!


Re: Rearchitecting for storage

2019-07-18 Thread Matthew Pounsett
On Thu, 18 Jul 2019 at 13:34, Kenneth Marshall  wrote:

> Hi Matt,
>

Hi!  Thanks for your reply.


> Have you considered using the VDO compression for tables that are less
> update intensive. Using just compression you can get almost 4X size
> reduction. For a database, I would forgo the deduplication function.
> You can then use a non-compressed tablespace for the heavier I/O tables
> and indexes.
>

VDO is a RedHat-only thing, isn't it?  We're not running RHEL... Debian.
Anyway, the bulk of the data (nearly 80%) is in a single table and its
indexes.  ~6TB to the table, and ~12TB to its indices.  Even if we switched
over to RedHat, there's no value in compressing lesser-used tables.


>
> > My understanding of the standard
> > upgrade process is that this requires that the data directory be smaller
> > than the free storage (so that there is room to hold two copies of the
> data
> > directory simultaneously).
>
> The link option with pg_upgrade does not require 2X the space, since it
> uses hard links instead of copying the files to the new cluster.
>

That would likely keep the extra storage requirements small, but still
non-zero.  Presumably the upgrade would be unnecessary if it could be done
without rewriting files.  Is there any rule of thumb for making sure one
has enough space available for the upgrade?   I suppose that would come
down to what exactly needs to get rewritten, in what order, etc., but the
pg_upgrade docs don't seem to have that detail.  For example, since we've
got an ~18TB table (including its indices), if that needs to be rewritten
then we're still looking at requiring significant extra storage.  Recent
experience suggests postgres won't necessarily do things in the most
storage-efficient way.. we just had a reindex on that database fail (in
--single-user) because 17TB was insufficient free storage for the db to
grow into.


Re: Rearchitecting for storage

2019-07-19 Thread Matthew Pounsett
On Thu, 18 Jul 2019 at 19:53, Rob Sargent  wrote:

>
> >
> > That would likely keep the extra storage requirements small, but still
> non-zero.  Presumably the upgrade would be unnecessary if it could be done
> without rewriting files.  Is there any rule of thumb for making sure one
> has enough space available for the upgrade?   I suppose that would come
> down to what exactly needs to get rewritten, in what order, etc., but the
> pg_upgrade docs don't seem to have that detail.  For example, since we've
> got an ~18TB table (including its indices), if that needs to be rewritten
> then we're still looking at requiring significant extra storage.  Recent
> experience suggests postgres won't necessarily do things in the most
> storage-efficient way.. we just had a reindex on that database fail (in
> --single-user) because 17TB was insufficient free storage for the db to
> grow into.
> >
> Can you afford to drop and re-create those 6 indices?


Technically, yes.  I don't see any reason we'd be prevented from doing
that.  But, rebuilding them will take a long time.  That's a lot of
downtime to incur any time we update the DB.  I'd prefer to avoid it if I
can.  For scale, the recent 'reindex database' that failed ran for nine
days before it ran out of room, and that was in single-user.  Trying to do
that concurrently would take a lot longer, I imagine.


Re: Rearchitecting for storage

2019-07-19 Thread Matthew Pounsett
On Thu, 18 Jul 2019 at 21:59, Andy Colson  wrote:

> >
>
> Now might be a good time to consider splitting the database onto multiple
> computers.  Might be simpler with a mid-range database, then your plan for
> the future is "add more computers".
>

Hmm... yes.  Range partitioning seems like a possible way forward.  I
hadn't considered that yet.  We might hold that back for when a dual-head
approach (server + jbod) can't scale anymore, but I think that's a long way
in the future.

>
>


Re: Rearchitecting for storage

2019-07-19 Thread Matthew Pounsett
On Fri, 19 Jul 2019 at 04:21, Luca Ferrari  wrote:

>
> This could be trivial, but any chance you can partition the table
> and/or archive unused records (at least temporarly)? A 18 TB table
> quite frankly sounds a good candidate to contain records no one is
> interested in the near future.
>

Partitioning is a possibility.  The whole database is historical test
results, stored specifically for doing comparisons over time, so I'm not
sure we can actually archive anything.  Expiring old test data is a
discussion we might have to have, eventually.


> In any case, if you can partition the table chances are you can at
> least do a per-table backup that could simplify maintanance of the
> database.
>

My current backup plan for this database is on-site replication, and a
monthly pg_dump from the standby to be copied off-site.  Doing per-table
backups sounds like a great way to end up with an inconsistent backup, but
perhaps I misunderstand what you mean.

Another possibility is putting the server into backup mode and taking a
snapshot of the filesystem, but coordinating that across chassis (in the
case where partitioning is used) in such a way that the db is consistent
sounds like a hard problem... unless issuing pg_start_backup on the chassis
holding the master table coordinates backup mode on all the chassis holding
child tables at the same time?  I haven't read enough on that yet.


>
> In desperate order, I would check also the log files (I mean, textual
> logs, not wals) because occasionally I found them requiring a few GBs
> on my disk, and that can be easily archived to gain some more extra
> space.
> Then I would go for some commodity NAS to attach as extra storage, at
> least for the upgrade process.
>

Okay.  So I guess the short answer is no, nobody really knows how to judge
how much space is required for an upgrade?  :)

Any logs we have are going to be a rounding error when compared to the
database itself.  And buying storage last-minute because an upgrade failed
is exactly the sort of thing that a resource constrained not-for-profit
can't do.  We really need to be able to plan this out long term so that we
get as much as possible out of every dollar.


> If any of the following fails, I would probably drop all the indexes
> to gain extra space, perform the upgrade, and then reindex (removing
> the old cluster, in the case it has not been upgraded with the link
> option).
>

Yeah, this sort of trial-and-error approach to getting upgrades done will
bother me, but it seems like it'll be necessary once we start growing into
whatever new home we get for the db.

Thanks very much for your time on this.


Re: Rearchitecting for storage

2019-07-19 Thread Matthew Pounsett
On Thu, 18 Jul 2019 at 09:44, Matthew Pounsett  wrote:

>
> I've recently inherited a database that is dangerously close to outgrowing
> the available storage on its existing hardware.  I'm looking for (pointers
> to) advice on scaling the storage in a financially constrained
> not-for-profit.
>

Thanks for your replies, everyone.  Here's a quick summary of what I've got
out of this.

Although nobody really addressed the core question of the performance
tradeoffs in different storage architectures, perhaps the fact that nobody
mentioned them means there really aren't any.  We'll proceed on the
assumption that externally attached storage really doesn't make a
difference.  NAS storage seems like a poor choice to me, for performance
reasons, and nobody's really said anything to disabuse me of that notion.

We're going to have a look at the relative costs of single-head solutions
as well as dual head (server + jbod) setups, and see what gets us the most
growth for the least cost. We'll plan for enough storage to get us five
years of growth, and just accept that maybe in the 5th year we won't be
able to do in-place upgrades without dropping indexes.

Horizontal scalability through range partitioning sounds interesting, but I
don't think it's a cost-effective solution for us right now.  As long as
it's possible for the db to fit in a single server (or jbod) using
commodity hardware, the incremental cost of adding more chassis (and
therefore more motherbaords, more CPUs, more memory) isn't offset by a
reduced cost anywhere else (e.g. using cheaper drives).  And that's not
even accounting for the increased operational cost of coordinating the DB
across multiple servers.  It could be a useful approach if DB growth
outpaces historical averages and we need to add hardware before a normal
replacement cycle.  It could also be useful at the end of that replacement
cycle if DB growth has outpaced commodity hardware improvements, and single
server solutions are no longer viable.

The DB server we inherited is currently a single instance, but once we
expand and have replication in place I'll have to do some testing to see if
LVM compression gives us any performance boost (or incurs an acceptable
performance cost). The big question there is whether the processing
required to do the compression is faster than the difference in read times
on the disk... I think that might be dependent on the data and how it's
accessed.  It certainly seems like it could give us some benefits, but I
don't think it's an experiment I want to attempt with only a single
production copy of the DB; the downtime required to rebuild the DB server
for A+B comparisons would be unacceptable.

Thanks again everyone.  This has been educational.


Re: Rearchitecting for storage

2019-07-19 Thread Matthew Pounsett
On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer  wrote:

> On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote:
> > Okay.  So I guess the short answer is no, nobody really knows how to
> > judge how much space is required for an upgrade?  :)
>
> As I understand it, a pg_upgrade --link uses only negligible extra
> space. It duplicates a bit of householding information, but not your
> data tables or indexes. Your 18 TB table will definitely not be duplicated
> during the upgrade if you can use --link.
>

The documentation for pg_upgrade --link says that the old copy is no longer
usable, which means it's modifying files that are linked.  If it were only
modifying small housekeeping files, then it would be most efficient not to
link those, which would keep both copies of the db usable.  That seems
incompatible with your suggestion that it doesn't need to modify the data
files.  Depending on how it goes about doing that, it could mean a
significant short-term increase in storage requirements while the data is
being converted.

Going back to our recent 'reindex database' attempt, pgsql does not
necessarily do these things in the most storage-efficient manner; it seems
entirely likely that it would choose to use links to duplicate the data
directory, then create copies of each data file as it converts them over,
then link that back to the original for an atomic replacement.  That could
eat up a HUGE amount of storage during the conversion process without the
start and end sizes being very different at all.

Sorry, but I can't reconcile your use of "as I understand it" with your use
of "definitely".  It sounds like you're guessing, rather than speaking from
direct knowledge of how the internals of pg_upgrade.


Re: [External] LIMIT not showing all results

2019-03-05 Thread Matthew Pounsett
On Tue, 5 Mar 2019 at 12:54, Tom Lane  wrote:

> Andreas Kretschmer  writes:
> > the other thing is, it would be nice to to know why the index is corrupt.
>
> Given that (a) this was triggered by a server migration and (b)
> the leading column of the index looks like it's probably varchar,
> I'm suspicious that the new server has different collation behavior.


The migration in question was an rsync from a Debian 9 box
running 9.4.19-0+deb8u1 to a FreeBSD 11 box  running 9.4.20.


Re: [External] LIMIT not showing all results

2019-03-05 Thread Matthew Pounsett
On Tue, 5 Mar 2019 at 13:55, Tom Lane  wrote:

>
> Yeah, that would fit the theory :-(.  Debian would be using glibc
> and FreeBSD would not be.  If you were using C collation in the
> database, you'd be all right because that's standardized, but I'll
> bet you were using something else.  What does psql \l show for the
> "collate" setting of this database?  (Or, if by chance you had an
> explicit COLLATE setting on the column in question, what's that?)
>

All of the databases are using en_US.UTF-8, which is (I think) the default
these days for most distributions, isn't it?

So yeah.. that would be it.  Thanks for your help.

The rsync migration was because we needed to do a cross-country copy before
putting the original DB server on a truck, but we couldn't get
pg_basebackup to complete the 22TB sync without dying.  rsync was
restartable, so we went that route instead.  Now that the two copies are
physically next to each other again, after we do a rebuild of the original
server I'll be syncing the data back (this time using pg_basebackup and
replication).  I *assume* we shouldn't expect similar collation problems
replicating data that way, but it seems prudent to check.
Should we?


Re: [External] LIMIT not showing all results

2019-03-05 Thread Matthew Pounsett
On Tue, 5 Mar 2019 at 18:09, Tom Lane  wrote:

>
> If you're planninng to install (the same version of) FreeBSD on
> the original server hardware, then rsync'ing back from the new
> system should be fine.  But Debian<->FreeBSD is gonna be trouble
> in either direction.
>

But I'm specifically NOT talking about doing an rsync to get the data
back.. the plan is to use in-protocol replication.  Maybe that's a
distinction without a difference, but that's why I brought it up.

The replication documentation, and more specifically the pg_basebackup
documentation, makes no mention of cross-OS replication as being a problem
for any reason.  If that is expected to be a problem, then perhaps that
should be updated?


Re: [External] LIMIT not showing all results

2019-03-05 Thread Matthew Pounsett
On Tue, 5 Mar 2019 at 18:39, Adrian Klaver 
wrote:

> >
> > The replication documentation, and more specifically the pg_basebackup
> > documentation, makes no mention of cross-OS replication as being a
> > problem for any reason.  If that is expected to be a problem, then
> > perhaps that should be updated?
>
> Generally covered under:
>
> https://www.postgresql.org/docs/10/warm-standby.html#STANDBY-PLANNING
>
> "It is usually wise to create the primary and standby servers so that
> they are as similar as possible, at least from the perspective of the
> database server."
>

Nothing in that paragraph says to me that I'm going to have problems as a
result of differences in postgres software dependencies.  It's a pretty
vague warning that seems to imply that as long as your hardware
architecture and filesystem layout are identical there shouldn't be any
issues.

Thanks for your clarification, though.  We'll have to take that into
account in our migration plan back to the original server.  I don't think
we can afford the downtime for a dump/restore between systems, so we'll
likely just end up regenerating indexes again, before cutting the
application over to the other DB server.

That is why later versions(10+) grew logical replication:
> https://www.postgresql.org/docs/10/logical-replication.html


That seems promising, but we'd have to deal with the upgrade to 10.x
first.  And presumably logical replication would allow more freedom in
replicating between mismatched versions as well... possibly as a
low-downtime migration path for getting very large databases moved to more
recent releases?