Rearchitecting for storage
I've recently inherited a database that is dangerously close to outgrowing the available storage on its existing hardware. I'm looking for (pointers to) advice on scaling the storage in a financially constrained not-for-profit. The current size of the DB's data directory is just shy of 23TB. When I received the machine it's on, it was configured with 18x3TB drives in RAID10 (9x 2-drive mirrors striped together) for about 28TB of available storage. As a short term measure I've reconfigured them into RAID50 (3x 6-drive RAID5 arrays). This is obviously a poor choice for performance, but it'll get us through until we figure out what to do about upgrading/replacing the hardware. The host is constrained to 24x3TB drives, so we can't get much of an upgrade by just adding/replacing disks. One of my anticipated requirements for any replacement we design is that I should be able to do upgrades of Postgres for up to five years without needing major upgrades to the hardware. My understanding of the standard upgrade process is that this requires that the data directory be smaller than the free storage (so that there is room to hold two copies of the data directory simultaneously). I haven't got detailed growth statistics yet, but given that the DB has grown to 23TB in 5 years, I should assume that it could double in the next five years, requiring 100TB of available storage to be able to do updates. This seems to be right on the cusp of what is possible to fit in a single chassis with a RAID10 configuration (at least, with commodify hardware), which means we're looking at pretty high cost:performance ratio. I'd like to see if we can find designs that get that ratio down a bit, or a lot, but I'm a general sysadmin, and the detailed effects on those choices are outside of my limited DBA experience. Are there good documents out there on sizing hardware for this sort of mid-range storage requirement, that is neither big data, nor "small data" able to fit on a single host? I'm hoping for an overview of the tradeoffs between single head, dual-head setups with a JBOD array, or whatever else is advisable to consider these days. Corrections of any poor assumptions exposed above are also quite welcome. :) Thanks in advance for any assistance!
Re: Rearchitecting for storage
On Thu, 18 Jul 2019 at 13:34, Kenneth Marshall wrote: > Hi Matt, > Hi! Thanks for your reply. > Have you considered using the VDO compression for tables that are less > update intensive. Using just compression you can get almost 4X size > reduction. For a database, I would forgo the deduplication function. > You can then use a non-compressed tablespace for the heavier I/O tables > and indexes. > VDO is a RedHat-only thing, isn't it? We're not running RHEL... Debian. Anyway, the bulk of the data (nearly 80%) is in a single table and its indexes. ~6TB to the table, and ~12TB to its indices. Even if we switched over to RedHat, there's no value in compressing lesser-used tables. > > > My understanding of the standard > > upgrade process is that this requires that the data directory be smaller > > than the free storage (so that there is room to hold two copies of the > data > > directory simultaneously). > > The link option with pg_upgrade does not require 2X the space, since it > uses hard links instead of copying the files to the new cluster. > That would likely keep the extra storage requirements small, but still non-zero. Presumably the upgrade would be unnecessary if it could be done without rewriting files. Is there any rule of thumb for making sure one has enough space available for the upgrade? I suppose that would come down to what exactly needs to get rewritten, in what order, etc., but the pg_upgrade docs don't seem to have that detail. For example, since we've got an ~18TB table (including its indices), if that needs to be rewritten then we're still looking at requiring significant extra storage. Recent experience suggests postgres won't necessarily do things in the most storage-efficient way.. we just had a reindex on that database fail (in --single-user) because 17TB was insufficient free storage for the db to grow into.
Re: Rearchitecting for storage
On Thu, 18 Jul 2019 at 19:53, Rob Sargent wrote: > > > > > That would likely keep the extra storage requirements small, but still > non-zero. Presumably the upgrade would be unnecessary if it could be done > without rewriting files. Is there any rule of thumb for making sure one > has enough space available for the upgrade? I suppose that would come > down to what exactly needs to get rewritten, in what order, etc., but the > pg_upgrade docs don't seem to have that detail. For example, since we've > got an ~18TB table (including its indices), if that needs to be rewritten > then we're still looking at requiring significant extra storage. Recent > experience suggests postgres won't necessarily do things in the most > storage-efficient way.. we just had a reindex on that database fail (in > --single-user) because 17TB was insufficient free storage for the db to > grow into. > > > Can you afford to drop and re-create those 6 indices? Technically, yes. I don't see any reason we'd be prevented from doing that. But, rebuilding them will take a long time. That's a lot of downtime to incur any time we update the DB. I'd prefer to avoid it if I can. For scale, the recent 'reindex database' that failed ran for nine days before it ran out of room, and that was in single-user. Trying to do that concurrently would take a lot longer, I imagine.
Re: Rearchitecting for storage
On Thu, 18 Jul 2019 at 21:59, Andy Colson wrote: > > > > Now might be a good time to consider splitting the database onto multiple > computers. Might be simpler with a mid-range database, then your plan for > the future is "add more computers". > Hmm... yes. Range partitioning seems like a possible way forward. I hadn't considered that yet. We might hold that back for when a dual-head approach (server + jbod) can't scale anymore, but I think that's a long way in the future. > >
Re: Rearchitecting for storage
On Fri, 19 Jul 2019 at 04:21, Luca Ferrari wrote: > > This could be trivial, but any chance you can partition the table > and/or archive unused records (at least temporarly)? A 18 TB table > quite frankly sounds a good candidate to contain records no one is > interested in the near future. > Partitioning is a possibility. The whole database is historical test results, stored specifically for doing comparisons over time, so I'm not sure we can actually archive anything. Expiring old test data is a discussion we might have to have, eventually. > In any case, if you can partition the table chances are you can at > least do a per-table backup that could simplify maintanance of the > database. > My current backup plan for this database is on-site replication, and a monthly pg_dump from the standby to be copied off-site. Doing per-table backups sounds like a great way to end up with an inconsistent backup, but perhaps I misunderstand what you mean. Another possibility is putting the server into backup mode and taking a snapshot of the filesystem, but coordinating that across chassis (in the case where partitioning is used) in such a way that the db is consistent sounds like a hard problem... unless issuing pg_start_backup on the chassis holding the master table coordinates backup mode on all the chassis holding child tables at the same time? I haven't read enough on that yet. > > In desperate order, I would check also the log files (I mean, textual > logs, not wals) because occasionally I found them requiring a few GBs > on my disk, and that can be easily archived to gain some more extra > space. > Then I would go for some commodity NAS to attach as extra storage, at > least for the upgrade process. > Okay. So I guess the short answer is no, nobody really knows how to judge how much space is required for an upgrade? :) Any logs we have are going to be a rounding error when compared to the database itself. And buying storage last-minute because an upgrade failed is exactly the sort of thing that a resource constrained not-for-profit can't do. We really need to be able to plan this out long term so that we get as much as possible out of every dollar. > If any of the following fails, I would probably drop all the indexes > to gain extra space, perform the upgrade, and then reindex (removing > the old cluster, in the case it has not been upgraded with the link > option). > Yeah, this sort of trial-and-error approach to getting upgrades done will bother me, but it seems like it'll be necessary once we start growing into whatever new home we get for the db. Thanks very much for your time on this.
Re: Rearchitecting for storage
On Thu, 18 Jul 2019 at 09:44, Matthew Pounsett wrote: > > I've recently inherited a database that is dangerously close to outgrowing > the available storage on its existing hardware. I'm looking for (pointers > to) advice on scaling the storage in a financially constrained > not-for-profit. > Thanks for your replies, everyone. Here's a quick summary of what I've got out of this. Although nobody really addressed the core question of the performance tradeoffs in different storage architectures, perhaps the fact that nobody mentioned them means there really aren't any. We'll proceed on the assumption that externally attached storage really doesn't make a difference. NAS storage seems like a poor choice to me, for performance reasons, and nobody's really said anything to disabuse me of that notion. We're going to have a look at the relative costs of single-head solutions as well as dual head (server + jbod) setups, and see what gets us the most growth for the least cost. We'll plan for enough storage to get us five years of growth, and just accept that maybe in the 5th year we won't be able to do in-place upgrades without dropping indexes. Horizontal scalability through range partitioning sounds interesting, but I don't think it's a cost-effective solution for us right now. As long as it's possible for the db to fit in a single server (or jbod) using commodity hardware, the incremental cost of adding more chassis (and therefore more motherbaords, more CPUs, more memory) isn't offset by a reduced cost anywhere else (e.g. using cheaper drives). And that's not even accounting for the increased operational cost of coordinating the DB across multiple servers. It could be a useful approach if DB growth outpaces historical averages and we need to add hardware before a normal replacement cycle. It could also be useful at the end of that replacement cycle if DB growth has outpaced commodity hardware improvements, and single server solutions are no longer viable. The DB server we inherited is currently a single instance, but once we expand and have replication in place I'll have to do some testing to see if LVM compression gives us any performance boost (or incurs an acceptable performance cost). The big question there is whether the processing required to do the compression is faster than the difference in read times on the disk... I think that might be dependent on the data and how it's accessed. It certainly seems like it could give us some benefits, but I don't think it's an experiment I want to attempt with only a single production copy of the DB; the downtime required to rebuild the DB server for A+B comparisons would be unacceptable. Thanks again everyone. This has been educational.
Re: Rearchitecting for storage
On Fri, 19 Jul 2019 at 11:25, Peter J. Holzer wrote: > On 2019-07-19 10:41:31 -0400, Matthew Pounsett wrote: > > Okay. So I guess the short answer is no, nobody really knows how to > > judge how much space is required for an upgrade? :) > > As I understand it, a pg_upgrade --link uses only negligible extra > space. It duplicates a bit of householding information, but not your > data tables or indexes. Your 18 TB table will definitely not be duplicated > during the upgrade if you can use --link. > The documentation for pg_upgrade --link says that the old copy is no longer usable, which means it's modifying files that are linked. If it were only modifying small housekeeping files, then it would be most efficient not to link those, which would keep both copies of the db usable. That seems incompatible with your suggestion that it doesn't need to modify the data files. Depending on how it goes about doing that, it could mean a significant short-term increase in storage requirements while the data is being converted. Going back to our recent 'reindex database' attempt, pgsql does not necessarily do these things in the most storage-efficient manner; it seems entirely likely that it would choose to use links to duplicate the data directory, then create copies of each data file as it converts them over, then link that back to the original for an atomic replacement. That could eat up a HUGE amount of storage during the conversion process without the start and end sizes being very different at all. Sorry, but I can't reconcile your use of "as I understand it" with your use of "definitely". It sounds like you're guessing, rather than speaking from direct knowledge of how the internals of pg_upgrade.
Re: [External] LIMIT not showing all results
On Tue, 5 Mar 2019 at 12:54, Tom Lane wrote: > Andreas Kretschmer writes: > > the other thing is, it would be nice to to know why the index is corrupt. > > Given that (a) this was triggered by a server migration and (b) > the leading column of the index looks like it's probably varchar, > I'm suspicious that the new server has different collation behavior. The migration in question was an rsync from a Debian 9 box running 9.4.19-0+deb8u1 to a FreeBSD 11 box running 9.4.20.
Re: [External] LIMIT not showing all results
On Tue, 5 Mar 2019 at 13:55, Tom Lane wrote: > > Yeah, that would fit the theory :-(. Debian would be using glibc > and FreeBSD would not be. If you were using C collation in the > database, you'd be all right because that's standardized, but I'll > bet you were using something else. What does psql \l show for the > "collate" setting of this database? (Or, if by chance you had an > explicit COLLATE setting on the column in question, what's that?) > All of the databases are using en_US.UTF-8, which is (I think) the default these days for most distributions, isn't it? So yeah.. that would be it. Thanks for your help. The rsync migration was because we needed to do a cross-country copy before putting the original DB server on a truck, but we couldn't get pg_basebackup to complete the 22TB sync without dying. rsync was restartable, so we went that route instead. Now that the two copies are physically next to each other again, after we do a rebuild of the original server I'll be syncing the data back (this time using pg_basebackup and replication). I *assume* we shouldn't expect similar collation problems replicating data that way, but it seems prudent to check. Should we?
Re: [External] LIMIT not showing all results
On Tue, 5 Mar 2019 at 18:09, Tom Lane wrote: > > If you're planninng to install (the same version of) FreeBSD on > the original server hardware, then rsync'ing back from the new > system should be fine. But Debian<->FreeBSD is gonna be trouble > in either direction. > But I'm specifically NOT talking about doing an rsync to get the data back.. the plan is to use in-protocol replication. Maybe that's a distinction without a difference, but that's why I brought it up. The replication documentation, and more specifically the pg_basebackup documentation, makes no mention of cross-OS replication as being a problem for any reason. If that is expected to be a problem, then perhaps that should be updated?
Re: [External] LIMIT not showing all results
On Tue, 5 Mar 2019 at 18:39, Adrian Klaver wrote: > > > > The replication documentation, and more specifically the pg_basebackup > > documentation, makes no mention of cross-OS replication as being a > > problem for any reason. If that is expected to be a problem, then > > perhaps that should be updated? > > Generally covered under: > > https://www.postgresql.org/docs/10/warm-standby.html#STANDBY-PLANNING > > "It is usually wise to create the primary and standby servers so that > they are as similar as possible, at least from the perspective of the > database server." > Nothing in that paragraph says to me that I'm going to have problems as a result of differences in postgres software dependencies. It's a pretty vague warning that seems to imply that as long as your hardware architecture and filesystem layout are identical there shouldn't be any issues. Thanks for your clarification, though. We'll have to take that into account in our migration plan back to the original server. I don't think we can afford the downtime for a dump/restore between systems, so we'll likely just end up regenerating indexes again, before cutting the application over to the other DB server. That is why later versions(10+) grew logical replication: > https://www.postgresql.org/docs/10/logical-replication.html That seems promising, but we'd have to deal with the upgrade to 10.x first. And presumably logical replication would allow more freedom in replicating between mismatched versions as well... possibly as a low-downtime migration path for getting very large databases moved to more recent releases?