On Fri, Nov 6, 2015 at 8:08 PM, Peter Geoghegan <p...@heroku.com> wrote:
> On Wed, Aug 19, 2015 at 7:24 PM, Peter Geoghegan <p...@heroku.com> wrote: > > I'll start a new thread for this, since my external sorting patch has > > now evolved well past the original "quicksort with spillover" > > idea...although not quite how I anticipated it would. It seems like > > I've reached a good point to get some feedback. > > Corey Huinker has once again assisted me with this work, by doing some > benchmarking on an AWS instance of his: > > 32 cores (c3.8xlarge, I suppose) > MemTotal: 251902912 kB > > I believe it had one EBS volume. > > This testing included 2 data sets: > > * A data set that he happens to have that is representative of his > production use-case. Corey had some complaints about the sort > performance of PostgreSQL, particularly prior to 9.5, and I like to > link any particular performance optimization to an improvement in an > actual production workload, if at all possible. > > * A tool that I wrote, that works on top of sortbenchmark.org's > "gensort" [1] data generation tool. It seems reasonable to me to drive > this work in part with a benchmark devised by Jim Gray. He did after > all receive a Turing award for this contribution to transaction > processing. I'm certainly a fan of his work. A key practical advantage > of that is that is has reasonable guarantees about determinism, making > these results relatively easy to recreate independently. > > The modified "gensort" is available from > https://github.com/petergeoghegan/gensort > > The python script postgres_load.py, which performs bulk-loading for > Postgres using COPY FREEZE. It ought to be fairly self-documenting: > > $:~/gensort$ ./postgres_load.py --help > usage: postgres_load.py [-h] [-w WORKERS] [-m MILLION] [-s] [-l] [-c] > > optional arguments: > -h, --help show this help message and exit > -w WORKERS, --workers WORKERS > Number of gensort workers (default: 4) > -m MILLION, --million MILLION > Generate n million tuples (default: 100) > -s, --skew Skew distribution of output keys (default: False) > -l, --logged Use logged PostgreSQL table (default: False) > -c, --collate Use default collation rather than C collation > (default: False) > > For this initial report to the list, I'm going to focus on a case > involving 16 billion non-skewed tuples generated using the gensort > tool. I wanted to see how a sort of a ~1TB table (1017GB as reported > by psql, actually) could be improved, as compared to relatively small > volumes of data (in the multiple gigabyte range) that were so improved > by sorts on my laptop, which has enough memory to avoid blocking on > physical I/O much of the time. How the new approach deals with > hundreds of runs that are actually reasonably sized is also of > interest. This server does have a lot of memory, and many CPU cores. > It was kind of underpowered on I/O, though. > > The initial load of 16 billion tuples (with a sortkey that is "C" > locale text) took about 10 hours. My tool supports parallel generation > of COPY format files, but serial performance of that stage isn't > especially fast. Further, in order to support COPY FREEZE, and in > order to ensure perfect determinism, the COPY operations occur > serially in a single transaction that creates the table that we > performed a CREATE INDEX on. > > Patch, with 3GB maintenance_work_mem: > > ... > LOG: performsort done (except 411-way final merge): CPU > 1017.95s/17615.74u sec elapsed 23910.99 sec > STATEMENT: create index on sort_test (sortkey ); > LOG: external sort ended, 54740802 disk blocks used: CPU > 2001.81s/31395.96u sec elapsed 41648.05 sec > STATEMENT: create index on sort_test (sortkey ); > > So just over 11 hours (11:34:08), then. The initial sorting for 411 > runs took 06:38:30.99, as you can see. > > Master branch: > > ... > LOG: finished writing run 202 to tape 201: CPU 1224.68s/31060.15u sec > elapsed 34409.16 sec > LOG: finished writing run 203 to tape 202: CPU 1230.48s/31213.55u sec > elapsed 34580.41 sec > LOG: finished writing run 204 to tape 203: CPU 1236.74s/31366.63u sec > elapsed 34750.28 sec > LOG: performsort starting: CPU 1241.70s/31501.61u sec elapsed 34898.63 sec > LOG: finished writing run 205 to tape 204: CPU 1242.19s/31516.52u sec > elapsed 34914.17 sec > LOG: finished writing final run 206 to tape 205: CPU > 1243.23s/31564.23u sec elapsed 34963.03 sec > LOG: performsort done (except 206-way final merge): CPU > 1243.86s/31570.58u sec elapsed 34974.08 sec > LOG: external sort ended, 54740731 disk blocks used: CPU > 2026.98s/48448.13u sec elapsed 55299.24 sec > CREATE INDEX > Time: 55299315.220 ms > > So 15:21:39 for master -- it's much improved, but this was still > disappointing given the huge improvements on relatively small cases. > > Finished index was fairly large, which can be seen here by working > back from "total relation size": > > postgres=# select pg_size_pretty(pg_total_relation_size('sort_test')); > pg_size_pretty > ---------------- > 1487 GB > (1 row) > > I think that this is probably due to the relatively slow I/O on this > server, and because the merge step is more of a bottleneck. As we > increase maintenance_work_mem, we're likely to then suffer from the > lack of explicit asynchronous I/O here. It helps, still, but not > dramatically. With with maintenance_work_mem = 30GB, patch is somewhat > faster (no reason to think that this would help master at all, so that > was untested): > > ... > LOG: starting quicksort of run 40: CPU 1815.99s/19339.80u sec elapsed > 24910.38 sec > LOG: finished quicksorting run 40: CPU 1820.09s/19565.94u sec elapsed > 25140.69 sec > LOG: finished writing run 40 to tape 39: CPU 1833.76s/19642.11u sec > elapsed 25234.44 sec > LOG: performsort starting: CPU 1849.46s/19803.28u sec elapsed 25499.98 sec > LOG: starting quicksort of run 41: CPU 1849.46s/19803.28u sec elapsed > 25499.98 sec > LOG: finished quicksorting run 41: CPU 1852.37s/20000.73u sec elapsed > 25700.43 sec > LOG: finished writing run 41 to tape 40: CPU 1864.89s/20069.09u sec > elapsed 25782.93 sec > LOG: performsort done (except 41-way final merge): CPU > 1965.43s/20086.28u sec elapsed 25980.80 sec > LOG: external sort ended, 54740909 disk blocks used: CPU > 3270.57s/31595.37u sec elapsed 40376.43 sec > CREATE INDEX > Time: 40383174.977 ms > > So that takes 11:13:03 in total -- we only managed to shave about 20 > minutes off the total time taken, despite a 10x increase in > maintenance_work_mem. Still, at least it gets moderately better, not > worse, which is certainly what I'd expect from the master branch. 60GB > was half way between 3GB and 30GB in terms of performance, so it > doesn't continue to help, but, again, at least things don't get much > worse. > > Thoughts on these results: > > * I'd really like to know the role of I/O here. Better, low-overhead > instrumentation is required to see when and how we are I/O bound. I've > been doing much of that on a more-or-less ad hoc basis so far, using > iotop. I'm looking into a way to usefully graph the I/O activity over > many hours, to correlate with the trace_sort output that I'll also > show. I'm open to suggestions on the easiest way of doing that. Having > used the "perf" tool for instrumenting I/O at all in the past. > > * Parallelism would probably help us here *a lot*. > > * As I said, I think we suffer from the lack of asynchronous I/O much > more at this scale. Will need to confirm that theory. > > * It seems kind of ill-advised to make run size (which is always in > linear proportion to maintenance_work_mem with this new approach to > sorting) larger, because it probably will hurt writing runs more than > it will help in making merging cheaper (perhaps mostly due to the lack > of asynchronous I/O to hide the latency of writes -- Linux might not > do so well at this scale). > > * Maybe adding actual I/O bandwidth is the way to go to get a better > picture. I wouldn't be surprised if we were very bottlenecked on I/O > here. Might be worth using many parallel EBS volumes here, for > example. > > [1] http://sortbenchmark.org/FAQ-2015.html > -- > Peter Geoghegan > The machine in question still exists, so if you have questions about it, commands you'd like me to run to give you insight as to the I/O capabilities of the machine, let me know. I can't guarantee we'll keep the machine much longer.