On Fri, Jan 23, 2015 at 6:42 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > Fixed-Chunks > > No. of workers/Time (ms) 0 2 4 8 16 24 32 > Run-1 250536 266279 251263 234347 87930 50474 35474 > Run-2 249587 230628 225648 193340 83036 35140 9100 > Run-3 234963 220671 230002 256183 105382 62493 27903 > Run-4 239111 245448 224057 189196 123780 63794 24746 > Run-5 239937 222820 219025 220478 114007 77965 39766
I cannot reproduce these results. I applied your fixed-chunk size patch and ran SELECT parallel_count('tbl_perf', 32) a few times. The first thing I notice is that, as I predicted, there's an issue with different workers finishing at different times. For example, from my first run: 2015-01-27 22:13:09 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34700) exited with exit code 0 2015-01-27 22:13:09 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34698) exited with exit code 0 2015-01-27 22:13:09 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34701) exited with exit code 0 2015-01-27 22:13:10 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34699) exited with exit code 0 2015-01-27 22:15:00 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34683) exited with exit code 0 2015-01-27 22:15:29 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34673) exited with exit code 0 2015-01-27 22:15:58 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34679) exited with exit code 0 2015-01-27 22:16:38 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34689) exited with exit code 0 2015-01-27 22:16:39 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34671) exited with exit code 0 2015-01-27 22:16:47 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34677) exited with exit code 0 2015-01-27 22:16:47 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34672) exited with exit code 0 2015-01-27 22:16:48 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34680) exited with exit code 0 2015-01-27 22:16:50 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34686) exited with exit code 0 2015-01-27 22:16:51 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34670) exited with exit code 0 2015-01-27 22:16:51 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34690) exited with exit code 0 2015-01-27 22:16:51 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34674) exited with exit code 0 2015-01-27 22:16:52 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34684) exited with exit code 0 2015-01-27 22:16:53 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34675) exited with exit code 0 2015-01-27 22:16:53 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34682) exited with exit code 0 2015-01-27 22:16:53 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34691) exited with exit code 0 2015-01-27 22:16:54 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34676) exited with exit code 0 2015-01-27 22:16:54 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34685) exited with exit code 0 2015-01-27 22:16:55 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34692) exited with exit code 0 2015-01-27 22:16:56 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34687) exited with exit code 0 2015-01-27 22:16:56 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34678) exited with exit code 0 2015-01-27 22:16:57 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34681) exited with exit code 0 2015-01-27 22:16:57 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34688) exited with exit code 0 2015-01-27 22:16:59 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34694) exited with exit code 0 2015-01-27 22:16:59 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34693) exited with exit code 0 2015-01-27 22:17:02 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34695) exited with exit code 0 2015-01-27 22:17:02 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34697) exited with exit code 0 2015-01-27 22:17:02 UTC [34660] LOG: worker process: parallel worker for PID 34668 (PID 34696) exited with exit code 0 That run started at 22:13:01. Within 4 seconds, 4 workers exited. So clearly we are not getting the promised 32-way parallelism for the whole test. Granted, in this instance, *most* of the workers run until the end, but I think we'll find that there are uncomfortably-frequent cases where we get significantly less parallelism than we planned on because the work isn't divided evenly. But leaving that aside, I've run this test 6 times in a row now, with a warm cache, and the best time I have is 237310.042 ms and the worst time I have is 242936.315 ms. So there's very little variation, and it's reasonably close to the results I got with dd, suggesting that the system is fairly well I/O bound. At a sequential read speed of 400 MB/s, 240 s = 96 GB of data. Assuming it takes no time at all to process the cached data (which seems to be not far from wrong judging by how quickly the first few workers exit), that means we're getting 24 GB of data from cache on a 64 GB machine. That seems a little low, but if the kernel is refusing to cache the whole relation to avoid cache-trashing, it could be right. Now, when you did what I understand to be the same test on the same machine, you got times ranging from 9.1 seconds to 35.4 seconds. Clearly, there is some difference between our test setups. Moreover, I'm kind of suspicious about whether your results are actually physically possible. Even in the best case where you somehow had the maximum possible amount of data - 64 GB on a 64 GB machine - cached, leaving no space for cache duplication between PG and the OS and no space for the operating system or postgres itself - the table is 120 GB, so you've got to read *at least* 56 GB from disk. Reading 56 GB from disk in 9 seconds represents an I/O rate of >6 GB/s. I grant that there could be some speedup from issuing I/O requests in parallel instead of serially, but that is a 15x speedup over dd, so I am a little suspicious that there is some problem with the test setup, especially because I cannot reproduce the results. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers