On Sat, Feb 1, 2014 at 1:41 PM, Andres Freund <and...@2ndquadrant.com> wrote: >> However, I tested the >> most recent revision from your git remote on the AWS instance.
> But that was before my fix, right. Except you managed to timetravel :) Heh, okay. So Nathan Boley has generously made available a machine with 4 AMD Opteron 6272s. I've performed the same benchmark on that server. However, I thought it might be interesting to circle back and get some additional numbers for the AWS instance already tested - I'd like to see what it looks like after your recent tweaks to fix the regression. The single client performance of that instance seems to be markedly better than that of Nathan's server. Tip: AWS command line tools + S3 are a great way to easily publish bulky pgbench-tools results, once you figure out how to correctly set your S3 bucket's security manifest to allow public http. It has similar advantages to rsync, and just works with the minimum of fuss. Anyway, I don't think that the new, third c3.8xlarge-rwlocks testset tells us much of anything: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/c38xlarge-rwlocks/ Here are the results of a benchmark on Nathan Boley's 64-core, 4 socket server: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/ Perhaps I should have gone past 64 clients, because in the document "Lock Scaling Analysis on Intel Xeon Processors" [1], Intel write: "This implies that with oversubscription (more threads running than available logical CPUs), the performance of spinlocks can depend heavily on the exact OS scheduler behavior, and may change drastically with operating system or VM updates." I haven't bothered with a higher client counts though, because Andres noted it's the same with 90 clients on this AMD system. Andres: Do you see big problems when # clients < # logical cores on the affected Intel systems? There is only a marginal improvement in performance on this big 4 socket system. Andres informs me privately that he has reproduced the problem on multiple new 4-socket Intel servers, so it seems reasonable to suppose that more or less an Intel thing. The Intel document [1] further notes: "As the number of threads polling the status of a lock address increases, the time it takes to process those polling requests will increase. Initially, the latency to transfer data across socket boundaries will always be an order of magnitude longer than the on-chip cache-to-cache transfer latencies. Such cross-socket transfers, if they are not effectively minimized by software, will negatively impact the performance of any lock algorithm that depends on them." So, I think it's fair to say, given what we now know from Andres' numbers and the numbers I got from Nathan's server, that this patch is closer to being something that addresses a particularly unfortunate pathology on many-socket Intel system than it is to being a general performance optimization. Based on the above quoted passage, it isn't unreasonable to suppose other vendors or architectures could be affected, but that isn't in evidence. While I welcome the use of atomic operations in the context of LW_SHARED acquisition as general progress, I think that to the outside world my proposed messaging is more useful. It's not quite a bug-fix, but if you're using a many-socket Intel server, you're *definitely* going to want to use a PostgreSQL version that is unaffected. You may well not want to take on the burden of waiting for 9.4, or waiting for it to fully stabilize. I note that Andres has a feature branch of this backported to Postgres 9.2, no doubt because of a request from a 2ndQuadrant customer. I have to wonder if we should think about making this available with a configure switch in one or more back branches. I think that the complete back-porting of the fsync request queue issue's fix in commit 758728 could be considered a precedent - that too was a fix for a really bad worst-case that was encountered fairly infrequently in the wild. It's sort of horrifying to have red-hot spinlocks in production, so that seems like the kind of thing we should make an effort to address for those running multi-socket systems. Of those running Postgres on new multi-socket systems, the reality is that the majority are running on Intel hardware. Unfortunately, everyone knows that Intel will soon be the only game in town when it comes to high-end x86_64 servers, which contributes to my feeling that we need to target back branches. We should do something about the possible regression with older compilers using the fallback first, though. It would be useful to know more about the nature of the problem that made such an appreciable difference in Andres' original post. Again, through private e-mail, I saw perf profiles from affected servers and an unaffected though roughly comparable server (i.e. Nathan's 64 core AMD server). Andres observed that "stalled-cycles-frontend" and "stalled-cycles-backend" Linux perf events were at huge variance depending on whether these Intel systems were patched or unpatched. They were about the same on the AMD system to begin with. [1] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers