(Follow-ups to dev-platform, please)

Dear all,

This email summarizes the results of our investigation on our options with regard to the future of PGO optimizations on Windows. I will first describe the work that happened as part of the investigation, and will then propose a set of options on what solutions are available to us. If you're interested in the tl;dr version, please scroll to the bottom. For the details, see the dependencies of bug 833881.

(Note that we're only talking about PGO for libxul. Anything outside of libxul, specifically the JS engine, is not going to be affected by the decision coming out of this thread. And obviously, this discussion is only about Windows.)

The first thing that we tried to investigate was whether or not upgrading to Visual Studio 2012 Update 1 makes the memory usage of the PGO linker drop down by a significant amount. Thanks to the investigation done by jimm, we know that it will actually increase the memory usage, and therefore is not an option.

Then, we tried to see how much breathing room we're going to have if we disabled PGO but not link-time code generation (LTCG), and if we disable them both together. It turns out that disabling PGO but keeping LTCG enabled reduces the memory usage by ~200MB, which means that it's not an effective measure. Disabling both LTCG and PGO brings down the linker's virtual memory usage to around 1GB, which means that we will not hit the maximum virtual memory size of 4GB for a *long* time. (Unfortunately, the Microsoft toolchain cannot perform PGO builds without LTCG.) Therefore, for the rest of this email, I will talk about disabling both PGO and LTCG.

We then tried to get a sense of how much of a win the PGO optimizations are. Thanks to a series of measurements by dmandelin, we know that disabling PGO/LTCG will result in a regression of about 10-20% on benchmarks which examine DOM and layout performance such as Dromaeo and guimark2 (and 40% in one case), but no significant regressions in the startup time, and gmail interactions. Thanks to a series of telemetry measurements performed by Vladan on a Nightly build we did last week which had PGO/LTCG disabled, there are no telemetry probes which show a significant regression on builds without PGO/LTCG. Vladan is going to try to get this data out of a Tp5 run tomorrow as well, but we don't have any evidence to believe that the results of that experiments will be any different.


Given the above, I'd like to propose the following long-term solutions:

1. Disable PGO/LTCG now. The downsides are that we should take a hit in microbenchmarks, specifically Dromaeo. But we have no reason to believe that is going to affect any of the performance characteristics observed by our users. And it means that engineers can stop worrying about this problem once and for all.

2. Try to delay disabling PGO/LTCG as much as possible. Given the tracking implemented in bug 710840, we can now watch those graphs so that we know when this problem is going to hit next, and come up with a mitigation strategy. In order to effectively implement this solution, we're going to need: * A person to own watching the graphs and report back when we step inside "the danger zone" again. * A detailed plan of action on what we'll do to mitigate this problem the next time as opposed to acting on a firedrill. One possible plan of action could be disabling PGO for everything except content/dom/layout/xpcom/gfx, no questions asked.
  * A group of engineers to own performing the above action.
* Going back through the historical data over the past year, determine the causes behind the large spikes in the gradual memory usage increase, and find solutions to them to buy as much time as possible.

3. Try to delay disabling PGO/LTCG until the next time that we hit the limit, and disable PGO/LTCG then once and for all. In order to implement this solution, we're going to need: * A person to own watching the graphs and report back when we step inside the danger zone again. * A build-system patch which makes it possible to disable PGO/LTCG for libxul by toggling a switch. * Clear documentation on what that switch is, so that anybody can toggle it when we need to take action the next time.


I think given the information that we currently have, the best course of action is #3, followed by #1 and #2. I'd like to explicitly recommend against #2, because I don't think we have the evidence to support that spending that much effort will bring any noticeable gains to our users. This effort is better spent elsewhere.


Please let me know if you have any questions, if I have missed anything, and do provide your feedback on the above proposal. As we ultimately need a decision to come out of this thread, and given that it affects Firefox Desktop, I have asked johnath to be the person who makes the final call, or to delegate that to someone whom he trusts.

Last but not least, hats off to everyone who helped during this investigation!

Cheers,
Ehsan
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to