Price matters, since every dollar we spend chasing ECC would be a
dollar we can't allocate towards perf improvements, hardware refresh
rate, or simply more machines for any build clusters we may want.

The paper linked above addresses massive compute clusters, which seems
to have limited implications for our use-cases.

Nearly every machine we do development on does not currently use ECC.
I don't see why that should change now. To me, ECC for desktop compute
workloads crosses the line into jumping at shadows, since "restart
your machine slightly more often than otherwise" is not onerous.

On Mon, Nov 6, 2017 at 9:19 AM, Gregory Szorc <gsz...@mozilla.com> wrote:
>
>
>> On Nov 6, 2017, at 05:19, Gabriele Svelto <gsve...@mozilla.com> wrote:
>>
>>> On 04/11/2017 01:10, Jeff Gilbert wrote:
>>> Clock speed and core count matter much more than ECC. I wouldn't chase
>>> ECC support for general dev machines.
>>
>> The Xeon-W SKUs I posted in the previous thread all had identical or
>> higher clock speeds than equivalent Core i9 SKUs and ECC support with
>> the sole exception of the i9-7980XE which has slightly higher (100MHz)
>> peak turbo clock than the Xeon W-2195.
>>
>> There is IMHO no performance-related reason to skimp on ECC support
>> especially for machines that will sport a significant amount of memory.
>>
>> Importance of ECC memory is IMHO underestimated mostly because it's not
>> common and thus users do not realize they may be hitting memory errors
>> more frequently than they realize. My main workstation is now 5 years
>> old and has accumulated 24 memory errors; that may not seem much but if
>> it happens at a bad time, or in a bad place, they can ruin your day or
>> permanently corrupt your data.
>>
>> As another example of ECC importance my laptop (obviously) doesn't have
>> ECC support and two years ago had a single bit that went bad in the
>> second DIMM. The issue manifested itself as internal compiler errors
>> while building Fennec. The first time I just pulled again from central
>> thinking it was a fluke, the second I updated the build dependencies
>> which I hadn't done in a while thinking that an old GCC might have been
>> the cause. It was not until the third day with a failure that I realized
>> what was happening. A 2-hours long memory test showed me the second DIMM
>> was bad so I removed it, ordered a new one and went on to check my
>> machine. I had to purge my compilation cache because garbage had
>> accumulated in there, run an hg verify on my repo as well as verifying
>> all the installed packages for errors. Since I didn't have access to my
>> main workstation at the time I had wasted 3 days chasing the issue and
>> my workflow was slowed down by a cold compilation cache and a gimped
>> machine (until I could replace the DIMM).
>>
>> This is not common, but it's not rare either and we now have hundreds of
>> developers within Mozilla so people are going to run into issues that
>> can be easily prevented by having ECC memory.
>>
>> That being said ECC memory also makes machines less susceptible to
>> Rowhammer-like attacks and makes them detectable while they are happening.
>>
>> For a more in-depth reading on the matter I suggest reading "Memory
>> Errors in Modern Systems - The Good, The Bad, and The Ugly" [1] in which
>> the authors analyze memory errors on live systems over two years and
>> argue that SEC-DED ECC (the type of protection you usually get on
>> workstations) is often insufficient and even chipkill ECC (now common on
>> most servers) is not enough to catch all errors happening during real
>> world use.
>>
>> Gabriele
>>
>> [1] https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf
>>
>
> The Xeon-W’s are basically the i9’s (both Skylake-X) with support for ECC, 
> more vPRO, and AMT. The Xeon-W’s lack Turbo 3.0 (preferred core). However, 
> Turbo 2.0 apparently reaches the same MHz, so I don’t think it matters much. 
> There are some other differences with regards to PCIe lanes, chipset, etc.
>
> Another big difference is price. The Xeon’s cost a lot more.
>
> For building Firefox, the i9’s and Xeon-W are probably very similar (and is 
> something we should test). It likely comes down to whether you want to pay a 
> premium for ECC and other Xeon-W features. I’m not in a position to answer 
> that.
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to