On 04/11/2017 01:10, Jeff Gilbert wrote:
> Clock speed and core count matter much more than ECC. I wouldn't chase
> ECC support for general dev machines.

The Xeon-W SKUs I posted in the previous thread all had identical or
higher clock speeds than equivalent Core i9 SKUs and ECC support with
the sole exception of the i9-7980XE which has slightly higher (100MHz)
peak turbo clock than the Xeon W-2195.

There is IMHO no performance-related reason to skimp on ECC support
especially for machines that will sport a significant amount of memory.

Importance of ECC memory is IMHO underestimated mostly because it's not
common and thus users do not realize they may be hitting memory errors
more frequently than they realize. My main workstation is now 5 years
old and has accumulated 24 memory errors; that may not seem much but if
it happens at a bad time, or in a bad place, they can ruin your day or
permanently corrupt your data.

As another example of ECC importance my laptop (obviously) doesn't have
ECC support and two years ago had a single bit that went bad in the
second DIMM. The issue manifested itself as internal compiler errors
while building Fennec. The first time I just pulled again from central
thinking it was a fluke, the second I updated the build dependencies
which I hadn't done in a while thinking that an old GCC might have been
the cause. It was not until the third day with a failure that I realized
what was happening. A 2-hours long memory test showed me the second DIMM
was bad so I removed it, ordered a new one and went on to check my
machine. I had to purge my compilation cache because garbage had
accumulated in there, run an hg verify on my repo as well as verifying
all the installed packages for errors. Since I didn't have access to my
main workstation at the time I had wasted 3 days chasing the issue and
my workflow was slowed down by a cold compilation cache and a gimped
machine (until I could replace the DIMM).

This is not common, but it's not rare either and we now have hundreds of
developers within Mozilla so people are going to run into issues that
can be easily prevented by having ECC memory.

That being said ECC memory also makes machines less susceptible to
Rowhammer-like attacks and makes them detectable while they are happening.

For a more in-depth reading on the matter I suggest reading "Memory
Errors in Modern Systems - The Good, The Bad, and The Ugly" [1] in which
the authors analyze memory errors on live systems over two years and
argue that SEC-DED ECC (the type of protection you usually get on
workstations) is often insufficient and even chipkill ECC (now common on
most servers) is not enough to catch all errors happening during real
world use.

 Gabriele

[1] https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to