On Fri, Apr 20, 2018 at 7:21 AM, Mick <michaelkintz...@gmail.com> wrote: > On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote: >> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and >> has numerous heat failures. >> >> Due to poor cooling ... surprised? >> >> The cooling is not working right. Something is still wrong. >> >> On 04/19/2018 09:33 PM, R0b0t1 wrote: >> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro >> > cards and a Tesla card. >> > >> > The system is a few years old at this point. Old enough that the >> > thermal compound could have hardened, which is why I replaced it. > > If the problem started suddenly, rather than getting progressively worse over > time, it may have something to do with kernel drivers, or some change in > firmware. >
As far as I know it has always been like this. It may be why it was hardly used before it came into my care. Looking at the server I could blame poor design; the inside is rather cramped, despite the care taken with the internal baffles. They may not have run a good flow simulation. Mr. Bird's observation seems to support this. > If the cause is mechanical, I'd also suggest checking the heat sink contact > surface. Some heat sinks are poorly manufactured and require flattening with > wet 'n dry sandpaper to get a flat enough surface and improve their contact > with the CPU. I've seen 15°C improvement in a Zalman CPU cooler after excess > metal was removed from copper pipes, which were manufactured proud. Hardcore > O/C's flatten the CPU too, but I'd avoid anything as radical because it can go > badly wrong if you remove more than the surface varnish from the chip. > > In the interim, opening the side panel may also help in hot weather. > The internals are custom made to fit the motherboard, cards, and drive slots. It may work better if I move it to another tower but it will be a while before I can find one. I will look at the interface between the heatsink and processor again, but it looked fine. How concerned should I be about overheating machine check errors? I used to think that it was best to avoid them, as the threshold was high enough that very small parts of the die could overshoot and fail, but I was informed that is not the case. Besides the throttling (which is fairly bad) I am not sure if there are any drawbacks to the overheating. I am wondering what the point of 32 threads is if you can't use them at 100%. Cheers, R0b0t1