On Fri, Apr 20, 2018 at 7:21 AM, Mick <michaelkintz...@gmail.com> wrote:
> On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
>> Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
>> has numerous heat failures.
>>
>> Due to poor cooling ... surprised?
>>
>> The cooling is not working right. Something is still wrong.
>>
>> On 04/19/2018 09:33 PM, R0b0t1 wrote:
>> > Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
>> > cards and a Tesla card.
>> >
>> > The system is a few years old at this point. Old enough that the
>> > thermal compound could have hardened, which is why I replaced it.
>
> If the problem started suddenly, rather than getting progressively worse over
> time, it may have something to do with kernel drivers, or some change in
> firmware.
>

As far as I know it has always been like this. It may be why it was
hardly used before it came into my care. Looking at the server I could
blame poor design; the inside is rather cramped, despite the care
taken with the internal baffles. They may not have run a good flow
simulation.

Mr. Bird's observation seems to support this.

> If the cause is mechanical, I'd also suggest checking the heat sink contact
> surface.  Some heat sinks are poorly manufactured and require flattening with
> wet 'n dry sandpaper to get a flat enough surface and improve their contact
> with the CPU.  I've seen 15°C improvement in a Zalman CPU cooler after excess
> metal was removed from copper pipes, which were manufactured proud.  Hardcore
> O/C's flatten the CPU too, but I'd avoid anything as radical because it can go
> badly wrong if you remove more than the surface varnish from the chip.
>
> In the interim, opening the side panel may also help in hot weather.
>

The internals are custom made to fit the motherboard, cards, and drive
slots. It may work better if I move it to another tower but it will be
a while before I can find one. I will look at the interface between
the heatsink and processor again, but it looked fine.


How concerned should I be about overheating machine check errors? I
used to think that it was best to avoid them, as the threshold was
high enough that very small parts of the die could overshoot and fail,
but I was informed that is not the case. Besides the throttling (which
is fairly bad) I am not sure if there are any drawbacks to the
overheating.

I am wondering what the point of 32 threads is if you can't use them at 100%.

Cheers,
     R0b0t1

Reply via email to