On Jan 4, 2018, at 4:33 PM, Jeff Hammond <jeff.scie...@gmail.com
<mailto:jeff.scie...@gmail.com>> wrote:
Can we restrain ourselves to talk about Open-MPI or at least
technical aspects of HPC communication on this list and leave the
stock market tips for Hacker News and Twitter?
Thanks,
Jeff
On Thu, Jan 4, 2018 at 3:53 PM, John
Chludzinski<john.chludzin...@gmail.com
<mailto:john.chludzin...@gmail.com>>wrote:
Fromhttps://semiaccurate.com/2018/01/04/kaiser-security-holes-will-devastate-intels-marketshare/
<https://semiaccurate.com/2018/01/04/kaiser-security-holes-will-devastate-intels-marketshare/>
Kaiser security holes will devastate Intel’s marketshare
Analysis: This one tips the balance toward AMD in a big way
Jan 4, 2018 by Charlie Demerjian
<https://semiaccurate.com/author/charlie/>
This latest decade-long critical security hole in Intel CPUs
is going to cost the company significant market share.
SemiAccurate thinks it is not only consequential but will
shift the balance of power away from Intel CPUs for at least
the next several years.
Today’s latest crop of gaping security flaws have three sets
of holes across Intel, AMD, and ARM processors along with a
slew of official statements and detailed analyses. On top of
that the statements from vendors range from detailed and
direct to intentionally misleading and slimy. Lets take a
look at what the problems are, who they effect and what the
outcome will be. Those outcomes range from trivial patching
to destroying the market share of Intel servers, and no we
are not joking.
(*Authors Note 1:* For the technical readers we are
simplifying a lot, sorry we know this hurts. The full
disclosure docs are linked, read them for the details.)
(*Authors Note 2:* For the financial oriented subscribers out
there, the parts relevant to you are at the very end, the
section is titled *Rubber Meet Road*.)
*The Problem(s):*
As we said earlier there are three distinct security flaws
that all fall somewhat under the same umbrella. All are ‘new’
in the sense that the class of attacks hasn’t been publicly
described before, and all are very obscure CPU speculative
execution and timing related problems. The extent the fixes
affect differing architectures also ranges from minor to
near-crippling slowdowns. Worse yet is that all three flaws
aren’t bugs or errors, they exploit correct CPU behavior to
allow the systems to be hacked.
The three problems are cleverly labeled Variant One, Variant
Two, and Variant Three. Google Project Zero was the original
discoverer of them and has labeled the classes as Bounds
Bypass Check, Branch Target Injection, and Rogue Data Cache
Load respectively. You can read up on the extensive and gory
details here
<https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html>
if
you wish.
If you are the TLDR type the very simplified summary is that
modern CPUs will speculatively execute operations ahead of
the one they are currently running. Some architectures will
allow these executions to start even when they violate
privilege levels, but those instructions are killed or rolled
back hopefully before they actually complete running.
Another feature of modern CPUs is virtual memory which can
allow memory from two or more processes to occupy the same
physical page. This is a good thing because if you have
memory from the kernel and a bit of user code in the same
physical page but different virtual pages, changing from
kernel to userspace execution doesn’t require a page fault.
This saves massive amounts of time and overhead giving modern
CPUs a huge speed boost. (For the really technical out there,
I know you are cringing at this simplification, sorry).
These two things together allow you to do some interesting
things and along with timing attacks add new weapons to your
hacking arsenal. If you have code executing on one side of a
virtual memory page boundary, it can speculatively execute
the next few instructions on the physical page that cross the
virtual page boundary. This isn’t a big deal unless the two
virtual pages are mapped to processes that are from different
users or different privilege levels. Then you have a problem.
(Again painfully simplified and liberties taken with the
explanation, read the Google paper for the full detail.)
This speculative execution allows you to get a few short (low
latency) instructions in before the speculation ends. Under
certain circumstances you can read memory from different
threads or privilege levels, write those things somewhere,
and figure out what addresses other bits of code are using.
The latter bit has the nasty effect of potentially blowing
through address space randomization defenses which are a
keystone of modern security efforts. It is ugly.
*Who Gets Hit:*
So we have three attack vectors and three affected companies,
Intel, AMD, and ARM. Each has a different set of
vulnerabilities to the different attacks due to differences
in underlying architectures. AMD put out a pretty clear
statement of what is affected, ARM put out by far the best
and most comprehensive description, and Intel obfuscated,
denied, blamed others, and downplayed the problem. If this
was a contest for misleading with doublespeak and
misdirection, Intel won with a gold star, the others weren’t
even in the game. Lets look at who said what and why.
*ARM:*
ARM has a page up
<https://developer.arm.com/support/security-update> listing
vulnerable processor cores, descriptions of the attacks, and
plenty of links to more information. They also put up a very
comprehensive white paper that rivals Google’s original
writeup, complete with code examples and a new 3a variant.
You can find it here
<https://developer.arm.com/support/security-update/download-the-whitepaper>.
Just for completeness we are putting up ARM’s excellent table
of affected processors, enjoy.
ARM Kaiser core table
<https://www.semiaccurate.com/assets/uploads/2018/01/ARM_Kaiser_response_table.jpg>
*Affected ARM cores*
*AMD:*
AMD gave us the following table which lays out their position
pretty clearly. The short version is that architecturally
speaking they are vulnerable to 1 and 2 but three is not
possible due to microarchitecture. More on this in a bit, it
is very important. AMD also went on to describe some of the
issues and mitigations to SemiAccurate, but again, more in a bit.
AMD Kaiser response Matrix
<https://www.semiaccurate.com/assets/uploads/2018/01/AMD_Kaiser_response.jpg>
*AMD’s response matrix*
*Intel:*
Intel is continuing to be the running joke of the industry as
far as messaging is concerned. Their statement is a pretty
awe-inspiring example of saying nothing while desperately
trying to minimize the problem. You can find it here
<https://newsroom.intel.com/news/intel-responds-to-security-research-findings/>
but
it contains zero useful information. SemiAccurate is getting
tired of saying this but Intel should be ashamed of how their
messaging is done, not saying anything would do less damage
than their current course of action.
You will notice the line in the second paragraph, “/Recent
reports that these exploits are caused by a “bug” or a “flaw”
and are unique to Intel products are incorrect.”/ This is
technically true and pretty damning. They are directly saying
that the problem is not a bug but is due to *misuse of
correct processor behavior*. This a a critical problem
because it can’t be ‘patched’ or ‘updated’ like a bug or flaw
without breaking the CPU. In short you can’t fix it, and this
will be important later. Intel mentions this but others don’t
for a good reason, again later.
Then Intel goes on to say, /“Intel is committed to the
industry best practice of responsible disclosure of potential
security issues, which is why Intel and other vendors had
planned to disclose this issue next week when more software
and firmware updates will be available. However, Intel is
making this statement today because of the current inaccurate
media reports./” This is simply not true, or at least the
part about industry best practices of responsible disclosure.
Intel sat on the last critical security flaw affecting 10+
years of CPUs which SemiAccurate exclusively disclosed
<https://www.semiaccurate.com/2017/05/01/remote-security-exploit-2008-intel-platforms/>
for
6+ weeks after a patch was released. Why? PR reasons.
SemiAccurate feels that Intel holding back knowledge of what
we believe were flaws being actively exploited in the field
even though there were simple mitigation steps available is
not responsible. Or best practices. Or ethical. Or anything
even intoning goodness. It is simply unethical, but only that
good if you are feeling kind. Intel does not do the right
thing for security breaches and has not even attempted to do
so in the 15+ years this reporter has been tracking them on
the topic. They are by far the worst major company in this
regard, and getting worse.
*Mitigation:*
As is described by Google, ARM, and AMD, but not Intel, there
are workarounds for the three new vulnerabilities. Since
Google first discovered these holes in June, 2017, there have
been patches pushed up to various Linux kernel and related
repositories. The first one SemiAccurate can find was dated
October 2017 and the industry coordinated announcement was
set for Monday, January 9, 2018 so you can be pretty sure
that the patches are in place and ready to be pushed out if
not on your systems already. Microsoft and Apple are said to
be at a similar state of readiness too. In short by the time
you read this, it will likely be fixed.
That said the fixes do have consequences, and all are heavily
workload dependent. For variants 1 and 2 the performance hit
is pretty minor with reports of ~1% performance hits under
certain circumstances but for the most part you won’t notice
anything if you patch, and you should patch. Basically 1 and
2 are irrelevant from any performance perspective as long as
your system is patched.
The big problem is with variant 3 which ARM claims has a
similar effect on devices like phones or tablets, IE low
single digit performance hits if that. Given the way ARM CPUs
are used in the majority of devices, they don’t tend to have
the multi-user, multi-tenant, heavily virtualized workloads
that servers do. For the few ARM cores that are affected,
their users will see a minor, likely unnoticeable performance
hit when patched.
User x86 systems will likely be closer to the ARM model for
performance hits. Why? Because while they can run heavily
virtualized, multi-user, multi-tenant workloads, most desktop
users don’t. Even if they do, it is pretty rare that these
users are CPU bound for performance, memory and storage
bandwidth will hammer performance on these workloads long
before the CPU becomes a bottleneck. Why do we bring this up?
Because in those heavily virtualized, multi-tenant,
multi-user workloads that most servers run in the modern
world, the patches for 3 are painful. How painful?
SemiAccurate’s research has found reports of between 5-50%
slowdowns, again workload and software dependent, with the
average being around 30%. This stands to reason because the
fixes we have found essentially force a demapping of kernel
code on a context switch.
*The Pain:*
This may sound like techno-babble but it isn’t, and it
happens a many thousands of times a second on modern machines
if not more. Because as Intel pointed out, the CPU is
operating correctly and the exploit uses correct behavior, it
can’t be patched or ‘fixed’ without breaking the CPU itself.
Instead what you have to do is make sure the circumstances
that can be exploited don’t happen. Consider this a software
workaround or avoidance mechanism, not a patch or bug fix,
the underlying problem is still there and exploitable, there
is just nothing to exploit.
Since the root cause of 3 is a mechanism that results in a
huge performance benefit by not having to take a few thousand
or perhaps millions page faults a second, at the very least
you now have to take the hit of those page faults. Worse yet
the fix, from what SemiAccurate has gathered so far, has to
unload the kernel pages from virtual memory maps on a context
switch. So with the patch not only do you have to take the
hit you previously avoided, but you have to also do a lot of
work copying/scrubbing virtual memory every time you do. This
explains the hit of ~1/3rd of your total CPU performance
quite nicely.
Going back to user x86 machines and ARM devices, they aren’t
doing nearly as many context switches as the servers are but
likely have to do the same work when doing a switch. In short
if you do a theoretical 5% of the switches, you take 5% of
that 30% hit. It isn’t this simple but you get the idea, it
is unlikely to cripple a consumer desktop PC or phone but
will probably cripple a server. Workload dependent, we meant it.
*The Knife Goes In:*
So x86 servers are in deep trouble, what was doable on two
racks of machines now needs three if you apply the patch for
3. If not, well customers have lawyers, will you risk it?
Worse yet would you buy cloud services from someone who
didn’t apply the patch? Think about this for the economics of
the megadatacenters, if you are buying 100K+ servers a month,
you now need closer to 150K, not a trivial added outlay for
even the big guys.
But there is one big caveat and it comes down to the part we
said we would get to later. Later is now. Go back and look at
that AMD chart near the top of the article, specifically
their vulnerability for Variant 3 attacks. Note the bit
about, “/Zero AMD vulnerability or risk because of AMD
architecture differences./” See an issue here?
What AMD didn’t spell out in detail is a minor difference in
microarchitecture between Intel and AMD CPUs. When a CPU
speculatively executes and crosses a privilege level
boundary, any idiot would probably say that the CPU should
see this crossing and not execute the following instructions
that are out of it’s privilege level. This isn’t rocket
science, just basic common sense.
AMD’s microarchitecture sees this privilege level change and
throws the microarchitectural equivalent of a hissy fit and
doesn’t execute the code. Common sense wins out. Intel’s
implementation does execute the following code across
privilege levels which sounds on the surface like a bit of a
face-palm implementation but it really isn’t.
What saves Intel is that the speculative execution goes on
but, to the best of our knowledge, is unwound when the
privilege level changes a few instructions later. Since Intel
CPUs in the wild don’t crash or violate privilege levels, it
looks like that mechanism works properly in practice. What
these new exploits do is slip a few very short instructions
in that can read data from the other user or privilege level
before the context change happens. If crafted correctly the
instructions are unwound but the data can be stashed in a
place that is persistent.
Intel probably get a slight performance gain from doing this
‘sloppy’ method but AMD seems to have have done the right
thing for the right reasons. That extra bounds check probably
take a bit of time but in retrospect, doing the right thing
was worth it. Since both are fundamental ‘correct’ behaviors
for their respective microarchitectures, there is no possible
fix, just code that avoids scenarios where it can be abused.
For Intel this avoidance comes with a 30% performance hit on
server type workloads, less on desktop workloads. For AMD the
problem was avoided by design and the performance hit is
zero. Doing the right thing for the right reasons even if it
is marginally slower seems to have paid off in this
circumstance. Mother was right, AMD listened, Intel didn’t.
*Weasel Words:*
Now you have a bit more context about why Intel’s response
was, well, a non-response. They blamed others, correctly, for
having the same problem but their blanket statement avoided
the obvious issue of the others aren’t crippled by the
effects of the patches like Intel. Intel screwed up, badly,
and are facing a 30% performance hit going forward for it.
AMD did right and are probably breaking out the champagne at
HQ about now.
Intel also tried to deflect lawyers by saying they follow
industry best practices. They don’t and the AMT hole was a
shining example of them putting PR above customer security.
Similarly their sitting on the fix for the TXT flaw for
*THREE*YEARS*
<https://www.semiaccurate.com/2016/01/20/intel-puts-out-secure-cpus-based-on-insecurity/>
because
they didn’t want to admit to architectural security blunders
and reveal publicly embarrassing policies until forced to
disclose by a governmental agency being exploited by a
foreign power is another example that shines a harsh light on
their ‘best practices’ line. There are many more like this.
Intel isn’t to be trusted for security practices or
disclosures because PR takes precedence over customer security.
*Rubber Meet Road:*
Unfortunately security doesn’t sell and rarely affects
marketshare. This time however is different and will hit
Intel were it hurts, in the wallet. SemiAccurate thinks this
exploit is going to devastate Intel’s marketshare. Why? Read
on subscribers.
/Note: The following is analysis for professional level
subscribers only./
/Disclosures: Charlie Demerjian and Stone Arch Networking
Services, Inc. have no consulting relationships, investment
relationships, or hold any investment positions with any of
the companies mentioned in this report./
/
/
On Thu, Jan 4, 2018 at 6:21 PM,
Reuti<re...@staff.uni-marburg.de
<mailto:re...@staff.uni-marburg.de>>wrote:
Am 04.01.2018 um 23:45 schrieb...@open-mpi.org
<mailto:r...@open-mpi.org>:
> As more information continues to surface, it is clear
that this original article that spurred this thread was
somewhat incomplete - probably released a little too
quickly, before full information was available. There is
still some confusion out there, but the gist from surfing
the various articles (and trimming away the hysteria)
appears to be:
>
> * there are two security issues, both stemming from the
same root cause. The “problem” has actually been around
for nearly 20 years, but faster processors are making it
much more visible.
>
> * one problem (Meltdown) specifically impacts at least
Intel, ARM, and AMD processors. This problem is the one
that the kernel patches address as it can be corrected
via software, albeit with some impact that varies based
on application. Those apps that perform lots of kernel
services will see larger impacts than those that don’t
use the kernel much.
>
> * the other problem (Spectre) appears to impact _all_
processors (including, by some reports, SPARC and Power).
This problem lacks a software solution
>
> * the “problem” is only a problem if you are running on
shared nodes - i.e., if multiple users share a common OS
instance as it allows a user to potentially access the
kernel information of the other user. So HPC
installations that allocate complete nodes to a single
user might want to take a closer look before installing
the patches. Ditto for your desktop and laptop - unless
someone can gain access to the machine, it isn’t really a
“problem”.
Weren't there some PowerPC with strict in-order-execution
which could circumvent this? I find a hint about an
"EIEIO" command only. Sure, in-order-execution might slow
down the system too.
-- Reuti
>
> * containers and VMs don’t fully resolve the problem -
the only solution other than the patches is to limit
allocations to single users on a node
>
> HTH
> Ralph
>
>
>> On Jan 3, 2018, at 10:47 AM,r...@open-mpi.org
<mailto:r...@open-mpi.org>wrote:
>>
>> Well, it appears from that article that the primary
impact comes from accessing kernel services. With an
OS-bypass network, that shouldn’t happen all that
frequently, and so I would naively expect the impact to
be at the lower end of the reported scale for those
environments. TCP-based systems, though, might be on the
other end.
>>
>> Probably something we’ll only really know after testing.
>>
>>> On Jan 3, 2018, at 10:24 AM, Noam Bernstein
<noam.bernst...@nrl.navy.mil
<mailto:noam.bernst...@nrl.navy.mil>> wrote:
>>>
>>> Out of curiosity, have any of the OpenMPI developers
tested (or care to speculate) how strongly affected
OpenMPI based codes (just the MPI part, obviously) will
be by the proposed Intel CPU memory-mapping-related
kernel patches that are all the rage?
>>>
>>>
https://arstechnica.com/gadgets/2018/01/whats-behind-the-intel-design-flaw-forcing-numerous-patches/
<https://arstechnica.com/gadgets/2018/01/whats-behind-the-intel-design-flaw-forcing-numerous-patches/>
>>>
>>> Noam
>>> _______________________________________________
>>> users mailing list
>>>users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>
>> _______________________________________________
>> users mailing list
>>users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>
> _______________________________________________
> users mailing list
>users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>