An article with "market share" in the title is not a technical assessment, but in any case, you aren't willing to respect the request to focus on Open-MPI on the Open-MPI list, so I'll be piping mail from your address to trash from now on.
Jeff On Thu, Jan 4, 2018 at 10:54 PM, John Chludzinski < john.chludzin...@gmail.com> wrote: > That article gives the best technical assessment I've seen of Intel's > architecture bug. I noted the discussion's subject and thought I'd add some > clarity. Nothing more. > > For the TL;DR crowd: get an AMD chip in your computer. > > On Thursday, January 4, 2018, r...@open-mpi.org <r...@open-mpi.org> wrote: > >> Yes, please - that was totally inappropriate for this mailing list. >> Ralph >> >> >> On Jan 4, 2018, at 4:33 PM, Jeff Hammond <jeff.scie...@gmail.com> wrote: >> >> Can we restrain ourselves to talk about Open-MPI or at least technical >> aspects of HPC communication on this list and leave the stock market tips >> for Hacker News and Twitter? >> >> Thanks, >> >> Jeff >> >> On Thu, Jan 4, 2018 at 3:53 PM, John Chludzinski <john.chludzinski@ >> gmail.com> wrote: >> >>> From https://semiaccurate.com/2018/01/04/kaiser-security-hol >>> es-will-devastate-intels-marketshare/ >>> >>> Kaiser security holes will devastate Intel’s marketshareAnalysis: This >>> one tips the balance toward AMD in a big wayJan 4, 2018 by Charlie >>> Demerjian <https://semiaccurate.com/author/charlie/> >>> >>> >>> >>> This latest decade-long critical security hole in Intel CPUs is going to >>> cost the company significant market share. SemiAccurate thinks it is not >>> only consequential but will shift the balance of power away from Intel CPUs >>> for at least the next several years. >>> >>> Today’s latest crop of gaping security flaws have three sets of holes >>> across Intel, AMD, and ARM processors along with a slew of official >>> statements and detailed analyses. On top of that the statements from >>> vendors range from detailed and direct to intentionally misleading and >>> slimy. Lets take a look at what the problems are, who they effect and what >>> the outcome will be. Those outcomes range from trivial patching to >>> destroying the market share of Intel servers, and no we are not joking. >>> >>> (*Authors Note 1:* For the technical readers we are simplifying a lot, >>> sorry we know this hurts. The full disclosure docs are linked, read them >>> for the details.) >>> >>> (*Authors Note 2:* For the financial oriented subscribers out there, >>> the parts relevant to you are at the very end, the section is titled *Rubber >>> Meet Road*.) >>> >>> *The Problem(s):* >>> >>> As we said earlier there are three distinct security flaws that all fall >>> somewhat under the same umbrella. All are ‘new’ in the sense that the class >>> of attacks hasn’t been publicly described before, and all are very obscure >>> CPU speculative execution and timing related problems. The extent the fixes >>> affect differing architectures also ranges from minor to near-crippling >>> slowdowns. Worse yet is that all three flaws aren’t bugs or errors, they >>> exploit correct CPU behavior to allow the systems to be hacked. >>> >>> The three problems are cleverly labeled Variant One, Variant Two, and >>> Variant Three. Google Project Zero was the original discoverer of them and >>> has labeled the classes as Bounds Bypass Check, Branch Target Injection, >>> and Rogue Data Cache Load respectively. You can read up on the >>> extensive and gory details here >>> <https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html> >>> if >>> you wish. >>> >>> If you are the TLDR type the very simplified summary is that modern CPUs >>> will speculatively execute operations ahead of the one they are currently >>> running. Some architectures will allow these executions to start even when >>> they violate privilege levels, but those instructions are killed or rolled >>> back hopefully before they actually complete running. >>> >>> Another feature of modern CPUs is virtual memory which can allow memory >>> from two or more processes to occupy the same physical page. This is a good >>> thing because if you have memory from the kernel and a bit of user code in >>> the same physical page but different virtual pages, changing from kernel to >>> userspace execution doesn’t require a page fault. This saves massive >>> amounts of time and overhead giving modern CPUs a huge speed boost. (For >>> the really technical out there, I know you are cringing at this >>> simplification, sorry). >>> >>> These two things together allow you to do some interesting things and >>> along with timing attacks add new weapons to your hacking arsenal. If you >>> have code executing on one side of a virtual memory page boundary, it can >>> speculatively execute the next few instructions on the physical page that >>> cross the virtual page boundary. This isn’t a big deal unless the two >>> virtual pages are mapped to processes that are from different users or >>> different privilege levels. Then you have a problem. (Again painfully >>> simplified and liberties taken with the explanation, read the Google paper >>> for the full detail.) >>> >>> This speculative execution allows you to get a few short (low latency) >>> instructions in before the speculation ends. Under certain circumstances >>> you can read memory from different threads or privilege levels, write those >>> things somewhere, and figure out what addresses other bits of code are >>> using. The latter bit has the nasty effect of potentially blowing through >>> address space randomization defenses which are a keystone of modern >>> security efforts. It is ugly. >>> >>> *Who Gets Hit:* >>> >>> So we have three attack vectors and three affected companies, Intel, >>> AMD, and ARM. Each has a different set of vulnerabilities to the different >>> attacks due to differences in underlying architectures. AMD put out a >>> pretty clear statement of what is affected, ARM put out by far the best and >>> most comprehensive description, and Intel obfuscated, denied, blamed >>> others, and downplayed the problem. If this was a contest for misleading >>> with doublespeak and misdirection, Intel won with a gold star, the others >>> weren’t even in the game. Lets look at who said what and why. >>> >>> *ARM:* >>> >>> ARM has a page up <https://developer.arm.com/support/security-update> >>> listing >>> vulnerable processor cores, descriptions of the attacks, and plenty of >>> links to more information. They also put up a very comprehensive white >>> paper that rivals Google’s original writeup, complete with code examples >>> and a new 3a variant. You can find it here >>> <https://developer.arm.com/support/security-update/download-the-whitepaper>. >>> Just for completeness we are putting up ARM’s excellent table of affected >>> processors, enjoy. >>> >>> [image: ARM Kaiser core table] >>> <https://www.semiaccurate.com/assets/uploads/2018/01/ARM_Kaiser_response_table.jpg> >>> >>> *Affected ARM cores* >>> >>> *AMD:* >>> >>> AMD gave us the following table which lays out their position pretty >>> clearly. The short version is that architecturally speaking they are >>> vulnerable to 1 and 2 but three is not possible due to microarchitecture. >>> More on this in a bit, it is very important. AMD also went on to describe >>> some of the issues and mitigations to SemiAccurate, but again, more in a >>> bit. >>> >>> [image: AMD Kaiser response Matrix] >>> <https://www.semiaccurate.com/assets/uploads/2018/01/AMD_Kaiser_response.jpg> >>> >>> *AMD’s response matrix* >>> >>> *Intel:* >>> >>> Intel is continuing to be the running joke of the industry as far as >>> messaging is concerned. Their statement is a pretty awe-inspiring example >>> of saying nothing while desperately trying to minimize the problem. You can >>> find >>> it here >>> <https://newsroom.intel.com/news/intel-responds-to-security-research-findings/> >>> but >>> it contains zero useful information. SemiAccurate is getting tired of >>> saying this but Intel should be ashamed of how their messaging is done, not >>> saying anything would do less damage than their current course of action. >>> >>> You will notice the line in the second paragraph, “*Recent reports that >>> these exploits are caused by a “bug” or a “flaw” and are unique to Intel >>> products are incorrect.”* This is technically true and pretty damning. >>> They are directly saying that the problem is not a bug but is due to *misuse >>> of correct processor behavior*. This a a critical problem because it >>> can’t be ‘patched’ or ‘updated’ like a bug or flaw without breaking the >>> CPU. In short you can’t fix it, and this will be important later. Intel >>> mentions this but others don’t for a good reason, again later. >>> >>> Then Intel goes on to say, *“Intel is committed to the industry best >>> practice of responsible disclosure of potential security issues, which is >>> why Intel and other vendors had planned to disclose this issue next week >>> when more software and firmware updates will be available. However, Intel >>> is making this statement today because of the current inaccurate media >>> reports.*” This is simply not true, or at least the part about industry >>> best practices of responsible disclosure. Intel sat on the last critical >>> security flaw affecting 10+ years of CPUs which SemiAccurate >>> exclusively disclosed >>> <https://www.semiaccurate.com/2017/05/01/remote-security-exploit-2008-intel-platforms/> >>> for >>> 6+ weeks after a patch was released. Why? PR reasons. >>> >>> SemiAccurate feels that Intel holding back knowledge of what we believe >>> were flaws being actively exploited in the field even though there were >>> simple mitigation steps available is not responsible. Or best practices. Or >>> ethical. Or anything even intoning goodness. It is simply unethical, but >>> only that good if you are feeling kind. Intel does not do the right thing >>> for security breaches and has not even attempted to do so in the 15+ years >>> this reporter has been tracking them on the topic. They are by far the >>> worst major company in this regard, and getting worse. >>> >>> *Mitigation:* >>> >>> As is described by Google, ARM, and AMD, but not Intel, there are >>> workarounds for the three new vulnerabilities. Since Google first >>> discovered these holes in June, 2017, there have been patches pushed up to >>> various Linux kernel and related repositories. The first one SemiAccurate >>> can find was dated October 2017 and the industry coordinated announcement >>> was set for Monday, January 9, 2018 so you can be pretty sure that the >>> patches are in place and ready to be pushed out if not on your systems >>> already. Microsoft and Apple are said to be at a similar state of readiness >>> too. In short by the time you read this, it will likely be fixed. >>> >>> That said the fixes do have consequences, and all are heavily workload >>> dependent. For variants 1 and 2 the performance hit is pretty minor with >>> reports of ~1% performance hits under certain circumstances but for the >>> most part you won’t notice anything if you patch, and you should patch. >>> Basically 1 and 2 are irrelevant from any performance perspective as long >>> as your system is patched. >>> >>> The big problem is with variant 3 which ARM claims has a similar effect >>> on devices like phones or tablets, IE low single digit performance hits if >>> that. Given the way ARM CPUs are used in the majority of devices, they >>> don’t tend to have the multi-user, multi-tenant, heavily virtualized >>> workloads that servers do. For the few ARM cores that are affected, their >>> users will see a minor, likely unnoticeable performance hit when patched. >>> >>> User x86 systems will likely be closer to the ARM model for performance >>> hits. Why? Because while they can run heavily virtualized, multi-user, >>> multi-tenant workloads, most desktop users don’t. Even if they do, it is >>> pretty rare that these users are CPU bound for performance, memory and >>> storage bandwidth will hammer performance on these workloads long before >>> the CPU becomes a bottleneck. Why do we bring this up? >>> >>> Because in those heavily virtualized, multi-tenant, multi-user workloads >>> that most servers run in the modern world, the patches for 3 are painful. >>> How painful? SemiAccurate’s research has found reports of between 5-50% >>> slowdowns, again workload and software dependent, with the average being >>> around 30%. This stands to reason because the fixes we have found >>> essentially force a demapping of kernel code on a context switch. >>> >>> *The Pain:* >>> >>> This may sound like techno-babble but it isn’t, and it happens a many >>> thousands of times a second on modern machines if not more. Because as >>> Intel pointed out, the CPU is operating correctly and the exploit uses >>> correct behavior, it can’t be patched or ‘fixed’ without breaking the CPU >>> itself. Instead what you have to do is make sure the circumstances that can >>> be exploited don’t happen. Consider this a software workaround or avoidance >>> mechanism, not a patch or bug fix, the underlying problem is still there >>> and exploitable, there is just nothing to exploit. >>> >>> Since the root cause of 3 is a mechanism that results in a huge >>> performance benefit by not having to take a few thousand or perhaps >>> millions page faults a second, at the very least you now have to take the >>> hit of those page faults. Worse yet the fix, from what SemiAccurate has >>> gathered so far, has to unload the kernel pages from virtual memory maps on >>> a context switch. So with the patch not only do you have to take the hit >>> you previously avoided, but you have to also do a lot of work >>> copying/scrubbing virtual memory every time you do. This explains the hit >>> of ~1/3rd of your total CPU performance quite nicely. >>> >>> Going back to user x86 machines and ARM devices, they aren’t doing >>> nearly as many context switches as the servers are but likely have to do >>> the same work when doing a switch. In short if you do a theoretical 5% of >>> the switches, you take 5% of that 30% hit. It isn’t this simple but you get >>> the idea, it is unlikely to cripple a consumer desktop PC or phone but will >>> probably cripple a server. Workload dependent, we meant it. >>> >>> *The Knife Goes In:* >>> >>> So x86 servers are in deep trouble, what was doable on two racks of >>> machines now needs three if you apply the patch for 3. If not, well >>> customers have lawyers, will you risk it? Worse yet would you buy cloud >>> services from someone who didn’t apply the patch? Think about this for the >>> economics of the megadatacenters, if you are buying 100K+ servers a month, >>> you now need closer to 150K, not a trivial added outlay for even the big >>> guys. >>> >>> But there is one big caveat and it comes down to the part we said we >>> would get to later. Later is now. Go back and look at that AMD chart near >>> the top of the article, specifically their vulnerability for Variant 3 >>> attacks. Note the bit about, “*Zero AMD vulnerability or risk because >>> of AMD architecture differences.*” See an issue here? >>> >>> What AMD didn’t spell out in detail is a minor difference in >>> microarchitecture between Intel and AMD CPUs. When a CPU speculatively >>> executes and crosses a privilege level boundary, any idiot would probably >>> say that the CPU should see this crossing and not execute the following >>> instructions that are out of it’s privilege level. This isn’t rocket >>> science, just basic common sense. >>> >>> AMD’s microarchitecture sees this privilege level change and throws the >>> microarchitectural equivalent of a hissy fit and doesn’t execute the code. >>> Common sense wins out. Intel’s implementation does execute the following >>> code across privilege levels which sounds on the surface like a bit of a >>> face-palm implementation but it really isn’t. >>> >>> What saves Intel is that the speculative execution goes on but, to the >>> best of our knowledge, is unwound when the privilege level changes a few >>> instructions later. Since Intel CPUs in the wild don’t crash or violate >>> privilege levels, it looks like that mechanism works properly in practice. >>> What these new exploits do is slip a few very short instructions in that >>> can read data from the other user or privilege level before the context >>> change happens. If crafted correctly the instructions are unwound but the >>> data can be stashed in a place that is persistent. >>> >>> Intel probably get a slight performance gain from doing this ‘sloppy’ >>> method but AMD seems to have have done the right thing for the right >>> reasons. That extra bounds check probably take a bit of time but in >>> retrospect, doing the right thing was worth it. Since both are fundamental >>> ‘correct’ behaviors for their respective microarchitectures, there is no >>> possible fix, just code that avoids scenarios where it can be abused. >>> >>> For Intel this avoidance comes with a 30% performance hit on server type >>> workloads, less on desktop workloads. For AMD the problem was avoided by >>> design and the performance hit is zero. Doing the right thing for the right >>> reasons even if it is marginally slower seems to have paid off in this >>> circumstance. Mother was right, AMD listened, Intel didn’t. >>> >>> *Weasel Words:* >>> >>> Now you have a bit more context about why Intel’s response was, well, a >>> non-response. They blamed others, correctly, for having the same problem >>> but their blanket statement avoided the obvious issue of the others aren’t >>> crippled by the effects of the patches like Intel. Intel screwed up, badly, >>> and are facing a 30% performance hit going forward for it. AMD did right >>> and are probably breaking out the champagne at HQ about now. >>> >>> Intel also tried to deflect lawyers by saying they follow industry best >>> practices. They don’t and the AMT hole was a shining example of them >>> putting PR above customer security. Similarly their sitting on the fix >>> for the TXT flaw for *THREE*YEARS* >>> <https://www.semiaccurate.com/2016/01/20/intel-puts-out-secure-cpus-based-on-insecurity/> >>> because >>> they didn’t want to admit to architectural security blunders and reveal >>> publicly embarrassing policies until forced to disclose by a governmental >>> agency being exploited by a foreign power is another example that shines a >>> harsh light on their ‘best practices’ line. There are many more like this. >>> Intel isn’t to be trusted for security practices or disclosures because PR >>> takes precedence over customer security. >>> >>> *Rubber Meet Road:* >>> >>> Unfortunately security doesn’t sell and rarely affects marketshare. This >>> time however is different and will hit Intel were it hurts, in the wallet. >>> SemiAccurate thinks this exploit is going to devastate Intel’s marketshare. >>> Why? Read on subscribers. >>> >>> *Note: The following is analysis for professional level subscribers >>> only.* >>> >>> *Disclosures: Charlie Demerjian and Stone Arch Networking Services, Inc. >>> have no consulting relationships, investment relationships, or hold any >>> investment positions with any of the companies mentioned in this report.* >>> >>> >>> On Thu, Jan 4, 2018 at 6:21 PM, Reuti <re...@staff.uni-marburg.de> >>> wrote: >>> >>>> >>>> Am 04.01.2018 um 23:45 schrieb r...@open-mpi.org: >>>> >>>> > As more information continues to surface, it is clear that this >>>> original article that spurred this thread was somewhat incomplete - >>>> probably released a little too quickly, before full information was >>>> available. There is still some confusion out there, but the gist from >>>> surfing the various articles (and trimming away the hysteria) appears to >>>> be: >>>> > >>>> > * there are two security issues, both stemming from the same root >>>> cause. The “problem” has actually been around for nearly 20 years, but >>>> faster processors are making it much more visible. >>>> > >>>> > * one problem (Meltdown) specifically impacts at least Intel, ARM, >>>> and AMD processors. This problem is the one that the kernel patches address >>>> as it can be corrected via software, albeit with some impact that varies >>>> based on application. Those apps that perform lots of kernel services will >>>> see larger impacts than those that don’t use the kernel much. >>>> > >>>> > * the other problem (Spectre) appears to impact _all_ processors >>>> (including, by some reports, SPARC and Power). This problem lacks a >>>> software solution >>>> > >>>> > * the “problem” is only a problem if you are running on shared nodes >>>> - i.e., if multiple users share a common OS instance as it allows a user to >>>> potentially access the kernel information of the other user. So HPC >>>> installations that allocate complete nodes to a single user might want to >>>> take a closer look before installing the patches. Ditto for your desktop >>>> and laptop - unless someone can gain access to the machine, it isn’t really >>>> a “problem”. >>>> >>>> Weren't there some PowerPC with strict in-order-execution which could >>>> circumvent this? I find a hint about an "EIEIO" command only. Sure, >>>> in-order-execution might slow down the system too. >>>> >>>> -- Reuti >>>> >>>> >>>> > >>>> > * containers and VMs don’t fully resolve the problem - the only >>>> solution other than the patches is to limit allocations to single users on >>>> a node >>>> > >>>> > HTH >>>> > Ralph >>>> > >>>> > >>>> >> On Jan 3, 2018, at 10:47 AM, r...@open-mpi.org wrote: >>>> >> >>>> >> Well, it appears from that article that the primary impact comes >>>> from accessing kernel services. With an OS-bypass network, that shouldn’t >>>> happen all that frequently, and so I would naively expect the impact to be >>>> at the lower end of the reported scale for those environments. TCP-based >>>> systems, though, might be on the other end. >>>> >> >>>> >> Probably something we’ll only really know after testing. >>>> >> >>>> >>> On Jan 3, 2018, at 10:24 AM, Noam Bernstein < >>>> noam.bernst...@nrl.navy.mil> wrote: >>>> >>> >>>> >>> Out of curiosity, have any of the OpenMPI developers tested (or >>>> care to speculate) how strongly affected OpenMPI based codes (just the MPI >>>> part, obviously) will be by the proposed Intel CPU memory-mapping-related >>>> kernel patches that are all the rage? >>>> >>> >>>> >>> https://arstechnica.com/gadgets/2018/01/whats-behind-the-in >>>> tel-design-flaw-forcing-numerous-patches/ >>>> >>> >>>> >>> >>>> Noam >>>> >>> _______________________________________________ >>>> >>> users mailing list >>>> >>> users@lists.open-mpi.org >>>> >>> https://lists.open-mpi.org/mailman/listinfo/users >>>> >> >>>> >> _______________________________________________ >>>> >> users mailing list >>>> >> users@lists.open-mpi.org >>>> >> https://lists.open-mpi.org/mailman/listinfo/users >>>> > >>>> > _______________________________________________ >>>> > users mailing list >>>> > users@lists.open-mpi.org >>>> > https://lists.open-mpi.org/mailman/listinfo/users >>>> > >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> >> >> >> -- >> Jeff Hammond >> jeff.scie...@gmail.com >> http://jeffhammond.github.io/ >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> >> >> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users