Re: CrowdStrike and drivers (was Re: why reliable linux hasn't gained more market share?)

The Wanderer Sun, 21 Jul 2024 09:28:16 -0700

On 2024-07-20 at 22:07, Jeffrey Walton wrote:

> On Sat, Jul 20, 2024 at 9:46 PM The Wanderer <wande...@fastmail.fm>
> wrote:
> 
>> On 2024-07-20 at 09:19, jeremy ardley wrote:


>>> The problem is the Windows Systems Administrators who contracted
>>> for / allowed unattended remote updates of kernel drivers on
>>> live hardware systems. This is the height of folly and there is
>>> no recovery if it causes a BSOD.

>> All the sysadmins involved did is agree to let an
>> antivirus-equivalent utility update itself, and its definitions. I
>> would be surprised if this could not have easily happened with
>> *any* antivirus-type utility which has self-update capability; I'm
>> fairly sure all modern broad-spectrum antivirus-etc. suites on
>> Windows do kernel-level access in similar fashion. CrowdStrike just
>> happens to be the company involved when it *did* happen.
> 
> I was around when Symantec Antivirus did about the same to about
> half the workstations at the Social Security Administration. A
> definition file update blue screened about half the Windows NT 4.0
> and Windows 2000 hosts. That was about 50,000 machines, if I recall
> correctly.

There *is* a difference between this incident and that one, in the form
of the *scale* of the issue. But otherwise, yes, I've seen less-severe
breakages of this sort occur in the past as well.

>> That the sysadmins decided to deploy CrowdStrike does not make it 
>> reasonable to fault them for this consequence, any more than e.g.
>> if a gamer decided to install a game, and then the game required a
>> patch to let them keep playing, and that patch silently included
>> new/updated DRM which installed a driver which broke the system (as
>> I recall some past DRM implementations have reportedly done), it
>> would then be reasonable to fault the gamer. In neither case was
>> the consequence foreseeable from the decision.
> 
> Sysadmins don't make that decision in the Enterprise. That decision 
> was made above the lowly sysadmin's pay grade.

It does depend on the enterprise. In my organization, I'm fairly sure
the people who made the decision at least did so with informed input
from the sysadmins, including specifically the people who were
administering the existing antivirus solution (McAfee).

>>> The situation is recoverable if all the windows machines are
>>> virtual with a good backup/restore plan. The situation is not
>>> recoverable if the kernel updates are on raw iron running
>>> Windows.
>> 
>> The situation is trivially recoverable if you can get access to
>> the machine in a way which lets you either boot to safe mode and
>> get local-administrator access, or lets you boot an alternative
>> environment (e.g. live-boot media) from which you can read and
>> write to the hard drive.
> 
> I don't think it's trivial for some enterprises due to the sheer 
> number of machines and the remote workforce.

Yeah - after the fact it occurred to me that I hadn't specified that
what this is *not* is *automatable*, which has inevitable consequences
for the difficulty of scaling the solution out.

At most you could provide bootable media which would, when booted to,
fix the issue and reboot. (If you could set things up for that to be
available by PXE boot, and if you have everything configured to try PXE
booting first before booting locally, then maybe you could automate it
with nothing more than telling people to reboot any computer they see
affected? But even that type of solution has its limits.)

> I'm guessing the company I work for will spend the next week or month
> sorting things out. And the company is a medium size enterprise with
> about 30,000 employees. Imagine how bad it's going to be for an
> enterprise with 100,000 employees.

Oh, I can.

>> I've spent a fair chunk of my workday today going around to
>> affected computers and performing a variant of the latter process.
>> 
>> Once you've done that, the fix is simple: delete, or move out of
>> the way, a single file whose name claims that it's a driver. With
>> that file gone, you can reboot, and Windows will come up normally
>> without the bluescreen.
> 
> Unfortunately, I don't see this as scalable. It works fine for a
> small business with 100 employees, but not an enterprise.

My own organization has thousands of computers, something like 1000-3000
of which have CrowdStrike Falcon as their antimalware solution. The part
of our IT department which would typically be expected to handle the
client-side remediation of something like this (including making and
keeping appointments with remote workers who were impacted) is a maximum
of 16 people, and I believe we're currently working with two positions
empty.

That said, a *lot* of our CrowdStrike-using computers seem to have not
been affected; as far as I can tell, most of them were *off* for the
entire active-issue period, and so never received the problematic
update. Someone has estimated that only 8% of our total computers are
affected. (I don't know where they got the figure from, but I do know
that "our total computers" includes another 3000-5000 units which use a
different antimalware solution, so that's going to skew the percentage.)

It's still likely to take us weeks, if not months, to get everything
affected by this back into working order.

>>> Heads should roll but obviously won't
>> 
>> What good would decapitation do, here?
> 
> I think it's a figure of speech; not a literal.

Indeed. I was simply extending the metaphor.

>> At most, CrowdStrike's people are guilty of rolling out an
>> insufficiently-tested update, or of designing a system such that
>> it's too easy for an update to break things in this way, or that
>> it's possible to break things in this way not with an actual new
>> client version (which goes through a release cascade, with each
>> organization deciding which of the most recent three versions each 
>> of their computers will get) but just with a data-files update
>> (which, as we have seen here, appears to go out to all clients
>> regardless of version).
> 
> At minimum, it is negligence.

Agreed.

>> The first would be poor institutional practice; the others would
>> be potentially-questionable software design, although it's hard to
>> know without seeing the internal architecture of the software in
>> question and understanding *why* it's designed that way.
>> 
>> In either case, it's not obvious to me why decapitating a few
>> scapegoats would *improve* the situation going forward, unless it
>> can be determined that specific people were actually negligent.
> 
> The incident affected the company's share price. Shares were down
> $10 or $15.

I was watching this over the course of the day, and saw it quoted
starting at "down nearly 15%" before the start of trading, and "down 9%"
after trading had closed for the day. I'm not sure what that reflects in
real-world practice, and I didn't see dollar prices quoted.

> If the potential issues were not detailed in company literature and
> prospectus, then the Securities and Exchange Commission might get
> involved for misrepresenting risk and liabilities. There could be big
> fines, and that will cost the shareholders more money.
> 
> All this points to an incompetent board. If someone's head is going
> to be taken (figuratively), then it should start with the CEO and
> other executives.

I could see an argument for that, although I'm not convinced 100% based
on what I've seen to date. I'd need more information and details, and am
unlikely to get them.

-- 
   The Wanderer

The reasonable man adapts himself to the world; the unreasonable one
persists in trying to adapt the world to himself. Therefore all
progress depends on the unreasonable man.         -- George Bernard Shaw

signature.asc
Description: OpenPGP digital signature

Re: CrowdStrike and drivers (was Re: why reliable linux hasn't gained more market share?)

Reply via email to