> > >>> +static bool sgx_has_eupdatesvn;
> > >>
> > >> We have CPUID "caches" of sorts. Why open code this?
> > >
> > > You mean X86_FEATURE_*?
> >
> > Yes.
> >
> > > SGX CPUID is only defined in SGX code currently (btw, I am not sure
> > > why they are made special) so it doesn’t support this.
> >
> > It's only used in SGX code and each subleaf is only used once, at init
> > time. There's really no reason to add any caching of any parts of it for
> > a single init-time use.
> >
> > > Is this a past decision that you want to change?
> >
> > *This* patch changes things. It moves the one user from a single
> > init-time use to a runtime user. It adds a global variable. It's not
> > about *me* wanting change. It's about this patch changing the status quo.
> >
> > I don't really care about the #define. That's not the question. The
> > question is whether we want the a scattered X86_FEATURE bit for
> > 'sgx_has_eupdatesvn' or not.
> 
> I don't have a strong opinion about this.
> What is your suggestion here?
> 
> >
> > >>> +static atomic_t sgx_usage_count;
> > >>
> > >> Is 32 bits enough for sgx_usage_count? What goes boom when it
> > overflows?
> > >
> > > I can increase the value easily, but even with current number we are
> talking
> > > about 2^32 enclaves, i dont think this is realistic to happen in practice.
> >
> > I don't mean to be pedantic, but is it 2^32? I thought we had signed
> > ints underneath atomic_t and we allow negative atomic_t values. Are you
> > suggesting that since we're just using a counter that the overflows into
> > the negative space are OK and it doesn't matter until it wraps all the
> > way back around to 0?
> 
> Yes, you would think that wrapping to negative is ok in this case as soon as
> dont end up hitting 0. But when it does happen, we will think that EPC is
> empty
> and attempt to execute EUPDATESVN. In this case, eupdatevsn fails with
> SGX_EPC_NOT_READY or SGX_LOCKFAIL errors depending on the state
> of the system, see the microcode flow:
> 
> IF (Other instruction is accessing EPC) THEN
>  RFLAGS.ZF := 1
>  RAX := SGX_LOCKFAIL;
>  GOTO ERROR_EXIT;
> FI
> (* Verify EPC is ready *)
> IF (the EPC contains any valid pages) THEN
>  RFLAGS.ZF := 1;
>  RAX := SGX_EPC_NOT_READY;
>  GOTO ERROR_EXIT;
> FI
> 
> This is ok on its own, but in the case any *other* ENCLS
> instruction is executing in parallel during this, it might #GP,
> which is going to be unexpected behaviour for legacy SW
> and smth we would like to prevent. So, I would state that
> this counter must be designed against possible overflows
> because of this.
> 
> >
> > Assuming that you actually mean 2^32... It is it about 2^32 enclaves? Or
> > 2^32 *opens*? Those are very, very different things.
> 
> We will increment counter on every open, so 2^32 opens at the same time,
> because each release drops a counter.
> 
> >
> > Also, this is kinda the second question that I asked that hasn't really
> > been answered directly. I really asked it for a good reason, and I'd
> > really prefer that it be answered, even if you disagree with the premise.
> >
> > I really want to know what goes boom when it overflows, so I'll go into
> > detail why it matters.
> >
> > If it's just that an extra EUPDATESVN gets run and some random SGX user
> > might see a random #GP, that's not so bad. But if it's kernel crash or
> > use-after-free, that _is_ bad.
> 
> Well, I would still say that a random legacy SGX user getting a #GP is not a
> good
> behaviour also, but yes, this is the worst case.
> 
> >
> > Is 4 bytes (maybe even 0 with the alignment of other things in practice)
> > worth it to take the problem from "not realistic" to "utterly  impossible".
> >
> > sizeof(struct file) == 240, so I guess it would take almost 1TB of
> > 'struct file' to overflow the counter, plus the overhead of 2^32 'struct
> > sgx_encl's. I'd call that "stupid" but not impossible on a bit modern
> > machine.
> >
> > But with a atomic64_t, you literally can't overflow it with 'struct
> > file' in a 64-bit address space because the address space fills up
> > before the counter overflows. It's in "utterly impossible" territory.
> 
> Yes, so I will change it to atomic64_t.
> 
> >
> > I'll admit, I was rushing through the review and didn't spell all of
> > that out my first pass around. Yeah, it's a little bit overkill to
> > explain this for the type of one variable. But, I'm hoping that with all
> > of that logic laid out, you will consider answering my question this time.
> >
> > I'd also very much appreciate if you could directly answer my questions
> > in future reviews, even if they appear silly. It'll make this all go faster.
> 
> Yes, of course, I am sorry not to answer to this point properly before.
> Hopefully you have an answer now.
> 
> >
> > ...
> > >>  if (ret == SGX_NO_UPDATE)
> > >>          return 0;
> > >>
> > >>  if (ret) {
> > >>          ENCLS_WARN(ret, "EUPDATESVN");
> > >>          return ret;
> > >>  }
> > >>
> > >>  pr_info("SVN updated successfully\n");
> > >>  return 0;
> > >>
> > >> Oh, and do we expect SGX_INSUFFICIENT_ENTROPY all the time? I
> thought
> > it
> > >> was supposed to be horribly rare. Shouldn't we warn on it?
> > >
> > > The entropy comes from RDSEED in this case, not RDRAND.
> > > This is not a horribly rare but very realistic failure.
> >
> > Fair enough. If you want to keep it this way, could you run an
> > experiment and see how realistic it is? Surely, if it's very realistic,
> > you should be able to show it on real hardware quite easily.
> >
> > We had a loooooooong discussion about this not so long ago. I believe
> > older CPUs were much more likely to be able to observe RDSEED failures.
> > But, even on those CPUs, repeated failures were pretty darn rare. New
> > CPUs are less likely to observe a single RDSEED failures. They are very
> > unlikely to see repeated failures.
> >
> > So really want to know if this is needed in _practice_. Sure,
> > theoretically, the architecture allows CPUs to fail RDSEED all the time.
> > But do the CPUs with EUPDATESVN also have RDSEED implementations that
> > fail easily?
> 
> This was the recent discussion I am aware we had on this matter:
> https://lkml.org/lkml/2024/2/5/1595
> The measurements were done for older platform (skylake), but I am not
> aware of any architectural changes since that time to improve this.

And to make it more concrete, I made some simple tests based
on program for stress testing rdrand/rdseed that was discussed in that
threat earlier: https://lkml.org/lkml/2024/2/6/746 
Using this stress testing and enough threads, I can make EUPDATESVN fail
reliably and quite easily even with 10 time retry loop by kernel. So,
given this, what should be the correct behaviour be here? WARN_ON
is not justified imo. But should we return an error to userspace and
block open()? 

Best Regards,
Elena.

Reply via email to