> > >>> +static bool sgx_has_eupdatesvn; > > >> > > >> We have CPUID "caches" of sorts. Why open code this? > > > > > > You mean X86_FEATURE_*? > > > > Yes. > > > > > SGX CPUID is only defined in SGX code currently (btw, I am not sure > > > why they are made special) so it doesn’t support this. > > > > It's only used in SGX code and each subleaf is only used once, at init > > time. There's really no reason to add any caching of any parts of it for > > a single init-time use. > > > > > Is this a past decision that you want to change? > > > > *This* patch changes things. It moves the one user from a single > > init-time use to a runtime user. It adds a global variable. It's not > > about *me* wanting change. It's about this patch changing the status quo. > > > > I don't really care about the #define. That's not the question. The > > question is whether we want the a scattered X86_FEATURE bit for > > 'sgx_has_eupdatesvn' or not. > > I don't have a strong opinion about this. > What is your suggestion here? > > > > > >>> +static atomic_t sgx_usage_count; > > >> > > >> Is 32 bits enough for sgx_usage_count? What goes boom when it > > overflows? > > > > > > I can increase the value easily, but even with current number we are > talking > > > about 2^32 enclaves, i dont think this is realistic to happen in practice. > > > > I don't mean to be pedantic, but is it 2^32? I thought we had signed > > ints underneath atomic_t and we allow negative atomic_t values. Are you > > suggesting that since we're just using a counter that the overflows into > > the negative space are OK and it doesn't matter until it wraps all the > > way back around to 0? > > Yes, you would think that wrapping to negative is ok in this case as soon as > dont end up hitting 0. But when it does happen, we will think that EPC is > empty > and attempt to execute EUPDATESVN. In this case, eupdatevsn fails with > SGX_EPC_NOT_READY or SGX_LOCKFAIL errors depending on the state > of the system, see the microcode flow: > > IF (Other instruction is accessing EPC) THEN > RFLAGS.ZF := 1 > RAX := SGX_LOCKFAIL; > GOTO ERROR_EXIT; > FI > (* Verify EPC is ready *) > IF (the EPC contains any valid pages) THEN > RFLAGS.ZF := 1; > RAX := SGX_EPC_NOT_READY; > GOTO ERROR_EXIT; > FI > > This is ok on its own, but in the case any *other* ENCLS > instruction is executing in parallel during this, it might #GP, > which is going to be unexpected behaviour for legacy SW > and smth we would like to prevent. So, I would state that > this counter must be designed against possible overflows > because of this. > > > > > Assuming that you actually mean 2^32... It is it about 2^32 enclaves? Or > > 2^32 *opens*? Those are very, very different things. > > We will increment counter on every open, so 2^32 opens at the same time, > because each release drops a counter. > > > > > Also, this is kinda the second question that I asked that hasn't really > > been answered directly. I really asked it for a good reason, and I'd > > really prefer that it be answered, even if you disagree with the premise. > > > > I really want to know what goes boom when it overflows, so I'll go into > > detail why it matters. > > > > If it's just that an extra EUPDATESVN gets run and some random SGX user > > might see a random #GP, that's not so bad. But if it's kernel crash or > > use-after-free, that _is_ bad. > > Well, I would still say that a random legacy SGX user getting a #GP is not a > good > behaviour also, but yes, this is the worst case. > > > > > Is 4 bytes (maybe even 0 with the alignment of other things in practice) > > worth it to take the problem from "not realistic" to "utterly impossible". > > > > sizeof(struct file) == 240, so I guess it would take almost 1TB of > > 'struct file' to overflow the counter, plus the overhead of 2^32 'struct > > sgx_encl's. I'd call that "stupid" but not impossible on a bit modern > > machine. > > > > But with a atomic64_t, you literally can't overflow it with 'struct > > file' in a 64-bit address space because the address space fills up > > before the counter overflows. It's in "utterly impossible" territory. > > Yes, so I will change it to atomic64_t. > > > > > I'll admit, I was rushing through the review and didn't spell all of > > that out my first pass around. Yeah, it's a little bit overkill to > > explain this for the type of one variable. But, I'm hoping that with all > > of that logic laid out, you will consider answering my question this time. > > > > I'd also very much appreciate if you could directly answer my questions > > in future reviews, even if they appear silly. It'll make this all go faster. > > Yes, of course, I am sorry not to answer to this point properly before. > Hopefully you have an answer now. > > > > > ... > > >> if (ret == SGX_NO_UPDATE) > > >> return 0; > > >> > > >> if (ret) { > > >> ENCLS_WARN(ret, "EUPDATESVN"); > > >> return ret; > > >> } > > >> > > >> pr_info("SVN updated successfully\n"); > > >> return 0; > > >> > > >> Oh, and do we expect SGX_INSUFFICIENT_ENTROPY all the time? I > thought > > it > > >> was supposed to be horribly rare. Shouldn't we warn on it? > > > > > > The entropy comes from RDSEED in this case, not RDRAND. > > > This is not a horribly rare but very realistic failure. > > > > Fair enough. If you want to keep it this way, could you run an > > experiment and see how realistic it is? Surely, if it's very realistic, > > you should be able to show it on real hardware quite easily. > > > > We had a loooooooong discussion about this not so long ago. I believe > > older CPUs were much more likely to be able to observe RDSEED failures. > > But, even on those CPUs, repeated failures were pretty darn rare. New > > CPUs are less likely to observe a single RDSEED failures. They are very > > unlikely to see repeated failures. > > > > So really want to know if this is needed in _practice_. Sure, > > theoretically, the architecture allows CPUs to fail RDSEED all the time. > > But do the CPUs with EUPDATESVN also have RDSEED implementations that > > fail easily? > > This was the recent discussion I am aware we had on this matter: > https://lkml.org/lkml/2024/2/5/1595 > The measurements were done for older platform (skylake), but I am not > aware of any architectural changes since that time to improve this.
And to make it more concrete, I made some simple tests based on program for stress testing rdrand/rdseed that was discussed in that threat earlier: https://lkml.org/lkml/2024/2/6/746 Using this stress testing and enough threads, I can make EUPDATESVN fail reliably and quite easily even with 10 time retry loop by kernel. So, given this, what should be the correct behaviour be here? WARN_ON is not justified imo. But should we return an error to userspace and block open()? Best Regards, Elena.