[Responses inline]

No need to keep me "bcc'd" (though thanks for the consideration) -- I'm happy 
to ignore anything I don't want to be pulled into ;-)

Here's a rollup of what needs to be done based on the discussion below:

1) Remove extraneous exports from sha1.h
2) Remove "safe mode" support.
3) Remove sha1_compression_W if it is not needed by the performance 
improvements.
4) Evaluate logic around storing states and generating recompression states.  
Remove defines that bloat code footprint.

Thanks,
Dan


-----Original Message-----
From: linus...@gmail.com [mailto:linus...@gmail.com] On Behalf Of Linus Torvalds
Sent: Tuesday, February 28, 2017 11:34 AM
To: Junio C Hamano <gits...@pobox.com>
Cc: Jeff King <p...@peff.net>; Joey Hess <i...@joeyh.name>; Git Mailing List 
<git@vger.kernel.org>
Subject: Re: SHA1 collisions found

On Tue, Feb 28, 2017 at 11:07 AM, Junio C Hamano <gits...@pobox.com> wrote:
>
> In a way similar to 8415558f55 ("sha1dc: avoid c99 
> declaration-after-statement", 2017-02-24), we would want this on top.

There's a few other simplifications that could be done:

 (1) make the symbols static that aren't used.

     The sha1.h header ends up declaring several things that shouldn't have 
been exported.

     I suspect the code may have had some debug mode that got stripped out from 
it before making it public (or that was never there, and was just something the 
generating code could add).

[danshu] Yes, this is reasonable.  The emphasis of the code, heretofore, had 
been the illustration of our unavoidable bit condition performance improvement 
to counter cryptanalysis.  I'm happy to remove the unused stuff from the public 
header.

 (2) get rid of the "safe mode" support.

     That one is meant for non-checking replacements where it generates a 
*different* hash for input with the collision fingerpring, but that's pointless 
for the git use when we abort on a collision fingerprint.

[danshu] Yes, I agree that if you aren't using this it can be taken out.  I 
believe Marc has some use cases / potentially consumers of this algorithm in 
mind.  We can move it into separate header/source files for anyone who wants to 
use it.

I think the first one will show that the sha1_compression() function isn't 
actually used, and with the removal of safe-mode I think
sha1_compression_W() also is unused.

[danshu]  Some of the performance experiments that I've looked at involve 
putting the sha1_compression_W(...) back in.  Though, that doesn't look like 
it's helping.  If it is unused after the performance improvements, we'll take 
it out, or move it into its own file.

Finally, only states 58 and 65 (out of all 80 states) are actually used, and 
from what I can tell, the 'maski' value is always 0, so the looping over 80 
state masks is really just a loop over two.

[danshu]  So, while looking at performance optimizations, I specifically looked 
at how much removing storing the intermediate states helps -- And I found that 
it doesn't seem to make a difference for performance.  My cursory hypothesis is 
because nothing is waiting on those writes to memory, the code moves on 
quickly.  That said, it is a bunch of code that is essentially doing nothing 
and removing that is worthwhile.  Though, partially what we're seeing here is 
that, as you point out below, we're working with generated code that we want to 
be general.  Specifically, right now, we're checking only disturbance vectors 
that we know can be used to efficiently attack the compression function.  It 
may be the case that further cryptanalysis uncovers more.  We want to have a 
general enough approach that we can add scanning for new disturbance vectors if 
they're found later.  Over specializing the code makes that more difficult, as 
currently the algorithm is data driven, and we don't need to write new code, 
but rather just add more data to check.  One other note -- the "maski" field of 
the  dv_info_t struct is not an index to check the state, but rather an index 
into the mask generated by the ubc check code, so that doesn't pertain to 
looping over the states.  More on this below.  

The file has code top *generate* all the 80 sha1_recompression_step() 
functions, and I don't think the compiler is smart enough to notice that only 
two of them matter.

[danshu] That's a good observation -- We should clean up the unused 
recompression steps, especially because that will generate a ton of object 
code.  We should add some logic to only compile the functions that are used.

And because 'maski' is always zero, thisL

   ubc_dv_mask[sha1_dvs[i].maski]

code looks like it might as well just use ubc_dv_mask[0] - in fact the 
ubc_dv_mask[] "array" really is just a single-entry array anyway:

   #define DVMASKSIZE 1

[danshu]  The idea here is that we are currently checking 32 disturbance 
vectors with our bit mask.  We're checking 32 DVs, because we have 32 bits of 
mask that we can use.  The DVs are ordered by their probability of leading to 
an attack (which is directly correlated to the complexity of finding a 
collision.)  Several of those DVs correspond to very low probability / high 
cost attacks, which we wouldn't expect to see in practice.  We just have the 
space to check, so why not?  However, improvements in cryptanalysis may make 
those attacks cheaper, in which case, we would potentially want to add more DVs 
to check, in which case we would expand the number of DVs and the mask.

so that code has a few oddities in it. It's generated code, which is probably 
why.

[danshu]  Accurate, we're also just trying to be general enough that we can 
easily add more DVs later if need be.  I don't know how likely that is, 
certainly the DVs that we're checking now are based on solid conjectures and 
rigorous analysis of the problem.  Though we don't want to rule out that there 
will be subsequent cryptanalytic developments later.  Marc can comment more 
here.

Basically, some of it could be improved. In particular, the "generate code for 
80 different recompression cases, but only ever use two of them" really looks 
like it would blow up the code generation footprint a lot.

I'm adding Marc Stevens and Dan Shumow to this email (bcc'd, so that they don't 
get dragged into any unrelated email threads) in case they want to comment.

I'm wondering if they perhaps have a cleaned-up version somewhere, or maybe 
they can tell me that I'm just full of sh*t and missed something.

[danshu]  Naw man, it looks pretty good, modulo a little bit of understandable 
confusion over 'maski' -- No fake news or alternative facts here ;-)

                    Linus

Reply via email to