[vox-tech] shell script challenge - Now MD5sum erratia

Thu, 8 Aug 2002 14:08:02 -0700

Charles Polisher writes:
 > Micah Cowan wrote:
 > > GNU Linux writes:
 > >  > Found a very interesting page on md5sum. It's:
 > >  >=20
 > >  > http://hills.ccsf.org/~jharri01/project.html
 > >  >=20
 > >  > "So why does MD5 seem so secure? Because 128 bits allows you to=
 have
 > >  > 2128=3D340,282,366,920,938,463,463,374,607,431,768,211,456 diff=
erent
 > >  > possible MD5 codes"
 > >  >=20
 > >  > Lots of good reading for insomniacs.
 > >=20
 > > It still shouldn't be relied upon, however, that two identical MD5=

 > > checksums are sufficient evidence that the corresponding files are=

 > > identical; I've heard more than one person claim to have encounter=
ed
 > > identical MD5 sums for different files, and its certainly not
 > > impossible, just improbable.
 >=20
 >   I'm dubious ;^)
 >=20
 >   <H. Lector voice>
 >     The voices tell you they've seen MD5's collide; do they
 >     tell you other things, Micah?=20
 >   </H. Lector voice>

Bruce Schneier, in "Applied Cryptography", points out that collisions
have been intentionally produced with success, by denBoer and
Bosselaers, though it is said that the attack has no practical
impact on security uses.

 > > But it's a heluva lot better than running diff from one file to ev=
ery
 > > other file - a factorial-time operation! :)
 >=20
 >  =20
 >   And now, a quibble:
 >   Actually, a comparison of the entire file would be
 >   no different than a comparison of the md5sum. From a
 >   Big O standpoint, it's just a constant factor. If=20
 >   comparing md5's isn't factorial, neither should a full
 >   diff. If there were a ton of files and the lengths were
 >   spread out, the comparisons could be further reduced=20
 >   by sorting the list by file size, then comparing only
 >   among groups of the same size.

You're right - I'd been assuming MD5s to be sorted, and
the diffs not to be.

In that case, both operations have equivalent complexity in both their
best and worst cases:

In the best case, where the sort key (file length or MD5) is unique
for identical files, the complexity is equal to the complexity of the
sort algorithm, hopefully O(N log N).

In the worst case, where the same sort key is frequently found for
different files, the complexity will be quadratic due to the many
comparisons.

However, collisions of MD5s are much less likely than collisions in
file lengths, so the diffs-comparison is much more likely to approach
its worst case, and the MD5s-comparison its best. So, practically
speaking, they are likely to be close to O(N=B2) and O(N log N),
respectively, due to the fact that "N" in the quadratic
term for the MD5s-comparison should usually be 1, rendering it
equivalent to linear.

Thanks, all, for pointing out my major brain-farts. I've only just had
my coffee :)

-Micah