[vox-tech] shell script challenge - Now MD5sum erratia

Thu, 8 Aug 2002 12:57:11 -0700

begin Charles Polisher <cpolish@attbi.com> 
> Micah Cowan wrote:
> > GNU Linux writes:
> >  > Found a very interesting page on md5sum. It's:
> >  > 
> >  > http://hills.ccsf.org/~jharri01/project.html
> >  > 
> >  > "So why does MD5 seem so secure? Because 128 bits allows you to have
> >  > 2128=340,282,366,920,938,463,463,374,607,431,768,211,456 different
> >  > possible MD5 codes"
> >  > 
> >  > Lots of good reading for insomniacs.
> > 
> > It still shouldn't be relied upon, however, that two identical MD5
> > checksums are sufficient evidence that the corresponding files are
> > identical; I've heard more than one person claim to have encountered
> > identical MD5 sums for different files, and its certainly not
> > impossible, just improbable.
> 
>   I'm dubious ;^)
> 
>   <H. Lector voice>
>     The voices tell you they've seen MD5's collide; do they
>     tell you other things, Micah? 
>   </H. Lector voice>

i'd have to agree with chuck.

   2^128 = 340,282,366,920,938,463,374,607,431,768,211,456

is an awfully big number.  according to ORA's PGP book, this number is 
is billions of times larger than the total number of computer files that
the human race has ever created, and will ever create, for the next
thousand years.

another way to put this:  you are 10^25 times more likely to win
the california lottery jackpot than you are to find 2 files with the
same md5sum.

i think this goes over the line of being "improbable" and enters the
realm of "effectively impossible".

if someone WERE to find 2 files with the same md5sum, i think they owe
it to the world to announce their find.  it's kind of like finding
amelia earhardt's plane, or noah's ark.  you just don't find 2 files
with equal md5sums and say "oh, that's interesting".  you report it
scientific american, new york times and slashdot.   :-)

> > But it's a heluva lot better than running diff from one file to
> > every other file - a factorial-time operation! :)
> 
>   
>   And now, a quibble: Actually, a comparison of the entire file would
>   be no different than a comparison of the md5sum. From a Big O
>   standpoint, it's just a constant factor. If comparing md5's isn't
>   factorial, neither should a full diff. If there were a ton of files
>   and the lengths were spread out, the comparisons could be further
>   reduced by sorting the list by file size, then comparing only among
>   groups of the same size.

that may suffer from having to be done using a HUGE temp file, rather
than small atomic operations done in memory.

you'd have to generate the list (ugh), sort the list (double ugh) and
THEN do the comparisons on the candidates.  i think this would be
significantly more work for the computer, but would take a few seconds
to write for the programmer.

pete

-- 
GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E  70A9 A3B9 1945 67EA 951D