[vox-tech] shell script challenge

GNU Linux vox-tech@lists.lugod.org
Wed, 7 Aug 2002 19:19:17 -0700


On Wed, Aug 07, 2002 at 12:20:14PM -0700, Chris McKenzie wrote:
> I'm using cygwin and I was given the request by my boss to remove all
> duplicate files from the server
> the server is on the x: drive of the windows machine which means that
> cygwin saw it as /cygwin/x
> I forget exactly what command I ran toget checksums.txt
> but it is in the format
> 
> <checksum> *x:<filename>
> 
> The challenge is to find the duplicate checksums and print the file name
> of those checksums.  This is tricky because the directories contain spaces

md5 is a better way to go than checksums. Checksums are usually 16bit or
32bit. An md5 sum is 128bits. Although it will take longer to process,
md5's will give you greater accuracy.

More importantly, the program "md5sum" will output results as you have
requested. 

Example:

gnulinux@deb:~$ md5sum funny.txt
ad739a7ca6402db9e8f73a544602137d  funny.txt

I am certainly no programmer, so I can't give you a shell program, but
here's a one-liner that will check directory "/foo" and output duplicate
files:

find /cygdrive/c/foo -type f -exec md5sum '{}' ';' | sort | uniq -D -w
33| cut -c 34-

NOTE: This probably will _not_ work for directories and file names that
use strange characters (ie. single quotes, double quoes, carriage
returns, etc). 

There are MANY ways to find duplicate files and this is just one method
that will work under linux and under Cygwin windoze. I tried it with
both and like most things it works better with Linux.

I suggest checking out "man md5sum", and in particular the "-c" option