[vox-tech] shell script challenge

Chris McKenzie vox-tech@lists.lugod.org
Wed, 7 Aug 2002 12:20:14 -0700 (PDT)


I'm using cygwin and I was given the request by my boss to remove all
duplicate files from the server
the server is on the x: drive of the windows machine which means that
cygwin saw it as /cygwin/x
I forget exactly what command I ran toget checksums.txt
but it is in the format

<checksum> *x:<filename>

The challenge is to find the duplicate checksums and print the file name
of those checksums.  This is tricky because the directories contain spaces
which gawk, sed, etc ... see as fields.  Even if I change the IFS to * and
then use gawk to print the *x:<fname> <checksum> -- sort wouldn't know how
to deal with it which would make uniq useless (I think).  if I do it the
other way, <checksum> *x:<filename> sort will work fine but uniq will fail
because the filename is there.  if I exclude the filename with a gawk ' {
print $1 } ' then sort and uniq will work fine but I won't have a
filename.  So all the combinations I can think of fail.  Does anyone know
how I can find only the duplicate checksums and the file names associated.

**I realize that with a lbut the problem is that there are 4,575 duplicate
checksums using:
cat checksums.txt | awk ' { print $1 } ' | sort -uniq -d | wc -l
and 46340 files on the server, which seems like it would take an awful
long time.  any suggestions?

Sincerely,
	Christopher J. McKenzie

	cjm@ucdavis.edu
	mckenzie@cs.ucdavis.edu
	H: (818) 991-7724
	C: (818) 429-3772
	1815 Mesa Ridge Ave
	Westlake Village, CA 91362