[vox-tech] shell script challenge

Shawn P. Neugebauer vox-tech@lists.lugod.org
Thu, 8 Aug 2002 23:12:11 -0700


(pretending this hasn't been beaten to death...)

not sure why, but i looked at the man page for sort and uniq.
there are options for each that solve your problem.  here's an example:

  #!/bin/tcsh
  find /some/path -type f >! /tmp/f
  cat /tmp/f | xargs md5sum >! /tmp/g
  cat /tmp/g | sort -k1,32 | uniq -w 32 -D

comments:
- i'm writing temp files because i presume you're doing this to
  lots of files, and the shell will quickly get overloaded with cmdline
  wildcard arguments.  furthermore, it gives you an intermediate product
  to inspect.
- xargs is a cool filter to handle situations where you would have
  lots of cmdline arguments (like filenames)
- the -k option to sort tells it to use characters 1-32 (in this case,
  the md5 hash value) as the key
- the -w option to uniq tells it to compare no more than the first
  32 characters; the -D option displays all the duplicates for any
  repeated line
- i recognize there are more concise/elegant ways to do this...

as i was trying this out on some detritus i found some surprising matches.
in a few cases, turned out to be short configuration-type files.  got me
to thinking, i don't think one would *want* to delete ALL duplicate files
on a system.  i understand the intent, but it could be dangerous to
do so (might not notice it right away).  for example, different packages
could use an identical file, e.g., a configuration file, in a different
location; wouldn't want to delete those!  anyway, seems to me that
manual inspection is required, pre-deletion, as a sanity check.

shawn.

On Wednesday 07 August 2002 12:20 pm, Chris McKenzie wrote:
> I'm using cygwin and I was given the request by my boss to remove all
> duplicate files from the server
> the server is on the x: drive of the windows machine which means that
> cygwin saw it as /cygwin/x
> I forget exactly what command I ran toget checksums.txt
> but it is in the format
>
> <checksum> *x:<filename>
>
> The challenge is to find the duplicate checksums and print the file name
> of those checksums.  This is tricky because the directories contain spaces
> which gawk, sed, etc ... see as fields.  Even if I change the IFS to * and
> then use gawk to print the *x:<fname> <checksum> -- sort wouldn't know how
> to deal with it which would make uniq useless (I think).  if I do it the
> other way, <checksum> *x:<filename> sort will work fine but uniq will fail
> because the filename is there.  if I exclude the filename with a gawk ' {
> print $1 } ' then sort and uniq will work fine but I won't have a
> filename.  So all the combinations I can think of fail.  Does anyone know
> how I can find only the duplicate checksums and the file names associated.
>
> **I realize that with a lbut the problem is that there are 4,575 duplicate
> checksums using:
> cat checksums.txt | awk ' { print $1 } ' | sort -uniq -d | wc -l
> and 46340 files on the server, which seems like it would take an awful
> long time.  any suggestions?
>
> Sincerely,
> 	Christopher J. McKenzie
>
> 	cjm@ucdavis.edu
> 	mckenzie@cs.ucdavis.edu
> 	H: (818) 991-7724
> 	C: (818) 429-3772
> 	1815 Mesa Ridge Ave
> 	Westlake Village, CA 91362
>
> _______________________________________________
> vox-tech mailing list
> vox-tech@lists.lugod.org
> http://lists.lugod.org/mailman/listinfo/vox-tech