Bill Bumgarner’s useful Dupinator script, for removing duplicate files, recently hit Python-URL. However, it has a logic bug that end up deleting too many files.

If you have several sets of duplicates that happen to share the same file size, all but one of the sets will be wiped out completely. The problem is that within each group of files of identical size, there’s at most a single generated "duplicates" list. The first file on the list is spared; the rest are deleted.

The net effect, when I tested the script on a large corpus of text files, was the program reported it would delete many files that were clearly not identical. (I had commented out the os.remove call for testing.)

There was an additional problem with iPhoto: the posted script follows symbolic links. iPhoto stores its albums as collections of symbolic links, so all photos in albums are flagged as duplicates of the original photos. An islink() test fixes this.

Here’s a modified version of the script. It has only been lightly tested, though the changes did successfully eliminate the false positives. Uncomment the os.remove() line only when you are satisfied with the list of redundant files generated.

Minor optimizations: all files < = 1024 bytes go directly into the dupes list, not potentialDupes, since the whole file has already been checked. Also, Mac OS X’s pesky .DS_Store files are skipped.

(I haven’t heard back from Bill yet on incorporating the fixes into his code, so I’m posting here.)

View Source Code (dupinator.py)

blog comments powered by Disqus