< mvidell.se

cmpfiles

Compares the set of files in DIR1 and DIR2 (by hashing) while ignoring the directory structure.

I have a lot of old files and weird old copies that sometimes have similar files but in slightly different file tree structures. As I wanted to remove the duplicate files I wanted to make sure that the files actually were duplicates and were safe to delete.

$ tree
.
├── dir1
│   ├── a.txt
│   ├── b.txt
│   └── c.txt
└── dir2
    ├── dir3
    │   └── c.txt
    └── dir4
        ├── a.txt
        └── b.txt

5 directories, 6 files

The above examples shows that two different directories with different file tree structures. Comparing manually whether it is safe to delete dir2 (i.e if all files in dir1 is in dir2 and vice versa) can be time consuming. cmpfiles recursively traverses the file tree, hashes all files, and then compares whether the tree produced the same hashes by sorting the list of hashes.

$ cmpfiles dir1/ dir2/
'dir1/c.txt' does not exist in 'dir2'

'dir2/dir3/c.txt' does not exist in 'dir1'

'dir1/a.txt' <=> 'dir2/dir4/a.txt'
'dir1/b.txt' <=> 'dir2/dir4/b.txt'

In this example cmpfiles tells us that dir1/c.txt and dir2/dir3/c.txt are not found in the other directory (because they have different hashes) and we know that we should not remove these files before investigating this discrepancy. One file could be an older version of the other file.

Help

Usage: cmpfiles [OPTION]... DIR1 DIR2
Compares the files in DIR1 to the files in DIR2

Compares the files (but not the file structure) by hashing each file in DIR1
and in DIR2, and then comparing them. It does not dereference symbolic links.

With no options, it produces output in three sections. The first section contains
the files that are in DIR1 and not in DIR2. The second section contains the
files that are in DIR2 and not in DIR1. The third section contains the files
that are in both DIR1 and DIR2.

If the program encounters duplicate files inside DIR1 (or inside DIR2) it
ignores these and outputs warnings to stderr.

  -1     Suppress first section  (files in DIR1 but not in DIR2)
  -2     Suppress second section (files in DIR2 but not in DIR1)
  -3     Suppress third section  (files in both DIR1 and DIR2)

  --md5          Use MD5 as hash function instead of SHA256
  -h, --help     Display this help and exit

Examples:
  cmpfiles -12 dir1 dir2  Print only files in both dir1 and dir2.
  cmpfiles -3  dir1 dir2  Print files in dir1 and not in dir2, and vice versa.

Somewhat inspired by comm(1).

The code for the project can be downloaded with git clone https://mvidell.se/cmpfiles.git .