I’m in the process of starting a proper backup solution however over the years I’ve had a few copy-paste home directory from different systems as a quick and dirty solution. Now I have to pay my technical debt and remove the duplicates. I’m looking for a deduplication tool.
- accept a destination directory
- source locations should be deleted after the operation
- if files content is the same then delete the redundant copy
- if files content is different, move and change the name to avoid name collision I tried doing it in nautilus but it does not look at the files content, only the file name. Eg if two photos have the same content but different name then it will also create a redundant copy.
Edit:
Some comments suggested using btrfs’ feature duperemove
. This will replace the same file content with points to the same location. This is not what I intend, I intend to remove the redundant files completely.
Edit 2: Another quite cool solution is to use hardlinks. It will replace all occurances of the same data with a hardlink. Then the redundant directories can be traversed and whatever is a link can be deleted. The remaining files will be unique. I’m not going for this myself as I don’t trust my self to write a bug free implementation.
Use Borg Backup. It has built-in deduplication — it works with chunks not files and will recognize identical chunks and avoid storing them multiple times. It will deduplicate your files and will find duplicated chunks even in files you didn’t know had duplicates. You can continue to keep your files duplicated or clean them out, it doesn’t matter, the borg backups will be optimized either way.
Here are the stats from a backup of 1 server with approx 600gig
Original size Compressed size Deduplicated size
This archive: 592.44 GB 553.58 GB 13.79 MB All archives: 14.81 TB 13.94 TB 599.58 GB
Unique chunks Total chunks
Chunk index: 2760965 19590945
13meg… nice
Restic
jdupes is my go-to solution for file deduplication. It should be able to remove duplicate files. I don’t know how much control it gives you over which duplicate to remove though.
As said previously, Borg is a full dedplicating incremental archiver complete with compression. You can use relative paths temporarily to build up your backups and a full backup history, then use something like pika to browse the archives to ensure a complete history.
I did not ask for a backup solution, but for a deduplication tool
Tbf you did start your post with
I’m in the process of starting a proper backup
So you’re going to end up with at least a few people talking about how to onboard your existing backups into a proper backup solution (like borg). Your bullet points can certainly probably be organized into a shell script with sync, but why? A proper backup solution with a full backup history is going to be way more useful than dumping all your files into a directory and renaming in case something clobbers. I don’t see the point in doing anything other than tarring your old backups and using
borg import-tar
(docs). It feels like you’re trying to go from one half-baked, odd backup solution to another, instead of just going with a full, complete solution.
What about folders? Because sometimes when you have duplicated folders (sometimes with a lot of nested subfolders), a file deduplicator will take forever. Do you know of a software that works with duplicate folders?
What do you mean that a file deduplication will take forever if there are duplicated directories? That the scan will take forever or that manual confirmation will take forever?
I don’t actually know but I bet that’s relatively costly so I would at least try to be mindful of efficiency, e.g
- use
find
to start only with large files, e.g > 1Gb (depends on your own threshold) - look for a “cheap” way to find duplicates, e.g exact same size (far from perfect yet I bet is sufficient is most cases)
then after trying a couple of times
- find a “better” way to avoid duplicates, e.g SHA1 (quite expensive)
- lower the threshold to include more files, e.g >.1Gb
and possibly heuristics e.g
- directories where all filenames are identical, maybe based on locate/updatedb that is most likely already indexing your entire filesystems
Why do I suggest all this rather than a tool? Because I be a lot of decisions have to be manually made.
fclones https://github.com/pkolaczk/fclones looks great but I didn’t use it so can’t vouch for it.
if you use
rmlint
as others suggested here is how to check for path of dupesjq -c '.[] | select(.type == "duplicate_file").path' rmlint.json
- use