How to find and delete duplicate files within the same directory?
Method-1
Bash script to find and remove duplicate file using checksum
#!/bin/bash
declare -A arr
shopt -s globstar
for file in **; do
[[ -f "$file" ]] || continue
read cksm _ < <(md5sum "$file")
if ((arr[$cksm]++)); then
echo "rm $file"
fi
done
This is both recursive and handles any file name.
Method-2
One line command to find and remove all duplicate files in Linux
- Finding Duplicate Files (based on size first, then MD5 hash)
find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate
2. Delete found duplicate files
find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate | cut -f3-100 -d ‘ ‘ | tr ‘\n.’ ‘\t.’ | sed ‘s/\t\t/\n/g’ | cut -f2-100 | tr ‘\t’ ‘\n’ | perl -i -pe ‘s/([ (){}-])/\\$1/g’ | perl -i -pe ‘s/’\”/\\’\”/g’ | xargs -pr rm -v
If you want to delete files without asking permission, remove the -p after last xargs in the above command.
Method-3
1. How to test files having unique content?
if diff "$file1" "$file2" > /dev/null; then
...
2. How can we get list of files in directory?
files="$( find ${files_dir} -type f )"
We can get any 2 files from that list and check if their names are different and content are same.
Script
#!/bin/bash
# removeDuplicates.sh
files_dir=$1
if [[ -z "$files_dir" ]]; then
echo "Error: files dir is undefined"
fi
files="$( find ${files_dir} -type f )"
for file1 in $files; do
for file2 in $files; do
# echo "checking $file1 and $file2"
if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
if diff "$file1" "$file2" > /dev/null; then
echo "$file1 and $file2 are duplicates"
rm -v "$file2"
fi
fi
done
done
For example, we have some dir:
$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)
So there are only 3 unique files.
Lets run that script:
$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'
And we get only 3 files leaved.
$> ls .tmp/ -1
all.txt
file
text(2)