Update: Including seed-centered clustering to prevent perceptual hash chaining
When I first wrote a duplicate photo detector, it was of course a script. It worked pretty well. I used perceptual hashing and clustered duplicates. I also used configurable criteria to select the keeper from each cluster of duplicates. The script generated a CSV audit trail. Technically correct and even rather fast.
But a script that works is not the same thing as a system people blindly can trust. This article is not about hashing. It’s about what happens when an algorithm meets reality — scale, safety, configuration drift, packaging, and not to forget: non-technical users.
The hard part was never detecting duplicates, it was turning a Reddit-discussed Python script into a deterministic, easily installable and user-friendly tool.
The Algorithm (Recap)
The original version (see the first article Finding and eliminating Duplicates) focused on exact hashing (SHA-1) for byte-identical files and perceptual hashing (dHash, pHash, wHash and optional colorhash). It used corroboration thresholds to reduce false positives and union–find clustering to group duplicate families. You can configure a deterministic keeper policy (resolution, sharpness, format preference, compression proxy). This solved the algorithmic problem and it worked really well — even at 10,000+ images.
Then reality intervened in the form of discussions on Reddit.
When a Script Stops Being Enough
As soon as real users tested it, new constraints arose from real world (problem) situations. Just to mention a few: Libraries with over 400,000 photos. People with a lot of iPhone photos in HEIC – JPEG variants. Re-scans of huge photo libraries after adding new folders. Need for cancellation of a started scan. Configuration changes between runs. Desire for visual review of clusters, actually we all want to be the human in the loop… Non-Python users wanting a GUI. At this point duplicate detection stopped being a hashing problem and became a systems problem.
Breaking the Monolith
The original script evolved: after the Reddit discussions find_duplicates.py learned how to cache image signatures in a SQLite DB, to speed up rescanning of photo collections and thus became find_duplicates_indexed.py. I then choose to implement a modular architecture to improve maintainability with a single unified engine for the CLI and a GUI frontend.
I split the monolith script into the following components.
| cli.py | command line interface |
| dedup_gui.py | GUI frontend |
| worker_qt.py | Multi-thread orchestration |
| engine.py | orchestration of core processing |
| config.py | configuration for engine |
| hashing.py | decode + resize + perceptual hashing |
| features.py | image features |
| index_db.py | SQLite incremental cache |
| reporting.py | CSV + HTML output |
| app_logging.py | exception logging |
| capabilities.py | availability of code libraries |
| dedup_types.py | type aliases |
| actions.py | move to quarantine / trash |
| settings.py | JSON-based configuration |
The image below shows an architectural diagram of those components.

This is not cosmetic refactoring. The key rules used enforce an architectural discipline: the engine knows nothing about the GUI, frontends do not perform computation and reporting and handling are separated from detection. This way CLI and GUI can share the exact same logic and the engine can be tested independently. The configuration is deterministic and there is never any hidden coupling between UI and policy.
‘Scaling Beyond O(n²)’
In information science this is known as the Quadratic Wall. A concept describing the point where an algorithm’s performance becomes unusable because the workload quadruples every time the input doubles. At small scales, standard nested loops are often fast enough with modern hardware. However once the input reaches a certain threshold, the growth curve becomes so steep that even fast hardware cannot keep up.
Naively comparing every image to every other image does not scale. For 50,000 images that’s 1.25 billion pairwise comparisons! That’s why, already in the original script, we implemented hash prefix bucketing, images are grouped by a coarse hash prefix. Only images inside the same bucket are compared. This way bucketing reduces comparisons from millions to thousands — without sacrificing correctness.
Also we implemented union–find clustering in which similarity is transitive: If A ≈ B and B ≈ C then A, B, C belong to the same cluster. Union–Find (disjoint-set) builds connected components efficiently.
Below you find a sequence diagram of the standard flow of interaction.


SQLite Incremental Indexing
Hashing 50,000 images repeatedly is wasteful. So the engine now uses an IndexDB cache. It stores extracted image features, validates the cache against a configuration signature and automatically invalidates stale entries if hashing settings change. Now re-scans become fast because incremental. Performance shifts from hash everything every time to hash only what changed. That’s the difference between an initial script and a battle tested system.
Here’s a diagram of the standard data flow.

Safety as a Design Constraint
Duplicate detection is not a toy problem. Deleting the wrong file can destroy the best version of a photo. So safety became a first-class architectural requirement from the very beginning.
From the very first start safeguards were built in. In the original script dry-run is already the default. Also a quarantine folder is used instead of direct deletion. Optionally we use send-to-trash integration which on Windows uses the system Recycle Bin as an extra safe guard. A CSV audit trail for is part of every run. An HTML report with thumbnails can be used to visually inspect the results of duplicates clustering.

The deterministic keeper selection can be configured by the user and we default to conservative corroboration thresholds. Similarity is not just a binary guess:
if dhash_dist <= T1:
similar
elif dhash_dist <= T2 and phash_dist <= T3 and whash_dist <= T4:
similar
else:
differentPythonBorderline cases require agreement from multiple independent hashes. So DedupTool is not about aggressive cleanup. Instead it can be used for controlled reduction of redundancy.
Adjusting the Keeper Policy
In the initial keeper policy ‘sharpness’ is a primary criterium, even more than intended, leading for instance to preferring a JPEG copy of 400KB over the original 2.5MB image.
Root cause in this case is, that the sharpness metric is based on the Laplacian variance. That measure detects high-frequency edges. JPEG compression introduces artifacts that artificially increase those edges: compression ringing, block edges, quantization noise and micro-contrast artifacts.
So the metric sees more edge ‘energy’, higher Laplacian variance and decides ‘sharper’, even though the image is objectively of lower quality. This is a known problem in image quality metrics. Edge-based sharpness metrics are not trustworthy quality metrics. They measure edge strength not image fidelity.
For the deduplication keeper selection, you usually want a hybrid score. Since DedupTool works with perceptual hashes, duplicates are usually characterized by same resolution but different compression. In those cases file size alone is often enough to pick the better file.
In our initial engine.py it’s this method that actually applies the keeper policy.
# --- keeper policy & actions ---
def _keeper_key(self, f: Features) -> Tuple:
# area, sharpness, format rank, size-per-pixel
spp = f.size / max(1, f.area)
return (f.area, f.sharp, file_ext_rank(f.path), -spp, f.size)PythonIf the winner is chosen with max(…), then the ranking priority is lexicographic: f.area, f.sharp, file_ext_rank(f.path), -spp, f.size. So if two files have the same resolution, sharpness is indeed the dominant criterion before format and before size.
And there is a second subtle issue: The -spp term is backwards for quality preference. Since spp = f.size / f.area, that is essentially bytes per pixel. Higher spp usually means: less compression, more retained information, better quality for same format/resolution.
A Balanced Solution
Using -spp and then max(…). That means the algorithm prefers lower bytes per pixel, which often means more heavily compressed files. So for equal area: sharpness favors overly sharpened, compressed JPEGs; -spp also favors smaller compressed files. That combination strongly explains why a 400 KB JPEG can currently beat the 2.5 MB original.
For archive-quality keeper selection, a much safer order would be: area, format rank, size-per-pixel or size and sharpness as tiebreaker. If we adjust the keeper policy to:
def _keeper_key(self, f: Features) -> Tuple:
spp = f.size / max(1, f.area)
return (f.area, file_ext_rank(f.path), spp, f.size, f.sharp)PythonThis does three good things. We let resolution still dominate, prefer better formats early; less-compressed files win before sharpness because sharpness becomes only a late tiebreaker.
In practice, f.size and spp are correlated for equal area, so either ordering is fine. I slightly prefer spp before size because it normalizes better across near-equal dimensions. Our file extension ranking is actually quite reasonable and does what we intended: RAW → 5, TIFF → 4, PNG → 3, JPEG → 2,
GIF → 1.
The most balanced solution then becomes:
def _keeper_key(self, f: Features) -> Tuple:
spp = f.size / max(1, f.area)
return (f.area, file_ext_rank(f.path), spp, f.size, f.sharp)PythonPriority now becomes: resolution, format quality, bytes per pixel (less compression), file size, sharpness.
Improving Cluster Purity: Preventing Hash Chaining
While testing DedupTool recently I noticed an interesting edge case in clustering.
In one cluster of four images, two images were true duplicates while the other two were slightly different photos from the same shoot. The duplicates were detected correctly and the keeper selection was correct. However, the cluster itself contained images that were not actually duplicates.
This revealed a subtle property of perceptual hash clustering sometimes called hash chaining.
The Root Cause: Transitive Similarity
DedupTool builds clusters using a similarity graph. Images are compared using perceptual hashes (dHash, pHash and wHash) and similar pairs are connected. Clustering then uses connected components to group images. This means the following chain can occur: A similar to B, B similar to C, C similar to D. Even if: A not similar to C, A not similar to D, B not similar to D all four images still end up in the same cluster.
This phenomenon is well known in perceptual hash systems and is sometimes referred to as over-merging or transitive similarity. Users of other photo deduplication tools such as PhotoSweeper or Duplicate Cleaner sometimes report similar behaviour when similarity thresholds are too permissive.
In DedupTool this did not affect duplicate detection — the correct keeper was still selected — but it did reduce cluster purity.
The Fix: Seed-Centered Cluster Refinement
The solution was to add a lightweight refinement stage after clustering. The key idea is simple: Every image in a cluster must be similar to the cluster seed. Instead of relying purely on connected components, clusters are re-centred around the image that the keeper policy would choose.
The pipeline therefore becomes:
hash_all()
↓
cluster() (fast DSU clustering using perceptual hashes)
↓
refine_clusters() ← new step
↓
choose_keepers()
During refinement: The best image in the cluster (according to the keeper policy) is selected as the seed. Every other image is compared directly with the seed. Images that are not sufficiently similar to the seed are removed from the cluster.
This converts clusters into seed-centred star clusters, eliminating most chaining artefacts.
Implementation
Because DedupTool already had the necessary building blocks — perceptual similarity checks, DSU clustering, and keeper scoring — the fix required only a small refinement helper. The refinement step simply filters cluster members based on similarity to the seed.
def refine_clusters(self, clusters, feats):
refined = {}
for cid, idxs in clusters.items():
if len(idxs) <= 2:
refined[cid] = idxs
continue
seed = max((feats[i] for i in idxs), key=self._keeper_key)
seed_i = feats.index(seed)
new_cluster = [seed_i]
for i in idxs:
if i == seed_i:
continue
if self.similar(seed, feats[i]):
new_cluster.append(i)
if len(new_cluster) > 1:
refined[cid] = new_cluster
return refinedPythonThis post-cluster refinement dramatically improves cluster purity without slowing down the tool, because the expensive hash comparisons have already been performed during the candidate generation stage.
Why This Approach Works Well
The refinement stage preserves the original performance characteristics of DedupTool: Fast candidate generation using hash bucketing, Efficient clustering using DSU, Safe keeper selection.
The only change is a small verification step that ensures cluster members are genuinely duplicates of the keeper image. In practice this eliminates most false cluster members while keeping the engine deterministic and predictable.
Dependency Hygiene and Packaging Reality
The most surprising part of the system’s evolution was not clustering. It was packaging. Turning a Python script into a Windows executable forces discipline. Optional dependencies must truly be optional. Capability detection must be accurate. Heavy code libraries must justify their existence. Frozen environments behave differently.
The original implementation relied on a couple of heavy weight code libraries, like imagehash from SciPy, OpenCV and skimage. That worked fine in my development environment. But for distribution, it caused dependency bloat and fragile packaging. So the system evolved again taking the steps described below.
Removing SciPy
Perceptual hashing via imagehash pulled in SciPy. SciPy dramatically increases package size and complexity. So the pHash DCT was reimplemented using NumPy only. The result is deterministic behavior and a drastically reduced dependency chain. Thus enabling cleaner packaging and a smaller executable footprint.
Removing OpenCV
OpenCV, a very useful but also rather heavy code library finally was used only for Laplacian sharpness estimation. Replacing it with, again, a NumPy-based, Laplacian variance calculation removed another ~100MB of runtime size, simplified packaging while preserving ranking behavior.
HEIC Support (and the x265 Surprise)
Supporting HEIC required pillow_heif. Which in turn brings a lot of native codec dependencies. Some DLLs look optional — until you remove them and decoding breaks. Packaging revealed that what looks like an video encoder library is actually required for image decode linkage. That forced an explicit decision: keep HEIC support and accept size, or offer a lighter build without it… Packaging is actually not an afterthought, it can be an architectural constraint!
From Script to Windows Application
The PyQt6 GUI is not just a nice wrapper, it enforces structural coherence also. The worker thread keeps the UI responsive but it also enables us to implement Cancel support checks inside engine loops. Instead of desperately trying to kill a started scan on a large photo collection it offers an elegant way out, just pressing one button!
Using PyQt also makes it easy to use settings to hydrate UI state at startup. DB validity is shown transparently and reporting preview matches the actual output location.
The control flow now is as follows: the GUI starts the Worker, that in turn starts the Engine and when that finishes, offers Reporting and returns control back to the GUI.
The engine itself always remains pure computation. That separation ensures a deterministic startup, prevents hidden side effects and ensures predictable behavior with reusable core logic. At this point, our project stopped being a script, it has become a real tool.
Shipping an Executable
Shipping a Windows executable does several things. It makes our project usable by non-Python users. It also forces dependency hygiene. Instead of offering some code, an installable application gives you a reproducible and operational system.
Positioning
Some additional positioning. This tool is not: a deep-learning vision model, a semantic similarity engine or a content-aware AI system. It is deterministic, explainable, conservative, auditable and scalable. It focuses on structural similarity, not semantic meaning. And that is deliberate, because duplicate detection requires confidence.
The Bigger Picture: Photo Intelligence
Duplicate detection is a controlled similarity problem. Our next frontier is harder: detecting AI-generated images, identifying heavy AI enhancements and distinguishing synthetic from authentic structure. Auditing image provenance requires much deeper structural analysis. We’ll discuss this in upcoming posts.
The foundations however are the same: feature extraction, deterministic signal handling, careful use of thresholding, safety-first based decision models and explainability over black-box inference. This tool however useful in itself is not the end goal, it is the stepping stone!
Final Thoughts
The first version proved already that perceptual hashing works. The second version then proved that architecture matters. The current version however also proves something else: engineering is not only about clever algorithms. It is also about building systems that people can trust.
Installation & First Use
DedupTool is distributed as a portable Windows build. No installation and no Python environment are required. Just follow these steps:
Download & Extract
The portable Windows build is available here. After downloading just follow the instructions above. These are also available as a README file in the download. Download the ZIP archive. Extract it to a folder of your choice (e.g. C:\Tools\DedupTool). Open the folder and start DedupTool.exe. That’s it.
What Happens on First Run?
On the first launch DedupTool will automatically create a few supporting folders inside the application directory:
DedupTool\
│
|── cache\
│ └── dedup_index.sqlite
│
|── reports\
│
└── dedup_settings.json
First Scan (Safe Workflow)
Select a test folder (not your entire archive). Enable Write HTML report to create a thumbnails report of the duplicate clusters found. Leave Dry Run enabled (default). Click Scan.
When finished: Pick ‘Use index Db’, hit the browse button and confirm dedup_index.sqlite inside the child folder cache. The ‘traffic light’ will turn green.
From the Help menu, you can choose Diagnostics to see an overview the installed capabilities, important paths and the cache index DB.
Review the CSV in the reports folder, open the HTML report for visual inspection. Only after verification consider enabling: Move to quarantine, or Send to Recycle Bin. DedupTool is conservative by design. Review first, act later. Duplicate detection isn’t about speed… as I already said, it’s about confidence.

A summary of this post has also been published on LinkedIn.
A problem summery of the post has been discussed on Reddit.
Engineering Notes
2026
— Adjusted the keeper policy to prevent over-prioritizing sharpness.
— Added seed-centred cluster refinement to prevent perceptual hash chaining.