I know you are requesting functionality in Lightroom proper. But in the mean time, there are a multitude of utilities that can do this job now.
This would at least allow you to find and delete duplicates - that's what I did before importing my collection into Lightroom.
Although finding 5000 of them is still gonna take some time...
The utility I know of (and have used) will find perfect duplicates of most image, video, and MP3s, but does not recognize RAW files. Also, it can only find perfect duplicates, and does not compare similarity. I suspect I have fewer RAW duplicates than jpeg duplicates, but it doesnt take nearly as many RAW files to waste a lot of disk space. Let me know if you are aware of a utility that will handle RAW images.
A 3rd party utility would help, but only to a point: --it would be tedious at best to chase down and cross-check in LR which file(s) to delete ( i.e. usually the one lacking any LR editing history). In your case it was not a problem since you were thinking ahead and found duplicates before installing LR. It's too late for me to do that on my original pre-LR photos. But I am also adding possible duplicates on an ongoing basis for one reason or another.
I don't mean to throw water on your FR - it might be worthwhile, as you said - for other purposes besides just finding dups.
But in the mean time, there are several utilities capable of assessing similarity, not just exact dups. I don't know about raw dups. But you could export all your raws you then run the dup checker on your exports.
But, raw dups will have the exact same file contents, unless you use DNG. If you do use DNG, the software would have to be smart enough to exclude xmp when doing the comparison (I dont know whether any exists like that) (assuming neither present filename nor original filename would match). If you want to find raws that are "similar" to other raws, then it gets trickier...
PS - I would definitely invent a workflow that imports no more dups in Lightroom, so once solved, it stays solved.
You've got my vote for this as a feature, but I guess a definition of 'similarity' would be needed to implement. One way would be to allow sorting by some user-defined combination of metadata fields, but I don't know if this is what you had in mind. Anyhow, if this did come to be, it'd be nice to also be able to filter and see only 'similar' images.
Short of a new feature, I've found that working with 'All Photographs', sorting by capture time, and visually scanning for duplicates works pretty well if the dups may have different names. Otherwise sorting by filename works. You can use metadata filters at the top to keep the number of images you review at one time reasonable. Obviously, the smaller your catalog, the more manageable this is.
There are a couple of plug-ins that may be helpful. The "Duplicate Finder" plug-in will let you select from some metadata options and then run through the entire catalog, putting identified duplicates in a Smart Collection for you to review. My personal experience with a previous version was that this took a very long time and failed to finish on a really large catalog (60k+).
If you've got a lot of files to compare, LR/Transporter can be useful. It lets you create a text file report on selected images containing whatever metadata fields you specifiy. I found it helpful in some cases to output the filenames, capture time, edit time, etc. If you are able to work with this data in Excel, Access, or some other database, you can whittle down your list of potential duplicates this way. You then use LR/Transporter to import a list of images you want to flag, and you can then filter on that flag.
Tedious no matter how you do it.
Short of a new feature, I've found that working with 'All Photographs', sorting by capture time, and visually scanning for duplicates works pretty well if the dups may have different names. Otherwise sorting by filename works. You can use metadata filters at the top to keep the number of images you review at one time reasonable.
That's a good idea. Two passes, by capture, and name, would find a worthwhile number of dups for most images. Scans are a problem since they lack an authentic capture time. I store the physical photos in separate directories named yyyy/mm/dd because that seemed like the logical thing to do to avoid too many photos in one directory as well as too many directories (365 per year max). But each directory scheme comes with its own problems.
definition of 'similarity' would be needed to implement.
4 sort criteria come to mind: Shape/Outline (there is calculus for finding similar shapes in two or more dimensions), Color (predominance of the 8 LR colors), Contrast (overall relative), Pixel Sharpness.
--That would be quite cool to use. But I would happy with whatever the developer considers to be "similar", because any defintion would arrange them close enough to greatly speed up a manual process. In addition to locating duplicates, similar images could be spotted: Select a photo before sorting, and it should appear in context. Then you will always have to eyeball it after that, but instead of thousands photos you might have a dozen or so to consider.
Add my vote we could also imagine with this technology, to find images according to a user sketch. User would just need to roughly draws what he is looking for. This technology is already implemented in an open source software : digikam. These functionalyties were themselves copied from another open source software that i used to use to find similar images and to find images according to a sketch. It worked quite well and i miss it in. Lightroom. Regards Eric