Skip navigation
_c_h_r_i_s_
Currently Being Moderated

PSE10 Remove Duplicate Photos

Aug 7, 2012 1:29 PM

I am running PSE10 on Win 7, and have found myself with a bit of a mess I'd like to clean up.  My Catalog has many hundred duplicate photos.  These are pairs of items that point to the same physical photo via a different path.  All follow the pattern of C:\Documents and Settings\Chris\My Documents\...   and  C:\Users\Chris\Pictures...

 

I have tried using the Search for duplicate photos function, which does identify the duplicates, but it also finds tons of false positives, photos that merely look alike but are not actual duplicates.  Sorting through these is not feasible.

 

I could perhaps identify the My Documents versions by deleting the shortcut to My Documents in the OS and seeing which images need to be reconnected.  However my tags are all attached to the C:\Documents and Settings versions.

 

I'm looking for a good way to identify actual duplicate photos in the catalog, rather than merely similar ones.

 

The workflow I imagine is: identify duplicate photos | transfer tags to C:\Users version | delete C:\My Documents version.

 

Anyone have any suggestions?

 

Thanks,

 

Chris

 

PS  I'll note that I am experienced with databases and sql, on the off chance the catalog is really a simple sqlite database that I can manipulate.

 
Replies
  • Currently Being Moderated
    Aug 7, 2012 2:35 PM   in reply to _c_h_r_i_s_

    Chris,

     

    Just so I understand:

     

    Are you saying that your photos exist in one and only one place on your hard drive, but that the Organizer has two thumbnails for each one, each thumbnail pointing to a different location?

     

    Ken

     
    |
    Mark as:
  • 99jon
    12,821 posts
    Jan 6, 2010
    Currently Being Moderated
    Aug 7, 2012 4:06 PM   in reply to _c_h_r_i_s_

    You should really have used backup and restore rather than copying the old catalog based on the XP system structure to Windows 7.

     

    http://helpx.adobe.com/photoshop-elements/kb/backup-restore-move-catal og-photoshop.html

     

    

     

     

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 7, 2012 6:02 PM   in reply to _c_h_r_i_s_

    Ooooof!  That's one messed up database, for sure.

     

    Using that sample file that now has both copies listed as "missing", when you restore its name to the correct one (in its correct directory), what happens if you do a File...Reconnect...Missing File on that file?

     

    This may not something that we're going to solve quickly.  If you can afford to spend a couple of days discussing options, I'd like to hear the opinions of others here. 

     

    The Organizer's main catalog file (catalog.pse10db) is an unencrypted, non-password protected SQLite database file.  If there isn't an easy solution using the Organizer, then running SQL queries certainly can be an option.  If you don't have a favourite SQL tool that can connect to SQLite, I recommend the free Database.Net tool:

    http://fishcodelib.com/Database.htm

    "Installation" is ridiculously easy -- extract the executable to a directory of your choosing and make a Desktop shortcut to it.

     

    In preparation, use Windows Explorer to make a file copy of your entire catalog directory (including subdirectories) somewhere for safekeeping....

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 7, 2012 11:26 PM   in reply to photodrawken

    See my answer on your previous post.

    The solution will be to do a full backup, then a restore to a new custom location.

    In the backup process, all files in the catalog should be copied, which means one file for the right folder path and the same physical file once again with its shortcut path. This will create real duplicates. If you restore to a custom location on a new master folder, the new folder tree will include different subfolders outside the shortcuts known by the OS.

    Then, the visual search should work to find duplicates you can stack and flatten afterwards.

    To make a quicker search, I would have a look at the location of suspected duplicates to see if you can limit the visual search to given subfolders and not to the whole database. Assigning a tag such as 'possible dup' to the files in a subfolder and the corresponding shortcut folder would help a lot.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 7, 2012 11:42 PM   in reply to MichelBParis

    Michel,

     

    I'm glad you responded to this, because I was also thinking that a backup and restore should be the first thing to try.

     

    If I understand what you're saying, it's this:

    1. Make a full PSE backup to an external drive.
    2. Delete the images from the internal drive.  (Backing them up separately, first, of course.)
    3. Delete the existing catalog.
    4. Restore from that PSE backup to a custom location, such as C:\Catalogs for the catalog and C:\Pictures for the images.

     

    Although that would ensure the photos are not duplicated on the drive, I'm not sure that the duplicates would be removed from the catalog entries.  What do you think?

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 7, 2012 11:56 PM   in reply to photodrawken

    photodrawken wrote:

     

    Michel,

     

    I'm glad you responded to this, because I was also thinking that a backup and restore should be the first thing to try.

     

    If I understand what you're saying, it's this:

    1. Make a full PSE backup to an external drive.
    2. Delete the images from the internal drive.  (Backing them up separately, first, of course.)
    3. Delete the existing catalog.
    4. Restore from that PSE backup to a custom location, such as C:\Catalogs for the catalog and C:\Pictures for the images.

     

    Although that would ensure the photos are not duplicated on the drive, I'm not sure that the duplicates would be removed from the catalog entries.  What do you think?

     

    Ken

    Steps 2 and 3 will be useful unless you have plenty of disk space.

     

    The restore will create real duplicate files with different entries in the catalog, instead of single files with two entries in the catalog. That's the situation when searching by visual similarity will be useful to stack duplicates, then flatten those stacks.

     
    |
    Mark as:
  • 99jon
    12,821 posts
    Jan 6, 2010
    Currently Being Moderated
    Aug 8, 2012 12:46 AM   in reply to photodrawken

     

    I would have thought there may possibly be a database tool which could extract the details into a table.

     

    Unfortunately I don’t know one.

     

    If the table could be sorted it may be possible to delete all the lines with Documents & settings/My Documents and then resave the file. Then replacing the existing db file with the newly saved version.

    It’s worth a shot.

     

     

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 8, 2012 4:29 AM   in reply to 99jon

    I think in the situation of a restored catalog with duplicate files, the best solution would be to use an external  utility to find duplicates. The duplicates will have the same file name, size and creation date, they should be easy to find. If you delete them from the disk via the software, you'll only have to find for missing files and delete them from the catalog.

    I don't know of such softwares myself, but I have seen many threads about them.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 8, 2012 12:26 PM   in reply to MichelBParis

    MichelBParis wrote:

     

    The restore will create real duplicate files with different entries in the catalog, instead of single files with two entries in the catalog. That's the situation when searching by visual similarity will be useful to stack duplicates, then flatten those stacks.

    Ah, I understand what you're saying -- each duplicated entry in the database will cause the backup process to backup the same referenced file again.

     

    So, doing a backup and restore will only make the problem worse: 

    • It will create duplicate image files on the drive whereas now there is only one image file.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 8, 2012 1:20 PM   in reply to 99jon

    99jon wrote:

     

     

    I would have thought there may possibly be a database tool which could extract the details into a table.

    Unfortunately I don’t know one.

    If the table could be sorted it may be possible to delete all the lines with Documents & settings/My Documents and then resave the file. Then replacing the existing db file with the newly saved version.

    It’s worth a shot.

     

    That's pretty much the approach I'm leaning toward right now.

     

    The complicating matter is the tags and album data.  Chris will have to give us the details, but if the "incorrect" file location data is the only data that has tags and albums, there will be some fancy SQL footwork needed to attach the tags and albums to the "correct" file location data.

     

    In general, the procedure would be to:

    1. Create a temporary table and populate it with the "incorrect" file location data (which contains correct reference keys to tags and albums) from the main media table.
    2. Use that temporary table to attach the tags and album reference keys to the "correct" file location data in the main media table.
    3. Use that temporary table to delete the "incorrect" file location records references from all the other tables.
    4. Use that temporary table to delete the "incorrect" file location records from the main media table.
    5. Delete the temporary table.

     

    That can be done without creating a temporary table, but the subsequent SQL queries would contain somewhat hellish WHERE clauses.  All this can be done in a straightforward manner using SQL queries in Database.Net.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 8, 2012 2:40 PM   in reply to photodrawken

    photodrawken wrote:

     

    MichelBParis wrote:

     

    The restore will create real duplicate files with different entries in the catalog, instead of single files with two entries in the catalog. That's the situation when searching by visual similarity will be useful to stack duplicates, then flatten those stacks.

    Ah, I understand what you're saying -- each duplicated entry in the database will cause the backup process to backup the same referenced file again.

     

    So, doing a backup and restore will only make the problem worse: 

    • It will create duplicate image files on the drive whereas now there is only one image file.

     

    Ken

    No, the problem is different, and far from worse... As a matter of fact, having duplicates files wastes disk space, and that's about all for practical purposes. Yes, it's not satisfying for the mind when you want everything to be tidy and in order, but considering the initial mess in the present situation, that's not a big deal. On the contrary, having several entries in the database for the same file is like a delayed bomb : it will necessarily create big problems.

     

    You can solve a part of the problem with the database, but deleting outside PSE an entry in the media table has a lot of accessory consequences on version sets, stacks and albums.

     

    I prefer to use the built-in feature of PSE : backup and restore when I can clearly see the consequences, and I would prefer an external file duplicates eradication, knowing that I can easily find and delete the missing entries from within Elements.

     

    To summarize, I would simply do the backup and restore and live with the duplicates.

    And, as you know I am curious, if I have some free time, I would test cleaning the duplicates as explained above (after all, I'll have the backup if anything goes wrong...)

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 8, 2012 4:48 PM   in reply to _c_h_r_i_s_

    Chris,

     

    If you're saying that the media to be deleted are not the media that have the tags and albums associated, then the problem is simpler.  But, you need to look at the tag_table table to see if any of its media_id values are the same as the id value in the media_table that you're intending to delete.

     

    In other words, it's not as simple as just deleting rows from the media_table table -- it's a relational database and when you're finished, you want the correct relations between tables, as well as no orphaned entries.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 8, 2012 9:52 PM   in reply to MichelBParis

    MichelBParis wrote:

     

    On the contrary, having several entries in the database for the same file is like a delayed bomb : it will necessarily create big problems.

     

    You can solve a part of the problem with the database, but deleting outside PSE an entry in the media table has a lot of accessory consequences on version sets, stacks and albums.

    I agree with both statements.

    MichelBParis wrote:

     

    I prefer to use the built-in feature of PSE : backup and restore when I can clearly see the consequences, and I would prefer an external file duplicates eradication, knowing that I can easily find and delete the missing entries from within Elements.

    OK, I see what you're saying:  if the backup and restore creates two sets of images on the hard drive corresponding to two sets of image data in the catalog, and one uses an external method to delete one set of image files, then the "Reconnect All Missing Files" dialog box allows one to select all the missing files and delete them from the catalog.  That's a great solution.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 10, 2012 6:08 AM   in reply to _c_h_r_i_s_

    Chris,

    Are you able to edit (change) something in the filename included at the end of the full path ?

    I have just tested that adding '___' before the .jpg extension made the file 'missing';

     

    So, if you can batch rename the full path in the media table, the search for missing file will show all duplicates and you'll simply select all and delete. I have tested that files renamed by sqlite like explained above are indeed considered missing and can be deleted cleanly by the Organizer without problems for tags, stacks or the like.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 10, 2012 8:27 AM   in reply to _c_h_r_i_s_

    _c_h_r_i_s_ wrote:

     

     

    What I discovered was, without any modification to my catalog, there were 649 entries in the tag table, 305 of which appeared not to be associated with any media.  A number of these are the import type tags.  Therefore, I conclude that adding a few additional orphaned import tags are not likely to cause any problems.  (Either that, or I've managed to completely bugger my catalog somehow!)

    Tags are not only used with media, they are also used for metadata. (I just had a look at the PSE6 catalog on my laptop). There are 3 tables : tag_table, tag_to_media_table, tag_to_metadata_table.

     

    I don't master sql queries like you, but I believe the organizer does a good job of deleting everything related to missing files. So, if you can fool the Organizer to believe 'marked' files are missing, that would be great.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 10, 2012 1:04 PM   in reply to _c_h_r_i_s_

    _c_h_r_i_s_ wrote:

     

    I found that nearly every item is indeed tagged.

     

    A tag entry looks like this:

     

    106616|925|import 2012-07-19T21:45:16|50|194|import|0|1|1|0|0

     

    The 3rd col ("import...") is the tag name, and the type name is "import".  My guess is that this is some sort of hidden tag used by PSE to manage import batches.  Is that correct?

    Yes, although it's not quite "hidden" -- you can set the Organizer's display mode to "Import Batch" and those tags are used to sort the thumbnails.

     

     

    _c_h_r_i_s_ wrote:

     

    Would there be any negative ramifications of leaving orphaned tags in place, or do I need to take care to remove any that refer only to the images I am deleting?

    Off the top of my head, I can't think of any negative ramifications -- any Organizer code that uses one of those "orphaned" tags should simply return an empty set.  On the other hand, my personal bias is that if I was doing all that SQL work to delete entries, I might as well do my best to get as clean a database as I can to reduce (or eliminate) the possibility of unintended consequences.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 10, 2012 1:11 PM   in reply to _c_h_r_i_s_

    _c_h_r_i_s_ wrote:

     

    What I discovered was, without any modification to my catalog, there were 649 entries in the tag table, 305 of which appeared not to be associated with any media.  A number of these are the import type tags.  Therefore, I conclude that adding a few additional orphaned import tags are not likely to cause any problems.

    Sounds good to me.

     

    However, since you have the image ids of your proposed records deletions stored in your "xxx" table, it would be straightforward to check each of the other tables for the presence of those image ids.  Then, it's your choice whether or not to delete the records from those other tables, too.  My preference would be to do that.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 10, 2012 11:07 PM   in reply to _c_h_r_i_s_

    _c_h_r_i_s_ wrote:

     

    In answer to my own question about orphaned tags, the following query will find any orphaned tags:

     

    select tagids.tag_id

    from tag_table as tt

    left outer join (select distinct(tag_id) from tag_to_media_table)  as tagids

              on tagids.tag_id = tt.id

    where tagids.tag_id is null;

    That query will find unused tags.  "Orphaned" tags would be tags existing in the tag_to_media table whose media_id is not in the media_table table.

     

    SELECT *

    FROM tag_to_media_table

    WHERE media_id NOT IN (

      SELECT id

      FROM media_table

      )

     

    should do it.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 11, 2012 2:12 AM   in reply to photodrawken

    photodrawken wrote:

     

    However, since you have the image ids of your proposed records deletions stored in your "xxx" table, it would be straightforward to check each of the other tables for the presence of those image ids.  Then, it's your choice whether or not to delete the records from those other tables, too.  My preference would be to do that.

     

    Ken

     

    I Also think that using the 'integrity checks' of the sqlite database (quick or complete) or the repair and optimize functions of the organizer might help in cleaning the database.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 11, 2012 2:39 PM   in reply to MichelBParis

    MichelBParis wrote:

     

    I Also think that using the 'integrity checks' of the sqlite database (quick or complete) or the repair and optimize functions of the organizer might help in cleaning the database.

    Absolutely right.  I'm not sure how useful it really is, but running

     

    PRAGMA integrity_check

     

    after all the work is done is a Good Thing to do.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 11, 2012 3:05 PM   in reply to _c_h_r_i_s_

    _c_h_r_i_s_ wrote:

     

    At this point, it appears my work has been successful.  The catalog is running fine, all my images appear to be intact, the duplicates are gone, and there seem to be no further references to c:\My Documents... in the catalog.

    Wahoo!  C'mon, Chris, you've gotta be happy.   Congratulations for persevering on this and your success!

     

     

    _c_h_r_i_s_ wrote:

     

    However, I do have a further question:  what is the purpose of the media_stack_sort_table and the version_stack_sort_table ?  Both of these have a media_id column that refers to media_table items that do not exist.  My operations probably added additional entries pointing to missing items, but without knowing what the tables do, it's hard to know if it matters.

    I can't figure out what those tables are used for.  Sorry.

     

    _c_h_r_i_s_ wrote:

     

    PS PRAGMA integrity_check turned up nothing interesting.

    Yeah, reading the SQLite documentation about it didn't sound as if it would return anything useful for this purpose.

     

    Ken

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 12, 2012 1:35 AM   in reply to _c_h_r_i_s_

    _c_h_r_i_s_ wrote:

     

    At this point, it appears my work has been successful.  The catalog is running fine, all my images appear to be intact, the duplicates are gone, and there seem to be no further references to c:\My Documents... in the catalog.

     

    I will not post all my queries here unless someone wants them, as it would be a lengthy and technically complex posting.

     

    However, I do have a further question:  what is the purpose of the media_stack_sort_table and the version_stack_sort_table ?  Both of these have a media_id column that refers to media_table items that do not exist.  My operations probably added additional entries pointing to missing items, but without knowing what the tables do, it's hard to know if it matters.

     

    Thanks for all your help!

     

    PS PRAGMA integrity_check turned up nothing interesting.

    Congrats for your efforts and great success

    !

    The media_stack_sort_table is probably used with the 'Find by Media type' search menu. The version_stack_sort table is probably used to show thumbnails in conjunction with the choice to show stacks/version collapsed or not.

     
    |
    Mark as:
  • Currently Being Moderated
    Aug 12, 2012 1:09 PM   in reply to _c_h_r_i_s_

    _c_h_r_i_s_ wrote:

     

    I boring integrity check is much better than an "interesting" one!

    You got that right.  When it comes to databases, boring is beautiful!

     

    Ken

     
    |
    Mark as:

More Like This

  • Retrieving data ...

Bookmarked By (0)

Answers + Points = Status

  • 10 points awarded for Correct Answers
  • 5 points awarded for Helpful Answers
  • 10,000+ points
  • 1,001-10,000 points
  • 501-1,000 points
  • 5-500 points