30 Replies Latest reply: Aug 12, 2012 1:09 PM by photodrawken RSS

    PSE10 Remove Duplicate Photos

    _c_h_r_i_s_

      I am running PSE10 on Win 7, and have found myself with a bit of a mess I'd like to clean up.  My Catalog has many hundred duplicate photos.  These are pairs of items that point to the same physical photo via a different path.  All follow the pattern of C:\Documents and Settings\Chris\My Documents\...   and  C:\Users\Chris\Pictures...

       

      I have tried using the Search for duplicate photos function, which does identify the duplicates, but it also finds tons of false positives, photos that merely look alike but are not actual duplicates.  Sorting through these is not feasible.

       

      I could perhaps identify the My Documents versions by deleting the shortcut to My Documents in the OS and seeing which images need to be reconnected.  However my tags are all attached to the C:\Documents and Settings versions.

       

      I'm looking for a good way to identify actual duplicate photos in the catalog, rather than merely similar ones.

       

      The workflow I imagine is: identify duplicate photos | transfer tags to C:\Users version | delete C:\My Documents version.

       

      Anyone have any suggestions?

       

      Thanks,

       

      Chris

       

      PS  I'll note that I am experienced with databases and sql, on the off chance the catalog is really a simple sqlite database that I can manipulate.

        • 1. Re: PSE10 Remove Duplicate Photos
          photodrawken Community Member

          Chris,

           

          Just so I understand:

           

          Are you saying that your photos exist in one and only one place on your hard drive, but that the Organizer has two thumbnails for each one, each thumbnail pointing to a different location?

           

          Ken

          • 2. Re: PSE10 Remove Duplicate Photos
            _c_h_r_i_s_

            Yes, that's preciesely the situation.  I have verified that there are no duplicate files using an external tool, and renaming a sample file causes both copies in the catalog to become disconnected.

            • 3. Re: PSE10 Remove Duplicate Photos
              99jon MVP

              You should really have used backup and restore rather than copying the old catalog based on the XP system structure to Windows 7.

               

              http://helpx.adobe.com/photoshop-elements/kb/backup-restore-move-catalog-photoshop.html

               

               

               

              • 4. Re: PSE10 Remove Duplicate Photos
                photodrawken Community Member

                Ooooof!  That's one messed up database, for sure.

                 

                Using that sample file that now has both copies listed as "missing", when you restore its name to the correct one (in its correct directory), what happens if you do a File...Reconnect...Missing File on that file?

                 

                This may not something that we're going to solve quickly.  If you can afford to spend a couple of days discussing options, I'd like to hear the opinions of others here. 

                 

                The Organizer's main catalog file (catalog.pse10db) is an unencrypted, non-password protected SQLite database file.  If there isn't an easy solution using the Organizer, then running SQL queries certainly can be an option.  If you don't have a favourite SQL tool that can connect to SQLite, I recommend the free Database.Net tool:

                http://fishcodelib.com/Database.htm

                "Installation" is ridiculously easy -- extract the executable to a directory of your choosing and make a Desktop shortcut to it.

                 

                In preparation, use Windows Explorer to make a file copy of your entire catalog directory (including subdirectories) somewhere for safekeeping....

                 

                Ken

                • 5. Re: PSE10 Remove Duplicate Photos
                  MichelBParis MVP

                  See my answer on your previous post.

                  The solution will be to do a full backup, then a restore to a new custom location.

                  In the backup process, all files in the catalog should be copied, which means one file for the right folder path and the same physical file once again with its shortcut path. This will create real duplicates. If you restore to a custom location on a new master folder, the new folder tree will include different subfolders outside the shortcuts known by the OS.

                  Then, the visual search should work to find duplicates you can stack and flatten afterwards.

                  To make a quicker search, I would have a look at the location of suspected duplicates to see if you can limit the visual search to given subfolders and not to the whole database. Assigning a tag such as 'possible dup' to the files in a subfolder and the corresponding shortcut folder would help a lot.

                  • 6. Re: PSE10 Remove Duplicate Photos
                    photodrawken Community Member

                    Michel,

                     

                    I'm glad you responded to this, because I was also thinking that a backup and restore should be the first thing to try.

                     

                    If I understand what you're saying, it's this:

                    1. Make a full PSE backup to an external drive.
                    2. Delete the images from the internal drive.  (Backing them up separately, first, of course.)
                    3. Delete the existing catalog.
                    4. Restore from that PSE backup to a custom location, such as C:\Catalogs for the catalog and C:\Pictures for the images.

                     

                    Although that would ensure the photos are not duplicated on the drive, I'm not sure that the duplicates would be removed from the catalog entries.  What do you think?

                     

                    Ken

                    • 7. Re: PSE10 Remove Duplicate Photos
                      MichelBParis MVP

                      photodrawken wrote:

                       

                      Michel,

                       

                      I'm glad you responded to this, because I was also thinking that a backup and restore should be the first thing to try.

                       

                      If I understand what you're saying, it's this:

                      1. Make a full PSE backup to an external drive.
                      2. Delete the images from the internal drive.  (Backing them up separately, first, of course.)
                      3. Delete the existing catalog.
                      4. Restore from that PSE backup to a custom location, such as C:\Catalogs for the catalog and C:\Pictures for the images.

                       

                      Although that would ensure the photos are not duplicated on the drive, I'm not sure that the duplicates would be removed from the catalog entries.  What do you think?

                       

                      Ken

                      Steps 2 and 3 will be useful unless you have plenty of disk space.

                       

                      The restore will create real duplicate files with different entries in the catalog, instead of single files with two entries in the catalog. That's the situation when searching by visual similarity will be useful to stack duplicates, then flatten those stacks.

                      • 8. Re: PSE10 Remove Duplicate Photos
                        99jon MVP

                         

                        I would have thought there may possibly be a database tool which could extract the details into a table.

                         

                        Unfortunately I don’t know one.

                         

                        If the table could be sorted it may be possible to delete all the lines with Documents & settings/My Documents and then resave the file. Then replacing the existing db file with the newly saved version.

                        It’s worth a shot.

                         

                         

                        • 9. Re: PSE10 Remove Duplicate Photos
                          MichelBParis MVP

                          I think in the situation of a restored catalog with duplicate files, the best solution would be to use an external  utility to find duplicates. The duplicates will have the same file name, size and creation date, they should be easy to find. If you delete them from the disk via the software, you'll only have to find for missing files and delete them from the catalog.

                          I don't know of such softwares myself, but I have seen many threads about them.

                          • 10. Re: PSE10 Remove Duplicate Photos
                            photodrawken Community Member

                            MichelBParis wrote:

                             

                            The restore will create real duplicate files with different entries in the catalog, instead of single files with two entries in the catalog. That's the situation when searching by visual similarity will be useful to stack duplicates, then flatten those stacks.

                            Ah, I understand what you're saying -- each duplicated entry in the database will cause the backup process to backup the same referenced file again.

                             

                            So, doing a backup and restore will only make the problem worse: 

                            • It will create duplicate image files on the drive whereas now there is only one image file.

                             

                            Ken

                            • 11. Re: PSE10 Remove Duplicate Photos
                              photodrawken Community Member

                              99jon wrote:

                               

                               

                              I would have thought there may possibly be a database tool which could extract the details into a table.

                               

                              Unfortunately I don’t know one.

                               

                              If the table could be sorted it may be possible to delete all the lines with Documents & settings/My Documents and then resave the file. Then replacing the existing db file with the newly saved version.

                              It’s worth a shot.

                               

                               

                               

                              That's pretty much the approach I'm leaning toward right now.

                               

                              The complicating matter is the tags and album data.  Chris will have to give us the details, but if the "incorrect" file location data is the only data that has tags and albums, there will be some fancy SQL footwork needed to attach the tags and albums to the "correct" file location data.

                               

                              In general, the procedure would be to:

                              1. Create a temporary table and populate it with the "incorrect" file location data (which contains correct reference keys to tags and albums) from the main media table.
                              2. Use that temporary table to attach the tags and album reference keys to the "correct" file location data in the main media table.
                              3. Use that temporary table to delete the "incorrect" file location records references from all the other tables.
                              4. Use that temporary table to delete the "incorrect" file location records from the main media table.
                              5. Delete the temporary table.

                               

                              That can be done without creating a temporary table, but the subsequent SQL queries would contain somewhat hellish WHERE clauses.  All this can be done in a straightforward manner using SQL queries in Database.Net.

                               

                              Ken

                              • 12. Re: PSE10 Remove Duplicate Photos
                                MichelBParis MVP

                                photodrawken wrote:

                                 

                                MichelBParis wrote:

                                 

                                The restore will create real duplicate files with different entries in the catalog, instead of single files with two entries in the catalog. That's the situation when searching by visual similarity will be useful to stack duplicates, then flatten those stacks.

                                Ah, I understand what you're saying -- each duplicated entry in the database will cause the backup process to backup the same referenced file again.

                                 

                                So, doing a backup and restore will only make the problem worse: 

                                • It will create duplicate image files on the drive whereas now there is only one image file.

                                 

                                Ken

                                No, the problem is different, and far from worse... As a matter of fact, having duplicates files wastes disk space, and that's about all for practical purposes. Yes, it's not satisfying for the mind when you want everything to be tidy and in order, but considering the initial mess in the present situation, that's not a big deal. On the contrary, having several entries in the database for the same file is like a delayed bomb : it will necessarily create big problems.

                                 

                                You can solve a part of the problem with the database, but deleting outside PSE an entry in the media table has a lot of accessory consequences on version sets, stacks and albums.

                                 

                                I prefer to use the built-in feature of PSE : backup and restore when I can clearly see the consequences, and I would prefer an external file duplicates eradication, knowing that I can easily find and delete the missing entries from within Elements.

                                 

                                To summarize, I would simply do the backup and restore and live with the duplicates.

                                And, as you know I am curious, if I have some free time, I would test cleaning the duplicates as explained above (after all, I'll have the backup if anything goes wrong...)

                                • 13. Re: PSE10 Remove Duplicate Photos
                                  _c_h_r_i_s_

                                  Ok, I see how this could work.  I am looking in the catalog now (via sqlite), and there is a table called media_table.  That stores the file path to each file (among other things), so I can identify all my problematic media easily.  I should be able to use that to find the two parallel copies, then I can delete the c:\user... version, and finally alter the paths of the c:\my documents version.

                                   

                                  I'll note that, as far as I know, none of the entries I want to delete have tags or are in stacks.

                                   

                                  Well, if I have some time tomorrow, I may take a stab at it.  Looks risky but doable.  I'll let you guys know how things go.

                                  • 14. Re: PSE10 Remove Duplicate Photos
                                    photodrawken Community Member

                                    Chris,

                                     

                                    If you're saying that the media to be deleted are not the media that have the tags and albums associated, then the problem is simpler.  But, you need to look at the tag_table table to see if any of its media_id values are the same as the id value in the media_table that you're intending to delete.

                                     

                                    In other words, it's not as simple as just deleting rows from the media_table table -- it's a relational database and when you're finished, you want the correct relations between tables, as well as no orphaned entries.

                                     

                                    Ken

                                    • 15. Re: PSE10 Remove Duplicate Photos
                                      photodrawken Community Member

                                      MichelBParis wrote:

                                       

                                      On the contrary, having several entries in the database for the same file is like a delayed bomb : it will necessarily create big problems.

                                       

                                      You can solve a part of the problem with the database, but deleting outside PSE an entry in the media table has a lot of accessory consequences on version sets, stacks and albums.

                                      I agree with both statements.

                                      MichelBParis wrote:

                                       

                                      I prefer to use the built-in feature of PSE : backup and restore when I can clearly see the consequences, and I would prefer an external file duplicates eradication, knowing that I can easily find and delete the missing entries from within Elements.

                                      OK, I see what you're saying:  if the backup and restore creates two sets of images on the hard drive corresponding to two sets of image data in the catalog, and one uses an external method to delete one set of image files, then the "Reconnect All Missing Files" dialog box allows one to select all the missing files and delete them from the catalog.  That's a great solution.

                                       

                                      Ken

                                      • 16. Re: PSE10 Remove Duplicate Photos
                                        _c_h_r_i_s_

                                        Hi,

                                         

                                        I've made some good progress using sqlite to read the PSE10 catalog.  Using the query at the bottom of this message, I was able to identify all my duplicate media, and write the media ids that I would like to delete to a temporary table called xxx.  Now I am investigating whether the media I want to delete has any tags (the catalog shows none in PSE), and, to my dismay, I found that nearly every item is indeed tagged.

                                         

                                        A tag entry looks like this:

                                         

                                        106616|925|import 2012-07-19T21:45:16|50|194|import|0|1|1|0|0

                                         

                                        The 3rd col ("import...") is the tag name, and the type name is "import".  My guess is that this is some sort of hidden tag used by PSE to manage import batches.  Is that correct? 

                                         

                                        Would there be any negative ramifications of leaving orphaned tags in place, or do I need to take care to remove any that refer only to the images I am deleting?

                                         

                                        Thanks,

                                         

                                        Chris

                                         

                                        =====

                                         

                                        Query to generate xxx table:

                                         

                                        create table xxx (media_id int);

                                         

                                        select x.id

                                        from (select *, replace(full_filepath, '/Users/Chris/Pictures/','') as n from media_table where full_filepath like '/Users/Chris/Pictures/%') as x

                                        join (select *, replace(full_filepath, '/Documents and Settings/Chris/My Documents/My Pictures/','') as n from media_table where full_filepath like '/Documents and Settings/Chris/%') as y

                                        on x.n = y.n

                                        where x.id is not null and y.id is not null;

                                        • 17. Re: PSE10 Remove Duplicate Photos
                                          MichelBParis MVP

                                          Chris,

                                          Are you able to edit (change) something in the filename included at the end of the full path ?

                                          I have just tested that adding '___' before the .jpg extension made the file 'missing';

                                           

                                          So, if you can batch rename the full path in the media table, the search for missing file will show all duplicates and you'll simply select all and delete. I have tested that files renamed by sqlite like explained above are indeed considered missing and can be deleted cleanly by the Organizer without problems for tags, stacks or the like.

                                          • 18. Re: PSE10 Remove Duplicate Photos
                                            _c_h_r_i_s_

                                            I have not yet tried making any modifications to my catalog, but I do anticipate making the Win XP names into Win 7 names when the duplicates are gone.

                                             

                                            In answer to my own question about orphaned tags, the following query will find any orphaned tags:

                                             

                                            select tagids.tag_id

                                            from tag_table as tt

                                            left outer join (select distinct(tag_id) from tag_to_media_table)  as tagids

                                                      on tagids.tag_id = tt.id

                                            where tagids.tag_id is null;

                                             

                                            What I discovered was, without any modification to my catalog, there were 649 entries in the tag table, 305 of which appeared not to be associated with any media.  A number of these are the import type tags.  Therefore, I conclude that adding a few additional orphaned import tags are not likely to cause any problems.  (Either that, or I've managed to completely bugger my catalog somehow!)

                                            • 19. Re: PSE10 Remove Duplicate Photos
                                              MichelBParis MVP

                                              _c_h_r_i_s_ wrote:

                                               

                                               

                                              What I discovered was, without any modification to my catalog, there were 649 entries in the tag table, 305 of which appeared not to be associated with any media.  A number of these are the import type tags.  Therefore, I conclude that adding a few additional orphaned import tags are not likely to cause any problems.  (Either that, or I've managed to completely bugger my catalog somehow!)

                                              Tags are not only used with media, they are also used for metadata. (I just had a look at the PSE6 catalog on my laptop). There are 3 tables : tag_table, tag_to_media_table, tag_to_metadata_table.

                                               

                                              I don't master sql queries like you, but I believe the organizer does a good job of deleting everything related to missing files. So, if you can fool the Organizer to believe 'marked' files are missing, that would be great.

                                              • 20. Re: PSE10 Remove Duplicate Photos
                                                photodrawken Community Member

                                                _c_h_r_i_s_ wrote:

                                                 

                                                I found that nearly every item is indeed tagged.

                                                 

                                                A tag entry looks like this:

                                                 

                                                106616|925|import 2012-07-19T21:45:16|50|194|import|0|1|1|0|0

                                                 

                                                The 3rd col ("import...") is the tag name, and the type name is "import".  My guess is that this is some sort of hidden tag used by PSE to manage import batches.  Is that correct?

                                                Yes, although it's not quite "hidden" -- you can set the Organizer's display mode to "Import Batch" and those tags are used to sort the thumbnails.

                                                 

                                                 

                                                _c_h_r_i_s_ wrote:

                                                 

                                                Would there be any negative ramifications of leaving orphaned tags in place, or do I need to take care to remove any that refer only to the images I am deleting?

                                                Off the top of my head, I can't think of any negative ramifications -- any Organizer code that uses one of those "orphaned" tags should simply return an empty set.  On the other hand, my personal bias is that if I was doing all that SQL work to delete entries, I might as well do my best to get as clean a database as I can to reduce (or eliminate) the possibility of unintended consequences.

                                                 

                                                Ken

                                                • 21. Re: PSE10 Remove Duplicate Photos
                                                  photodrawken Community Member

                                                  _c_h_r_i_s_ wrote:

                                                   

                                                  What I discovered was, without any modification to my catalog, there were 649 entries in the tag table, 305 of which appeared not to be associated with any media.  A number of these are the import type tags.  Therefore, I conclude that adding a few additional orphaned import tags are not likely to cause any problems.

                                                  Sounds good to me.

                                                   

                                                  However, since you have the image ids of your proposed records deletions stored in your "xxx" table, it would be straightforward to check each of the other tables for the presence of those image ids.  Then, it's your choice whether or not to delete the records from those other tables, too.  My preference would be to do that.

                                                   

                                                  Ken

                                                  • 22. Re: PSE10 Remove Duplicate Photos
                                                    photodrawken Community Member

                                                    _c_h_r_i_s_ wrote:

                                                     

                                                    In answer to my own question about orphaned tags, the following query will find any orphaned tags:

                                                     

                                                    select tagids.tag_id

                                                    from tag_table as tt

                                                    left outer join (select distinct(tag_id) from tag_to_media_table)  as tagids

                                                              on tagids.tag_id = tt.id

                                                    where tagids.tag_id is null;

                                                    That query will find unused tags.  "Orphaned" tags would be tags existing in the tag_to_media table whose media_id is not in the media_table table.

                                                     

                                                    SELECT *

                                                    FROM tag_to_media_table

                                                    WHERE media_id NOT IN (

                                                      SELECT id

                                                      FROM media_table

                                                      )

                                                     

                                                    should do it.

                                                     

                                                    Ken

                                                    • 23. Re: PSE10 Remove Duplicate Photos
                                                      MichelBParis MVP

                                                      photodrawken wrote:

                                                       

                                                      However, since you have the image ids of your proposed records deletions stored in your "xxx" table, it would be straightforward to check each of the other tables for the presence of those image ids.  Then, it's your choice whether or not to delete the records from those other tables, too.  My preference would be to do that.

                                                       

                                                      Ken

                                                       

                                                      I Also think that using the 'integrity checks' of the sqlite database (quick or complete) or the repair and optimize functions of the organizer might help in cleaning the database.

                                                      • 24. Re: PSE10 Remove Duplicate Photos
                                                        _c_h_r_i_s_

                                                        Yes, of course your terminology is better.  I have no orphaned tags (your terminology), and will make sure I still have none when I am done.

                                                        • 25. Re: PSE10 Remove Duplicate Photos
                                                          photodrawken Community Member

                                                          MichelBParis wrote:

                                                           

                                                          I Also think that using the 'integrity checks' of the sqlite database (quick or complete) or the repair and optimize functions of the organizer might help in cleaning the database.

                                                          Absolutely right.  I'm not sure how useful it really is, but running

                                                           

                                                          PRAGMA integrity_check

                                                           

                                                          after all the work is done is a Good Thing to do.

                                                           

                                                          Ken

                                                          • 26. Re: PSE10 Remove Duplicate Photos
                                                            _c_h_r_i_s_

                                                            At this point, it appears my work has been successful.  The catalog is running fine, all my images appear to be intact, the duplicates are gone, and there seem to be no further references to c:\My Documents... in the catalog.

                                                             

                                                            I will not post all my queries here unless someone wants them, as it would be a lengthy and technically complex posting.

                                                             

                                                            However, I do have a further question:  what is the purpose of the media_stack_sort_table and the version_stack_sort_table ?  Both of these have a media_id column that refers to media_table items that do not exist.  My operations probably added additional entries pointing to missing items, but without knowing what the tables do, it's hard to know if it matters.

                                                             

                                                            Thanks for all your help!

                                                             

                                                            PS PRAGMA integrity_check turned up nothing interesting.

                                                            • 27. Re: PSE10 Remove Duplicate Photos
                                                              photodrawken Community Member

                                                              _c_h_r_i_s_ wrote:

                                                               

                                                              At this point, it appears my work has been successful.  The catalog is running fine, all my images appear to be intact, the duplicates are gone, and there seem to be no further references to c:\My Documents... in the catalog.

                                                              Wahoo!  C'mon, Chris, you've gotta be happy.   Congratulations for persevering on this and your success!

                                                               

                                                               

                                                              _c_h_r_i_s_ wrote:

                                                               

                                                              However, I do have a further question:  what is the purpose of the media_stack_sort_table and the version_stack_sort_table ?  Both of these have a media_id column that refers to media_table items that do not exist.  My operations probably added additional entries pointing to missing items, but without knowing what the tables do, it's hard to know if it matters.

                                                              I can't figure out what those tables are used for.  Sorry.

                                                               

                                                              _c_h_r_i_s_ wrote:

                                                               

                                                              PS PRAGMA integrity_check turned up nothing interesting.

                                                              Yeah, reading the SQLite documentation about it didn't sound as if it would return anything useful for this purpose.

                                                               

                                                              Ken

                                                              • 28. Re: PSE10 Remove Duplicate Photos
                                                                _c_h_r_i_s_

                                                                I boring integrity check is much better than an "interesting" one!

                                                                • 29. Re: PSE10 Remove Duplicate Photos
                                                                  MichelBParis MVP

                                                                  _c_h_r_i_s_ wrote:

                                                                   

                                                                  At this point, it appears my work has been successful.  The catalog is running fine, all my images appear to be intact, the duplicates are gone, and there seem to be no further references to c:\My Documents... in the catalog.

                                                                   

                                                                  I will not post all my queries here unless someone wants them, as it would be a lengthy and technically complex posting.

                                                                   

                                                                  However, I do have a further question:  what is the purpose of the media_stack_sort_table and the version_stack_sort_table ?  Both of these have a media_id column that refers to media_table items that do not exist.  My operations probably added additional entries pointing to missing items, but without knowing what the tables do, it's hard to know if it matters.

                                                                   

                                                                  Thanks for all your help!

                                                                   

                                                                  PS PRAGMA integrity_check turned up nothing interesting.

                                                                  Congrats for your efforts and great success

                                                                  !

                                                                  The media_stack_sort_table is probably used with the 'Find by Media type' search menu. The version_stack_sort table is probably used to show thumbnails in conjunction with the choice to show stacks/version collapsed or not.

                                                                  • 30. Re: PSE10 Remove Duplicate Photos
                                                                    photodrawken Community Member

                                                                    _c_h_r_i_s_ wrote:

                                                                     

                                                                    I boring integrity check is much better than an "interesting" one!

                                                                    You got that right.  When it comes to databases, boring is beautiful!

                                                                     

                                                                    Ken