1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Photos Finding duplicate images in different formats?

Discussion in 'Photography, Art & Design' started by Mister_Tad, 9 Feb 2020.

  1. Mister_Tad

    Mister_Tad Will work for nuts Super Moderator

    Joined:
    27 Dec 2002
    Posts:
    14,085
    Likes Received:
    2,451
    I’m looking for something that will scan two locations for duplicate images and remove the duplicates from one, except they will be in different formats and sizes.

    I have a local image library that will be a mix of of RAW, TIFF, and JPEGs in anything up to 20MP that has been synced to Google Photos.

    My Google photos also has synced photos from the phones over the years, many of which won’t be in my local library, in their “High quality”.

    I’m moving away from Google photos to Synology Moments, and have imported all of my local photos, but need a way to weed out the duplicates and only take the ones that are only located in Google photos.

    Except the duplicates could be a 20MP ARW in one place and a 4MP JPG in the other.

    Lots of software from various locations, some of which look a bit dubious, claim to do this or something like it, but does anyone have something they’ve used and trust that they can recommend?
     
  2. adidan

    adidan Guesswork is still work

    Joined:
    25 Mar 2009
    Posts:
    19,804
    Likes Received:
    5,591
    Hm, not sure.

    I know ccleaner has a find duplicates option that you can tinker with to find by duplicate names, size or whatever.

    There'd still be some manual work involved though I guess.

    https://www.ccleaner.com/docs/cclea...-duplicate-files/changing-file-finder-options

    There the options they have, not sure it will be the best solution but it may be something.

    Edit: Sorry, not sure if it'll be any use as I don't know if it'll let you scan google photos as a location. Never tried that so no sure.
     
  3. bawjaws

    bawjaws Multimodder

    Joined:
    5 Dec 2010
    Posts:
    4,284
    Likes Received:
    891
    How are you identifying duplicates? Presumably it's a bit more involved than by filename (sans extension)?
     
  4. Mister_Tad

    Mister_Tad Will work for nuts Super Moderator

    Joined:
    27 Dec 2002
    Posts:
    14,085
    Likes Received:
    2,451
    Both locations will be local. I've downloaded the Google lot.

    Content only. Files, formats, dimensions, sizes... All different
     
  5. Gareth Halfacree

    Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    17,132
    Likes Received:
    6,728
    I've used findimagedupes (it's in the Debian repos) with considerable success before. DigiKam can also do it through the Tools menu, but you have to add everything into digiKam's library for that to work.

    Both work on visual similarity, and don't care about different dimensions or formats (so long as the formats are ones they can read, of course!)

    EDIT:
    Hmm, maybe findimagedupes isn't as good as I remembered - it seems to think these four images are duplicates with a 99% confidence threshold...

    upload_2020-2-9_13-39-1.png
    upload_2020-2-9_13-39-13.png
    upload_2020-2-9_13-39-48.png
    upload_2020-2-9_13-40-4.png

    EDIT EDIT:
    Ah, maybe I was using this findimagedupes, not the one in the Debian repos...
     
    Last edited: 9 Feb 2020
  6. Gareth Halfacree

    Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    17,132
    Likes Received:
    6,728
    Oh, yeah, that's the stuff. Took a fraction of the time of the Perl version (and made the CPU fan *super* angry, 'cos it was balls to the wall with 16 workers) and spat out:

    Code:
    $ findimagedupes -R /home/blacklaw/Dropbox/Work/bittech/images
    /home/blacklaw/Dropbox/Work/bittech/images/synology-ds1515-2.jpg /home/blacklaw/Dropbox/Work/bittech/images/synology-ds1515.jpg
    /home/blacklaw/Dropbox/Work/bittech/images/ad-block-plus.jpg /home/blacklaw/Dropbox/Work/bittech/images/eyeo-abp-logo.jpg
    /home/blacklaw/Dropbox/Work/bittech/images/samsung-consumer-marketing.jpg /home/blacklaw/Dropbox/Work/bittech/images/samsung-consumer-marketing.xcf
    /home/blacklaw/Dropbox/Work/bittech/images/oculus-vr-nate-mitchell.jpg /home/blacklaw/Dropbox/Work/bittech/images/oculus-vr-nate-mitchell.png
    /home/blacklaw/Dropbox/Work/bittech/images/intel-otellini.jpg /home/blacklaw/Dropbox/Work/bittech/images/intelotellini.jpg
    
    Judging by the filenames, they're definitely duplicates - and the Samsung one shows it can compare between two formats (JPEG and XCF).

    Think it might have missed some, though - might run it again with a higher threshold value (which, ironically, seems to lower the comparison threshold...)

    EDIT:
    Okay, a threshold of 15 brought up a lot more - and while you can kinda see where it's coming from, these definitely aren't duplicates.

    upload_2020-2-9_13-52-36.png upload_2020-2-9_13-52-45.png

    Let's try a value of 5...
     
  7. Gareth Halfacree

    Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    17,132
    Likes Received:
    6,728
    Right, threshold of 5 has cut down things nicely - and picked up a couple of true duplicates that the default threshold of 0 missed:

    Code:
    $ findimagedupes -t 5 -R /home/blacklaw/Dropbox/Work/bittech/images
    /home/blacklaw/Dropbox/Work/bittech/images/synology-ds1515-2.jpg /home/blacklaw/Dropbox/Work/bittech/images/synology-ds1515.jpg
    /home/blacklaw/Dropbox/Work/bittech/images/equifax-logo.jpg /home/blacklaw/Dropbox/Work/bittech/images/microsoft-windows-10-logo.jpg
    /home/blacklaw/Dropbox/Work/bittech/images/ad-block-plus.jpg /home/blacklaw/Dropbox/Work/bittech/images/eyeo-abp-logo.jpg
    /home/blacklaw/Dropbox/Work/bittech/images/gambling-man.jpg /home/blacklaw/Dropbox/Work/bittech/images/pexels-gambling.jpg
    /home/blacklaw/Dropbox/Work/bittech/images/type-rider2.jpg /home/blacklaw/Dropbox/Work/bittech/images/typerider.jpg
    /home/blacklaw/Dropbox/Work/bittech/images/samsung-consumer-marketing.jpg /home/blacklaw/Dropbox/Work/bittech/images/samsung-consumer-marketing.xcf
    /home/blacklaw/Dropbox/Work/bittech/images/oculus-vr-nate-mitchell.jpg /home/blacklaw/Dropbox/Work/bittech/images/oculus-vr-nate-mitchell.png
    /home/blacklaw/Dropbox/Work/bittech/images/intel-otellini.jpg /home/blacklaw/Dropbox/Work/bittech/images/intelotellini.jpg
    
    However, there's a still a false positive in there:

    upload_2020-2-9_13-57-8.png
    upload_2020-2-9_13-57-31.png

    In other words: you might want to do some fine-tuning with the threshold value, and double-check that duplicates it flags are really duplicates!

    EDIT:
    Annoyingly, dropping the threshold to 3 removes a true duplicate as well as the false positive; raising to 4 brings both back again.
     
  8. Gareth Halfacree

    Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    17,132
    Likes Received:
    6,728
    Eh, on second thoughts maybe neither of those is really up to the job.

    Created a directory with some sample images:

    Code:
    -rw-r--r-- 1 blacklaw blacklaw  1520612 Feb  9 14:10 raspberrypi4-2gb-left.jpg
    -rw-r--r-- 1 blacklaw blacklaw   178679 Feb  9 14:09 raspberrypi4-2gb-top.jpg
    -rw-r--r-- 1 blacklaw blacklaw 13407608 Feb  9 14:09 raspberrypi4-2gb-top.png
    -rw-r--r-- 1 blacklaw blacklaw  2707825 Feb  9 14:10 raspberrypi4-2gb-whitespacecrop.jpg
    -rw-r--r-- 1 blacklaw blacklaw  2225338 Feb  9 14:11 raspberrypi4-4gb.jpg
    
    They're all versions of the same image: the .png is 4000x2660, the .jpg is 2000x1330, the whitespacecrop is cropped to minimise whitespace but otherwise unmodified, left is a crop of just the left half of the image, and raspberrypi4-4gb.jpg is a wildcard: a near-identical but technically distinct picture of a 4GB Raspberry Pi 4.

    Threshold of 0: No results.
    Threshold of 5: Correctly marks the png and jpg as duplicates.
    Threshold of 10: No change.
    Threshold of 15: No change.
    Threshold of 20: Incorrectly adds "raspberrypi4-4gb.jpg" as a false positive.
    Threshold of 40: Correctly marks the whitespacecrop as a duplicate.
    Threshold of 60: Decides everything in the directory is a duplicate, even if I add a complete different picture.

    Could still try digiKam, tho'.
     
  9. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    15,425
    Likes Received:
    3,011
  10. Gareth Halfacree

    Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    17,132
    Likes Received:
    6,728
    That only seems to hit on the original PNG and resized JPG - but the fact it loads a webpage to confirm the results is nice:

    upload_2020-2-9_14-28-46.png

    (Yeah, I typo'd "image" when I made the directory, lemme 'lone.)
     
    RedFlames likes this.
  11. Gareth Halfacree

    Gareth Halfacree WIIGII! Lover of bit-tech Administrator Super Moderator Moderator

    Joined:
    4 Dec 2007
    Posts:
    17,132
    Likes Received:
    6,728
    Had a go with Geeqie, but that has the same problem as findimagedupes - it's more confident that the 4GB Raspberry Pi is a duplicate of the 2GB Raspberry Pi than the 2GB Raspberry Pi with a bit of whitespace removed.

    upload_2020-2-9_14-44-59.png
     
  12. Mister_Tad

    Mister_Tad Will work for nuts Super Moderator

    Joined:
    27 Dec 2002
    Posts:
    14,085
    Likes Received:
    2,451
    So what you’re saying is, it needs a bottle of wine, a long playlist and some good old fashioned manual labour.

    :/
     
  13. adidan

    adidan Guesswork is still work

    Joined:
    25 Mar 2009
    Posts:
    19,804
    Likes Received:
    5,591
    Pretty much.

    Unless you know somebody who works for the Security Service who wants to run them through their systems.
     
  14. RedFlames

    RedFlames ...is not a Belgian football team

    Joined:
    23 Apr 2009
    Posts:
    15,425
    Likes Received:
    3,011
    *hands @Mister_Tad a corkscrew*

    Dave is busy.
     
  15. wolfticket

    wolfticket Downwind from the bloodhounds

    Joined:
    19 Apr 2008
    Posts:
    3,556
    Likes Received:
    646
    Speaking of which, I'm kinda surprised Google doesn't leverage it's reverse image search magic to find likely duplicates in Google Photos, especially since it already has content aware search.
     
    adidan likes this.
  16. wolfticket

    wolfticket Downwind from the bloodhounds

    Joined:
    19 Apr 2008
    Posts:
    3,556
    Likes Received:
    646
    Last edited: 9 Feb 2020
  17. adidan

    adidan Guesswork is still work

    Joined:
    25 Mar 2009
    Posts:
    19,804
    Likes Received:
    5,591
    I'm surprised they've not turned that into a money spinner too.
     
  18. dynamis_dk

    dynamis_dk Grr... Grumpy!!

    Joined:
    23 Nov 2005
    Posts:
    3,762
    Likes Received:
    339
    I use Duplicate Cleaner Pro - https://www.digitalvolcano.co.uk/dcfeatures.html

    I haven't done exactly what your asking, two separate folders but it does a great job of figuring out dups in the same folder where they are different size/resolution. Might be worth a look before you break out the bottle :)

    EDIT: So I was curious on how well it compares folders. So I created two folders, A & B. Then copied 4 images into A. I copied those 4 into B and edited them to resize 4 different reductions, flipped on horizontally and the other vertically. I ran a scan against the two folders and it did successfully match the duplicates against each other, producing 4 groups with options on which to keep, move, rename etc. Might be ok for what your after, downside being that its paid if you want RAW/Pro format support
     
    Last edited: 10 Feb 2020
    Mister_Tad likes this.
  19. Mister_Tad

    Mister_Tad Will work for nuts Super Moderator

    Joined:
    27 Dec 2002
    Posts:
    14,085
    Likes Received:
    2,451
    Ooh, that sounds promising, I'll check it out.

    £24 is a small price to pay versus manually sifting through around 80k files if it works - given the quantity that's going to be well more than £24 in wine.
     
  20. Xlog

    Xlog Minimodder

    Joined:
    16 Dec 2006
    Posts:
    714
    Likes Received:
    80
    There is also windows program called VisiPics (free), hasn't been updated in a while, but last time I used it, it worked quite well.
     

Share This Page