Duplicate finder not finding many duplicates

dankie8948 · April 11, 2024, 2:55am

So i moved 3500 flacs into a folder of 4300 mp3’s in hopes to replace the mp3’s with my flacs but out of those 3500 flacs only about 100 were found and marked as replacements for the mp3’s. I started with a low tolerance and worked my way up to high and that was the most I could replace. Some instances that did not show up on the list have the exact same titles. Wondering if there is a better way to do this as this is the primary feature i signed up to lexicon to utilize

dankie8948 · April 11, 2024, 3:30am

I forgot to run smart fixes and search for more tags, will update with new results soon

dankie8948 · April 11, 2024, 4:12am

So now I’ve ran smart fixes and searched for tags and it has not changed my effectivness in finding duplicates. Decided to give the broken track scanner a try because i know at least one of my files has artifacting but it didn’t seem to find that or any others. Software is nice but I think im looking for something a bit more destructive and will take all day to complete a task if it has to

Christiaan · April 11, 2024, 9:18am

If the artist and title is the same (or nearly the same), it should definitely find it as duplicate. Do you have an example of 2 tracks that aren’t found?

dankie8948 · April 11, 2024, 8:06pm

Certainly, here are a few “Happy - Ashanti” that were not found.

Filename: Ashanti - Happy (Mainstream Edit).mp3
Title: Ashanti - Happy (Mainstream Edit)

Filename: Ashanti - Happy(Explicit).flac
Title: Happy

Filename: Ashanti - Happy.mp3
Title: Ashanti - Happy

I’m still working on getting clean versions for everything and the other night I was working with file renamer to really get rid of all the bracketed junk, problem is I can’t push those changes to my serato files without destroying the heirarchy because its nowhere near as sophisticated as your program. I think I could clean up the data set myself if “replace with space” in smart fixes could work with a custom list of terms and the entire inputed string to remove terms like (Official Music Video). I have a list of 60 terms I used to clean up a copy of the original data set of 4355 so I could use the batch file I wrote to get all the file names in a list in a specific folder which I used to batch download whatever soundizz could find, whichw as somewhere around 3500, I have found a few downloads that didn’t match the original set but thats a me problem. Thinking of trying to make a BERT or GPT nlp to deal with that, but thats currently above my skill level XD

I got a bit long winded here, thanks to you for all the work you have done so far

Christiaan · April 12, 2024, 6:28am

Filename doesn’t matter, that isn’t used in the duplicate scan at all. You didn’t show me the artist field, so maybe they are empty?
I think you need to run a smartfix: Extract Artist From Title so your artist fields are populated properly and that will make the duplicate scanner run much better

dankie8948 · April 12, 2024, 6:25pm

Thanks, I certainly haven’t tried that yet so ill give it a shot and let you know how it works out

dankie8948 · April 13, 2024, 1:08am

Ok! Nice! I am finding a lot more now, found 600+ tracks even on tolerance set to none! You’ve got one heck of a brain on you my friend. Absolute legend

dankie8948 · April 13, 2024, 1:13am

Now im noticing that many of my duplicates are not defaulting to the new flac, anyway to adjust that behavior?

Also, since my syntax’s for titles are all over the place and my “-” is sometimes unspaced i’m not able to find or correct all my artists:

Do you know of any ways to extract the artist name based on a list of known artists? I think I can handle the unspaced “-” issue with some python (unless you’ve got a tool hiding in there somewhere)

dankie8948 · April 13, 2024, 5:17am

Well heres what I came up with to fix the unspaced dashline issue, haven’t fully tested it yet and I don’t know if its going to break anything so if anyone with this issue finds this code and wants to try it, be wary and absolutley DO NOT apply it to your main library yet, you will also have to pip install mutagen and have python

import os
from mutagen.id3 import ID3, TIT2

def add_single_space_around_hyphen(description):
“”"
Adds a single space around each hyphen in the given description.
“”"
modified_description = description.replace(‘-’, ’ - ')
# Remove any extra spaces around hyphens
modified_description = ’ '.join(modified_description.split())
return modified_description

def process_mp3_titles(directory_path):
“”"
Processes MP3 titles in the specified directory.
“”"
for root, _, files in os.walk(directory_path):
for filename in files:
if filename.lower().endswith(‘.mp3’):
mp3_path = os.path.join(root, filename)
try:
audio = ID3(mp3_path)
if ‘TIT2’ in audio:
mp3_description = audio[‘TIT2’].text[0]
modified_description = add_single_space_around_hyphen(mp3_description)
audio[‘TIT2’] = TIT2(text=[modified_description])
audio.save(mp3_path)
print(f"Modified title in ‘{filename}’“)
except Exception as e:
print(f"Error processing ‘{filename}’: {e}”)

if name == “main”:
mp3_directory = r"D:\hyphon space insurance python tests" # Replace with your actual directory
process_mp3_titles(mp3_directory)

dankie8948 · April 13, 2024, 5:30am

Props on the undo button, just saved me hours haha

Christiaan · April 13, 2024, 6:58am

Lots of info in the manual: Find Duplicates

It takes FLAC first because it has higher bitrate and more cues in this case.

You can use the Replace Text recipe to replace the dash with a space dash space.