Improving Duplicate Scanner search results

Scythe · September 4, 2022, 10:23am

There are lots of duplicates that Lexicon isn’t picking up currently, even with search tolerance set to high. I’ll try to document as many examples as I can find in this thread. Some may be easy to tweak the searcher to catch while others might not be possible. But I just thought it would be good to document as many cases as possible in a single thread to keep things organised.

For anyone else looking to contribute to this list, Ensure you provide the artist/title fields of both tracks and verify yourself that they aren’t considered a duplicate by running a scan with High search tolerance (put them in a small playlist to make scanning faster).

Angerfist and Miss K8 - Bogota (2020 Refix)
Angerfist & Miss K8 - Bogota (2020 Refix) (Original Mix)
Miss K8 & Angerfist - Bogotá (2020 Refix) (Original Mix)

There are a few things going on here but “and” and “&” should be considered interchangeable for duplicate detection along with accented characters (“a” and “á”)

Allowing a duplicate match with jumbled artist names “Miss K8 & Angerfist” vs “Angerfist & Miss K8” would also be nice, but maybe harder to implement

Miss K8 and Nolz - Elevate
Miss K8 & Mc Nolz - Elevate
Miss K8 & Nolz - Elevate

“&” and “and” issue again, but maybe add “MC” as a potentially ignorable value on higher search tolerances

Miss K8 vs. Angerfist - New World Order
Miss K8 & Angerfist - New World Order

“vs.” “vs” could also be considered interchangeable with “and” & “&” within the artist field

Miss K8 - Impact (Radio Edit)
Angerfist & Miss K8 - Impact (Radio Edit)

Might be harder to implement but catching cases where an artist was missed would be handy on higher tolerance levels to catch cases like this

Christiaan · September 4, 2022, 10:22pm

Great suggestions!

What I’ll do is make the artist field a bit smarter where it will ignore the order of possible artists and handle “and” synonyms. This will be from low tolerance and up.
I’ll make it ignore " mc " in the artist field as well.

The last one isn’t possible because the information is just not there, unless it is missing a very short artist name, then it could be considered a typo.

–
If anyone has more of these, keep them coming please!

Scythe · November 10, 2022, 9:56am

Been collecting some more examples of things that don’t match in the duplicate scanner that ideally should:

Daisy & Stormtrooper - Mindwalkers (Tymon Remix)
Daisy & Stormtrooper - Mindwalkers (Tymon RMX)

Would be good if we could treat RMX and Remix as the same word for duplicates

Gammer Feat. MC Storm - 21st Century Rush Master)
Gammer & MC Storm - 21st Century Rush-(Original Mix)

If we’re already ignoring ‘Feat.’ ‘MC’ and ‘&’ then the reason this isn’t being caught is the ‘Master)’ bit. That might be hard to catch as I doubt you can ignore ‘Master’ without the brackets without adding heaps of false positives, but I do think ‘(Master)’ can be ignored from the title field if not already

Eclipse - 24-7 (Breeze And Styles Mix)
Eclipse - 24/7 (Breeze & Styles remix)

Assuming this doesn’t match because of the 24-7 vs 24/7 which might be too niche to make a rule for

Brisk - Airhead (Fracus & Darwin Remix)
DJ Brisk - Airhead (Fracus and Darwin Remix)

If we can ignore DJ in the artist field that would be good.

Technikore & Jts Feat. Niki Mak - Always (Extended Mix)
Technikore X Jts Feat. Niki Mak - Always (Mixed Cut)

Not sure if we’re ignoring ‘X’ in the artist field as a synonym for &. Id ideally like ‘(Mixed Cut)’ ignored from titles as well, though I’ll admit it’s a less frequently used title tag

Rob IYF & Al Storm & Monster - Mutant Bass
Rob IYF & Monster - Mutant Bass

I think we kinda touched on this already but for high-tolerance searches I think it would be really valuable to implement some sort of way to catch examples like this where one artist is missing from one of the duplicates. Thinking on this there’s a few ways to tackle it:

Levenshtein Distance or similar algorithm that gives a ‘distance’ between 2 strings where you could consider the artist fields a match if the Levenshtein Distance is less than X, though I have a feeling this on its own is going to catch way too many false positives with short artist values with any X value high enough to catch the above example. Maybe including the title as well would reduce false positives enough to make this useful as it would increase the distance between most non-duplicates but not duplicates
Slightly different approach, if you took the entire artist and title as an array of strings (with the irrelevant information removed), could it work to check for the number of strings that match, and consider it a match if N-X, where N is the number of strings in the larger array and X is how many strings of difference you permit? So you’d have:
A: (Rob, IYF, Al, Storm, Monster, Mutant, Bass)
B: (Rob, IYF, Monster, Mutant, Bass)
So track A would be considered a match to track B if X was 2 or greater

Just brainstorming, so not sure if these are of any use. I know these would definitely result in more false positives, but given the way the duplicate scanner processes and removed duplicates so cleanly without breaking playlists, I don’t consider removing duplicates from my collection in any other way. So I’m more than happy to spend the extra time going through the duplicate scanner results carefully to check every match if it means I’m going to catch more of them within my library.

Christiaan · November 23, 2022, 3:26pm

I’ve implemented most of your suggestions in the next beta version. For the Medium tolerance.

Did not implement the 24/7 thing.

High and low tolerance have a fuzzy string compare (trigram search). For High the tolerance is set a bit higher for more positives.