Identifying duplicate songs artists and albums on Rap Genius: not so easy by Tom Lehman Lyrics
Before you can prevent users from adding duplicate artists to Rap Genius, you have to be able to identify duplicate artists.
This is harder than it sounds – you can’t just look for duplicate artist names because different users will invariably use different names for the same artist. For example, Lil' Wayne and Lil Wayne (i.e., with and without the apostrophe) both refer to the same person.
So different names does not necessarily imply different artists. The question is how different do two names have to be before we can infer that they refer to different things?String “Essence”The apostrophe case gives us a start: two names must differ by more than just punctuation – specifically, they must differ by at least one letter or number – in order to refer to different things.
How do we use this fact to help us detect duplicates? Instead of comparing two names directly, we will instead compare their essences – where a name’s essence is calculated by removing all its non-letters and non-numbers.
Here’s a Ruby implementation of this idea:class String
def essence
strip.
downcase.
gsub(/[^a-z0-9]/, '')
end
endHere’s how this works:"Lil' Wayne".essence
"Lil' Wayne" == "Lil Wayne"
"Lil' Wayne".essence == "Lil Wayne".essence
The names aren’t equal, but their essences are, so they refer to the same thingImproving EssenceDoes our implementation of essence completely capture the idea of “sameness”? Is it possible for two names to contain different alpha-numeric characters and yet still refer to the same thing? Anything is possible when you’re dealing with natural language!
Here are some modifications to String#essence I made based on experimenting with real-world data:
Puff Daddy & The Family refers to the same thing as Puff Daddy and The Family
So essence should treat & and and the same. Similarly, essence should treat + and plus the same
Nipsey Hu$$le should equal Nipsey Hussle
Essence should treat $ and s the same (at least in the case of rap names..)
The Hot Boyz should equal Hot Boyz
Essence should ignore leading articles
The Hot Boyz should equal The Hot Boys
Essence should treat trailing z’s the same as trailing s’s
Da Hot Boyz should equal The Hot Boyz
Essence should treat Da (and Tha) like The. (Again, at least for rap names)The finished productHere’s a version of String#essence that accounts for these observations. Am I missing anything?class String
def essence
strip.
downcase.
gsub('&', 'and').
gsub('$', 's').
gsub('+', 'plus').
gsub(/\bda\b/i, 'the').
gsub(/([a-y\-])z\b/i, '\1s').
sub(/^(th[ea]|a|an)\s+/i, '').
gsub(/[^a-z0-9]/, '')
end
end(Note: Essence doesn’t help you at all in the Tupac / 2Pac case. For this you have to manually compile a list of alternate names for each artist! lol!)
This is harder than it sounds – you can’t just look for duplicate artist names because different users will invariably use different names for the same artist. For example, Lil' Wayne and Lil Wayne (i.e., with and without the apostrophe) both refer to the same person.
So different names does not necessarily imply different artists. The question is how different do two names have to be before we can infer that they refer to different things?String “Essence”The apostrophe case gives us a start: two names must differ by more than just punctuation – specifically, they must differ by at least one letter or number – in order to refer to different things.
How do we use this fact to help us detect duplicates? Instead of comparing two names directly, we will instead compare their essences – where a name’s essence is calculated by removing all its non-letters and non-numbers.
Here’s a Ruby implementation of this idea:class String
def essence
strip.
downcase.
gsub(/[^a-z0-9]/, '')
end
endHere’s how this works:"Lil' Wayne".essence
"Lil' Wayne" == "Lil Wayne"
"Lil' Wayne".essence == "Lil Wayne".essence
The names aren’t equal, but their essences are, so they refer to the same thingImproving EssenceDoes our implementation of essence completely capture the idea of “sameness”? Is it possible for two names to contain different alpha-numeric characters and yet still refer to the same thing? Anything is possible when you’re dealing with natural language!
Here are some modifications to String#essence I made based on experimenting with real-world data:
Puff Daddy & The Family refers to the same thing as Puff Daddy and The Family
So essence should treat & and and the same. Similarly, essence should treat + and plus the same
Nipsey Hu$$le should equal Nipsey Hussle
Essence should treat $ and s the same (at least in the case of rap names..)
The Hot Boyz should equal Hot Boyz
Essence should ignore leading articles
The Hot Boyz should equal The Hot Boys
Essence should treat trailing z’s the same as trailing s’s
Da Hot Boyz should equal The Hot Boyz
Essence should treat Da (and Tha) like The. (Again, at least for rap names)The finished productHere’s a version of String#essence that accounts for these observations. Am I missing anything?class String
def essence
strip.
downcase.
gsub('&', 'and').
gsub('$', 's').
gsub('+', 'plus').
gsub(/\bda\b/i, 'the').
gsub(/([a-y\-])z\b/i, '\1s').
sub(/^(th[ea]|a|an)\s+/i, '').
gsub(/[^a-z0-9]/, '')
end
end(Note: Essence doesn’t help you at all in the Tupac / 2Pac case. For this you have to manually compile a list of alternate names for each artist! lol!)