Unexpected Data Cleaning

I’ve had to get creative with writing SQL queries to start the process of merging my Spotify data and my LastFM data. At first I thought it was going rather well if not very fast. I was definitely exercising my brain and strengthening my SQL and algorithm skills.

I was quite pleased and relieved upon realizing all the releases in a LastFM release-group–as well as all common recordings across them–shared the same Listeners and Playcount numbers.

What does all that gibberish mean? Check out my first post about Using the Last.fm and MusicBrainz APIs.

Think of an album. Any album. I’ll think of 13 by Black Sabbath because that will come up here in a minute. That album is released in multiple countries, even in a single country there are multiple versions of an album such as different covers, deluxe versions with bonus tracks, after a few years the remastered version, and so on. Each unique version is a “release” in the 13 “release-group.” What you and I would call “tracks,” MusicBrainz calls “Recordings.”

Table listing data from Last FM about the Black Sabbath album thirteen.
Figure 01: Data porn in the form of the “13” Release Group.

Anyway, all nineteen versions of 13 (Figure 01) are really all the same album with mostly all the same songs so, thankfully, LastFM gives them the same standard MBID and numbers. I’m relieved because they could have been pricks (as they’re known to be) and said more people listen to the version of “God Is Dead?” on vinyl than on the digital copy from iTunes but–again, thank God–they don’t. See what I did there? With the “thank God”? I am so clever.

Just in case you’re as aroused as I am looking at all the yummy goodness from MusicBrainz in Figure 01, here’s another screenshot.

13_ReleaseTracks.png
Figure 02: Just look at it! My imagination is running wild!

At first, I thought it would be messy kinda like Spotify, so I had Python grab every MB release for which LastFM had data and store it all in a JSON file (so I could review it and plan for my other data science and data visualization needs & wants).

PythonValidation.png
Figure 03: A “valid” release, for me, is one for which LastFM has data. There’s no reason to keep that data and waste time checking all of them every time.

I wanted to merge or at least somehow “link” the Popularity and Followers data from Spotify with the Listeners and Playcount data from LastFM for something resembling easy access. After deciding I only needed one set of stats per album and/or song from each release group, I told PHP to just grab the first release from each release-group for my database. 

python_JustFirstRelease.png
Figure 04: That argument in the if statement came after a lot of time and frustration caused by the fact that I didn’t know some release-groups didn’t have any releases. I know.

I added columns to my current albums table and created an albumsMB table with plans to merge them. To make that easier, I chose to make some temporary, redundant columns rather than convoluted JOINs and sub-queries. Then I made copies of those tables and ran some tests on those temporary copies. I have far too much data I love and am attached to — I am not going to risk losing it no matter how trivial the task is on which I’m working.

I played with two tables. The table of album info from MusicBrainz (below) and my existing table of album info from Spotify.

BS_albumsMB.png
Figure 05: Wherever there’s a Black Sabbath MBID in my albumsMB (album info from MusicBrainz) table, I added Black Sabbath’s Spotify ID. Easy.

I wondered–a lot–whether I really needed more columns from the data like “country”, “disambiguation,” etc. but all my SQL test queries worked so well so easily, I eagerly, perhaps hastily went in the “opposite” direction. I thought it was enough to tell PHP that wherever the album title and Spotify artist ID from the matched, add the album’s MBID to my existing albums table.

I was SO excited. I was finally going to have complete charts–with quantifiable data–for Black Sabbath that included all the Tony Martin era albums Spotify lacks. I was totally going to take screenshots and send them to Tony and he was going to be so grateful and we’d be best buds and I could move onto actual new features and stuff since my data was massaged and merged and yay!

I switched browser tabs to bask in my victory.

BS_albumsPopList.png
Figure 06: This page has things I’ll fix and temporary stuff I’ll remove once everything works.

I didn’t scroll down so I didn’t see this (Figure 06) whole thing. I didn’t notice there were actually three albums with LastFM data. I thought I’d not yet added code to the query that should populate the Listeners and Playcount columns but when I checked, I saw it was there and should be working. I don’t know which I saw first — those three rows with LastFM data or … this …

BS_albumsCropped.png
Figure 07: The albums table of Spotify album info. Only three rows received a MBID.

I was more confused than frustrated already by all the missing data in the albumMBID column when I noticed the two tables also contained very different albums.

Later: I wonder how the releases are ordered … but I can’t control the order they come in … regardless of how they are in their home database … I may have to use some surgical precision … with certain properties.

I should have anticipated that Tony Martin’s album MBIDs wouldn’t have a place to go but I also never noticed most of those Spotify albums include “(Remastered Edition)” in the title which meant most of the titles didn’t match. I wasn’t happy about that … that potentially meant some exhausting work with RegEx. I thought I might get off easy if I could change some existing columns to use FullText but I immediately thought of potential problems.

I half-heartedly started copying MBIDs and pasting them into the empty fields but then deleted them as I thought of … not only did I not want to do this manually for all my current (and future!) MB albums but what if I added columns later or found a more … accurate-ish, clean-ish solution. Which is what I started doing earlier and what I’m going to do now. Add more columns for the properties I mentioned earlier. One of the JSON files to which I referred when planning all this had no values for all the keys I thought of using. It looks like the most recent Black Sabbath data (from yesterday, FWIW) has values for most of them. I’ll need to add equivalent columns to the Spotify albums table, too.

In the back of my mind, I kept wondering why my Spotify albums list was so short. I keep adding and removing values from the “type” and “group” key in the PHP file that fetches Spotify album data. I’ll have to do that again so I can have more albums from each source so I can, hopefully, have more matches.

As I wrote this post and went to MusicBrainz to take screenshots, I saw this …

BS_MusicBrainz_Discography.png
Figure 08: So THERE are the albums that Spotify had but my MB/LFM data was missing!

Later, I also noticed the “Type: Album” bit in Figure 02.

I now know I have to request “Album + Compilation” and “Album + Live” to get, for example, Reunion. But … why don’t I have Vol 4 from MusicBrainz? It can’t be possible there were no releases from the Vol 4 release-group with data at LastFM … right?

Well, time to get knee and elbow deep with even more data about each release-group and release …

Yes, I’m aroused by that. 😉

P.S. I wonder if … and hope that … MusicBrainz and/or LastFM have the CD I once found at a truck stop … it was called something like “The Essential Black Sabbath” or “Black Sabbath’s Greatest” and it was exclusively Tony Martin tracks. I’d love find that and someday get it autographed just because it’s so … peculiar.

Update: Finding that Tony Martin Compilation

I have no idea if is even any of those listed in Figure 08, so …

TMdiscograpny.png
Figure 09: I love how precise MB users are — that someone made sure to add “featuring Tony Iommi” to Seventh Star.

Step #1 Click any album in their discography (Figure 09) between 87-95 except Dehumanizer. Poor Tony, man.

headlessCrossReleaseGroup.png
Figure 10: Headless Cross release group

Step #2 Click any release (Figure 10).

headlessCrossRelease.png
Figure 11: Dang, dudes! They gotta say “ex-Black Sabbath vocalist”? I wonder if Dio suffered the same fate. 

Step #3 Click Tony Martin’s name in the credits (Figure 11).

TMdiscography.png
Figure 12: Tony Martin’s discography

Crap. I was hoping it would be listed right here (Figure 12) but, I suppose this makes more sense.

Step #4 Click “Show all release groups instead” (Figure 12) or anything else I need to.

Crap #2. I’m actually going to have to go through all of those compilations …

I opened each compilation 1996 and later (that I knew wasn’t specifically Ozzy or Dio) in another tab.

*Sigh* Each tab is a release group so requires at least one more click.

I checked eight of those candidates:

  • The Sabbath Stones
    Six instrumentals, seven Tony Martin tracks, and one apiece by Dio, Ian Gillan, and Glenn Hughes. No Ozzy.
  • Greatest Hits
    Ozzy & Dio
  • Ozzy is the only vocalist featured on:
    • The Ultimate Collection
    • Forever
    • Greatest Hits
    • The Collection
    • Rock Giants
  • Rock Champions is all Tony Martin!

I checked a few albums and Dio is never referred to as “ex-vocalist”.

I feel it’s also worth mentioning The Best of Black Sabbath which is an unusually respectable–albeit unbalanced–compilation as these things go. First of all–awesome cover.

  • 28 songs by Ozzy
  • 2 by Dio
  • 1 awful song by Ian Gillan
  • 1 instrumental

Not a single Tony Martin song.

 

Advertisements

Same Artist with Different MBIDs: MusicBrainz, LastFM (and Spotify)

TL;DR ==> Skip the Introduction and scroll to “The Actual Tests”

Introduction

One of the cooler (to me) features of my PopRock app is it combines related artists for some of the charts–if the same artist is in multiple bands, for example, like Ronnie James Dio was in Dio, ElfBlack Sabbath, Rainbow, and so on. That was fun to work on and most triumphant when I finally got it to work.

MusicBrainz and LastFM, however, have provided me with a somewhat similar challenge. In my first post about using the MusicBrainz and LastFM APIs, I mentioned that Alice Cooper has two MBIDs–one for the band (1964-1975 according to MusicBrainz) and another for the lead singer cum solo artist (or is it “solo artist née lead singer”?). I’ve stored those two MBIDs, along with several others, in a python list that I loop through to get data about multiple artists. I wasn’t yet ready to store this data in my database, so I was saving each artist’s daily data in a JSON file (which Python makes so freaking easy, by the way!) like so:

artistNameFor_file_name = artistName.replace(‘ ‘, ”)

dateFor_file_name = time.strftime(“%m-%d-%y”)

artistJSON = json.dumps(artist, indent=4)

f = open (‘data/’ + artistNameFor_file_name + ‘_‘ + dateFor_file_name + ‘.json‘, ‘w’)

f.write (artistJSON)

f.close()

It works totally awesome — even for artists with unusual characters in their name such as Mötley Crüe. It doesn’t work so well for artists with exactly the same name such as Alice Cooper and Alice Cooper. In the latter case, data for the second Alice Cooper overwrites the file and data for the first Alice Cooper.

I could use the ‘a’ (instead of the ‘w’) argument to append it but for it to merge the data how I want it* isn’t really worth the time and effort it would take to write that script. What I’ll do, instead, is write the PHP for putting both sets of data where it belongs in my MySQL database which I have to do anyway. What I noticed while writing that previous post was LastFM isn’t a pretentious wiener like the guys who work for Championship Vinyl in High Fidelity (and MusicBrainz contributors). LastFM uses only one of those MBIDs for  Alice Cooper. I just have to test both MBIDs again to see which one — then I can write … wait … I could add some … other string from an artist’s data to the filename … OR I could include the time in addition to the date … and that might be helpful for … something else as well. But, as I said, I need this PHP written as well.

Back to the primary topic:

I need a script (and maybe a … are they called “lookup” tables?) that knows to put data from both Alice Cooper MBID … crap … no, wait … because these are hard, quantifiable (yes, I know I’m mis-using that word … be thankful I didn’t say “mis-abusing”) numbers I can add them! That’s the whole purpose behind getting LastFM data in the first place!

I am hoping Joan Jett is equally easy. For some reason, MusicBrainz contributors have her under Joan Jett and Joan Jett and the Blackhearts. I have no idea why. She’s never done anything without the Blackhearts. This is not a Tom Petty and Tom Petty & the Heartbreakers situation … aw, man! … I also need to deal with Tom Petty.

Spotify has one artist id for all Alice Cooper albums as well as Joan Jett. They have different, separate ids for the two Tom Petty entities which I’m okay with (and I’m sure Spotify is relieved). There are other issues with Spotify both odd but whatever as well as inaccurate but … that’s a whole different, well, issue.

The Actual Tests

TEST #1 – Part 1

First, I’ll demonstrate the “problem” by showing the response to a request to MusicBrainz for the Alice Cooper group’s Release Groups (a list of albums containing all the different releases–like other countries, etc.–of that album).

ACgroup_MB_ReleaseGroups.png
The Alice Cooper band MBID data includes 25 release groups (for only 7 “real life” albums! Still only 8 if you include Greatest Hits).

OMG! I just noticed there is a “type” property with a “Group” value! Woo-hoo! That is going to make everything sooooo easy.

ACperson_MB_ReleaseGroups.png
MusicBrainz has 25 release groups for Alice Cooper as a solo artist. Exactly the same amount as the group which is odd considering he has 20 studio albums (which doesn’t include live albums or compilations). That … is weird.

I notice, with a smile, this MBID has a “type” of “person” key/value pair.

I have to note that the Alice Cooper discography Wikipedia page combines solo artist and group lists for studio albums, live albums, and compilations but separates group and solo artist singles. Alice Cooper is just a mess. 😉

Important part: Each MBID has 25 albums for a total of 50. Yes, I read all of that and know each MBID has a different list of albums.

TEST #1 – Part 2

Artist Info from LastFM using the person vs band MBIDs

ACcombined_LastFM_artistInfo.png

The LastFM response using the person MBID gets me artist info for the same MBID.

The LastFM response using the group MBID redirects me to the artist info for the person MBID and the data is the same — note the listeners and playcount values.

Moral: I only need to use one MBID for getting info from LastFM — the person MBID.

Let’s try Joan Jett.

JJperson_MB_ReleaseGroups.png
MusicBrainz has 9 release groups for Joan as an individual.

I’m amused they have hometowns (“begin_area”) for both Alice (Detroit) and Joan (Wynnewood) but for the Alice Cooper group (Pheonix) and JJ & the Blackhearts (Los Angeles) they have the city in which the band was formed.

JJperson_MB_ReleaseGroups.png
MusicBrainz has 25 release groups for Joan Jett and the Blackhearts!

Joan Jett & Company isn’t quite as simple with LastFM.

JJperson_LastFM_artistInfo.png
Joan Jett individual MBID stats from LastFM

Joan Jett the person is, apparently, similar to neither the artist who inspired everything about her (Suzi Quatro) nor her contemporary, Pat Benatar.  Okay.

JJband_LastFM_artistInfo.png
Joan Jett and the Blackhearts MBID stats from LastFM

JJ & the BHs are, unsurprisingly, similar to her old band The Runaways and, well, Lita Ford (lead guitarist for The Runaways).

Neither “solo” or with the BHs (and those should be, in truth, the same MBID) is similar to Evil Stig (a band for which JJ sang 100% the lead vocals, not just some as she did with the Runaways).

Most importantly and irritatingly, the MBIDs have different values for listeners and playcount. I can still simply add them but this requires me to fetch each MBID in Last FM separately (for the time being, while I’m using my Python script to put the responses into JSON files).

Update: I’ve made a little lookup-ish table.

queryForMBID.png

I know I don’t need the name columns, but it helps me. This is for artists that are identical in the real world to make sure data from LastFM goes where it should — Alice Cooper has all Alice Cooper stats, albums, and tracks and Joan Jett gets all her stuff.

Speaking of Joan Jett, I am about to–any second now–make a lookup-ish table for related artists so I don’t need my “manual” group charts below.

steveTaylorGroup.png
This chart uses arrays containing an artist and groups for which they were the only lead singer.

On a related note, I’ll make a “related albums” table for artists who were only on some of a band’s albums — for example, any Dio-related query knows to get only those Rainbow and/or Black Sabbath albums for which he was the lead singer.

An issue that will still remain is albums for which an artist is only on one or some of the songs:

  • My albums table won’t accept albums if the artistID exists in the artist table which means neither The Rocky Horror Show nor The Rocky Horror Picture Show get inserted because they’re cast albums with various singers–Meat Loaf‘s artistID won’t match the artistID for those albums and “Various” or whatever isn’t in the artists table.
  • Ted Nugent‘s Free for All album doesn’t display for Meat Loaf queries despite Loaf singing some of the songs — he isn’t the “artist” for the album. Hmm … I should look and see if he shows up in any of properties for that album in any of my data sources … my goodness but this app is becoming wicked cool wicked fast!

Footnotes

*Once I increase my MongoDB and CouchDB knowledge, perhaps it won’t be such a time-consuming task.

Using the Last.fm and MusicBrainz APIs

I am quite pleased with my app that uses the Spotify Web API for creating data visualizations using the popularity data of artists, albums, and tracks. That popularity score, however, is a relative number so I can’t truly compare much in a useful way, however pretty my UX is.

Now is the time on Sprockets where we start a new project using the Last.fm API to get real, quantifiable data — the number of listeners for each entity (artist, album, track) and playcount for each. I’m very interested to find how those numbers differ.

Spotify and the Spotify Web API make it very easy to get ID numbers for each artist, album IDs for each artist and track IDs for each album. In fact, the documentation shows they might even be making an effort to be stupid. The example URL in their documentation for getting info about an artist is:

/2.0/?method=artist.getinfo&artist=Cher&api_key=YOUR_API_KEY&format=json

So any spelling errors or similar names will make your life miserable. Fortunately, their database uses (for the most part–I’ll get to that in a moment) the mbid or MusicBrainz ID for everything so you can, instead, write the above like this:

/2.0/?method=artist.getinfo&mbid=bfcc6d75-a6a5-4bc6-8282-47aec8531818&api_key=YOUR_API_KEY&format=json

Unfortunately, there’s no Last.fm equivalent to Spotify‘s utterly delightful endpoint for getting all albums by an artist:

/artists/{id}/albums

Fortunately, the MusicBrainz API has such an animal but it’s a little more complicated and it took me a bit of trial and error to figure out quite how it works. I’d like to share it with you. Here are the steps I’ll need to complete to get all of the information about an artist, their albums, and the tracks from those albums with an extra curveball just to make it fun.

Step #1 – Search MusicBrainz.org for the desired artist

In this case, I chose Alice Cooper.

aliceCooperSearchResults.png

For the uninititated, not only would using the search term “alice cooper” bring up tribute bands but the fans (I count myself among them) who insist on differentiating between the group and the solo artist mean I have to use not one but two MBID numbers for this single artist.

Step #2 – Get the Artist’s MBID from MusicBrainz

Click the artist name in the search results then click the Details tab on the artist page.

aliceCooperDetailsTab.png

Step #3 – Get the Artist’s release-groups from MusicBrainz

Copy the artist MBID and paste it into a MusicBrainz web service URL using the artist entity and a ?inc=release-groups parameter like so:

/ws/2/artist/ee58c59f-8e7f-4430-b8ca-236c4d3745ae?inc=release-groups&fmt=json

to get a list of “release groups.”

AC_MB_info_browser.png

I broke up with Dreamweaver a while back, dumping it for the far smarter, cuter, cooler VS Code but ol’ DW will always have a special place in my heart because of cool commands like Edit > Code > Apply Source Formatting which does this:

beforeAfter.png

Remember, you have to do that twice for Alice Cooper. Because we, his fans, are pricks like that.

Step #4 – Get the Artist’s Releases from MusicBrainz

A release is an album — like Zipper Catches Skin or Billion Dollar Babies. A release group is all editions and versions of that album. To see exactly what that means, we copy the id for the Constrictor release-group from line 26 in the above image and construct the URL below using the release-group entity and inc=releases parameter.

/ws/2/release-group/24c4fdde-4494-3837-a288-5b4fdbe966eb?inc=releases&fmt=json

That gets us all six releases in the Constrictor release-group:

  • United States with “packaging”: “Cardboard/Paper Sleeve” (a vinyl record)
  • United States with “packaging”: “Jewel Case” (a CD)
  • Japan CD
  • United States with “packaging”: “” (digital?)
  • Germany with “packaging”: null
  • United Kingdom with “packaging”: null

I don’t know what null packaging means.

The first time I get data from Last.fm, I’ll try all of these and hypothesize “Cardboard/Paper Sleeve” will have a zero playcount.

Step #5 – Get a Release’s Recordings from MusicBrainz

A recording is a “track.”

Construct a URL using the individual release entity and inc=recordings parameter.

/ws/2/release/1189de93-7e9f-3b47-a9df-e2599e62f75d?inc=recordings&fmt=json

The UK release of Constrictor has 10 tracks. There will probably never be a deluxe or remastered edition. Which is a damn shame.

Step #6 – Get an Artist’s data from Last.fm

Using the artist MBID, method=artist.getinfo, and an API Key as parameters, assemble a Last.fm URL like so:

/2.0/?method=artist.getinfo&mbid=ee58c59f-8e7f-4430-b8ca-236c4d3745ae&api_key=YOURAPIKEY&format=json

The resulting JSON for Alice Cooper the man includes a bio, birthday, similar artists, genres, and–for me–the important bit:

"stats": {
"listeners": "1362113",
"playcount": "31011489"
}

You’re dying to know how that compares to Alice Cooper the group, amiright?

"stats": {
"listeners": "1362113",
"playcount": "31011489"
}

Well, check that shizzle out! We only need to run what will be this query once for Alice Cooper! Both MBIDs get the exact same data. Wanna know something nerdy and cool?

The artist entity URLs I used earlier to get the release groups differ …

  • Alice Cooper the person has a gender of “male.” Alice Cooper the band has a gender of “null.”
  • The person has a “begin_area” of Detroit, MI while the band has Pheonix, AZ
  • The band has a life-span of 1964-1975 with “ended”: true while the person’s life-span begins with “1948-02-04” and ends with “null” and “ended” is false.
  • Type properties are “group” and “person”

Step #7 – Get Album Info from Last.fm

We’ll go with Constrictor again. Using album.getinfo this time.

/2.0/?method=album.getinfo&mbid=1189de93-7e9f-3b47-a9df-e2599e62f75d&api_key=YOURAPIKEY&format=json

  • listeners: 64,771
  • playcount: 545,070

For comparison (the entire purpose of this whole exercise), let’s also get Billion Dollar Babies:

Well, I’ll be horn-swaggled! There were three Billion Dollar Babies release-groups in my JSON file but none of their IDs worked so I searched at MusicBrainz and got the actual mbid for the release-group which had 13 releases in it! How do you pick? Eventually, I’ll get all of them but …

First choice didn’t work … this is going to get tedious …

Okay, second try gets me some stats

  • listeners: 169,768
  • playcount: 1,222,112

Yeah, see that’s a significant difference.

Step #8 – Get Track Info from Last.fm

I went for “The World Needs Guts” from Constrictor using the track.getInfo method as a parameter.

/2.0/?method=track.getInfo&mbid=70bd458f-c62d-4014-99e7-233c3c0b8f96&api_key=YOURAPIKEY&format=json

  • listeners: 11,332
  • playcount: 38,656

Which I’m sure is tiny crumbs compared to even one instance of “School’s Out” let alone adding all of the recordings of “School’s Out” together.

Conclusion

It will take me a lot longer to get the data from Last.fm that I need than it does to get data from Spotify but it will also be data I can use!