Unexpected Data Cleaning

I’ve had to get creative with writing SQL queries to start the process of merging my Spotify data and my LastFM data. At first I thought it was going rather well if not very fast. I was definitely exercising my brain and strengthening my SQL and algorithm skills.

I was quite pleased and relieved upon realizing all the releases in a LastFM release-group–as well as all common recordings across them–shared the same Listeners and Playcount numbers.

What does all that gibberish mean? Check out my first post about Using the Last.fm and MusicBrainz APIs.

Think of an album. Any album. I’ll think of 13 by Black Sabbath because that will come up here in a minute. That album is released in multiple countries, even in a single country there are multiple versions of an album such as different covers, deluxe versions with bonus tracks, after a few years the remastered version, and so on. Each unique version is a “release” in the 13 “release-group.” What you and I would call “tracks,” MusicBrainz calls “Recordings.”

Table listing data from Last FM about the Black Sabbath album thirteen.
Figure 01: Data porn in the form of the “13” Release Group.

Anyway, all nineteen versions of 13 (Figure 01) are really all the same album with mostly all the same songs so, thankfully, LastFM gives them the same standard MBID and numbers. I’m relieved because they could have been pricks (as they’re known to be) and said more people listen to the version of “God Is Dead?” on vinyl than on the digital copy from iTunes but–again, thank God–they don’t. See what I did there? With the “thank God”? I am so clever.

Just in case you’re as aroused as I am looking at all the yummy goodness from MusicBrainz in Figure 01, here’s another screenshot.

13_ReleaseTracks.png
Figure 02: Just look at it! My imagination is running wild!

At first, I thought it would be messy kinda like Spotify, so I had Python grab every MB release for which LastFM had data and store it all in a JSON file (so I could review it and plan for my other data science and data visualization needs & wants).

PythonValidation.png
Figure 03: A “valid” release, for me, is one for which LastFM has data. There’s no reason to keep that data and waste time checking all of them every time.

I wanted to merge or at least somehow “link” the Popularity and Followers data from Spotify with the Listeners and Playcount data from LastFM for something resembling easy access. After deciding I only needed one set of stats per album and/or song from each release group, I told PHP to just grab the first release from each release-group for my database. 

python_JustFirstRelease.png
Figure 04: That argument in the if statement came after a lot of time and frustration caused by the fact that I didn’t know some release-groups didn’t have any releases. I know.

I added columns to my current albums table and created an albumsMB table with plans to merge them. To make that easier, I chose to make some temporary, redundant columns rather than convoluted JOINs and sub-queries. Then I made copies of those tables and ran some tests on those temporary copies. I have far too much data I love and am attached to — I am not going to risk losing it no matter how trivial the task is on which I’m working.

I played with two tables. The table of album info from MusicBrainz (below) and my existing table of album info from Spotify.

BS_albumsMB.png
Figure 05: Wherever there’s a Black Sabbath MBID in my albumsMB (album info from MusicBrainz) table, I added Black Sabbath’s Spotify ID. Easy.

I wondered–a lot–whether I really needed more columns from the data like “country”, “disambiguation,” etc. but all my SQL test queries worked so well so easily, I eagerly, perhaps hastily went in the “opposite” direction. I thought it was enough to tell PHP that wherever the album title and Spotify artist ID from the matched, add the album’s MBID to my existing albums table.

I was SO excited. I was finally going to have complete charts–with quantifiable data–for Black Sabbath that included all the Tony Martin era albums Spotify lacks. I was totally going to take screenshots and send them to Tony and he was going to be so grateful and we’d be best buds and I could move onto actual new features and stuff since my data was massaged and merged and yay!

I switched browser tabs to bask in my victory.

BS_albumsPopList.png
Figure 06: This page has things I’ll fix and temporary stuff I’ll remove once everything works.

I didn’t scroll down so I didn’t see this (Figure 06) whole thing. I didn’t notice there were actually three albums with LastFM data. I thought I’d not yet added code to the query that should populate the Listeners and Playcount columns but when I checked, I saw it was there and should be working. I don’t know which I saw first — those three rows with LastFM data or … this …

BS_albumsCropped.png
Figure 07: The albums table of Spotify album info. Only three rows received a MBID.

I was more confused than frustrated already by all the missing data in the albumMBID column when I noticed the two tables also contained very different albums.

Later: I wonder how the releases are ordered … but I can’t control the order they come in … regardless of how they are in their home database … I may have to use some surgical precision … with certain properties.

I should have anticipated that Tony Martin’s album MBIDs wouldn’t have a place to go but I also never noticed most of those Spotify albums include “(Remastered Edition)” in the title which meant most of the titles didn’t match. I wasn’t happy about that … that potentially meant some exhausting work with RegEx. I thought I might get off easy if I could change some existing columns to use FullText but I immediately thought of potential problems.

I half-heartedly started copying MBIDs and pasting them into the empty fields but then deleted them as I thought of … not only did I not want to do this manually for all my current (and future!) MB albums but what if I added columns later or found a more … accurate-ish, clean-ish solution. Which is what I started doing earlier and what I’m going to do now. Add more columns for the properties I mentioned earlier. One of the JSON files to which I referred when planning all this had no values for all the keys I thought of using. It looks like the most recent Black Sabbath data (from yesterday, FWIW) has values for most of them. I’ll need to add equivalent columns to the Spotify albums table, too.

In the back of my mind, I kept wondering why my Spotify albums list was so short. I keep adding and removing values from the “type” and “group” key in the PHP file that fetches Spotify album data. I’ll have to do that again so I can have more albums from each source so I can, hopefully, have more matches.

As I wrote this post and went to MusicBrainz to take screenshots, I saw this …

BS_MusicBrainz_Discography.png
Figure 08: So THERE are the albums that Spotify had but my MB/LFM data was missing!

Later, I also noticed the “Type: Album” bit in Figure 02.

I now know I have to request “Album + Compilation” and “Album + Live” to get, for example, Reunion. But … why don’t I have Vol 4 from MusicBrainz? It can’t be possible there were no releases from the Vol 4 release-group with data at LastFM … right?

Well, time to get knee and elbow deep with even more data about each release-group and release …

Yes, I’m aroused by that. 😉

P.S. I wonder if … and hope that … MusicBrainz and/or LastFM have the CD I once found at a truck stop … it was called something like “The Essential Black Sabbath” or “Black Sabbath’s Greatest” and it was exclusively Tony Martin tracks. I’d love find that and someday get it autographed just because it’s so … peculiar.

Update: Finding that Tony Martin Compilation

I have no idea if is even any of those listed in Figure 08, so …

TMdiscograpny.png
Figure 09: I love how precise MB users are — that someone made sure to add “featuring Tony Iommi” to Seventh Star.

Step #1 Click any album in their discography (Figure 09) between 87-95 except Dehumanizer. Poor Tony, man.

headlessCrossReleaseGroup.png
Figure 10: Headless Cross release group

Step #2 Click any release (Figure 10).

headlessCrossRelease.png
Figure 11: Dang, dudes! They gotta say “ex-Black Sabbath vocalist”? I wonder if Dio suffered the same fate. 

Step #3 Click Tony Martin’s name in the credits (Figure 11).

TMdiscography.png
Figure 12: Tony Martin’s discography

Crap. I was hoping it would be listed right here (Figure 12) but, I suppose this makes more sense.

Step #4 Click “Show all release groups instead” (Figure 12) or anything else I need to.

Crap #2. I’m actually going to have to go through all of those compilations …

I opened each compilation 1996 and later (that I knew wasn’t specifically Ozzy or Dio) in another tab.

*Sigh* Each tab is a release group so requires at least one more click.

I checked eight of those candidates:

  • The Sabbath Stones
    Six instrumentals, seven Tony Martin tracks, and one apiece by Dio, Ian Gillan, and Glenn Hughes. No Ozzy.
  • Greatest Hits
    Ozzy & Dio
  • Ozzy is the only vocalist featured on:
    • The Ultimate Collection
    • Forever
    • Greatest Hits
    • The Collection
    • Rock Giants
  • Rock Champions is all Tony Martin!

I checked a few albums and Dio is never referred to as “ex-vocalist”.

I feel it’s also worth mentioning The Best of Black Sabbath which is an unusually respectable–albeit unbalanced–compilation as these things go. First of all–awesome cover.

  • 28 songs by Ozzy
  • 2 by Dio
  • 1 awful song by Ian Gillan
  • 1 instrumental

Not a single Tony Martin song.

 

Advertisements

Drag & Drop Rock & Roll

I really need to decide which in my list of “coming soon” features for myRockinApp will get my full attention. One is dragging and dropping an artist onto a chart to compare their popularity over time (in two different ways) but … rats … that’s not really what I just took a bunch of screenshots to talk about … the other is this … either a simple array or — more fun — a groovy algorithm to combine multiple artists that are actually the same artist … like all the versions of

Mike Knott that include but are not limited to …

knott.png

  • Mike Knott
  • Michael Knott
  • L.S.U.
  • L.S.U. (Life Savers Underground)
  • Life Savers

Or Miss Joan Jett

joanJett.png

Finally … FINALLY … it looks like they have all her shizzle on Spotify.

And the artist that got me started thinking about this, Ronnie James Dio

dio.png

I want to make something that will grab me just the Rainbow or Black Sabbath albums on which he is the lead singer. I’d also like to write something that gets me just the songs off other artist’s albums like Roger Glover‘s Butterfly Ball and the Grasshopper’s Feast as well as Kerry Livgren‘s Seeds of Change.

Of course, making this drag & drop, too, is probably the easiest and least frustrating way to do this … and … make a way to save the dropped artists as a group the user can name something like “Dio and Friends.”

I’ve found several promising looking bits of documentation and tutorials for jQueryUI and using that with AJAX, etc.

Note to self: Add Elf to database.

I’ll likely do this first because the multi-line charts sound like they’re gonna be quite a bit more difficult.

petty.png

And I need to find out why so many albums have no art like Tom Petty & the Heartbreakers above. Meat Loaf has that issue, too.