Spotify and LastFM Data for Rock and Roll Hall of Fame Class of 2019

The ongoing project I’m usually busiest with is PopRock — a web app that tracks statistics and creates data visualizations using data from the Spotify Web API and the Last.fm API.

From Spotify, I grab artist Popularity and Followers (daily) as well as Popularity for albums and tracks (weekly). That is

  • 327 artists
  • Their 5,742 albums
  • The 75,283 tracks from those albums

If you’ve never heard of Last.fm, it gathers data via “scrobbles” from all participants who have a “scrobbling” app installed. When you, the music lover, listen to a song using iTunes, Spotify, etc. the app sends that bit of information as a “scrobble” to Last.fm. As often as I remember to, I run a Python script that gathers the Listeners and Playcount for each artist, their albums and tracks. At the moment, this task is so cumbersome, I keep the list relatively short — 32 artists.

What brings joy to my nerdy little heart is when there’s an event that creates an anomaly or jump in my data. For example, a movie about Queen causes a spike in popularity — especially for a particular song (“Bohemian Rhapsody”) or a movie about Motley Crue causes a spike in their popularity.

What brings sadness to my heart is how little influence the Rock and Roll Hall of Fame has on my little data exercises. For the most part, neither inductions nor nominations cause much of a change. Having said that, I look forward to the announcement of nominations every year as well as to the broadcast of the induction ceremony. The ceremony was held last week and the ceremony will air on HBO April 27. To celebrate that, I’m going to share some data and charts for this year’s nominees and inductees.

Below (Figure 01) is the Spotify popularity for the Class of 2019.

compareBars.png
Figure 01: L-to-R Def Leppard, Janet Jackson, Radiohead, Roxy Music, Stevie Nicks, the Cure, the Zombies

I’ve already written about how Spotify calculates their popularity score numerous times and every article about the RRHoF repeats their rules and controversy so forgive me if I don’t regurgitate all of that here.

Induction Process and Eligibility Requirements, etc.

compareLines_Inductees.png
Figure 02: A line starts on the day (or day after) I add an artist to my database.

As you can see, all artists flow together for the most part — because Spotify’s popularity scores are relative. That’s what caused all of them to go down together in late June, climb back up and fall again together in February. Given that, the lines, in a vacuum, would be pretty straight. Knowing that, the nominations announced in December 2018 seem to have no effect.

Speaking of nominations, here are those who were nominated last year but not inducted.

compareBars_Noms.png
Figure 03: L-to-R: Devo, John Prine, Kraftwerk, LL Cool J, MC5, Rage Against the Machine, Rufus featuring Chaka Khan, Todd Rundgren

Like many people, I find it extremely difficult to keep my thoughts about this to myself. At least three of those artists deserve to be in more than at least three of the inductees. But, of course, if the RRHoF used reason and logic, people wouldn’t write motherloads of articles and blog posts about them twice a year.

compareLines_Noms.png
Figure 04: Yes, I know I need to tweak a couple things. It’s a work in progress.

The only artist I would say was really affected by their nomination is the MC5. I would expect all of them would bump up because the public was reminded of these artists and listened to them like, “Oh, yeah, I haven’t listened to LL Cool J in forever” but, in the case of the MC5, I think it’s more like, “Who are the MC5?” and I am just fine with a new generation being exposed to them by any means necessary. Very similar to another highly-influential Detroit band, The Stooges (inducted 2010). I’m still waiting for Ted Nugent to get selected for consideration to be put on the ballot. Once.

Because popularity is relative, you might be thinking the line charts are pretty useless and you’d be right. If I look at each artist individually, however,  and change the Y-axis from a 0-to-100 scale to a lowest-to-highest scale, they become (a little) more useful. Sometimes. This begins a series of posts about each inductee that do exactly that with some more useful statistics and data visualizations.

Before we go on, however, here’s something more useful — a graph using each artist’s followers on Spotify.

followerColumnsX.png
Figure 05: L-to-R: Def Leppard, Kraftwerk, Janet JacksonJohn Prine, DevoLL Cool J, MC5, RadioheadRage Against the Machine, Roxy MusicRufus featuring Chaka Khan, Stevie Nicks, The Cure, The ZombiesTodd Rundgren

Day-um! Radiohead has some fans, yo!

That wide range between, for example, The Zombies (who deserve better, by the way) and Radiohead makes for what could be considered a clunky chart.

In other charts, the ratio for popularity (which is, conveniently, on a scale of 0-100) is 1:2. One popularity point gets represented by two pixels so I can fight against the opposite problem of a chart that’s too small. The ratio in the above chart is 5000:1. 5000 followers to one pixel.

Chrome’s dev tools were invaluable in troubleshooting that.

I love how “good” I’m getting at math now that it’s always in a real, practical context. For the labels on top of the columns I first checked to see whether the number was above or below one million.

  • If it was below, I divided by 1000 and used Javascript‘s toFixed(0) method to strip away the decimal places — which super conveniently converts the integer to a string which I concatenated with a “k”.
  • If it was above, I divided by one million, used toFixed(1) and added the “M”.
Advertisements

Unexpected Data Cleaning

I’ve had to get creative with writing SQL queries to start the process of merging my Spotify data and my LastFM data. At first I thought it was going rather well if not very fast. I was definitely exercising my brain and strengthening my SQL and algorithm skills.

I was quite pleased and relieved upon realizing all the releases in a LastFM release-group–as well as all common recordings across them–shared the same Listeners and Playcount numbers.

What does all that gibberish mean? Check out my first post about Using the Last.fm and MusicBrainz APIs.

Think of an album. Any album. I’ll think of 13 by Black Sabbath because that will come up here in a minute. That album is released in multiple countries, even in a single country there are multiple versions of an album such as different covers, deluxe versions with bonus tracks, after a few years the remastered version, and so on. Each unique version is a “release” in the 13 “release-group.” What you and I would call “tracks,” MusicBrainz calls “Recordings.”

Table listing data from Last FM about the Black Sabbath album thirteen.
Figure 01: Data porn in the form of the “13” Release Group.

Anyway, all nineteen versions of 13 (Figure 01) are really all the same album with mostly all the same songs so, thankfully, LastFM gives them the same standard MBID and numbers. I’m relieved because they could have been pricks (as they’re known to be) and said more people listen to the version of “God Is Dead?” on vinyl than on the digital copy from iTunes but–again, thank God–they don’t. See what I did there? With the “thank God”? I am so clever.

Just in case you’re as aroused as I am looking at all the yummy goodness from MusicBrainz in Figure 01, here’s another screenshot.

13_ReleaseTracks.png
Figure 02: Just look at it! My imagination is running wild!

At first, I thought it would be messy kinda like Spotify, so I had Python grab every MB release for which LastFM had data and store it all in a JSON file (so I could review it and plan for my other data science and data visualization needs & wants).

PythonValidation.png
Figure 03: A “valid” release, for me, is one for which LastFM has data. There’s no reason to keep that data and waste time checking all of them every time.

I wanted to merge or at least somehow “link” the Popularity and Followers data from Spotify with the Listeners and Playcount data from LastFM for something resembling easy access. After deciding I only needed one set of stats per album and/or song from each release group, I told PHP to just grab the first release from each release-group for my database. 

python_JustFirstRelease.png
Figure 04: That argument in the if statement came after a lot of time and frustration caused by the fact that I didn’t know some release-groups didn’t have any releases. I know.

I added columns to my current albums table and created an albumsMB table with plans to merge them. To make that easier, I chose to make some temporary, redundant columns rather than convoluted JOINs and sub-queries. Then I made copies of those tables and ran some tests on those temporary copies. I have far too much data I love and am attached to — I am not going to risk losing it no matter how trivial the task is on which I’m working.

I played with two tables. The table of album info from MusicBrainz (below) and my existing table of album info from Spotify.

BS_albumsMB.png
Figure 05: Wherever there’s a Black Sabbath MBID in my albumsMB (album info from MusicBrainz) table, I added Black Sabbath’s Spotify ID. Easy.

I wondered–a lot–whether I really needed more columns from the data like “country”, “disambiguation,” etc. but all my SQL test queries worked so well so easily, I eagerly, perhaps hastily went in the “opposite” direction. I thought it was enough to tell PHP that wherever the album title and Spotify artist ID from the matched, add the album’s MBID to my existing albums table.

I was SO excited. I was finally going to have complete charts–with quantifiable data–for Black Sabbath that included all the Tony Martin era albums Spotify lacks. I was totally going to take screenshots and send them to Tony and he was going to be so grateful and we’d be best buds and I could move onto actual new features and stuff since my data was massaged and merged and yay!

I switched browser tabs to bask in my victory.

BS_albumsPopList.png
Figure 06: This page has things I’ll fix and temporary stuff I’ll remove once everything works.

I didn’t scroll down so I didn’t see this (Figure 06) whole thing. I didn’t notice there were actually three albums with LastFM data. I thought I’d not yet added code to the query that should populate the Listeners and Playcount columns but when I checked, I saw it was there and should be working. I don’t know which I saw first — those three rows with LastFM data or … this …

BS_albumsCropped.png
Figure 07: The albums table of Spotify album info. Only three rows received a MBID.

I was more confused than frustrated already by all the missing data in the albumMBID column when I noticed the two tables also contained very different albums.

Later: I wonder how the releases are ordered … but I can’t control the order they come in … regardless of how they are in their home database … I may have to use some surgical precision … with certain properties.

I should have anticipated that Tony Martin’s album MBIDs wouldn’t have a place to go but I also never noticed most of those Spotify albums include “(Remastered Edition)” in the title which meant most of the titles didn’t match. I wasn’t happy about that … that potentially meant some exhausting work with RegEx. I thought I might get off easy if I could change some existing columns to use FullText but I immediately thought of potential problems.

I half-heartedly started copying MBIDs and pasting them into the empty fields but then deleted them as I thought of … not only did I not want to do this manually for all my current (and future!) MB albums but what if I added columns later or found a more … accurate-ish, clean-ish solution. Which is what I started doing earlier and what I’m going to do now. Add more columns for the properties I mentioned earlier. One of the JSON files to which I referred when planning all this had no values for all the keys I thought of using. It looks like the most recent Black Sabbath data (from yesterday, FWIW) has values for most of them. I’ll need to add equivalent columns to the Spotify albums table, too.

In the back of my mind, I kept wondering why my Spotify albums list was so short. I keep adding and removing values from the “type” and “group” key in the PHP file that fetches Spotify album data. I’ll have to do that again so I can have more albums from each source so I can, hopefully, have more matches.

As I wrote this post and went to MusicBrainz to take screenshots, I saw this …

BS_MusicBrainz_Discography.png
Figure 08: So THERE are the albums that Spotify had but my MB/LFM data was missing!

Later, I also noticed the “Type: Album” bit in Figure 02.

I now know I have to request “Album + Compilation” and “Album + Live” to get, for example, Reunion. But … why don’t I have Vol 4 from MusicBrainz? It can’t be possible there were no releases from the Vol 4 release-group with data at LastFM … right?

Well, time to get knee and elbow deep with even more data about each release-group and release …

Yes, I’m aroused by that. 😉

P.S. I wonder if … and hope that … MusicBrainz and/or LastFM have the CD I once found at a truck stop … it was called something like “The Essential Black Sabbath” or “Black Sabbath’s Greatest” and it was exclusively Tony Martin tracks. I’d love find that and someday get it autographed just because it’s so … peculiar.

Update: Finding that Tony Martin Compilation

I have no idea if is even any of those listed in Figure 08, so …

TMdiscograpny.png
Figure 09: I love how precise MB users are — that someone made sure to add “featuring Tony Iommi” to Seventh Star.

Step #1 Click any album in their discography (Figure 09) between 87-95 except Dehumanizer. Poor Tony, man.

headlessCrossReleaseGroup.png
Figure 10: Headless Cross release group

Step #2 Click any release (Figure 10).

headlessCrossRelease.png
Figure 11: Dang, dudes! They gotta say “ex-Black Sabbath vocalist”? I wonder if Dio suffered the same fate. 

Step #3 Click Tony Martin’s name in the credits (Figure 11).

TMdiscography.png
Figure 12: Tony Martin’s discography

Crap. I was hoping it would be listed right here (Figure 12) but, I suppose this makes more sense.

Step #4 Click “Show all release groups instead” (Figure 12) or anything else I need to.

Crap #2. I’m actually going to have to go through all of those compilations …

I opened each compilation 1996 and later (that I knew wasn’t specifically Ozzy or Dio) in another tab.

*Sigh* Each tab is a release group so requires at least one more click.

I checked eight of those candidates:

  • The Sabbath Stones
    Six instrumentals, seven Tony Martin tracks, and one apiece by Dio, Ian Gillan, and Glenn Hughes. No Ozzy.
  • Greatest Hits
    Ozzy & Dio
  • Ozzy is the only vocalist featured on:
    • The Ultimate Collection
    • Forever
    • Greatest Hits
    • The Collection
    • Rock Giants
  • Rock Champions is all Tony Martin!

I checked a few albums and Dio is never referred to as “ex-vocalist”.

I feel it’s also worth mentioning The Best of Black Sabbath which is an unusually respectable–albeit unbalanced–compilation as these things go. First of all–awesome cover.

  • 28 songs by Ozzy
  • 2 by Dio
  • 1 awful song by Ian Gillan
  • 1 instrumental

Not a single Tony Martin song.

 

Finding and Deleting (almost) Duplicate Rows in MySQL Part 2

Background:

  • I copied the original popArtists table, renaming the original popArtistsOLD
  • I worked (see below) on the copy which is now called popArtists

When I tried repeating my process for finding and deleting duplicates in a MySQL database (not to mention converting dates to assist with that and make future data wrangling easier) with popArtists, () it “loaded” (I think the message said “Loading”) for a long, long time before, eventually giving some sort of error stating it had pooped out and/or lost a connection. Checking the new, copied table, it appeared as though no rows had been affected — there were still 728,082 rows. When I logged in and checked this morning, however, the table showed only 56, 049 rows! So, I’m going to copy the original table again and try some tests.

NOTE TO SELF: Actually write out the following:

  1. SELECT Alice Cooper in SQL tab
  2. Export as CSV for Excel
  3. Import CSV into Excel
  4. Highlight duplicates

There were three instances of two rows for one date and one instance of four rows in a single date. I have 328 artists in my database so if each artist had the same amount of duplicates, there would be only 3,280 extra rows to delete, not 672,033.

I didn’t want to export and import a CSV file with 728,082 rows but it looks like I have to if I want to find the problem.

  • I made another copy of the original called popArtistsNew.

The exported CSV file is just over 53 MB!

After Excel finished importing the CSV, it said, “730,440 rows loaded” from popArtistsNEW which should have 728,082. I double-checked popArtistsNEW in PHPMyAdmin and, unexpectedly, it has 727,834 rows. All three should be the same.

Now I’m worried. And my face is stuck in a WTF expression.

Because of the sheer size of the spreadsheet, Excel crashed every time I tried using Home > Conditional Formatting > Highlight Cells Rules > Duplicate Values so, instead, I tried Home > Conditional Formatting > New Rule. The formula, after much Googling, trial & error, baffling results, and profanity is:

=OR(AND($B2=$B1,$D2=$D1),AND($B2=$B3,$D2=$D3))

Figure 01

Let me break the formula down and explain it. First, look at it like this:

=OR(Condition1, Condition2)

The ‘=‘ sign tells Excel this is a formula, not text. ‘OR‘ tells Excel that if either Condition1 or Condition1 is true, apply my orange fill. This next part is where people can get confused. Just like an ‘OR’ statement has two conditions, an ‘AND’ statement has two conditions, like so:

=AND(ConditionA, ConditionB)

The difference here being if both ConditionA AND ConditionB must be true.

So, my formula is actually this:

=OR(AND(ConditionA, ConditionB),AND(ConditionC , ConditionD))

which means:

  • If ConditionA AND ConditionB are true OR
  • If ConditionC AND ConditionD are true

Make that whole row of cells orange.

After a whole lot of scrolling–which was far more stressful that it should be because I had, really, almost a million rows–I saw that I finally had only my duplicate rows highlighted (seriously, you should have seen the horrific crap Excel did for some of the “errors” in my “trials”).

Now, I just needed to count them and see if there were, actually, several hundreds of thousands. Mostly, I saw chunks of 2, 4, or 6 rows in a single day. But then I started seeing chunks of 556 rows for a single artist in a single day. Notice “chunks” is plural. Lots of ’em.

When you import a CSV file, Excel automatically gives you Sort & Filter menus for each column. I used

FilterMenu > Filter by Color > Filter by Cell Color

FilterMenu > Filter by Color > Filter by Cell Color

The Excel status bar at the bottom said 676,965 displayed out of 730,440. So, I’m satisfied.

Finding and Deleting (almost) Duplicate Rows in MySQL

The database for my PopRock web app has evolved as I learn and as my needs & wants expand. Learning means mistakes and mistakes with MySQL (and PHP) mean duplicate rows. Sometimes lots of them–I once forgot to empty an array at the end of a loop (inside another loop). It didn’t take long to notice that page was getting increasingly slower … because it had millions of rows.

The PopRock database contains data fetched from the Spotify Web API using cron jobs for:

  • Artist popularity – daily
  • Album popularity – weekly
  • Track popularity – weekly

In the last couple days, I noticed some of these items hadn’t been fetched in a couple months–or a couple years.

Sidebar/Note to self: I also noticed that some albums (anthologies) didn’t have all their tracks in my database — my solution to (and explanation of) that problem will probably be my next post.

Debugging and testing also has the side effect of creating lots of duplicate rows. For example:

albumID_01.png
Figure 01

There’s also this from when I added a date column after I’d started fetching data. I am fine with deleting all of these and doing so is easy.

noDateInPopTracks.png
Figure 02

And, lastly, for a bit, my cron job was getting twice a day:

twice.png
Figure 03

This happened because my albums and tracks cron jobs works like this:

  1. Get an array of artists
  2. For each artist
  3. get an array of their albums
  4. For each album, get that album’s popularity score from Spotify

I do have one array containing all of my artists for getting their popularity scores but that task is relatively small. Looping through every artist then getting all of their albums is a bigger task. Then getting all of the tracks for each album is even bigger. So, for those last two tasks, I’ve broken up my artists into smaller arrays.

If, accidentally, I’ve placed one of my artists into two of those arrays, everything for that artist gets done twice (that last screenshot). How does one accidentally get millions of rows like I mentioned earlier? One accidentally writes (or copy & pastes together) a script that, basically, says something like: For every track in David Bowie’s oeuvre, loop through all of David Bowie‘s albums that loop through every track on each of those albums. Yeah, imagine that grows like a zombie apocalypse.

My page for listing all an artist’s tracks has been slowing again, lately, so I need to:

  • Find out if something works but is happening more than it should
  • Resume work on making PopRock a Progressive Web App (PWA) so it stores data locally, speeding up everything it does.

Finding & Deleting Duplicate Rows

Finding and deleting all “duplicate” rows for Figure 02 is easy. I just need to find all rows where date equals 0000-00-00 00:00:00.

SELECT * FROM popTracks WHERE date='0000-00-00 00:00:00';

That query returns 1,116 rows out of 1,622,599 for (currently) 72,997 tracks so I’m confident I won’t lose anything significant when I delete them like

DELETE FROM popTracks WHERE date='0000-00-00 00:00:00';

leaving 1,621,483.

The other two aren’t so easy.

I don’t want to delete every row inserted on a given date–I want to delete all but one. Also, I created an extra hurdle when coding my PHP. When I get a new album to insert into my albums table, I format the date:

$albumReleasedWhole = $album->release_date;
$albumReleased = substr($albumReleasedWhole, 0, 4);

I do that because some release_dates are a four digit year, some are month and year, and some are full dates. I didn’t do that for the popAlbums table — I didn’t anticipate ever caring about the date format.

So I have dates that are kind of alike. In Figure 01, I have sixteen rows inserted on March 10 but all at different times. If I muster my limitless optimism and try using just the date …

SELECT * FROM popAlbums WHERE date='2018-03-10';

that fails. Zero rows.

We know that if I were to select those dates to display on a web page, I can write something like the above PHP so what I’m going to do is

  1. Create a new column called, say, justDate.
  2. Fill all the empty cells in the justDate column with a converted date from the old date column.
  3. I’ll scour all those rows for any obvious yuckiness and test for any less-obvious.
  4. Once I’m sure everything is okay, I can delete the old date column.

I created the new column under the Structure tab in PHPMyAdmin, then hopped over to the SQL tab and:

UPDATE popAlbums SET justDate = date_format((popAlbums.date), '%Y-%m-%d');

Ta-da!

Figure 04

Because I’m a very cautious man, I copied the table before trying the next couple things without reducing myself to a sobbing mess on the floor. In popAlbums2, I deleted the original date column and proceeded with this:

DELETE a FROM popAlbums2 a
INNER JOIN popAlbums2 b
WHERE a.albumID = b.albumID AND a.justDate = b.justDate AND a.id < b.id;

In line 2, we create two tables (sort of–if we make a change in a, it also happens in b because they’re the same table–don’t think about it too hard) by using INNER JOIN.

Line three looks at all the rows in b. When it finds more than one row for a given album on a given day (a.albumID = b.albumID and a.justDate = b.justDate) like the sixteen rows on March 10 in Figure 01, it compares the id of the row in b to the id of the first similar row in a:

  • Is a.id ‘6001’ less than b.id ‘6001’? No. Move on to the next row in b.
  • Is a.id ‘6001’ less than b.id ‘6037’? Yes. Delete it and move on.

I held my breath and ran the query for a very, very long 16 seconds.

Rows affected = 11,850

I am SO glad I made a copy of this table. I moved back over to the Browse tab and … here’s the final result!

Figure 05 (Compare to Figure 01)

Lastly, I renamed the popAlbums table to popAlbumsOLD and renamed popAlbums2 to popAlbums as well as renamed the justDate column to date so all my functions and pages still work.

I had to hold my breath for a much longer 33 seconds while I ran the date_format query on the popTracks table’s 1,646,115 rows.

My DELETE query put me through, as NASA would say, 187 seconds of terror (it seemed like a lot longer) to delete 48,492 rows.

UPDATE: Fortunately, I remembered I had to change how the date gets inserted into the new column. Before, the default was just “current timestamp.” In my PHP, I changed it to use the MySQL curdate() function.

Schema Revision and PoliticalPorn

I got to my sillyDayJob early this morning so I could work on my SQL query because using JSON and javascript methods got the same weird results. I went to print out the picture of my revised schema and saw the image was more out of date than I thought so I opened it in Photoshop and updated it.

I keep wanting to just put all of this information in one table but I’m very stubborn about wanting to master SQL for properly normalized databases. While working on that, some lady came into our office and asked if I would be interested in a League of Women Voters guide to amendments on the ballot. Would I? I jumped up and shouted, “And how!” No, actually I didn’t but that’s how I felt. I then really wanted to be all Mad Men and shout, “Thanks, this is sure swell!” because … obviously, I’m just a big geek.

It feels good to be a political geek again.

andHow
Yes, people in Florida can read. Yes, people in Florida vote. *sigh*

Hmm … a couple of those don’t show the table’s primary id column. [Below I have an updated update of the revised revision.]

But … already … I think I know how it should and will work.

In related news, two of my pull requests for Hacktoberfest are for OpenElections repos.

Which … makes me sad … because it makes me think about the fact that if openFEC provided election data through their API instead of jacked up PDFs, etc. then … the fine folks at OpenElections wouldn’t have to work so hard.

Update 13 hours later: Getting closer …

Screen Shot 2018-10-12 at 7.14.39 PM.png

There’s probably a better way to do it, but … it’s working (so far) and that’s all I care about at this point.

Screen Shot 2018-10-12 at 7.14.14 PM.png

Here is the corrected revised revision of the revision.

schema_101618.png

Where Were These Resources Weeks Ago?

After learning about IndexedDB in the GrowWithGoogle Mobile Web course, I’m enhancing my big personal project to create a local backup of the MySQL database using IndexedDB with PHP and jQuery

If only I’d been aware of this “Google Search” thing a month or two ago … 😉

From MySQL to IndexedDB
https://stackoverflow.com/questions/44638999/from-mysql-to-indexeddb

Also mentions PouchDB and CouchDB
https://stackoverflow.com/questions/37921898/synchronization-between-mysql-and-indexeddb

Synchronization between mysql and IndexedDB
https://codeforgeek.com/2014/11/sync-app-mysql-indexeddb/

SyncedDB repo
https://github.com/paldepind/synceddb

Client-side database solutions: IndexedDB (says “teh codez”)
https://www.corporate3design.com/blog/24

Create Your Own To-Do App with HTML5 and IndexedDB
http://blog.teamtreehouse.com/create-your-own-to-do-app-with-html5-and-indexeddb

Offline Add, Edit, Delete Data in HTML5 IndexedDB
https://www.mindstick.com/Articles/1535/offline-add-edit-delete-data-in-html5-indexeddb

Self-Doubt Pays Off

TIL (Theotherday I Learned) That Word “Group” Doesn’t Mean What You Think It Means.

Here’s my schema so you can follow what I was trying to do.

popRock_schema.png
Figure 1. My Schema

My database gathers data I collect using the Spotify Web API on the popularity of artists, albums, and tracks over time. For example, I have a weekly cron task that grabs the popularity of a bunch of artists.

This query’s ultimate purpose is so I can show the latest popularity scores for all albums by a given artist. As pictured below, it shows the artistName, albumName, year (the album was released), and latest popularity score. It needs the albumID for two of the tables it selects from and the artistID to select from two tables. I only want the latest popularity score so it selects from the popAlbums table multiple times.

After several days of trying to get this complex query to work (I kept getting a “mysqli_fetch_array() expects parameter … to be mysqli_result, boolean given in” error), I broke it up into a couple views and incorporated them into this query:

yayKickAssRockOn.png
Figure 2. Query with Views Now that I think about it, in retrospect, those scores may not be accurate because maybe that popAlbumsLatest view doesn’t really group them how I want. I’ll have to check. 

It worked fabulously. I was delighted. But I needed to have a PHP variable in a couple spots to generate it dynamically so I couldn’t use views. It still took me a while longer to get the sub-queries to function properly. This is what I came up with. It uses the actual artistID instead of a variable for testing purposes.

badScabies_workBench2.png
Figure 3. Query #1 using subqueries instead of views Stick a pin in that score and date for Trash.

It worked fine in MySQL Workbench (which I’ve fallen in love with along with Sequel Pro for Mac OS) but I’ve seen stuff be fine in my editor and fail with the Terminal, browser, server, etc. so I uploaded that and when it actually worked in the browser from my server—instantly with no error—my first reaction was shock and skepticism.

badScabies.png
Figure 4. Results of Query #1 Pay attention to that score for Trash (among others).

Maybe it was pessimism or maybe it was intuition but my first thought was, “That can’t be right.”

The browser showed only one result per album but … still … my gut didn’t accept it.

I ran a query for all (Alice Cooper) results in the popArtists table to see if these were, indeed, the latest results and not … whatever … so I looked for album popularity scores that were different the last time I pulled them from Spotify.

allTrashWorkbench2.png
Figure 5. Checking my work All dates except the most recent have pop=63 but Dec 2nd has pop=62

What I saw was that the score was, indeed, incorrect in Figure 3 and Figure 4. But the date was correct (see Figure 3).

“This isn’t going to end the way you think” Luke said but I didn’t listen.

My results were showing the latest date but not the latest popularity score.

Now I was really confused. But fascinated. I had that gleeful feeling I’ve heard scientists get when they say, “THAT’s not supposed to happen!”

Some googling told me Group doesn’t work exactly how you think and, thankfully, one of the solutions was presented and explained in a way that worked for me.

GROUP BY groups the specified column but, apparently, it doesn’t necessarily keep all the other columns with it … you know, like in Excel … or, at least, it doesn’t … I don’t know. It just doesn’t work like most people think it should. I am far from the first to encounter this phenomenon.

I’m still not sure I could explain how the solution works to someone else but it involved yet another sub-sub-sub-query that first groups the results then grabs the row with the … but you’d think that’s what … nevermind … just trust me the solution works.

Here’s what finally worked:

goodScabiesWorkbench2.png
Figure 6. Correct date with correct score. 

And the result

goodScabies.png

Note several of the scores changed once the query is written* and works correctly.

*It worked just fine … MySQL isn’t broken …

Lastly, I love both MySQL Workbench and Sequel Pro because they allow me to work remotely via ssh getting instant test results (not having to write a file, upload it, test it in the browser, rinse, repeat like I was doing) while not worrying about people attacking, say, an installation of PHPMyAdmin (no matter how securely I’ve locked it down). These apps made this and so much other work go SO much faster so it was so much more enjoyable.