Unexpected Data Cleaning

I’ve had to get creative with writing SQL queries to start the process of merging my Spotify data and my LastFM data. At first I thought it was going rather well if not very fast. I was definitely exercising my brain and strengthening my SQL and algorithm skills.

I was quite pleased and relieved upon realizing all the releases in a LastFM release-group–as well as all common recordings across them–shared the same Listeners and Playcount numbers.

What does all that gibberish mean? Check out my first post about Using the Last.fm and MusicBrainz APIs.

Think of an album. Any album. I’ll think of 13 by Black Sabbath because that will come up here in a minute. That album is released in multiple countries, even in a single country there are multiple versions of an album such as different covers, deluxe versions with bonus tracks, after a few years the remastered version, and so on. Each unique version is a “release” in the 13 “release-group.” What you and I would call “tracks,” MusicBrainz calls “Recordings.”

Table listing data from Last FM about the Black Sabbath album thirteen.
Figure 01: Data porn in the form of the “13” Release Group.

Anyway, all nineteen versions of 13 (Figure 01) are really all the same album with mostly all the same songs so, thankfully, LastFM gives them the same standard MBID and numbers. I’m relieved because they could have been pricks (as they’re known to be) and said more people listen to the version of “God Is Dead?” on vinyl than on the digital copy from iTunes but–again, thank God–they don’t. See what I did there? With the “thank God”? I am so clever.

Just in case you’re as aroused as I am looking at all the yummy goodness from MusicBrainz in Figure 01, here’s another screenshot.

13_ReleaseTracks.png
Figure 02: Just look at it! My imagination is running wild!

At first, I thought it would be messy kinda like Spotify, so I had Python grab every MB release for which LastFM had data and store it all in a JSON file (so I could review it and plan for my other data science and data visualization needs & wants).

PythonValidation.png
Figure 03: A “valid” release, for me, is one for which LastFM has data. There’s no reason to keep that data and waste time checking all of them every time.

I wanted to merge or at least somehow “link” the Popularity and Followers data from Spotify with the Listeners and Playcount data from LastFM for something resembling easy access. After deciding I only needed one set of stats per album and/or song from each release group, I told PHP to just grab the first release from each release-group for my database. 

python_JustFirstRelease.png
Figure 04: That argument in the if statement came after a lot of time and frustration caused by the fact that I didn’t know some release-groups didn’t have any releases. I know.

I added columns to my current albums table and created an albumsMB table with plans to merge them. To make that easier, I chose to make some temporary, redundant columns rather than convoluted JOINs and sub-queries. Then I made copies of those tables and ran some tests on those temporary copies. I have far too much data I love and am attached to — I am not going to risk losing it no matter how trivial the task is on which I’m working.

I played with two tables. The table of album info from MusicBrainz (below) and my existing table of album info from Spotify.

BS_albumsMB.png
Figure 05: Wherever there’s a Black Sabbath MBID in my albumsMB (album info from MusicBrainz) table, I added Black Sabbath’s Spotify ID. Easy.

I wondered–a lot–whether I really needed more columns from the data like “country”, “disambiguation,” etc. but all my SQL test queries worked so well so easily, I eagerly, perhaps hastily went in the “opposite” direction. I thought it was enough to tell PHP that wherever the album title and Spotify artist ID from the matched, add the album’s MBID to my existing albums table.

I was SO excited. I was finally going to have complete charts–with quantifiable data–for Black Sabbath that included all the Tony Martin era albums Spotify lacks. I was totally going to take screenshots and send them to Tony and he was going to be so grateful and we’d be best buds and I could move onto actual new features and stuff since my data was massaged and merged and yay!

I switched browser tabs to bask in my victory.

BS_albumsPopList.png
Figure 06: This page has things I’ll fix and temporary stuff I’ll remove once everything works.

I didn’t scroll down so I didn’t see this (Figure 06) whole thing. I didn’t notice there were actually three albums with LastFM data. I thought I’d not yet added code to the query that should populate the Listeners and Playcount columns but when I checked, I saw it was there and should be working. I don’t know which I saw first — those three rows with LastFM data or … this …

BS_albumsCropped.png
Figure 07: The albums table of Spotify album info. Only three rows received a MBID.

I was more confused than frustrated already by all the missing data in the albumMBID column when I noticed the two tables also contained very different albums.

Later: I wonder how the releases are ordered … but I can’t control the order they come in … regardless of how they are in their home database … I may have to use some surgical precision … with certain properties.

I should have anticipated that Tony Martin’s album MBIDs wouldn’t have a place to go but I also never noticed most of those Spotify albums include “(Remastered Edition)” in the title which meant most of the titles didn’t match. I wasn’t happy about that … that potentially meant some exhausting work with RegEx. I thought I might get off easy if I could change some existing columns to use FullText but I immediately thought of potential problems.

I half-heartedly started copying MBIDs and pasting them into the empty fields but then deleted them as I thought of … not only did I not want to do this manually for all my current (and future!) MB albums but what if I added columns later or found a more … accurate-ish, clean-ish solution. Which is what I started doing earlier and what I’m going to do now. Add more columns for the properties I mentioned earlier. One of the JSON files to which I referred when planning all this had no values for all the keys I thought of using. It looks like the most recent Black Sabbath data (from yesterday, FWIW) has values for most of them. I’ll need to add equivalent columns to the Spotify albums table, too.

In the back of my mind, I kept wondering why my Spotify albums list was so short. I keep adding and removing values from the “type” and “group” key in the PHP file that fetches Spotify album data. I’ll have to do that again so I can have more albums from each source so I can, hopefully, have more matches.

As I wrote this post and went to MusicBrainz to take screenshots, I saw this …

BS_MusicBrainz_Discography.png
Figure 08: So THERE are the albums that Spotify had but my MB/LFM data was missing!

Later, I also noticed the “Type: Album” bit in Figure 02.

I now know I have to request “Album + Compilation” and “Album + Live” to get, for example, Reunion. But … why don’t I have Vol 4 from MusicBrainz? It can’t be possible there were no releases from the Vol 4 release-group with data at LastFM … right?

Well, time to get knee and elbow deep with even more data about each release-group and release …

Yes, I’m aroused by that. 😉

P.S. I wonder if … and hope that … MusicBrainz and/or LastFM have the CD I once found at a truck stop … it was called something like “The Essential Black Sabbath” or “Black Sabbath’s Greatest” and it was exclusively Tony Martin tracks. I’d love find that and someday get it autographed just because it’s so … peculiar.

Update: Finding that Tony Martin Compilation

I have no idea if is even any of those listed in Figure 08, so …

TMdiscograpny.png
Figure 09: I love how precise MB users are — that someone made sure to add “featuring Tony Iommi” to Seventh Star.

Step #1 Click any album in their discography (Figure 09) between 87-95 except Dehumanizer. Poor Tony, man.

headlessCrossReleaseGroup.png
Figure 10: Headless Cross release group

Step #2 Click any release (Figure 10).

headlessCrossRelease.png
Figure 11: Dang, dudes! They gotta say “ex-Black Sabbath vocalist”? I wonder if Dio suffered the same fate. 

Step #3 Click Tony Martin’s name in the credits (Figure 11).

TMdiscography.png
Figure 12: Tony Martin’s discography

Crap. I was hoping it would be listed right here (Figure 12) but, I suppose this makes more sense.

Step #4 Click “Show all release groups instead” (Figure 12) or anything else I need to.

Crap #2. I’m actually going to have to go through all of those compilations …

I opened each compilation 1996 and later (that I knew wasn’t specifically Ozzy or Dio) in another tab.

*Sigh* Each tab is a release group so requires at least one more click.

I checked eight of those candidates:

  • The Sabbath Stones
    Six instrumentals, seven Tony Martin tracks, and one apiece by Dio, Ian Gillan, and Glenn Hughes. No Ozzy.
  • Greatest Hits
    Ozzy & Dio
  • Ozzy is the only vocalist featured on:
    • The Ultimate Collection
    • Forever
    • Greatest Hits
    • The Collection
    • Rock Giants
  • Rock Champions is all Tony Martin!

I checked a few albums and Dio is never referred to as “ex-vocalist”.

I feel it’s also worth mentioning The Best of Black Sabbath which is an unusually respectable–albeit unbalanced–compilation as these things go. First of all–awesome cover.

  • 28 songs by Ozzy
  • 2 by Dio
  • 1 awful song by Ian Gillan
  • 1 instrumental

Not a single Tony Martin song.

 

Advertisements

Finding and Deleting (almost) Duplicate Rows in MySQL Part 2

Background:

  • I copied the original popArtists table, renaming the original popArtistsOLD
  • I worked (see below) on the copy which is now called popArtists

When I tried repeating my process for finding and deleting duplicates in a MySQL database (not to mention converting dates to assist with that and make future data wrangling easier) with popArtists, () it “loaded” (I think the message said “Loading”) for a long, long time before, eventually giving some sort of error stating it had pooped out and/or lost a connection. Checking the new, copied table, it appeared as though no rows had been affected — there were still 728,082 rows. When I logged in and checked this morning, however, the table showed only 56, 049 rows! So, I’m going to copy the original table again and try some tests.

NOTE TO SELF: Actually write out the following:

  1. SELECT Alice Cooper in SQL tab
  2. Export as CSV for Excel
  3. Import CSV into Excel
  4. Highlight duplicates

There were three instances of two rows for one date and one instance of four rows in a single date. I have 328 artists in my database so if each artist had the same amount of duplicates, there would be only 3,280 extra rows to delete, not 672,033.

I didn’t want to export and import a CSV file with 728,082 rows but it looks like I have to if I want to find the problem.

  • I made another copy of the original called popArtistsNew.

The exported CSV file is just over 53 MB!

After Excel finished importing the CSV, it said, “730,440 rows loaded” from popArtistsNEW which should have 728,082. I double-checked popArtistsNEW in PHPMyAdmin and, unexpectedly, it has 727,834 rows. All three should be the same.

Now I’m worried. And my face is stuck in a WTF expression.

Because of the sheer size of the spreadsheet, Excel crashed every time I tried using Home > Conditional Formatting > Highlight Cells Rules > Duplicate Values so, instead, I tried Home > Conditional Formatting > New Rule. The formula, after much Googling, trial & error, baffling results, and profanity is:

=OR(AND($B2=$B1,$D2=$D1),AND($B2=$B3,$D2=$D3))

Figure 01

Let me break the formula down and explain it. First, look at it like this:

=OR(Condition1, Condition2)

The ‘=‘ sign tells Excel this is a formula, not text. ‘OR‘ tells Excel that if either Condition1 or Condition1 is true, apply my orange fill. This next part is where people can get confused. Just like an ‘OR’ statement has two conditions, an ‘AND’ statement has two conditions, like so:

=AND(ConditionA, ConditionB)

The difference here being if both ConditionA AND ConditionB must be true.

So, my formula is actually this:

=OR(AND(ConditionA, ConditionB),AND(ConditionC , ConditionD))

which means:

  • If ConditionA AND ConditionB are true OR
  • If ConditionC AND ConditionD are true

Make that whole row of cells orange.

After a whole lot of scrolling–which was far more stressful that it should be because I had, really, almost a million rows–I saw that I finally had only my duplicate rows highlighted (seriously, you should have seen the horrific crap Excel did for some of the “errors” in my “trials”).

Now, I just needed to count them and see if there were, actually, several hundreds of thousands. Mostly, I saw chunks of 2, 4, or 6 rows in a single day. But then I started seeing chunks of 556 rows for a single artist in a single day. Notice “chunks” is plural. Lots of ’em.

When you import a CSV file, Excel automatically gives you Sort & Filter menus for each column. I used

FilterMenu > Filter by Color > Filter by Cell Color

FilterMenu > Filter by Color > Filter by Cell Color

The Excel status bar at the bottom said 676,965 displayed out of 730,440. So, I’m satisfied.

Finding and Deleting (almost) Duplicate Rows in MySQL

The database for my PopRock web app has evolved as I learn and as my needs & wants expand. Learning means mistakes and mistakes with MySQL (and PHP) mean duplicate rows. Sometimes lots of them–I once forgot to empty an array at the end of a loop (inside another loop). It didn’t take long to notice that page was getting increasingly slower … because it had millions of rows.

The PopRock database contains data fetched from the Spotify Web API using cron jobs for:

  • Artist popularity – daily
  • Album popularity – weekly
  • Track popularity – weekly

In the last couple days, I noticed some of these items hadn’t been fetched in a couple months–or a couple years.

Sidebar/Note to self: I also noticed that some albums (anthologies) didn’t have all their tracks in my database — my solution to (and explanation of) that problem will probably be my next post.

Debugging and testing also has the side effect of creating lots of duplicate rows. For example:

albumID_01.png
Figure 01

There’s also this from when I added a date column after I’d started fetching data. I am fine with deleting all of these and doing so is easy.

noDateInPopTracks.png
Figure 02

And, lastly, for a bit, my cron job was getting twice a day:

twice.png
Figure 03

This happened because my albums and tracks cron jobs works like this:

  1. Get an array of artists
  2. For each artist
  3. get an array of their albums
  4. For each album, get that album’s popularity score from Spotify

I do have one array containing all of my artists for getting their popularity scores but that task is relatively small. Looping through every artist then getting all of their albums is a bigger task. Then getting all of the tracks for each album is even bigger. So, for those last two tasks, I’ve broken up my artists into smaller arrays.

If, accidentally, I’ve placed one of my artists into two of those arrays, everything for that artist gets done twice (that last screenshot). How does one accidentally get millions of rows like I mentioned earlier? One accidentally writes (or copy & pastes together) a script that, basically, says something like: For every track in David Bowie’s oeuvre, loop through all of David Bowie‘s albums that loop through every track on each of those albums. Yeah, imagine that grows like a zombie apocalypse.

My page for listing all an artist’s tracks has been slowing again, lately, so I need to:

  • Find out if something works but is happening more than it should
  • Resume work on making PopRock a Progressive Web App (PWA) so it stores data locally, speeding up everything it does.

Finding & Deleting Duplicate Rows

Finding and deleting all “duplicate” rows for Figure 02 is easy. I just need to find all rows where date equals 0000-00-00 00:00:00.

SELECT * FROM popTracks WHERE date='0000-00-00 00:00:00';

That query returns 1,116 rows out of 1,622,599 for (currently) 72,997 tracks so I’m confident I won’t lose anything significant when I delete them like

DELETE FROM popTracks WHERE date='0000-00-00 00:00:00';

leaving 1,621,483.

The other two aren’t so easy.

I don’t want to delete every row inserted on a given date–I want to delete all but one. Also, I created an extra hurdle when coding my PHP. When I get a new album to insert into my albums table, I format the date:

$albumReleasedWhole = $album->release_date;
$albumReleased = substr($albumReleasedWhole, 0, 4);

I do that because some release_dates are a four digit year, some are month and year, and some are full dates. I didn’t do that for the popAlbums table — I didn’t anticipate ever caring about the date format.

So I have dates that are kind of alike. In Figure 01, I have sixteen rows inserted on March 10 but all at different times. If I muster my limitless optimism and try using just the date …

SELECT * FROM popAlbums WHERE date='2018-03-10';

that fails. Zero rows.

We know that if I were to select those dates to display on a web page, I can write something like the above PHP so what I’m going to do is

  1. Create a new column called, say, justDate.
  2. Fill all the empty cells in the justDate column with a converted date from the old date column.
  3. I’ll scour all those rows for any obvious yuckiness and test for any less-obvious.
  4. Once I’m sure everything is okay, I can delete the old date column.

I created the new column under the Structure tab in PHPMyAdmin, then hopped over to the SQL tab and:

UPDATE popAlbums SET justDate = date_format((popAlbums.date), '%Y-%m-%d');

Ta-da!

Figure 04

Because I’m a very cautious man, I copied the table before trying the next couple things without reducing myself to a sobbing mess on the floor. In popAlbums2, I deleted the original date column and proceeded with this:

DELETE a FROM popAlbums2 a
INNER JOIN popAlbums2 b
WHERE a.albumID = b.albumID AND a.justDate = b.justDate AND a.id < b.id;

In line 2, we create two tables (sort of–if we make a change in a, it also happens in b because they’re the same table–don’t think about it too hard) by using INNER JOIN.

Line three looks at all the rows in b. When it finds more than one row for a given album on a given day (a.albumID = b.albumID and a.justDate = b.justDate) like the sixteen rows on March 10 in Figure 01, it compares the id of the row in b to the id of the first similar row in a:

  • Is a.id ‘6001’ less than b.id ‘6001’? No. Move on to the next row in b.
  • Is a.id ‘6001’ less than b.id ‘6037’? Yes. Delete it and move on.

I held my breath and ran the query for a very, very long 16 seconds.

Rows affected = 11,850

I am SO glad I made a copy of this table. I moved back over to the Browse tab and … here’s the final result!

Figure 05 (Compare to Figure 01)

Lastly, I renamed the popAlbums table to popAlbumsOLD and renamed popAlbums2 to popAlbums as well as renamed the justDate column to date so all my functions and pages still work.

I had to hold my breath for a much longer 33 seconds while I ran the date_format query on the popTracks table’s 1,646,115 rows.

My DELETE query put me through, as NASA would say, 187 seconds of terror (it seemed like a lot longer) to delete 48,492 rows.

UPDATE: Fortunately, I remembered I had to change how the date gets inserted into the new column. Before, the default was just “current timestamp.” In my PHP, I changed it to use the MySQL curdate() function.

Data Science Isn’t Always Sexy and Glamorous

Sometimes data science is a bunch of debugging and fact-checking.

Bit o’ trivia about me: I got into this because I wanted to start using open data and APIs instead of constantly fact-checking frequently inaccurate data from my co-workers. Now, I find I have to fact-check my data. Not sexy. Not sexy at all.

I’ve wanted to write this post for a long time but, until now, I felt it would seem like mere whining … or conspiracy ranting (you’ll see).

Gathering complete, comprehensive election results seems an impossible task. It’s almost (get ready) as if somebodyThey … don’t want us to have them (imagine my voice in any shrill tone you like).

It should be easier. Much easier.

I promise I won’t complain about every little thing because I’ve complained about some of these things before — such as

  • The FEC provides Excel files for some elections but only PDFs for others
  • The FEC API doesn’t provide access to election results at all
  • The aforementioned results aren’t available on their website for months
  • Politico has results in real time and even updates them for a couple days after the election but the page is then like an operating system update or rendering video … it stays at 99% complete forever. Even now, as of December 10, 2018, Alaska shows only 99.5% precincts reporting.

Last night, I opened VS Code to continue adding features to Election Insights (the web app formerly known as prezPlayPro) when I noticed a few things were somehow broken since my triumphant post of November 23 — not only did some results change but some maps were downright broken.

Long Story Short: I now test everything using a private or Incognito window so I can be at least a wee bit more sure I’m looking at the latest code. Results that I expected to change after a fix made 2-3 weeks ago finally showed up. So, nothing was broken or causing incorrect results but I only know the “new” results are accurate because I did some digging in several piles of data to confirm … digging which should have been easier.

I had two potential problems I needed to investigate. Two questions needed answering:

  • Did Evan McMullin really beat Darrell Castle in a buttload of states?
  • Why do I have two candidates (Darrell Castle and Emidio Soltysik) affiliated with the U.S. Taxpayers party in Michigan?

In my previous, Final 2016 Presidential Election Maps, post, I realized I hadn’t included Evan McMullin in my arrays of “right-leaning” candidates. Much to my surprise, adding him to those arrays didn’t change any results (or so I thought). Last night, when I saw that including him may have drastically changed the results, I realized one of two things was true — either my code was broken or my database contained mistakes.

I chose to look into Texas because when I moved my hand, that’s where my cursor landed, showing me McMullin. My results are taken from the PDF from the FEC but, FORTUNATELY, I didn’t go directly to that PDF to confirm results. I also wanted to check party affiliations which I got from Ballotpedia (whom I’ve whined about previously for other issues even before the inaccuracy I just found). Otherwise, I wouldn’t have found some of the groovy things I did.

So, first, I went to the Ballotpedia page for Michigan’s 2016 presidential election results. Much to my relief, the mistake was theirs.

ballotpediaMichigan2016.png
I took all my party affiliations from their Results tables which, at least in this case, differs from the list above.

When I first started this project, I tried using Python‘s Beautiful Soup to grab info like that in the above screenshot from Politico because they conveniently listed every state on a single page. Unfortunately, the code is filled with inconsistencies and invisible crap neither I nor Beautiful Soup could beat into submission. Also, if memory serves, candidates’ names were spelled differently on different state ballots. <– That’s infuriating fact #4,987 on the list.

So I just did some major cutting and pasting to fifty pages I saved from Ballotpedia which sucked in it’s own way because you can’t right-click on their US map to open them in separate tabs — you have to click each one and, after saving the state page, click the Back button to get back to the map.

Before I noticed the Ballotpedia candidate list contained different parties than the results table, I followed the link to their data source (Michigan‘s Secretary of State or, as Ballotpedia calls it, “Department of State”) but when I clicked it, got a 404. Several of the source links at Ballotpedia have the same result but I don’t know whether I should be frustrated with Ballotpedia for having broken links or, as I’d thought previously, frustrated with those states for not keeping their results pages up. My FEC results PDF lists parties for each candidate (but not, much to my chagrin, by state). There I found Soltysik listed as Natural Law Party (which is still kinda conservative, if my recollection is correct) and Socialist Party USA (like the Beach Boys song).

Not yet noticing the mistake in the screenshot above, I set my party affiliation problem aside for the moment and went to Ballotpedia’s Texas page so I could confirm my results (from the FEC) for Castle/McMullin.

ballotpediaTexas2016.png

Ballotpedia doesn’t even list McMullin as a candidate in Texas but does list 51,261 write-in votes. Ever the optimist, I clicked the link for Texas Secretary of State.

TexasSoS2016.png
And, as it turns out, McMullin wallops Castle in Texas.

Black gold! Texas tea! Comprehensive election results, that is!

Note that most of those 51k+ write-in votes are for a single candidate. I think that’s rather significant. If I were the type to post election results, I might consider including that bit of information. Of course, Ballotpedia is probably in the pocket of the Commission On Presidential Debates (who fit nicely in the pocket of Big Insurance who are run by the Illuminati).

Now I was curious if Politico limited their results like Ballotpedia.  I had to go there anyway to see what party affiliation they had for Soltysik anway, so … after finding Soltysik was accurately listed as NLP in Michigan, I saw Politico‘s Texas results were wanting as much as Ballotpedia’s.

TexasPolitico2016.png
These are Politico’s “detailed” results.

Now I was grateful I couldn’t get Beautiful Soup working to my satisfaction with Politico. I’d have missed out on a bunch of candidates!

I still have much digging to do because far too many of my candidates have “null” for party affiliation — not to mention I now know I must fact-check whatever I find. Getting data directly from each state would be best, of course, but since Ballotpedia’s links don’t go anywhere, that won’t be as easy as I’d like.

Adding Genres to my Rockin App

Currently, I have an empty genres table that looks like this:

emptyGenres.png

My thought when I created it was I’d just use booleans for each column (except artistID, of course). I was going to use these as a short list of categories to create charts with but I’m now thinking I want to use all of the genres Spotify may provide.

Their console gave me the following genres for Alice Cooper:

  "genres": [
    "album rock",
    "classic rock",
    "comic",
    "glam metal",
    "glam rock",
    "hard rock",
    "metal",
    "protopunk",
    "rock"
  ]

I’m fine with that. I wrote about Spotify genres once before, wondering how artists were assigned genres. Were tracks and albums assigned genres and the artist was then assigned all of those? That makes the most sense to me, but I don’t think that’s how Spotify does it.

I wanted to test and dig around but despite the fact that Alice Cooper has tracks that might not fit any of those genres, I couldn’t think of an Alice Cooper album that would go beyond those listed above. So I did Iggy Pop:

"genres": [
    "alternative rock",
    "classic rock",
    "dance rock",
    "garage rock",
    "glam rock",
    "new wave",
    "permanent wave",
    "protopunk",
    "punk",
    "rock"
  ]

The first thing I notice is Alice didn’t get “punk” despite being voted “Punk of the Year” in 1973 by Cream magazine readers. I don’t even know what “permanent wave” is. Anyway, I chose Iggy because I wondered what genre his two French language jazz albums might fall under.

Preliminaires (33EUXrFKGjpUSGacqEHhU4)

"genres": []

One version of Preliminaires has a genres property but no values.

Preliminaires (0bwmAmXOTTfX1PMS0psdjM)

"genres": []

Same.

Après

"genres": []

Same. No surprise.

As a “control group,” I also tried:

  • Raw Power – nope, no genres
  • Lust for Life – no genres

Anyway … what I’m wondering is — can I use Spotify’s genres in the way that I want to? If I want a list of punk bands, will One Bad Pig appear or not? If they don’t, I can add the genres myself … the question really is, I guess … how do I design the table(s)? Maybe just a two-column table (artistID, genre) with an index for each. And … create a function that

  1. gets an artist
  2. gets the genres array
  3. for each genre in genre, insert a row with the artistID and the genre

Man, I hope the spelling for their genres is standardized … like … no “Funk” and “Phunk” or some such nonsense.

Yeah, that’s what I’ll do.

One Bad Pig, by the way, gets:

"genres": [
    "christian metal",
    "christian punk",
    "deep christian rock"
  ]

See, this is why I didn’t want to use Spotify’s genres. I have to write extra code to make sure a “punk” chart includes “punk” but also “protopunk” as well as “christian punk” and … whatever else. Seriously — ew.

The silver lining is, it will all make for a killer relationship chart.

Where Were These Resources Weeks Ago?

After learning about IndexedDB in the GrowWithGoogle Mobile Web course, I’m enhancing my big personal project to create a local backup of the MySQL database using IndexedDB with PHP and jQuery

If only I’d been aware of this “Google Search” thing a month or two ago … 😉

From MySQL to IndexedDB
https://stackoverflow.com/questions/44638999/from-mysql-to-indexeddb

Also mentions PouchDB and CouchDB
https://stackoverflow.com/questions/37921898/synchronization-between-mysql-and-indexeddb

Synchronization between mysql and IndexedDB
https://codeforgeek.com/2014/11/sync-app-mysql-indexeddb/

SyncedDB repo
https://github.com/paldepind/synceddb

Client-side database solutions: IndexedDB (says “teh codez”)
https://www.corporate3design.com/blog/24

Create Your Own To-Do App with HTML5 and IndexedDB
http://blog.teamtreehouse.com/create-your-own-to-do-app-with-html5-and-indexeddb

Offline Add, Edit, Delete Data in HTML5 IndexedDB
https://www.mindstick.com/Articles/1535/offline-add-edit-delete-data-in-html5-indexeddb

My Second Schema Revising

“Hmm …” he says, realizing he’ll want data viz for multiple elections and many candidates run in multiple elections … and sometimes they change parties …

  • Where should the year go?
  • Where the party at?

Oh, man … the number of Electoral College votes a state gets can change (based on population) as well …

And, aw, crap! Two states (Nebraska and Maine) don’t follow the stupid “winner take all” rule. Normally, I’d like that but it’s a pain in my butt now.

schemaRevised
Later changed “popVotes” table name to “results.” Considered having different results tables for different years but I think having affiliations and results structured the same will help me keep my code straight in my head.

Note how New York and Michigan lose votes with each census while Florida and Texas gain them (check out other examples in the screenshot below). The number of votes a state gets changes every ten years with the census so my electoralCollege table has columns for each first election of the decade so the following election years are grouped together.

  • 1992, 1996, 2000
  • 2004, 2008
  • 2012, 2016, 2020

Here’s what I made in Excel that I’ll save as a CSV.

ecExcel

It’s irritating how many different websites I have to visit to get these numbers, election results, etc.

News media get these results in real time so somebody has this information handy somewhere. Ugh.

It’s also irritating that, in an effort to find a faster way to convert the (sometimes PDF to) Excel to something usable in CSV I spend more time than if I just tediously copied and pasted. I do the same thing with my code. I think there must be a more elegant, faster way to do stuff when, nope, it’s only code, and it’s more simple and easy than I’m making it. Having said that, I’m unsure if this is the best way for this and other tables in this database to be organized, but here’s how I’m using the data for party affiliations.

affiliationsExcel

In the original PDF from the Federal Election Commission (FEC), it looked like this:

Evan McMullin (CCS, IDP, IND, BFA, NP, UN)

I haven’t thought too much yet about how I’ll query for party affiliation and morph that into my broad-stroke “socialist” and “conservative” big tents so I tried to make it as simple and malleable as possible. I’m very anxious to get started — this concept has been wanted to be born for a long, long time — so I’m allowing myself to rush a bit. I don’t mind having to fix it later if I must — it’ll be a fun learning experience (I hope).

It’s late in the day and I’m not sure if this is the best schema because I’m rushing a bit so I’ll need to review this later but this is what I’m going with to start.

electionDB2
Yep. Some of those commas shouldn’t be there. Noticed when I tried pasting into my Terminal.