The database for my PopRock web app has evolved as I learn and as my needs & wants expand. Learning means mistakes and mistakes with MySQL (and PHP) mean duplicate rows. Sometimes lots of them–I once forgot to empty an array at the end of a loop (inside another loop). It didn’t take long to notice that page was getting increasingly slower … because it had millions of rows.
The PopRock database contains data fetched from the Spotify Web API using cron jobs for:
- Artist popularity – daily
- Album popularity – weekly
- Track popularity – weekly
In the last couple days, I noticed some of these items hadn’t been fetched in a couple months–or a couple years.
Sidebar/Note to self: I also noticed that some albums (anthologies) didn’t have all their tracks in my database — my solution to (and explanation of) that problem will probably be my next post.
Debugging and testing also has the side effect of creating lots of duplicate rows. For example:
There’s also this from when I added a date column after I’d started fetching data. I am fine with deleting all of these and doing so is easy.
And, lastly, for a bit, my cron job was getting twice a day:
This happened because my albums and tracks cron jobs works like this:
- Get an array of artists
- For each artist
- get an array of their albums
- For each album, get that album’s popularity score from Spotify
I do have one array containing all of my artists for getting their popularity scores but that task is relatively small. Looping through every artist then getting all of their albums is a bigger task. Then getting all of the tracks for each album is even bigger. So, for those last two tasks, I’ve broken up my artists into smaller arrays.
If, accidentally, I’ve placed one of my artists into two of those arrays, everything for that artist gets done twice (that last screenshot). How does one accidentally get millions of rows like I mentioned earlier? One accidentally writes (or copy & pastes together) a script that, basically, says something like: For every track in David Bowie’s oeuvre, loop through all of David Bowie‘s albums that loop through every track on each of those albums. Yeah, imagine that grows like a zombie apocalypse.
My page for listing all an artist’s tracks has been slowing again, lately, so I need to:
- Find out if something works but is happening more than it should
- Resume work on making PopRock a Progressive Web App (PWA) so it stores data locally, speeding up everything it does.
Finding & Deleting Duplicate Rows
Finding and deleting all “duplicate” rows for Figure 02 is easy. I just need to find all rows where date equals 0000-00-00 00:00:00.
SELECT * FROM popTracks WHERE date='0000-00-00 00:00:00';
That query returns 1,116 rows out of 1,622,599 for (currently) 72,997 tracks so I’m confident I won’t lose anything significant when I delete them like
DELETE FROM popTracks WHERE date='0000-00-00 00:00:00';
The other two aren’t so easy.
I don’t want to delete every row inserted on a given date–I want to delete all but one. Also, I created an extra hurdle when coding my PHP. When I get a new album to insert into my albums table, I format the date:
$albumReleasedWhole = $album->release_date;
$albumReleased = substr($albumReleasedWhole, 0, 4);
I do that because some release_dates are a four digit year, some are month and year, and some are full dates. I didn’t do that for the popAlbums table — I didn’t anticipate ever caring about the date format.
So I have dates that are kind of alike. In Figure 01, I have sixteen rows inserted on March 10 but all at different times. If I muster my limitless optimism and try using just the date …
SELECT * FROM popAlbums WHERE date='2018-03-10';
that fails. Zero rows.
We know that if I were to select those dates to display on a web page, I can write something like the above PHP so what I’m going to do is
- Create a new column called, say, justDate.
- Fill all the empty cells in the justDate column with a converted date from the old date column.
- I’ll scour all those rows for any obvious yuckiness and test for any less-obvious.
- Once I’m sure everything is okay, I can delete the old date column.
I created the new column under the Structure tab in PHPMyAdmin, then hopped over to the SQL tab and:
UPDATE popAlbums SET justDate = date_format((popAlbums.date), '%Y-%m-%d');
Because I’m a very cautious man, I copied the table before trying the next couple things without reducing myself to a sobbing mess on the floor. In popAlbums2, I deleted the original date column and proceeded with this:
DELETE a FROM popAlbums2 a
INNER JOIN popAlbums2 b
WHERE a.albumID = b.albumID AND a.justDate = b.justDate AND a.id < b.id;
In line 2, we create two tables (sort of–if we make a change in a, it also happens in b because they’re the same table–don’t think about it too hard) by using INNER JOIN.
Line three looks at all the rows in b. When it finds more than one row for a given album on a given day (a.albumID = b.albumID and a.justDate = b.justDate) like the sixteen rows on March 10 in Figure 01, it compares the id of the row in b to the id of the first similar row in a:
- Is a.id ‘6001’ less than b.id ‘6001’? No. Move on to the next row in b.
- Is a.id ‘6001’ less than b.id ‘6037’? Yes. Delete it and move on.
I held my breath and ran the query for a very, very long 16 seconds.
Rows affected = 11,850
I am SO glad I made a copy of this table. I moved back over to the Browse tab and … here’s the final result!
Lastly, I renamed the popAlbums table to popAlbumsOLD and renamed popAlbums2 to popAlbums as well as renamed the justDate column to date so all my functions and pages still work.
I had to hold my breath for a much longer 33 seconds while I ran the date_format query on the popTracks table’s 1,646,115 rows.
My DELETE query put me through, as NASA would say, 187 seconds of terror (it seemed like a lot longer) to delete 48,492 rows.
UPDATE: Fortunately, I remembered I had to change how the date gets inserted into the new column. Before, the default was just “current timestamp.” In my PHP, I changed it to use the MySQL curdate() function.