Finding and Deleting (almost) Duplicate Rows in MySQL

The database for my PopRock web app has evolved as I learn and as my needs & wants expand. Learning means mistakes and mistakes with MySQL (and PHP) mean duplicate rows. Sometimes lots of them–I once forgot to empty an array at the end of a loop (inside another loop). It didn’t take long to notice that page was getting increasingly slower … because it had millions of rows.

The PopRock database contains data fetched from the Spotify Web API using cron jobs for:

  • Artist popularity – daily
  • Album popularity – weekly
  • Track popularity – weekly

In the last couple days, I noticed some of these items hadn’t been fetched in a couple months–or a couple years.

Sidebar/Note to self: I also noticed that some albums (anthologies) didn’t have all their tracks in my database — my solution to (and explanation of) that problem will probably be my next post.

Debugging and testing also has the side effect of creating lots of duplicate rows. For example:

albumID_01.png
Figure 01

There’s also this from when I added a date column after I’d started fetching data. I am fine with deleting all of these and doing so is easy.

noDateInPopTracks.png
Figure 02

And, lastly, for a bit, my cron job was getting twice a day:

twice.png
Figure 03

This happened because my albums and tracks cron jobs works like this:

  1. Get an array of artists
  2. For each artist
  3. get an array of their albums
  4. For each album, get that album’s popularity score from Spotify

I do have one array containing all of my artists for getting their popularity scores but that task is relatively small. Looping through every artist then getting all of their albums is a bigger task. Then getting all of the tracks for each album is even bigger. So, for those last two tasks, I’ve broken up my artists into smaller arrays.

If, accidentally, I’ve placed one of my artists into two of those arrays, everything for that artist gets done twice (that last screenshot). How does one accidentally get millions of rows like I mentioned earlier? One accidentally writes (or copy & pastes together) a script that, basically, says something like: For every track in David Bowie’s oeuvre, loop through all of David Bowie‘s albums that loop through every track on each of those albums. Yeah, imagine that grows like a zombie apocalypse.

My page for listing all an artist’s tracks has been slowing again, lately, so I need to:

  • Find out if something works but is happening more than it should
  • Resume work on making PopRock a Progressive Web App (PWA) so it stores data locally, speeding up everything it does.

Finding & Deleting Duplicate Rows

Finding and deleting all “duplicate” rows for Figure 02 is easy. I just need to find all rows where date equals 0000-00-00 00:00:00.

SELECT * FROM popTracks WHERE date='0000-00-00 00:00:00';

That query returns 1,116 rows out of 1,622,599 for (currently) 72,997 tracks so I’m confident I won’t lose anything significant when I delete them like

DELETE FROM popTracks WHERE date='0000-00-00 00:00:00';

leaving 1,621,483.

The other two aren’t so easy.

I don’t want to delete every row inserted on a given date–I want to delete all but one. Also, I created an extra hurdle when coding my PHP. When I get a new album to insert into my albums table, I format the date:

$albumReleasedWhole = $album->release_date;
$albumReleased = substr($albumReleasedWhole, 0, 4);

I do that because some release_dates are a four digit year, some are month and year, and some are full dates. I didn’t do that for the popAlbums table — I didn’t anticipate ever caring about the date format.

So I have dates that are kind of alike. In Figure 01, I have sixteen rows inserted on March 10 but all at different times. If I muster my limitless optimism and try using just the date …

SELECT * FROM popAlbums WHERE date='2018-03-10';

that fails. Zero rows.

We know that if I were to select those dates to display on a web page, I can write something like the above PHP so what I’m going to do is

  1. Create a new column called, say, justDate.
  2. Fill all the empty cells in the justDate column with a converted date from the old date column.
  3. I’ll scour all those rows for any obvious yuckiness and test for any less-obvious.
  4. Once I’m sure everything is okay, I can delete the old date column.

I created the new column under the Structure tab in PHPMyAdmin, then hopped over to the SQL tab and:

UPDATE popAlbums SET justDate = date_format((popAlbums.date), '%Y-%m-%d');

Ta-da!

Figure 04

Because I’m a very cautious man, I copied the table before trying the next couple things without reducing myself to a sobbing mess on the floor. In popAlbums2, I deleted the original date column and proceeded with this:

DELETE a FROM popAlbums2 a
INNER JOIN popAlbums2 b
WHERE a.albumID = b.albumID AND a.justDate = b.justDate AND a.id < b.id;

In line 2, we create two tables (sort of–if we make a change in a, it also happens in b because they’re the same table–don’t think about it too hard) by using INNER JOIN.

Line three looks at all the rows in b. When it finds more than one row for a given album on a given day (a.albumID = b.albumID and a.justDate = b.justDate) like the sixteen rows on March 10 in Figure 01, it compares the id of the row in b to the id of the first similar row in a:

  • Is a.id ‘6001’ less than b.id ‘6001’? No. Move on to the next row in b.
  • Is a.id ‘6001’ less than b.id ‘6037’? Yes. Delete it and move on.

I held my breath and ran the query for a very, very long 16 seconds.

Rows affected = 11,850

I am SO glad I made a copy of this table. I moved back over to the Browse tab and … here’s the final result!

Figure 05 (Compare to Figure 01)

Lastly, I renamed the popAlbums table to popAlbumsOLD and renamed popAlbums2 to popAlbums as well as renamed the justDate column to date so all my functions and pages still work.

I had to hold my breath for a much longer 33 seconds while I ran the date_format query on the popTracks table’s 1,646,115 rows.

My DELETE query put me through, as NASA would say, 187 seconds of terror (it seemed like a lot longer) to delete 48,492 rows.

UPDATE: Fortunately, I remembered I had to change how the date gets inserted into the new column. Before, the default was just “current timestamp.” In my PHP, I changed it to use the MySQL curdate() function.

Advertisements

TIL How to Make and Use PHP Cookies

When I first created popRock, I had a heck of a time getting the artistID session variable to travel from page to page — so you could see various charts and lists for the artist chosen from my hideously coded Choose Artist menu on the choose_artist page. I finally got it working but, a couple months ago, I switched hosting plans and the app stopped working.

I still haven’t converted it all to AJAX as mentioned in the aforelinked link, but in the meantime, everything is working again because I finally gave PHP cookies a chance. Yay. I wish I’d used them from the beginning it is SO easy! But I read countless articles about why you shouldn’t use them and should use sessions instead. Well, sessions can kiss my butt as I walk out the door because I’m breaking up with them.

Where Were These Resources Weeks Ago?

After learning about IndexedDB in the GrowWithGoogle Mobile Web course, I’m enhancing my big personal project to create a local backup of the MySQL database using IndexedDB with PHP and jQuery

If only I’d been aware of this “Google Search” thing a month or two ago … 😉

From MySQL to IndexedDB
https://stackoverflow.com/questions/44638999/from-mysql-to-indexeddb

Also mentions PouchDB and CouchDB
https://stackoverflow.com/questions/37921898/synchronization-between-mysql-and-indexeddb

Synchronization between mysql and IndexedDB
https://codeforgeek.com/2014/11/sync-app-mysql-indexeddb/

SyncedDB repo
https://github.com/paldepind/synceddb

Client-side database solutions: IndexedDB (says “teh codez”)
https://www.corporate3design.com/blog/24

Create Your Own To-Do App with HTML5 and IndexedDB
http://blog.teamtreehouse.com/create-your-own-to-do-app-with-html5-and-indexeddb

Offline Add, Edit, Delete Data in HTML5 IndexedDB
https://www.mindstick.com/Articles/1535/offline-add-edit-delete-data-in-html5-indexeddb

PHP Rabbit Holes [solved]

Whoa.

  1. Got tracks to work (with laptop not at home).
  2. Destroyed tracks by attempting to “fix” or improve it (with laptop still not at home).
  3. Fortunately, I hadn’t pushed anything to GitHub yet so, at home on my iMac, got tracks to work again … with a bit of ghost code “product” appearing in the browser with my returned data.

After my success with albums, I then got tracks to work. The loop(s), however were sending 36 requests (one for each album). What I wanted, however, was to combine all the tracks from all albums into one array, then divide them by 50 so there were nine arrays and, thus, only nine requests. So I started to “fix” it.

That’s when the madness began.

Insanity #1: After much debugging and profanity, I realized that instead of putting tracks into the arrays like this …

  • 50
  • 50
  • 50
  • 50
  • 50

and so on, it was doing this …

  • 50
  • 100
  • 150
  • 100
  • 50

At first I thought the numbers were only going up but then realized they were going up then down. It did this while using array_slice despite the start and stop variables echoing the correct numbers … and that’s just one bit of weirdness with numbers/calculations not making sense. One of the other bits of weirdness is back when the tracks script worked (meaning it got everything even if it took “too many” requests for my standards of elegance) the variables for start and stop in the array_slice were not correct — they weren’t incrementing at all.

Still don’t think figured that out but it may be related to …

Insanity #2: After giving up on solving and fixing that (I just accepted the idea of 36 requests for the time being), one of the echo commands I used from a previous version of madness still appeared …

You have made 0 getTracks requests

Two things about that:

  • The line of code with the echo no longer exists.
  • The line of code setting the variable the echo refers to no longer exists.

Honestly, I don’t remember googling “super globals” or even “php globals” or why I thought of those but I’ve since learned, so far, that they, apparently, are stored in either php.ini or some other php config file … so me deleting the line of code in my script may not have gotten rid of the variable? Maybe? Let’s assume that’s true. That explains my second bullet above — why the variable still exists — but it doesn’t explain how or why the echo command still … “exists”(?) … somewhere …

INSERTED UPDATE: The above theory about the $GLOBALS being stored somewhere is just as false and nonsensical as I knew/hoped it was. See end of post.

It can’t be the cache because I did the destruction on one computer and re-did the progress with tracks on a different computer. The code related to the “ghost” feedback never existed on the iMac and I never accessed the “broken” files from browsers on the iMac. The ghost is in the remote machine.

Seriously, PHP has been creeping me out for about 18 hours. I had dreams about it. For real. I also had dreams about multiple wives and multiple ex-wives and the Chinese successfully invading the USA though those may have been subconscious metaphors for the PHP weirdness. Or metaphors or prophecies having to do with something else entirely.

More to come.

P.S. What worries me is, in troubleshooting, I thought I should make more $GLOBALS variables of things in loops so I could unset them, but … so … who knows what supernatural, ancient, Lovecraftian, Raimian, code gods I’ve unleashed …

Update #1: Read entire php.ini file and there’s no mention of the $GLOBALS array. The other super globals are mentioned.

Update #2: Insanity #2 was totally my fault. That actually makes me feel much better. There is currently a total of five files in this little app. Here at home, I was only changing and uploading two of them but the two lines of code also appeared (I was using that echo and variable in multiple locations for debugging purposes) in a third file.

Insanity #1 still haunts me.

UPDATE #3 on Nov 8: Insanity #1 [solved]

I was basing my array_slice on how I did it in Javascript so it looked like this:

$x = ceil((count($artistAlbums))/20);
$firstAlbum = 0;
for ($i=0; $i<$x; ++$i) {
$lastAlbum = $firstAlbum + 19;
$artistAlbumsChunk = array_slice($artistAlbums, $firstAlbum,     $lastAlbum);
$albumsArrays [] = $artistAlbumsChunk;
$firstAlbum += 19;
};

I was interpreting the offset parameter as “first” (accurate enough) and the length parameter (because I wasn’t paying attention when I read the documentation) as “last” like it is in Javascript.

It’s Miller Time

Or something.

After much hard work and learning, I have the albums portion of the rockinPHP app working. Behold:

alice.png

and

jjcm.png

Lots more work still to do. Including getting that data into a database — the table is just a visual feedback tool for my development and debugging.

Awesome. Rock ON.

Deeper. And stuff.

Just installed Composer.

Used that to install the latest version of PHPMailer.

Watching The Fugitive for the billionth time and was inspired to search for PHP jobs in Chicago. God, I’d love to live in Chicago. Almost as much as I love The Fugitive.

My Site Is Secure

Just bought my first TLS/SSL certificate. Feel cool. Like I’ve gone through a rite of passage. Like I’m not an amateur. Also continuing to make php forms and such more secure in the app itself.

Pretty pleased with myself.

Learning TONS. Such as: I need to change references to all my javascript (bootstrap, etc.) so my site can get back to it’s normal, pretty self.