- I copied the original popArtists table, renaming the original popArtistsOLD
- I worked (see below) on the copy which is now called popArtists
When I tried repeating my process for finding and deleting duplicates in a MySQL database (not to mention converting dates to assist with that and make future data wrangling easier) with popArtists, () it “loaded” (I think the message said “Loading”) for a long, long time before, eventually giving some sort of error stating it had pooped out and/or lost a connection. Checking the new, copied table, it appeared as though no rows had been affected — there were still 728,082 rows. When I logged in and checked this morning, however, the table showed only 56, 049 rows! So, I’m going to copy the original table again and try some tests.
NOTE TO SELF: Actually write out the following:
- SELECT Alice Cooper in SQL tab
- Export as CSV for Excel
- Import CSV into Excel
- Highlight duplicates
There were three instances of two rows for one date and one instance of four rows in a single date. I have 328 artists in my database so if each artist had the same amount of duplicates, there would be only 3,280 extra rows to delete, not 672,033.
I didn’t want to export and import a CSV file with 728,082 rows but it looks like I have to if I want to find the problem.
- I made another copy of the original called popArtistsNew.
The exported CSV file is just over 53 MB!
After Excel finished importing the CSV, it said, “730,440 rows loaded” from popArtistsNEW which should have 728,082. I double-checked popArtistsNEW in PHPMyAdmin and, unexpectedly, it has 727,834 rows. All three should be the same.
Now I’m worried. And my face is stuck in a WTF expression.
Because of the sheer size of the spreadsheet, Excel crashed every time I tried using Home > Conditional Formatting > Highlight Cells Rules > Duplicate Values so, instead, I tried Home > Conditional Formatting > New Rule. The formula, after much Googling, trial & error, baffling results, and profanity is:
Let me break the formula down and explain it. First, look at it like this:
The ‘=‘ sign tells Excel this is a formula, not text. ‘OR‘ tells Excel that if either Condition1 or Condition1 is true, apply my orange fill. This next part is where people can get confused. Just like an ‘OR’ statement has two conditions, an ‘AND’ statement has two conditions, like so:
The difference here being if both ConditionA AND ConditionB must be true.
So, my formula is actually this:
=OR(AND(ConditionA, ConditionB),AND(ConditionC , ConditionD))
- If ConditionA AND ConditionB are true OR
- If ConditionC AND ConditionD are true
Make that whole row of cells orange.
After a whole lot of scrolling–which was far more stressful that it should be because I had, really, almost a million rows–I saw that I finally had only my duplicate rows highlighted (seriously, you should have seen the horrific crap Excel did for some of the “errors” in my “trials”).
Now, I just needed to count them and see if there were, actually, several hundreds of thousands. Mostly, I saw chunks of 2, 4, or 6 rows in a single day. But then I started seeing chunks of 556 rows for a single artist in a single day. Notice “chunks” is plural. Lots of ’em.
When you import a CSV file, Excel automatically gives you Sort & Filter menus for each column. I used
FilterMenu > Filter by Color > Filter by Cell Color
The Excel status bar at the bottom said 676,965 displayed out of 730,440. So, I’m satisfied.