Finding and Deleting (almost) Duplicate Rows in MySQL Part 2

Background:

  • I copied the original popArtists table, renaming the original popArtistsOLD
  • I worked (see below) on the copy which is now called popArtists

When I tried repeating my process for finding and deleting duplicates in a MySQL database (not to mention converting dates to assist with that and make future data wrangling easier) with popArtists, () it “loaded” (I think the message said “Loading”) for a long, long time before, eventually giving some sort of error stating it had pooped out and/or lost a connection. Checking the new, copied table, it appeared as though no rows had been affected — there were still 728,082 rows. When I logged in and checked this morning, however, the table showed only 56, 049 rows! So, I’m going to copy the original table again and try some tests.

NOTE TO SELF: Actually write out the following:

  1. SELECT Alice Cooper in SQL tab
  2. Export as CSV for Excel
  3. Import CSV into Excel
  4. Highlight duplicates

There were three instances of two rows for one date and one instance of four rows in a single date. I have 328 artists in my database so if each artist had the same amount of duplicates, there would be only 3,280 extra rows to delete, not 672,033.

I didn’t want to export and import a CSV file with 728,082 rows but it looks like I have to if I want to find the problem.

  • I made another copy of the original called popArtistsNew.

The exported CSV file is just over 53 MB!

After Excel finished importing the CSV, it said, “730,440 rows loaded” from popArtistsNEW which should have 728,082. I double-checked popArtistsNEW in PHPMyAdmin and, unexpectedly, it has 727,834 rows. All three should be the same.

Now I’m worried. And my face is stuck in a WTF expression.

Because of the sheer size of the spreadsheet, Excel crashed every time I tried using Home > Conditional Formatting > Highlight Cells Rules > Duplicate Values so, instead, I tried Home > Conditional Formatting > New Rule. The formula, after much Googling, trial & error, baffling results, and profanity is:

=OR(AND($B2=$B1,$D2=$D1),AND($B2=$B3,$D2=$D3))

Figure 01

Let me break the formula down and explain it. First, look at it like this:

=OR(Condition1, Condition2)

The ‘=‘ sign tells Excel this is a formula, not text. ‘OR‘ tells Excel that if either Condition1 or Condition1 is true, apply my orange fill. This next part is where people can get confused. Just like an ‘OR’ statement has two conditions, an ‘AND’ statement has two conditions, like so:

=AND(ConditionA, ConditionB)

The difference here being if both ConditionA AND ConditionB must be true.

So, my formula is actually this:

=OR(AND(ConditionA, ConditionB),AND(ConditionC , ConditionD))

which means:

  • If ConditionA AND ConditionB are true OR
  • If ConditionC AND ConditionD are true

Make that whole row of cells orange.

After a whole lot of scrolling–which was far more stressful that it should be because I had, really, almost a million rows–I saw that I finally had only my duplicate rows highlighted (seriously, you should have seen the horrific crap Excel did for some of the “errors” in my “trials”).

Now, I just needed to count them and see if there were, actually, several hundreds of thousands. Mostly, I saw chunks of 2, 4, or 6 rows in a single day. But then I started seeing chunks of 556 rows for a single artist in a single day. Notice “chunks” is plural. Lots of ’em.

When you import a CSV file, Excel automatically gives you Sort & Filter menus for each column. I used

FilterMenu > Filter by Color > Filter by Cell Color

FilterMenu > Filter by Color > Filter by Cell Color

The Excel status bar at the bottom said 676,965 displayed out of 730,440. So, I’m satisfied.

Advertisements

From Excel to MySQL Mess

Here are my candidates in Excel:

The CSV file in VS Code:

My import command and … the mess I can’t figure out (and immediately deleted):

WHAT happened?

Attempt #2

I surrounded everything with single quotes and …

tryingAgainCandidates

Note it created only 16 rows!

Filling Empty Cells with Zero in Excel

I had an enormous spreadsheet I’d made by saving a PDF as an Excel workbook and rearranging the data.

noZeroes
Some 2016 presidential election results in Excel

I needed to get the data into MySQL but there were countless empty cells where candidates received no votes and I knew that wouldn’t fly. I needed zeroes.

STEP #1

Select all the empty cells by clicking Find & Select in the Editing section of the Home tab in the spawn-from-Hell ribbon-thing and choosing Go To Special.

goToSpecial

STEP #2

Select Blanks and click OK.

blanks

STEP #3

Press F2 (function key at the top of your keyboard) to select one of the selected empty cells (using F2 prevents deselecting all the other empty cells).

STEP #4

Type “0” and press either Ctrl+Enter on Windows or Cmd+Return on a Mac.

STEP #5

Behold …

allZeroes