Beautiful Socialism

About 13 months ago, I made some maps of the 2012 election results using only the socialist candidates. If nothing else, I was curious how many votes they’d get if they all voted together instead of having seven or more different candidates. Those maps were made using Illustrator, Photoshop, and data from the  delightfully detailed, completely comprehensive FEC 2012 Election Results (PDF) I tediously transcribed into Excel. Fun, but yuck.

For 2016, I’m using Python and D3 and will probably throw in some MySQL, PHP, and jQuery along with, of course, plain-old JavaScript for some other fun.

I’ve just finished preparing the latest data from the 2016 election. The post and project are called beautifulSocialism because I use BeautifulSoup. Get it? See what I did there?

First, I just saved the 2016 Presidential Election Results page from Politico. After my first few tries (ever) using Beautiful Soup, I reduced just over 6202 lines of code (and they were really long lines) to 103 equally dense lines of code that I could almost use. I can’t express how proud I am of how elegant it is, IMHO, and how proud I am.

07
Lines 30-33 were a last-minute addition after I noticed that tag contained the hidden treasure of the full party name.

However, there were some whitespace issues I just could not solve and neither Google searches nor StackOverflow provided solutions that worked for me. Also, BeautifulSoup’s encoding shoved some additional unwanted characters into my “final” product.

table3_before_editing

I spent a lot of time trying many things but solved neither problem. I made it even worse a couple times, though!

Eventually, I surrendered and used Dreamweaver for a relatively few rounds of Find & Replace. First, I used Dreamweaver’s awesome Apply Source Formatting command which made the code pretty but the number of lines ballooned to 2595.

Sadly, correcting the candidates’ names took far more rounds than I expected because they were screwed up in so many different ways. I wanted full names and, since there were many candidates even I was unfamiliar with, I went to my go-to source of presidential candidate information for the last ten years but Politics1.com‘s lack of state-specific ballot information  (in their defense, that’s not the site’s purpose) posed two problems:

  • They give the candidate’s home state but I didn’t know if Smith from whatever state, for example, would be the Smith running in some other state.
  • They give the party the candidate most identifies with but Politico’s results used whatever was on the ballot–often “unaffiliated,” “independent,” or “other.”

So, much to my chagrin, I used Ballot-o-pedia or whatever it’s called. It’s the slowest damn site on the Internet. I hate to admit that it was a huge help and I’m still not providing a link to it. It made me want to throw stuff several times.

After a bunch of manual editing I didn’t expect, I now have this:

3a

Now for some fun DOM manipulation using jQuery to dynamically add some sexy CSS to for the table. Yes, it might be quicker to just do it manually (much like my eventual editing in Dreamweaver) but I prefer learning, even if it takes longer and I make mistakes. Besides, someday, hopefully, I’ll be working with much larger data sets and this knowledge would, of course, pay off.

That’s unrelated to the maps, of course, but there are so many cool things I can do to practice with this data! I’ll also make something that will find and list all the different parties candidates use in different states (just for fun but also) to consolidate them and use in the maps.

Eventually, much of this work will also apply to my politicsPlay project as well.

Advertisements

Pile of Pythons

My first focus on Python came from recommendations regarding Web Scraping. One of my first project ideas is based on web scraping which I’d never heard of until asking /r/LearnProgramming (my question was pretty general and vague because I didn’t know where to start let alone which direction to take) for direction & advice about a project I could only describe in the newbiest terms.

Based on that, two of today’s To-Dos was installing Grab (I’d read the most delightfully enthusiastic article in surprisingly understandable broken English about it) and BeautifulSoup. Based on all my Data Science reading and goals, my third item was installing Anaconda. I learned so much just downloading and installing these three. I love learning even the smallest details–is any given tool a library, a framework, a language, or a program? What are equivalents to each other? Learning all these details make the learning curve seem so much less intimidating. So, here’s what I learned today. I found it all extremely valuable and interesting.

Starting off, I already had, at least, Python 2.7.6 installed.

First, here’s the order in which they need to be installed and why.

  1. Installed pip (so I could install Grab and, theoretically, beautifulsoup), a Package Management System
    sudo apt-get install python-pip
  2. Installed beautifulsoup. The Crummy documentation stating to use pip for either beautifulsoup4 or python-beautifulsoup4 didn’t work for me, but this got me v3.2.1-1:
    sudo apt-get install python-beautifulsoup
  3. Installed python-dev after multiple failed attempts at installing Grab. The Grab fails each had different errors and the Grab documentation covered the possibilities (mostly relating to lxml) in a mostly excellent manner but still didn’t fix the problem–which I found after scrolling slowly up through all the terminal output. Googled my findings and this article provided a great explanation and this solution:
    sudo apt-get install python-dev
  4. Installed Grab successfully! I knew things were going well when it paused during the lxml building portion of the installation to do lots of something other than spit out an error. I say “lots of something” because it was obviously(?) “thinking.” Can I just say that while the Command Line is infinite in its unassailable coolness, the lack of a progress bar stresses me out? First, a couple pre-emptive dependency issues averted with:
    sudo apt-get install libcurl4-openssl-dev (so it can build pycurl)
    sudo apt-get install libxml2-dev libxslt-dev (so it can build lxml)
    Then finally getting to use pip:
    pip install -U Grab
  5. Installed Anaconda with no problem whatsoever using their excellent documentation:
    1. Download the installer script
    2. Run the installer script
    3. Test/confirm using conda list
    4. 30-minute Test Drive

I haven’t done that Test Drive yet, because, first, I’m dying to talk about all of the following shizzle I caught flying by in the Terminal during successful and failed attempts followed by Googling and reading (and copious bookmarking).

BeautifulSoup and LXML
In the aforelinked article about Grab, the author stated they didn’t use beautifulsoup, a widely praised and beloved HTML parser, because it didn’t play nice with their particular install of Anaconda. At the time I first read the article, that aside didn’t mean much to me but because of today’s lxml-related struggles, it stuck out to me that they were equivalent products. The author claims lxml meets their needs so I’m interested to try both since it seems the whole world is in love with BeautifulSoup and except for their “I use it because I kinda have to” reasoning and what I could call a bad first impression (or bad first association) today.

Also, I was also interested because I was also installing Anaconda.

Speaking of Anaconda and BeautifulSoup, Anaconda is a Framework, meaning, if the “my own words” in which I am putting this are accurate, it is a collection of tools used together. Am I correct in saying that any given “stack” is also a “framework”? The Anaconda framework/collection includes oodles of other tools I’ve read about and seen elsewhere including, as it turns out, BeautifulSoup4 so I think I now have both v3.2.1-1 and v4.

BeautifulSoup is a Library–much like, say, jQuery. You don’t need to “install” it per se, you can simply drop the BS4 folder into your project folder like jQuery and/or Bootstrap. Bootstrap, however, now that I think about it, is a bit of a framework because it is a collection of libraries working together that also require jQuery.

A library is a … think of a CSS file as a library of styles. An external JS file (like jQuery) is a library of functions and whatnot all ready out of the box.

Conda is a Package Management System like apt, pip or rpm.

Grab is, according that author person, a framework (for web scraping) if, for no other reason, because it includes/uses another framework heavily discussed in the article called Spider. I mention this because I noticed that, included in the ten bazillion packages of the Anaconda framework, was a little guy named Spyder and I wondered if it was a spelling error in one place or the other.

Spider vs Spyder
Grab is a library wrapped around the PyCurl library. I mention that only because I’d read about pycurl elsewhere. The Grab:Spider framework processes network requests for scraping asynchronously using the MultiCurl library. I’ll admit right here that’s gibberish to me. I find it interesting because Anaconda also included PyCurl as well as Curl (but not MultiCurl, if you’re keeping score). All of this is interesting to me because it’s helping me understand how to compare and contrast different frameworks and toolsets. How and why particular components are chosen and combined.

So, Spider is just a library (I think). Spyder, on the other hand, is an Integrated Development Environment (IDE) and is, actually, an acronym of sorts: Scientific PYthon Develpment EnviRonment. I was, for a while, pretty confused as to how an IDE was different from, say, any other given code/text editor–DreamWeaver for example. Dreamweaver can do a lot that, say, Notepad, or even Notepad++, can’t. An IDE can do much that Dreamweaver can’t. Wouldn’t it be awesome if Dreamweaver had a console like Firebug? Such a lovechild would be an IDE–integrating the live/final view with the coding/editing view with debugging and feedback as well. Dare I say Flash is an IDE?

Spyder describes itself as–and I’m kinda paraphrasing–a “Powerful, interactive development environment for Python with advanced editing, interactive testing, debugging and introspection features and is also a numerical computing environment thanks to support of IPython plus NumPy, SciPy and MatPlotLib providing MatLab-like features part of Spyderlib, a python module based on PyQt4, Rope, and others.”

All these library names are starting to stand out to me instead of jumbling together in a meaningless pile of pythons as I learn how each is different.

  • NumPy is for linear algebra
  • SciPy is for signal (waveforms) & image processing
  • MatPlotLib is for interactive 2D & 3D plotting

All of the above come with Anaconda as did it’s very own copy of Python v2.711-0.

The Spyder IDE jumped out at me because I knew Anaconda also included another IDE, discussed at far more length in Python for Data Science for Dummies, called IPython Notebook.

IPython Notebook describes itself as an enhanced interactive Python interpreter and web app, interactive computational environment in which you can combine code execution, rich text (explanatory content), mathematics, plots and rich media (images).” Notebook is the IDE. IPython is an architecture though I don’t know what that means quite yet.

I think both Spyder and Jupyter use a notebook paradigm as well much like Dreamweaver uses Sites and other apps use Projects.

Anaconda comes with a third IDE, Jupyter. While similar, my impression is that Jupyter is a rising star. It is also, “a web app that allows users to create and share documents containing live code, equations, visualizations and explanatory text for data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.”

Last item of note is that Anaconda also includes Pandas. I’ve seen Pandas mentioned a lot in Data Science discussions and I have a a book, Pandas: Powerful Python Data Analysis Toolkit that is 1,787 pages long. Normally, an intro chapter called “10 Minutes to Pandas” wouldn’t seem so funny to me but that “short introduction … geared mainly for new users” starts at page 259 and, at 22 pages, is the proverbial tip o’ the iceberg.

Except polar bears, not pandas, live on icebergs.

I find it equally funny that the FAQ is only four pages.