My first focus on Python came from recommendations regarding Web Scraping. One of my first project ideas is based on web scraping which I’d never heard of until asking /r/LearnProgramming (my question was pretty general and vague because I didn’t know where to start let alone which direction to take) for direction & advice about a project I could only describe in the newbiest terms.
Based on that, two of today’s To-Dos was installing Grab (I’d read the most delightfully enthusiastic article in surprisingly understandable broken English about it) and BeautifulSoup. Based on all my Data Science reading and goals, my third item was installing Anaconda. I learned so much just downloading and installing these three. I love learning even the smallest details–is any given tool a library, a framework, a language, or a program? What are equivalents to each other? Learning all these details make the learning curve seem so much less intimidating. So, here’s what I learned today. I found it all extremely valuable and interesting.
Starting off, I already had, at least, Python 2.7.6 installed.
First, here’s the order in which they need to be installed and why.
- Installed pip (so I could install Grab and, theoretically, beautifulsoup), a Package Management System
sudo apt-get install python-pip
- Installed beautifulsoup. The Crummy documentation stating to use pip for either beautifulsoup4 or python-beautifulsoup4 didn’t work for me, but this got me v3.2.1-1:
sudo apt-get install python-beautifulsoup
- Installed python-dev after multiple failed attempts at installing Grab. The Grab fails each had different errors and the Grab documentation covered the possibilities (mostly relating to lxml) in a mostly excellent manner but still didn’t fix the problem–which I found after scrolling slowly up through all the terminal output. Googled my findings and this article provided a great explanation and this solution:
sudo apt-get install python-dev
- Installed Grab successfully! I knew things were going well when it paused during the lxml building portion of the installation to do lots of something other than spit out an error. I say “lots of something” because it was obviously(?) “thinking.” Can I just say that while the Command Line is infinite in its unassailable coolness, the lack of a progress bar stresses me out? First, a couple pre-emptive dependency issues averted with:
sudo apt-get install libcurl4-openssl-dev (so it can build pycurl)
sudo apt-get install libxml2-dev libxslt-dev (so it can build lxml)
Then finally getting to use pip:
pip install -U Grab
- Installed Anaconda with no problem whatsoever using their excellent documentation:
- Download the installer script
- Run the installer script
- Test/confirm using conda list
- 30-minute Test Drive
I haven’t done that Test Drive yet, because, first, I’m dying to talk about all of the following shizzle I caught flying by in the Terminal during successful and failed attempts followed by Googling and reading (and copious bookmarking).
BeautifulSoup and LXML
In the aforelinked article about Grab, the author stated they didn’t use beautifulsoup, a widely praised and beloved HTML parser, because it didn’t play nice with their particular install of Anaconda. At the time I first read the article, that aside didn’t mean much to me but because of today’s lxml-related struggles, it stuck out to me that they were equivalent products. The author claims lxml meets their needs so I’m interested to try both since it seems the whole world is in love with BeautifulSoup and except for their “I use it because I kinda have to” reasoning and what I could call a bad first impression (or bad first association) today.
Also, I was also interested because I was also installing Anaconda.
Speaking of Anaconda and BeautifulSoup, Anaconda is a Framework, meaning, if the “my own words” in which I am putting this are accurate, it is a collection of tools used together. Am I correct in saying that any given “stack” is also a “framework”? The Anaconda framework/collection includes oodles of other tools I’ve read about and seen elsewhere including, as it turns out, BeautifulSoup4 so I think I now have both v3.2.1-1 and v4.
BeautifulSoup is a Library–much like, say, jQuery. You don’t need to “install” it per se, you can simply drop the BS4 folder into your project folder like jQuery and/or Bootstrap. Bootstrap, however, now that I think about it, is a bit of a framework because it is a collection of libraries working together that also require jQuery.
A library is a … think of a CSS file as a library of styles. An external JS file (like jQuery) is a library of functions and whatnot all ready out of the box.
Conda is a Package Management System like apt, pip or rpm.
Grab is, according that author person, a framework (for web scraping) if, for no other reason, because it includes/uses another framework heavily discussed in the article called Spider. I mention this because I noticed that, included in the ten bazillion packages of the Anaconda framework, was a little guy named Spyder and I wondered if it was a spelling error in one place or the other.
Spider vs Spyder
Grab is a library wrapped around the PyCurl library. I mention that only because I’d read about pycurl elsewhere. The Grab:Spider framework processes network requests for scraping asynchronously using the MultiCurl library. I’ll admit right here that’s gibberish to me. I find it interesting because Anaconda also included PyCurl as well as Curl (but not MultiCurl, if you’re keeping score). All of this is interesting to me because it’s helping me understand how to compare and contrast different frameworks and toolsets. How and why particular components are chosen and combined.
So, Spider is just a library (I think). Spyder, on the other hand, is an Integrated Development Environment (IDE) and is, actually, an acronym of sorts: Scientific PYthon Develpment EnviRonment. I was, for a while, pretty confused as to how an IDE was different from, say, any other given code/text editor–DreamWeaver for example. Dreamweaver can do a lot that, say, Notepad, or even Notepad++, can’t. An IDE can do much that Dreamweaver can’t. Wouldn’t it be awesome if Dreamweaver had a console like Firebug? Such a lovechild would be an IDE–integrating the live/final view with the coding/editing view with debugging and feedback as well. Dare I say Flash is an IDE?
Spyder describes itself as–and I’m kinda paraphrasing–a “Powerful, interactive development environment for Python with advanced editing, interactive testing, debugging and introspection features and is also a numerical computing environment thanks to support of IPython plus NumPy, SciPy and MatPlotLib providing MatLab-like features part of Spyderlib, a python module based on PyQt4, Rope, and others.”
All these library names are starting to stand out to me instead of jumbling together in a meaningless pile of pythons as I learn how each is different.
- NumPy is for linear algebra
- SciPy is for signal (waveforms) & image processing
- MatPlotLib is for interactive 2D & 3D plotting
All of the above come with Anaconda as did it’s very own copy of Python v2.711-0.
The Spyder IDE jumped out at me because I knew Anaconda also included another IDE, discussed at far more length in Python for Data Science for Dummies, called IPython Notebook.
IPython Notebook describes itself as an enhanced interactive Python interpreter and web app, interactive computational environment in which you can combine code execution, rich text (explanatory content), mathematics, plots and rich media (images).” Notebook is the IDE. IPython is an architecture though I don’t know what that means quite yet.
I think both Spyder and Jupyter use a notebook paradigm as well much like Dreamweaver uses Sites and other apps use Projects.
Anaconda comes with a third IDE, Jupyter. While similar, my impression is that Jupyter is a rising star. It is also, “a web app that allows users to create and share documents containing live code, equations, visualizations and explanatory text for data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.”
Last item of note is that Anaconda also includes Pandas. I’ve seen Pandas mentioned a lot in Data Science discussions and I have a a book, Pandas: Powerful Python Data Analysis Toolkit that is 1,787 pages long. Normally, an intro chapter called “10 Minutes to Pandas” wouldn’t seem so funny to me but that “short introduction … geared mainly for new users” starts at page 259 and, at 22 pages, is the proverbial tip o’ the iceberg.
Except polar bears, not pandas, live on icebergs.
I find it equally funny that the FAQ is only four pages.