Articles

Easily manage Python environments with Anaconda

Lately, I’ve been doing a lot of Python data analysis. Though Python has many strengths, package management has long been a nightmare.

Continuum Analytics has created several wonderful free open-source projects, perhaps most notably Anaconda, which makes installing Python and packages much, much easier, esp. if you want to maintain multiple Python environments, which you probably do, esp. if you want to run both Python 2 and Python 3.

I’ve just hit on a workflow that enables me to keep my environment up to date without risk of breaking stuff.

I currently have two environments, the default 2.7 environment and a 3.4 environment I use most of the time:

→ conda info -e
# conda environments:
#
py34                  *  /Users/JLavin/Applications/Anaconda/anaconda/envs/py34
root                     /Users/JLavin/Applications/Anaconda/anaconda

I want to update many packages in py34 that have gone stale, but I’m afraid something might break. So I run:

conda list -n py34 --export > ~/Python/conda_packages_20140911

This creates a file that allows me to clone my current py34 environment with a simple command:

conda create --name oldpy34 --file ~/Python/conda_packages_20140911

Hopefully, I won’t need this, but it’s a super simple insurance policy in case anything goes awry.

Now, let’s try updating my current py34 environment:

±  |78152348-media-content-category-cleanup ✗| → conda update --all
Fetching package metadata: ..
Solving package specifications: .
Package plan for installation in environment /Users/JLavin/Applications/Anaconda/anaconda/envs/py34:

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
astroid-1.2.1              |           py34_0         189 KB
astropy-0.4.1              |       np18py34_0         4.9 MB
bcolz-0.7.1                |       np18py34_0         324 KB
beautiful-soup-4.3.2       |           py34_0         114 KB
binstar-0.5.5              |           py34_0          68 KB
....
xlsxwriter-0.5.7           |           py34_0         165 KB
xz-5.0.5                   |                0         132 KB
------------------------------------------------------------
                                       Total:       121.2 MB

The following NEW packages will be INSTALLED:

bcolz:             0.7.1-np18py34_0
cytoolz:           0.7.0-py34_0
decorator:         3.4.0-py34_0
toolz:             0.7.0-py34_0
xz:                5.0.5-0

The following packages will be UPDATED:

astroid:           1.1.1-py34_0        --> 1.2.1-py34_0
astropy:           0.3.2-np18py34_0    --> 0.4.1-np18py34_0
beautiful-soup:    4.3.1-py34_0        --> 4.3.2-py34_0
binstar:           0.5.3-py34_0        --> 0.5.5-py34_0
blaze:             0.5.0-np18py34_1    --> 0.6.3-np18py34_0
bokeh:             0.4.4-np18py34_1    --> 0.6.0-np18py34_0
colorama:          0.2.7-py34_0        --> 0.3.1-py34_0
configobj:         5.0.5-py34_0        --> 5.0.6-py34_0
cython:            0.20.1-py34_0       --> 0.21-py34_0
datashape:         0.2.0-np18py34_1    --> 0.3.0-np18py34_1
docutils:          0.11-py34_0         --> 0.12-py34_0
dynd-python:       0.6.2-np18py34_0    --> 0.6.5-np18py34_0
...
tornado:           3.2.1-py34_0        --> 4.0.1-py34_0
werkzeug:          0.9.6-py34_0        --> 0.9.6-py34_1
xlsxwriter:        0.5.5-py34_0        --> 0.5.7-py34_0

Proceed ([y]/n)? y

The update succeeded, so my environment is now totally up to date. Thanks, Continuum Analytics! But the update could have failed. Or it could have succeeded but one or more of the updated packages could have broken my applications in ways I don’t like, causing me to want to roll back to where I began and update more selectively.

Having a snapshot of my environment and the ability to instantly recreate it gives me peace of mind.

Posted by James on Sep 11, 2014

Reason #427 why I hate proprietary operating systems

At home, I run Linux machines, my wife is on a Mac, and my kids and in-laws are on Windows laptops. (I’ll transition my kids to Linux as they move into programming.)

Because of this heterogeneity, I like to format my external hard drives with multiple partitions, each for a different OS.

But this can be a huge pain. I formatted a 3 TB hard drive with a Windows partition, a Linux partition, and space for a Mac (HFS+) partition. But my wife’s MacBook Pro’s Disk Utility refused to create an HFS+ partition on the third physical partition, complaining that the hard drive has a Master Boot Record. I wasn’t trying to create a bootable partition, but Mac OS didn’t care. Using my Linux machine (and the “hfsprogs” package), I managed to format the partition as HFS+. It shocked me that Linux could create a Mac-formatted partition where a Mac couldn’t.

My wife’s MacBook agreed it was an HFS+ partition in good state, but it still refused to let TimeMachine back up to it because it was a non-journaled HFS+ partition. GParted can’t create a journaled HFS+ partition.

I finally surrendered and threw away all my partitions and let the MacBook Pro take the first physical spot on the hard drive. TimeMachine is finally running. I won’t know whether I can use the unformatted 2TB of space for Windows or Linux till it finishes. Proprietary OSes are so annoying!

Posted by James on Mar 21, 2014

Database tuning: Triggers & materialized views

Had fun today at work tuning a Postgres database that has gotten very slow over the years as it has accumulated many gigabytes of data. This app is frequently rendered inoperable by just one or two users visiting its home page, which is obviously a bad situation. (Luckily, it’s an in-house tool used almost exclusively by a single user, which is why it hasn’t received more love before now.)

Over the past week, I’ve been recording the slowest queries, and today I started attacking them. The easiest-to-fix were the ones caused by missing indexes. Another problem I found was unnecessary overhead from two compound indexes that were indexing the same two columns with opposite orderings; I turned one into a single-column index, which should produce similar read performance and superior write performance.

A third fix I proposed was adding a field for the calculated value of md5(email). Some queries have been doing full-table searches of md5(email). I don’t understand why that’s necessary, but having to calculate md5() for every row in the table and then scanning the whole table sounds pretty inefficient. So I created a named function for calculating md5(email) and a trigger that calls the function whenever a table record is added or modified. Doing this at the database layer makes sense because Rails doesn’t need to know anything about md5(email).

I also created my first Postgres materialized view today. Another query can occasionally take 40+ seconds on our server. The same query normally runs orders of magnitude faster, so I’m not sure what causes such long delays. But it’s doing a join that involves calculating a count on a large table. My first thought was to add a counter cache, but that didn’t make sense when I looked at the table layout. I instead made a materialized view, which worked well on my static copy of the production database. But when I went to the Postgres documentation, I discovered two flaws with Postgres 9.3’s materialized view implementation: 1) Updating the materialized view is a manual process; and, 2) Updating the materialized view takes a full lock on the view. So I’m not sure it’s worth pushing to production, but I’m glad to read that Postgres devs are already working to improve the implementation of materialized views.

Posted by James on Mar 21, 2014

Chef pain point: Modifying 2+ lines but not an entire file

I’m suffering some pain modifying server configuration files with Chef.

Chef::Util::FileEdit is great for replacing one line with another, as many times as desired:

ruby_block "provide dovecot with custom MySQL connection info" do
  block do
    file = Chef::Util::FileEdit.new("/etc/dovecot/dovecot-sql.conf.ext")
    file.search_file_replace_line(/#driver = /,"driver = mysql")
    file.search_file_replace_line(/#connect = /,"connect = host=127.0.0.1 dbname=mail user=mailuser password=new_pw")
    file.search_file_replace_line(/default_pass_scheme/,"default_pass_scheme = SHA512-CRYPT")
    file.search_file_replace_line(/password_query/,"password_query = select email as user, password from users where email = '%u';")
    file.write_file
  end
end

And templates are great for replacing entire files:

template "/etc/dovecot/conf.d/10-master.conf" do
  source "10-master.conf.erb"
  mode 0640
  owner "vmail"
  group "dovecot"
end

But I can’t figure out how to replace a multi-line code block in a file. The articles I’ve read suggest Chef tries to force users into replacing whole files. cassianoleal answers a question about how to do so with, “As you said yourself, the recommended Chef pattern is to manage the whole file.” Why must we copy entire files to replace one block of code with another? Chef 11 apparently includes “partials,” which let you insert multi-line code elements. Chef::Util::FileEdit also lets you do that. But the ability to insert multiple lines doesn’t enable deleting multi-line code blocks. I probably could copy the entire file into memory, search-and-replace the multi-line segment with a regex and write the modified file back, but shouldn’t this be a built-in Chef tool?

Posted by James on Mar 17, 2014

Reverting commits with Git without losing history

Today at work, I decided to roll back the previous few commits I had made, but I didn’t want to git reset --hard and throw away history or mess up anyone else who might have pulled from my branch, so I decided to git revert, but I wasn’t quite sure the syntax.

I pulled out my normally reliable name-brand search engine and, after an unusually long search, found the “answer.” But it wasn’t quite right. It failed to revert one of the commits I wanted to revert. So I’m putting the answer here in hopes it saves someone else some pain.

I make three commits below and then revert the last two…

mkdir test_git_revert
cd test_git_revert/
git init .
vim a.txt
git add a.txt
git commit -m "create a.txt"
vim b.txt
git add b.txt
git commit -m "create b.txt"
vim c.txt
git add c.txt
git commit -m "create c.txt"
git log

  commit b59b5ecddc5284358da38635dc0829f629be11a7

  Author: James Lavin <james@fakedomain.com>

  Date:   Thu Mar 13 16:58:42 2014 -0400

      create c.txt

  commit 31fec743e007f94eb4738d1108c79b38dfa6cff0

  Author: James Lavin <james@fakedomain.com>

  Date:   Thu Mar 13 16:58:19 2014 -0400

      create b.txt

  commit a9d88ae06cedf5296297705142020a5264c839b8

  Author: James Lavin <james@fakedomain.com>

  Date:   Thu Mar 13 16:57:56 2014 -0400

      create a.txt

To revert the previous two commits and keep the first, I ran the following:

git revert --no-edit a9d88ae06cedf..b59b5ecddc5284

which is equivalent to:

git revert --no-edit <last_good_commit_SHA>..<last_bad_commit_SHA>

The output:

[master 4323ac0] Revert "create c.txt"

 1 file changed, 1 deletion(-)

 delete mode 100644 c.txt

[master c49aa86] Revert "create b.txt"

 1 file changed, 1 deletion(-)

 delete mode 100644 b.txt

I then confirmed with git log:

commit c49aa86bd04addb0a585417534bdb02638800e17

Author: James Lavin <james@fakedomain.com>

Date:   Thu Mar 13 16:59:26 2014 -0400

    Revert "create b.txt"

    This reverts commit 31fec743e007f94eb4738d1108c79b38dfa6cff0.

commit 4323ac0c5bccda28fc263ca7c8ff9d4d9f88a14c

Author: James Lavin <james@fakedomain.com>

Date:   Thu Mar 13 16:59:26 2014 -0400

    Revert "create c.txt"

    This reverts commit b59b5ecddc5284358da38635dc0829f629be11a7.

commit b59b5ecddc5284358da38635dc0829f629be11a7

Author: James Lavin <james@fakedomain.com>

Date:   Thu Mar 13 16:58:42 2014 -0400

    create c.txt

commit 31fec743e007f94eb4738d1108c79b38dfa6cff0

Author: James Lavin <james@fakedomain.com>

Date:   Thu Mar 13 16:58:19 2014 -0400

    create b.txt

commit a9d88ae06cedf5296297705142020a5264c839b8

Author: James Lavin <james@fakedomain.com>

Date:   Thu Mar 13 16:57:56 2014 -0400

    create a.txt

To check again, I ran git diff a9d88ae06cedf52 and got blank output, indicating I was where I was after the first commit.

To triple check, I ran ls and saw only the file I added to Git in the first commit:

a.txt

Posted by James on Mar 13, 2014