Articles

Getting started with Docker

A Hedgeye colleague who has suffered similar pain from Chef as I have (especially unstable APIs) suggested we switch to Docker, and we’re now pursuing that. So far, we both like it, though we haven’t yet done anything complicated.

To help our future selves and our colleagues use Docker, I’m going to document some stuff here.

To install Docker on my Mac, I followed this nice guide from Chris Jones.

With Docker running, getting started is super easy. Just type:

docker run -it ubuntu /bin/bash

Docker will then pull down a Ubuntu image (unless it has already done so, in which case it will use what you already have), spin up an instance running just bash console (the command you told it to run), and drop you into the console session.

When you’re finished interacting with your instance, type exit and it will log you out and then terminate the instance. It terminates because Docker’s approach is to spin up a container running just one process and shut that container down when the initial process is no longer active. From the initial process, you can, of course, spawn additional processes. But whenever the special initial process closes, your container dies.

Here are some basic Docker commands.

[ More to come ]

Posted by James on Dec 03, 2014

Easily manage Python environments with Anaconda

Lately, I’ve been doing a lot of Python data analysis. Though Python has many strengths, package management has long been a nightmare.

Continuum Analytics has created several wonderful free open-source projects, perhaps most notably Anaconda, which makes installing Python and packages much, much easier, esp. if you want to maintain multiple Python environments, which you probably do, esp. if you want to run both Python 2 and Python 3.

I’ve just hit on a workflow that enables me to keep my environment up to date without risk of breaking stuff.

I currently have two environments, the default 2.7 environment and a 3.4 environment I use most of the time:

→ conda info -e
# conda environments:
#
py34                  *  /Users/JLavin/Applications/Anaconda/anaconda/envs/py34
root                     /Users/JLavin/Applications/Anaconda/anaconda

I want to update many packages in py34 that have gone stale, but I’m afraid something might break. So I run:

conda list -n py34 --export > ~/Python/conda_packages_20140911

This creates a file that allows me to clone my current py34 environment with a simple command:

conda create --name oldpy34 --file ~/Python/conda_packages_20140911

Hopefully, I won’t need this, but it’s a super simple insurance policy in case anything goes awry.

Now, let’s try updating my current py34 environment:

±  |78152348-media-content-category-cleanup ✗| → conda update --all
Fetching package metadata: ..
Solving package specifications: .
Package plan for installation in environment /Users/JLavin/Applications/Anaconda/anaconda/envs/py34:

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
astroid-1.2.1              |           py34_0         189 KB
astropy-0.4.1              |       np18py34_0         4.9 MB
bcolz-0.7.1                |       np18py34_0         324 KB
beautiful-soup-4.3.2       |           py34_0         114 KB
binstar-0.5.5              |           py34_0          68 KB
....
xlsxwriter-0.5.7           |           py34_0         165 KB
xz-5.0.5                   |                0         132 KB
------------------------------------------------------------
                                       Total:       121.2 MB

The following NEW packages will be INSTALLED:

bcolz:             0.7.1-np18py34_0
cytoolz:           0.7.0-py34_0
decorator:         3.4.0-py34_0
toolz:             0.7.0-py34_0
xz:                5.0.5-0

The following packages will be UPDATED:

astroid:           1.1.1-py34_0        --> 1.2.1-py34_0
astropy:           0.3.2-np18py34_0    --> 0.4.1-np18py34_0
beautiful-soup:    4.3.1-py34_0        --> 4.3.2-py34_0
binstar:           0.5.3-py34_0        --> 0.5.5-py34_0
blaze:             0.5.0-np18py34_1    --> 0.6.3-np18py34_0
bokeh:             0.4.4-np18py34_1    --> 0.6.0-np18py34_0
colorama:          0.2.7-py34_0        --> 0.3.1-py34_0
configobj:         5.0.5-py34_0        --> 5.0.6-py34_0
cython:            0.20.1-py34_0       --> 0.21-py34_0
datashape:         0.2.0-np18py34_1    --> 0.3.0-np18py34_1
docutils:          0.11-py34_0         --> 0.12-py34_0
dynd-python:       0.6.2-np18py34_0    --> 0.6.5-np18py34_0
...
tornado:           3.2.1-py34_0        --> 4.0.1-py34_0
werkzeug:          0.9.6-py34_0        --> 0.9.6-py34_1
xlsxwriter:        0.5.5-py34_0        --> 0.5.7-py34_0

Proceed ([y]/n)? y

The update succeeded, so my environment is now totally up to date. Thanks, Continuum Analytics! But the update could have failed. Or it could have succeeded but one or more of the updated packages could have broken my applications in ways I don’t like, causing me to want to roll back to where I began and update more selectively.

Having a snapshot of my environment and the ability to instantly recreate it gives me peace of mind.

Posted by James on Sep 11, 2014

Reason #427 why I hate proprietary operating systems

At home, I run Linux machines, my wife is on a Mac, and my kids and in-laws are on Windows laptops. (I’ll transition my kids to Linux as they move into programming.)

Because of this heterogeneity, I like to format my external hard drives with multiple partitions, each for a different OS.

But this can be a huge pain. I formatted a 3 TB hard drive with a Windows partition, a Linux partition, and space for a Mac (HFS+) partition. But my wife’s MacBook Pro’s Disk Utility refused to create an HFS+ partition on the third physical partition, complaining that the hard drive has a Master Boot Record. I wasn’t trying to create a bootable partition, but Mac OS didn’t care. Using my Linux machine (and the “hfsprogs” package), I managed to format the partition as HFS+. It shocked me that Linux could create a Mac-formatted partition where a Mac couldn’t.

My wife’s MacBook agreed it was an HFS+ partition in good state, but it still refused to let TimeMachine back up to it because it was a non-journaled HFS+ partition. GParted can’t create a journaled HFS+ partition.

I finally surrendered and threw away all my partitions and let the MacBook Pro take the first physical spot on the hard drive. TimeMachine is finally running. I won’t know whether I can use the unformatted 2TB of space for Windows or Linux till it finishes. Proprietary OSes are so annoying!

Posted by James on Mar 21, 2014

Database tuning: Triggers & materialized views

Had fun today at work tuning a Postgres database that has gotten very slow over the years as it has accumulated many gigabytes of data. This app is frequently rendered inoperable by just one or two users visiting its home page, which is obviously a bad situation. (Luckily, it’s an in-house tool used almost exclusively by a single user, which is why it hasn’t received more love before now.)

Over the past week, I’ve been recording the slowest queries, and today I started attacking them. The easiest-to-fix were the ones caused by missing indexes. Another problem I found was unnecessary overhead from two compound indexes that were indexing the same two columns with opposite orderings; I turned one into a single-column index, which should produce similar read performance and superior write performance.

A third fix I proposed was adding a field for the calculated value of md5(email). Some queries have been doing full-table searches of md5(email). I don’t understand why that’s necessary, but having to calculate md5() for every row in the table and then scanning the whole table sounds pretty inefficient. So I created a named function for calculating md5(email) and a trigger that calls the function whenever a table record is added or modified. Doing this at the database layer makes sense because Rails doesn’t need to know anything about md5(email).

I also created my first Postgres materialized view today. Another query can occasionally take 40+ seconds on our server. The same query normally runs orders of magnitude faster, so I’m not sure what causes such long delays. But it’s doing a join that involves calculating a count on a large table. My first thought was to add a counter cache, but that didn’t make sense when I looked at the table layout. I instead made a materialized view, which worked well on my static copy of the production database. But when I went to the Postgres documentation, I discovered two flaws with Postgres 9.3’s materialized view implementation: 1) Updating the materialized view is a manual process; and, 2) Updating the materialized view takes a full lock on the view. So I’m not sure it’s worth pushing to production, but I’m glad to read that Postgres devs are already working to improve the implementation of materialized views.

Posted by James on Mar 21, 2014

Chef pain point: Modifying 2+ lines but not an entire file

I’m suffering some pain modifying server configuration files with Chef.

Chef::Util::FileEdit is great for replacing one line with another, as many times as desired:

ruby_block "provide dovecot with custom MySQL connection info" do
  block do
    file = Chef::Util::FileEdit.new("/etc/dovecot/dovecot-sql.conf.ext")
    file.search_file_replace_line(/#driver = /,"driver = mysql")
    file.search_file_replace_line(/#connect = /,"connect = host=127.0.0.1 dbname=mail user=mailuser password=new_pw")
    file.search_file_replace_line(/default_pass_scheme/,"default_pass_scheme = SHA512-CRYPT")
    file.search_file_replace_line(/password_query/,"password_query = select email as user, password from users where email = '%u';")
    file.write_file
  end
end

And templates are great for replacing entire files:

template "/etc/dovecot/conf.d/10-master.conf" do
  source "10-master.conf.erb"
  mode 0640
  owner "vmail"
  group "dovecot"
end

But I can’t figure out how to replace a multi-line code block in a file. The articles I’ve read suggest Chef tries to force users into replacing whole files. cassianoleal answers a question about how to do so with, “As you said yourself, the recommended Chef pattern is to manage the whole file.” Why must we copy entire files to replace one block of code with another? Chef 11 apparently includes “partials,” which let you insert multi-line code elements. Chef::Util::FileEdit also lets you do that. But the ability to insert multiple lines doesn’t enable deleting multi-line code blocks. I probably could copy the entire file into memory, search-and-replace the multi-line segment with a regex and write the modified file back, but shouldn’t this be a built-in Chef tool?

Posted by James on Mar 17, 2014