Ok. I have a big honking server with a bigger honkinger graphics card that I use to run experiments. I'd like to run a bunch of them, but that causes the big honking server to choke.

Solution? More servers. What I'd like to do is push a button and have the following happen:

  • An EC2 instance with a preconfigured image spins up.
  • My experimental code is copied over.
  • The experiment is run remotely.
  • The output is collected locally.
  • The instance is terminated.

It's particularly important that the last step happen so that I don't end up putting the lab $30k in the hole to Amazon.

I've gotten it working for a persistent client session (i.e. I never close or move my laptop). It looks like this:


echo "spinning up instance and extracting instance id..."
INSTANCE_ID=`aws ec2 run-instances --key-name keyname --security-groups a-security-group --count 1 --image-id ami-xxxxxxxx --instance-type g2.2xlarge | grep "InstanceId" | grep --only-matching 'i-[0-9a-f]*'`
echo "instance id: ${INSTANCE_ID}"

echo "waiting for it to spin up..."
sleep 60  #lol

echo "extracting ip..."
IP=`aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | grep "PublicIpAddress" | head -n1 | grep --only-matching '[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*'`
echo "IP: ${IP}"

echo "copying code..."
scp -i /path/to/key.pem -r  path/to/code/  ubuntu@${IP}:

echo "running experiment..."
ssh -i /path/to/key.pem ubuntu@${IP} "command to run"

echo "terminating instance..."
aws ec2 terminate-instances --instance-ids ${INSTANCE_ID}

echo "done"

So, I'm not sure what would happen if I close my laptop - does ssh terminate the remote execution with my session? Will the next command be run, or will the instance spin on forever?

This would all be solved with some kind of callback mechanism with a default 'kill everything' exception if something goes wrong. That would eliminate the timing hacks during spin-up and ensure that the instance is torn down no matter what. But, hey, I'll take what I can get.